Stars
[CVPR 2025] Adaptive Keyframe Sampling for Long Video Understanding
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Official code for Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
VELOCITI Benchmark Evaluation and Visualisation Code
Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
ICLR 2025 - official implementation for "I-Con: A Unifying Framework for Representation Learning"
Code for "Scaling Language-Free Visual Representation Learning" paper (Web-SSL).
[CVPR 2025 Highlight] Official code for "Olympus: A Universal Task Router for Computer Vision Tasks"
[CVPR 2025 Oral] VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
A survey on MM-LLMs for long video understanding: From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Code for "Predicate Hierarchies Improve Few-Shot State Classification" , ICLR 2025
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
[ICLR 2025] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
[CVPR 2024] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
Easily create large video dataset from video urls
[CVPR 2025] Official Repository of the paper "On the Consistency of Video Large Language Models in Temporal Comprehension"
Code for "CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally"
Official PyTorch implementation of the paper "Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs"
Awesome papers & datasets specifically focused on long-term videos.