Stars
Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model
CrossLMM: Decoupling Long Video Sequences from LMMs via Dual Cross-Attention Mechanisms
The official repo for "Vidi: Large Multimodal Models for Video Understanding and Editing"
[CVPR 2025] Open-source, End-to-end, Vision-Language-Action model for GUI Agent & Computer Use.
Image editing is worth a single LoRA! 0.1% training data for fantastic image editing! Training released! Surpasses GPT-4o in ID persistence! Official ComfyUI workflow release! Only 4GB VRAM is enou…
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
An Enhanced CLIP Framework for Learning with Synthetic Captions
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports
FunQA benchmarks funny, creative, and magic videos for challenging tasks including timestamp localization, video description, reasoning, and beyond.
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
TransMLA: Multi-Head Latent Attention Is All You Need
A Fine-grained Benchmark for Video Captioning and Retrieval
MM-EUREKA: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning
Collections of Papers and Projects for Multimodal Reasoning.
[CVPR 2025] Online Video Understanding: OVBench and VideoChat-Online
Awesome-Long2short-on-LRMs is a collection of state-of-the-art, novel, exciting long2short methods on large reasoning models. It contains papers, codes, datasets, evaluations, and analyses.
Witness the aha moment of VLM with less than $3.
Fully open reproduction of DeepSeek-R1
[ICLR 2025] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark
[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
hhaAndroid / xtuner
Forked from InternLM/xtunerXTuner is a toolkit for efficiently fine-tuning LLM