Stars
Ongoing research training transformer models at scale
[ICML'25] "ConText: Driving In-context Learning for Text Removal and Segmentation"
DeepFashion2 Dataset https://arxiv.org/pdf/1901.07973.pdf
Awesome work on hand pose estimation/tracking
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
[NeurIPS'23] Emergent Correspondence from Image Diffusion
CoTracker is a model for tracking any point (pixel) on a video.
ECCV2020 paper "Whole-Body Human Pose Estimation in the Wild"
A real-time approach for mapping all human pixels of 2D RGB images to a 3D surface-based model of the body
SAM-PT: Extending SAM to zero-shot video segmentation with point-based tracking.
[CVPR2024, Highlight] Official code for DragDiffusion
The project is an official implement of our ECCV2018 paper "Simple Baselines for Human Pose Estimation and Tracking(https://arxiv.org/abs/1804.06208)"
[CVPR 2024] Official implementation of the paper "Visual In-context Learning"
🔥🔥 UNO: A Universal Customization Method for Both Single and Multi-Subject Conditioning
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
code for EMNLP 2024 paper: How do Large Language Models Learn In-Context? Query and Key Matrices of In-Context Heads are Two Towers for Metric Learning
[ICLR'25] Official code for the paper 'MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs'
This repo contains the code for "MEGA-Bench Scaling Multimodal Evaluation to over 500 Real-World Tasks" [ICLR2025]
[CVPR 2025] LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Official repository of "GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing"
Official Repo for Paper "OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision" [ICLR2025]
[ICLR 2025 Spotlight] OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text