Stars
[CVPR 2025 Best Paper Award Candidate] VGGT: Visual Geometry Grounded Transformer
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
Wan: Open and Advanced Large-Scale Video Generative Models
FlashMLA: Efficient MLA decoding kernels
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
MoBA: Mixture of Block Attention for Long-Context LLMs
[CVPR 2025 Highlight] Video Depth Anything: Consistent Depth Estimation for Super-Long Videos
Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
New repo collection for NVIDIA Cosmos: https://github.com/nvidia-cosmos
[Arxiv 2024] Edicho: Consistent Image Editing in the Wild
HunyuanVideo: A Systematic Framework For Large Video Generation Model
[CVPR'25] Official Implementations for Paper - MagicQuill: An Intelligent Interactive Image Editing System
A suite of image and video neural tokenizers
Code for "Diffusion Model Alignment Using Direct Preference Optimization"
Video Generation, Physical Commonsense, Semantic Adherence, VideoCon-Physics
Scaling Diffusion Transformers with Mixture of Experts
Meta Lingua: a lean, efficient, and easy-to-hack codebase to research LLMs.
Official Implementation of paper "MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion"
A PyTorch native platform for training generative AI models
Long context evaluation for large language models
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A continuously updated collection of DCTLs (DaVinci Color Transform Language) designed to enhance and educate on workflows using ARRI LogC3, Gen5 and Cineon in DaVinci Resolve. This collection offe…
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
[ECCV 2024] 3DPE: Real-time 3D-aware Portrait Editing from a Single Image