Stars
Official Repository of Paper "ROSA: Harnessing Robot States for Vision-Language and Action Alignment"
MAGI-1: Autoregressive Video Generation at Scale
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[CVPR 2025 Oral] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
ReNeg: Learning Negative Embedding with Reward Guidance
Liquid: Language Models are Scalable and Unified Multi-modal Generators
[CVPR 2025] StreetCrafter: Street View Synthesis with Controllable Video Diffusion Models
[ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
The official implementation of "[MASK] is All You Need"
A PyTorch library for implementing flow matching algorithms, featuring continuous and discrete flow matching implementations. It includes practical examples for both text and image modalities.
[CVPR 2025 Oral]Infinity ∞ : Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
HunyuanVideo: A Systematic Framework For Large Video Generation Model
[ECCV 2024] Street Gaussians: Modeling Dynamic Urban Scenes with Gaussian Splatting
[CVPR 2025 Highlight] Truncated Diffusion Model for Real-Time End-to-End Autonomous Driving
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
Bridging Large Vision-Language Models and End-to-End Autonomous Driving
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
[NeurIPS 2024] A Generalizable World Model for Autonomous Driving
[AAAI 2025] Linear-complexity Visual Sequence Learning with Gated Linear Attention
[CVPR 2025] DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
[ICML 2024] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
[CVPR 2024] LMDrive: Closed-Loop End-to-End Driving with Large Language Models
A method that can match the 3D point cloud sub-map generated by the robot during the SLAM process with the 2D map.
[CVPR2024] Official Repository of Paper "Panacea: Panoramic and Controllable Video Generation for Autonomous Driving"