Stars
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
SEED-Voken: A Series of Powerful Visual Tokenizers
HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation
Wan: Open and Advanced Large-Scale Video Generative Models
The official code implementation of Generalized Category Discovery in Semantic Segmentation
Taming Transformers for High-Resolution Image Synthesis
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
Project Page for "LISA: Reasoning Segmentation via Large Language Model"
[ICCV 2023] Official PyTorch implementation of "Rethinking Mobile Block for Efficient Attention-based Models"
利用AI大模型,一键解说并剪辑视频; Using AI models to automatically provide commentary and edit videos with a single click.
[ICLR 2025] Autoregressive Video Generation without Vector Quantization
[CVPR 2025 Highlight🔥] Identity-Preserving Text-to-Video Generation by Frequency Decomposition
HunyuanVideo: A Systematic Framework For Large Video Generation Model
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
[ICLR 2025] Official Implementation of Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation
This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.
[CVPR 2024] SuperSVG: Superpixel-based Scalable Vector Graphics Synthesis