Stars
A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
🚀 SenseFlow: Scaling Distribution Matching for Flow-based Text-to-Image Distillation
Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
[CVPR 2025] MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis
We introduce DI*-SDX-1step Model, which is a leading human-preferred 1-step text-to-image model of 1024 resolution.
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
[CVPR 2025] Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL
DreamO: A Unified Framework for Image Customization
A SOTA open-source image editing model, which aims to provide comparable performance against the closed-source models like GPT-4o and Gemini 2 Flash.
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think!
[ICLR'25 Oral] Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Pytorch implementation for the paper titled "SimpleAR: Pushing the Frontier of Autoregressive Visual Generation"
Official repository of In-Context LoRA for Diffusion Transformers
A minimal and universal controller for FLUX.1.
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation
[CVPRW 2025] UniToken is an auto-regressive generation model that combines discrete and continuous representations to process visual inputs, making it easy to integrate both visual understanding an…
Paper list: deep learning based video compression