Simulating the Real World: Survey & Resources

This repository is divided into two main sections:

Our Survey Paper Collection - This section presents our survey, "Simulating the Real World: A Unified Survey of Multimodal Generative Models", which systematically unify the study of 2D, video, 3D and 4D generation within a single framework.

Text2X Resources – This section continues the original Awesome-Text2X-Resources, an open collection of state-of-the-art (SOTA) and novel Text-to-X (X can be everything) methods, including papers, codes, and datasets. The goal is to track the rapid progress in this field and provide researchers with up-to-date references.

⭐ If you find this repository useful for your research or work, a star is highly appreciated!

💗 This repository is continuously updated. If you find relevant papers, blog posts, videos, or other resources that should be included, feel free to submit a pull request (PR) or open an issue. Community contributions are always welcome!

📜 Our Survey Paper Collection

𝐒𝐢𝐦𝐮𝐥𝐚𝐭𝐢𝐧𝐠 𝐭𝐡𝐞 𝐑𝐞𝐚𝐥 𝐖𝐨𝐫𝐥𝐝: 𝐀 𝐔𝐧𝐢𝐟𝐢𝐞𝐝 𝐒𝐮𝐫𝐯𝐞𝐲 𝐨𝐟 𝐌𝐮𝐥𝐭𝐢𝐦𝐨𝐝𝐚𝐥 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐯𝐞 𝐌𝐨𝐝𝐞𝐥𝐬

Abstract

Understanding and replicating the real world is a critical challenge in Artificial General Intelligence (AGI) research. To achieve this, many existing approaches, such as world models, aim to capture the fundamental principles governing the physical world, enabling more accurate simulations and meaningful interactions. However, current methods often treat different modalities, including 2D (images), videos, 3D, and 4D representations, as independent domains, overlooking their interdependencies. Additionally, these methods typically focus on isolated dimensions of reality without systematically integrating their connections. In this survey, we present a unified survey for multimodal generative models that investigate the progression of data dimensionality in real-world simulation. Specifically, this survey starts from 2D generation (appearance), then moves to video (appearance+dynamics) and 3D generation (appearance+geometry), and finally culminates in 4D generation that integrate all dimensions. To the best of our knowledge, this is the first attempt to systematically unify the study of 2D, video, 3D and 4D generation within a single framework. To guide future research, we provide a comprehensive review of datasets, evaluation metrics and future directions, and fostering insights for newcomers. This survey serves as a bridge to advance the study of multimodal generative models and real-world simulation within a unified framework.

⭐ Citation

If you find this paper and repo helpful for your research, please cite it below:

@article{hu2025simulating,
  title={Simulating the Real World: A Unified Survey of Multimodal Generative Models},
  author={Hu, Yuqi and Wang, Longguang and Liu, Xian and Chen, Ling-Hao and Guo, Yuwei and Shi, Yukai and Liu, Ce and Rao, Anyi and Wang, Zeyu and Xiong, Hui},
  journal={arXiv preprint arXiv:2503.04641},
  year={2025}
}

Paradigms

Tip

Feel free to pull requests or contact us if you find any related papers that are not included here. The process to submit a pull request is as follows:

a. Fork the project into your own repository.
b. Add the Title, Paper link, Conference, Project/GitHub link in README.md using the following format:

[Origin] **Paper Title** [[Paper](Paper Link)] [[GitHub](GitHub Link)] [[Project Page](Project Page Link)]

c. Submit the pull request to this branch.

2D Generation

Text-to-Image Generation.

Here are some seminal papers and models.

Imagen: [NeurIPS 2022] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding [Paper] [Project Page]
DALL-E: [ICML 2021] Zero-shot text-to-image generation [Paper] [GitHub]
DALL-E 2: [arXiv 2022] Hierarchical Text-Conditional Image Generation with CLIP Latents [Paper]
DALL-E 3: [Platform Link]
DeepFloyd IF: [GitHub]
Stable Diffusion: [CVPR 2022] High-Resolution Image Synthesis with Latent Diffusion Models [Paper] [GitHub]
SDXL: [ICLR 2024 spotlight] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis [Paper] [GitHub]
FLUX.1: [Platform Link]

Video Generation

Text-to-video generation models adapt text-to-image frameworks to handle the additional dimension of dynamics in the real world. We classify these models into three categories based on different generative machine learning architectures.

Survey

[AIRC 2023] A Survey of AI Text-to-Image and AI Text-to-Video Generators [Paper]
[arXiv 2024] Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [Paper]

Video Algorithms

(1) VAE- and GAN-based Approaches.

VAE-based Approaches.

SV2P: [ICLR 2018 Poster] Stochastic Variational Video Prediction [Paper] [Project Page]
[arXiv 2021] FitVid: Overfitting in Pixel-Level Video Prediction [Paper] [GitHub] [Project Page]

GAN-based Approaches.

[CVPR 2018] MoCoGAN: Decomposing Motion and Content for Video Generation [Paper] [GitHub]
[CVPR 2022] StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 [Paper] [GitHub] [Project Page]
DIGAN: [ICLR 2022] Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks [Paper] [GitHub] [Project Page]
[ICCV 2023] StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation [Paper] [GitHub] [Project Page]

(2) Diffusion-based Approaches.

U-Net-based Architectures.

[NeurIPS 2022] Video Diffusion Models [Paper] [Project Page]
[arXiv 2022] Imagen Video: High Definition Video Generation with Diffusion Models [Paper] [Project Page]
[arXiv 2022] MagicVideo: Efficient Video Generation With Latent Diffusion Models [Paper] [Project Page]
[ICLR 2023 Poster] Make-A-Video: Text-to-Video Generation without Text-Video Data [Paper] [Project Page]
GEN-1: [ICCV 2023] Structure and Content-Guided Video Synthesis with Diffusion Models [Paper] [Project Page]
PYoCo: [ICCV 2023] Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [Paper] [Project Page]
[CVPR 2023] Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models [Paper] [Project Page]
[IJCV 2024] Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation [Paper] [GitHub] [Project Page]
[NeurIPS 2024] VideoComposer: Compositional Video Synthesis with Motion Controllability [Paper] [GitHub] [Project Page]
[ICLR 2024 Spotlight] AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning [Paper] [GitHub] [Project Page]
[CVPR 2024] Make Pixels Dance: High-Dynamic Video Generation [Paper] [Project Page]
[ECCV 2024] Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [Paper] [Project Page]
[SIGGRAPH Asia 2024] Lumiere: A Space-Time Diffusion Model for Video Generation [Paper] [Project Page]

Transformer-based Architectures.

[ICLR E377 2024 Poster] VDT: General-purpose Video Diffusion Transformers via Mask Modeling [Paper] [GitHub] [Project Page]
W.A.L.T: [ECCV 2024] Photorealistic Video Generation with Diffusion Models [Paper] [Project Page]
[CVPR 2024] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [Paper] [Project Page]
[CVPR 2024] GenTron: Diffusion Transformers for Image and Video Generation [Paper] [Project Page]
[ICLR 2025 Poster] CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [Paper] [GitHub]
[ICLR 2025 Spotlight] Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers [Paper] [GitHub]

(3) Autoregressive-based Approaches.

VQ-GAN: [CVPR 2021 Oral] Taming Transformers for High-Resolution Image Synthesis [Paper] [GitHub]
[CVPR 2023 Highlight] MAGVIT: Masked Generative Video Transformer [Paper] [GitHub] [Project Page]
[ICLR 2023 Poster] CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers [Paper] [GitHub]
[ICML 2024] VideoPoet: A Large Language Model for Zero-Shot Video Generation [Paper] [Project Page]
[ICLR 2024 Poster] Language Model Beats Diffusion - Tokenizer is key to visual generation [Paper]
[arXiv 2024] Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation [Paper] [GitHub]
[arXiv 2024] Emu3: Next-Token Prediction is All You Need [Paper] [GitHub] [Project Page]
[ICLR 2025 Poster] Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding [Paper] [GitHub]

Video Applications

Video Editing.

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation [Paper] [GitHub] [Project Page]
[ICCV 2023] Pix2Video: Video Editing using Image Diffusion [Paper] [GitHub] [Project Page]
[CVPR 2024] VidToMe: Video Token Merging for Zero-Shot Video Editing [Paper] [GitHub] [Project Page]
[CVPR 2024] Video-P2P: Video Editing with Cross-attention Control [Paper] [GitHub] [Project Page]
[CVPR 2024 Highlight] CoDeF: Content Deformation Fields for Temporally Consistent Video Processing [Paper] [GitHub] [Project Page]
[NeurIPS 2024] Towards Consistent Video Editing with Text-to-Image Diffusion Models [Paper]
[ICLR 2024 Poster] Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [Paper] [GitHub] [Project Page]
[arXiv 2024] UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [Paper] [GitHub] [Project Page]
[TMLR 2024] AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks [Paper] [GitHub] [Project Page]

Novel View Synthesis.

[arXiv 2024] ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis [Paper] [GitHub] [Project Page]
[CVPR 2024 Highlight] ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models [Paper] [GitHub] [Project Page]
[ICLR 2025 Poster] CameraCtrl: Enabling Camera Control for Video Diffusion Models [Paper] [GitHub] [Project Page]
[ICLR 2025 Poster] NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer [Paper] [GitHub]

Human Animation in Videos.

[ICCV 2019] Everybody Dance Now [Paper] [GitHub] [Project Page]
[ICCV 2019] Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis [Paper] [GitHub] [Project Page] [Dataset]
[NeurIPS 2019] First Order Motion Model for Image Animation [Paper] [GitHub] [Project Page]
[ICCV 2023] Adding Conditional Control to Text-to-Image Diffusion Models [Paper] [GitHub]
[ICCV 2023] HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation [Paper] [GitHub] [Project Page]
[CVPR 2023] Learning Locally Editable Virtual Humans [Paper] [GitHub] [Project Page] [Dataset]
[CVPR 2023] Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [Paper] [GitHub] [Project Page]
[CVPRW 2024] LatentMan: Generating Consistent Animated Characters using Image Diffusion Models [Paper] [GitHub] [Project Page]
[IJCAI 2024] Zero-shot High-fidelity and Pose-controllable Character Animation [Paper]
[arXiv 2024] UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [Paper] [GitHub] [Project Page]
[arXiv 2024] MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling [Paper] [GitHub] [Project Page]

3D Generation

3D Algorithms

Text-to-3D Generation.

Survey

[arXiv 2023] Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era [Paper]
[arXiv 2024] Advances in 3D Generation: A Survey [Paper]
[arXiv 2024] A Survey On Text-to-3D Contents Generation In The Wild [Paper]

Feedforward Approaches.

[arXiv 2022] 3D-LDM: Neural Implicit 3D Shape Generation with Latent Diffusion Models [Paper] [GitHub]
[arXiv 2022] Point-E: A System for Generating 3D Point Clouds from Complex Prompts [Paper] [GitHub]
[arXiv 2023] Shap-E: Generating Conditional 3D Implicit Functions [Paper] [GitHub]
[NeurIPS 2023] Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation [Paper] [GitHub] [Project Page]
[ICCV 2023] ATT3D: Amortized Text-to-3D Object Synthesis [Paper] [Project Page]
[ICLR 2023 Spotlight] MeshDiffusion: Score-based Generative 3D Mesh Modeling [Paper] [GitHub] [Project Page]
[CVPR 2023] Diffusion-SDF: Text-to-Shape via Voxelized Diffusion [Paper] [GitHub] [Project Page]
[ICML 2024] HyperFields:Towards Zero-Shot Generation of NeRFs from Text [Paper] [GitHub] [Project Page]
[ECCV 2024] LATTE3D: Large-scale Amortized Text-To-Enhanced3D Synthesis [Paper] [Project Page]
[arXiv 2024] AToM: Amortized Text-to-Mesh using 2D Diffusion [Paper] [GitHub] [Project Page]

Optimization-based Approaches.

[ICLR 2023 notable top 5%] DreamFusion: Text-to-3D using 2D Diffusion [Paper] [Project Page]
[CVPR 2023 Highlight] Magic3D: High-Resolution Text-to-3D Content Creation [Paper] [Project Page]
[CVPR 2023] Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models [Paper] [Project Page]
[ICCV 2023] Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation [Paper] [GitHub] [Project Page]
[NeurIPS 2023 Spotlight] ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation [Paper] [GitHub] [Project Page]
[ICLR 2024 Poster] MVDream: Multi-view Diffusion for 3D Generation [Paper] [GitHub] [Project Page]
[ICLR 2024 Oral] DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation [Paper] [GitHub] [Project Page]
[CVPR 2024] PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion [Paper]
[CVPR 2024] VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation [Paper] [Project Page]
[CVPR 2024] GSGEN: Text-to-3D using Gaussian Splatting [Paper] [GitHub] [Project Page]
[CVPR 2024] GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models [Paper] [GitHub] [Project Page]
[CVPR 2024] Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior [Paper] [GitHub] [Project Page]

MVS-based Approaches.

[ICLR 2024 Poster] Instant3D: Fast Text-to-3D with Sparse-view Generation and Large Reconstruction Model [Paper] [Project Page]
[CVPR 2024] Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion [Paper] [GitHub] [Project Page]
[CVPR 2024] Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior [Paper] [GitHub] [Project Page]

Image-to-3D Generation.

Feedforward Approaches.

[arXiv 2023] 3DGen: Triplane Latent Diffusion for Textured Mesh Generation [Paper]
[NeurIPS 2023] Michelangelo: Conditional 3d shape generation based on shape-image-text aligned latent representation [Paper] [GitHub] [Project Page]
[NeurIPS 2024] Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer [Paper] [GitHub] [Project Page]
[SIGGRAPH 2024 Best Paper Honorable Mention] CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets [Paper] [GitHub] [Project Page]
[arXiv 2024] CraftsMan: High-fidelity Mesh Generation with 3D Native Generation and Interactive Geometry Refiner [Paper] [GitHub] [Project Page]
[arXiv 2024] Structured 3D Latents for Scalable and Versatile 3D Generation [Paper] [GitHub] [Project Page]

Optimization-based Approaches.

[arXiv 2023] Consistent123: Improve Consistency for One Image to 3D Object Synthesis [Paper] [Project Page]
[arXiv 2023] ImageDream: Image-Prompt Multi-view Diffusion for 3D Generation [Paper] [GitHub] [Project Page]
[CVPR 2023] RealFusion: 360° Reconstruction of Any Object from a Single Image [Paper] [GitHub] [Project Page]
[ICCV 2023] Zero-1-to-3: Zero-shot One Image to 3D Object [Paper] [GitHub] [Project Page]
[ICLR 2024 Poster] Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors [Paper] [GitHub] [Project Page]
[ICLR 2024 Poster] TOSS: High-quality Text-guided Novel View Synthesis from a Single Image [Paper] [GitHub] [Project Page]
[ICLR 2024 Spotlight] SyncDreamer: Generating Multiview-consistent Images from a Single-view Image [Paper] [GitHub] [Project Page]
[CVPR 2024] Wonder3D: Single Image to 3D using Cross-Domain Diffusion [Paper] [GitHub] [Project Page]
[ICLR 2025] IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts [Paper] [GitHub]

MVS-based Approaches.

[NeurIPS 2023] One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization [Paper] [GitHub] [Project Page]
[ECCV 2024] CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model [Paper] [GitHub] [Project Page]
[arXiv 2024] InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models [Paper] [GitHub]
[ICLR 2024 Oral] LRM: Large Reconstruction Model for Single Image to 3D [Paper] [Project Page]
[NeurIPS 2024] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image [Paper] [GitHub] [Project Page]

Video-to-3D Generation.

[CVPR 2024 Highlight] ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models [Paper] [GitHub] [Project Page]
[ICML 2024] IM-3D: Iterative Multiview Diffusion and Reconstruction for High-Quality 3D Generation [Paper] [Project Page]
[arXiv 2024] V3D: Video Diffusion Models are Effective 3D Generators [Paper] [GitHub] [Project Page]
[ECCV 2024 Oral] SV3D: Novel Multi-view Synthesis and 3D Generation from a Single Image Using Latent Video Diffusion [Paper] [Project Page]
[NeurIPS 2024 Oral] CAT3D: Create Anything in 3D with Multi-View Diffusion Models [Paper] [Project Page]

3D Applications

Avatar Generation.

[CVPR 2023] Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation [Paper]
[SIGGRAPH 2023] DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance [Paper] [Project Page]
[NeurIPS 2023] Headsculpt: Crafting 3d head avatars with text [Paper] [GitHub] [Project Page]
[NeurIPS 2023] DreamWaltz: Make a Scene with Complex 3D Animatable Avatars [Paper] [GitHub] [Project Page]
[NeurIPS 2023 Spotlight] DreamHuman: Animatable 3D Avatars from Text [Paper] [Project Page]

Scene Generation.

[ACM MM 2023] RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture [Paper]
[TVCG 2024] Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields [Paper] [GitHub] [Project Page]
[ECCV 2024] DreamScene: 3D Gaussian-based Text-to-3D Scene Generation via Formation Pattern Sampling [Paper] [GitHub] [Project Page]
[ECCV 2024] DreamScene360: Unconstrained Text-to-3D Scene Generation with Panoramic Gaussian Splatting [Paper] [GitHub] [Project Page]
[arXiv 2024] Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior [Paper] [GitHub] [Project Page]
[arXiv 2024] CityCraft: A Real Crafter for 3D City Generation [Paper] [GitHub]

3D Editing.

[ECCV 2022] Unified Implicit Neural Stylization [Paper] [GitHub] [Project Page]
[ECCV 2022] ARF: Artistic Radiance Fields [Paper] [GitHub] [Project Page]
[SIGGRAPH Asia 2022] FDNeRF: Few-shot Dynamic Neural Radiance Fields for Face Reconstruction and Expression Editing [Paper] [GitHub] [Project Page]
[CVPR 2022] FENeRF: Face Editing in Neural Radiance Fields [Paper] [GitHub] [Project Page]
[SIGGRAPH 2023] TextDeformer: Geometry Manipulation using Text Guidance [Paper] [GitHub] [Project Page]
[ICCV 2023] ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces [Paper] [GitHub] [Project Page]
[ICCV 2023 Oral] Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions [Paper] [GitHub] [Project Page]

4D Generation

4D Algorithms

Feedforward Approaches.

[CVPR 2024] Control4D: Efficient 4D Portrait Editing with Text [Paper] [Project Page]
[NeurIPS 2024] Animate3D: Animating Any 3D Model with Multi-view Video Diffusion [Paper] [GitHub] [Project Page]
[NeurIPS 2024] Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels [Paper] [GitHub] [Project Page]
[NeurIPS 2024] Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [Paper] [GitHub] [Project Page] [Dataset]
[NeurIPS 2024] L4GM: Large 4D Gaussian Reconstruction Model [Paper] [GitHub] [Project Page]

Optimization-based Approaches.

[arXiv 2023] Text-To-4D Dynamic Scene Generation [Paper] [Project Page]
[CVPR 2024] 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling [Paper] [GitHub] [Project Page]
[CVPR 2024] A Unified Approach for Text- and Image-guided 4D Scene Generation [Paper] [GitHub] [Project Page]
[CVPR 2024] Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models [Paper] [Project Page]
[ECCV 2024] TC4D: Trajectory-Conditioned Text-to-4D Generation [Paper] [GitHub] [Project Page]
[ECCV 2024] SC4D: Sparse-Controlled Video-to-4D Generation and Motion Transfer [Paper] [GitHub] [Project Page]
[ECCV 2024] STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians [Paper] [GitHub] [Project Page]
[NeurIPS 2024] 4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models [Paper] [Project Page]
[NeurIPS 2024] Compositional 3D-aware Video Generation with LLM Director [Paper] [Project Page]
[NeurIPS 2024] DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos [Paper] [GitHub] [Project Page]
[NeurIPS 2024] DreamMesh4D: Video-to-4D Generation with Sparse-Controlled Gaussian-Mesh Hybrid Representation [Paper] [GitHub] [Project Page]
[arXiv 2024] Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis [Paper] [GitHub]

4D Applications

4D Editing.

[CVPR 2024] Control4D: Efficient 4D Portrait Editing with Text [Paper] [Project Page]
[CVPR 2024] Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion [Paper] [GitHub] [Project Page]

Human Animation.

[SIGGRAPH 2020] Robust Motion In-betweening [Paper]
[CVPR 2022] Generating Diverse and Natural 3D Human Motions from Text [Paper] [GitHub] [Project Page]
[SCA 2023] Motion In-Betweening with Phase Manifolds [Paper] [GitHub]
[CVPR 2023] T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations [Paper] [GitHub] [Project Page]
[ICLR 2023 notable top 25%] Human Motion Diffusion Model [Paper] [GitHub] [Project Page]
[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language [Paper] [GitHub] [Project Page]
[ICML 2024] HumanTOMATO: Text-aligned Whole-body Motion Generation [Paper] [GitHub] [Project Page]
[CVPR 2024] MoMask: Generative Masked Modeling of 3D Human Motions [Paper] [GitHub] [Project Page]
[CVPR 2024] Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives [Paper] [GitHub] [Project Page]

Other Related Resources

World Foundation Model Platform

NVIDIA Cosmos ([GitHub] [Paper]): NVIDIA Cosmos is a world foundation model platform for accelerating the development of physical AI systems.
- Cosmos-Transfer1：a world-to-world transfer model designed to bridge the perceptual divide between simulated and real-world environments.
- Cosmos-Predict1: a collection of general-purpose world foundation models for Physical AI that can be fine-tuned into customized world models for downstream applications.
- Cosmos-Reason1： a model that understands the physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.

🎯Back to Top - Our Survey Paper Collection

🔥 Awesome Text2X Resources

An open collection of state-of-the-art (SOTA), novel Text to X (X can be everything) methods (papers, codes and datasets), intended to keep pace with the anticipated surge of research.

Update Logs

2025.04.18 - update layout on section Related Resources.

2025 Update Logs:

2025.05.08 - update new layout.
2025.03.10 - CVPR 2025 Accepted Papers🎉
2025.02.28 - update several papers status "CVPR 2025" to accepted papers, congrats to all 🎉
2025.01.23 - update several papers status "ICLR 2025" to accepted papers, congrats to all 🎉
2025.01.09 - update layout.

Previous 2024 Update Logs:

2024.12.21 adjusted the layouts of several sections and Happy Winter Solstice ⚪🥣.
2024.09.26 - update several papers status "NeurIPS 2024" to accepted papers, congrats to all 🎉
2024.09.03 - add one new section 'text to model'.
2024.06.30 - add one new section 'text to video'.
2024.07.02 - update several papers status "ECCV 2024" to accepted papers, congrats to all 🎉
2024.06.21 - add one hot Topic about AIGC 4D Generation on the section of Suvery and Awesome Repos.
2024.06.17 - an awesome repo for CVPR2024 Link 👍🏻
2024.04.05 adjusted the layout and added accepted lists and ArXiv lists to each section.
2024.04.05 - an awesome repo for CVPR2024 on 3DGS and NeRF Link 👍🏻
2024.03.25 - add one new survey paper of 3D GS into the section of "Survey and Awesome Repos--Topic 1: 3D Gaussian Splatting".
2024.03.12 - add a new section "Dynamic Gaussian Splatting", including Neural Deformable 3D Gaussians, 4D Gaussians, Dynamic 3D Gaussians.
2024.03.11 - CVPR 2024 Accpeted Papers Link
update some papers accepted by CVPR 2024! Congratulations🎉

Text to 4D

(Also, Image/Video to 4D)

🎉 4D Accepted Papers

Year	Title	Venue	Paper	Code	Project Page
2025	Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images	ICLR 2025	Link	Link	Link
2025	GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking	CVPR 2025	Link	Link	Link
2025	Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos	CVPR 2025 Oral	Link	Link	Link

Accepted Papers References

%accepted papers

@inproceedings{jinoptimizing,
  title={Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images},
  author={Jin, In-Hwan and Choo, Haesoo and Jeong, Seong-Hun and Heemoon, Park and Kim, Junghwan and Kwon, Oh-joon and Kong, Kyeongbo},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025}
}

@article{bian2025gsdit,
  title={GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking},
  author={Bian, Weikang and Huang, Zhaoyang and Shi, Xiaoyu and and Li, Yijin and Wang, Fu-Yun and Li, Hongsheng},
  journal={arXiv preprint arXiv:2501.02690},
  year={2025}
}

@article{jin2024stereo4d,
  title={Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos}, 
  author={Jin, Linyi and Tucker, Richard and Li, Zhengqi and Fouhey, David and Snavely, Noah and Holynski, Aleksander},
  journal={CVPR},
  year={2025},
}

💡 4D ArXiv Papers

1. AR4D: Autoregressive 4D Generation from Monocular Videos

Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian (University of Science and Technology of China, Microsoft Research Asia)

Abstract

Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.

2. WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

Ling Yang, Kaixin Zhu, Juanxi Tian, Bohan Zeng, Mingbao Lin, Hongjuan Pei, Wentao Zhang, Shuicheng Yan

(Peking University, University of the Chinese Academy of Sciences, National University of Singapore)

Abstract

With the rapid development of 3D reconstruction technology, research in 4D reconstruction is also advancing, existing 4D reconstruction methods can generate high-quality 4D scenes. However, due to the challenges in acquiring multi-view video data, the current 4D reconstruction benchmarks mainly display actions performed in place, such as dancing, within limited scenarios. In practical scenarios, many scenes involve wide-range spatial movements, highlighting the limitations of existing 4D reconstruction datasets. Additionally, existing 4D reconstruction methods rely on deformation fields to estimate the dynamics of 3D objects, but deformation fields struggle with wide-range spatial movements, which limits the ability to achieve high-quality 4D scene reconstruction with wide-range spatial movements. In this paper, we focus on 4D scene reconstruction with significant object spatial movements and propose a novel 4D reconstruction benchmark, WideRange4D. This benchmark includes rich 4D scene data with large spatial variations, allowing for a more comprehensive evaluation of the generation capabilities of 4D generation methods. Furthermore, we introduce a new 4D reconstruction method, Progress4D, which generates stable and high-quality 4D results across various complex 4D scene reconstruction tasks. We conduct both quantitative and qualitative comparison experiments on WideRange4D, showing that our Progress4D outperforms existing state-of-the-art 4D reconstruction methods.

3. SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

Chun-Han Yao, Yiming Xie, Vikram Voleti, Huaizu Jiang, Varun Jampani

(Stability AI, Northeastern University)

Abstract

We present Stable Video 4D 2.0 (SV4D 2.0), a multi-view video diffusion model for dynamic 3D asset generation. Compared to its predecessor SV4D, SV4D 2.0 is more robust to occlusions and large motion, generalizes better to real-world videos, and produces higher-quality outputs in terms of detail sharpness and spatio-temporal consistency. We achieve this by introducing key improvements in multiple aspects: 1) network architecture: eliminating the dependency of reference multi-views and designing blending mechanism for 3D and frame attention, 2) data: enhancing quality and quantity of training data, 3) training strategy: adopting progressive 3D-4D training for better generalization, and 4) 4D optimization: handling 3D inconsistency and large motion via 2-stage refinement and progressive frame sampling. Extensive experiments demonstrate significant performance gain by SV4D 2.0 both visually and quantitatively, achieving better detail (-14\% LPIPS) and 4D consistency (-44\% FV4D) in novel-view video synthesis and 4D optimization (-12\% LPIPS and -24\% FV4D) compared to SV4D.

4. Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, Ziwei Liu

(Huazhong University of Science and Technology, Nanyang Technological University, Great Bay University)

Abstract

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable rendering, marking a significant advancement in single-image-based 4D scene generation.

5. Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

(Visual Geometry Group University of Oxford, Naver Labs Europe)

Abstract

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.

6. In-2-4D: Inbetweening from Two Single-View Images to 4D Generation

Sauradip Nag, Daniel Cohen-Or, Hao Zhang, Ali Mahdavi-Amiri

(Simon Fraser University, Tel Aviv University)

Abstract

We propose a new problem, In-2-4D, for generative 4D (i.e., 3D + motion) inbetweening from a minimalistic input setting: two single-view images capturing an object in two distinct motion states. Given two images representing the start and end states of an object in motion, our goal is to generate and reconstruct the motion in 4D. We utilize a video interpolation model to predict the motion, but large frame-to-frame motions can lead to ambiguous interpretations. To overcome this, we employ a hierarchical approach to identify keyframes that are visually close to the input states and show significant motion, then generate smooth fragments between them. For each fragment, we construct the 3D representation of the keyframe using Gaussian Splatting. The temporal frames within the fragment guide the motion, enabling their transformation into dynamic Gaussians through a deformation field. To improve temporal consistency and refine 3D motion, we expand the self-attention of multi-view diffusion across timesteps and apply rigid transformation regularization. Finally, we merge the independently generated 3D motion segments by interpolating boundary deformation fields and optimizing them to align with the guiding video, ensuring smooth and flicker-free transitions. Through extensive qualitative and quantitiave experiments as well as a user study, we show the effectiveness of our method and its components.

7. TesserAct: Learning 4D Embodied World Models

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, Chuang Gan

(UMass Amherst, HKUST, Harvard University)

Abstract

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

8. HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

Haiyang Zhou, Wangbo Yu, Jiawen Guan, Xinhua Cheng, Yonghong Tian, Li Yuan

(Peking University, Peng Cheng Laboratory, Harbin Institute of Technology)

Abstract

The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	AR4D: Autoregressive 4D Generation from Monocular Videos	3 Jan 2025	Link	--	Link
2025	WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes	17 Mar 2025	Link	Link	Dataset Page
2025	SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation	20 Mar 2025	Link	--	Link
2025	Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency	26 Mar 2025	Link	Link	Link
2025	Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction	10 Apr 2025	Link	Link	Link
2025	In-2-4D: Inbetweening from Two Single-View Images to 4D Generation	11 Apr 2025	Link	Link	Link
2025	TesserAct: Learning 4D Embodied World Models	29 Apr 2025	Link	Link	Link
2025	HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation	30 Apr 2025	Link	Link	Link

ArXiv Papers References

%axiv papers

@misc{zhu2025ar4dautoregressive4dgeneration,
      title={AR4D: Autoregressive 4D Generation from Monocular Videos}, 
      author={Hanxin Zhu and Tianyu He and Xiqian Yu and Junliang Guo and Zhibo Chen and Jiang Bian},
      year={2025},
      eprint={2501.01722},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.01722}, 
}

@article{yang2025widerange4d,
  title={WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes},
  author={Yang, Ling and Zhu, Kaixin and Tian, Juanxi and Zeng, Bohan and Lin, Mingbao and Pei, Hongjuan and Zhang, Wentao and Yan, Shuichen},
  journal={arXiv preprint arXiv:2503.13435},
  year={2025}
}

@misc{yao2025sv4d20enhancingspatiotemporal,
      title={SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation}, 
      author={Chun-Han Yao and Yiming Xie and Vikram Voleti and Huaizu Jiang and Varun Jampani},
      year={2025},
      eprint={2503.16396},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.16396}, 
}

 @article{liu2025free4d,
     title={Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency},
     author={Liu, Tianqi and Huang, Zihao and Chen, Zhaoxi and Wang, Guangcong and Hu, Shoukang and Shen, liao and Sun, Huiqiang and Cao, Zhiguo and Li, Wei and Liu, Ziwei},
     journal={arXiv preprint arXiv:2503.20785},
     year={2025}
}

@article{jiang2025geo4d,
  title={Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction},
  author={Jiang, Zeren and Zheng, Chuanxia and Laina, Iro and Larlus, Diane and Vedaldi, Andrea},
  journal={arXiv preprint arXiv:2504.07961},
  year={2025}
}

@misc{nag2025in24dinbetweeningsingleviewimages,
      title={In-2-4D: Inbetweening from Two Single-View Images to 4D Generation}, 
      author={Sauradip Nag and Daniel Cohen-Or and Hao Zhang and Ali Mahdavi-Amiri},
      year={2025},
      eprint={2504.08366},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      url={https://arxiv.org/abs/2504.08366}, 
}

@article{zhen2025tesseract,
  title={TesserAct: Learning 4D Embodied World Models}, 
  author={Haoyu Zhen and Qiao Sun and Hongxin Zhang and Junyan Li and Siyuan Zhou and Yilun Du and Chuang Gan},
  year={2025},
  eprint={2504.20995},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.20995}, 
}

@article{zhou2025holotime,
  title={HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation},
  author={Zhou, Haiyang and Yu, Wangbo and Guan, Jiawen and Cheng, Xinhua and Tian, Yonghong and Yuan, Li},
  journal={arXiv preprint arXiv:2504.21650},
  year={2025}
}

Previous Papers

Year 2023

In 2023, tasks classified as text/Image to 4D and video to 4D generally involve producing four-dimensional data from text/Image or video input. For more details, please check the 2023 4D Papers, including 6 accepted papers and 3 arXiv papers.

Year 2024

For more details, please check the 2024 4D Papers, including 21 accepted papers and 13 arXiv papers.

Text to Video

🎉 T2V Accepted Papers

Year	Title	Venue	Paper	Code	Project Page
2025	TransPixar: Advancing Text-to-Video Generation with Transparency	CVPR 2025	Link	Link	Link
2025	BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations	CVPR 2025	Link	--	Link
2025	Identity-Preserving Text-to-Video Generation by Frequency Decomposition	CVPR 2025	Link	Link	Link
2025	One-Minute Video Generation with Test-Time Training	CVPR 2025	Link	Link	Link
2025	The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation	CVPR 2025	Link	Link	Link

Accepted Papers References

%accepted papers

@misc{wang2025transpixar,
     title={TransPixar: Advancing Text-to-Video Generation with Transparency}, 
     author={Luozhou Wang and Yijun Li and Zhifei Chen and Jui-Hsien Wang and Zhifei Zhang and He Zhang and Zhe Lin and Yingcong Chen},
     year={2025},
     eprint={2501.03006},
     archivePrefix={arXiv},
     primaryClass={cs.CV},
     url={https://arxiv.org/abs/2501.03006}, 
}

@article{feng2025blobgen,
  title={BlobGEN-Vid: Compositional Text-to-Video Generation with Blob Video Representations},
  author={Feng, Weixi and Liu, Chao and Liu, Sifei and Wang, William Yang and Vahdat, Arash and Nie, Weili},
  journal={arXiv preprint arXiv:2501.07647},
  year={2025}
}

@article{yuan2024identity,
  title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
  author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2411.17440},
  year={2024}
}

@misc{dalal2025oneminutevideogenerationtesttime,
      title={One-Minute Video Generation with Test-Time Training}, 
      author={Karan Dalal and Daniel Koceja and Gashon Hussein and Jiarui Xu and Yue Zhao and Youjin Song and Shihao Han and Ka Chun Cheung and Jan Kautz and Carlos Guestrin and Tatsunori Hashimoto and Sanmi Koyejo and Yejin Choi and Yu Sun and Xiaolong Wang},
      year={2025},
      eprint={2504.05298},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.05298}, 
}

@article{gao2025devil,
  title={The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation},
  author={Gao, Bingjie and Gao, Xinyu and Wu, Xiaoxue and Zhou, Yujie and Qiao, Yu and Niu, Li and Chen, Xinyuan and Wang, Yaohui},
  journal={arXiv preprint arXiv:2504.11739},
  year={2025}
}

💡 T2V ArXiv Papers

1. Multi-subject Open-set Personalization in Video Generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, Sergey Tulyakov

(Snap Inc., UC Merced, CMU)

Abstract

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist − a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

2. Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Chenjie Cao, Jingkai Zhou, Shikai Li, Jingyun Liang, Chaohui Yu, Fan Wang, Xiangyang Xue, Yanwei Fu

(Alibaba DAMO Academy, Fudan University, Hupan Lab)

Abstract

Camera and human motion controls have been extensively studied for video generation, but existing approaches typically address them separately, suffering from limited data with high-quality annotations for both aspects. To overcome this, we present Uni3C, a unified 3D-enhanced framework for precise control of both camera and human motion in video generation. Uni3C includes two key contributions. First, we propose a plug-and-play control module trained with a frozen video generative backbone, PCDController, which utilizes unprojected point clouds from monocular depth to achieve accurate camera control. By leveraging the strong 3D priors of point clouds and the powerful capacities of video foundational models, PCDController shows impressive generalization, performing well regardless of whether the inference backbone is frozen or fine-tuned. This flexibility enables different modules of Uni3C to be trained in specific domains, i.e., either camera control or human motion control, reducing the dependency on jointly annotated data. Second, we propose a jointly aligned 3D world guidance for the inference phase that seamlessly integrates both scenic point clouds and SMPL-X characters to unify the control signals for camera and human motion, respectively. Extensive experiments confirm that PCDController enjoys strong robustness in driving camera motion for fine-tuned backbones of video generation. Uni3C substantially outperforms competitors in both camera controllability and human motion quality. Additionally, we collect tailored validation sets featuring challenging camera movements and human actions to validate the effectiveness of our method.

3. We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

Minkyu Choi, S P Sharan, Harsh Goel, Sahil Shah, Sandeep Chinchali

(The University of Texas at Austin)

Abstract

Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce NeuS-E, a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that NeuS-E significantly enhances temporal and logical alignment across diverse prompts by almost 40%.

4. GENMO: A GENeralist Model for Human MOtion

Jiefeng Li, Jinkun Cao, Haotian Zhang, Davis Rempe, Jan Kautz, Umar Iqbal, Ye Yuan (NVIDIA)

Abstract

Human motion modeling traditionally separates motion generation and estimation into distinct tasks with specialized models. Motion generation models focus on creating diverse, realistic motions from inputs like text, audio, or keyframes, while motion estimation models aim to reconstruct accurate motion trajectories from observations like videos. Despite sharing underlying representations of temporal dynamics and kinematics, this separation limits knowledge transfer between tasks and requires maintaining separate models. We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals. Leveraging the synergy between regression and diffusion, GENMO achieves accurate global motion estimation while enabling diverse motion generation. We also introduce an estimation-guided training objective that exploits in-the-wild videos with 2D annotations and text descriptions to enhance generative diversity. Furthermore, our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control. This unified approach creates synergistic benefits: generative priors improve estimated motions under challenging conditions like occlusions, while diverse video data enhances generation capabilities. Extensive experiments demonstrate GENMO's effectiveness as a generalist framework that successfully handles multiple human motion tasks within a single model.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	Multi-subject Open-set Personalization in Video Generation	10 Jan 2025	Link	--	Link
2025	Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation	21 Apr 2025	Link	Link	Link
2025	We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback	25 Apr 2025	Link	--	--
2025	GENMO: A GENeralist Model for Human MOtion	2 May 2025	Link	--	Link

ArXiv Papers References

%axiv papers

@misc{chen2025multisubjectopensetpersonalizationvideo,
      title={Multi-subject Open-set Personalization in Video Generation}, 
      author={Tsai-Shien Chen and Aliaksandr Siarohin and Willi Menapace and Yuwei Fang and Kwot Sin Lee and Ivan Skorokhodov and Kfir Aberman and Jun-Yan Zhu and Ming-Hsuan Yang and Sergey Tulyakov},
      year={2025},
      eprint={2501.06187},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.06187}, 
}

@article{cao2025uni3c,
        title={Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation},
        author={Cao, Chenjie and Zhou, Jingkai and Li, shikai and Liang, Jingyun and Yu, Chaohui and Wang, Fan and Xue, Xiangyang and Fu, Yanwei},
        journal={arXiv preprint arXiv:2504.14899},
        year={2025}
}

@article{choi2025we,
  title={We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback},
  author={Choi, Minkyu and Sharan, SP and Goel, Harsh and Shah, Sahil and Chinchali, Sandeep},
  journal={arXiv preprint arXiv:2504.17180},
  year={2025}
}

@article{genmo2025,
  title={GENMO: Generative Models for Human Motion Synthesis},
  author={Li, Jiefeng and Cao, Jinkun and Zhang, Haotian and Rempe, Davis and Kautz, Jan and Iqbal, Umar and Yuan, Ye},
  journal={arXiv preprint arXiv:2505.01425},
  year={2025}
}

Video Other Additional Info

Previous Papers

Year 2024

For more details, please check the 2024 T2V Papers, including 21 accepted papers and 6 arXiv papers.

OSS video generation models: Mochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence.
Survey: The Dawn of Video Generation: Preliminary Explorations with SORA-like Models, arXiv, Project Page, GitHub Repo

📚 Dataset Works

1. VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Zhiyu Tan, Xiaomeng Yang, Luozheng Qin, Hao Li

(Fudan University, ShangHai Academy of AI for Science)

Abstract

The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

Year	Title	ArXiv Time	Paper	Code	Project Page
2024	VidGen-1M: A Large-Scale Dataset for Text-to-video Generation	5 Aug 2024	Link	Link	Link

References

%axiv papers

@article{tan2024vidgen,
  title={VidGen-1M: A Large-Scale Dataset for Text-to-video Generation},
  author={Tan, Zhiyu and Yang, Xiaomeng, and Qin, Luozheng and Li Hao},
  booktitle={arXiv preprint arxiv:2408.02629},
  year={2024}
}

Text to Scene

🎉 3D Scene Accepted Papers

Year	Title	Venue	Paper	Code	Project Page
2025	Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model	CVPR 2025	Link	--	Link

Accepted Papers References

%accepted papers

@article{Scene Splatter,
        title   = {Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model},
        author  = {Zhang, Shengjun and Li, Jinzhao and Fei, Xin and Liu, Hao and Duan, Yueqi},
        journal = {IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)},
        year    = {2025},
}

💡 3D Scene ArXiv Papers

1. LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation

Yang Zhou, Zongjin He, Qixuan Li, Chao Wang (ShangHai University)

Abstract

Recently, the field of text-guided 3D scene generation has garnered significant attention. High-quality generation that aligns with physical realism and high controllability is crucial for practical 3D scene applications. However, existing methods face fundamental limitations: (i) difficulty capturing complex relationships between multiple objects described in the text, (ii) inability to generate physically plausible scene layouts, and (iii) lack of controllability and extensibility in compositional scenes. In this paper, we introduce LayoutDreamer, a framework that leverages 3D Gaussian Splatting (3DGS) to facilitate high-quality, physically consistent compositional scene generation guided by text. Specifically, given a text prompt, we convert it into a directed scene graph and adaptively adjust the density and layout of the initial compositional 3D Gaussians. Subsequently, dynamic camera adjustments are made based on the training focal point to ensure entity-level generation quality. Finally, by extracting directed dependencies from the scene graph, we tailor physical and layout energy to ensure both realism and flexibility. Comprehensive experiments demonstrate that LayoutDreamer outperforms other compositional scene generation quality and semantic alignment methods. Specifically, it achieves state-of-the-art (SOTA) performance in the multiple objects generation metric of T3Bench.

2. Bolt3D: Generating 3D Scenes in Seconds

Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler

(Google Research, University of Oxford, Google DeepMind)

Abstract

We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

3. WORLDMEM: Long-term Consistent World Simulation with Memory

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, Xingang Pan

(Nanyang Technological University, Peking University, Shanghai AI Laboratry)

Abstract

World simulation has gained increasing popularity due to its ability to model virtual environments and predict the consequences of actions. However, the limited temporal context window often leads to failures in maintaining long-term consistency, particularly in preserving 3D spatial consistency. In this work, we present WorldMem, a framework that enhances scene generation with a memory bank consisting of memory units that store memory frames and states (e.g., poses and timestamps). By employing a memory attention mechanism that effectively extracts relevant information from these memory frames based on their states, our method is capable of accurately reconstructing previously observed scenes, even under significant viewpoint or temporal gaps. Furthermore, by incorporating timestamps into the states, our framework not only models a static world but also captures its dynamic evolution over time, enabling both perception and interaction within the simulated world. Extensive experiments in both virtual and real scenarios validate the effectiveness of our approach.

4. HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

Wenqi Dong, Bangbang Yang, Zesong Yang, Yuan Li, Tao Hu, Hujun Bao, Yuewen Ma, Zhaopeng Cui

(Zhejiang University, ByteDance)

Abstract

Scene-level 3D generation represents a critical frontier in multimedia and computer graphics, yet existing approaches either suffer from limited object categories or lack editing flexibility for interactive applications. In this paper, we present HiScene, a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation and delivers high-fidelity scenes with compositional identities and aesthetic scene content. Our key insight is treating scenes as hierarchical "objects" under isometric views, where a room functions as a complex object that can be further decomposed into manipulatable items. This hierarchical approach enables us to generate 3D content that aligns with 2D representations while maintaining compositional structure. To ensure completeness and spatial alignment of each decomposed instance, we develop a video-diffusion-based amodal completion technique that effectively handles occlusions and shadows between objects, and introduce shape prior injection to ensure spatial coherence within the scene. Experimental results demonstrate that our method produces more natural object arrangements and complete object instances suitable for interactive applications, while maintaining physical plausibility and alignment with user inputs.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation	4 Feb 2025	Link	--	--
2025	Bolt3D: Generating 3D Scenes in Seconds	18 Mar 2025	Link	--	Link
2025	WORLDMEM: Long-term Consistent World Simulation with Memory	16 Apr 2025	Link	Link	Link
2025	HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation	17 Apr 2025	Link	--	Link

ArXiv Papers References

%axiv papers

@article{zhou2025layoutdreamer,
  title={LAYOUTDREAMER: Physics-guided Layout for Text-to-3D Compositional Scene Generation},
  author={Zhou, Yang and He, Zongjin and Li, Qixuan and Wang, Chao},
  journal={arXiv preprint arXiv:2502.01949},
  year={2025}
}

@article{szymanowicz2025bolt3d,
title={{Bolt3D: Generating 3D Scenes in Seconds}},
author={Szymanowicz, Stanislaw and Zhang, Jason Y. and Srinivasan, Pratul
     and Gao, Ruiqi and Brussee, Arthur and Holynski, Aleksander and
     Martin-Brualla, Ricardo and Barron, Jonathan T. and Henzler, Philipp},
journal={arXiv:2503.14445},
year={2025}
}

@misc{xiao2025worldmemlongtermconsistentworld,
      title={WORLDMEM: Long-term Consistent World Simulation with Memory}, 
      author={Zeqi Xiao and Yushi Lan and Yifan Zhou and Wenqi Ouyang and Shuai Yang and Yanhong Zeng and Xingang Pan},
      year={2025},
      eprint={2504.12369},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.12369}, 
}

@article{dong2025hiscene,
      title   = {HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation},
      author  = {Dong, Wenqi and Yang, Bangbang and Yang, Zesong and Li, Yuan and Hu, Tao and Bao, Hujun and Ma, Yuewen and Cui, Zhaopeng},
      journal = {arXiv preprint arXiv:2504.13072},
      year    = {2025},
}

Scene Other Additional Info

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 3D Scene Papers, including 23 accepted papers and 8 arXiv papers.

Awesome Repos

Survey

[arXiv 16 Apr 2025]Recent Advance in 3D Object and Scene Generation: A Survey [Paper]

Awesome Repos

Resource1: WorldGen: Generate Any 3D Scene in Seconds
Resource2: Awesome-3D-Scene-Generation

Text to Human Motion

🎉 Motion Accepted Papers

Year	Title	Venue	Paper	Code	Project Page
2025	MixerMDM: Learnable Composition of Human Motion Diffusion Models	CVPR 2025	Link	Link	Link
2025	Dynamic Motion Blending for Versatile Motion Editing	CVPR 2025	Link	Link	Link

Accepted Papers References

%accepted papers

@article{ruiz2025mixermdm,
  title={MixerMDM: Learnable Composition of Human Motion Diffusion Models},
  author={Ruiz-Ponce, Pablo and Barquero, German and Palmero, Cristina and Escalera, Sergio and Garc{\'\i}a-Rodr{\'\i}guez, Jos{\'e}},
  journal={arXiv preprint arXiv:2504.01019},
  year={2025}
}

@article{jiang2025dynamic,
  title={Dynamic Motion Blending for Versatile Motion Editing},
  author={Jiang, Nan and Li, Hongjie and Yuan, Ziye and He, Zimo and Chen, Yixin and Liu, Tengyu and Zhu, Yixin and Huang, Siyuan},
  journal={arXiv preprint arXiv:2503.20724},
  year={2025}
}

💡 Motion ArXiv Papers

1. MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm

Ziyan Guo, Zeyu Hu, Na Zhao, De Wen Soh

(Singapore University of Technology and Design, LightSpeed Studios)

Abstract

Human motion generation and editing are key components of computer graphics and vision. However, current approaches in this field tend to offer isolated solutions tailored to specific tasks, which can be inefficient and impractical for real-world applications. While some efforts have aimed to unify motion-related tasks, these methods simply use different modalities as conditions to guide motion generation. Consequently, they lack editing capabilities, fine-grained control, and fail to facilitate knowledge sharing across tasks. To address these limitations and provide a versatile, unified framework capable of handling both human motion generation and editing, we introduce a novel paradigm: Motion-Condition-Motion, which enables the unified formulation of diverse tasks with three concepts: source motion, condition, and target motion. Based on this paradigm, we propose a unified framework, MotionLab, which incorporates rectified flows to learn the mapping from source motion to target motion, guided by the specified conditions. In MotionLab, we introduce the 1) MotionFlow Transformer to enhance conditional generation and editing without task-specific modules; 2) Aligned Rotational Position Encoding} to guarantee the time synchronization between source motion and target motion; 3) Task Specified Instruction Modulation; and 4) Motion Curriculum Learning for effective multi-task learning and knowledge sharing across tasks. Notably, our MotionLab demonstrates promising generalization capabilities and inference efficiency across multiple benchmarks for human motion.

2. Motion Anything: Any to Motion Generation

Zeyu Zhang, Yiran Wang, Wei Mao, Danning Li, Rui Zhao, Biao Wu, Zirui Song, Bohan Zhuang, Ian Reid, Richard Hartley

(The Australian National University, The University of Sydney, Tecent Canberra XR Vision Labs, McGill University, JD.com, University of Technology Sydney, Mohamed bin Zayed University of Artificial Intelligence, Zhejiang University, Google Research)

Abstract

Conditional motion generation has been extensively studied in computer vision, yet two critical challenges remain. First, while masked autoregressive methods have recently outperformed diffusion-based approaches, existing masking models lack a mechanism to prioritize dynamic frames and body parts based on given conditions. Second, existing methods for different conditioning modalities often fail to integrate multiple modalities effectively, limiting control and coherence in generated motion. To address these challenges, we propose Motion Anything, a multimodal motion generation framework that introduces an Attention-based Mask Modeling approach, enabling fine-grained spatial and temporal control over key frames and actions. Our model adaptively encodes multimodal conditions, including text and music, improving controllability. Additionally, we introduce Text-Music-Dance (TMD), a new motion dataset consisting of 2,153 pairs of text, music, and dance, making it twice the size of AIST++, thereby filling a critical gap in the community. Extensive experiments demonstrate that Motion Anything surpasses state-of-the-art methods across multiple benchmarks, achieving a 15% improvement in FID on HumanML3D and showing consistent performance gains on AIST++ and TMD.

3. MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space

Lixing Xiao, Shunlin Lu, Huaijin Pi, Ke Fan, Liang Pan, Yueer Zhou, Ziyong Feng, Xiaowei Zhou, Sida Peng, Jingbo Wang

(Zhejiang University, The Chinese University of Hong Kong Shenzhen, The University of Hong Kong, Shanghai Jiao Tong University, DeepGlint, Shanghai AI Laboratory)

Abstract

This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition.

4. Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models

Marc Benedí San Millán, Angela Dai, Matthias Nießner

(Technical University of Munich)

Abstract

Animation of humanoid characters is essential in various graphics applications, but requires significant time and cost to create realistic animations. We propose an approach to synthesize 4D animated sequences of input static 3D humanoid meshes, leveraging strong generalized motion priors from generative video models -- as such video models contain powerful motion information covering a wide variety of human motions. From an input static 3D humanoid mesh and a text prompt describing the desired animation, we synthesize a corresponding video conditioned on a rendered image of the 3D mesh. We then employ an underlying SMPL representation to animate the corresponding 3D mesh according to the video-generated motion, based on our motion optimization. This enables a cost-effective and accessible solution to enable the synthesis of diverse and realistic 4D animations.

5. FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation

Manolo Canales Cuba, Vinícius do Carmo Melício, João Paulo Gois

(Universidade Federal do ABC, Santo Andr ́e, Brazil)

Abstract

Achieving high-fidelity and temporally smooth 3D human motion generation remains a challenge, particularly within resource-constrained environments. We introduce FlowMotion, a novel method leveraging Conditional Flow Matching (CFM). FlowMotion incorporates a training objective within CFM that focuses on more accurately predicting target motion in 3D human motion generation, resulting in enhanced generation fidelity and temporal smoothness while maintaining the fast synthesis times characteristic of flow-matching-based methods. FlowMotion achieves state-of-the-art jitter performance, achieving the best jitter in the KIT dataset and the second-best jitter in the HumanML3D dataset, and a competitive FID value in both datasets. This combination provides robust and natural motion sequences, offering a promising equilibrium between generation quality and temporal naturalness.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm	6 Feb 2025	Link	Link	Link
2025	Motion Anything: Any to Motion Generation	12 Mar 2025	Link	Link	Link
2025	MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space	19 Mar 2025	Link	Link	Link
2025	Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models	20 Mar 2025	Link	--	Link
2025	FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-Driven Human Motion Generation	20 Apr 2025	Link	--	--

ArXiv Papers References

%axiv papers

@article{guo2025motionlab,
  title={MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm},
  author={Guo, Ziyan and Hu, Zeyu and Zhao, Na and Soh, De Wen},
  journal={arXiv preprint arXiv:2502.02358},
  year={2025}
}

@article{zhang2025motion,
  title={Motion Anything: Any to Motion Generation},
  author={Zhang, Zeyu and Wang, Yiran and Mao, Wei and Li, Danning and Zhao, Rui and Wu, Biao and Song, Zirui and Zhuang, Bohan and Reid, Ian and Hartley, Richard},
  journal={arXiv preprint arXiv:2503.06955},
  year={2025}
}

@article{xiao2025motionstreamer,
      title={MotionStreamer: Streaming Motion Gen
10000
eration via Diffusion-based Autoregressive Model in Causal Latent Space},
      author={Xiao, Lixing and Lu, Shunlin and Pi, Huaijin and Fan, Ke and Pan, Liang and Zhou, Yueer and Feng, Ziyong and Zhou, Xiaowei and Peng, Sida and Wang, Jingbo},
      journal={arXiv preprint arXiv:2503.15451},
      year={2025}
}

@misc{millán2025animatinguncapturedhumanoidmesh,
        title={Animating the Uncaptured: Humanoid Mesh Animation with Video Diffusion Models}, 
        author={Marc Benedí San Millán and Angela Dai and Matthias Nießner},
        year={2025},
        eprint={2503.15996},
        archivePrefix={arXiv},
        primaryClass={cs.GR},
        url={https://arxiv.org/abs/2503.15996}, 
}

@article{cuba2025flowmotion,
  title={FlowMotion: Target-Predictive Flow Matching for Realistic Text-Driven Human Motion Generation},
  author={Cuba, Manolo Canales and Gois, Jo{\~a}o Paulo},
  journal={arXiv preprint arXiv:2504.01338},
  year={2025}
}

Motion Other Additional Info

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 Text to Human Motion Papers, including 36 accepted papers and 8 arXiv papers.

📚 Dataset Works

Datasets

Motion	Info	URL	Others
AIST	AIST Dance Motion Dataset	Link	--
AIST++	AIST++ Dance Motion Dataset	Link	dance video database with SMPL annotations
AMASS	optical marker-based motion capture datasets	Link	--

Additional Info

AMASS

AMASS is a large database of human motion unifying different optical marker-based motion capture datasets by representing them within a common framework and parameterization. AMASS is readily useful for animation, visualization, and generating training data for deep learning.

Awesome Repos

Survey

[TPAMI 2023] Human Motion Generation: A Survey [Paper]
[arXiv 7 Apr 2025] A Survey on Human Interaction Motion Generation [Paper] [GitHub]

Text to 3D Human

🎉 Human Accepted Papers

Year	Title	Venue	Paper	Code	Project Page
2025	Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion	CVPR 2025	Link	Link	Link
2025	GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior	CVPR 2025	Link	Link	Link
2025	Text-based Animatable 3D Avatars with Morphable Model Alignment	SIGGRAPH 2025	Link	Link	Link

Accepted Papers References

%accepted papers

@inproceedings{zhou2025zero1toa,
  title = {Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion},
  author = {Zhenglin Zhou and Fan Ma and Hehe Fan and Tat-Seng Chua},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025},
}

@article{tang2025gaussianip,
  title={GaussianIP: Identity-Preserving Realistic 3D Human Generation via Human-Centric Diffusion Prior},
  author={Tang, Zichen and Yao, Yuan and Cui, Miaomiao and Bo, Liefeng and Yang, Hongyu},
  journal={arXiv preprint arXiv:2503.11143},
  year={2025}
}

@article{AnimPortrait3D_sig25,
      author = {Wu, Yiqian and Prinzler, Malte and Jin, Xiaogang and Tang, Siyu},
      title = {Text-based Animatable 3D Avatars with Morphable Model Alignment},
      year = {2025}, 
      isbn = {9798400715402}, 
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      url = {https://doi.org/10.1145/3721238.3730680},
      doi = {10.1145/3721238.3730680},
      articleno = {},
      numpages = {11},
      location = {Vancouver, BC, Canada},
      series = {SIGGRAPH '25}
}

💡 Human ArXiv Papers

1. Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, Shunsuke Saito

(Technical University of Munich, Meta Reality Labs)

Abstract

Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars from images of different sources, smartphone captures, single images, and even out-of-domain inputs like antique busts.

2. LAM: Large Avatar Model for One-shot Animatable Gaussian Head

Yisheng He, Xiaodong Gu, Xiaodan Ye, Chao Xu, Zhengyi Zhao, Yuan Dong, Weihao Yuan, Zilong Dong, Liefeng Bo

(Tongyi Lab, Alibaba Group)

Abstract

We present LAM, an innovative Large Avatar Model for animatable Gaussian head reconstruction from a single image. Unlike previous methods that require extensive training on captured video sequences or rely on auxiliary neural networks for animation and rendering during inference, our approach generates Gaussian heads that are immediately animatable and renderable. Specifically, LAM creates an animatable Gaussian head in a single forward pass, enabling reenactment and rendering without additional networks or post-processing steps. This capability allows for seamless integration into existing rendering pipelines, ensuring real-time animation and rendering across a wide range of platforms, including mobile phones. The centerpiece of our framework is the canonical Gaussian attributes generator, which utilizes FLAME canonical points as queries. These points interact with multi-scale image features through a Transformer to accurately predict Gaussian attributes in the canonical space. The reconstructed canonical Gaussian avatar can then be animated utilizing standard linear blend skinning (LBS) with corrective blendshapes as the FLAME model did and rendered in real-time on various platforms. Our experimental results demonstrate that LAM outperforms state-of-the-art methods on existing benchmarks.

3. HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration

Boyuan Wang, Runqi Ouyang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Guan Huang, Lihong Liu, Xingang Wang

(GigaAI, Institute of Automation Chinese Academy of Sciences, Peking University)

Abstract

Single-image human reconstruction is vital for digital human modeling applications but remains an extremely challenging task. Current approaches rely on generative models to synthesize multi-view images for subsequent 3D reconstruction and animation. However, directly generating multiple views from a single human image suffers from geometric inconsistencies, resulting in issues like fragmented or blurred limbs in the reconstructed models. To tackle these limitations, we introduce \textbf{HumanDreamer-X}, a novel framework that integrates multi-view human generation and reconstruction into a unified pipeline, which significantly enhances the geometric consistency and visual fidelity of the reconstructed 3D models. In this framework, 3D Gaussian Splatting serves as an explicit 3D representation to provide initial geometry and appearance priority. Building upon this foundation, \textbf{HumanFixer} is trained to restore 3DGS renderings, which guarantee photorealistic results. Furthermore, we delve into the inherent challenges associated with attention mechanisms in multi-view human generation, and propose an attention modulation strategy that effectively enhances geometric details identity consistency across multi-view. Experimental results demonstrate that our approach markedly improves generation and reconstruction PSNR quality metrics by 16.45% and 12.65%, respectively, achieving a PSNR of up to 25.62 dB, while also showing generalization capabilities on in-the-wild data and applicability to various human reconstruction backbone models.

4. SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

Yuhang Yang, Fengqi Liu, Yixing Lu, Qin Zhao, Pingyu Wu, Wei Zhai, Ran Yi, Yang Cao, Lizhuang Ma, Zheng-Jun Zha, Junting Dong

(USTC, Shanghai AI Lab, SJTU, CMU)

Abstract

3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains 1 million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.

Year	Title	ArXiv Time	Paper	Code	Project Page
2025	Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars	27 Feb 2025	Link	--	Link
2025	LAM: Large Avatar Model for One-shot Animatable Gaussian Head	4 Apr 2025	Link	Link	Link
2025	HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration	4 Apr 2025	Link	Link	Link
2025	SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets	9 Apr 2025	Link	Link	Link

ArXiv Papers References

%axiv papers

@misc{kirschstein2025avat3r,
      title={Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars},
      author={Tobias Kirschstein and Javier Romero and Artem Sevastopolsky and Matthias Nie\ss{}ner and Shunsuke Saito},
      year={2025},
      eprint={2502.20220},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2502.20220},
}

@article{he2025lam,
  title={LAM: Large Avatar Model for One-shot Animatable Gaussian Head},
  author={He, Yisheng and Gu, Xiaodong and Ye, Xiaodan and Xu, Chao and Zhao, Zhengyi and Dong, Yuan and Yuan, Weihao and Dong, Zilong and Bo, Liefeng},
  journal={arXiv preprint arXiv:2502.17796},
  year={2025}
}

@article{wang2025humandreamerx,
  title={HumanDreamer-X: Photorealistic Single-image Human Avatars Reconstruction via Gaussian Restoration}, 
  author={Boyuan Wang and Runqi Ouyang and Xiaofeng Wang and Zheng Zhu and Guosheng Zhao and Chaojun Ni and Guan Huang and Lihong Liu and Xingang Wang},
  journal={arXiv preprint arXiv:2504.03536},
  year={2025}
}

@article{yang2025sigman,
  title={SIGMAN: Scaling 3D Human Gaussian Generation with Millions of Assets},
  author={Yang, Yuhang and Liu, Fengqi and Lu, Yixing and Zhao, Qin and Wu, Pingyu and Zhai, Wei and Yi, Ran and Cao, Yang and Ma, Lizhuang and Zha, Zheng-Jun and others},
  journal={arXiv preprint arXiv:2504.06982},
  year={2025}
}

Additional Info

Previous Papers

Year 2023-2024

For more details, please check the 2023-2024 3D Human Papers, including 18 accepted papers and 5 arXiv papers.

Survey and Awesome Repos

Survey

PROGRESS AND PROSPECTS IN 3D GENERATIVE AI: A TECHNICAL OVERVIEW INCLUDING 3D HUMAN, ArXiv 2024

Awesome Repos

Resource1: Awesome Digital Human

Pretrained Models

Pretrained Models (human body)	Info	URL
SMPL	smpl model (smpl weights)	Link
SMPL-X	smpl model (smpl weights)	Link
human_body_prior	vposer model (smpl weights)	Link

SMPL

SMPL is an easy-to-use, realistic, model of the of the human body that is useful for animation and computer vision.

version 1.0.0 for Python 2.7 (female/male, 10 shape PCs)
version 1.1.0 for Python 2.7 (female/male/neutral, 300 shape PCs)
UV map in OBJ format

SMPL-X

SMPL-X, that extends SMPL with fully articulated hands and facial expressions (55 joints, 10475 vertices)

🎯Back to Top - Text2X Resources

Related Resources

Text to 'other tasks'

Here, other tasks refer to CAD, 3D modeling, music generation, and so on.

Text to CAD

[arXiv 7 Nov 2024] CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM [Paper] [GitHub] [Project Page]
[NeurIPS 2024 Spotlight] Text2CAD: Generating Sequential CAD Designs from Beginner-to-Expert Level Text Prompts [Paper] [GitHub] [Project Page] [Dataset]

Text to Music

[arXiv 1 Sep 2024] FLUX that Plays Music [Paper] [GitHub]

Text to Model

[ICLR Workshop on Neural Network Weights as a New Data Modality 2025] Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization [Paper]

Survey and Awesome Repos

🔥 Topic 1: 3D Gaussian Splatting

Survey

[arXiv 6 May 2024] Gaussian Splatting: 3D Reconstruction and Novel View Synthesis, a Review [Paper]
[arXiv 17 Mar 2024] Recent Advances in 3D Gaussian Splatting [Paper]
[IEEE TVCG 2024] 3D Gaussian as a New Vision Era: A Survey [Paper]
[arXiv 8 Jan 2024] A Survey on 3D Gaussian Splatting [Paper] [GitHub] [Benchmark]

Awesome Repos

Resource1: Awesome 3D Gaussian Splatting Resources
Resource2: 3D Gaussian Splatting Papers
Resource3: 3DGS and Beyond Docs

🔥 Topic 2: AIGC 3D

Survey

[arXiv 15 May 2024] A Survey On Text-to-3D Contents Generation In The Wild [Paper]
[arXiv 2 Feb 2024] A Comprehensive Survey on 3D Content Generation [Paper] [GitHub]
[arXiv 31 Jan 2024] Advances in 3D Generation: A Survey [Paper]

Awesome Repos

Benchmark

[CVPR 2024] GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation [Paper] [GitHub] [Project Page]

Foundation Model

[arXiv 19 Mar 2025] Cube: A Roblox View of 3D Intelligence [Paper] [GitHub]

🔥 Topic 3: 3D Human & LLM 3D

Survey

[arXiv 6 June 2024] A Survey on 3D Human Avatar Modeling -- From Reconstruction to Generation [Paper]
[arXiv 5 Jan 2024] Progress and Prospects in 3D Generative AI: A Technical Overview including 3D human [Paper]

Awesome Repos

Resource1: Awesome LLM 3D
Resource2: Awesome Digital Human
Resource3: Awesome-Avatars

🔥 Topic 4: AIGC 4D

Survey

[arXiv 18 Mar 2025] Advances in 4D Generation: A Survey [Paper] [GitHub]

Awesome Repos

Resource1: Awesome 4D Generation

🔥 Topic 5: Physics-based AIGC

Survey

[arXiv 27 Mar 2025] Exploring the Evolution of Physics Cognition in Video Generation: A Survey [Paper] [GitHub]
[arXiv 19 Jan 2025] Generative Physical AI in Vision: A Survey [Paper] [GitHub]

Dynamic Gaussian Splatting

Neural Deformable 3D Gaussians

[CVPR 2024] Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction [Paper] [GitHub] [Project Page]
[CVPR 2024] 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering [Paper] [GitHub] [Project Page]
[CVPR 2024] SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes [Paper] [GitHub] [Project Page]
[CVPR 2024 Highlight] 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos [Paper] [GitHub] [Project Page]

4D Gaussians

[SIGGRAPH 2024] 4D-Rotor Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes [Paper]
[ICLR 2024] Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting [Paper] [GitHub] [Project Page]

Dynamic 3D Gaussians

[CVPR 2024 Highlight] Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle [Paper] [GitHub] [Project Page]
[3DV 2024] Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis [Paper] [GitHub] [Project Page]

🎯Back to Top - Table of Contents

License

This repo is released under the MIT license.

✉️ Any additions or suggestions, feel free to contact us.

Name		Name	Last commit message	Last commit date
Latest commit History 378 Commits
docs		docs
media		media
LICENSE		LICENSE
README.md		README.md

License

ALEEEHU/World-Simulator

Folders and files

Latest commit

History

Repository files navigation

Simulating the Real World: Survey & Resources

Table of Contents

📜 Our Survey Paper Collection

Abstract

⭐ Citation

Paradigms

2D Generation

Text-to-Image Generation.

Video Generation

Survey

Video Algorithms

(1) VAE- and GAN-based Approaches.

(2) Diffusion-based Approaches.

(3) Autoregressive-based Approaches.

Video Applications

Video Editing.

Novel View Synthesis.

Human Animation in Videos.

3D Generation

3D Algorithms

Text-to-3D Generation.

Survey

Feedforward Approaches.

Optimization-based Approaches.

MVS-based Approaches.

Image-to-3D Generation.

Feedforward Approaches.

Optimization-based Approaches.

MVS-based Approaches.

Video-to-3D Generation.

3D Applications

Avatar Generation.

Scene Generation.

3D Editing.

4D Generation

4D Algorithms

Feedforward Approaches.

Optimization-based Approaches.

4D Applications

4D Editing.

Human Animation.

Other Related Resources

World Foundation Model Platform

🔥 Awesome Text2X Resources

Update Logs

Text to 4D

🎉 4D Accepted Papers

💡 4D ArXiv Papers

1. AR4D: Autoregressive 4D Generation from Monocular Videos

2. WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes

3. SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation

4. Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

5. Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

6. In-2-4D: Inbetweening from Two Single-View Images to 4D Generation

7. TesserAct: Learning 4D Embodied World Models

8. HoloTime: Taming Video Diffusion Models for Panoramic 4D Scene Generation

Previous Papers

Year 2023

Year 2024

Text to Video

🎉 T2V Accepted Papers

💡 T2V ArXiv Papers

1. Multi-subject Open-set Personalization in Video Generation

2. Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

3. We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

4. GENMO: A GENeralist Model for Human MOtion

Video Other Additional Info

Previous Papers

Year 2024

📚 Dataset Works

1. VidGen-1M: A Large-Scale Dataset for Text-to-video Generation

Text to Scene

🎉 3D Scene Accepted Papers

💡 3D Scene ArXiv Papers