MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction
Yingshuang Zou, Yikang Ding, Chuanrui Zhang, Jiazhe Guo, Bohan Li,
Xiaoyang Lyu, Feiyang Tan, Xiaojuan Qi, Haoqian Wang
🔥🔥 (2025.03) Check out our other latest works on generative world models: UniScene, DiST-4D, HERMES.
🔥🔥 (2025.03) The data processing code is released!
🔥🔥 (2025.03) The training and inference code of Multi-modal Diffusion is available NOW!!!
🔥🔥 (2025.03) Paper in on arXiv: MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction
- Release data processing code.
- Release the pretrained model.
- Release training / inference code.
Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial per 8000 formance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.
Model | Resolution | Checkpoint |
---|---|---|
MDM1024 | 576x1024 | Hugging Face |
MDM512 | 320x512 | Hugging Face |
conda create -n mudg python=3.8.5
conda activate mudg
pip install -r requirements.txt
We project the fused point clouds onto novel viewpoints to generate sparse color and depth maps.
Note: The detailed data processing steps can be found in the Data Processing section.
For your convenience, we have also provided pre-processed data. You can access it via this link.
python virtual_render/generate_virtual_item.py
- Download pretrained models, and put the
model.ckpt
with the required resolution incheckpoints/[1024|512]_mdm/[1024|512]-mdm-checkpoint.ckpt
. - Run the commands based on your devices and needs in terminal.
sh virtual_render/scripts/render.sh 15365
15365 is the item id, and you can change it to any item id following the item list.
- Process the data and generate the item list.
- Generate the train data list:
python data/create_data_infos.py
- Download the pretrained model DynamiCrafter512 and put the
model.ckpt
incheckpoints/512_mdm/512-mdm-checkpoint.ckpt
. - We train the 320 * 512 model with the following command:
sh configs/stage1-512_mdm_waymo/run-512.sh
- Then we use the following command to train the 576 * 1024 model:
sh configs/stage2-1024_mdm_waymo/run-1024.sh
This repository is released under the Apache 2.0 license.
Please consider citing our paper if our code are useful:
@article{zou2025mudg,
title={MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction},
author={Zou, Yingshuang and Ding, Yikang and Zhang, Chuanrui and Guo, Jiazhe and Li, Bohan and Lyu, Xiaoyang and Tan, Feiyang and Qi, Xiaojuan and Wang, Haoqian},
journal={arXiv preprint arXiv:2503.10604},
year={2025}
}
We would like to thank the contributors of the following repositories for their valuable contributions to the community: