Shengqiong Wu1, Hao Fei1*, Jingkang Yang2, Xiangtai Li2, Juncheng Li3, Hanwang Zhang2, and Tat-Seng Chua1
The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can largely suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG.
-
The main task dataset is PSG4D, please refer the instruction for preparation.
-
In the 2D-to-4D visual scene transfer learning, the datasets we leverage are:
Please follow the instructions to prepare the datasets.
Coming soon.
If you use PSG-4D-LLM in your project, please kindly cite:
@inproceedings{wu2025psg4dllm,
title={Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene},
author={Shengqiong Wu and Hao Fei and Jingkang Yang and Xiangtai Li and Juncheng Li and Hanwang Zhang and Tat-Seng Chua1},
booktitle={CVPR},
year={2025}
}
Our 4D-LLM is developed based on the codebases of NExT-Chat, Chat-UniVi, SA-Gate, and Sam2, and we would like to thank the developers of both.