SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

This repository will provide the details and code for our model and datasets for SpaceVLLM.

Updates

2025-03-18: 📄 Our paper is now available on Arxiv.

Overview

Multimodal large language models (MLLMs) have made remarkable progress in either temporal or spatial localization. However, they struggle to perform spatio-temporal video grounding. This limitation stems from two major challenges. Firstly, it is difficult to extract accurate spatio-temporal information of each frame in the video. Secondly, the substantial number of visual tokens makes it challenging to precisely map visual tokens of each frame to their corresponding spatial coordinates. To address these issues, we introduce SpaceVLLM, a MLLM endowed with spatio-temporal video grounding capability. Specifically, we adopt a set of interleaved Spatio-Temporal Aware Queries to capture temporal perception and dynamic spatial information. Moreover, we propose a Query-Guided Space Decoder to establish a corresponding connection between the queries and spatial coordinates. Additionally, due to the lack of spatio-temporal datasets, we construct the Unified Spatio-Temporal Grounding (Uni-STG) dataset, comprising 480K instances across three tasks. This dataset fully exploits the potential of MLLM to simultaneously facilitate localization in both temporal and spatial dimensions. Extensive experiments demonstrate that SpaceVLLM achieves the state-of-the-art performance across 11 benchmarks covering temporal, spatial, spatio-temporal and video understanding tasks, highlighting the effectiveness of our approach. Our code, datasets, and model will be released in this repository.

Citation

@article{wang2025spacevllm,
  title={SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability},
  author={Wang, Jiankang and Zhang, Zhihan and Liu, Zhihang and Li, Yang and Ge, Jiannan and Xie, Hongtao and Zhang, Yongdong},
  journal={arXiv preprint arXiv:2503.13983},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Updates

Overview

Citation

About

Uh oh!

Releases

Packages

Uh oh!

License

Jayce1kk/SpaceVLLM

Folders and files

Latest commit

History

Repository files navigation

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

Updates

Overview

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages