Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.18445 (cs)

[Submitted on 30 Nov 2023]

Title:VTimeLLM: Empower LLM to Grasp Video Moments

Authors:Bin Huang, Xin Wang, Hong Chen, Zihan Song, Wenwu Zhu

View PDF

Abstract:Large language models (LLMs) have shown remarkable text understanding capabilities, which have been extended as Video LLMs to handle video data for comprehending visual details. However, existing Video LLMs can only provide a coarse description of the entire video, failing to capture the precise start and end time boundary of specific events. In this paper, we solve this issue via proposing VTimeLLM, a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically, our VTimeLLM adopts a boundary-aware three-stage training strategy, which respectively utilizes image-text pairs for feature alignment, multiple-event videos to increase temporal-boundary awareness, and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning, VTimeLLM significantly outperforms existing Video LLMs. Besides, benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark, showing its superior cross-modal understanding and reasoning abilities.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.18445 [cs.CV]
	(or arXiv:2311.18445v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.18445

Submission history

From: Bin Huang [view email]
[v1] Thu, 30 Nov 2023 10:49:56 UTC (2,652 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VTimeLLM: Empower LLM to Grasp Video Moments

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VTimeLLM: Empower LLM to Grasp Video Moments

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators