Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Updates

2025-05-27: Expanded Time-R1 ArXiv paper released! Read on ArXiv.
2025-03-17: TimeZero initial release! Code and evaluation scripts are now available.
2025-03-17: TimeZero achieves SOTA performance on Charades-STA!

Overview

TimeZero is a reasoning-guided Large Vision-Language Model (LVLM) for Temporal Video Grounding (TVG). It excels at identifying temporal segments within videos that correspond to a given natural language query. TimeZero achieves this entirely through a reinforcement learning approach that allows the model to reason about video-language relationships during inference.

Key Features:

Reinforcement Learning Training: TimeZero is trained entirely using reinforcement learning, enhancing its ability to generate accurate temporal boundaries.
Test-Time Reasoning: The model exhibits emergent reasoning capabilities during inference, generating a chain of thought to justify its segment predictions.
SOTA Performance: TimeZero sets a new SOTA on the Charades-STA benchmark.

This README provides an overview of TimeZero, including setup instructions, the training process, and evaluation guidelines.

Example:

Training Visualization:

Setup

conda create -n timezero python=3.11
conda env create -f environment.yml
conda activate timezero

Training

TimeZero training involves the following steps:

Data Preprocessing:

Download the dataset Charades-STA, Charades-v1, ActivityNet

Before training, you need to preprocess the video data.
```
bash preprocess_video.sh
```
Specify the path to the Charades-STA dataset (video files, annotations, etc.).

GRPO Training:

cd scripts
bash run_grpo_video.sh

run_grpo_video.sh

#!/bin/bash

export DEBUG_MODE="false"  # Set to "true" for verbose logging during training.
export LOG_PATH="./debug_log.txt"

torchrun --nproc_per_node="4" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="12361" \
src/open_r1/grpo_video.py \
--deepspeed scripts/zero3_offload.json \
--output_dir $OUTDIR \
--model_name_or_path mllm/Qwen2.5-VL-7B-Instruct \
--preprocessed_data_path ./Charades_preprocessed_data_maxpix_3584 \
--train_data_path ./Charades/charades_annotation/train.json \
--eval_data_path ./Charades/charades_annotation/val.json \
--video_folder ./Charades/Charades_v1 \
--dataset_name xxx \
--max_prompt_length 8192 \
--max_completion_length 1024 \
--num_generations 8 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--logging_steps 1 \
--bf16 \
--torch_dtype bfloat16 \
--data_seed 42 \
--gradient_checkpointing true \
--attn_implementation flash_attention_2 \
--num_train_epochs 2 \
--run_name $WANDB_NAME \
--report_to wandb \
--save_steps 50 \
--save_only_model true

Evaluation

After training, evaluate your model's performance:

bash scripts/evaluate.sh # Use evaluate.sh for evaluation.

evaluate.sh

python evaluate.py --model_base <path_to_your_trained_model> --dataset <charades or activitynet>

The evaluation script (evaluate.py) needs to be implemented to load your model, process the test data, and calculate the relevant metrics (R1@0.3, R1@0.5, R1@0.7, etc.).

Results

Charades-STA (Finetuned)

TimeZero outperforms previous state-of-the-art methods by a large margin.

Method	Type	R1@0.3	R1@0.5	R1@0.7
EaTR (VLP sota)	VLP	-	68.4	44.9
TimeSuite (LVLM sota)	SFT	79.4	67.1	43.0
TimeZero (ours)	RL	83.3	72.5	47.9

ActivityNet (Finetuned)

TimeZero surpasses previous state-of-the-art LVLMs.

Method	Type	R1@0.3	R1@0.5	R1@0.7
EaTR (VLP sota)	VLP	-	58.18	37.64
TRACE (LVLM sota)	SFT	54.0	37.7	24.0
TimeZero (ours)	RL	68.6	47.3	26.9

Acknowledgements

We thank the authors of the following projects for their contributions:

Citation

@article{wang2025timer1posttraininglargevision,
  title={Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding},
  author={Ye Wang and Ziheng Wang and Boshen Xu and Yang Du and Kejun Lin and Zihan Xiao and Zihao Yue and Jianzhong Ju and Liang Zhang and Dingyi Yang and Xiangnan Fang and Zewen He and Zhenbo Luo and Wenxuan Wang and Junqi Lin and Jian Luan and Qin Jin},
  journal={arXiv preprint arXiv:2503.13377},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
ActivityNet/activitynet_annotation		ActivityNet/activitynet_annotation
		Charades/charades_annotation
configs		configs
dataset		dataset
scripts		scripts
src/open_r1		src/open_r1
LICENSE.txt		LICENSE.txt
README.md		README.md
TimeZero_TechReport.pdf		TimeZero_TechReport.pdf
data_configs.py		data_configs.py
environment.yml		environment.yml
evaluate.py		evaluate.py
preprocess_dataset.py		preprocess_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Updates

Overview

Setup

Training

Evaluation

Results

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

www-Ye/Time-R1

Folders and files

Latest commit

History

Repository files navigation

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Updates

Overview

Setup

Training

Evaluation

Results

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages