8000 GitHub - www-Ye/Time-R1: R1-like Video-LLM for Temporal Grounding
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

www-Ye/Time-R1

Repository files navigation

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Updates

  • 2025-05-27: Expanded Time-R1 ArXiv paper released! Read on ArXiv.
  • 2025-03-17: TimeZero initial release! Code and evaluation scripts are now available.
  • 2025-03-17: TimeZero achieves SOTA performance on Charades-STA!

Overview

TimeZero is a reasoning-guided Large Vision-Language Model (LVLM) for Temporal Video Grounding (TVG). It excels at identifying temporal segments within videos that correspond to a given natural language query. TimeZero achieves this entirely through a reinforcement learning approach that allows the model to reason about video-language relationships during inference.

Key Features:

  • Reinforcement Learning Training: TimeZero is trained entirely using reinforcement learning, enhancing its ability to generate accurate temporal boundaries.
  • Test-Time Reasoning: The model exhibits emergent reasoning capabilities during inference, generating a chain of thought to justify its segment predictions.
  • SOTA Performance: TimeZero sets a new SOTA on the Charades-STA benchmark.

This README provides an overview of TimeZero, including setup instructions, the training process, and evaluation guidelines.

Example:

image

Training Visualization:

0a466a4bca3bb8d9b2a2af0f15890b4

Setup

conda create -n timezero python=3.11
conda env create -f environment.yml
conda activate timezero

Training

TimeZero training involves the following steps:

  1. Data Preprocessing:

    Download the dataset Charades-STA, Charades-v1, ActivityNet

    Before training, you need to preprocess the video data.

    bash preprocess_video.sh

    Specify the path to the Charades-STA dataset (video files, annotations, etc.).

  2. GRPO Training:

    cd scripts
    bash run_grpo_video.sh

    run_grpo_video.sh

    #!/bin/bash
    
    export DEBUG_MODE="false"  # Set to "true" for verbose logging during training.
    export LOG_PATH="./debug_log.txt"
    
    torchrun --nproc_per_node="4" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12361" \
    src/open_r1/grpo_video.py \
    --deepspeed scripts/zero3_offload.json \
    --output_dir $OUTDIR \
    --model_name_or_path mllm/Qwen2.5-VL-7B-Instruct \
    --preprocessed_data_path ./Charades_preprocessed_data_maxpix_3584 \
    --train_data_path ./Charades/charades_annotation/train.json \
    --eval_data_path ./Charades/charades_annotation/val.json \
    --video_folder ./Charades/Charades_v1 \
    --dataset_name xxx \
    --max_prompt_length 8192 \
    --max_completion_length 1024 \
    --num_generations 8 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --logging_steps 1 \
    --bf16 \
    --torch_dtype bfloat16 \
    --data_seed 42 \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --num_train_epochs 2 \
    --run_name $WANDB_NAME \
    --report_to wandb \
    --save_steps 50 \
    --save_only_model true

Evaluation

After training, evaluate your model's performance:

bash scripts/evaluate.sh # Use evaluate.sh for evaluation.

evaluate.sh

python evaluate.py --model_base <path_to_your_trained_model> --dataset <charades or activitynet>

The evaluation script (evaluate.py) needs to be implemented to load your model, process the test data, and calculate the relevant metrics (R1@0.3, R1@0.5, R1@0.7, etc.).

Results

  • Charades-STA (Finetuned)

TimeZero outperforms previous state-of-the-art methods by a large margin.

Method Type R1@0.3 R1@0.5 R1@0.7
EaTR (VLP sota) VLP - 68.4 44.9
TimeSuite (LVLM sota) SFT 79.4 67.1 43.0
TimeZero (ours) RL 83.3 72.5 47.9
  • ActivityNet (Finetuned)

TimeZero surpasses previous state-of-the-art LVLMs.

Method Type R1@0.3 R1@0.5 R1@0.7
EaTR (VLP sota) VLP - 58.18 37.64
TRACE (LVLM sota) SFT 54.0 37.7 24.0
TimeZero (ours) RL 68.6 47.3 26.9

Acknowledgements

We thank the authors of the following projects for their contributions:

Citation

@article{wang2025timer1posttraininglargevision,
  title={Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding},
  author={Ye Wang and Ziheng Wang and Boshen Xu and Yang Du and Kejun Lin and Zihan Xiao and Zihao Yue and Jianzhong Ju and Liang Zhang and Dingyi Yang and Xiangnan Fang and Zewen He and Zhenbo Luo and Wenxuan Wang and Junqi Lin and Jian Luan and Qin Jin},
  journal={arXiv preprint arXiv:2503.13377},
  year={2025}
}

About

R1-like Video-LLM for Temporal Grounding

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0