8000 GitHub - LongVideoHaystack/TStar
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

LongVideoHaystack/TStar

Repository files navigation

TStar: A Unified KeyFrame Searching Framework for Video Question Answering

TStar framework

Post Project Page Demo Dataset Paper

TStar is an advanced framework that integrates Keyframe Searching into Vision-Language Models (VLMs), enhancing their performance for extremely long video understandiong tasks. By efficiently identifying relevant frames within videos, TStar improves the ability of state-of-the-art models like LLaVA-oneVision, QWen-VL and GPT-4o to understand and reason over video data.

2025.4.4 Update: We’ve shared a compact demo set of TStar outputs LV-Haystack Tiny (Google Drive)

2025.4.4 Update: We are thrilled to release TStar and LongVideoHaystack!

Features

  • Iteratively Searching: Iteratively identifies and focuses on the most relevant visual information in videos based on the question being asked.
  • Plug-in: Easily integrates various grounding and searching backends.
  • Efficient Video QA: Combines T* keyframe search with advanced video question answering capabilities.

Getting Started

Installation

## Follow docs/installation to implemet Grounding (e.g., LLaVA) and Searching (e.g., YOLO) Function
###  Install Query Grounder Interface(LLaVA or GPT-API) 
### Optional if you test with GPT4o or QWen
git clone https://github.com/LLaVA-VL/LLaVA-NeXT  

### Install Image Scorer Interface e.g., YOLO-WORLD 
### Optional if you test with owl-vit (fast run but lower performance))
git clone --recursive https://github.com/AILab-CVC/YOLO-World.git

Structure:

LV-Haystack/
├── LLaVA-NeXT/                # Query grounding and QA interface (e.g., LLaVA or GPT-4 API, or QWen from HF)
├── YOLO-World/                # Object detection model with open vocabulary (optional)
├── TStar/                     # Core Python module for T* keyframe search 
│   ├── interface_grounding.py       # Interface for grounding questions with VLMs
│   ├── interface_heuristic.py      # Function for scoring images using YOLO
│   ├── interface_searcher.py  # Logic for searching keyframes in T*
│   ├── TStarFramework.py      # Example class integrating T* searching with QA
├── LVHaystackBench              # Scripts for inference on the LV-Haystack dataset
│   ├── run_TStar_onDataset.py # Run keyframe search on a given dataset (e.g., LongVideoBench)
│   ├── val_tstar_results.py     # Evaluate keyframe search results on LV-Haystack
│   ├── val_qa_results.py      # Evaluate video question answering with searched keyframes
├── README.md                  # Documentation for the repository

Run TStar Demo

The example below demonstrates how to perform video question answering with keyframe searching framework. This example uses GPT-4o as the VLM and YOLO-World as scoring function.

export OPENAI_API_KEY=your_openai_api_key

python run_TStar_Demo_onVideo.py \
    --video_path /path/to/LVHaystack/38737402-19bd-4689-9e74-3af391b15feb.mp4 \
    --question "What is the color of the couch?" \
    --options "A) Red, B) Blue, C) Green, D) Yellow" \
    --grounder gpt-4o \
    --heuristic owl-vit \
    --search_nframes 8

Test on LV-HayStack

To evaluate T* on a dataset (e.g., LV-Haystack), use the following command:

bash ./eval_LV_Haystack.sh

Running T* on Your Dataset

To process your own dataset with T*, you need to prepare a JSON file describing the dataset. The JSON file should follow the format below:

Click to expand JSON examples!
[
    {
        "file_name": "example_video.mp4",
        "question": "What is the color of the couch?",
        "choices": {
            "A": "Red",
            "B": "Blue",
            "C": "Green",
            "D": "Yellow"
        },
        "frame_indexes": [10, 50, 100]  // Optional: Use this for specific frame sampling
    },
    {
        "file_name": "another_video.mp4",
        "question": "What object is next to the chair?",
        "choices": {
            "A": "Table",
            "B": "Lamp",
            "C": "Sofa",
            "D": "Bookshelf"
        }
    }
]

Once your dataset is prepared, you can run TStar to perform keyframe searching. Use the following command:

Click to expand python script!
python ./run_TStar_onDataset.py \
    --dataset_meta LVHaystack/LongVideoHaystack \
    --split test_tiny \
    --video_root ./Datasets/ego4d_data/ego4d_data/v1/256p \
    --output_json_name TStar_LVHaystack_tiny.json \
    --grounder gpt-4o \
    --heuristic owl-vit \
    --search_nframes 8
# new you have add predict frame index in your annotations json
# and sampine frame with the T* prediction for your works!

Contact

Citation

If you find TStar helpful, please consider citing us:

@misc{tstar,
      title={Re-thinking Temporal Search for Long-Form Video Understanding}, 
      author={Jinhui Ye and Zihan Wang and Haosen Sun and Keshigeyan Chandrasegaran and Zane Durante and Cristobal Eyzaguirre and Yonatan Bisk and Juan Carlos Niebles and Ehsan Adeli and Li Fei-Fei and Jiajun Wu and Manling Li},
      year={2025},
      eprint={2504.02259},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2504.02259}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0