8000 GitHub - aiming-lab/ReAgent-V
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

aiming-lab/ReAgent-V

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou*, Yangfan He*, Yaofeng Su, Siwei Han, Joel Jang, Gedas Bertasius, Mohit Bansal, Huaxiu Yao

*Equal Contribution

ReAgent-V is a modular, extensible, and reward-aware video reasoning framework designed to elevate video question answering and reasoning through:

  • 🔧 Flexible Tool Integration — Plug-and-play support for OCR, ASR, object detection, scene graph generation, captioning, and more
  • 🧠 Reward-Guided Inference — Enables real-time self-correction via structured reward signals
  • 🎯 Adaptive Model Alignment — Aligns models dynamically based on inference-time feedback
  • 🗂️ High-Quality Data Selection — Facilitates sample-efficient learning using reflective evaluation
  • 📊 Entropy-Calibrated Frame Selection — Prioritizes key frames for focused reasoning
  • 🔁 Multi-Perspective Reflection — Refines answers through debate among conservative, neutral, and aggressive viewpoints

News

🔥 [June 2, 2025] Our latest paper is now live on arXiv: arXiv:2506.01300!


📌 Overview

Framework Overview

🚀 Applications

ReAgent-V supports a range of real-world tasks via dedicated application modules:

🧭 VLA Alignment

Aligns Vision-Language-Action (VLA) models using Trajectory-wise Preference Optimization (TPO) guided by ReAgent-V’s reward feedback. Specifically, ReAgent-V evaluates each trajectory across multiple axes—such as task success, temporal stability, visual grounding, and semantic precision—and performs multi-agent reflection to produce refined, high-fidelity reward scores for alignment.

🎥 Video Understanding

  • Entropy-Calibrated Frame Selection
    Efficiently selects the most informative frames for video reasoning.

  • Tool-Augmented Inference
    Dynamically integrates multimodal tools including OCR, ASR, object detection, scene graph generation, and captioning, etc.

  • Multi-Agent Reflection
    Iteratively refines outputs by encouraging disagreement and consensus among diverse agent personas (conservative / neutral / aggressive).

  • 📁 Module: ReAgent-V

  • 📘 Instructions: Video Understanding README

📈 Reward-Aware Data Curation and Collection for SFT, DPO, GRPO, and Beyond

ReAgent-V enables inference-time data curation by leveraging real-time rewards and reflection-based diagnostics depending on the optimization paradigm, the extraction strategy varies:

🧪 For SFT (Supervised Fine-Tuning)

ReAgent-V can directly collect samples with high reward scores (from the evaluation report) without requiring additional reflection.

  • ✅ These samples indicate that the model’s initial reasoning is reliable.
  • 📥 Stored as supervised training pairs with accompanying scalar reward labels from the critic agent.

Simple, scalable, and label-efficient: reward scores enable dynamic filtering without manual annotation.

🔄 For GRPO (Group Relative Policy Optimization)

To curate high-value training data for GRPO, ReAgent-V employs a reflection-triggering mechanism grounded in importance scoring, effectively identifying challenging yet informative video-text samples during the video understanding phase.

  • 🎥 Each input is a (video, text) pair, typically comprising a video and its initial response.
  • 📊 During inference, ReAgent-V computes an importance score (denoted as E.importance_score) based on the critic agent’s overall assessment of reasoning sufficiency.
  • ❗ If this importance score falls below a threshold (e.g., < 5 out of 10), the sample is considered difficult, meaning the model struggled with initial reasoning and likely required further refinement.
  • 📥 The resulting (video, text) samples are labeled as reflection-worthy and collected as valuable candidates for GRPO training.

⚖️ For DPO (Direct Preference Optimization)

ReAgent-V supports Direct Preference Optimization (DPO) by reframing itself from a video reasoning agent into a reward-generating agent. This is achieved through a task template modification that emphasizes evaluating answer quality rather than producing a single correct answer.

  • 🧠 Transforms the task from “answer the video question” into “score the video from different perspectives” based on visual evidence.

  • ♻️ Uses multi-perspective reflection outputs (e.g., conservative, neutral, aggressive) to generate candidate rewards.

  • 📊 Each candidate reward is generated along customizable reward dimensions, such as:

    • 🎯 Visual alignment
    • ⏱️ Temporal accuracy
    • 💬 Linguistic precision
    • 🧠 Reasoning specificity
    • 🔍 Option disambiguation
  • ✅ The system identifies the answer with the higher reflection reward as the preferred choice.

  • 🔗 Constructs (preferred, rejected) pairs from these outputs to serve as DPO training data.

Unlike static or hand-crafted rewards, ReAgent-V’s feedback is context-aware, multi-dimensional, and fully dynamic, adapting to each video-question instance.


🌀 Unified Insight: ReAgent-V closes the data curation loop. Its multi-agent reward pipeline not only improves current inference but continuously supplies high-fidelity data for future optimization—making learning systems self-refining in the wild.

🧑‍💻 Getting Started

Each subfolder contains its own README.md with detailed installation, setup, and training instructions. To get started:

  1. Clone the repository
  2. Follow the environment setup and requirements in each module
  3. Explore the demo scripts and customize as needed

💬 If you have questions or encounter any issues, feel free to open an issue or contact the maintainers.


📚 Citation

If you find ReAgent-V helpful in your research or projects, please consider citing:

@article{zhou2025reagent,
  title={ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding},
  author={Zhou, Yiyang and He, Yangfan and Su, Yaofeng and Han, Siwei and Jang, Joel and Bertasius, Gedas and Bansal, Mohit and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2506.01300},
  year={2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
0