- 2024-04-10: All training code are released.
- 2024-04-08: All checkpoints (R1V-Free-VL-3B, R1V-Free-VL-7B) are released.
- 2024-04-01: Initial release of R1V-Free framework (v0.1-alpha)
-
AI-Feedback 🤖: The first visual reasoning model that uses AI feedback as reward.
-
Label-Free 🔄: No need for ground truth labels as supervision.
-
Open-ended 🌍: Capable of training on open-ended questions, enhancing the ability to understand open-world visual concepts.
- Release the Training Code.
- Release the Evaluation Code.
- Release the R1V-Free-3B Checkpoint.
- Release the R1V-Free-7B Checkpoint.
- Release the Wandb records of the training process.
acc_reward | format_reward |
---|---|
completion length | reward_std |
---|---|
# Create and activate conda environment
conda create -n r1v-free python=3.11 -y && conda activate r1v-free
# Install dependencies with automatic CUDA detection
bash setup.sh
Note
If you meet bug when running the script, first try align your environments with ./src/requirements.txt
1. Qwen2-VL 系列
[2B-Instruct
🤗] | [7B-Instruct
🤗]
2. Qwen2.5-VL 系列
[3B-Instruct
🤗] | [7B-Instruct
🤗]
🤗 R1V-Free Training Dataset: RLHF-V
cd src/R1V-Free
export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL
export LOG_PATH="./debug_log.txt"
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export REWARD_GPUS=<GPU_NUMS of REWARD_MODEL>
torchrun --nproc_per_node="6" \
--nnodes="1" \
--node_rank="0" \
--master_addr="127.0.0.1" \
--master_port="12345" \
src/grpo.py \
--output_dir <OUTPUT_DIR> \
--model_name_or_path <PATH-T
86FE
O-Qwen2-VL-2B-Instruct> \
--dataset_name Exgc/R1V-Free_RLHFV \
--deepspeed local_scripts/zero3.json \
--max_prompt_length 512 \
--max_completion_length 512 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--logging_steps 1 \
--bf16 \
--report_to wandb \
--gradient_checkpointing false \
--attn_implementation flash_attention_2 \
--max_pixels 401408 \
--num_train_epochs 10 \
--run_name Qwen2-VL-2B-GRPO-RLHF-V \
--save_steps 100 \
--save_only_model true \
--num_generations 8 # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance
We evaluate our pretrained models using VLMEvalKit.
-
Install VLMEvalKit Navigate to the
src/VLMEvalKit
directory and run:pip install -e .
-
Set up the API key (if required) To use LLM APIs as the judge or choice extractor, you need to setup API keys in
src/VLMEvalKit/.env
. -
Run Evaluation For instance, to evaluate our R1V-Free-2.5VL-3B on MMVet dataset, navigate to
src/VLMEvalKit
and runpython run.py --data MMVet --model R1V-Free-2.5VL-3B --verbose --reuse
--verbose
: Shows detailed evaluation logs.--reuse
: Enables reuse of previously computed results if available.
Note
For other benchmarks supported by VLMEvalKit, you can replace --data MMVet
with datasets like HallusionBench
, MathVista
, etc.
We build upon these foundational works:
Category | Resources |
---|---|
Codebase | DeepSeek-R1, Open-R1-Multimodal, R1-V, VLMEvalKit |
Pretrained Model | QwenVL,InternLM-XComposer-2.5-Reward |
Training Data | RLHF-V |
Evaluation Data | MMVet |
@article{Cheng_R1V-Free_Advancing_Open-World_2025,
author = {Cheng, Xize and Cai, Zhengzhou and Wang, Zehan and Ji, Shengpeng and Jiang, Ziyue and Jin, Tao and Zhao, Zhou},
title = {{R1V-Free: Advancing Open-World Visual Reasoning with Label-Free AI Feedback}},
url = {https://github.com/Exgc/R1V-Free},
year = {2025}
}