[2025/05/23] We release our paper Efficient RL Training for Reasoning Models via Length-Aware Optimization and update the main branch
[2025/03/19] We release our github project (branch old)
Code of paper "Efficient RL Training for Reasoning Models via Length-Aware Optimization"
We introduce Short-RL, a simple yet effective technique to control response length during the RL training process of R1-like models, while maintaining stable performance.
To begin working with Short-RL for the Logic-RL dataset, just run:
cd Logic-RL
conda create -n logic python=3.9
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.6.3 ray
pip3 install flash-attn --no-build-isolation
pip install -e . # For verl integration
pip install wandb IPython matplotlib
To begin working with Short-RL for 3 math settings , just run:
cd deepscaler
bash setup.sh
We directly use the data from logic-RL at Logic-RL/data/kk/instruct
Train Short-RL
cd Logic-RL
bash sh/Short-RL.sh # Normal-RL.sh for baseline comparision
The performance of Logic-RL is sensitive to the learning rate. In our experiments, a learning rate of 1e-6 with a batch size of 8 yields the best convergence within 3 epochs, which is the setting used in the paper. However, this configuration can be unstable, sometimes leading to sudden drops in test accuracy during training, regardless of whether standard RL or Short-RL is used. For reliable reproduction of the paper results, multiple runs may be necessary.
For more stable training, a learning rate of 4e-7 is a robust alternative, though it requires more epochs to converge.
Eval
cd eval_kk
bash eval.sh
cd Math_eval
bash test_aime.sh
bash test_amc.sh
Data preparation:
You can directly use the data at deepscaler/data/orzmath, deepscaler/data/ThinkDeepScaler, deepscaler/data/ThinksimpleRL
Or if you want to prepare it yourself, taking Open Reasoner Zero as an example, first you need to download curated 57k training data from Orz
to ./deepscaler/data.
Then run
bash ./scripts/data/data.sh
Train Short-RL
cd deepscaler
#Open Reasoner Zero
bash scripts/train/Short-RL.sh # Normal-RL.sh for baseline comparision
#DeepScaleR
bash scripts/deepscaler/Short-RL.sh # Normal-RL.sh for baseline comparision
#SimpleRL-Math
bash scripts/simplerl/Short-RL.sh # Normal-RL.sh for baseline comparision
Evaluation
The evaluation curves can be seen in wandb during training.
Or if you want to evaluate it after training. You can run:
bash ./scripts/eval/eval_model.sh
Our training framework is built on Logic-RL, deepscaler, verl and ray.
- Our model is based on Qwen2.5-7B,DeepSeek Distill Qwen1.5B
- Our math data is from Open-Reasoner-Zero, deepscaler, simpleRL-reason
@misc{yuan2025efficientrltrainingreasoning,
title={Efficient RL Training for Reasoning Models via Length-Aware Optimization},
author={Danlong Yuan and Tian Xie and Shaohan Huang and Zhuocheng Gong and Huishuai Zhang and Chong Luo and Furu Wei and Dongyan Zhao},
year={2025},
eprint={2505.12284},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2505.12284},
}