8000 GitHub - lblankl/Short-RL: Short RL
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

lblankl/Short-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Short RL

Short RL: Efficient RL Training for Reasoning Models via Length-Aware Optimization


News

[2025/05/23] We release our paper Efficient RL Training for Reasoning Models via Length-Aware Optimization and update the main branch

[2025/03/19] We release our github project (branch old)

Overview

Code of paper "Efficient RL Training for Reasoning Models via Length-Aware Optimization"

We introduce Short-RL, a simple yet effective technique to control response length during the RL training process of R1-like models, while maintaining stable performance.

Getting Started 🚀

Installation & Training Scripts

Logic-RL Setup

To begin working with Short-RL for the Logic-RL dataset, just run:

cd Logic-RL
conda create -n logic python=3.9
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.6.3 ray
pip3 install flash-attn --no-build-isolation
pip install -e .  # For verl integration
pip install wandb IPython matplotlib

Math-RL Setup

To begin working with Short-RL for 3 math settings , just run:

cd deepscaler
bash setup.sh

Start Logic-RL Training

We directly use the data from logic-RL at Logic-RL/data/kk/instruct

Train Short-RL

cd Logic-RL
bash sh/Short-RL.sh # Normal-RL.sh for baseline comparision

The performance of Logic-RL is sensitive to the learning rate. In our experiments, a learning rate of 1e-6 with a batch size of 8 yields the best convergence within 3 epochs, which is the setting used in the paper. However, this configuration can be unstable, sometimes leading to sudden drops in test accuracy during training, regardless of whether standard RL or Short-RL is used. For reliable reproduction of the paper results, multiple runs may be necessary.

For more stable training, a learning rate of 4e-7 is a robust alternative, though it requires more epochs to converge.

Eval

cd eval_kk
bash eval.sh

cd Math_eval
bash test_aime.sh
bash test_amc.sh

Start Math-RL Training

Data preparation:

You can directly use the data at deepscaler/data/orzmath, deepscaler/data/ThinkDeepScaler, deepscaler/data/ThinksimpleRL

Or if you want to prepare it yourself, taking Open Reasoner Zero as an example, first you need to download curated 57k training data from Orz to ./deepscaler/data. Then run

bash ./scripts/data/data.sh

Train Short-RL

cd deepscaler
#Open Reasoner Zero
bash scripts/train/Short-RL.sh # Normal-RL.sh for baseline comparision
#DeepScaleR
bash scripts/deepscaler/Short-RL.sh # Normal-RL.sh for baseline comparision
#SimpleRL-Math
bash scripts/simplerl/Short-RL.sh # Normal-RL.sh for baseline comparision

Evaluation

The evaluation curves can be seen in wandb during training.

Or if you want to evaluate it after training. You can run:

bash ./scripts/eval/eval_model.sh

Acknowledgements

Our training framework is built on Logic-RL, deepscaler, verl and ray.

Citation

@misc{yuan2025efficientrltrainingreasoning,
      title={Efficient RL Training for Reasoning Models via Length-Aware Optimization}, 
      author={Danlong Yuan and Tian Xie and Shaohan Huang and Zhuocheng Gong and Huishuai Zhang and Chong Luo and Furu Wei and Dongyan Zhao},
      year={2025},
      eprint={2505.12284},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.12284}, 
}

About

Short RL

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0