🐙 OctoThinker
Revisiting Mid-training in the Era of RL Scaling

🔥 News

[2025-04-24] 🎉🎉🎉 We release our first progress blog on Notion, together with the first version of our base and RL models on HuggingFace, which is trained on Llama-3 series.

📖 Introduction

Note: We are still in the process of exploring more possibilities and expand to different model families, but we are eager to share some findings with the community from our empirical results in an open-source manner!

We explores how different early pre(mid)-training strategies' could bring impact to post-training stages, especially during the period of Reinforcement Learning (RL). We hold the hope of reshaping the pre-training stage of LLMs, in the era of RL scaling. 🐙 OctoThinker is our initial attempt to explore this direction. We go through a thorough pipeline of pre-training, RL, and evaluation, to investigate deep-level insights.

What does 🐙 OctoThinker mean?

"Octo" is from the word "octopus", representing our base model families which are branched and trained via different strategies. "Thinker" means the model is finally trained to think and reason at RL stage, which is expected to show frequent self-reflection behaviors and strong reasoning abilities.

Usage

Currently, our repo contains 3 main parts:

Pre-training code based on Nanotron
RL code based on verl
Evaluation code which is refined from DeepSeekMath and MegaMath

Pre-training

Pre-training Environment Setup

conda create -n nanotron python=3.10
conda activate nanotron
cd nanotron
pip install -r requirements.txt

To Submit Pre-training Jobs

#TODO: add pre-training scripts

RL

RL Environment Setup

#TODO: add RL scripts

To Submit RL Jobs

#TODO: add RL scripts

Evaluation

Evaluation Environment Setup

conda create -n matheval python=3.10
conda activate matheval
cd eval
pip install -r requirements.txt

To Submit Evaluation Jobs

cd eval
bash scripts/en_math_cot_eval_last4dir.sh <model_root_dir>

Visualization

We also provide the visualization code for the pre-training and RL process. All visualizations are in plot directory to ensure the reproducibility.

Acknowledgements

For training framework and inference engine, we use verl and vLLM. We thank huggingface open-r1 team, a-m-team, and also SimpleRL Project, to open source their dataset and training recipes. In fact, we are deeply grateful to the entire open‑source community for their tireless efforts in making our exploration possible.

If you find this work useful, please cite:

@misc{wang2025octothinker,
  title={OctoThinker: Revisiting Mid-Training In the Era of RL Scaling},
  author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
  year={2025},
  howpublished={\url{https://tinyurl.com/OctoThinker}},
  note={Notion Blog}
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
eval		eval
plot		plot
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🐙 OctoThinker
Revisiting Mid-training in the Era of RL Scaling

🔥 News

📖 Introduction

What does 🐙 OctoThinker mean?

Usage

Pre-training

RL

Evaluation

Visualization

Acknowledgements

About

Releases

Packages

Languages

License

GAIR-NLP/OctoThinker

Folders and files

Latest commit

History

Repository files navigation

🐙 OctoThinkerRevisiting Mid-training in the Era of RL Scaling

🔥 News

📖 Introduction

What does 🐙 OctoThinker mean?

Usage

Pre-training

RL

Evaluation

Visualization

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

🐙 OctoThinker
Revisiting Mid-training in the Era of RL Scaling

Packages