MathIF is a dedicated benchmark for evaluating the instruction-following capabilities of large reasoning models (LRMs) on mathematical reasoning tasks. It exposes a fundamental trade-off between a model’s problem-solving strength and its ability to comply with user-specified constraints.
• 📖 Paper • 🔧 Usage • 📊 Leaderboard • 🤗 Data • 🐦 Twitter
-
Compositional Constraints
15 Python-verifiable constraint types in four categories (length, lexical, format, affix), combined into single, dual, and triple constraints. -
Diverse Math Sources
Problems drawn from GSM8K, MATH-500, Minerva, Olympiad, and AIME, totaling 420 high-quality evaluation samples. -
Fine-Grained Metrics
- Hard Accuracy (HAcc): fraction of examples satisfying all constraints
- Soft Accuracy (SAcc): average fraction of satisfied constraints per example
-
vLLM-Powered Inference
Efficient decoding with nucleus sampling (T=1.0, p=0.95) and up to 16k token generation.
- Python 3.9 or later
- CUDA 12.4
git
,bash
git clone https://github.com/TingchenFu/MathIF.git
cd MathIF
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
bash code/scripts/vllm_if.sh
bash code/scripts/eval_if.sh
Each line in the JSONL file contains:
Field | Description |
---|---|
source |
Original data source |
id |
Unique example identifier |
question |
Math problem statement |
answer |
Ground-truth solution |
constraint_desc |
Human-readable constraint summary |
constraint_name |
Constraint category |
constraint_args |
Arguments used for verification |
.
├── data/ # MathIF JSONL files
├── code/
│ ├── scripts/ # Inference & evaluation scripts
│ └── ... # Model wrappers and utilities
├── output/ # Generated predictions & logs
├── requirements.txt # Python dependencies
└── README.md # This overview
Here's your LaTeX table transformed into a clean and readable GitHub-flavored Markdown table, keeping only HAcc, SAcc, and correctness with constraint (w/ const.
). For clarity, the models are grouped by size, but LaTeX-specific formatting (bold/underline) is omitted since GitHub tables do not support rich styling.
📢 Showcase Your Model’s Instruction-Following Capability
Feel free to contribute results from your own models—we welcome community submissions! We currently support evaluation of newly added models on our platform. To be included on the leaderboard, please provide the Hugging Face model link for verification and testing.
Model | HAcc | SAcc | Correctness |
---|---|---|---|
Qwen3-4B | 44.05 | 61.43 | 58.57 |
Qwen3-1.7B | 30.24 | 50.24 | 51.19 |
Qwen3-0.6B | 27.86 | 50.44 | 32.14 |
L1-Qwen-1.5B-Exact | 19.76 | 39.60 | 42.86 |
L1-Qwen-1.5B-Max | 19.76 | 39.40 | 45.71 |
DeepSeek-R1-Distill-Qwen-1.5B | 17.14 | 36.62 | 31.67 |
DeepScaler-1.5B-Preview | 14.52 | 34.52 | 36.19 |
Qwen2.5-1.5B-SimpleRL-Zoo | 9.05 | 24.33 | 22.38 |
Qwen2.5-Math-1.5B-Instruct | 7.62 | 21.39 | 44.29 |
Model | HAcc | SAcc | Correctness |
---|---|---|---|
Qwen3-14B | 50.71 | 67.06 | 64.29 |
DeepSeek-R1-Distill-Qwen-14B | 39.28 | 60.55 | 50.95 |
Qwen3-8B | 37.86 | 57.34 | 66.43 |
DeepSeek-R1-Distill-Qwen-7B | 26.43 | 44.96 | 48.57 |
DeepSeek-R1-Distill-Llama-8B | 22.14 | 44.04 | 36.43 |
Open-Reasoner-Zero-7B | 13.57 | 32.26 | 51.90 |
Qwen2.5-Math-7B-Instruct | 9.05 | 25.60 | 37.14 |
Model | HAcc | SAcc | Correctness |
---|---|---|---|
Qwen3-32B | 43.81 | 62.82 | 70.00 |
DeepSeek-R1-Distill-Qwen-32B | 42.62 | 60.91 | 57.62 |
DeepSeek-R1-Distill-Llama-70B | 41.43 | 61.07 | 54.05 |
QwQ-32B | 40.24 | 59.99 | 68.81 |
OlympicCoder-32B | 35.95 | 57.97 | 54.52 |
s1-32B | 20.95 | 41.78 | 60.95 |
Open-Reasoner-Zero-32B | 15.47 | 35.52 | 67.62 |
MathIF is inspired by prior work on IFEval and ComplexBench, and leverages vLLM for efficient inference.
@article{fu2025scaling,
title={Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models},
author={Fu, Tingchen and Gu, Jiawei and Li, Yafu and Qu, Xiaoye and Cheng, Yu},
journal={arXiv preprint arXiv:2505.14810},
year={2025}
}
For questions, feedback, or collaboration inquiries, please contact:
- Tingchen Fu: lucas.futingchen@gmail.com
- Yafu Li: yafuly@gmail.com