MathIF: Instruction-Following Benchmark for Large Reasoning Models

MathIF is a dedicated benchmark for evaluating the instruction-following capabilities of large reasoning models (LRMs) on mathematical reasoning tasks. It exposes a fundamental trade-off between a model’s problem-solving strength and its ability to comply with user-specified constraints.

• 📖 Paper • 🔧 Usage • 📊 Leaderboard • 🤗 Data • 🐦 Twitter

📖Features

Compositional Constraints
15 Python-verifiable constraint types in four categories (length, lexical, format, affix), combined into single, dual, and triple constraints.
Diverse Math Sources
Problems drawn from GSM8K, MATH-500, Minerva, Olympiad, and AIME, totaling 420 high-quality evaluation samples.
Fine-Grained Metrics
- Hard Accuracy (HAcc): fraction of examples satisfying all constraints
- Soft Accuracy (SAcc): average fraction of satisfied constraints per example
vLLM-Powered Inference
Efficient decoding with nucleus sampling (T=1.0, p=0.95) and up to 16k token generation.

✨Getting Started

Prerequisites

Python 3.9 or later
CUDA 12.4
git, bash

Installation

git clone https://github.com/TingchenFu/MathIF.git
cd MathIF

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

🔧Usage

Inference

bash code/scripts/vllm_if.sh

Evaluation

bash code/scripts/eval_if.sh

Dataset Format

Each line in the JSONL file contains:

Field	Description
`source`	Original data source
`id`	Unique example identifier
`question`	Math problem statement
`answer`	Ground-truth solution
`constraint_desc`	Human-readable constraint summary
`constraint_name`	Constraint category
`constraint_args`	Arguments used for verification

Project Structure

.
├── data/                # MathIF JSONL files
├── code/
│   ├── scripts/         # Inference & evaluation scripts
│   └── ...              # Model wrappers and utilities
├── output/              # Generated predictions & logs
├── requirements.txt     # Python dependencies
└── README.md            # This overview

Here's your LaTeX table transformed into a clean and readable GitHub-flavored Markdown table, keeping only HAcc, SAcc, and correctness with constraint (w/ const.). For clarity, the models are grouped by size, but LaTeX-specific formatting (bold/underline) is omitted since GitHub tables do not support rich styling.

📊Leaderboard

📢 Showcase Your Model’s Instruction-Following Capability

Feel free to contribute results from your own models—we welcome community submissions! We currently support evaluation of newly added models on our platform. To be included on the leaderboard, please provide the Hugging Face model link for verification and testing.

≤ 4B Models

Model	HAcc	SAcc	Correctness
Qwen3-4B	44.05	61.43	58.57
Qwen3-1.7B	30.24	50.24	51.19
Qwen3-0.6B	27.86	50.44	32.14
L1-Qwen-1.5B-Exact	19.76	39.60	42.86
L1-Qwen-1.5B-Max	19.76	39.40	45.71
DeepSeek-R1-Distill-Qwen-1.5B	17.14	36.62	31.67
DeepScaler-1.5B-Preview	14.52	34.52	36.19
Qwen2.5-1.5B-SimpleRL-Zoo	9.05	24.33	22.38
Qwen2.5-Math-1.5B-Instruct	7.62	21.39	44.29

7B–14B Models

Model	HAcc	SAcc	Correctness
Qwen3-14B	50.71	67.06	64.29
DeepSeek-R1-Distill-Qwen-14B	39.28	60.55	50.95
Qwen3-8B	37.86	57.34	66.43
DeepSeek-R1-Distill-Qwen-7B	26.43	44.96	48.57
DeepSeek-R1-Distill-Llama-8B	22.14	44.04	36.43
Open-Reasoner-Zero-7B	13.57	32.26	51.90
Qwen2.5-Math-7B-Instruct	9.05	25.60	37.14

≥ 32B Models

Model	HAcc	SAcc	Correctness
Qwen3-32B	43.81	62.82	70.00
DeepSeek-R1-Distill-Qwen-32B	42.62	60.91	57.62
DeepSeek-R1-Distill-Llama-70B	41.43	61.07	54.05
QwQ-32B	40.24	59.99	68.81
OlympicCoder-32B	35.95	57.97	54.52
s1-32B	20.95	41.78	60.95
Open-Reasoner-Zero-32B	15.47	35.52	67.62

🌻Acknowledgements

MathIF is inspired by prior work on IFEval and ComplexBench, and leverages vLLM for efficient inference.

📖Citation

@article{fu2025scaling,
  title={Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models},
  author={Fu, Tingchen and Gu, Jiawei and Li, Yafu and Qu, Xiaoye and Cheng, Yu},
  journal={arXiv preprint arXiv:2505.14810},
  year={2025}
}

📬Contact

For questions, feedback, or collaboration inquiries, please contact:

Tingchen Fu: lucas.futingchen@gmail.com
Yafu Li: yafuly@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit H 8000 istory 23 Commits
code		code
data		data
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MathIF: Instruction-Following Benchmark for Large Reasoning Models

📖Features

✨Getting Started

Prerequisites

Installation

🔧Usage

Inference

Evaluation

Dataset Format

Project Structure

📊Leaderboard

≤ 4B Models

7B–14B Models

≥ 32B Models

🌻Acknowledgements

📖Citation

📬Contact

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

TingchenFu/MathIF

Folders and files

Latest commit

H 8000 istory

Repository files navigation

MathIF: Instruction-Following Benchmark for Large Reasoning Models

📖Features

✨Getting Started

Prerequisites

Installation

🔧Usage

Inference

Evaluation

Dataset Format

Project Structure

📊Leaderboard

≤ 4B Models

7B–14B Models

≥ 32B Models

🌻Acknowledgements

📖Citation

📬Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages