Ranajoy Sadhukhan*, Zhuoming Chen*, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
Carnegie Mellon University
We introduce Kinetics, which challenges the traditional test-time scaling (TTS) laws by adopting a practical efficiency perspective. It reveals that prior compute-optimal approaches overlook major key-value memory access bottlenecks in various TTS strategies. By jointly considering memory and compute, the Kinetics scaling law shows that it is more efficient to scale model size up to a threshold before investing more compute in test-time scaling. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget.
| |
In this codebase, we provide,
- eFLOPs-based cost model analysis
- Comparison of previous and Kinetics scaling law
- Comparison of dense and sparse scaling law based on eFLOPs based cost model
- Block Top-K attention speed-up demonstration
We provide AIME24 and AIME25 reasoning traces for Qwen3 model series and its sparse attention variants through Huggingface.
The repository is organized as follows,
Kinetics/
├── benchmark/
| ├──blocktopk.py
| └──dense.py
├── cost_model
| ├── best_of_N
| | ├── cost_model_ntrial.py
| | ├── frontier_numTrials.ipynb
| | └── compare_sparse_numTrials.ipynb
| ├── long_CoT
| | ├── cost_model_genlen.py
| | ├── frontier_genlen.ipynb
| | └── compare_sparse_genlen.ipynb
| ├── utils.py
├── README.md
└── requirements.txt
cost_model/
: contains code to perform eFLOPs-based cost analysis for two different inference-scaling strategies - best of N and long CoT.benchmark/
: contains code for benchmarking the task throughput with block top-k and dense attention-based model.
conda create -n kinetics python=3.11
conda activate kinetics
# install flashinfer v0.1.6
wget https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.6/flashinfer-0.1.6+cu124torch2.4-cp311-cp311-linux_x86_64.whl#sha256=19a01e2ec93662bc6b83819daaae277d93e7cc989343c5f8940af44a4cb66ba0
pip install -r requirements.txt
Download the reasoning traces from Huggingface
The samples are saved as jsonl files. Every example dict contains the following keys:
query
: question textchoice
: ground-truth output listprediction
: output prediction text listscore
: 0.0 or 100.0 Each example is replicated 32 times (32 max trials). For 32B, we only provide 8 trials.
Compute inference cost and accuracies for different test-time configurations (number of trials and generation lengths).
cd cost_model
python3 best_of_N/cost_model_ntrial.py <path_to_response_root> <method>
python3 best_of_N/cost_model_genlen.py <path_to_response_root> <method>
Args:
<response_root>
: Dataset directory (e.g.,AIME24
,AIME25
)<method>
: One ofdense
,topk
,blocktopk
First generate cost analysis csv files using the scripts mentioned above. Then refer to cost_model/best_of_N/frontier_numTrials.ipynb
and cost_model/long_CoT/frontier_genlen.ipynb
to obtain the accuracy vs cost budget pareto curves for dense and the sparse variants using best of N and long CoT scaling stratgies respectively.
|
|
Contains paged-attention implementation of dense attention and block top-k attention. We only provide implementation for 1 decoder layer, considering n-way Tensor Parallelism (n is the number of key-value heads).
python3 dense.py --model Qwen/Qwen3-8B --gen_len 32768 --batch_size 4096 --page_size 16 --world_size 8
python3 blocktopk.py --model Qwen/Qwen3-8B --gen_len 32768 --batch_size 4096 --page_size 16 --topk_page 64 --world_size 8
- Add SGLang based evaluation
@misc{sadhukhan2025kineticsrethinkingtesttimescaling,
title={Kinetics: Rethinking Test-Time Scaling Laws},
author={Ranajoy Sadhukhan and Zhuoming Chen and Haizhong Zheng and Yang Zhou and Emma Strubell and Beidi Chen},
year={2025},
eprint={2506.05333},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.05333},
}