Awesome Test-time-Scaling in LLMs

Our repository, Awesome Test-time-Scaling in LLMs, gathers available papers on test-time scaling, to our current knowledge. Unlike other repositories that categorize papers, we decompose each paper's contributions based on the taxonomy provided by "What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models" facilitating easier understand and comparison for readers.

Figure 1: A Visual Map and Comparison: From What to Scale to How to Scale..

📢 News and Updates

[13/Apr/2025] 📌 The Second Version is released:
1. We correct some typos;
2. We include "Evaluation" and "Agentic" Tasks, which were enhanced by TTS;
3. We revise the figures and tables, like the color of table 1.
[9/Apr/2025] 📌 Our repository is created.
[31/Mar/2025] 📌 Our initial survey is on Arxiv!

📘 Introduction

As enthusiasm for scaling computation (data and parameters) in the pertaining era gradually diminished, test-time scaling (TTS)—also referred to as “test-time computing”—has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in reasoning-intensive tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering systemic understanding. To fill this gap, we propose a unified, hierarchical framework structured along four orthogonal dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct a holistic review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique contributions of individual methods within the broader TTS landscape.

Figure 2: omparison of Scaling Paradigms in Pre-training and Test-time Phases..

🧬 Taxonomy

1. What to Scale

``What to scale'' refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference.

Parallel Scaling improves test-time performance by generating multiple outputs in parallel and then aggregating them into a final answer.
Sequential Scaling involves explicitly directing later computations based on intermediate steps.
Hybrid Scaling exploits the complementary benefits of parallel and sequential scaling.
Internal Scaling elicits a model to autonomously determine how much computation to allocate for reasoning during testing within the model’s internal parameters, instead of external human-guided strategies.

2. How to Scale

Tuning
- Supervised Fine-Tuning (SFT): by training on synthetic or distilled long CoT examples, SFT allows a model to imitate extended reasoning patterns.
- Reinforcement Learning (RL): RL can guide a model’s policy to generate longer or more accurate solutions.
Inference
- Stimulation (STI): It basically stimulates the LLM to generate more and longer samples instead of generating individual samples directly.
- Verification (VER): The verification process plays an important role in the TTS, and it can be adapted to: i) directly selects the output sample among various ones, under the Parallel Scaling paradigm; ii) guides the stimulation process and determines when to stop, under the Sequential Scaling paradigm; iii) serves as the criteria in the search process; iv) determines what sample to aggregate and how to aggregate them, e.g., weights.
- Search (SEA): Search is a time-tested technique for retrieving relevant information from large databases, and it can also systematically explore the potential outputs of LLMs to improve complex reasoning tasks.
- Aggregation (AGG): Aggregation techniques consolidate multiple solutions into a final decision to enhance the reliability and robustness of model predictions at test time.

3. Where to Scale

Reasoning: Math, Code, Science, Game & Strategy, Medical and so on.
General-Purpose: Basics, Agents, Knowledge, Open-Ended, Multi-Modal and so on.

4. How Well to Scale

Performance: This dimension measures the correctness and robustness of outputs.
Efficiency: it captures the cost-benefit tradeoffs of TTS methods.
Controllability: This dimension assesses whether TTS methods adhere to resource or output constraints, 8000 such as compute budgets or output lengths.
Scalability: Scalability quantifies how well models improve with more test-time compute (e.g., tokens or steps).

🔍 Paper Tables

Method(PapersTitles)	What	How →						Where	How Well
		SFT	RL	STI	SEA	VER	AGG
Scaling llm test-time compute optimally can be more effective than scaling model parameters.,	Parallel, Sequential	✗	✗	✗	Beam, LookAhead	Verifier	(Weighted) Best-of-N, Stepwise Aggregation	Math	Pass@1, FLOPsMatched Evaluation
Multi-agent verification: Scaling test-time compute with goal verifiers .,	Parallel	✗	✗	Self-Repetition	✗	Multiple-Agent Verifiers	Best-of-N	Math, Code, General	BoN-MAV (Cons@k), Pass@1
Evolving Deeper LLM Thinking ,	Sequential	✗	✗	Self-Refine	✗	Functional	✗	Open-Ended	Success Rate, Token Cost
Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models ,	Sequential	✗	✗	CoT + Self-Repetition	✗	Bandit	✗	Game, Sci, Math	Accuracy, Token Cost
START: Self-taught reasoner with tools ,	Parallel, Sequential	Rejection Sampling	✗	Hint-infer	✗	Tool	✗	Math, Code	Pass@1
" Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding ,	Sequential	✗	✗	Adaptive Injection Decoding	✗	✗	✗	Math, Logical, Commonsense	Accuracy
Chain of draft: Thinking faster by writing less ,	Sequential	✗	✗	Chain-of-Draft	✗	✗	✗	Math, Symbolic, Commonsense	Accuracy, Latency, Token Cost
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking ,	Hybrid	imitation	✗	✗	MCTS	PRM	✗	Math	Pass@1
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling ,	Parallel, Hybrid	✗	✗	✗	DVTS, Beam Search	PRM	Best-of-N	Math	Pass@1, Pass@k, Majority, FLOPS
Tree of thoughts: Deliberate problem solving with large language models ,	Hybrid	✗	✗	Propose Prompt, Self-Repetition	Tree Search	Self-Evaluate	✗	Game, Open-Ended	Success Rate, LLM-as-a-Judge
Mindstar: Enhancing math reasoning in pre-trained llms at inference time ,	Hybrid	✗	✗	✗	LevinTS	PRM	✗	Math	Accuracy, Token Cost
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving ,	Hybrid	✗	✗	✗	Reward Balanced Search	RM	✗	Math	Test Error Rate, FLOPs
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment ,	Hybrid	✗	✗	Self-Refine	Control Flow Graph	Self-Evaluate	Prompt Synthesis	Math, Code	Pass@1
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving ,	Parallel, Hybrid	✗	✗	MoA	✗	Verification Agent	Selection Agent	Math, General, Finance	Accuracy, F1 Score
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods ,	Hybrid	✗	✗	✗	Particle-based Monte Carlo	PRM + SSM	Particle Filtering	Math	Pass@1, Budget vs. Accuracy
Archon: An Architecture Search Framework for Inference-Time Techniques ,	Hybrid	✗	✗	MoA, Self-Repetition	✗	Verification Agent, Unit Testing (Ensemble)	Fusion	Math, Code, Open-Ended	Pass@1, Win Rate
Wider or deeper? scaling llm inference-time compute with adaptive branching tree search ,	Hybrid	✗	✗	Mixture-of-Model	AB-MCTS-(M,A)	✗	✗	Code	Pass@1, RMSLE, ROC-AUC
Thinking llms: General instruction following with thought generation ,	Internal, Parallel	✗	DPO	Think	✗	Judge Models	✗	Open-Ended	Win Rate
Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models ,	Internal, Hybrid	✗	DPO	Diversity Generation	MCTS	Self-Reflect	✗	Math	Pass@1
MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving ,	Internal, Sequential	imitation	✗	MoA	✗	Tool	✗	Math	Pass@k
Offline Reinforcement Learning for LLM Multi-Step Reasoning ,	Internal, Sequential	✗	OREO	✗	Beam Search	Value Function	✗	Math, Agent	Pass@1, Success Rate
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning ,	Internal	warmup, GRPO, Rule-Based	✗	✗	✗	✗	Math, Code, Sci	Pass@1, cons@64, Percentile, Elo Rating, Win Rate
s1: Simple test-time scaling ,	Internal	distillation	✗	Budget Forcing	✗	✗	✗	Math, Sci	Pass@1, Control, Scaling
O1 Replication Journey: A Strategic Progress Report -- Part 1 ,	Internal	imitation	✗	✗	Journey Learning	PRM, Critique	Multi-Agents	Math	Accuracy
From drafts to answers: Unlocking llm potential via aggregation fine-tuning ,	Internal, Parallel	imitation	✗	✗	✗	Fusion	✗	Math, Open-Ended	Win Rate
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though ,	Internal, Hybrid	imitation, meta-RL	Think	MCTS, A*	PRM	✗	Math, Open-Ended	Win Rate
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates ,	Internal, Sequential	✗	PPO, Trajectory	Thought Template Retrieve	✗	✗	Math	Pass@1
L1: Controlling how long a reasoning model thinks with reinforcement learning ,	Internal	✗	GRPO, Length-Penalty	✗	✗	✗	✗	Math	Pass@1, Length Error
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions ,	Internal, Hybrid	distillation, imitation	✗	Reflection Prompt	MCTS	Self-Critic	✗	Math	Pass@1, Pass@k

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
.github		.github
css		css
figs		figs
js		js
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arxiv_citations.json		arxiv_citations.json
index.html		index.html
papers.json		papers.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome Test-time-Scaling in LLMs

📢 News and Updates

📘 Introduction

🧬 Taxonomy

1. What to Scale

2. How to Scale

3. Where to Scale

4. How Well to Scale

🔍 Paper Tables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

testtimescaling/testtimescaling.github.io

Folders and files

Latest commit

History

Repository files navigation

Awesome Test-time-Scaling in LLMs

📢 News and Updates

📘 Introduction

🧬 Taxonomy

1. What to Scale

2. How to Scale

3. Where to Scale

4. How Well to Scale

🔍 Paper Tables

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages