8000 GitHub - testtimescaling/testtimescaling.github.io: "what, how, where, and how well? a survey on test-time scaling in large language models" repository
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

"what, how, where, and how well? a survey on test-time scaling in large language models" repository

License

Notifications You must be signed in to change notification settings

testtimescaling/testtimescaling.github.io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Test-time-Scaling in LLMs

Our repository, Awesome Test-time-Scaling in LLMs, gathers available papers on test-time scaling, to our current knowledge. Unlike other repositories that categorize papers, we decompose each paper's contributions based on the taxonomy provided by "What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models" facilitating easier understand and comparison for readers.

Figure 1: A Visual Map and Comparison: From What to Scale to How to Scale..

πŸ“’ News and Updates

  • [13/Apr/2025] πŸ“Œ The Second Version is released:

    1. We correct some typos;
    2. We include "Evaluation" and "Agentic" Tasks, which were enhanced by TTS;
    3. We revise the figures and tables, like the color of table 1.
  • [9/Apr/2025] πŸ“Œ Our repository is created.

  • [31/Mar/2025] πŸ“Œ Our initial survey is on Arxiv!

πŸ“˜ Introduction

As enthusiasm for scaling computation (data and parameters) in the pertaining era gradually diminished, test-time scaling (TTS)β€”also referred to as β€œtest-time computing”—has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in reasoning-intensive tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering systemic understanding. To fill this gap, we propose a unified, hierarchical framework structured along four orthogonal dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct a holistic review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique contributions of individual methods within the broader TTS landscape.

Figure 2: omparison of Scaling Paradigms in Pre-training and Test-time Phases..

🧬 Taxonomy

1. What to Scale

``What to scale'' refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference.

  • Parallel Scaling improves test-time performance by generating multiple outputs in parallel and then aggregating them into a final answer.
  • Sequential Scaling involves explicitly directing later computations based on intermediate steps.
  • Hybrid Scaling exploits the complementary benefits of parallel and sequential scaling.
  • Internal Scaling elicits a model to autonomously determine how much computation to allocate for reasoning during testing within the model’s internal parameters, instead of external human-guided strategies.

2. How to Scale

  • Tuning
    • Supervised Fine-Tuning (SFT): by training on synthetic or distilled long CoT examples, SFT allows a model to imitate extended reasoning patterns.
    • Reinforcement Learning (RL): RL can guide a model’s policy to generate longer or more accurate solutions.
  • Inference
    • Stimulation (STI): It basically stimulates the LLM to generate more and longer samples instead of generating individual samples directly.
    • Verification (VER): The verification process plays an important role in the TTS, and it can be adapted to: i) directly selects the output sample among various ones, under the Parallel Scaling paradigm; ii) guides the stimulation process and determines when to stop, under the Sequential Scaling paradigm; iii) serves as the criteria in the search process; iv) determines what sample to aggregate and how to aggregate them, e.g., weights.
    • Search (SEA): Search is a time-tested technique for retrieving relevant information from large databases, and it can also systematically explore the potential outputs of LLMs to improve complex reasoning tasks.
    • Aggregation (AGG): Aggregation techniques consolidate multiple solutions into a final decision to enhance the reliability and robustness of model predictions at test time.

3. Where to Scale

  • Reasoning: Math, Code, Science, Game & Strategy, Medical and so on.< BCCB /li>
  • General-Purpose: Basics, Agents, Knowledge, Open-Ended, Multi-Modal and so on.

4. How Well to Scale

  • Performance: This dimension measures the correctness and robustness of outputs.
  • Efficiency: it captures the cost-benefit tradeoffs of TTS methods.
  • Controllability: This dimension assesses whether TTS methods adhere to resource or output constraints, such as compute budgets or output lengths.
  • Scalability: Scalability quantifies how well models improve with more test-time compute (e.g., tokens or steps).

πŸ” Paper Tables

Method(PapersTitles)
What How β†’ Where How Well
SFT RL STI SEA VER AGG
Scaling llm test-time compute optimally can be more effective than scaling model parameters., arXiv Badge Parallel,
Sequential
βœ— βœ— βœ— Beam,
LookAhead
Verifier (Weighted) Best-of-N,
Stepwise Aggregation
Math Pass@1,
FLOPsMatched Evaluation
Multi-agent verification: Scaling test-time compute with goal verifiers
., arXiv Badge
Parallel βœ— βœ— Self-Repetition βœ— Multiple-Agent
Verifiers
Best-of-N Math,
Code,
General
BoN-MAV (Cons@k),
Pass@1
Evolving Deeper LLM Thinking
, arXiv Badge
Sequential βœ— βœ— Self-Refine βœ— Functional βœ— Open-Ended Success Rate,
Token Cost
Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models
, arXiv Badge
Sequential βœ— βœ— CoT +
Self-Repetition
βœ— Bandit βœ— Game,
Sci,
Math
Accuracy,
Token Cost
START: Self-taught reasoner with tools
, arXiv Badge
Parallel,
Sequential
Rejection Sampling βœ— Hint-infer βœ— Tool βœ— Math,
Code
Pass@1
" Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding
, arXiv Badge
Sequential βœ— βœ— Adaptive Injection
Decoding
βœ— βœ— βœ— Math,
Logical,
Commonsense
Accuracy
Chain of draft: Thinking faster by writing less
, arXiv Badge
Sequential βœ— βœ— Chain-of-Draft βœ— βœ— βœ— Math,
Symbolic,
Commonsense
Accuracy,
Latency,
Token Cost
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
, arXiv Badge
Hybrid imitation βœ— βœ— MCTS PRM βœ— Math Pass@1
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
, arXiv Badge
Parallel,
Hybrid
βœ— βœ— βœ— DVTS,
Beam Search
PRM Best-of-N Math Pass@1,
Pass@k,
Majority,
FLOPS
Tree of thoughts: Deliberate problem solving with large language models
, arXiv Badge
Hybrid βœ— βœ— Propose Prompt,
Self-Repetition
Tree Search Self-Evaluate βœ— Game,
Open-Ended
Success Rate,
LLM-as-a-Judge
Mindstar: Enhancing math reasoning in pre-trained llms at inference time
, arXiv Badge
Hybrid βœ— βœ— βœ— LevinTS PRM βœ— Math Accuracy,
Token Cost
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving
, arXiv Badge
Hybrid βœ— βœ— βœ— Reward Balanced
Search
RM βœ— Math Test Error Rate,
FLOPs
Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
, arXiv Badge
Hybrid βœ— βœ— Self-Refine Control Flow Graph Self-Evaluate Prompt Synthesis Math,
Code
Pass@1
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
, arXiv Badge
Parallel,
Hybrid
βœ— βœ— MoA βœ— Verification Agent Selection Agent Math,
General,
Finance
Accuracy,
F1 Score
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
, arXiv Badge
Hybrid βœ— βœ— βœ— Particle-based
Monte Carlo
PRM + SSM Particle Filtering Math Pass@1,
Budget vs. Accuracy
Archon: An Architecture Search Framework for Inference-Time Techniques
, arXiv Badge
Hybrid βœ— βœ— MoA,
Self-Repetition
βœ— Verification Agent,
Unit Testing (Ensemble)
Fusion Math,
Code,
Open-Ended
Pass@1,
Win Rate
Wider or deeper? scaling llm inference-time compute with adaptive branching tree search
, arXiv Badge
Hybrid βœ— βœ— Mixture-of-Model AB-MCTS-(M,A) βœ— βœ— Code Pass@1,
RMSLE,
ROC-AUC
Thinking llms: General instruction following with thought generation
, arXiv Badge
Internal,
Parallel
βœ— DPO Think βœ— Judge Models βœ— Open-Ended Win Rate
Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
, arXiv Badge
Internal,
Hybrid
βœ— DPO Diversity Generation MCTS Self-Reflect βœ— Math Pass@1
MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
, arXiv Badge
Internal,
Sequential
imitation βœ— MoA βœ— Tool βœ— Math Pass@k
Offline Reinforcement Learning for LLM Multi-Step Reasoning
, arXiv Badge
Internal,
Sequential
βœ— OREO βœ— Beam Search Value Function βœ— Math,
Agent
Pass@1,
Success Rate
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
, arXiv Badge
Internal warmup,
GRPO,
Rule-Based
βœ— βœ— βœ— βœ— Math,
Code,
Sci
Pass@1,
cons@64,
Percentile,
Elo Rating,
Win Rate
s1: Simple test-time scaling
, arXiv Badge
Internal distillation βœ— Budget Forcing βœ— βœ— βœ— Math,
Sci
Pass@1,
Control,
Scaling
O1 Replication Journey: A Strategic Progress Report -- Part 1
, arXiv Badge
Internal imitation βœ— βœ— Journey Learning PRM,
Critique
Multi-Agents Math Accuracy
From drafts to answers: Unlocking llm potential via aggregation fine-tuning
, arXiv Badge
Internal,
Parallel
imitation βœ— βœ— βœ— Fusion βœ— Math,
Open-Ended
Win Rate
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
, arXiv Badge
Internal,
Hybrid
imitation,
meta-RL
Think MCTS,
A*
PRM βœ— Math,
Open-Ended
Win Rate
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
, arXiv Badge
Internal,
Sequential
βœ— PPO,
Trajectory
Thought Template Retrieve βœ— βœ— Math Pass@1
L1: Controlling how long a reasoning model thinks with reinforcement learning
, arXiv Badge
Internal βœ— GRPO,
Length-Penalty
βœ— βœ— βœ— βœ— Math Pass@1,
Length Error
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
, arXiv Badge
Internal,
Hybrid
distillation,
imitation
βœ— Reflection Prompt MCTS Self-Critic βœ— Math Pass@1,
Pass@k

About

"what, how, where, and how well? a survey on test-time scaling in large language models" repository

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
0