Our repository, Awesome Test-time-Scaling in LLMs, gathers available papers on test-time scaling, to our current knowledge. Unlike other repositories that categorize papers, we decompose each paper's contributions based on the taxonomy provided by "What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models" facilitating easier understand and comparison for readers.
-
[13/Apr/2025] π The Second Version is released:
- We correct some typos;
- We include "Evaluation" and "Agentic" Tasks, which were enhanced by TTS;
- We revise the figures and tables, like the color of table 1.
-
[9/Apr/2025] π Our repository is created.
-
[31/Mar/2025] π Our initial survey is on Arxiv!
As enthusiasm for scaling computation (data and parameters) in the pertaining era gradually diminished, test-time scaling (TTS)βalso referred to as βtest-time computingββhas emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in reasoning-intensive tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering systemic understanding. To fill this gap, we propose a unified, hierarchical framework structured along four orthogonal dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct a holistic review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique contributions of individual methods within the broader TTS landscape.
``What to scale'' refers to the specific form of TTS that is expanded or adjusted to enhance an LLMβs performance during inference.
- Parallel Scaling improves test-time performance by generating multiple outputs in parallel and then aggregating them into a final answer.
- Sequential Scaling involves explicitly directing later computations based on intermediate steps.
- Hybrid Scaling exploits the complementary benefits of parallel and sequential scaling.
- Internal Scaling elicits a model to autonomously determine how much computation to allocate for reasoning during testing within the modelβs internal parameters, instead of external human-guided strategies.
- Tuning
- Supervised Fine-Tuning (SFT): by training on synthetic or distilled long CoT examples, SFT allows a model to imitate extended reasoning patterns.
- Reinforcement Learning (RL): RL can guide a modelβs policy to generate longer or more accurate solutions.
- Inference
- Stimulation (STI): It basically stimulates the LLM to generate more and longer samples instead of generating individual samples directly.
- Verification (VER): The verification process plays an important role in the TTS, and it can be adapted to: i) directly selects the output sample among various ones, under the Parallel Scaling paradigm; ii) guides the stimulation process and determines when to stop, under the Sequential Scaling paradigm; iii) serves as the criteria in the search process; iv) determines what sample to aggregate and how to aggregate them, e.g., weights.
- Search (SEA): Search is a time-tested technique for retrieving relevant information from large databases, and it can also systematically explore the potential outputs of LLMs to improve complex reasoning tasks.
- Aggregation (AGG): Aggregation techniques consolidate multiple solutions into a final decision to enhance the reliability and robustness of model predictions at test time.
- Reasoning: Math, Code, Science, Game & Strategy, Medical and so on.< BCCB /li>
- General-Purpose: Basics, Agents, Knowledge, Open-Ended, Multi-Modal and so on.
- Performance: This dimension measures the correctness and robustness of outputs.
- Efficiency: it captures the cost-benefit tradeoffs of TTS methods.
- Controllability: This dimension assesses whether TTS methods adhere to resource or output constraints, such as compute budgets or output lengths.
- Scalability: Scalability quantifies how well models improve with more test-time compute (e.g., tokens or steps).