This repository contains the codebase for our paper "Can Past Experience Accelerate LLM Reasoning?". The project explores how prior experience and memory mechanisms can accelerate LLM reasoning while maintaining or improving answer quality. It introduces a framework for adaptive compute budget allocation and memory-based reasoning enhancements across diverse task similarities.
Memory methods control how prior examples or intermediate results are incorporated during reasoning:
Memory Method | Description |
---|---|
no_memory |
No memory used. Each query is answered independently. |
SFT |
Fine-tuned model with few-shot context from training examples. |
in_context |
Uses retrieved past examples dynamically as in-context demonstrations. |
reflect |
Uses self-reflection on failed trials to guide next iteration. |
multi_case_reflect |
Reflects using multiple past answers for better generalization. |
reflect_update |
Reflection with iterative memory refinement after each round. |
In code (e.g., main.py
or SLURM script), memory method is selected with:
--memory_method=SFT
Scaling methods define how reasoning is expanded or iterated during test time:
Scaling Method | Description |
---|---|
best_of_n |
Samples multiple answers and picks the best one based on a scoring model. |
self_refine |
Iteratively refines previous answers by self-editing. |
dfs |
Depth-first search in reasoning steps, exploring sequentially. |
long_cot |
Long Chain-of-Thought reasoning using larger context windows. |
In code (e.g., main.py
), the method is passed as:
--scaling_method=long_cot
These methods can be combined with different memory settings to study cost-quality tradeoffs.
We group questions into clusters to simulate varying degrees of similarity with previously solved tasks. This helps analyze whether LLMs can leverage prior exposure to speed up reasoning.
Subgroup Key | Description |
---|---|
1_same_question |
Exact same question repeated |
2_diff_wording |
Same semantics with different wording |
3_diff_number |
Same problem type, but numbers are varied |
4_diff_question |
Entirely new question type or topic |
In main.py
, similarity is set using:
--subgroup=2_diff_wording
python main.py \
--cuda=0 \
--backend=meta-llama/Meta-Llama-3.1-8B-Instruct \
--value_backend=gpt-4o-mini \
--task=MATH500 \
--scaling_method=best_of_n \
--memory_method=SFT \
--subgroup=2_diff_wording \
--cluster_id=1 \
--experiment_id=4_0
Optional flags:
--use_lora
--prm_use_lora
--max_tokens
,--num_iteration
,--num_questions
, etc.