The official repository which contains the code and pre-trained models/datasets for our paper Efficient Test-Time Scaling via Self-Calibration.
- [2025-3-3]: We released our paper.
- [2025-2-25]: We released our codes, models and datasets.
We propose an efficient test-time scaling method by using model confidence for dynamically sampling adjustment, since confidence can be seen as an intrinsic measure that directly reflects model uncertainty on different tasks. For example, we can incorporate the modelβs confidence into self-consistency by assigning each sampled response
where
As shown in the previous figure, our approaches can achieve comparable performance with substantially fewer computational resources. The confidence-weighted Self-Consistency can save 94.2% samples to achieve an accuracy of 85.0, compared to standard Self-Consistency, demonstrating that reliable confidence estimation can significantly enhance the computational efficiency of test-time scaling.
However, extracting accurate confidence can be challenging since vanilla LLMs are known to be overconfident on their own responses and their confidence often exceeds the actual accuracy.
Hence, we propose a new framework, Self-Calibration, that can make model generate calibrated confidence score.
conda create -n Self-Calibration python=3.10
conda activate Self-Calibration
pip install -r requirements.txt
pip install vllm -U
To use an (efficient) sampling method, you may use
inference = SampleInference(
model_name=model_name,
eos_token_str=eos_token_str,
I=I,
torch_dtype=torch.float16,
device_map="auto"
)
to start an inference engine, and you may use
result = inference.run_inference_interactive(
query=prompt,
method=method, #["earlyexit", "asc_conf", "asc", "sc", "sc_conf", "best_of_n"]
threshold=0.7, # threshold in earlyexit, asc and asc_conf
max_samples=16, # the number of sampling times in sc, sc_conf and best_of_n. These number is also the max sample times in earlyexit, asc and asc_conf
temperature=0.8,
extract_handler=dataset_handler
)
The example codes can be used by
python sampling_methods/sample.py --use_cot --model_name HINT-lab/Llama_3.1-8B-Instruct-Self-Calibration --dataset_name gsm8k
You can generate the data by the following scripts,
bash data_gen.bash \
--model_name "meta-llama/Llama-3.1-8B-Instruct" \
--temperature 0.8 \
--use_cot_flag "--use_cot" \
--num_generations 32 \
--subset "train" \
--data_size 100 \
--save_path "llama"
Also, you can use the default settings by
bash data_gen.bash
The dynamic temperature version is quite slow. You can may use non-dt version by change data_generator_dt
to data_generator
in data_gen.bash
, which is more faster but the responses are possibly less diverse.
# training details should be written in model_training/configs/{version}.json
bash scripts/main.bash \
--merged_model_path "./models/llama" \
--version "llama" \
--basemodel "meta-llama/Llama-3.1-8B-Instruct"
bash scripts/evaluate.bash \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--answer_folder "example" \
--num_generations 16 \
If you want to add a new dataset for data generation or test, you should update the utils/dataset_loader.py
to implement a new dataset handler
Click to expand
class DatasetHandler(ABC):
@abstractmethod
def load_data(self):
"""
Load the dataset and return a tuple: (splits_dict, answer_type).
splits_dict: A dictionary where each key is a split name (e.g., 'train', 'test')
and the value is the corresponding dataset or data structure.
answer_type: A string describing the type of the answer, e.g.:
'number', 'text', 'option letter', etc.
"""
pass
@abstractmethod
def prepare_qa_data(self, data):
"""
Given a particular split (like a list or IterableDataset),
transform it into a dictionary: {prompt_text -> ground_truth_answer}.
"""
pass
@abstractmethod
def extract_answer(self, response):
"""
Given a model-generated response (string), extract the final answer
so that it matches the ground truth format (number, letter, text, etc.).
"""
pass
def check(self, correct_answer, response):
"""
Given the correct answer and the model-generated response,
check if the response is correct. This is a simple equality check.
"""
return correct_answer == response
and add the name of the datasets in function get_dataset
For new models, you should update the utils/SPECIAL_SUFFIXS.py
to add a new SPECIAL_SUFFIXS
and split_marker
.
Self-Calibration
βββ data_creation # codes for data generation
β βββ data_generator_dt.py # data generator with dynamic temperature
β βββ data_generator.py # data generator without dynamic temperature
β βββ dataset_creat.py # create datasets from output responses
β βββ dt_generator.py # implement of dynamic temperature
β
βββ evaluation
β βββ analysis.py # implement of different inference methods
β βββ calculate_confidence.py # confidences generate
β βββ generate_responses.py # responses generate
β βββ llama_reward.py # ORM example
β βββ PRM_reward_score.py # PRM example
β
βββ model_training
β βββ configs/ # model training configs
β βββ merge_lora_model.py # model merging and upload
β βββ train.py # training scripts
β
βββ utils
β βββ dataset_loader.py # dataset loader
β βββ metric.py # evaluation metric
β βββ SPECIAL_SUFFIXS.py # model configs (confidence querying prompts)
β
We opensource our datasets and models on the huggingface.
- DeepSeek-R1-Distill-Qwen-1.5B-Self-Calibration
- Qwen2.5-7B-Instruct-Self-Calibration
- Llama-3.1-8B-Instruct-Self-Calibration
@misc{huang2025efficienttesttimescalingselfcalibration,
title={Efficient Test-Time Scaling via Self-Calibration},
author={Chengsong Huang and Langlin Huang and Jixuan Leng and Jiacheng Liu and Jiaxin Huang},
year={2025},
eprint={2503.00031},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2503.00031},
}