Official repository for paper TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
๐ Paper ย ย ๐ Leaderboard ย ย ๐ค TableBench ย ย ๐ค TableInstruct
๐ฅApr. 18, 2025๐ฅ
โ๏ธ Enhanced TableBench: Weโve released an cleaner version of TableBench, after thoroughly reviewing all test set cases and correcting any errors we identified. Please download the latest version of TableBench for the most accurate dataset.
๐ Brand New Leaderboard: The brand new Leaderboard is now live! We've included the performance of many newly released models in our latest leaderboard and will continue to keep it up to date. Submissions are welcome! For submission guidelines, please refer to the
Submission section
on Leaderboard website.๐ Refined Evaluation Metrics: In response to community feedback and in-depth discussions, we've updated the evaluation metrics for Fact Checking, Numerical Reasoning, and Data Analysis. You can find the detailed specifications of these new metrics in Section
๐ Evaluation Metrics
๐Jan. 21, 2025๐:
We are thrilled to share that our paper has been accepted to AAAI 2025! We sincerely thank our co-authors, the anonymous reviewers, and all the researchers and users on GitHub or through email whose valuable feedback and support have greatly contributed to this work.
TableBench
is a comprehensive and complex
benchmark designed to evaluate Table
Question Answering (TableQA) capabilities, aligning closely with the "Reasoning Complexity of
Questions
" dimension in real-world Table QA scenarios. It covers 18 question
categories
across 4 major ategoriesโincluding Fact Checking, Numerical Reasoning, Data
Analysis, and Visualizationโwith 886 carefully curated test cases. TableBench
substantially pushes the boundaries of large language models in complex TableQA scenarios.
This module defines evaluation metrics for various sub-tasks:
- Metric: Exact Match (EM)
- Description: Assesses whether the predicted statement exactly matches the reference.
- Metric: Exact Match (EM)
- Description: Focuses on the correctness of numerical outputs.
- Metrics vary based on sub-task type:
Task Type | Metric | Description |
---|---|---|
Impact Analysis | Exact Match (EM) | Requires precise match of influential factors |
Correlation Analysis | EM_with_error_10 | Allows ยฑ10% numerical margin of error |
Trend Forecasting | EM_with_error_10 | Allows ยฑ10% numerical margin of error |
Statistical Analysis | EM_with_error_10 | Allows ยฑ10% numerical margin of error |
Other Data Analysis Tasks | ROUGE-L | Suitable for open-ended, textual responses |
- Metric: Pass@1
- Description: Measures whether the correct chart is generated on the first attempt.
Download the latest version of TableBench from Hugging Face and place it in your working directory.
Use your preferred Large Language Model (LLM) to generate predictions for each test case.
Important notes:
- Store your model's predictions in the
prediction
field - Include your model name in the
model_name
field
Example JSON structure:
[
{
...
"model_name": "your-model-name-here",
"prediction": "Final Answer: 1062"
},
...
]
We provide an example inference result file at: eval_examples/inference_results/o3-mini-2025-01-31=TableBench_DP=Example.jsonl
Use our parsing script to extract final answers from your model's predictions. The script parse_tablebench_instruction_response_script.py
includes a main method example:
if __name__ == '__main__':
# ==== Global settings ====
PROJECT_ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
EXP_DIR = 'eval_examples'
INFERENCE_RESULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/inference_results'
PARSED_RUSULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/parsed_results'
# ==== Load inference results ====
for inference_result_file in iter_file_from_dir(f'{INFERENCE_RESULT_DIR}', '.jsonl'):
print(f'Parsing {inference_result_file}')
# === Load inference results ===
inference_results = read_json_file(inference_result_file)
if not isinstance(inference_results, list):
inference_results = [inference_results]
# === Parse inference results ===
parsed_results = parse_inference_results(inference_results)
# === Save parsed results ===
write_json_to_file(
f'{PARSED_RUSULT_DIR}/{os.path.basename(inference_result_file)}', parsed_results, is_json_line=True)
print('Parsing completed.')
After parsing the predictions, use our evaluation script to assess your model's performance.
A main function example is provided in eval_tablebench_script.py
if __name__ == '__main__':
# ==== Global settings ====
PROJECT_ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
EXP_DIR = 'eval_examples'
PARSED_RUSULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/parsed_results'
EVAL_RESULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/evaluation_results'
# Init metric evaluation engine
metric_eval_engine = QAMetric()
# Merge parsed results to one file
overall_sim_inference_results_path = merge_parsed_results_to_one_sim_file(
PARSED_RUSULT_DIR, EVAL_RESULT_DIR)
# Support add multiple inference result files
candidate_eval_file_paths = [
overall_sim_inference_results_path
]
# Support eval targeted models,if eval_models is empty, all models will be eval
eval_models = [
# 'o3-mini-2025-01-31',
]
# Load the candidate evaluation results
categoried_llm_inference_results = build_categoried_llm_inference_results(
candidate_eval_file_paths, eval_models)
# Evaluate by subtype
print('==== Evaluate by subtype ====')
llm_eval_subtype_results = eval_by_subtype(
categoried_llm_inference_results, metric_eval_engine)
# sve_subtype_results to csv
llm_eval_subtype_results_csv_path = f'{EVAL_RESULT_DIR}/llm_eval_subtype_results.csv'
save_subtype_results_to_csv(
llm_eval_subtype_results, llm_eval_subtype_results_csv_path)
# Evaluate by type
print('==== Evaluate by type ====')
llm_eval_type_results = eval_by_type(
categoried_llm_inference_results, metric_eval_engine)
# save_type_results to csv
llm_eval_type_results_csv_path = f'{EVAL_RESULT_DIR}/llm_eval_type_results.csv'
save_type_results_to_csv(
llm_eval_type_results, llm_eval_type_results_csv_path)
The final evaluation results will be stored in eval_examples/evaluation_results
, including both json
and csv
formats.
parse@1
: Indicates whether the final answer was successfully parsed in a single run. This is a boolean value used to assess the LLM's ability to follow the expected output format.
ecr@1
: Stands for "execution correctness rate at 1". It checks whether the Python code generated by the LLM can run without errors on the first attempt. This metric reflects the correctness of code generation.
pass@1
: Used for the chart_generation
task, this indicates whether the generated result passes the provided test cases in a single attempt.
Refer to our paper for more details.
If you find our work helpful, please use the following citations.
@inproceedings{wu2025tablebench,
title={Tablebench: A comprehensive and complex benchmark for table question answering},
author={Wu, Xianjie and Yang, Jian and Chai, Linzheng and Zhang, Ge and Liu, Jiaheng and Du, Xeron and Liang, Di and Shu, Daixin and Cheng, Xianfu and Sun, Tianzhen and others},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={39},
number={24},
pages={25497--25506},
year={2025}
}