8000 GitHub - TableBench/TableBench: Official repository for paper "TableBench: A Comprehensive and Complex Benchmark for Table Question Answering"
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Official repository for paper "TableBench: A Comprehensive and Complex Benchmark for Table Question Answering"

Notifications You must be signed in to change notification settings

TableBench/TableBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

24 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

TableBench

Official repository for paper TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

๐Ÿ“š Paper ย ย  ๐Ÿ† Leaderboard ย ย  ๐Ÿค— TableBench ย ย  ๐Ÿค— TableInstruct

News

๐Ÿ”ฅApr. 18, 2025๐Ÿ”ฅ

  1. โ˜€๏ธ Enhanced TableBench: Weโ€™ve released an cleaner version of TableBench, after thoroughly reviewing all test set cases and correcting any errors we identified. Please download the latest version of TableBench for the most accurate dataset.

  2. ๐Ÿš€ Brand New Leaderboard: The brand new Leaderboard is now live! We've included the performance of many newly released models in our latest leaderboard and will continue to keep it up to date. Submissions are welcome! For submission guidelines, please refer to the Submission section on Leaderboard website.

  3. ๐Ÿ” Refined Evaluation Metrics: In response to community feedback and in-depth discussions, we've updated the evaluation metrics for Fact Checking, Numerical Reasoning, and Data Analysis. You can find the detailed specifications of these new metrics in Section ๐Ÿ“ Evaluation Metrics

๐Ÿš€Jan. 21, 2025๐Ÿš€:

We are thrilled to share that our paper has been accepted to AAAI 2025! We sincerely thank our co-authors, the anonymous reviewers, and all the researchers and users on GitHub or through email whose valuable feedback and support have greatly contributed to this work.

๐Ÿงพ Overview

TableBench is a comprehensive and complex benchmark designed to evaluate Table Question Answering (TableQA) capabilities, aligning closely with the "Reasoning Complexity of Questions" dimension in real-world Table QA scenarios. It covers 18 question categories across 4 major ategoriesโ€”including Fact Checking, Numerical Reasoning, Data Analysis, and Visualizationโ€”with 886 carefully curated test cases. TableBench substantially pushes the boundaries of large language models in complex TableQA scenarios.

TableBench

๐Ÿ“ Evaluation Metrics

This module defines evaluation metrics for various sub-tasks:

๐Ÿ” Fact Checking

  • Metric: Exact Match (EM)
  • Description: Assesses whether the predicted statement exactly matches the reference.

๐Ÿ”ข Numerical Reasoning

  • Metric: Exact Match (EM)
  • Description: Focuses on the correctness of numerical outputs.

๐Ÿ“ˆ Data Analysis

  • Metrics vary based on sub-task type:
Task Type Metric Description
Impact Analysis Exact Match (EM) Requires precise match of influential factors
Correlation Analysis EM_with_error_10 Allows ยฑ10% numerical margin of error
Trend Forecasting EM_with_error_10 Allows ยฑ10% numerical margin of error
Statistical Analysis EM_with_error_10 Allows ยฑ10% numerical margin of error
Other Data Analysis Tasks ROUGE-L Suitable for open-ended, textual responses

๐Ÿ“Š Visualization

  • Metric: Pass@1
  • Description: Measures whether the correct chart is generated on the first attempt.

๐Ÿ”ง How to evaluate on Tablebench

Step 1. Download the Dataset

Download the latest version of TableBench from Hugging Face and place it in your working directory.

Step 2. Run Inference with Your LLM

Use your preferred Large Language Model (LLM) to generate predictions for each test case.

Important notes:

  • Store your model's predictions in the prediction field
  • Include your model name in the model_name field

Example JSON structure:

[
  {
    ...
    "model_name": "your-model-name-here",
    "prediction": "Final Answer: 1062"
  },
  ...
]

We provide an example inference result file at: eval_examples/inference_results/o3-mini-2025-01-31=TableBench_DP=Example.jsonl

Step 3. Parse Final Answers from LLM Predictions

Use our parsing script to extract final answers from your model's predictions. The script parse_tablebench_instruction_response_script.py includes a main method example:

if __name__ == '__main__':
    # ==== Global settings ====
    PROJECT_ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
    EXP_DIR = 'eval_examples'

    INFERENCE_RESULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/inference_results'
    PARSED_RUSULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/parsed_results'

    # ==== Load inference results ====
    for inference_result_file in iter_file_from_dir(f'{INFERENCE_RESULT_DIR}', '.jsonl'):
        print(f'Parsing {inference_result_file}')
        # === Load inference results ===
        inference_results = read_json_file(inference_result_file)
        if not isinstance(inference_results, list):
            inference_results = [inference_results]
        # === Parse inference results ===
        parsed_results = parse_inference_results(inference_results)
        # === Save parsed results ===
        write_json_to_file(
            f'{PARSED_RUSULT_DIR}/{os.path.basename(inference_result_file)}', parsed_results, is_json_line=True)
    print('Parsing completed.')

Step 4. Evaluate the Results

After parsing the predictions, use our evaluation script to assess your model's performance. A main function example is provided in eval_tablebench_script.py

if __name__ == '__main__':

    # ==== Global settings ====
    PROJECT_ROOT_DIR = os.path.dirname(os.path.abspath(__file__))
    EXP_DIR = 'eval_examples'

    PARSED_RUSULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/parsed_results'
    EVAL_RESULT_DIR = f'{PROJECT_ROOT_DIR}/{EXP_DIR}/evaluation_results'

    # Init metric evaluation engine
    metric_eval_engine = QAMetric()

    # Merge parsed results to one file
    overall_sim_inference_results_path = merge_parsed_results_to_one_sim_file(
        PARSED_RUSULT_DIR, EVAL_RESULT_DIR)

    # Support add multiple inference result files
    candidate_eval_file_paths = [
        overall_sim_inference_results_path
    ]

    # Support eval targeted models,if eval_models is empty, all models will be eval
    eval_models = [
        # 'o3-mini-2025-01-31',
    ]

    # Load the candidate evaluation results
    categoried_llm_inference_results = build_categoried_llm_inference_results(
        candidate_eval_file_paths, eval_models)

    # Evaluate by subtype
    print('==== Evaluate by subtype ====')
    llm_eval_subtype_results = eval_by_subtype(
        categoried_llm_inference_results, metric_eval_engine)
    # sve_subtype_results to csv
    llm_eval_subtype_results_csv_path = f'{EVAL_RESULT_DIR}/llm_eval_subtype_results.csv'
    save_subtype_results_to_csv(
        llm_eval_subtype_results, llm_eval_subtype_results_csv_path)

    # Evaluate by type
    print('==== Evaluate by type ====')
    llm_eval_type_results = eval_by_type(
        categoried_llm_inference_results, metric_eval_engine)
    # save_type_results to csv
    llm_eval_type_results_csv_path = f'{EVAL_RESULT_DIR}/llm_eval_type_results.csv'
    save_type_results_to_csv(
        llm_eval_type_results, llm_eval_type_results_csv_path)

The final evaluation results will be stored in eval_examples/evaluation_results, including both json and csv formats.

Terminology

parse@1: Indicates whether the final answer was successfully parsed in a single run. This is a boolean value used to assess the LLM's ability to follow the expected output format.

ecr@1: Stands for "execution correctness rate at 1". It checks whether the Python code generated by the LLM can run without errors on the first attempt. This metric reflects the correctness of code generation.

pass@1: Used for the chart_generation task, this indicates whether the generated result passes the provided test cases in a single attempt.

Reasoning Methods Examples

McEval

Refer to our paper for more details.

Citation

If you find our work helpful, please use the following citations.

@inproceedings{wu2025tablebench,
  title={Tablebench: A comprehensive and complex benchmark for table question answering},
  author={Wu, Xianjie and Yang, Jian and Chai, Linzheng and Zhang, Ge and Liu, Jiaheng and Du, Xeron and Liang, Di and Shu, Daixin and Cheng, Xianfu and Sun, Tianzhen and others},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={24},
  pages={25497--25506},
  year={2025}
}

About

Official repository for paper "TableBench: A Comprehensive and Complex Benchmark for Table Question Answering"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0