A Benchmark for Testing Materials Tool Usage Abilities of Large Language Models (LLMs)
- pymatgen_code_qa benchmark:
qa_benchmark/generated_qa/generation_results_code.json
, which consists of 34,621 QA pairs. - pymatgen_code_doc benchmark:
qa_benchmark/generated_qa/generation_results_doc.json
, which consists of 34,604 QA pairs. - real-world tool-usage benchmark:
src/question_segments
, which consists of 49 questions (138 tasks). One subfolder means a question with problem statement, property list and verification code.
- QA example
- QA benchmark
- real-world tool-usage benchmark
Results for single LLMs
Results for LLM-RAG with different retrieval sources Results for advanced RAG agent systemWe use Conda and Poetry to manage the Python environment.
conda create -n mattools python=3.13
conda activate mattools
poetry install
or you can try install with requirements.txt
(may face errors).
conda create -n mattools python=3.13
conda activate mattools
pip install -r requirements.txt
You can look up the results stored in qa_benchmark/test_results
.
First, please unzip the two benchmark files (doc and code qa) in qa_benchmark/generated_qa/
.
cd qa_benchmark/pymatgen-qa-generation/src
touch .env
GEMINI_API_KEY="Replace your api key here"
Then configure your API key with Gemini or download LLM from huggingface.
After that, modify the settings in settings.py
TEST_CONFIG = {
"MODEL_NAME": "gemini-2.0-flash", # model name
"MODEL_TYPE": "remote", # ['remote', 'local']
"LOGGER_FILE": "question_evaluation_code_gemini-2-0-flash.log", # log file name
"CSV_FILE_NAME": "evaluation_results_gemini-2-0-flash.csv", # results file name
"TEST_FILE_PATH": "generation_results_doc.json", # qa or doc benchmark
}
After that, run
cd qa_benchmark/pymatgen-qa-generation/src
python testing_script.py
You can just look up the results stored in pure_agent_test
for single LLMs, RAG_agent_test
for LLM-RAG, agentic_RAG_test
for Agentic RAG, lightrag
for LightRAG, mtr_rag_test
for self-reflection LLM-doc RAG system
.
First please configure your API key:
cd src
touch .env
OPENAI_API_KEY = "Replace your api key here"
GEMINI_API_KEY = "Replace your api key here"
- Generate vector store and configure docker
Unzip src/documents_llm_doc_gemini_20_flash.json.zip
.
Open src/construct_doc.ipynb
, run all blocks.
Or click Here to download the vector store and unzip the vector_store
folder to src
folder.
For configuring docker sandbox used for verification generated code
Option1: docker pull grenzlinie/mat-tools:latest
.
Option2: run docker build -t mat-tool-ben .
at root directory to create docker image. When testing.
The result_analysis.py
will automatically generate container for each question.
- Test single LLM
cd src
python build_agent.py --model_names gpt-4o-mini-2024-07-18 # generated code
python result_analysis.py --generated_function_path pure_agent_test/gpt-4o-mini-2024-07-18 # execute code and get result analysis
- Test LLM-RAG with different retrieval sources
cd src
python build_agent.py --model_names gpt-4o-mini-2024-07-18 --retriever_type llm-doc-full
python result_analysis.py --generated_function_path RAG_agent_test/gpt-4o-mini-2024-07-18
- Test agentic RAG
cd src
python main.py --model_name gpt-4o-mini-2024-07-18 --retriever_type llm-doc-full
python result_analysis.py --generated_function_path agentic_RAG_test/gpt-4o-mini-2024-07-18
- Test LightRAG
Download LightRAG to src
folder.
cd src/lightrag/LightRAG-main
pip install -e .
export OPENAI_API_KEY="..."
python examples/mattoolben.py
python result_analysis.py --generated_function_path ./lightrag/LightRAG-main/gpt4o_function_generation_results/
- Test self-reflection LLM-RAG agent system
cd src
python mtr_rag_test/rag.py --model_name gpt-4o-mini-2024-07-18 --retriever_type llm-doc-full
python result_analysis.py --generated_function_path mtr_rag_test/gpt-4o-mini-2024-07-18
First, the version of pymatgen and pymatgen-analysis-defects must be fixed.
We provided the code of pymatgen and pymatgen-analysis-defects in src/tool_source_code/pymatgen/src/pymatgen/
. The code of pymatgen-analysis-defects is in src/tool_source_code/pymatgen/src/pymatgen/analysis/defects
.
pymatgen version 2024.8.9
pymatgen-analysis-defects version 2024.7.19
pip install repoagent
export OPENAI_API_KEY=YOUR_API_KEY # on Linux/Mac
set OPENAI_API_KEY=YOUR_API_KEY # on Windows
$Env:OPENAI_API_KEY = "YOUR_API_KEY" # on Windows (PowerShell)
cd src/tool_source_code/pymatgen/src/pymatgen
repoagent run -m gemini-2.0-flash -b https://generativelanguage.googleapis.com/v1beta/openai/ -tp . --print-hierarchy
It will generate markdown for each python file and a summary JSON file. Our summary JSON file generated by gemini-2.0-flash is at src/project_hierarchy.json
. We used chroma to extract each md_content and code_content in it to qa_benchmark/pymatgen-qa-generation/src/files/documents_llm_doc_gemini_20_flash_full.json
for generating QA benchmark.
First, modify the settings in settings.py
.
CONFIG = {
"PROMPT": "code_generation", # ['question_generation', 'code_generation']
"MODEL_NAME": "gemini-2.0-flash", # model name
"LOGGER_FILE": "question_generator_code.log", # log file name
"OUTPUT_FILE": "files/generation_results_code.json", # store path
}
Then run:
cd qa_benchmark/pymatgen-qa-generation/src
python main.py
Run all blocks in src/question_generation/build_qa_test.ipynb
.
(Note: The number of generated triplets may be larger than what we report in our benchmark, as we conducted a manual review process to remove low-quality triplets.)
MatTools provides a systematic way to evaluate the ability of LLMs to handle tasks related to materials science tools. It includes question generation, test automation, and an analysis framework, ensuring robust assessment and consistent results.@misc{MatTools,
title={MatTools: Benchmarking Large Language Models for Materials Science Tools},
author={Siyu Liu and Jiamin Xu and Beilin Ye and Bo Hu and David J. Srolovitz and Tongqi Wen},
year={2025},
eprint={2505.10852},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={https://arxiv.org/abs/2505.10852},
}