8000 GitHub - blaiszik/awesome-matchem-datasets
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

blaiszik/awesome-matchem-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Materials & Chemistry Datasets

A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models. This includes experimental, computational, and literature-mined datasets—prioritizing open-access resources and community contributions.

This project aims to:

  • Catalog the best datasets by domain, type, quality, and size
  • Support reproducible research in AI for chemistry and materials
  • Provide a community-driven resource with contributions from researchers and developers

Table of Contents


How to Use

  • Explore datasets by domain or data type using the tables below
  • Click the access links to explore or download the data
  • Sort/filter by quality, size, and suitability for ML models
  • Fork the repo and submit a pull request to add new datasets

Contributing

Want to add a new dataset or improve metadata?

  1. Fork the repository
  2. Edit the appropriate dataset list or add a new entry
  3. Submit a pull request with a brief description and source
  4. Use the following fields:
    • Dataset Name
    • Domain
    • Type (Computational, Experimental, Literature-mined)
    • Size
    • Access (Open/Restricted/Proprietary)
    • Format (JSON, CSV, CIF, HDF5, SMILES, etc.)
    • License
    • Access Link
    • Notes or Use Cases

Datasets

Computational Datasets

Dataset Domain Size Type Format License Access Link
OMat24 (Meta) Inorganic crystals 110M DFT entries Computational JSON/HDF5 CC BY 4.0 Open OMat24
OMol25 (Meta) Molecular chemistry 100M+ DFT calculations Computational LMDB CC BY 4.0 Open OMol25
Materials Project (LBL) Inorganic crystals 500k+ compounds Computational JSON/API CC BY 4.0 Open materialsproject.org
Open Catalyst 2020 (OC20) Catalysis (surfaces) 1.2M relaxations Computational JSON/HDF5 CC BY 4.0 Open opencatalystproject.org
AFLOW Inorganic materials 3.5M materials Computational REST API Open Open aflow.org
OQMD Inorganic solids 1M+ compounds Computational SQL/CSV Open Open oqmd.org
JARVIS-DFT (NIST) 3D/2D materials 40k+ entries Computational JSON/API Open Open jarvis.nist.gov
Carolina Materials DB Hypothetical crystals 214k structures Computational JSON CC BY 4.0 Open carolinamatdb.org
NOMAD Various DFT/MD >19M calculations Computational JSON CC BY 4.0 Open NOMAD Repository
MatPES DFT Potential Energy Surfaces ~400,000 structures from 300K MD simulations Computational JSON Open MatPES
Vector-QM24 Small organic and inorganic molecules 836k conformational isomers Computational JSON Placeholder Open V-QM24
AIMNet2 Dataset Non-metallic compounds 20M hybrid DFT calculations Computational JSON Open Open AIMNet
RDB7 Barrier height and enthalpy for small organic reactions 12k CCSD(T)-F12 calculations Computational CSV Open Open Zenodo
RDB19-Rad ΔG of activation and of reaction for organic reactions in 40 common solvents 5.6k DFT + COSMO-RS calculations Computational CSV Open Open Zenodo
QCML Small molecules consisting of up to 8 heavy atoms 14.7B Semi-empirical + 33.5M DFT calculations Computational TFDS CC BY-NC 4.0 Open Zenodo

Experimental Datasets

Dataset Domain Size Type Format License Access Link
Crystallography Open Database Crystal structures 523k+ entries Experimental CIF Public Domain Open crystallography.net
NIST ICSD (subset) Inorganic structures ~290k structures Experimental CIF Proprietary Restricted icsd.products.fiz-karlsruhe.de
CSD (Cambridge) Organic crystals ~1.3M structures Experimental CIF Proprietary Restricted ccdc.cam.ac.uk
opXRD Crystal structures 92552 (2179 labeled) Experimental JSON CC BY 4.0 Open zenodo.org
MDR SuperCon Superconductivity legacy superconductor database w/ material composition, structure, properties, and processes Mixed CC BY 4.0 Open NIMS MDR

LLM Training Datasets

Dataset Domain Size Type Format License Access Link
ChemPile Chemistry 75B+ tokens LLM Training Mixed Open Open ChemPile
SmolInstruct Small molecules 3.3M samples LLM Training JSON CC BY 4.0 Open SmolInstruct
CAMEL Chemistry 20K problem-solution pairs LLM Training JSON Open Open CAMEL
ChemNLP Chemistry Extensive, many combined datasets LLM Training JSON Open Open ChemNLP
MaScQA Materials Science 640 QA pairs LLM Training XLSX Open Open MaScQA
SciCode Research Coding in Physics, Math, Material Science, Biology, and Chemistry 338 subproblems LLM Training JSON Open Open SciCode

Literature-mined & Text Datasets

Dataset Domain Size Type Format License Access Link
PubChem Molecules & data 119M compounds Literature SMILES/SDF Public Domain Open pubchem.ncbi.nlm.nih.gov
USPTO Reactions Organic reactions 1.8M reactions Literature RXN/SMILES Open Open USPTO MIT
Open Reaction Database (ORD) Synthetic reactions ~1M reactions Experimental/Lit JSON CC BY 4.0 Open open-reaction-database.org
PatCID (IBM) Chemical image data 81M images / 13M mols Literature PNG/SMILES Open Open github.com/DS4SD/PatCID
MatScholar NLP corpus (materials) 5M+ abstracts Literature JSON/Graph Open Open matscholar.com

Proprietary Datasets (for reference)

Dataset Domain Size Access Use Case Notes
CAS Registry Chemical substances 250M+ substances Proprietary Industry standard for molecule indexing
Reaxys (Elsevier) Reactions & properties Millions of reactions Proprietary Rich curated literature reaction data
Citrine Informatics DB Experimental materials Private Proprietary Materials ML platform w/ industry data
CSD (Cambridge) Organic crystals 1.3M+ Proprietary Gold-standard X-ray structures
PoLyInfo Polymers & properties 500k+ data points / Experimental Proprietary Polymer properties from literature sources

Dataset Resources

  • The Materials Data Facility - Over 100 TB of open materials data. #TODO list some of these in the tables above
  • Foundry-ML search Foundry - 61 structured datasets ready for download through a Python client #TODO list some of these in the tables above

TODO


Other Links


License

This project is licensed under the MIT License. Each dataset listed has its own license, noted in the table. Always check the source's license before using the data in your project.


Acknowledgements

Thanks to the open data and research communities including:

  • Meta AI FAIR
  • The Materials Data Facility / Foundry-ML
  • NIST JARVIS and Materials Project
  • LBL, MIT, CCDC, FIZ Karlsruhe
  • Contributors to Open Catalyst, PubChem, ORD, and AFLOW
  • Developers of open chemistry toolkits (RDKit, Open Babel)

Citation

If this repository was helpful in your work, feel free to cite or star the repo. You can also reference the underlying dataset publications linked above.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0