A curated list of the most useful datasets in materials science and chemistry for training machine learning and AI foundation models. This includes experimental, computational, and literature-mined datasets—prioritizing open-access resources and community contributions.
This project aims to:
- Catalog the best datasets by domain, type, quality, and size
- Support reproducible research in AI for chemistry and materials
- Provide a community-driven resource with contributions from researchers and developers
- Explore datasets by domain or data type using the tables below
- Click the access links to explore or download the data
- Sort/filter by quality, size, and suitability for ML models
- Fork the repo and submit a pull request to add new datasets
Want to add a new dataset or improve metadata?
- Fork the repository
- Edit the appropriate dataset list or add a new entry
- Submit a pull request with a brief description and source
- Use the following fields:
- Dataset Name
- Domain
- Type (
Computational
,Experimental
,Literature-mined
) - Size
- Access (Open/Restricted/Proprietary)
- Format (JSON, CSV, CIF, HDF5, SMILES, etc.)
- License
- Access Link
- Notes or Use Cases
Dataset | Domain | Size | Type | Format | License | Access | Link |
---|---|---|---|---|---|---|---|
OMat24 (Meta) | Inorganic crystals | 110M DFT entries | Computational | JSON/HDF5 | CC BY 4.0 | Open | OMat24 |
OMol25 (Meta) | Molecular chemistry | 100M+ DFT calculations | Computational | LMDB | CC BY 4.0 | Open | OMol25 |
Materials Project (LBL) | Inorganic crystals | 500k+ compounds | Computational | JSON/API | CC BY 4.0 | Open | materialsproject.org |
Open Catalyst 2020 (OC20) | Catalysis (surfaces) | 1.2M relaxations | Computational | JSON/HDF5 | CC BY 4.0 | Open | opencatalystproject.org |
AFLOW | Inorganic materials | 3.5M materials | Computational | REST API | Open | Open | aflow.org |
OQMD | Inorganic solids | 1M+ compounds | Computational | SQL/CSV | Open | Open | oqmd.org |
JARVIS-DFT (NIST) | 3D/2D materials | 40k+ entries | Computational | JSON/API | Open | Open | jarvis.nist.gov |
Carolina Materials DB | Hypothetical crystals | 214k structures | Computational | JSON | CC BY 4.0 | Open | carolinamatdb.org |
NOMAD | Various DFT/MD | >19M calculations | Computational | JSON | CC BY 4.0 | Open | NOMAD Repository |
MatPES | DFT Potential Energy Surfaces | ~400,000 structures from 300K MD simulations | Computational | JSON | Open | MatPES | |
Vector-QM24 | Small organic and inorganic molecules | 836k conformational isomers | Computational | JSON | Placeholder | Open | V-QM24 |
AIMNet2 Dataset | Non-metallic compounds | 20M hybrid DFT calculations | Computational | JSON | Open | Open | AIMNet |
RDB7 | Barrier height and enthalpy for small organic reactions | 12k CCSD(T)-F12 calculations | Computational | CSV | Open | Open | Zenodo |
RDB19-Rad | ΔG of activation and of reaction for organic reactions in 40 common solvents | 5.6k DFT + COSMO-RS calculations | Computational | CSV | Open | Open | Zenodo |
QCML | Small molecules consisting of up to 8 heavy atoms | 14.7B Semi-empirical + 33.5M DFT calculations | Computational | TFDS | CC BY-NC 4.0 | Open | Zenodo |
Dataset | Domain | Size | Type | Format | License | Access | Link |
---|---|---|---|---|---|---|---|
Crystallography Open Database | Crystal structures | 523k+ entries | Experimental | CIF | Public Domain | Open | crystallography.net |
NIST ICSD (subset) | Inorganic structures | ~290k structures | Experimental | CIF | Proprietary | Restricted | icsd.products.fiz-karlsruhe.de |
CSD (Cambridge) | Organic crystals | ~1.3M structures | Experimental | CIF | Proprietary | Restricted | ccdc.cam.ac.uk |
opXRD | Crystal structures | 92552 (2179 labeled) | Experimental | JSON | CC BY 4.0 | Open | zenodo.org |
MDR SuperCon | Superconductivity | legacy superconductor database w/ material composition, structure, properties, and processes | Mixed | CC BY 4.0 | Open | NIMS MDR |
Dataset | Domain | Size | Type | Format | License | Access | Link |
---|---|---|---|---|---|---|---|
ChemPile | Chemistry | 75B+ tokens | LLM Training | Mixed | Open | Open | ChemPile |
SmolInstruct | Small molecules | 3.3M samples | LLM Training | JSON | CC BY 4.0 | Open | SmolInstruct |
CAMEL | Chemistry | 20K problem-solution pairs | LLM Training | JSON | Open | Open | CAMEL |
ChemNLP | Chemistry | Extensive, many combined datasets | LLM Training | JSON | Open | Open | ChemNLP |
MaScQA | Materials Science | 640 QA pairs | LLM Training | XLSX | Open | Open | MaScQA |
SciCode | Research Coding in Physics, Math, Material Science, Biology, and Chemistry | 338 subproblems | LLM Training | JSON | Open | Open | SciCode |
Dataset | Domain | Size | Type | Format | License | Access | Link |
---|---|---|---|---|---|---|---|
PubChem | Molecules & data | 119M compounds | Literature | SMILES/SDF | Public Domain | Open | pubchem.ncbi.nlm.nih.gov |
USPTO Reactions | Organic reactions | 1.8M reactions | Literature | RXN/SMILES | Open | Open | USPTO MIT |
Open Reaction Database (ORD) | Synthetic reactions | ~1M reactions | Experimental/Lit | JSON | CC BY 4.0 | Open | open-reaction-database.org |
PatCID (IBM) | Chemical image data | 81M images / 13M mols | Literature | PNG/SMILES | Open | Open | github.com/DS4SD/PatCID |
MatScholar | NLP corpus (materials) | 5M+ abstracts | Literature | JSON/Graph | Open | Open | matscholar.com |
Dataset | Domain | Size | Access | Use Case Notes |
---|---|---|---|---|
CAS Registry | Chemical substances | 250M+ substances | Proprietary | Industry standard for molecule indexing |
Reaxys (Elsevier) | Reactions & properties | Millions of reactions | Proprietary | Rich curated literature reaction data |
Citrine Informatics DB | Experimental materials | Private | Proprietary | Materials ML platform w/ industry data |
CSD (Cambridge) | Organic crystals | 1.3M+ | Proprietary | Gold-standard X-ray structures |
PoLyInfo | Polymers & properties | 500k+ data points / Experimental | Proprietary | Polymer properties from literature sources |
- The Materials Data Facility - Over 100 TB of open materials data. #TODO list some of these in the tables above
- Foundry-ML search Foundry - 61 structured datasets ready for download through a Python client #TODO list some of these in the tables above
- Classify and add CRIPT for polymer data
- Classify and add Polymer Genome and other datasets from Khazana
- A dataset on solubilities of gases in polymers (15 000 experimental measurements of 79 gases' uptakes (0.01–50 wt%) in 102 different polymers, pressures from 1 × 10−3 to 7 × 102 bar and temperatures from 233 to 508 K, includes nearly 500 solvent–polymer systems). Optimized structures of various repeating units are included. Should it be of interest for you, it is available here: Data
- Add Materials Cloud Datasets
- Classify Atomly. A bit challenging with non-English
- Look into adding NOMAD for experimental data as well
- Review Alexandria Materials
- Add A Quantum-Chemical Bonding Database for Solid-State Materials Part 1: https://zenodo.org/records/8091844 Part 2: https://zenodo.org/records/8092187
- Add QM datasets. http://quantum-machine.org/datasets/
This project is licensed under the MIT License. Each dataset listed has its own license, noted in the table. Always check the source's license before using the data in your project.
Thanks to the open data and research communities including:
- Meta AI FAIR
- The Materials Data Facility / Foundry-ML
- NIST JARVIS and Materials Project
- LBL, MIT, CCDC, FIZ Karlsruhe
- Contributors to Open Catalyst, PubChem, ORD, and AFLOW
- Developers of open chemistry toolkits (RDKit, Open Babel)
If this repository was helpful in your work, feel free to cite or star the repo. You can also reference the underlying dataset publications linked above.