Abstract
Identification of small molecules remains a central question in analytical chemistry, in particular for natural product research, metabolomics, environmental research, and biomarker discovery. Mass spectrometry is the predominant technique for high-throughput analysis of small molecules. But it reveals only information about the mass of molecules and, by using tandem mass spectrometry, about the mass of molecular fragments. Automated interpretation of mass spectra is often limited to searching in spectral libraries, such that we can only dereplicate molecules for which we have already recorded reference mass spectra. In my thesis “Computational methods for small molecule identification” we developed SIRIUS, a tool for the structural elucidation of small molecules with tandem mass spectrometry. The method first computes a hypothetical fragmentation tree using combinatorial optimization. By using a Bayesian statistical model, we can learn parameters and hyperparameters of the underlying scoring directly from data. We demonstrate that the statistical model, which was fitted on a small dataset, generalizes well across many different datasets and mass spectrometry instruments. In a second step the fragmentation tree is used to predict a molecular fingerprint using kernel support vector machines. The predicted fingerprint can be searched in a structure database to identify the molecular structure. We demonstrate that our machine learning model outperforms all other methods for this task, including its predecessor FingerID. SIRIUS is available as commandline tool and as user interface. The molecular fingerprint prediction is implemented as web service and receives over one million requests per month.
Article note
The dissertation of Dr. Kai Dührkop has been awarded by the best-thesis award of the Fachgruppe Bioinformatik (FaBI) (see https://www.bioinformatik.de/en/).
Funding source: Deutsche Forschungsgemeinschaft
Award Identifier / Grant number: BO 1910/20
Funding statement: We gratefully acknowledge financial support by the Deutsche Forschungsgemeinschaft (BO 1910/20).
About the author
Dr. Kai Dührkop received his diploma in Bioinformatics in 2012 at the Friedrich-Schiller University (FSU) Jena, Germany, working on fixed parameter tractable algorithms for tree alignments. For his dissertation about the identification of small molecules with tandem mass spectrometry, supervised by Prof. Dr. Sebastian Böcker at FSU Jena, he was awarded the best-thesis award by the Fachgruppe Bioinformatik (FaBI). At present, he is postdoctoral researcher in the group of Prof. Juho Ruosu at the Aalto University, Finland.
Acknowledgments
We thank the GNPS community, S. Stein, and F. Kuhlmann and Agilent Technologies, Inc. (Santa Clara, USA) for providing data that was used to estimate the hyperparameters of SIRIUS 4 and to train CSI:FingerID.
Competing financial interests statement
K. D. is a co-founder of the Bright Giant GmbH, Germany.
References
1. I. Blaženović, T. Kind, J. Ji and O. Fiehn. Software tools and approaches for compound identification of lc-ms/ms data in metabolomics. Metabolites, 8(2), 2018.10.3390/metabo8020031Search in Google Scholar PubMed PubMed Central
2. S. Böcker and K. Dührkop. Fragmentation trees reloaded. J Cheminform, 8:5, 2016.10.1186/s13321-016-0116-8Search in Google Scholar PubMed PubMed Central
3. S. Böcker and F. Rasche. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinformatics, 24:I49–I55, 2008. Proc. of European Conference on Computational Biology (ECCB 2008).10.1093/bioinformatics/btn270Search in Google Scholar PubMed
4. C. Cortes, M. Mohri and A. Rostamizadeh. Algorithms for learning kernels based on centered alignment. J Mach Learn Res, 13 (1):795–828, 2012.Search in Google Scholar
5. R. R. da Silva, P. C. Dorrestein and R. A. Quinn. Illuminating the dark matter in metabolomics. Proc Natl Acad Sci USA, 112 (41):12549–12550, 2015.10.1073/pnas.1516878112Search in Google Scholar PubMed PubMed Central
6. K. Dührkop, H. Shen, M. Meusel, J. Rousu and S. Böcker. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci USA, 112 (41):12580–12585, 2015.10.1073/pnas.1509788112Search in Google Scholar PubMed PubMed Central
7. K. Dührkop, M. A. Lataretu, W. T. J. White and S. Böcker. Heuristic algorithms for the maximum colorful subtree problem. In Proc. of Workshop on Algorithms in Bioinformatics (WABI 2018), volume 113 of Leibniz International Proceedings in Informatics (LIPIcs), pages 23:1–23:14, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.Search in Google Scholar
8. K. Dührkop. Computational methods for small molecule identification. Friedrich-Schiller-Universität Jena. https://doi.org/10.22032/dbt.35296.Search in Google Scholar
9. K. Dührkop, M. Fleischauer, M. Ludwig, A. Aksenov, A. Melnik, M. Meusel, P. C. Dorrestein, J. Rousu, and S. Böcker. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).10.1038/s41592-019-0344-8Search in Google Scholar PubMed
10. Y. Fu, C. Zhao, X. Lu and G. Xu. Nontargeted screening of chemical contaminants and illegal additives in food based on liquid chromatography–high resolution mass spectrometry. Trends Anal Chem, 2017.10.1016/j.trac.2017.07.014Search in Google Scholar
11. M. Heinonen, H. Shen, N. Zamboni and J. Rousu. Metabolite identification and molecular fingerprint prediction via machine learning. Bioinformatics, 28 (18):2333–2341, 2012.10.1093/bioinformatics/bts437Search in Google Scholar PubMed
12. J. Hollender, E. L. Schymanski, H. P. Singer and P. L. Ferguson. Nontarget screening with high resolution mass spectrometry in the environment: ready to go? Environ Sci Technol, 51 (20):11505–11512, 2017. PMID: 28877430.10.1021/acs.est.7b02184Search in Google Scholar PubMed
13. H. Horai et al. MassBank: A public repository for sharing mass spectral data for life sciences. J Mass Spectrom, 45 (7):703–714, 2010.10.1002/jms.1777Search in Google Scholar PubMed
14. J. R. Idle and F. J. Gonzalez. Metabolomics. Cell Metab, 6 (5):348–351, 2007.10.1016/j.cmet.2007.10.005Search in Google Scholar PubMed PubMed Central
15. S. Kim et al. PubChem substance and compound databases. Nucleic Acids Res, 44:D1202–D1213, 2016.10.1093/nar/gkv951Search in Google Scholar PubMed PubMed Central
16. T. Kind and O. Fiehn. Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm. BMC Bioinf, 7 (1):234, 2006.10.1186/1471-2105-7-234Search in Google Scholar PubMed PubMed Central
17. M. Ludwig, K. Dührkop and S. Böcker. Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints. Bioinformatics, 34(13):i333–i340, 2018. Proc. of Intelligent Systems for Molecular Biology (ISMB 2018).10.1093/bioinformatics/bty245Search in Google Scholar PubMed PubMed Central
18. K. Peters et al. Current challenges in plant eco-metabolomics. Int J Mol Sci, 19(5), 2018.10.3390/ijms19051385Search in Google Scholar PubMed PubMed Central
19. J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers, chapter 5. MIT Press, Cambridge, Massachusetts, 2000.Search in Google Scholar
20. F. Rasche, A. Svatoš, R. K. Maddula, C. Böttcher and S. Böcker. Computing fragmentation trees from tandem mass spectrometry data. Anal Chem, 83(4):1243–1251, 2011.10.1021/ac101825kSearch in Google Scholar PubMed
21. F. Rasche, K. Scheubert, F. Hufsky, T. Zichner, M. Kai, A. Svatoš and S. Böcker. Identifying the unknowns by aligning fragmentation trees. Anal Chem, 84(7):3417–3426, 2012.10.1021/ac300304uSearch in Google Scholar PubMed
22. I. Rauf, F. Rasche, F. Nicolas and S. Böcker. Finding maximum colorful subtrees in practice. J Comput Biol, 20(4):1–11, 2013.10.1007/978-3-642-29627-7_22Search in Google Scholar
23. D. Rogers and M. Hahn. Extended-connectivity fingerprints. J Chem Inf Model, 50(5):742–754, 2010.10.1021/ci100050tSearch in Google Scholar PubMed
24. M. A. Samaraweera, L. M. Hall,D. W. Hill, and D. F. Grant Evaluation of an Artificial Neural Network Retention Index Model for Chemical Structure Identification in Nontargeted Metabolomics. Anal Chem, 90(21):12752–12760, 2018.10.1021/acs.analchem.8b03118Search in Google Scholar PubMed PubMed Central
25. E. L. Schymanski et al. Critical Assessment of Small Molecule Identification 2016: Automated methods. J Cheminf, 9:22, 2017.10.1186/s13321-017-0207-1Search in Google Scholar
26. H. Shen, K. Dührkop, S. Böcker and J. Rousu. Metabolite identification through multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157–i164, 2014. Proc. of Intelligent Systems for Molecular Biology (ISMB 2014).10.1093/bioinformatics/btu275Search in Google Scholar
27. H. Shen, S. Szedmak, C. Brouard and J. Rousu. Soft Kernel Target Alignment for Two-Stage Multiple Kernel Learning, pages 427–441. Springer International Publishing, Cham, 2016.10.1007/978-3-319-46307-0_27Search in Google Scholar
28. S. E. Stein and D. R. Scott. Optimization and testing of mass spectral library search algorithms for compound identification. J Am Soc Mass Spectrom, 5(9):859–866, 1994.10.1016/1044-0305(94)87009-8Search in Google Scholar
29. K. Uppal, D. I. Walker, K. Liu, S. Li, Y.-M. Go and D. P. Jones. Computational metabolomics: a framework for the million metabolome. Chem Res Toxicol, 29(12):1956–1975, 2016.10.1021/acs.chemrestox.6b00179Search in Google Scholar PubMed PubMed Central
30. M. Vinaixa, E. L. Schymanski, S. Neumann, M. Navarro, R. M. Salek and O. Yanes. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: State of the field and future prospects. Trends Anal Chem, 78:23–35, 2016.10.1016/j.trac.2015.09.005Search in Google Scholar
31. M. Wang et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social molecular networking. Nat Biotechnol, 34(8):828–837, 2016.10.1038/nbt.3597Search in Google Scholar PubMed PubMed Central
32. W. T. J. White, S. Beyer, K. Dührkop, M. Chimani and S. Böcker. Speedy colorful subtrees. In Proc. of Computing and Combinatorics Conference (COCOON 2015), volume 9198 of Lect Notes Comput Sci, pages 310–322. Springer, Berlin, 2015.10.1007/978-3-319-21398-9_25Search in Google Scholar
© 2019 Walter de Gruyter GmbH, Berlin/Boston