Abstract
Protein glycosylation, a post-translational modification of proteins by glycans, plays an important role in numerous physiological and pathological cellular functions. Glycoproteomics, the study of protein glycosylation on a proteome-wide scale, utilizes liquid chromatography coupled with tandem mass spectrometry (MS/MS) to get combinational information on glycosylation site, glycosylation level and glycan structure. However, current database searching methods for glycoproteomics often struggle with glycan structure determination due to the limited occurrence of structure-determining ions. Although spectral searching methods can leverage fragment intensity to facilitate the structure identification of glycopeptides, their application is hindered by difficulties in spectral library construction. In this work, we present DeepGP, a hybrid deep learning framework based on transformer and graph neural networks, for the prediction of MS/MS spectra and retention time of glycopeptides. Two graph neural network modules are employed to capture the branched glycan structure and predict glycan ion intensity, respectively. Additionally, a pretraining strategy is implemented to alleviate the insufficiency of glycoproteomics data. Testing on multiple biological datasets, DeepGP accurately predicts MS/MS spectra and retention time of glycopeptides, closely aligning with the experimental results. Comprehensive benchmarking of DeepGP on synthetic and biological datasets validates its effectiveness in distinguishing similar glycans. Based on various decoy methods, DeepGP in combination with database searching can increase glycopeptide detection sensitivity. We anticipate that DeepGP can inspire extensive future work in glycoproteomics.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
£14.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
£99.00 per year
only £8.25 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The raw datasets used in this study are available in the PRIDE45 database under accession codes PXD005411 (ref. 7), PXD005412 (ref. 7), PXD005413 (ref. 7), PXD005553 (ref. 7), PXD005555 ref. 7, PXD025859 (ref. 9), PXD015360 (ref. 36), PXD009654 (ref. 37), PXD023980 (ref. 18), PXD016428 (ref. 46), PXD005931 (ref. 47), PXD025455 (ref. 48), PXD009716 (ref. 49) and PXD005565 (ref. 7). Further details regarding the datasets and raw files used in this study can be found in Supplementary Table 1 and Supplementary Data 7. Source data are provided with this paper. The source data for the main figures including statistics are provided as a Source Data file. The source data for the supplementary figures including statistics are provided as Supplementary Data 8. FASTA files were from the UniProt H. sapiens reference proteome (20,600 entries), UniProt M. musculus reference proteome (17,082 entries) and UniProt S. pombe reference proteome (5,140 entries). For the synthetic glycopeptide dataset (Syn_1), the FASTA file was compiled from synthetic glycopeptide sequences as stated in the original publication.
Code availability
DeepGP along with the user guide is freely available via GitHub (https://github.com/lmsac/DeepGP) and Zenodo (https://doi.org/10.5281/zenodo.11911189)50.
References
Wang, Y. C., Peterson, S. E. & Loring, J. F. Protein post-translational modifications and regulation of pluripotency in human stem cells. Cell Res. 24, 143–160 (2014).
Hart, G. W. & Copeland, R. J. Glycomics hits the big time. Cell 143, 672–676 (2010).
Hu, H., Khatri, K. & Zaia, J. Algorithms and design strategies towards automated glycoproteomics analysis. Mass Spectrom. Rev. 36, 475–498 (2017).
Hu, H., Khatri, K., Klein, J., Leymarie, N. & Zaia, J. A review of methods for interpretation of glycopeptide tandem mass spectral data. Glycoconj. J. 33, 285–296 (2016).
Bojar, D. & Lisacek, F. Glycoinformatics in the artificial intelligence era. Chem. Rev. 122, 15971–15988 (2022).
Zeng, W. F. et al. pGlyco: a pipeline for the identification of intact N-glycopeptides by using HCD- and CID-MS/MS and MS3. Sci. Rep. 6, 25102 (2016).
Liu, M. Q. et al. pGlyco 2.0 enables precision N-glycoproteomics with comprehensive quality control and one-step mass spectrometry for intact glycopeptide identification. Nat. Commun. 8, 438 (2017).
Zeng, W. F., Cao, W. Q., Liu, M. Q., He, S. M. & Yang, P. Y. Precise, fast and comprehensive analysis of intact glycopeptides and modified glycans with pGlyco3. Nat. Methods 18, 1515–1523 (2021).
Shen, J. C. et al. StrucGP: de novo structural sequencing of site-specific N-glycan on glycoproteins using a modularization strategy. Nat. Methods 18, 921–929 (2021).
Polasky, D. A., Yu, F. C., Teo, G. C. & Nesvizhskii, A. I. Fast and comprehensive N- and O-glycoproteomics analysis with MSFragger-Glyco. Nat. Methods 17, 1125–1132 (2020).
Lu, L., Riley, N. M., Shortreed, M. R., Bertozzi, C. R. & Smith, L. M. O-Pair search with MetaMorpheus for O-glycopeptide characterization. Nat. Methods 17, 1133–1138 (2020).
Medzihradszky, K. F., Maynard, J., Kaasik, K. & Bern, M. Intact N- and O-linked glycopeptide identification from HCD data using Byonic. Mol. Cell. Proteomics 13, S36 (2014).
Fang, Z. et al. Glyco-Decipher enables glycan database-independent peptide matching and in-depth characterization of site-specific N-glycosylation. Nat. Commun. 13, 1900 (2022).
Xiao, K. & Tian, Z. GPSeeker enables quantitative structural N-Glycoproteomics for site- and structure-specific characterization of differentially expressed N-glycosylation in hepatocellular carcinoma. J. Proteome Res. 18, 2885–2895 (2019).
Peng, W. et al. MS-based glycomics and glycoproteomics methods enabling isomeric characterization. Mass Spectrom. Rev. 42, 577–616 (2023).
Toghi Eshghi, S., Shah, P., Yang, W., Li, X. & Zhang, H. GPQuest: a spectral library matching algorithm for site-specific assignment of tandem mass spectra to intact N-glycopeptides. Anal. Chem. 87, 5181–5188 (2015).
Li, S. J., Zhu, J. H., Lubman, D. M., Zhou, H. & Tang, H. X. GlycoSLASH: concurrent glycopeptide identification from multiple related LC-MS/MS data sets by using spectral clustering and library searching. J. Proteome Res. 22, 1501–1509 (2023).
Yang, Y. et al. GproDIA enables data-independent acquisition glycoproteomics with comprehensive statistical control. Nat. Commun. 12, 6073 (2021).
Zeng, W. F. et al. MS/MS spectrum prediction for modified peptides using pDeep2 trained by transfer learning. Anal. Chem. 91, 9724–9731 (2019).
Zhou, X. X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).
Tarn, C. & Zeng, W. F. pDeep3: toward more accurate spectrum prediction with fast few-shot learning. Anal. Chem. 93, 5815–5822 (2021).
Gessulat, S. et al. Prosit: proteome-wide prediction of peptide tandem mass spectra by deep learning. Nat. Methods 16, 509–518 (2019).
Tiwary, S. et al. High-quality MS/MS spectrum prediction for data-dependent and data-independent acquisition data analysis. Nat. Methods 16, 519 (2019).
Yang, Y. et al. In silico spectral libraries by deep learning facilitate data-independent acquisition proteomics. Nat. Commun. 11, 146 (2020).
Lou, R. H. et al. DeepPhospho accelerates DIA phosphoproteome profiling through in silico library generation. Nat. Commun. 12, 6685 (2021).
Zong, Y. et al. DeepFLR facilitates false localization rate control in phosphoproteomics. Nat. Commun. 14, 2269 (2023).
Reily, C., Stewart, T. J., Renfrow, M. B. & Novak, J. Glycosylation in health and disease. Nat. Rev. Nephrol. 15, 346–366 (2019).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. In Proc. 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (eds Burstein, J. et al.) 4171–4186 (ACL, 2018); https://doi.org/10.18653/V1/N19-1423
Cao, W. et al. Recent advances in software tools for more generic and precise intact glycopeptide analysis. Mol. Cell. Proteomics 20, 100060 (2021).
Liu, J. et al. Methods for peptide identification by spectral comparison. Proteome Sci 5, 3 (2007).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. Preprint at https://arXiv.org/1609.02907 (2016).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? Preprint at https://arXiv.org/1810.00826 (2018).
Veličković, P. et al. Graph attention networks. In Proc. 6th International Conference on Learning Representations (ICLR, 2018); https://doi.org/10.48550/arXiv.1710.10903
Xiong, Z. et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J. Med. Chem. 63, 8749–8760 (2020).
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems (eds von Luxburg, U. et al.) 5999–6009 (Curran Associates, 2017); https://doi.org/10.48550/arXiv.1706.03762
Zhang, Y. et al. Comparative glycoproteomic profiling of human body fluid between healthy controls and patients with papillary thyroid carcinoma. J. Proteome Res. 19, 2539–2552 (2020).
Qin, H. et al. Highly efficient analysis of glycoprotein sialylation in human serum by simultaneous quantification of glycosites and site-specific glycoforms. J. Proteome Res. 18, 3439–3446 (2019).
Sun, W. et al. Glycopeptide database search and de novo sequencing with PEAKS GlycanFinder enable highly sensitive glycoproteomics. Nat. Commun. 14, 4046 (2023).
Polasky, D. A., Geiszler, D. J., Yu, F. & Nesvizhskii, A. I. Multiattribute glycan identification and FDR control for glycoproteomics. Mol. Cell. Proteomics 21, 100205 (2022).
Zhang, S. Spectrum and Retention Time Prediction for N-Glycopeptides Using Deep Learning. Master's thesis, Univ. of Waterloo (2023).
Kawahara, R. et al. Community evaluation of glycoproteomics informatics solutions reveals high-performance search strategies for serum glycopeptide analysis. Nat. Methods 18, 1304–1316 (2021).
Klein, J., Carvalho, L. & Zaia, J. Expanding N-Glycopeptide identifications by fragmentation prediction and glycome network smoothing. Preprint at bioRxiv https://doi.org/10.1101/2021.02.14.431154 (2021).
Zhang, Z. & Shah, B. Prediction of collision-induced dissociation spectra of common N-glycopeptides for glycoform identification. Anal. Chem. 82, 10194–10202 (2010).
Yang, Y. & Fang, Q. Prediction of glycopeptide fragment mass spectra by deep learning. Nat. Commun. 15, 2448 (2024).
Vizcaino, J. A. et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 44, D447–D456 (2016).
Zhang, Y. et al. Glyco-CPLL: an integrated method for in-depth and comprehensive N-glycoproteome profiling of human plasma. J. Proteome Res. 19, 655–666 (2020).
Bollineni, R. C., Koehler, C. J., Gislefoss, R. E., Anonsen, J. H. & Thiede, B. Large-scale intact glycopeptide identification by Mascot database search. Sci. Rep. 8, 2117 (2018).
Lin, Y. et al. A panel of glycopeptides as candidate biomarkers for early diagnosis of NASH hepatocellular carcinoma using a stepped HCD Method and PRM evaluation. J. Proteome Res. 20, 3278–3289 (2021).
Pioch, M., Hoffmann, M., Pralow, A., Reichl, U. & Rapp, E. glyXtool(MS): an open-source pipeline for semiautomated analysis of glycopeptide mass spectrometry data. Anal. Chem. 90, 11908–11916 (2018).
Zong, Y. Code for DeepGP. Zenodo https://doi.org/10.5281/zenodo.11911189 (2024).
Acknowledgements
We thank M. Ye and Z. Fang from Dalian Institute of Chemical Physics, Chinese Academy of Sciences for assisting us in using GP-plotter. This work was supported by the Science and Technology Commission of Shanghai Municipality (grant no. 23JS1400100, L.Q.) and the National Natural Science Foundation of China (grant no. 21934001, L.Q.). The work also received support from the AI for Science project of Fudan University (X.Q. and L.Q.).
Author information
Authors and Affiliations
Contributions
Y.Z. did the majority of the coding work and data analysis and wrote the original draft of the manuscript. Y.W. built the deep learning model. X.Q. and X.H. assisted in the building of the deep learning model and provided the computational resources. L.Q. supervised all aspects of the work and finalized the manuscript. All authors were involved in the design of this work.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Wen-Feng Zeng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Tables 1–3, Figs. 1–31 and Notes 1–7.
Supplementary Data 1
Scores by DeepGP and pGlyco3 for glycopeptide candidates of MS/MS spectra from mouse liver dataset with 1% or 100% FDR by pGlyco3.
Supplementary Data 2
Scores by DeepGP and pGlyco3 for glycopeptide candidates of MS/MS spectra from mouse brain dataset with 1% or 100% FDR by pGlyco3.
Supplementary Data 3
The glycopeptides identified by DeepGP, pGlyco3 and StrucGP from the MS/MS spectra extracted from the mouse liver and brain datasets.
Supplementary Data 4
Glycopeptides identified by DeepGP for MS/MS spectra, removing diagnostic ions.
Supplementary Data 5
Glycopeptides identified by DeepGP + pGlyco3 and pGlyco3 alone with different decoy methods.
Supplementary Data 6
Glycopeptides additionally identified by DeepGP + pGlyco3 with different decoy methods.
Supplementary Data 7
Summary of the raw files used for each dataset from the public resources.
Supplementary Data 8
Source data for supplementary figures.
Source data
Source Data Fig. 2
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zong, Y., Wang, Y., Qiu, X. et al. Deep learning prediction of glycopeptide tandem mass spectra powers glycoproteomics. Nat Mach Intell 6, 950–961 (2024). https://doi.org/10.1038/s42256-024-00875-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-024-00875-x