Abstract
DNA sequence has several representations; one of them is to split it into k-mers components. In this work, we explore the high similarity between natural language and “genomic sequence language” which are both character-based languages, to represent DNA sequences. In this representation, we processed a DNA sequence as a set of overlapping word embeddings using the Global Vectors representation. In Natural language processing context, we can consider k-mers as words. The embedding representation of k-mers helped to overcome the curse of dimensionality, which is one of the main issues of traditional methods that encode k-mers occurrence as one hot vector. Experiments on the first Critical Assessment of Metagenome Interpretation (CAMI) dataset demonstrated that our method is an efficient way to cluster metagenomics reads and predict their taxonomy. This method could be used as first step for metagenomics downstream analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Menegaux, R., Vert, J.-P.: Continuous embeddings of dna sequencing reads and application to metagenomics. J. Comput. Biol. 26(6), 509–518 (2018)
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)
Kimothi, D., Soni, A., Biyani, P., Hogan, J.M.: Distributed representations for biological sequence analysis. arXiv preprint arXiv:1608.05949 (2016)
Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One 10, e0141287 (2015)
Shi, L., Chen, B.: A Vector Representation of DNA Sequences Using Locality Sensitive Hashing. biorxiv (2019)
Ng, P.: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.0627 (2017)
Du, J., Jia, P., Dai, Y., Tao, C., Zhao, Z., Zhi, D.: Gene2Vec: distributed representation of genes based on co-expression. bioRxiv (2018)
Yang, K.K., Wu, Z., Bedbrook, C.N., Arnold, F.H.: Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018)
Menegaux, R., Vert, J.P.: Continuous embeddings of DNA sequencing reads, and application to metagenomics. BioRxiv (2018)
Sczyrba, A., et. al.: Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software (2017)
Mande, S.S., Mohammed, M.H., Ghosh, T.S.: Classification of metagenomic sequences: methods and challenges. Briefings Bioinform. 13(6), 669–681 (2012)
Huerta-Cepas, J., Dopazo, J., Gabaldón, T.: ETE: a python Environment for Tree Exploration. BMC Bioinform. 11(%11), 24 (2010)
Sayers, E.W., Agarwala, R., Bolton, E.E., Brister, J.R., Canese, K., Clark, K., Connor, R., Fiorini, N., Funk, K., Hefferon, T., Holmes, J.B., Kim, S., Kimchi, A., Kitts, P.A., Lathrop, S., Lu, Z., Madden, T.L., Marchler-Bauer, A., Phan, L., Schneider, V.A., Schoch, C.L., Pruitt, K.D., Ostell, J.: Database resources of the national center for biotechnology information. Nucleic Acids Res. 47, D23–D28 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Matougui, B., Belhadef, H., Kitouni, I. (2021). An Approach Based Natural Language Processing for DNA Sequences Encoding Using the Global Vectors for Word Representation. In: Saeed, F., Mohammed, F., Al-Nahari, A. (eds) Innovative Systems for Intelligent Health Informatics. IRICT 2020. Lecture Notes on Data Engineering and Communications Technologies, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-030-70713-2_53
Download citation
DOI: https://doi.org/10.1007/978-3-030-70713-2_53
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-70712-5
Online ISBN: 978-3-030-70713-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)