[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

An Approach Based Natural Language Processing for DNA Sequences Encoding Using the Global Vectors for Word Representation

  • Conference paper
  • First Online:
Innovative Systems for Intelligent Health Informatics (IRICT 2020)

Abstract

DNA sequence has several representations; one of them is to split it into k-mers components. In this work, we explore the high similarity between natural language and “genomic sequence language” which are both character-based languages, to represent DNA sequences. In this representation, we processed a DNA sequence as a set of overlapping word embeddings using the Global Vectors representation. In Natural language processing context, we can consider k-mers as words. The embedding representation of k-mers helped to overcome the curse of dimensionality, which is one of the main issues of traditional methods that encode k-mers occurrence as one hot vector. Experiments on the first Critical Assessment of Metagenome Interpretation (CAMI) dataset demonstrated that our method is an efficient way to cluster metagenomics reads and predict their taxonomy. This method could be used as first step for metagenomics downstream analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 143.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 179.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Menegaux, R., Vert, J.-P.: Continuous embeddings of dna sequencing reads and application to metagenomics. J. Comput. Biol. 26(6), 509–518 (2018)

    Google Scholar 

  2. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)

    Google Scholar 

  3. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)

    Google Scholar 

  4. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Google Scholar 

  5. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

  6. Kimothi, D., Soni, A., Biyani, P., Hogan, J.M.: Distributed representations for biological sequence analysis. arXiv preprint arXiv:1608.05949 (2016)

  7. Asgari, E., Mofrad, M.R.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS One 10, e0141287 (2015)

    Google Scholar 

  8. Shi, L., Chen, B.: A Vector Representation of DNA Sequences Using Locality Sensitive Hashing. biorxiv (2019)

    Google Scholar 

  9. Ng, P.: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.0627 (2017)

  10. Du, J., Jia, P., Dai, Y., Tao, C., Zhao, Z., Zhi, D.: Gene2Vec: distributed representation of genes based on co-expression. bioRxiv (2018)

    Google Scholar 

  11. Yang, K.K., Wu, Z., Bedbrook, C.N., Arnold, F.H.: Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018)

    Google Scholar 

  12. Menegaux, R., Vert, J.P.: Continuous embeddings of DNA sequencing reads, and application to metagenomics. BioRxiv (2018)

    Google Scholar 

  13. Sczyrba, A., et. al.: Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software (2017)

    Google Scholar 

  14. Mande, S.S., Mohammed, M.H., Ghosh, T.S.: Classification of metagenomic sequences: methods and challenges. Briefings Bioinform. 13(6), 669–681 (2012)

    Google Scholar 

  15. Huerta-Cepas, J., Dopazo, J., Gabaldón, T.: ETE: a python Environment for Tree Exploration. BMC Bioinform. 11(%11), 24 (2010)

    Google Scholar 

  16. Sayers, E.W., Agarwala, R., Bolton, E.E., Brister, J.R., Canese, K., Clark, K., Connor, R., Fiorini, N., Funk, K., Hefferon, T., Holmes, J.B., Kim, S., Kimchi, A., Kitts, P.A., Lathrop, S., Lu, Z., Madden, T.L., Marchler-Bauer, A., Phan, L., Schneider, V.A., Schoch, C.L., Pruitt, K.D., Ostell, J.: Database resources of the national center for biotechnology information. Nucleic Acids Res. 47, D23–D28 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Brahim Matougui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Matougui, B., Belhadef, H., Kitouni, I. (2021). An Approach Based Natural Language Processing for DNA Sequences Encoding Using the Global Vectors for Word Representation. In: Saeed, F., Mohammed, F., Al-Nahari, A. (eds) Innovative Systems for Intelligent Health Informatics. IRICT 2020. Lecture Notes on Data Engineering and Communications Technologies, vol 72. Springer, Cham. https://doi.org/10.1007/978-3-030-70713-2_53

Download citation

Publish with us

Policies and ethics