Hydropathy and Conformational Similarity-Based Distributed Representation of Protein Sequences for Properties Prediction

Hrushikesh Bhosale¹,
Ashwin Lahorkar²,
Divye Singh³,
Aamod Sane¹ &
…
Jayaraman Valadi ORCID: orcid.org/0000-0003-0185-9039¹

340 Accesses
Explore all metrics

Abstract

In the natural language processing community conventional features like TF-IDF are commonly employed for text mining and other applications. These conventional features lack semantic/syntactic information. Researchers in the text mining field discovered that distributed representation of words can indeed contain this information and increase the predictive power of algorithms. Word2Vec to learn word embeddings from texts is a very popular distributed representation in NLP tasks. Recently researchers introduced these distributed representations, viz., ProtVec, for various protein function annotation tasks with considerable success. We, in this work, have developed reduced protein alphabet representations employing two different reduction schemes for four different regression tasks. Employing the entire Swiss-Prot annotated sequences we have extracted the embedding vectors using skip-gram models with different embedding vector sizes, k-mer sizes and context window sizes. We then used these vectors as input to the Support Vector Machines regression algorithm to build regression models. In this way we built seven different models which include the original ProtVec model, hydropathy-based reduced alphabet model, conformational similarity-based reduced alphabet model and all possible combinations of these three aforementioned models. The performance improvement in absorption and enantioselectivity tasks indicate that grouping the alphabets on an appropriate basis can indeed play a major role in enhancing algorithm capabilities. Our results on all the four tasks indicate individual-reduced alphabet representations and certain synergistic combinations can considerably increase prediction performance. This new method exhibits multiple advantages including improved semantic/syntactic information and more compact reduced representations. This method can also provide important domain information which can be used in further experimentations to develop sequences with desired properties.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

Article 16 April 2024

Distributed Reduced Alphabet Representation for Predicting Proinflammatory Peptides

Assessing the role of evolutionary information for enhancing protein language model embeddings

Article Open access 05 September 2024

Data Availability

The software for the algorithms developed can be made available by writing to the corresponding author.

References

Mikolov TSutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems 2013; (pp. 3111–3119).
Asgari E, Mofrad MRK. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE.
Kimothi D, Soni ABiyani P, Hogan JM. Distributed representations for biological sequence analysis. 2016. arXiv preprint arXiv: 1608.05949.
Ng P. dna2vec: consistent vector representations of variable-length k-mers. 2017. arXiv preprint arXiv: 1701.06279.
Dutta A, Dubey T, Singh KK, Anand A. SpliceVec: distributed feature representations for splice junction prediction. Comput Biol Chem. 2018;74:434–41.
Article Google Scholar
Zhang Y, Chen Q, Yang Z, Lin H, Lu Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci Data. 2019;6(1):1–9.
Article Google Scholar
Yang X, Yang S, Li Q, Wuchty S, Zhang Z. Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method. Comput Struct Biotechnol J. 2020;18:153–61.
Article Google Scholar
Jaeger S, Fulle S, Turk S. Mol2vec: unsupervised machine learning approach with chemical intuition. J Chem Inf Model. 2018;58(1):27–35.
Article Google Scholar
Li T, Fan K, Wang J, Wang W. Reduction of protein sequence complexity by residue grouping. Protein Eng. 2003;16(5):323–30.
Article Google Scholar
Weathers EA, Paulaitis ME, Woolf TB, Hoh JH. Reduced amino acid alphabet is sufficient to accurately recognize intrinsically disordered protein. FEBS Lett. 2004;576(3):348–52.
Article Google Scholar
Idicula-Thomas S, Kulkarni AJ, Kulkarni BD, Jayaraman VK, Balaji PV. A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli. Bioinformatics. 2006;22(3):278–84.
Article Google Scholar
Oğul H, Mumcuoğlu EÜ. A discriminative method for remote homology detection based on n-peptide compositions with reduced amino acid alphabets. BioSystems. 2007;87(1):75–81.
Article Google Scholar
Susko E, Roger AJ. On reduced amino acid alphabets for phylogenetic inference. Mol Biol Evol. 2007;24(9):2139–50.
Article Google Scholar
Gangal R, Kumar KK. Reduced alphabet motif methodology for GPCR annotation. J Biomol Struct Dyn. 2007;25(3):299–310.
Article Google Scholar
Peterson EL, Kondev J, Theriot JA, Phillips R. Reduced amino acid alphabets exhibit an improved sensitivity and selectivity in fold assignment. Bioinformatics. 2009;25(11):1356–62.
Article Google Scholar
Jia C, Liu T, Zhang X, Fu H, Yang Q. Alignment-free comparison of protein sequences based on reduced amino acid alphabets. J Biomol Struct Dyn. 2009;26(6):763–9.
Article Google Scholar
Albayrak A, Otu HH, Sezerman UO. Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets. BMC Bioinformatics. 2010;11(1):1–10.
Article Google Scholar
Oberti M, Vaisman II. cnnAlpha: protein disordered regions prediction by reduced amino acid alphabets and convolutional neural networks. Proteins Struct, Function, Bioinformatics. 2020;88(11):1472–81.
Article Google Scholar
Wijesekara RY, Lahorkar A, Rathore K, Valadi J. RA2Vec: Distributed representation of protein sequences with reduced alphabet embeddings: RA2Vec: distributed representation. Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. New York: Association for Computing Machinery (ACM); 2020. pp. 1–1. https://doi.org/10.1145/3388440.3414925.
Surana S, Gunjal D, Singh D, Arora P, Valadi J. Alphabet reduction and distributed vector representation based method for classification of antimicrobial peptides. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (pp. 2825–2832). 2020. IEEE.
Yang KK, Wu Z, Bedbrook CN, Arnold FH. Learned protein embeddings for machine learning. Bioinformatics. 2018;34(15):2642–8.
Article Google Scholar
Bedbrook CN, Rice AJ, Yang KK, Ding X, Chen S, LeProust EM, et al. Structure-guided SCHEMA recombination generates diverse chimeric channelrhodopsins. Proc Natl Acad Sci. 2017;114(13):E2624–33.
Article Google Scholar
Li Y, Drummond DA, Sawayama AM, Snow CD, Bloom JD, Arnold FH. A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments. Nat Biotechnol. 2007;25(9):1051–6.
Article Google Scholar
Romero PA, Krause A, Arnold FH. Navigating the protein fitness landscape with Gaussian processes. Proc Natl Acad Sci. 2013;110(3):E193–201.
Article MathSciNet Google Scholar
Engqvist MK, McIsaac RS, Dollinger P, Flytzanis NC, Abrams M, Schor S, Arnold FH. Directed evolution of Gloeobacter violaceus rhodopsin spectral properties. J Mol Biol. 2015;427(1):205–20.
Article Google Scholar
Zaugg J, Gumulya Y, Malde AK, Bodén M. Learning epistatic interactions from sequence-activity data to predict enantioselectivity. J Comput Aided Mol Des. 2017;31(12):1085–96.
Article Google Scholar

Download references

Funding

We declare that we did not receive funding from any agency for this work.

Author information

Authors and Affiliations

Department of Computer Science, FLAME University, Pune, Maharashtra, India
Hrushikesh Bhosale, Aamod Sane & Jayaraman Valadi
CMS SPPU, Pune, Maharashtra, India
Ashwin Lahorkar
Engineering for Research, Thoughtworks Technologies, Pune, Maharashtra, India
Divye Singh

Authors

Hrushikesh Bhosale
View author publications
You can also search for this author in PubMed Google Scholar
Ashwin Lahorkar
View author publications
You can also search for this author in PubMed Google Scholar
Divye Singh
View author publications
You can also search for this author in PubMed Google Scholar
Aamod Sane
View author publications
You can also search for this author in PubMed Google Scholar
Jayaraman Valadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jayaraman Valadi.

Ethics declarations

Conflict of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Enabling Innovative Computational Intelligence Technologies for IOT” guest edited by Omer Rana, Rajiv Misra, Alexander Pfeiffer, Luigi Troiano and Nishtha Kesswani.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhosale, H., Lahorkar, A., Singh, D. et al. Hydropathy and Conformational Similarity-Based Distributed Representation of Protein Sequences for Properties Prediction. SN COMPUT. SCI. 3, 61 (2022). https://doi.org/10.1007/s42979-021-00948-3

Download citation

Received: 03 September 2021
Accepted: 18 October 2021
Published: 11 November 2021
DOI: https://doi.org/10.1007/s42979-021-00948-3

Hydropathy and Conformational Similarity-Based Distributed Representation of Protein Sequences for Properties Prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

Distributed Reduced Alphabet Representation for Predicting Proinflammatory Peptides

Assessing the role of evolutionary information for enhancing protein language model embeddings

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Hydropathy and Conformational Similarity-Based Distributed Representation of Protein Sequences for Properties Prediction

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

From PDB files to protein features: a comparative analysis of PDB bind and STCRDAB datasets

Distributed Reduced Alphabet Representation for Predicting Proinflammatory Peptides

Assessing the role of evolutionary information for enhancing protein language model embeddings

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now