[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Prediction of structural classes for protein sequences and domains-Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy

Published: 01 December 2006 Publication History

Abstract

This paper addresses computational prediction of protein structural classes. Although in recent years progress in this field was made, the main drawback of the published prediction methods is a limited scope of comparison procedures, which in same cases were also improperly performed. Two examples include using protein datasets of varying homology, which has significant impact on the prediction accuracy, and comparing methods in pairs using different datasets. Based on extensive experimental work, the main aim of this paper is to revisit and reevaluate state of the art in this field. To this end, this paper performs a first-of-its-kind comprehensive and multi-goal study, which includes investigation of eight prediction algorithms, three protein sequence representations, three datasets with different homologies and finally three test procedures. Quality of several previously unused prediction algorithms, newly proposed sequence representation, and a new-to-the-field testing procedure is evaluated. Several important conclusions and findings are made. First, the logistic regression classifier, which was not previously used, is shown to perform better than other prediction algorithms, and high quality of previously used support vector machines is confirmed. The results also show that the proposed new sequence representation improves accuracy of the high quality prediction algorithms, while it does not improve results of the lower quality classifiers. The study shows that commonly used jackknife test is computationally expensive, and therefore computationally less demanding 10-fold cross-validation procedure is proposed. The results show that there is no statistically significant difference between these two procedures. The experiments show that sequence homology has very significant impact on the prediction accuracy, i.e. using highly homologous datasets results in higher accuracies. Thus, results of several past studies that use homologous datasets should not be perceived as reliable. The best achieved prediction accuracy for low homology datasets is about 57% and confirms results reported by Wang and Yuan [How good is the prediction of protein structural class by the component-coupled method?. Proteins 2000;38:165-175]. For a highly homologous dataset instance based classification is shown to be better than the previously reported results. It achieved 97% prediction accuracy demonstrating that homology is a major factor that can result in the overestimated prediction accuracy.

References

[1]
Levitt, M. and Chothia, C., Structural patterns in globular proteins. Nature. v261. 552-557.
[2]
Gromiha, M. and Selvaraj, S., Protein secondary structure prediction in different structural classes. Protein Eng. v11. 249-251.
[3]
Chou, K.C. and Zhang, C.T., Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. v30. 275-349.
[4]
Bahar, I., Atilgan, A.R., Jernigan, R.L. and Erman, B., Understanding the recognition of protein structural classes by amino acid composition. Proteins. v29. 172-185.
[5]
Murzin, A., Brenner, S., Hubbard, T. and Chothia, C., SCOP: a structural classification of protein database for the investigation of sequence and structures. J. Mol. Biol. v247. 536-540.
[6]
Nakashima, H., Nishikawa, K. and Ooi, T., The folding type of a protein is relevant to the amino acid composition. J. Biochem. v99. 153-162.
[7]
Klein, P. and Delisi, C., Prediction of protein structural class from the amino-acid sequence. Biopolymers. v25. 1659-1672.
[8]
Zhang, C.T. and Chou, K.C., An optimization approach to predicting protein structural class from amino-acid composition. Protein Sci. v1. 401-408.
[9]
Metfessel, B.A., Saurugger, P.N., Connelly, D.P. and Rich, S., Cross-validation of protein structural class prediction using statistical clustering and neural networks. Protein Sci. v2. 1171-1182.
[10]
Chou, K.C. and Zhang, C.T., Predicting protein-folding types by distance functions that make allowances for amino-acid interactions. J. Biol. Chem. v269. 22014-22020.
[11]
Dubchak, I., Muchnik, I., Holbrook, S.R. and Kim, S.H., Prediction of protein-folding class using global description of amino-acid sequence. Proc. Nat. Acad. Sci. v92. 8700-8704.
[12]
Dubchak, I., Muchnik, I., Mayor, C., Dralyuk, I. and Kim, S.H., Recognition of a protein fold in the context of the SCOP classification. Proteins. v35. 401-407.
[13]
Wang, Z.-X. and Yuan, Z., How good is the prediction of protein structural class by the component-coupled method?. Proteins. v38. 165-175.
[14]
Cai, Y., Is it a paradox or misinterpretation?. Proteins. v43. 336-338.
[15]
Cai, Y.D., Liu, X.J., Xu, X.B. and Chou, K.C., Support vector machines for prediction of protein domain structural class. J. Theor. Biol. v221. 115-120.
[16]
Jin, L., Fang, W. and Tang, H., Prediction of protein structural classes by a new measure of information discrepancy. Comput. Biol. Chem. v27. 373-380.
[17]
Chou, K.C. and Cai, Y.D., Prediction protein structural class by functional domain composition. Biochem. Biophys. Res. Commun. v321. 1007-1009.
[18]
Zhou, G. and Assa-Munt, N., Some insights into protein structural class prediction. Proteins. v44. 57-59.
[19]
Wang, Z.-X., The prediction accuracy for protein structural class by the component-coupled methods is around 60%. Proteins. v43. 339-340.
[20]
Chou, K.C., A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins. v21. 319-344.
[21]
Kabsch, W. and Sander, C., Dictionary of protein secondary structures: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. v22. 2577-2637.
[22]
Eisenhaber, F., Frömmel, C. and Argos, P., Prediction of secondary structural content of proteins from their amino acid composition alone, II the paradox with secondary structural class. Proteins. v25. 169-179.
[23]
Eisenhaber, F., Prediction of secondary structural contents of proteins from their amino acid composition alone, I new analytic vector decomposition methods. Proteins. v25 i2. 157-168.
[24]
Berman, H.M., The protein data bank. Nucleic Acids Res. v28. 235-242.
[25]
Chou, K.C. and Maggiora, G.M., Domain structural class prediction. Protein Eng. v11. 523-538.
[26]
Andreeva, A., Howorth, D., Brenner, S., Hubbard, T., Chothia, C. and Murzin, A., SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acid Res. v32. D226-D229.
[27]
J. Grassmann, M. Reczko, S. Suhai, L. Edler, Protein fold class prediction-new methods of statistical classification, Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology (ISMB), 1999, pp. 106-112.
[28]
Ding, C.H. and Dubchak, I., Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics. v17. 349-358.
[29]
C. Leslie, E. Eskin, W. Stafford Noble, The spectrum kernel: a string kernel for SVM protein classification, Proceedings of the Pacific Symposium on Biocomputing, 2002, pp. 566-575.
[30]
Markowetz, F., Edler, L. and Vingron, M., Support vector machines for protein fold class prediction. Biometrical J. v45. 377-389.
[31]
Chou, K.C. and Zhang, C.T., A new approach to predicting protein folding types. J. Protein Chem. v12. 169-178.
[32]
Zhang, C.T., Chou, K.C. and Maggiora, G.M., Predicting protein structural classes from amino acid composition: application of fuzzy clustering. Protein Eng. v8. 425-435.
[33]
Bu, W.S., Feng, Z.P., Zhang, Z. and Zhang, C.T., Prediction of protein structural classes based on amino acid index. Eur. J. Biochem. v266. 1043-1049.
[34]
Luo, R., Feng, Z. and Liu, J., Prediction of protein structural class by amino acid and polypeptide composition. Eur. J. Biochem. v269. 4219-4225.
[35]
Chou, K.C., Liu, W.M., Maggiora, G.M. and Zhang, C.T., Prediction and classification of domain structural classes. Proteins. v31. 97-103.
[36]
B. Rost, C. Sander, Third generation prediction of secondary structure, in: D.M. Webster (Ed.), Protein Structure Prediction: Methods and Protocols, 2000, pp. 71-95.
[37]
Filkenstein, A.V. and Ptitsyn, O., Statistical analysis of the correlation among amino acid residues in helical, ß-structural and non-regular regions of globular proteins. J. Mol. Biol. v62. 613-624.
[38]
Chou, P.Y. and Fasmad, U.D., Prediction of protein conformation. Biochemistry. v13. 211-215.
[39]
Zhang, C.T., Prediction of helix/strand content of globular proteins based on their primary sequences. Protein Eng. v11:11. 971-979.
[40]
Zhang, Z.D., Sun, Z.R. and Zhang, C.T., A new approach to predict the helix/strand content of globular proteins. J. Theor. Biol. v208. 65-78.
[41]
Lin, Z. and Pan, X-M., Accurate prediction of protein secondary structural content. J. Protein Chem. v20 i3. 217-220.
[42]
M.K. Ganapathiraju, et al., Characterization of protein secondary structure, IEEE Signal Process. Mag. (2004) 78-87.
[43]
Ruan, J., Wang, K., Yang, J., Kurgan, L. and Cios, K., Highly accurate and consistent method for prediction of helix and strand content from primary protein sequences. Artif. Intell. Med. v35 i1-2. 19-35.
[44]
I.V. Grigoriev, S.H. Kim, Detection of protein fold similarity based on correlation of amino acid properties, Proc. Nat. Acad. Sci. 96 (1999) 14318-14323.
[45]
Zhou, G., An intriguing controversy over protein structural class prediction. J. Protein Chem. v17. 729-738.
[46]
Li, W., Jaroszewski, L. and Godzik, A., Clustering of highly homologous sequences to reduce the size of large protein database. Bioinformatics. v17. 282-283.
[47]
Li, W., Jaroszewsk, L. and Godzik, A., Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics. v18 i1. 77-82.
[48]
Hobohm, U. and Sander, C., Enlarged representative set of protein structures. Protein Sci. v3. 522
[49]
L.A. Kurgan, L. Homaeian, Prediction of secondary protein structure content from primary sequence alone-a feature selection based approach, Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognition (MLDM 2005), Leipzig, Germany, LNAI 4587, 2005, pp. 334-345.
[50]
Oobatake, M. and Ooi, T., An analysis of non-bonded energy of proteins. J. Theor. Biol. v67. 567-584.
[51]
Cornette, J., Hydrophobicity scales and computational techniques for detecting amphipathic structures in protein. J. Mol. Biol. v195. 659-685.
[52]
Muskal, S.M. and Kim, S.H., Predicting protein secondary structure content: a tandem neural network approach. J. Mol. Biol. v225. 713-727.
[53]
H. Liu, R. Setiono, A probabilistic approach to feature selection-a filter solution, Proceedings of the 13th International Conference on Machine Learning, Italy, 1996, pp. 319-327.
[54]
Kohavi, R. and John, G., Wrappers for feature subset selection. Artif. Intell. v97 i1-2. 273-324.
[55]
M.A. Hall, Correlation-based feature subset selection for machine learning, Ph.D. Thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand, 1999.
[56]
Witten, I. and Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.
[57]
G.H. John, P. Langley P, Estimating continuous distributions in Bayesian classifiers, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, San Mateo, 1995, pp. 338-345.
[58]
Saha, A., Wu, C.L. and Tang, D.S., Approximation, approximation dimension reduction and nonconvex optimization using linear superpositions of gaussians. IEEE Trans. Comput. v42. 1222-1233.
[59]
Aha, D. and Kibler, D., Instance-based learning algorithms. Mach. Learn. v6. 37-66.
[60]
Quinlan, R., C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo.
[61]
Breiman, L., Random forests. Mach. Learn. v45 i1. 5-32.
[62]
W. Cohen, Fast effective rule induction, Proceeding of the 12th International Conference on Machine Learning, Lake Tahoe, CA, 1995, pp. 115-123.
[63]
Keerthi, S.S., Shevade, S.K., Bhattacharyya, C. and Murthy, K.R., Improvements to Platt's SMO algorithm for SVM classifier design. Neural Comput. v13 i3. 637-649.
[64]
le Cessie, S. and van Houwelingen, J.C., Ridge estimators in logistic regression. Appl. Stat. v41 i1. 191-201.

Cited By

View all
  • (2019)Protein Tertiary Structure Prediction Based on Multiscale Recurrence Quantification Analysis and Horizontal Visibility GraphAdvances in Neural Networks – ISNN 201910.1007/978-3-030-22808-8_52(531-539)Online publication date: 10-Jul-2019
  • (2018)An improved method to enhance protein structural class prediction using their secondary structure sequences and genetic algorithmInternational Journal of Bioinformatics Research and Applications10.5555/3282646.328265114:4(376-400)Online publication date: 1-Jan-2018
  • (2017)A novel density-based ensemble learning algorithm with application to protein structural classificationIntelligent Data Analysis10.3233/IDA-15035721:1(167-179)Online publication date: 1-Jan-2017
  • Show More Cited By
  1. Prediction of structural classes for protein sequences and domains-Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Pattern Recognition
        Pattern Recognition  Volume 39, Issue 12
        December, 2006
        259 pages

        Publisher

        Elsevier Science Inc.

        United States

        Publication History

        Published: 01 December 2006

        Author Tags

        1. Homology
        2. Machine learning
        3. Prediction
        4. Protein structural class
        5. SCOP
        6. Secondary protein structure

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 07 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2019)Protein Tertiary Structure Prediction Based on Multiscale Recurrence Quantification Analysis and Horizontal Visibility GraphAdvances in Neural Networks – ISNN 201910.1007/978-3-030-22808-8_52(531-539)Online publication date: 10-Jul-2019
        • (2018)An improved method to enhance protein structural class prediction using their secondary structure sequences and genetic algorithmInternational Journal of Bioinformatics Research and Applications10.5555/3282646.328265114:4(376-400)Online publication date: 1-Jan-2018
        • (2017)A novel density-based ensemble learning algorithm with application to protein structural classificationIntelligent Data Analysis10.3233/IDA-15035721:1(167-179)Online publication date: 1-Jan-2017
        • (2017)Classification of Protein Structure Classes on Flexible Neutral TreeIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2016.261096714:5(1122-1133)Online publication date: 1-Sep-2017
        • (2017)Feature selection by maximizing correlation information for integrated high-dimensional protein dataPattern Recognition Letters10.1016/j.patrec.2017.03.01192:C(17-24)Online publication date: 1-Jun-2017
        • (2015)An ensemble classifier based prediction of G-protein-coupled receptor classes in low homologyNeurocomputing10.1016/j.neucom.2014.12.013154:C(110-118)Online publication date: 22-Apr-2015
        • (2015)A highly accurate protein structural class prediction approach using auto cross covariance transformation and recursive feature eliminationComputational Biology and Chemistry10.1016/j.compbiolchem.2015.08.01259:PA(95-100)Online publication date: 1-Dec-2015
        • (2014)Combining multiple clusterings for protein structure predictionInternational Journal of Data Mining and Bioinformatics10.1504/IJDMB.2014.06401210:2(162-174)Online publication date: 1-Jul-2014
        • (2014)Comparing ensemble learning methods based on decision tree classifiers for protein fold recognitionInternational Journal of Data Mining and Bioinformatics10.1504/IJDMB.2014.0577769:1(89-105)Online publication date: 1-Nov-2014
        • (2014)Combining multiple viewsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2013.11.00428(174-180)Online publication date: 1-Feb-2014
        • Show More Cited By

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media