Abstract
Remote homology detection is a key element of protein structure and function analysis in computational and experimental biology. This paper presents a simple representation of protein sequences, which uses the evolutionary information of profiles for efficient remote homology detection. The frequency profiles are directly calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into binary profiles with a probability threshold. Such binary profiles make up of a new building block for protein sequences. The protein sequences are mapped into high-dimensional vectors by the occurrence times of each binary profile. The resulting vectors are then evaluated by support vector machine to train classifiers that are then used to classify the test protein sequences. The method is further improved by applying an efficient feature extraction algorithm from natural language processing, namely, the latent semantic analysis model. Testing on the SCOP 1.53 database shows that the method based on binary profiles outperforms those based on many other basic building blocks including N-grams, patters and motifs. The ROC50 score is 0.698, which is higher than other methods by nearly 10 percent.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Weston, J., Leslie, C., Zhou, D., Noble, W.S.: Semi-supervised protein classification using cluster kernels. Journal. Cambridge, Mass., 595-602
Darzentas, N., Rigoutsos, I., Ouzounis, C.A.: Sensitive detection of sequence similarity using combinatorial pattern discovery: A challenging study of two distantly related protein families. Proteins 61, 926–937 (2005)
Li, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of computational biology 10, 857–868 (2003)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990)
Pearson, W.R.: Rapid and sensitive sequence comparison with fastp and fasta. Methods Enzymol. 183, 63–98 (1990)
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: A new generation of protein database search programs. Nucleic Acids Research. 25, 3389–3402 (1997)
Karplus, K., Barrett, C., Hughey, R.: Hidden markov models for detecting remote protein homologies. Bioinformatics 14, 846–856 (1998)
Qian, B., Goldstein, R.A.: Performance of an iterated t-hmm for homology detection. Bioinformatics 20, 2175–2180 (2004)
Vapnik, V.N.: Statistical learning theory. Wiley, Chichester (1998)
Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. J. Comput. Biol. 7, 95–114 (2000)
Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for svm protein classification. Journal, 564-575
Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, S.W.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20, 467–476 (2004)
Hou, Y., Hsu, W., Lee, M.L., Bystroff, C.: Efficient remote homology detection using local structure. Bioinformatics 19, 2294–2301 (2003)
Saigo, H., Vert, J.P., Ueda, N., Akutsu, T.: Protein homology detection using string alignment kernels. Bioinformatics 20, 1682–1689 (2004)
Saigo, H., Vert, J.P., Akutsu, T., Ueda, N.: Comparison of svm-based methods for remote homology detection. Genome Informatics 13, 396–397 (2002)
Dowd, S.E., Zaragoza, J., Rodriguez, J.R., Oliver, M.J., Payton, P.R.: Windows. Net network distributed basic local alignment search toolkit (w.Nd-blast). BMC Bioinformatics 6, 93 (2005)
Dong, Q.W., Wang, X.L., Lin, L.: Application of latent semantic analysis to protein remote homology detection. Bioinformatics 22, 285–290 (2006)
Bellegarda, J.: Exploiting latent semantic information in statistical language modeling. Proc. IEEE. 88, 1279–1296 (2000)
Dong, Q.W., Lin, L., Wang, X.L., Li, M.H.: A pattern-based svm for protein remote homology detection. Journal 4, 3363-3368, Guangzhou, China
Ben-Hur, A., Brutlag, D.: Remote homology detection: A motif based approach. Bioinformatics 19(Suppl. 1), i26–33 (2003)
Holm, L., Sander, C.: Removing near-neighbour redundancy from large protein sequence collections. Bioinformatics 14, 423–429 (1998)
Henikoff, S., Henikoff, J.G.: Position-based sequence weights. J. Mol. Biol. 243, 574–578 (1994)
Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to latent semantic analysis. Discourse Processes 25, 259–284 (1998)
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: Scop database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Research 32, D226–D229 (2004)
Chandonia, J.M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M., Brenner, S.E.: The astral compendium in 2004. Nucleic acids research 32, 189–192 (2004)
Gribskov, M., Robinson, N.L.: Use of receiver operating characteristic(roc) analysis to evaluate sequence matching. Computers and Chemistry 20, 25–33 (1996)
Bailey, T.L., Grundy, W.N.: Classifying proteins by family using the product of correlated p-values. Journal, 10-14
Krogh, A., Brown, M., Mian, I.S., Sjolander, K., Haussler, D.: Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology 235, 1501–1531 (1994)
Dong, Q.W., Wang, X.I., Lin, L.: Novel knowledge-based mean force potential at the profile level. BMC Bioinformatics 7, 324 (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Dong, Q., Lin, L., Wang, X. (2007). Protein Remote Homology Detection Based on Binary Profiles. In: Hochreiter, S., Wagner, R. (eds) Bioinformatics Research and Development. BIRD 2007. Lecture Notes in Computer Science(), vol 4414. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71233-6_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-71233-6_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71232-9
Online ISBN: 978-3-540-71233-6
eBook Packages: Computer ScienceComputer Science (R0)