Abstract
Techniques for analyzing genome sequences in high performance environments to predict the function and structure of a protein have been developing. The function of a protein is determined by its characteristics and the sequence pattern, and a protein is classified as belonging to a family according to its genealogy and structure. This study determines the protein family of unknown proteins by analyzing the sequence database of the proteins, which is classified using a clustering algorithm. The analysis of the experimental clustering results verified that, by applying the proposed pf_cluster algorithm, the protein family of new proteins can be found using their sequence information.
Similar content being viewed by others
References
Bork P, Koonin EV (1998) Predicting functions from Protein sequences–where are the bottlenecks? Nat Genet 18(4):313–318
Chargaff E (1950) Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia 6:201–209
Watson JD, Crick FHC (1953) A structure for deoxyribose nucleic acid. Nature 171:737–738
Altschul SF (1990) Basic local alignment search tool. J Mol Biol 215.3:403–410
Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N (1999) The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 96:2896–2901
Wu CH (2003) Protein family classification and functional annotation. Comput Biol Chem 27(1):37–47
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two Proteins. J Mol Biol 48(3):443–453
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Enright AJ, Ouzounis CA (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 16(5):451–457
Yona G, Linial N, Linial M (1999) ProtoMap: automatic classification of protein sequences, a hierarchy of Protein families, and local maps of the Protein space. Proteins 37(3):360–378
Sasson O et al (2003) ProtoNet: hierarchical classification of the protein space. Nucleic Acids Res 31(1):348–352
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575–1584
Chen Y et al (2006) SEQOPTICS: a protein sequence clustering system. BMC Bioinformatics 7(Suppl 4):S10
Finn RD et al (2013) Pfam: the protein families database. Nucleic Acids Res. doi:10.1093/nar/gkt1223
Bateman A et al (2002) The Pfam protein families database. Nucleic Acids Res 30:276–280
Acknowledgments
This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2013R1A1A2063006).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Han, SH., Yi, G. High performance clustering algorithm for analysis of protein family clusters. J Supercomput 72, 1878–1896 (2016). https://doi.org/10.1007/s11227-016-1706-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1706-y