Abstract
Proteins’ tertiary structure, which is determined by its amino acid sequence via the protein folding process, have essential role in the function of protein. Protein fold recognition is one of the interesting studies in bioinformatics. In this paper, to address this issue, we propose a Feature Selection (FS) method based on Map_Reduce framework and Vortex Search Algorithm (VSA). FS is one of the most important steps of pre-processing data, which aims to select a variable subset of relevant features. In unparalleled mode and typical data, over hundreds of feature selection and dimension reduction algorithms have been provided such as Principle Component Analysis, Linear Discriminant Analysis, and so on. Nevertheless, these algorithms are not implemented for real-world applications when data instances increasing in three-dimensional: volume, velocity and variety that called Big Data, actually if we want to use previous feature selection methods on Big Data, volume of large and complex computing will be required. VSA was inspired from the vortex pattern created by the vortical flow of the stirred fluids. In Map_Reduce framework, Map and Reduce functions executed in parallel mode. In the proposed method, in each step of Map function, a VSA is employed to find an optimized subset of features and decrease feature search space. In the light of the above consideration, we evaluate the proposed method in classification of a benchmark dataset for protein fold recognition. The experimental results indicate that the proposed method improves prediction accuracy considerably.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abbasi E, Ghatee M, Shiri ME (2013) FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds. Comput Biol Med 43(9):1182–1191
Hashemi HB, Shakery A, Naeini MP, eds (2009) Protein fold pattern recognition using Bayesian ensemble of RBF neural networks. In: 2009 international conference of soft computing and pattern recognition. IEEE
Shenoy SR, Jayaram B (2010) Proteins: sequence to structure and function-current status. Curr Protein Pept Sci 11(7):498–514
Lampros C, Papaloukas C, Exarchos K, Fotiadis DI, Tsalikakis D (2009) Improving the protein fold recognition accuracy of a reduced state-space hidden Markov model. Comput Biol Med 39(10):907–914
Aram RZ, Charkari NM (2015) A two-layer classification framework for protein fold recognition. J Theor Biol 365:32–39
Ibrahim W, Abadeh MS (2017) Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J Theor Biol 421:1–15
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Manyika J (2011) Big data: the next frontier for innovation, competition, and productivity. http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation. Accessed 11 Jan 2020
Gartner (2017) Big data. https://www.gartner.com/en/information-technology/glossary/big-data. Accessed 11 Jan 2020
Shin K (ed) (2012) MapReduce algorithms for big data analysis. VLDB endowment. Springer, Berlin
Zikopoulos P, Eaton C (2011) Understanding big data: analytics for enterprise class hadoop and streaming data. McGraw-Hill Osborne Media, New York
Agrawal D, Bernstein P, Bertino E, Davidson S, Dayal U, Franklin M, et al (2011) Challenges and opportunities with big data 2011-1
Kouzes RT, Anderson GA, Elbert ST, Gorton I, Gracio DK (2009) The changing paradigm of data-intensive computing. Computer 42(1):26–34
Hey AJ, Tansley S, Tolle KM (2009) The fourth paradigm: data-intensive scientific discovery. Microsoft research Redmond, Washington
Wang Q, Wang C, Ren K, Lou W, Li J (2010) Enabling public auditability and data dynamics for storage security in cloud computing. IEEE Trans Parallel Distrib Syst 22(5):847–859
Oprea A, Reiter MK, Yang K (eds) (2005) Space-efficient block storage integrity. NDSS, San Diego
Wang Q, Ren K, Yu S, Lou W (2011) Dependable and secure sensor data storage with dynamic integrity assurance. ACM Trans Sens Netw (TOSN) 8(1):9
García A, Bourov S, Hammad A, Hartmann V, Jejkal T, Otte JC, et al (2011) Data-intensive analysis for scientific experiments at the large scale data facility. In: 2011 IEEE symposium on large data analysis and visualization. IEEE
Simeonidou D, Nejabati R, Zervas G, Klonidis D, Tzanakaki A, O’Mahony MJ (2005) Dynamic optical-network architectures and technologies for existing and emerging grid services. J Lightwave Technol 23(10):3347
Foster I, Zhao Y, Raicu I, Lu S (2008) Cloud computing and grid computing 360-degree compared. arXiv preprint arXiv:09010131
Furht B, Escalante A (2010) Handbook of cloud computing. Springer, Berlin
Alpaydin E (2010) Introduction to machine learning. The MIT Press, London
Bikku T, Rao NS, Akepogu AR (2016) Hadoop based feature selection and decision making models on big data. Indian J Sci Technol. https://doi.org/10.17485/ijst/2016/v9i10/88905
Ding CH, Dubchak I (2001) Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics 17(4):349–358
Hou J, Adhikari B, Cheng J (2017) DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34(8):1295–1303
Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. Commun ACM. https://doi.org/10.1145/1327452.1327492
Sudha P, Ramyachitra D, Manikandan P (2018) Enhanced artificial neural network for protein fold recognition and structural class prediction. Gene Rep 12:261–275
Peyravi F, Latif A, Moshtaghioun SM (2019) A composite approach to protein tertiary structure prediction: hidden Markov model based on lattice. Bull Math Biol 81(3):899–918
García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F (2016) Big data preprocessing: methods and prospects. Big Data Anal 1(1):9
Fernández A, del Río S, López V, Bawakid A, del Jesus MJ, Benítez JM et al (2014) Big data with cloud computing: an insight on the computing environment, MapReduce, and programming frameworks. Wiley Interdiscip Rev Data Min Knowl Discov 4(5):380–409
White T (2012) Hadoop: the definitive guide. O’Reilly Media Inc., Sebastopol
Apache Hadoop Project (2015) Apache Hadoop
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O’Reilly Media Inc, Sebastopol
Spark A (2015) Lightning-fast cluster computing. Apache Spark: official website
Liu H, Motoda H (2007) Computational methods of feature selection. CRC Press, Boca Raton
Razavi SF, Sajedi H (2019) SVSA: a semi vortex search algorithm for solving optimization problems. Int J Data Sci Anal 8(1):15–32
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif Intell 97(1–2):273–324
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Tauer G, Nagi R (2013) A map-reduce lagrangian heuristic for multidimensional assignment problems with decomposable costs. Parallel Comput 39(11):653–668
UzZaman N (2007) Survey on Google file system. Survey Paper for CSC. p 456
Qian J, Lv P, Yue X, Liu C, Jing Z (2015) Hierarchical attribute reduction algorithms for big data using MapReduce. Knowl Based Syst 73:18–31
Xu Y, Qu W, Li Z, Liu Z, Ji C, Li Y et al (2014) Balancing reducer workload for skewed data using sampling-based partitioning. Comput Electr Eng 40(2):675–687
Rastrigin L (1963) The convergence of the random search method in the extremal control of a many parameter system. Autom Remote Control 24:1337–1342
Schumer M, Steiglitz K (1968) Adaptive step size random search. IEEE Trans Autom Control 13(3):270–276
Schrack G, Choit M (1976) Optimized relative step size random searches. Math Progr 10(1):230–244
Sajedi H, Razavi SF (2016) MVSA: multiple vortex search algorithm. In: 2016 IEEE 17th international symposium on computational intelligence and informatics (CINTI), Hungary
Göktepe YE, Kodaz H (2018) Prediction of protein–protein interactions using an effective sequence based combined method. Neurocomputing 303:68–74
Doğan B, Ölmez T (2015) A new metaheuristic for numerical function optimization: vortex search algorithm. Inf Sci 293:125–145
Hooda N, Seema B, Prashant SR (2018) Fraudulent firm classification: a case study of an external audit. Appl Artif Intell 32(1):48–64
Göktepe YE, İlhan İ, Kahramanlı Ş (2016) Predicting protein–protein interactions by weighted pseudo amino acid composition. Int J Data Min Bioinform 15(3):272–290
Sakar CO, Serbes G, Gunduz A, Tunc HC, Nizam H, Sakar BE, Tutuncu M, Aydin T, Isenkul ME, Apaydin H (2019) A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform. Appl Soft Comput 74:255–263
Shen H-B, Chou K-C (2006) Ensemble classifier for protein fold pattern recognition. Bioinformatics 22(14):1717–1722
Nanni L (2006) A novel ensemble of classifiers for protein fold recognition. Neurocomputing 69(16–18):2434–2437
Nanni L (2006) Ensemble of classifiers for protein fold recognition. Neurocomputing 69(7–9):850–853
Chen Y, Chen F, Yang JY, Yang MQ (2008) Ensemble voting system for multiclass protein fold recognition. Int J Pattern Recognit Artif Intell 22(04):747–763
Guo X, Gao X (2008) A novel hierarchical ensemble classifier for protein fold recognition. Protein Eng Des Sel 21(11):659–664
Chmielnicki W, Sta K (2012) A hybrid discriminative/generative approach to protein fold recognition. Neurocomputing 75(1):194–198
Martin S, Roe D, Faulon J-L (2004) Predicting protein–protein interactions using signature products. Bioinformatics 21(2):218–226
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Hekmatnia, E., Sajedi, H. & Habib Agahi, A. A parallel classification framework for protein fold recognition. Evol. Intel. 13, 525–535 (2020). https://doi.org/10.1007/s12065-020-00350-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12065-020-00350-7