Abstract
The management and analysis of big data has been identified as one of the most important emerging needs in recent years. This is because of the sheer volume and increasing complexity of data being created or collected. Current clustering algorithms can not handle big data, and therefore, scalable solutions are necessary. Since fuzzy clustering algorithms have shown to outperform hard clustering approaches in terms of accuracy, this paper investigates the parallelization and scalability of a common and effective fuzzy clustering algorithm named fuzzy c-means (FCM) algorithm. The algorithm is parallelized using the MapReduce paradigm outlining how the Map and Reduce primitives are implemented. A validity analysis is conducted in order to show that the implementation works correctly achieving competitive purity results compared to state-of-the art clustering algorithms. Furthermore, a scalability analysis is conducted to demonstrate the performance of the parallel FCM implementation with increasing number of computing nodes used.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bell G, Hey AJG, Szalay A (2009) Beyond the data deluge. Science 323(AAAS):1297–1298. doi:10.112/science.1170411
Ghosh A, Jain LC (2005) Evolutionary computation in data mining series: studies in fuzziness and soft computing, vol 163. Springer, New York
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley, New York. ISBN: 0-321-32136-7
Jabeen H, Baig AR (2010) Review of classification using genetic programming. Int J Eng Sci Technol 2(2):94–103
Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann Publishers Inc., San Francisco
Ludwig SA (2014) Clonal selection based fuzzy C-means algorithm for clustering. In: GECCO '14 Proceedings of the 2014 conference on genetic and evolutionary computation, pp 105–112
Babuska R (2014) Fuzzy clustering lecture. http://homes.di.unimi.it/ valenti/SlideCorsi/Bioinformatica05/Fuzzy-Clustering-lecture-Babuska. Accessed Oct 2014
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall Inc, Upper Saddle River
Ruspini EH (1970) Numerical methods for fuzzy clustering. Inf Sci 2:319350
Yang MS (1993) A survey of fuzzy clustering. Math Comput Model 18:1–16
Lee HS (1999) Automatic clustering of business process in business systems planning. Eur J Oper Res 114:354–362
Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic theory and application. Prentice Hall PTR, Upper Saddle River
Rosenfeld A (1975) Fuzzy graphs. In: Zadeh LA, Fu KS, Shimura M (eds) Fuzzy sets and their applications to cognitive and decision processes. Academic Press, New York
Matula DW (1970) Cluster analysis via graph theoretic techniques. In: Proceedings of the Louisiana conference on combinatorics, graph theory and computing, Winnipeg
Dunn JC (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Guerrero-Bote VP, Lopez-Pujalte C, de Moya-Anegon F, Herrero-Solana V (2003) Comparison of neural models for document clustering. Int J Approx Reason 34:287–305
Gath I, Geva AB (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 11(7):773–781
Bezdek JC, Coray C, Gunderson R, Watson J (1981) Detection and characterization of cluster substructure—linear structure, fuzzy c-varieties and convex combinations thereof. SIAM J Appl Math 40(2):358–372
Yang Y, Huang S (2007) Image segmentation by fuzzy c-means clustering algorithm with a novel penalty term. Comput Inform 26:17–31
Cai W, Chen S, Zhang D (2007) Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation. Pattern Recognit 40(3):825–838
Sarma TH, Viswanath P, Reddy BE (2013) A hybrid approach to speed-up the k-means clustering method. Int J Mach Learn Cybern 4:107–113
Dean J, Ghemawat S (2004) Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design and implementation (OSDI’04), vol 6, p 10
He Y, Tan H, Luo W, Feng S, Fan J (2014) Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: Proceedings of the CloudCom’09. Springer, Berlin, pp 674–679
Zhou P, Lei J, Ye W (2011) Large-scale data sets clustering based on mapreduce and hadoop. Comput Inf Syst 7(16):5956–5963
Li H-G, Wu G-Q, Hu X-G, Zhang J, Li L, Wu X (2011) K-means clustering with bagging and mapreduce. In: Proceedings of the 44th Hawaii international conference on system sciences. IEEE Computer Society, Washington, DC, pp 1–8
Nair S, Mehta J (2011) Clustering with Apache Hadoop. In: Proceedings of the international conference, workshop on emerging trends in technology (ICWET’11), New York. ACM, New York, pp 505–509
Papadimitriou S, Sun J (2008) Disco: distributed co-clustering with map-reduce: a case study towards petabyte-scale end-to-end mining. In: Proceedings of the IEEE ICDM’08, Washington, DC, pp 512–521
Ene A, Im S, Moseley B (2011) Fast clustering using mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 681–689
Yang J, Li X (2013) Mapreduce based method for big data semantic clustering. In: Proceedings of the 2013 IEEE international conference on systems, man, and cybernetics (SMC’13). IEEE Computer Society, Washington, DC, pp 2814–2819
Cordeiro F, Traina Jr C, Traina AJM, Lopez J, Kang U, Taloutsos C (2011) Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of KDD’11. ACM, New York, pp 690–698
MPI (Message Passing Interface). http://www-unix.mcs.anl.gov/mpi/. Accessed 25 Apr 2015
PVM (Parallel Virtual Machine). http://www.csm.ornl.gov/pvm/. Accessed 25 Apr 2015
Modenesi MV, Costa MCA, Evsukoff AG, Ebecken NF (2007) Parallel fuzzy c-means cluster analysis. In: Lecture notes in computer science on high performance computing for computational science (VECPAR’06). Springer, New York
Blackard JA (1998) Comparison of neural networks and discriminant analysis in predicting forest cover types. Ph.D. dissertation, Department of Forest Sciences, Colorado State University, Fort Collins, Colorado
Apache Hadoop. http://hadoop.apache.org/. Accessed 25 Apr 2015
Han J (2005) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
Karypis G (2003) CLUTO: a clustering toolkit. University of Minnesota, Computer Science. Tech. Rep. 02-017
Havens TC, Chitta R, Jain AK, Rong J (2011) Speedup of fuzzy and possibilistic kernel c-means for large-scale clustering. In: Proceedings of IEEE international conference on fuzzy systems (FUZZ), pp 463–470
Hathaway R, Bezdek J (1995) Optimization of clustering criteria by reformulation. IEEE Trans Fuzzy Syst 3:241245
Mahout Library. http://mahout.apache.org/. Accessed 25 Apr 2015
Acknowledgments
The author acknowledges the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ludwig, S.A. MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability. Int. J. Mach. Learn. & Cyber. 6, 923–934 (2015). https://doi.org/10.1007/s13042-015-0367-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-015-0367-0