[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3481646.3481649acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccbdcConference Proceedingsconference-collections
research-article

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Published: 26 November 2021 Publication History

Abstract

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term ‘Big Data Mining’. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 2%-3%, even in small-size data sets.

References

[1]
Abhinandan Banik and Samir Kumar Bandyopadhyay. 2016. Big data- A review on analysing 3Vs. In Journal of Scientific and Engineering Research. ISSN: 2394-2630.
[2]
Ishwarappa and Anuradha J. 2015. A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology. In Procedia Computer Science. Volume 48, Pages 319-324, ISSN 1877-0509. https://doi.org/10.1016/j.procs.2015.04.188.
[3]
Jonardo R. Asor, Gene Marck B. Catedrilla and Jefferson L. Lerios. 2020. Usage of classification algorithm for extracting knowledge in cholesterol report towards non-communicable disease analysis. Journal of Advances in Information Technology 11,4 (November 2020), 265-270.
[4]
Xindong Wu, Xingquan Zhu, Gong-Qing Wu and Wei Ding. 2014. Data mining with big data. In IEEE Transactions on Knowledge and Data Engineering. vol. 26, no. 1, pp. 97-107.
[5]
Jaseena K. U. and Julia M. David. 2014. Issues, challenges, and solutions: Big data mining. In Computer Science & Information Technology. 4. 131-140. 10.5121/csit.2014.41311.
[6]
Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the titans: MapReduce vs. Spark for large scale data analytics. 2015. Proc. VLDB Endow. 8, 2110–2121. https://doi.org/10.14778/2831360.2831365
[7]
Yassir Samadi, Mostapha Zbakh and Claude Tadonki. 2016. Comparative study between Hadoop and Spark based on Hibench benchmarks. In 2nd International Conference on Cloud Computing Technologies and Applications (CloudTech), pp. 267-275.
[8]
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li and Bizhu Qiu. 2014. BigDataBench: A big data benchmark suite from internet services. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 488-499.
[9]
Yassir Samadi, Mostapha Zbakh and Claude Tadonki. 2017. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks. Concurrency and Computation: Practice and Experience. 30. 10.1002/cpe.4367.
[10]
Nasim Ahmed, Andre L.C. Barczak, Teo Susnjak and Mohammed A. Rashid. 2020. A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. J Big Data 7, 110. https://doi.org/10.1186/s40537-020-00388-5.
[11]
Ilias Mavridis and Eleni Karatza. 2015. Log file analysis in cloud with apache hadoop and apache spark. In Journal of Systems and Software. Volume 125, Pages 133-151, ISSN 0164-1212. https://doi.org/10.1016/j.jss.2016.11.037.
[12]
YuJia Liu. 2019. An empirical comparison between Mapreduce and Spark. A thesis presented in partial fulfilment of the requirements. Information Sciences, Massey University, Auckland, New Zealand.
[13]
Ishaan Lagwankar, Ananth Narayan Sankaranarayanan and Subramaniam Kalambur. 2020. Impact of Map-Reduce framework on Hadoop and Spark MR Application Performance. In IEEE International Conference on Big Data (Big Data), pp. 2763-2772.
[14]
A. Vettriselvi, N. Gnanambigai, P. Dinadayalan and S. Sutha. 2021. A Comparative study of Machine Learning Algorithms Using RDD Based Regression and Classification Methods. Annals of the Romanian Society for Cell Biology. 15168–15199.
[15]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker and Ion Stoica. 2010. Spark: Cluster computing with working sets. USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).
[16]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST '10). IEEE Computer Society, USA, 1–10. https://doi.org/10.1109/MSST.2010.5496972
[17]
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia and Ameet Talwalkar. 2016. Mllib: Machine learning in apache spark. In The Journal of Machine Learning Research 17.1. 1235-1241.
[18]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107–113. https://doi.org/10.1145/1327452.1327492
[19]
Deepak Vohra. 2016. Apache Mahout. 10.1007/978-1-4842-2199-0_11.
[20]
Trevor Hastie, Robert Tibshirani and J. H. Friedman. 2001. The elements of statistical learning: data mining, inference, and prediction : with 200 full-color illustrations. New York: Springer.
[21]
J. -. Gauvain, Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. In IEEE Transactions on Speech and Audio Processing. vol. 2, no. 2, pp. 291-298.
[22]
Taha Tekdogan. Git repository of the benchmark experiment for ICCBDC’21. Retrieved 15 July, 2021 from https://github.com/tekdogan/iccbdc-21
[23]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC '13). Association for Computing Machinery, New York, NY, USA, Article 5, 1–16.
[24]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm
[25]
Dheeru Dua and Casey Graff. 2019. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml
[26]
Justin Ma, Lawrence K. Saul, Stefan Savage and Geoffrey M. Voelker. 2009. Identifying suspicious URLs: An application of large-scale online learning. Proceedings of the International Conference on Machine Learning (ICML), pages 681-688, Montreal, Quebec
[27]
De Wang, Danesh Irani and Calton Pu. 2012. Evolutionary study of web spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006. In Proc. of 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2012). Pittsburgh, Pennsylvania, United States.
[28]
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In Proceedings of the ACM Recommender Systems Conference (RecSys).
[29]
Frederic P. Miller, Agnes F. Vandome and John McBrewster. 2010. Amazon Web Services. Alpha Press.

Cited By

View all
  • (2024)A State-of-the-Art Review in Big Data Management Engineering: Real-Life Case Studies, Challenges, and Future Research DirectionsEng10.3390/eng50300685:3(1266-1297)Online publication date: 3-Jul-2024
  • (2024)Optimizing Data Processing: A Comparative Study of Big Data Platforms in Edge, Fog, and Cloud LayersApplied Sciences10.3390/app1401045214:1(452)Online publication date: 4-Jan-2024
  • (2023)Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobsJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00465-z12:1Online publication date: 17-Jun-2023
  • Show More Cited By

Index Terms

  1. Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Other conferences
          ICCBDC '21: Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing
          August 2021
          122 pages
          ISBN:9781450390408
          DOI:10.1145/3481646
          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 26 November 2021

          Permissions

          Request permissions for this article.

          Check for updates

          Author Tags

          1. Big Data
          2. Classification
          3. Data Mining

          Qualifiers

          • Research-article
          • Research
          • Refereed limited

          Conference

          ICCBDC 2021

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)70
          • Downloads (Last 6 weeks)17
          Reflects downloads up to 10 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)A State-of-the-Art Review in Big Data Management Engineering: Real-Life Case Studies, Challenges, and Future Research DirectionsEng10.3390/eng50300685:3(1266-1297)Online publication date: 3-Jul-2024
          • (2024)Optimizing Data Processing: A Comparative Study of Big Data Platforms in Edge, Fog, and Cloud LayersApplied Sciences10.3390/app1401045214:1(452)Online publication date: 4-Jan-2024
          • (2023)Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobsJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00465-z12:1Online publication date: 17-Jun-2023
          • (2023)Apache Spark Big data Analysis, Performance Tuning, and Spark Application Optimization2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT)10.1109/EASCT59475.2023.10393086(1-8)Online publication date: 20-Oct-2023
          • (2023)Performance Analysis of the Distributed Support Vector Machine Algorithm Using Spark for Predicting Flight DelaysE3S Web of Conferences10.1051/e3sconf/202346502037465(02037)Online publication date: 18-Dec-2023
          • (2022)Expectativas en torno a Big DataHUMAN REVIEW. International Humanities Review / Revista Internacional de Humanidades10.37467/revhuman.v11.384011:Monográfico(1-10)Online publication date: 5-Dec-2022

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media