More Web Proxy on the site http://driver.im/

research-article

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Authors:

Ali CakmakAuthors Info & Claims

ICCBDC '21: Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing

Pages 15 - 20

https://doi.org/10.1145/3481646.3481649

Published: 26 November 2021 Publication History

Abstract

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term ‘Big Data Mining’. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 2%-3%, even in small-size data sets.

References

[1]

Abhinandan Banik and Samir Kumar Bandyopadhyay. 2016. Big data- A review on analysing 3Vs. In Journal of Scientific and Engineering Research. ISSN: 2394-2630.

[2]

Ishwarappa and Anuradha J. 2015. A Brief Introduction on Big Data 5Vs Characteristics and Hadoop Technology. In Procedia Computer Science. Volume 48, Pages 319-324, ISSN 1877-0509. https://doi.org/10.1016/j.procs.2015.04.188.

[3]

Jonardo R. Asor, Gene Marck B. Catedrilla and Jefferson L. Lerios. 2020. Usage of classification algorithm for extracting knowledge in cholesterol report towards non-communicable disease analysis. Journal of Advances in Information Technology 11,4 (November 2020), 265-270.

[4]

Xindong Wu, Xingquan Zhu, Gong-Qing Wu and Wei Ding. 2014. Data mining with big data. In IEEE Transactions on Knowledge and Data Engineering. vol. 26, no. 1, pp. 97-107.

Digital Library

[5]

Jaseena K. U. and Julia M. David. 2014. Issues, challenges, and solutions: Big data mining. In Computer Science & Information Technology. 4. 131-140. 10.5121/csit.2014.41311.

[6]

Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan. 2015. Clash of the titans: MapReduce vs. Spark for large scale data analytics. 2015. Proc. VLDB Endow. 8, 2110–2121. https://doi.org/10.14778/2831360.2831365

Digital Library

[7]

Yassir Samadi, Mostapha Zbakh and Claude Tadonki. 2016. Comparative study between Hadoop and Spark based on Hibench benchmarks. In 2nd International Conference on Cloud Computing Technologies and Applications (CloudTech), pp. 267-275.

[8]

Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li and Bizhu Qiu. 2014. BigDataBench: A big data benchmark suite from internet services. In IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pp. 488-499.

[9]

Yassir Samadi, Mostapha Zbakh and Claude Tadonki. 2017. Performance comparison between Hadoop and Spark frameworks using HiBench benchmarks. Concurrency and Computation: Practice and Experience. 30. 10.1002/cpe.4367.

[10]

Nasim Ahmed, Andre L.C. Barczak, Teo Susnjak and Mohammed A. Rashid. 2020. A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. J Big Data 7, 110. https://doi.org/10.1186/s40537-020-00388-5.

[11]

Ilias Mavridis and Eleni Karatza. 2015. Log file analysis in cloud with apache hadoop and apache spark. In Journal of Systems and Software. Volume 125, Pages 133-151, ISSN 0164-1212. https://doi.org/10.1016/j.jss.2016.11.037.

[12]

YuJia Liu. 2019. An empirical comparison between Mapreduce and Spark. A thesis presented in partial fulfilment of the requirements. Information Sciences, Massey University, Auckland, New Zealand.

[13]

Ishaan Lagwankar, Ananth Narayan Sankaranarayanan and Subramaniam Kalambur. 2020. Impact of Map-Reduce framework on Hadoop and Spark MR Application Performance. In IEEE International Conference on Big Data (Big Data), pp. 2763-2772.

[14]

A. Vettriselvi, N. Gnanambigai, P. Dinadayalan and S. Sutha. 2021. A Comparative study of Machine Learning Algorithms Using RDD Based Regression and Classification Methods. Annals of the Romanian Society for Cell Biology. 15168–15199.

[15]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker and Ion Stoica. 2010. Spark: Cluster computing with working sets. USENIX Workshop on Hot Topics in Cloud Computing (HotCloud).

[16]

Konstantin Shvachko, Hairong Kuang, Sanjay Radia and Robert Chansler. 2010. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST) (MSST '10). IEEE Computer Society, USA, 1–10. https://doi.org/10.1109/MSST.2010.5496972

Digital Library

[17]

Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia and Ameet Talwalkar. 2016. Mllib: Machine learning in apache spark. In The Journal of Machine Learning Research 17.1. 1235-1241.

Digital Library

[18]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (January 2008), 107–113. https://doi.org/10.1145/1327452.1327492

Digital Library

[19]

Deepak Vohra. 2016. Apache Mahout. 10.1007/978-1-4842-2199-0_11.

[20]

Trevor Hastie, Robert Tibshirani and J. H. Friedman. 2001. The elements of statistical learning: data mining, inference, and prediction : with 200 full-color illustrations. New York: Springer.

[21]

J. -. Gauvain, Chin-Hui Lee. 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. In IEEE Transactions on Speech and Audio Processing. vol. 2, no. 2, pp. 291-298.

[22]

Taha Tekdogan. Git repository of the benchmark experiment for ICCBDC’21. Retrieved 15 July, 2021 from https://github.com/tekdogan/iccbdc-21

[23]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC '13). Association for Computing Machinery, New York, NY, USA, Article 5, 1–16.

Digital Library

[24]

Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm

Digital Library

[25]

Dheeru Dua and Casey Graff. 2019. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. http://archive.ics.uci.edu/ml

[26]

Justin Ma, Lawrence K. Saul, Stefan Savage and Geoffrey M. Voelker. 2009. Identifying suspicious URLs: An application of large-scale online learning. Proceedings of the International Conference on Machine Learning (ICML), pages 681-688, Montreal, Quebec

[27]

De Wang, Danesh Irani and Calton Pu. 2012. Evolutionary study of web spam: Webb Spam Corpus 2011 versus Webb Spam Corpus 2006. In Proc. of 8th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom 2012). Pittsburgh, Pennsylvania, United States.

Digital Library

[28]

Yuchin Juan, Yong Zhuang, Wei-Sheng Chin and Chih-Jen Lin. 2016. Field-aware factorization machines for CTR prediction. In Proceedings of the ACM Recommender Systems Conference (RecSys).

Digital Library

[29]

Frederic P. Miller, Agnes F. Vandome and John McBrewster. 2010. Amazon Web Services. Alpha Press.

Cited By

Theodorakopoulos LTheodoropoulou AStamatiou Y(2024)A State-of-the-Art Review in Big Data Management Engineering: Real-Life Case Studies, Challenges, and Future Research DirectionsEng10.3390/eng50300685:3(1266-1297)Online publication date: 3-Jul-2024
https://doi.org/10.3390/eng5030068
Shwe TAritsugi M(2024)Optimizing Data Processing: A Comparative Study of Big Data Platforms in Edge, Fog, and Cloud LayersApplied Sciences10.3390/app1401045214:1(452)Online publication date: 4-Jan-2024
https://doi.org/10.3390/app14010452
Wen SHan RLiu CChen L(2023)Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobsJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00465-z12:1Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1186/s13677-023-00465-z
Show More Cited By

Index Terms

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Index terms have been assigned to the content through auto-classification.

Recommendations

Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
A comparative between hadoop mapreduce and apache Spark on HDFS
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine Learning

Data is growing now in a very high speed with a large volume, Spark and MapReduce¹ both provide a processing model for analyzing and managing this large data -Big Data- stored on HDFS. In this paper, we discuss a comparative between Apache Spark and ...
Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
Highlights
- Distributed Heterogeneous Ensemble is designed for big data classification.
- ...
Abstract
In this era of big data, processing large scale data efficiently and accurately has become a challenging problem. Ensemble classification is a type of supervised learning that uses multiple experts to generate the final output. It ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICCBDC '21: Proceedings of the 2021 5th International Conference on Cloud and Big Data Computing

August 2021

122 pages

ISBN:9781450390408

DOI:10.1145/3481646

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 November 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICCBDC 2021

ICCBDC 2021: 2021 5th International Conference on Cloud and Big Data Computing

August 13 - 15, 2021

Liverpool, United Kingdom

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
217
Total Downloads

Downloads (Last 12 months)70
Downloads (Last 6 weeks)17

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Theodorakopoulos LTheodoropoulou AStamatiou Y(2024)A State-of-the-Art Review in Big Data Management Engineering: Real-Life Case Studies, Challenges, and Future Research DirectionsEng10.3390/eng50300685:3(1266-1297)Online publication date: 3-Jul-2024
https://doi.org/10.3390/eng5030068
Shwe TAritsugi M(2024)Optimizing Data Processing: A Comparative Study of Big Data Platforms in Edge, Fog, and Cloud LayersApplied Sciences10.3390/app1401045214:1(452)Online publication date: 4-Jan-2024
https://doi.org/10.3390/app14010452
Wen SHan RLiu CChen L(2023)Fast DRL-based scheduler configuration tuning for reducing tail latency in edge-cloud jobsJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-023-00465-z12:1Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1186/s13677-023-00465-z
Karthikeya Sahith CMuppidi SMerugula S(2023)Apache Spark Big data Analysis, Performance Tuning, and Spark Application Optimization2023 International Conference on Evolutionary Algorithms and Soft Computing Techniques (EASCT)10.1109/EASCT59475.2023.10393086(1-8)Online publication date: 20-Oct-2023
https://doi.org/10.1109/EASCT59475.2023.10393086
Khotimah HWilda Al Aluf BAri Rifqi MHernawan ASatya Nugraha G(2023)Performance Analysis of the Distributed Support Vector Machine Algorithm Using Spark for Predicting Flight DelaysE3S Web of Conferences10.1051/e3sconf/202346502037465(02037)Online publication date: 18-Dec-2023
https://doi.org/10.1051/e3sconf/202346502037
López Núñez JCerda-Neumann G(2022)Expectativas en torno a Big DataHUMAN REVIEW. International Humanities Review / Revista Internacional de Humanidades10.37467/revhuman.v11.384011:Monográfico(1-10)Online publication date: 5-Dec-2022
https://doi.org/10.37467/revhuman.v11.3840

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents