An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster
<p>A typical Spark cluster architecture.</p> "> Figure 2
<p>Amdahl’s law for various serial factors and numbers of executors.</p> "> Figure 3
<p>Gustafson’s law for various percentages of serial work.</p> "> Figure 4
<p>A 2D plate’s homogeneous node communication.</p> "> Figure 5
<p>Communication model based on fully connected graphs.</p> "> Figure 6
<p>Schematic diagram of the Hadoop cluster used in the experiment.</p> "> Figure 7
<p>Spark stages DAG of WC.</p> "> Figure 8
<p>Spark stages DAG of Kmeans.</p> "> Figure 9
<p>Spark stages DAG of SVM.</p> "> Figure 10
<p>Spark stages DAG of NWeight.</p> "> Figure 11
<p>Spark stages DAG of PageRank.</p> "> Figure 12
<p>Single executor runtime complexity with different sizes.</p> "> Figure 13
<p>Fitting the model to WordCount workload for amount of data.</p> "> Figure 14
<p>Fitting the model to SVM workload with different dataset sizes.</p> "> Figure 15
<p>Fitting the model to PageRank workload for amount of data.</p> "> Figure 16
<p>Fitting the model to Kmeans workloads with different sizes.</p> "> Figure 17
<p>Fitting the model to Graph (NWeight) workloads of different sizes.</p> ">
Abstract
:1. Introduction
- We introduced two distinct parallelisation models for performance prediction of Spark jobs on Hadoop cluster. Each model is based on a different communication pattern between the nodes of a Hadoop cluster.
- We accomplished extensive experimental work. The authors analysed and verified the performance pattern based on two main parameters, the number of executors and the amount of data for each job. The data reliability was verified by running each workload at least three times.
- We evaluated our models on five HiBench workloads in order to test the data fitting accuracy. Our results show that the experimental data fitted one of the models accurately, and the fitness was compared with Amdhal’s law, Gustafson’s law, and Ernest’s model. The data fitness was compared based on two criteria, Rsquared and RRSE.
2. Apache Spark Platform
3. Related Work
4. Parallelisation Models
4.1. Amdahl’s Law and Gustafson’s Law
4.2. A Model Using a 2D Plate Communication Pattern
5. An Enhanced Model for Runtime Prediction
6. Experiments
6.1. Experimental Setup
6.2. Experiment Performance Evaluation
6.3. Configuration of Parameters
7. Results and Analysis
7.1. Procedure to Fit Equations
7.2. Finding the Approximate Algorithm Complexity ()
7.3. The Full Model Fitting
7.4. Evaluation of the Fitting Errors
7.5. Benefits of the Proposed Models
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
SVM | Support vector machines |
API | Application programming interface |
SQL | Structured query language |
HDFS | Hadoop distributed file system |
RDD | Resilient distributed datasets |
MLlib | Machine learning library |
CPU | Central processing unit |
I/O | input/output |
UC | University of california |
AMP | Algorithms, machines and people |
DAG | Directed acyclic graph |
YARN | Yet another resource negotiator |
PERIDOT | Performance predIction moDel fOr Spark applicaTions |
NEXEC | Number of executor |
SQRT | Square root |
2D | Two dimensional |
GHz | Gigahertz |
TB | Terabyte |
RAM | Random access memory |
DDR | Double data rate |
GB | Gigabyte |
MB | Megabyte |
WC | WordCount |
Exec | Executor |
MOP | Multi-object optimization |
References
- Katal, A.; Wazid, M.; Goudar, R.H. Big data: Issues, challenges, tools and good practices. In Proceedings of the 2013 Sixth international conference on contemporary computing (IC3), Nodia, India, 8–10 August 2013; pp. 404–409. [Google Scholar]
- Dean, J.; Ghemawat, S. Mapreduce: Simplified data processing on large clusters. Commun. ACM 2008, 1, 107–113. [Google Scholar] [CrossRef]
- Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Ma, J.; McCauly, M.; Franklin, M.J.; Shenker, S.; Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th Symposium on Networked Systems Design and Implementation(NSDI), San Jose, CA, USA, 25 April 2012; pp. 15–28. [Google Scholar]
- Mazhar Javed, A.; Rafia Asad, K.; Haitham, N.; Awais, Y.; Syed Muhammad, A.; Usman, N.; Vishwa Pratap, S. A Recommendation Engine for Predicting Movie Ratings Using a Big Data Approach. Electronics 2021, 10, 1215. [Google Scholar]
- Zaharia, M.; Xin, R.S.; Wendell, P.; Das, T.; Armbrust, M.; Dave, A.; Meng, X.; Rosen, J.; Venkataraman, S.; Franklin, M.J.; et al. Apache spark: A unified engine for big data processing. Commun. ACM 2016, 59, 56–65. [Google Scholar] [CrossRef]
- Meng, X.; Bradley, J.; Yavuz, B.; Sparks, E.; Venkataraman, S.; Liu, D.; Freeman, J.; Tsai, D.B.; Amde, M.; Owen, S.; et al. Mllib: Machine learning in apache spark. J. Mach. Learn. Res. 2016, 17, 1235–1241. [Google Scholar]
- Kroß, J.; Krcmar, H. PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop. Big Data Cogn. Comput. 2019, 3, 47. [Google Scholar] [CrossRef] [Green Version]
- Petridis, P.; Gounaris, A.; Torres, J. Spark Parameter Tuning via Trial-and-Error. In Proceedings of the INNS Conference on Big Data; Springer: Cham, Switzerland, 2016; pp. 226–237. [Google Scholar]
- Herodotou, H.; Lim, H.; Luo, G.; Borisov, N.; Dong, L.; Cetin, F.B.; Babu, S. Starfish: A self-tuning system for big data analytics. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, USA, 9–12 January 2011; pp. 261–272. [Google Scholar]
- Mustafa, S.; Elghandour, I.; Ismail, M.A. A machine learning approach for predicting execution time of spark jobs. Alex. Eng. J. 2018, 57, 3767–3778. [Google Scholar] [CrossRef]
- Cheng, G.; Ying, S.; Wang, B. Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model. J. Syst. Softw. 2021, 180, 111028. [Google Scholar] [CrossRef]
- Wang, G.; Xu, J.; He, B. A novel method for tuning configuration parameters of spark based on machine learning. In Proceedings of the 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Sydney, NSW, Australia, 12–14 December 2016; pp. 586–593. [Google Scholar]
- Wilkinson, B.; Allen, M. Parallel Programming, 2nd ed.; Prentice Hall: Hoboken, NJ, USA, 1999; p. 268. [Google Scholar]
- Ahmed, N.; Barczak, A.L.; Rashid, M.A.; Susnjak, T. A Parallelization Model for Performance Characterization of Spark Big Data Jobs on Hadoop Clusters. J. Big Data 2021, 8, 1–28. [Google Scholar] [CrossRef]
- Mavridis, I.; Karatza, H. Performance evaluation of cloud-based log file analysis with apache hadoop and apache spark. J. Syst. Softw. 2017, 125, 133–151. [Google Scholar] [CrossRef]
- Apache Spark Market Share. Available online: https://www.datanyze.com/market-share/big-data-processing–204/apache-spark-market-share (accessed on 9 October 2021).
- Companies using Apache Spark. Available online: https://enlyft.com/tech/products/apache-spark (accessed on 9 October 2021).
- Apache Spark Overview 2.4.4. RDD Programming Guide. Available online: https://spark.apache.org/docs/2.4.4/ (accessed on 7 August 2020).
- Chen, Y.; Goetsch, P.; Hoque, M.A.; Lu, J.; Tarkoma, S. d-simplexed: Adaptive delaunay triangulation for performance modeling and prediction on big data analytics. IEEE Trans. Big Data 2019, 1–12. [Google Scholar] [CrossRef]
- Vavilapalli, V.K.; Murthy, A.C.; Douglas, C.; Agarwal, S.; Konar, M.; Evans, R.; Graves, T.; Lowe, J.; Shah, H.; Seth, S. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, Santa Clara, CA, USA, 1–3 October 2013; pp. 1–16. [Google Scholar]
- Al-Sayeh, H.; Hagedorn, S.; Sattler, K.U. A gray-box modeling methodology for runtime prediction of apache spark jobs. Distrib. Parallel Databases 2020, 38, 1–21. [Google Scholar] [CrossRef] [Green Version]
- Assefi, M.; Behravesh, E.; Liu, G.; Tafti, A.P. Big data machine learning using apache spark mllib. In Proceedings of the 2017 IEEE International Conference on Big Data (Big Data), Boston, MA, USA, 11–14 December 2017; pp. 3492–3498. [Google Scholar]
- Taneja, R.; Krishnamurthy, R.B.; Liu, G. Optimization of machine learning on apache spark. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), Las Vegas, NV, USA, 25–28 July 2016; pp. 163–167. [Google Scholar]
- Gounaris, A.; Torres, J. A methodology for spark parameter tuning. Big Data Res. 2018, 11, 22–32. [Google Scholar] [CrossRef] [Green Version]
- Javaid, M.U.; Kanoun, A.A.; Demesmaeker, F.; Ghrab, A.; Skhiri, S. A Performance Prediction Model for Spark Applications. In Proceedings of the International Conference on Big Data, Honolulu, HI, USA, 18–20 September 2020; pp. 13–22. [Google Scholar]
- Gulino, A.; Canakoglu, A.; Ceri, S.; Ardagna, D. Performance Prediction for Data-driven Workflows on Apache Spark. In Proceedings of the 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOT.S.), Nice, France, 17–19 November 2020; pp. 1–8. [Google Scholar]
- Cheng, G.; Ying, S.; Wang, B.; Li, Y. Efficient performance prediction for apache spark. J. Parallel Distrib. Comput. 2021, 149, 40–51. [Google Scholar] [CrossRef]
- Aziz, K.; Zaidouni, D.; Bellafkih, M. Leveraging resource management for efficient performance of apache spark. J. Big Data 2019, 6, 1–23. [Google Scholar] [CrossRef] [Green Version]
- Boden, C.; Spina, A.; Rabl, T.; Markl, V. Benchmarking data flow systems for scalable machine learning. In Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, Chicago, IL, USA, 19 May 2017; pp. 1–10. [Google Scholar]
- Maros, A.; Murai, F.; da Silva, A.P.C.; Almeida, M.J.; Lattuada, M.; Gianniti, E.; Hosseini, M.; Ardagna, D. Machine learning for performance prediction of spark cloud applications. In Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), Milan, Italy, 8–13 July 2019; pp. 99–106. [Google Scholar]
- Venkataraman, S.; Yang, Z.; Franklin, M.; Recht, B.; Stoica, I. Ernest:Efficient performance prediction for large-scale advanced analytics. In Proceedings of the 13th Symposium on Networked Systems Design and Implementation (NSDI), Santa Clara, CA, USA, 16–18 March 2016; pp. 363–378. [Google Scholar]
- Amannejad, Y.; Shah, S.; Krishnamurthy, D.; Wang, M. Fast and lightweight execution time predictions for spark applications. In Proceedings of the 2019 IEEE 12th International Conference on Cloud Computing (CLOUD), Milan, Italy, 8–13 July 2019; pp. 493–495. [Google Scholar]
- Shah, S.; Amannejad, Y.; Krishnamurthy, D.; Wang, M. Quick execution time predictions for spark applications. In Proceedings of the 2019 15th International Conference on Network and Service Management (CNSM), Halifax, NS, Canada, 21–25 October 2019. [Google Scholar]
- Ahmed, N.; Barczak, A.L.; Susnjak, T.; Rashid, M.A. A comprehensive performance analysis of apache hadoop and apache spark for large scale data sets using HiBench. J. Big Data 2020, 7, 1–18. [Google Scholar] [CrossRef]
- Chao, Z.; Shi, S.; Gao, H.; Luo, J.; Wang, H. A gray-box performance model for apache spark. Future Gener. Comput. Syst. 2018, 89, 58–67. [Google Scholar] [CrossRef]
- Amdahl, G.M. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the AFIPS ’67 (Spring): Spring Joint Computer Conference, Sunnyvale, CA, USA, 18–20 April 1967; pp. 483–485. [Google Scholar]
- Gustafson, J.L. Reevaluating amdahl’s law. Commun. ACM 1988, 31, 532–533. [Google Scholar] [CrossRef] [Green Version]
- Barczak, A.L.; Messom, C.H.; Johnson, M.J. Performance characteristics of a cost-effective medium-sized beowulf cluster supercomputer. In Proceedings of the International Conference on Computational Science, Melbourne, Australia, 2–4 June 2003; pp. 1050–1059. [Google Scholar]
- HiBench Suite. Available online: https://github.com/Intel-bigdata/HiBench (accessed on 4 June 2019).
- Huang, S.; Huang, J.; Dai, J.; Xie, T.; Huang, B. The HiBench benchmark suite: Characterization of the mapreduce-based data analysis. In Proceedings of the 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), Long Beach, CA, USA, 1–6 March 2010; pp. 41–51. [Google Scholar]
- Zhao, Y.; Hu, F.; Chen, H. An adaptive tuning strategy on spark based on in-memory computation characteristics. In Proceedings of the 2016 18th International Conference on Advanced Communication Technology (ICACT), PyeongChang, Korea, 31 January–3 February 2016; pp. 484–488. [Google Scholar]
- Marcu, O.C.; Costan, A.; Antoniu, G.; Perez-Hernandez, M. Spark versus flink: Understanding performance in big data analytics frameworks. In Proceedings of the 2016 IEEE International Conference on Cluster Computing (CLUSTER), Taipei, Taiwan, 12–16 September 2016; pp. 433–442. [Google Scholar]
- Williams, T.; Kelley, C. Gnuplot 5.4: An Interactive Plotting Program. 2020. Available online: http://gnuplot.sourceforge.net/ (accessed on 7 July 2021).
- Bottou, L.; Lin, C.-J. Support vector machine solvers. Large Scale Kernel Mach. 2007, 3, 301–320. [Google Scholar]
- Li, X.; Fang, Z. Parallel clustering algorithms. Parallel Comput. 1989, 11, 275–290. [Google Scholar] [CrossRef]
- Chen, P.; Xie, H.; Maslov, S.; Redner, S. Finding scientific gems with Google’s PageRank algorithm. J. Inf. 2007, 1, 8–15. [Google Scholar] [CrossRef] [Green Version]
- Goel, A.; Munagala, K. Complexity measures for map-reduce, and comparison to parallel computing. arXiv 2012, arXiv:1211.6526. [Google Scholar]
- Tomita, E.; Tanaka, A.; Takahashi, H. The worst-case time complexity for generating all maximal cliques and computational experiments. Theor. Comput. Sci. 2006, 363, 28–42. [Google Scholar] [CrossRef] [Green Version]
- James, G.; Witten, D.; Hatie, T.; Tibshirani, R. An Introduction to Statistical Learning, 2nd ed.; Springer: New York, NY, USA, 2021. [Google Scholar]
References | Approach/Method | System/Environments |
---|---|---|
Cheng et al. [27] | Machine Learning | Efficient performance prediction for Apache Spark. |
Ahmed et al. [34] | Comprehensive Trial-and-Error | Apache Hadoop and Apache Spark for large scale datasets. |
Al-Sayeh et al. [21] | Gray-box modelling | Runtime prediction of Spark jobs. |
Shah et al. [33] | PERIDOT | Quick execution time predictions for Spark applications. |
Aziz et al. [28] | Machine Learning | Resource management for efficient performance of Apache Spark. |
Gounaris et al. [24] | Alternative Systematic | Spark parameter tuning. |
Mustafa et el. [10] | Machine Learning | Predicting execution time of Spark jobs. |
Chao et al. [35] | Gray-box modelling (Machine Learning) | Spark performance model for accuracy improvements. |
Petridis et al. [8] | Trial-and-Error | Spark parameter tuning. |
Server Configuration | |
---|---|
Processor | 2.9 GHz |
Main memory | 64 GB |
Storage | 10 TB |
Node Configuration | |
CPU | Intel (R) Xeon (R) CPU E3-1231 [email protected] GHz |
Main memory | 32 GB |
Number of Nodes | 9 |
Storage | 6 TB each, 54 TB total |
CPU cores | 8 each, 72 total |
Software | |
Operating System | Ubuntu 16.04.2 (GNU/Linux 4.13.0-37-generic x86 64) |
Hadoop | 2.4.0 |
Spark | 2.1.0 |
JDK | 1.7.0 |
Benchmark Categories | Application | Input Data Size | Input Samples | |
---|---|---|---|---|
Micro Benchmark | WordCount | Multiple-Exec. | Single-Exec. | - |
313 MB, 940 MB, 5.9 GB, 8.8 GB, and 19.2 GB | 3 GB, 5 GB, 7 GB, 10 GB, 12.8 GB, 14.4 GB, 16 GB, 18 GB, and 21.6 GB | |||
Machine Learning | Kmeans | 19 GB, 56 GB, 94 GB, 130 GB, and 168 GB | 1 GB, 38 GB, 75 GB, 113 GB, 149 GB, and 187 GB | 10, 30, 50, 70, and 90 (million samples) |
SVM | 34 MB, 60 MB, 1.2 GB, 1.8 GB and 2 GB | 200 MB, 400 MB, 600 MB, 800 MB, 1.35 GB, 2 GB, 2.3 GB, and 2.5 GB | 2100, 2600, 3600, 4100, and 5100 (samples) | |
Web Search | PageRank | 507 MB, 1.6 GB, 2.8 GB, 4 GB, and 5 GB | 100 MB, 250 MB, 750 MB, 6 GB, 7 GB, 8 GB, 9 GB, and 10 GB | 1, 3, 5, 7, and 9 (million of pages) |
Graph | NWeight | 37 MB, 70 MB, 129 MB, 155 MB, and 211 MB | 20 MB, 55 MB, 99 MB, 141 MB, 175 MB, 214 MB, 247 MB, 262 MB, and 286 MB | 1, 2, 4, 5, and 7 (million of edges) |
Workloads | Stages | Parallel Stages | Collect | Serialization | Deserialization | Shuffle | Aggregate |
---|---|---|---|---|---|---|---|
WC | 2 | no | yes | - | - | yes | - |
SVM | 209 | no | yes | no | yes | yes | yes |
NWeight | 9 | yes | - | no | yes | yes | - |
Kmeans | 20 | no | yes | yes | yes | yes | - |
PageRank | 5 | no | - | no | yes | yes | - |
Parameters | Default | Range | Description |
---|---|---|---|
Spark.executor.memory | 1 | 12 | Amount of memory to use per executor process, in GB. |
Spark.executor.cores | 1 | 2–14 | The number of cores to use on each executor. |
Spark.driver.memory | 1 | 4 | Amount of memory to use for the driver process, in GB. |
Spark.driver.cores | 1 | 3 | The Number of cores to use for the driver process. |
Spark.shuffle.file.buffer | 32 | 48 | Size of the in-memory buffer for each shuffle file output stream, in KB. |
Spark.reducer.maxSizeInFlight | 48 | 96 | Maximum size of map outputs to fetch simultaneously from each reduce task, in MB. |
Spark.memory.fraction | 0.6 | 0.1–0.4 | Fraction of heap space used for execution and storage. |
Spark.memory.storageFraction | 0.5 | 0.1-0.4 | Amount of storage memory immune to eviction expressed as a fraction of the size of the region. |
Spark.task.maxFailures | 4 | 5 | Number of failures of any particular the task before giving up on the job. |
Spark.speculation | False | True/ False | If set to “true” performs speculative execution of tasks. |
Spark.rpc.message.maxSize | 128 | 256 | Maximum message size to allow in “control plane” communication, in MB. |
Spark.io.compression.codec | snappy | lz4/lzf/snappy | Compress map output files. |
Spark.io.compression.snappy.blockSize | 32 | 32–128 | Block size in Snappy compression, in KB |
Workload | Theoretical Time Complexity | Single Executor Best Fit |
---|---|---|
WordCount | [47] | linear |
SVM | [44] | quadratic |
PageRank | [46] | linear |
Kmeans | or [45] | linear |
NWeight | or [48] | quadratic |
Workload | Best Fit | Equations (15) or (16) | Equations (15) or (16) (c = 1) | Amdhal Equation (3) | Gustafson Equation (5) | Equations (8) or (9) | Ernest [31] |
---|---|---|---|---|---|---|---|
Wordcount | linear | 0.996 | 0.996 | 0.997 | 0.996 | 0.997 | 0.995 |
SVM | quadrat. | 0.917 | 0.912 | 0.906 | 0.887 | 0.917 | 0.847 |
PageRank | linear | 0.990 | 0.989 | 0.990 | 0.989 | 0.990 | 0.988 |
Kmeans | linear | 0.992 | 0.992 | 0.992 | 0.993 | 0.993 | 0.992 |
NWeight | quadrat. | 0.964 | 0.964 | 0.956 | 0.965 | 0.966 | 0.950 |
Workload | Best Fit | Equations (15) or (16) | Equations (15) or (16) (c = 1) | Amdhal Equation (3) | Gustafson Equation (5) | Equations (8) or (9) | Ernest [31] |
---|---|---|---|---|---|---|---|
Wordcount | linear | 0.083 | 0.083 | 0.074 | 0.082 | 0.074 | 0.091 |
SVM | quadrat. | 0.271 | 0.276 | 0.285 | 0.313 | 0.271 | 0.367 |
PageRank | linear | 0.116 | 0.118 | 0.113 | 0.121 | 0.113 | 0.127 |
Kmeans | linear | 0.138 | 0.137 | 0.139 | 0.131 | 0.130 | 0.137 |
NWeight | quadrat. | 0.193 | 0.193 | 0.212 | 0.190 | 0.189 | 0.226 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmed, N.; Barczak, A.L.C.; Rashid, M.A.; Susnjak, T. An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster. Big Data Cogn. Comput. 2021, 5, 65. https://doi.org/10.3390/bdcc5040065
Ahmed N, Barczak ALC, Rashid MA, Susnjak T. An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster. Big Data and Cognitive Computing. 2021; 5(4):65. https://doi.org/10.3390/bdcc5040065
Chicago/Turabian StyleAhmed, Nasim, Andre L. C. Barczak, Mohammad A. Rashid, and Teo Susnjak. 2021. "An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster" Big Data and Cognitive Computing 5, no. 4: 65. https://doi.org/10.3390/bdcc5040065
APA StyleAhmed, N., Barczak, A. L. C., Rashid, M. A., & Susnjak, T. (2021). An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster. Big Data and Cognitive Computing, 5(4), 65. https://doi.org/10.3390/bdcc5040065