Abstract
MapReduce is a currently popular programming model to support parallel computations on large datasets. Among the several existing MapReduce implementations, Hadoop has attracted a lot of attention from both industry and research. In a Hadoop job, map and reduce tasks coordinate to produce a solution to the input problem, exhibiting precedence constraints and synchronization delays that are characteristic of a pipeline communication between maps (producers) and reduces (consumers). We here address the challenge of designing analytical models to estimate the performance of MapReduce workloads, notably Hadoop workloads, focusing particularly on the intra-job pipeline parallelism between map and reduce tasks belonging to the same job. We propose a hierarchical model that combines a precedence graph model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time, throughput and resource utilization. We validate our solution against a queuing network simulator and a real setup in various scenarios, finding very close agreement in both cases. In particular, our model produces estimates of average job response time that deviate from measurements of a real setup by less than 15 %.
Similar content being viewed by others
References
Apache Software Foundation, Powered by Hadoop. URL http://wiki.apache.org/hadoop/PoweredBy. Access date: 1 July 2012 (2012)
Apache Software Foundation, Official Apache Hadoop Website. URL http://hadoop.apache.org/. Accessed date: 1 July 2012 (2012)
Berlińska J., Drozdowski M.: Scheduling divisible MapReduce computations. J. Parallel Distrib. Comput. 71(3), 450–459 (2011)
Chen, Y., Ganapathi, A., Griffith R., Katz, R.: The case for evaluating MapReduce performance using workload suites. In: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Singapore, pp. 390–399 (2011)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears R.: MapReduce Online Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI), San Jose, California, pp. 21–35 (2010)
Dean, J., Ghemawat, S., MapReduce : Simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation (OSDI), San Francisco, California, pp. 137–150 (2004)
Dean J., Ghemawat S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Ganapathi, A.: Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning. Technical Report UCB/EECS-2009-181. EECS Department, University of California, Berkeley (2009)
Ganapathi, A., Kuno, H., Dayal, U., Wiener, J., Fox, A., Jordan, M., Patterson, D. : Predicting multiple metrics for queries: better decisions enabled by machine learning. In: Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE), Shanghai, China, pp. 592–603 (2009)
Herodotou, H.: Hadoop Performance Models. Technical Report CS-2011-05. Computer Science Department, Duke University. URL http://arxiv.org/abs/1106.0940 (2011)
Jain R.: The Art of Computer Systems Performance Analysis—Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, London (1991)
Jiang D.R., Ooi B.C., Shi L., Wu S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow 3(1–2), 472–483 (2010)
Jonkers H.: Queueing Models of Parallel Applications: The Glamis Methodology, Computer Performance Evaluation: Modeling Techniques & Tools (LNCS 794), pp. 123–138. Springer, Berlin (1994)
Kim S., Won J., Han H., Eom H., Yeom H.Y.: Improving hadoop performance in intercloud environments ACM SIGMETRICS. Perform. Eval. Rev. 39(3), 107–109 (2011)
Krevat, E., Shiran, T., Anderson, E., Tucek, J., Wylie, J.J. , Ganger, G.R.: Applying Performance Models to Understand Data-intensive Computing Efficiency. Technical Report CMU-PDL-10-108. Carnegie Mellon University, Pittsburgh (2010)
Kruskal C.P., Weiss A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10), 1001–1016 (1985)
Lavenberg S., Reiser M.: Stationary state probabilities at arrival instants for closed queueing networks with multiple types of customers. J. Appl. Probab. 17(4), 1048–1061 (1980)
Lee K.H., Lee Y.J., Choi H., Chung Y.D., Moon B.: Parallel data processing with MapReduce: a survey. ACM SIGMOD Record J. 40(4), 11–20 (2011)
Liang D.R., Tripathi S. K.: On performance prediction of parallel computations with precedent constraints. IEEE Trans. Parallel Distrib. Syst. 11(5), 491–508 (2000)
Little J.: A proof for the queuing formula: L = λ W. Oper. Res. 9(3), 383–387 (1961)
Mak V.W., Lundstrom S.F.: Predicting performance of parallel computations. IEEE Trans. Parallel Distrib. Syst. 1(3), 257–260 (1990)
Menasce, D., Dowdy, L., Almeida, V.: Performance by Design: Computer Capacity Planning By Example. Prentice Hall PTR (2004)
Morton, K., Balazinska, M., Grossman, D.: ParaTimer: a progress indicator for MapReduce DAGs. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), Indianapolis, Indiana, pp. 507–518 (2010)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD), Providence, Rhode Island, pp. 165–178 (2009)
Reiser M., Lavenberg S.S.: Mean-value analysis of closed multichain queuing networks. J. ACM 27(2), 313–322 (1980)
Salza, S., Lavenberg, S.S.: Approximating response time distributions in closed queueing network models of computer performance. In: Proceedings Performance, North Holland, Amsterdam, pp. 133–145 (1981)
Thomasian A., Bay P.F.: Analytic queueing network models for parallel processing of task systems. IEEE Trans. Comput. 35(12), 1045–1054 (1986)
Trivedi K.S.: Probability and Statistics with Reliability, Queuing and Computer Science Applications. Prentice Hall PTR, Upper Saddle River (1882)
Varki, E.: Mean value technique for closed fork-join networks. In: Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Atlanta, Georgia, pp. 103–112 (1999)
Vianna, E.,Comarela, G., Pontes, T., Almeida, J., Almeida, V., Wilkinson, K., Kuno, H., Dayal, U.: Modeling the performance of the Hadoop online prototype. In: Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Vitória, Brazil, pp. 152–159 (2011)
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: Proceedings of the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), Imperial College London, UK, pp. 1–11 (2009)
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: Using realistic simulation for performance analysis of MapReduce setups. In: Proceedings of the 1st ACM Workshop on Large-Scale System and Application Performance (LSAP), Munich, Germany, pp. 19–26 (2009)
Weng N., Wolf T.: Analytic modeling of network processors for parallel workload mapping. ACM Trans. Embed. Comput. Syst. 8(3), 18:1–18:29 (2009)
White T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale. 2nd edn. O’Reilly Media, Sebastopol (2011)
Yang, H.C., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-Reduce-Merge: simplified relational data processing on LargeClusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD), Beijing, China, pp. 1029–1040 (2007)
Yang, X., Sun, J.: An Analytical performance model of MapReduce. In: Proceedings of the 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), Beijing, China, pp. 306–310 (2011)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI), San Diego, California, pp. 29–42 (2008)
Zahorjan, J.: The Approximate Solution of Large Queueing Network Models, PhD. Thesis, University of Toronto, Canada (1980)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vianna, E., Comarela, G., Pontes, T. et al. Analytical Performance Models for MapReduce Workloads. Int J Parallel Prog 41, 495–525 (2013). https://doi.org/10.1007/s10766-012-0227-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-012-0227-4