Analytical Performance Models for MapReduce Workloads

Emanuel Vianna¹,
Giovanni Comarela¹,
Tatiana Pontes¹,
Jussara Almeida¹,
Virgílio Almeida¹,
Kevin Wilkinson²,
Harumi Kuno² &
…
Umeshwar Dayal²

1361 Accesses
Explore all metrics

Abstract

MapReduce is a currently popular programming model to support parallel computations on large datasets. Among the several existing MapReduce implementations, Hadoop has attracted a lot of attention from both industry and research. In a Hadoop job, map and reduce tasks coordinate to produce a solution to the input problem, exhibiting precedence constraints and synchronization delays that are characteristic of a pipeline communication between maps (producers) and reduces (consumers). We here address the challenge of designing analytical models to estimate the performance of MapReduce workloads, notably Hadoop workloads, focusing particularly on the intra-job pipeline parallelism between map and reduce tasks belonging to the same job. We propose a hierarchical model that combines a precedence graph model and a queuing network model to capture the intra-job synchronization constraints. We first show how to build a precedence graph that represents the dependencies among multiple tasks of the same job. We then apply it jointly with an approximate Mean Value Analysis (aMVA) solution to predict mean job response time, throughput and resource utilization. We validate our solution against a queuing network simulator and a real setup in various scenarios, finding very close agreement in both cases. In particular, our model produces estimates of average job response time that deviate from measurements of a real setup by less than 15 %.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

References

Apache Software Foundation, Powered by Hadoop. URL http://wiki.apache.org/hadoop/PoweredBy. Access date: 1 July 2012 (2012)
Apache Software Foundation, Official Apache Hadoop Website. URL http://hadoop.apache.org/. Accessed date: 1 July 2012 (2012)
Berlińska J., Drozdowski M.: Scheduling divisible MapReduce computations. J. Parallel Distrib. Comput. 71(3), 450–459 (2011)
Article Google Scholar
Chen, Y., Ganapathi, A., Griffith R., Katz, R.: The case for evaluating MapReduce performance using workload suites. In: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Singapore, pp. 390–399 (2011)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears R.: MapReduce Online Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI), San Jose, California, pp. 21–35 (2010)
Dean, J., Ghemawat, S., MapReduce : Simplified data processing on large clusters. In: Proceedings of Operating Systems Design and Implementation (OSDI), San Francisco, California, pp. 137–150 (2004)
Dean J., Ghemawat S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Ganapathi, A.: Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning. Technical Report UCB/EECS-2009-181. EECS Department, University of California, Berkeley (2009)
Ganapathi, A., Kuno, H., Dayal, U., Wiener, J., Fox, A., Jordan, M., Patterson, D. : Predicting multiple metrics for queries: better decisions enabled by machine learning. In: Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE), Shanghai, China, pp. 592–603 (2009)
Herodotou, H.: Hadoop Performance Models. Technical Report CS-2011-05. Computer Science Department, Duke University. URL http://arxiv.org/abs/1106.0940 (2011)
Jain R.: The Art of Computer Systems Performance Analysis—Techniques for Experimental Design, Measurement, Simulation, and Modeling. Wiley, London (1991)
MATH Google Scholar
Jiang D.R., Ooi B.C., Shi L., Wu S.: The performance of MapReduce: an in-depth study. Proc. VLDB Endow 3(1–2), 472–483 (2010)
Google Scholar
Jonkers H.: Queueing Models of Parallel Applications: The Glamis Methodology, Computer Performance Evaluation: Modeling Techniques & Tools (LNCS 794), pp. 123–138. Springer, Berlin (1994)
Book Google Scholar
Kim S., Won J., Han H., Eom H., Yeom H.Y.: Improving hadoop performance in intercloud environments ACM SIGMETRICS. Perform. Eval. Rev. 39(3), 107–109 (2011)
Article Google Scholar
Krevat, E., Shiran, T., Anderson, E., Tucek, J., Wylie, J.J. , Ganger, G.R.: Applying Performance Models to Understand Data-intensive Computing Efficiency. Technical Report CMU-PDL-10-108. Carnegie Mellon University, Pittsburgh (2010)
Kruskal C.P., Weiss A.: Allocating independent subtasks on parallel processors. IEEE Trans. Softw. Eng. 11(10), 1001–1016 (1985)
Article MATH Google Scholar
Lavenberg S., Reiser M.: Stationary state probabilities at arrival instants for closed queueing networks with multiple types of customers. J. Appl. Probab. 17(4), 1048–1061 (1980)
Article MathSciNet MATH Google Scholar
Lee K.H., Lee Y.J., Choi H., Chung Y.D., Moon B.: Parallel data processing with MapReduce: a survey. ACM SIGMOD Record J. 40(4), 11–20 (2011)
Article Google Scholar
Liang D.R., Tripathi S. K.: On performance prediction of parallel computations with precedent constraints. IEEE Trans. Parallel Distrib. Syst. 11(5), 491–508 (2000)
Article Google Scholar
Little J.: A proof for the queuing formula: L = λ W. Oper. Res. 9(3), 383–387 (1961)
Article MathSciNet MATH Google Scholar
Mak V.W., Lundstrom S.F.: Predicting performance of parallel computations. IEEE Trans. Parallel Distrib. Syst. 1(3), 257–260 (1990)
Article Google Scholar
Menasce, D., Dowdy, L., Almeida, V.: Performance by Design: Computer Capacity Planning By Example. Prentice Hall PTR (2004)
Morton, K., Balazinska, M., Grossman, D.: ParaTimer: a progress indicator for MapReduce DAGs. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD), Indianapolis, Indiana, pp. 507–518 (2010)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., Stonebraker, M.: A Comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD), Providence, Rhode Island, pp. 165–178 (2009)
Reiser M., Lavenberg S.S.: Mean-value analysis of closed multichain queuing networks. J. ACM 27(2), 313–322 (1980)
Article MathSciNet MATH Google Scholar
Salza, S., Lavenberg, S.S.: Approximating response time distributions in closed queueing network models of computer performance. In: Proceedings Performance, North Holland, Amsterdam, pp. 133–145 (1981)
Thomasian A., Bay P.F.: Analytic queueing network models for parallel processing of task systems. IEEE Trans. Comput. 35(12), 1045–1054 (1986)
Article Google Scholar
Trivedi K.S.: Probability and Statistics with Reliability, Queuing and Computer Science Applications. Prentice Hall PTR, Upper Saddle River (1882)
Google Scholar
Varki, E.: Mean value technique for closed fork-join networks. In: Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Atlanta, Georgia, pp. 103–112 (1999)
Vianna, E.,Comarela, G., Pontes, T., Almeida, J., Almeida, V., Wilkinson, K., Kuno, H., Dayal, U.: Modeling the performance of the Hadoop online prototype. In: Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Vitória, Brazil, pp. 152–159 (2011)
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: A simulation approach to evaluating design decisions in MapReduce setups. In: Proceedings of the IEEE International Symposium on Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), Imperial College London, UK, pp. 1–11 (2009)
Wang, G., Butt, A.R., Pandey, P., Gupta, K.: Using realistic simulation for performance analysis of MapReduce setups. In: Proceedings of the 1st ACM Workshop on Large-Scale System and Application Performance (LSAP), Munich, Germany, pp. 19–26 (2009)
Weng N., Wolf T.: Analytic modeling of network processors for parallel workload mapping. ACM Trans. Embed. Comput. Syst. 8(3), 18:1–18:29 (2009)
Article Google Scholar
White T.: Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale. 2nd edn. O’Reilly Media, Sebastopol (2011)
Google Scholar
Yang, H.C., Dasdan, A., Hsiao, R.L., Parker, D.S.: Map-Reduce-Merge: simplified relational data processing on LargeClusters. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD), Beijing, China, pp. 1029–1040 (2007)
Yang, X., Sun, J.: An Analytical performance model of MapReduce. In: Proceedings of the 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS), Beijing, China, pp. 306–310 (2011)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving MapReduce performance in heterogeneous environments. In: Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation (OSDI), San Diego, California, pp. 29–42 (2008)
Zahorjan, J.: The Approximate Solution of Large Queueing Network Models, PhD. Thesis, University of Toronto, Canada (1980)

Download references

Author information

Authors and Affiliations

Federal University of Minas Gerais (UFMG), Belo Horizonte, Brazil
Emanuel Vianna, Giovanni Comarela, Tatiana Pontes, Jussara Almeida & Virgílio Almeida
Information Analytics Lab, HP Laboratories (HP-Labs), Palo Alto, CA, USA
Kevin Wilkinson, Harumi Kuno & Umeshwar Dayal

Authors

Emanuel Vianna
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Comarela
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana Pontes
View author publications
You can also search for this author in PubMed Google Scholar
Jussara Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Virgílio Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Kevin Wilkinson
View author publications
You can also search for this author in PubMed Google Scholar
Harumi Kuno
View author publications
You can also search for this author in PubMed Google Scholar
Umeshwar Dayal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Emanuel Vianna.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vianna, E., Comarela, G., Pontes, T. et al. Analytical Performance Models for MapReduce Workloads. Int J Parallel Prog 41, 495–525 (2013). https://doi.org/10.1007/s10766-012-0227-4

Download citation

Received: 10 December 2011
Accepted: 09 October 2012
Published: 27 November 2012
Issue Date: August 2013
DOI: https://doi.org/10.1007/s10766-012-0227-4

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SMART: Speedup Job Completion Time by Scheduling Reduce Tasks

Latency Estimation of Big Data Processing Under the MapReduce Framework with Coupling Effects

A Pareto-based scheduler for exploring cost-performance trade-offs for MapReduce workloads

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Analytical Performance Models for MapReduce Workloads

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SMART: Speedup Job Completion Time by Scheduling Reduce Tasks

Latency Estimation of Big Data Processing Under the MapReduce Framework with Coupling Effects

A Pareto-based scheduler for exploring cost-performance trade-offs for MapReduce workloads

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation