More Web Proxy on the site http://driver.im/

article

Piranha: optimizing short jobs in Hadoop

Author:

Khaled ElmeleegyAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 6, Issue 11

Pages 985 - 996

https://doi.org/10.14778/2536222.2536225

Published: 01 August 2013 Publication History

Abstract

Cluster computing has emerged as a key parallel processing platform for large scale data. All major internet companies use it as their major central processing platform. One of cluster computing's most popular examples is MapReduce and its open source implementation Hadoop. These systems were originally designed for batch and massive-scale computations. Interestingly, over time their production workloads have evolved into a mix of a small fraction of large and long-running jobs and a much bigger fraction of short jobs. This came about because these systems end up being used as data warehouses, which store most of the data sets and attract ad hoc, short, data-mining queries. Moreover, the availability of higher level query languages that operate on top of these cluster systems proliferated these ad hoc queries. Since existing systems were not designed for short, latency-sensistive jobs, short interactive jobs suffer from poor response times.

In this paper, we present Piranha--a system for optimizing short jobs on Hadoop without affecting the larger jobs. It runs on existing unmodified Hadoop clusters facilitating its adoption. Piranha exploits characteristics of short jobs learned from production workloads at Yahoo! clusters to reduce the latency of such jobs. To demonstrate Piranha's effectiveness, we evaluated its performance using three realistic short queries. Piranha was able to reduce the queries' response times by up to 71%.

References

[1]

Apache Hadoop Project. http://hadoop.apache.org/.

[2]

Apache Hadoop Project. http://incubator.apache.org/drill/.

[3]

Jaql. http://www.almaden.ibm.com/cs/projects/jaql/.

[4]

The Next Generation of Apache Hadoop MapReduce (YARN). http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/.

[5]

A. Abouzied, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In VLDB, Lyon, France, 2009.

Digital Library

[6]

V. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE 2011 : IEEE International Conference on Data Engineering, 2011.

Digital Library

[7]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a sql implementation on the mapreduce framework. Proc. VLDB Endow., 4:1318-1327, 2011.

Digital Library

[8]

S. Chen. Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow., 3(1-2):1459-1468, Sept. 2010.

Digital Library

[9]

T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI'10: Proceedings of the 7th USENIX conference on Networked systems design and implementation, pages 21-21, 2010.

Digital Library

[10]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, 2004.

Digital Library

[11]

J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3(1), 2010.

Digital Library

[12]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29-43, New York, NY, USA, 2003. ACM.

Digital Library

[13]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI'11: Proceedings of the 8th USENIX conference on Networked systems design and implementation, 2011.

Digital Library

[14]

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, pages 11-11, 2010.

Digital Library

[15]

M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, pages 261-276, 2009.

Digital Library

[16]

M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 987-994, New York, NY, USA, 2009. ACM.

Digital Library

[17]

D. Jiang, B. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. Proc. VLDB Endow., 3(1), 2010.

Digital Library

[18]

I. Kim, J. Moon, and H. Y. Yeom. Timer-based interrupt mitigation for high performance packet processing. In In Proc. 5th International Conference on HighPerformance Computing in the Asia-Pacific Region, Gold, 2001.

[19]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In Proc. of the 36th Int'l Conf on Very Large Data Bases, pages 330-339, 2010.

Digital Library

[20]

D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI'11: Proceedings of the 8th USENIX conference on Networked systems design and implementation, 2011.

Digital Library

[21]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099-1110, 2008.

Digital Library

[22]

R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Sci. Program., 13:277-298, October 2005.

Digital Library

[23]

The Apache Software Foundation. Apache Hadoop - PoweredBy:. At http://wiki.apache.org/hadoop/PoweredBy.

[24]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626-1629, 2009.

Digital Library

[25]

R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD Conference, pages 13-24, 2013.

Digital Library

[26]

Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In OSDI'08: Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 1-14, 2008.

Digital Library

[27]

M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems, pages 265-278, 2010.

Digital Library

[28]

M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI'08: Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 29-42, 2008.

Digital Library

Cited By

Park YTak BHan W(2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589279
Grzegorowski M(2023)Selected Aspects of Interactive Feature ExtractionTransactions on Rough Sets XXIII10.1007/978-3-662-66544-2_8(121-287)Online publication date: 1-Jan-2023
https://doi.org/10.1007/978-3-662-66544-2_8
Al-Sayeh HMemishi BJibril MParadies MSattler KIves ZBonifati AEl Abbadi A(2022)Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517892(1840-1854)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517892
Show More Cited By

Index Terms

Piranha: optimizing short jobs in Hadoop

Index terms have been assigned to the content through auto-classification.

Recommendations

Scheduling of deteriorating jobs with release dates to minimize the maximum lateness

In this paper, we consider the problem of scheduling n deteriorating jobs with release dates on a single (batching) machine. Each job's processing time is a simple linear function of its starting time. The objective is to minimize the maximum lateness. ...
Adaptive parallelism with Piranha
Primary-secondary bicriteria scheduling on identical machines to minimize the total completion time of all jobs and the maximum T-time of all machines

In this paper, we study a new primary-secondary bicriteria scheduling problem on identical machines. The primary objective is to minimize the total completion time of all jobs and the secondary objective is to minimize the maximum T-time of all machines,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 6, Issue 11

August 2013

237 pages

ISSN:2150-8097

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013

Published in PVLDB Volume 6, Issue 11

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

39
Total Citations
View Citations
474
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Park YTak BHan W(2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589279
Grzegorowski M(2023)Selected Aspects of Interactive Feature ExtractionTransactions on Rough Sets XXIII10.1007/978-3-662-66544-2_8(121-287)Online publication date: 1-Jan-2023
https://doi.org/10.1007/978-3-662-66544-2_8
Al-Sayeh HMemishi BJibril MParadies MSattler KIves ZBonifati AEl Abbadi A(2022)Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517892(1840-1854)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517892
Grzegorowski MZdravevski EJanusz ALameski PApanowicz CŚlęzak D(2021)Cost Optimization for Big Data Workloads Based on Dynamic Scheduling and Cluster-Size TuningBig Data Research10.1016/j.bdr.2021.10020325:COnline publication date: 15-Jul-2021
https://dl.acm.org/doi/10.1016/j.bdr.2021.100203
Li ZShen HWard L(2019)Accelerating Big Data Analytics Using Scale-Up/Out Heterogeneous Clusters2019 28th International Conference on Computer Communication and Networks (ICCCN)10.1109/ICCCN.2019.8847060(1-9)Online publication date: Jul-2019
https://doi.org/10.1109/ICCCN.2019.8847060
Zdravevski ELameski PDimitrievski AGrzegorowski MApanowicz C(2019)Cluster-size optimization within a cloud-based ETL framework for Big Data2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006547(3754-3763)Online publication date: Dec-2019
https://doi.org/10.1109/BigData47090.2019.9006547
Satish AShiou TZhang CElmeleegy KZwaenepoel WOliveira RFelber PHu Y(2018)ScrubProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190513(1-15)Online publication date: 23-Apr-2018
https://dl.acm.org/doi/10.1145/3190508.3190513
Mao YGreen VWang JXiong HGuo Z(2018)DRESS: Dynamic RESource-Reservation Scheme for Congested Data-Intensive Computing Platforms2018 IEEE 11th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2018.00095(694-701)Online publication date: Jul-2018
https://doi.org/10.1109/CLOUD.2018.00095
Chen RZhang HAnjum ASill AZhao XFarid MPallickara SCao J(2017)Large-scale 3D Reconstruction with an R-based Analysis WorkflowProceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3148055.3148062(85-93)Online publication date: 5-Dec-2017
https://dl.acm.org/doi/10.1145/3148055.3148062
Yi XLiu FNiu DJin HLui J(2017)CocoaACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/30228762:2(1-31)Online publication date: 7-Feb-2017
https://dl.acm.org/doi/10.1145/3022876
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents