[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Piranha: optimizing short jobs in Hadoop

Published: 01 August 2013 Publication History

Abstract

Cluster computing has emerged as a key parallel processing platform for large scale data. All major internet companies use it as their major central processing platform. One of cluster computing's most popular examples is MapReduce and its open source implementation Hadoop. These systems were originally designed for batch and massive-scale computations. Interestingly, over time their production workloads have evolved into a mix of a small fraction of large and long-running jobs and a much bigger fraction of short jobs. This came about because these systems end up being used as data warehouses, which store most of the data sets and attract ad hoc, short, data-mining queries. Moreover, the availability of higher level query languages that operate on top of these cluster systems proliferated these ad hoc queries. Since existing systems were not designed for short, latency-sensistive jobs, short interactive jobs suffer from poor response times.
In this paper, we present Piranha--a system for optimizing short jobs on Hadoop without affecting the larger jobs. It runs on existing unmodified Hadoop clusters facilitating its adoption. Piranha exploits characteristics of short jobs learned from production workloads at Yahoo! clusters to reduce the latency of such jobs. To demonstrate Piranha's effectiveness, we evaluated its performance using three realistic short queries. Piranha was able to reduce the queries' response times by up to 71%.

References

[1]
Apache Hadoop Project. http://hadoop.apache.org/.
[2]
Apache Hadoop Project. http://incubator.apache.org/drill/.
[3]
Jaql. http://www.almaden.ibm.com/cs/projects/jaql/.
[4]
The Next Generation of Apache Hadoop MapReduce (YARN). http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/.
[5]
A. Abouzied, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In VLDB, Lyon, France, 2009.
[6]
V. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE 2011 : IEEE International Conference on Data Engineering, 2011.
[7]
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a sql implementation on the mapreduce framework. Proc. VLDB Endow., 4:1318-1327, 2011.
[8]
S. Chen. Cheetah: a high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow., 3(1-2):1459-1468, Sept. 2010.
[9]
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In NSDI'10: Proceedings of the 7th USENIX conference on Networked systems design and implementation, pages 21-21, 2010.
[10]
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, 2004.
[11]
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc. VLDB Endow., 3(1), 2010.
[12]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles, SOSP '03, pages 29-43, New York, NY, USA, 2003. ACM.
[13]
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI'11: Proceedings of the 8th USENIX conference on Networked systems design and implementation, 2011.
[14]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, pages 11-11, 2010.
[15]
M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: fair scheduling for distributed computing clusters. In SOSP, pages 261-276, 2009.
[16]
M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD '09, pages 987-994, New York, NY, USA, 2009. ACM.
[17]
D. Jiang, B. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. Proc. VLDB Endow., 3(1), 2010.
[18]
I. Kim, J. Moon, and H. Y. Yeom. Timer-based interrupt mitigation for high performance packet processing. In In Proc. 5th International Conference on HighPerformance Computing in the Asia-Pacific Region, Gold, 2001.
[19]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. In Proc. of the 36th Int'l Conf on Very Large Data Bases, pages 330-339, 2010.
[20]
D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI'11: Proceedings of the 8th USENIX conference on Networked systems design and implementation, 2011.
[21]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099-1110, 2008.
[22]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Sci. Program., 13:277-298, October 2005.
[23]
The Apache Software Foundation. Apache Hadoop - PoweredBy:. At http://wiki.apache.org/hadoop/PoweredBy.
[24]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626-1629, 2009.
[25]
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In SIGMOD Conference, pages 13-24, 2013.
[26]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. In OSDI'08: Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 1-14, 2008.
[27]
M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In EuroSys '10: Proceedings of the 5th European conference on Computer systems, pages 265-278, 2010.
[28]
M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In OSDI'08: Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 29-42, 2008.

Cited By

View all
  • (2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
  • (2023)Selected Aspects of Interactive Feature ExtractionTransactions on Rough Sets XXIII10.1007/978-3-662-66544-2_8(121-287)Online publication date: 1-Jan-2023
  • (2022)Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517892(1840-1854)Online publication date: 10-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 6, Issue 11
August 2013
237 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2013
Published in PVLDB Volume 6, Issue 11

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)QaaD (Query-as-a-Data): Scalable Execution of Massive Number of Small Queries in SparkProceedings of the ACM on Management of Data10.1145/35892791:2(1-26)Online publication date: 20-Jun-2023
  • (2023)Selected Aspects of Interactive Feature ExtractionTransactions on Rough Sets XXIII10.1007/978-3-662-66544-2_8(121-287)Online publication date: 1-Jan-2023
  • (2022)Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data ApplicationsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517892(1840-1854)Online publication date: 10-Jun-2022
  • (2021)Cost Optimization for Big Data Workloads Based on Dynamic Scheduling and Cluster-Size TuningBig Data Research10.1016/j.bdr.2021.10020325:COnline publication date: 15-Jul-2021
  • (2019)Accelerating Big Data Analytics Using Scale-Up/Out Heterogeneous Clusters2019 28th International Conference on Computer Communication and Networks (ICCCN)10.1109/ICCCN.2019.8847060(1-9)Online publication date: Jul-2019
  • (2019)Cluster-size optimization within a cloud-based ETL framework for Big Data2019 IEEE International Conference on Big Data (Big Data)10.1109/BigData47090.2019.9006547(3754-3763)Online publication date: Dec-2019
  • (2018)ScrubProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190513(1-15)Online publication date: 23-Apr-2018
  • (2018)DRESS: Dynamic RESource-Reservation Scheme for Congested Data-Intensive Computing Platforms2018 IEEE 11th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD.2018.00095(694-701)Online publication date: Jul-2018
  • (2017)Large-scale 3D Reconstruction with an R-based Analysis WorkflowProceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3148055.3148062(85-93)Online publication date: 5-Dec-2017
  • (2017)CocoaACM Transactions on Modeling and Performance Evaluation of Computing Systems10.1145/30228762:2(1-31)Online publication date: 7-Feb-2017
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media