[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2213836.2213938acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
demonstration

Clydesdale: structured data processing on hadoop

Published: 20 May 2012 Publication History

Abstract

There have been several recent proposals modifying Hadoop, radically changing the storage organization or query processing techniques to obtain good performance for structured data processing. We will showcase Clydesdale, a research prototype for structured data processing on Hadoop that can achieve dramatic performance improvements over existing solutions, without any changes to the underlying MapReduce implementation. Clydesdale achieves this through a novel synthesis of several techniques from the database literature and carefully adapting them to the Hadoop environment. On the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive, the dominant approach for structured data processing on Hadoop today. To the best of our knowledge, Clydesdale is the fastest solution for processing workloads on structured data sets that fit a star schema on Hadoop. Attendees will be able to run queries on the data from the star schema benchmark on a remote Hadoop cluster with Clydesdale and Hive installed, and get a breakdown of the time taken to execute the query. Attendees will also be able to pose their own queries using ClyQL -- a novel embedded DSL in Scala that can be used to rapidly prototype star join queries. With this demonstration, we hope to convince the attendees that unlike previously thought, Hadoop can indeed efficiently support structured data processing.

References

[1]
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009.
[2]
P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Optimized for the New Bottleneck: Memory Access. In VLDB, pages 54--65, 1999.
[3]
J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In VLDB, pages 339--350, 2007.
[4]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008.
[5]
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop: Making a Yellow Elephant Run Like a Cheetah. PVLDB, 3(1):518--529, 2010.
[6]
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7), 2011.
[7]
T. Kaldewey, E. J. Shekita, and S. Tata. Clydesdale: Structured Data Processing on MapReduce. In EDBT, 2012.
[8]
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In SIGMOD Conference, 2011.
[9]
P. E. O'Neil, E. J. O'Neil, and X. Chen. The Star Schema Benchmark (SSB). http://www.cs.umb.edu/~Sponeil/StarSchemaB.PDF.
[10]
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009.
[11]
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, pages 553--564, 2005.
[12]
R. Vernica, A. Balmin, K. Beyer, and V. Ercegovac. Adaptive MapReduce using Situation-Aware Mappers. In EDBT, 2012.
[13]
C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE, pages 657--668, 2010.
[14]
M. Zukowski, P. A. Boncz, N. Nes, and S. Héman. MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data Engineering Bulletin, 28(2):17--22, 2005.

Cited By

View all
  • (2017)Data Organization and Curation in Big DataHandbook of Big Data Technologies10.1007/978-3-319-49340-4_5(143-178)Online publication date: 26-Feb-2017
  • (2014)MapReduce Family of Large-Scale Data-Processing SystemsLarge Scale and Big Data10.1201/b17112-3(39-106)Online publication date: 12-Jun-2014
  • (2014)Correlation Aware Technique for SQL to NoSQL TransformationProceedings of the 2014 7th International Conference on Ubi-Media Computing and Workshops10.1109/U-MEDIA.2014.27(43-46)Online publication date: 12-Jul-2014
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hadoop
  2. mapreduce
  3. query processing
  4. star joins
  5. structured data processing

Qualifiers

  • Demonstration

Conference

SIGMOD/PODS '12
Sponsor:

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Data Organization and Curation in Big DataHandbook of Big Data Technologies10.1007/978-3-319-49340-4_5(143-178)Online publication date: 26-Feb-2017
  • (2014)MapReduce Family of Large-Scale Data-Processing SystemsLarge Scale and Big Data10.1201/b17112-3(39-106)Online publication date: 12-Jun-2014
  • (2014)Correlation Aware Technique for SQL to NoSQL TransformationProceedings of the 2014 7th International Conference on Ubi-Media Computing and Workshops10.1109/U-MEDIA.2014.27(43-46)Online publication date: 12-Jul-2014
  • (2014)Cost-Based Join Algorithm Selection in HadoopWeb Information Systems Engineering – WISE 201410.1007/978-3-319-11746-1_18(246-261)Online publication date: 2014
  • (2014)Big Data Processing SystemsCloud Data Management10.1007/978-3-319-04765-2_9(135-176)Online publication date: 17-Feb-2014
  • (2013)The family of mapreduce and large-scale data processing systemsACM Computing Surveys10.1145/2522968.252297946:1(1-44)Online publication date: 11-Jul-2013
  • (2012)On the optimization of schedules for MapReduce workloads in the presence of shared scansThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-012-0279-521:5(589-609)Online publication date: 1-Oct-2012

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media