demonstration

Clydesdale: structured data processing on hadoop

Authors:

Andrey Balmin,

Tim Kaldewey,

Sandeep TataAuthors Info & Claims

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 705 - 708

https://doi.org/10.1145/2213836.2213938

Published: 20 May 2012 Publication History

Get Access

Abstract

There have been several recent proposals modifying Hadoop, radically changing the storage organization or query processing techniques to obtain good performance for structured data processing. We will showcase Clydesdale, a research prototype for structured data processing on Hadoop that can achieve dramatic performance improvements over existing solutions, without any changes to the underlying MapReduce implementation. Clydesdale achieves this through a novel synthesis of several techniques from the database literature and carefully adapting them to the Hadoop environment. On the star schema benchmark, we show that Clydesdale is on average 38x faster than Hive, the dominant approach for structured data processing on Hadoop today. To the best of our knowledge, Clydesdale is the fastest solution for processing workloads on structured data sets that fit a star schema on Hadoop. Attendees will be able to run queries on the data from the star schema benchmark on a remote Hadoop cluster with Clydesdale and Hive installed, and get a breakdown of the time taken to execute the query. Attendees will also be able to pose their own queries using ClyQL -- a novel embedded DSL in Scala that can be used to rapidly prototype star join queries. With this demonstration, we hope to convince the attendees that unlike previously thought, Hadoop can indeed efficiently support structured data processing.

References

[1]

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009.

Digital Library

Google Scholar

[2]

P. A. Boncz, S. Manegold, and M. L. Kersten. Database Architecture Optimized for the New Bottleneck: Memory Access. In VLDB, pages 54--65, 1999.

Digital Library

Google Scholar

[3]

J. Cieslewicz and K. A. Ross. Adaptive aggregation on chip multiprocessors. In VLDB, pages 339--350, 2007.

Digital Library

Google Scholar

[4]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1):107--113, 2008.

Digital Library

Google Scholar

[5]

J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop: Making a Yellow Elephant Run Like a Cheetah. PVLDB, 3(1):518--529, 2010.

Digital Library

Google Scholar

[6]

A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-Oriented Storage Techniques for MapReduce. PVLDB, 4(7), 2011.

Digital Library

Google Scholar

[7]

T. Kaldewey, E. J. Shekita, and S. Tata. Clydesdale: Structured Data Processing on MapReduce. In EDBT, 2012.

Digital Library

Google Scholar

[8]

Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: Leveraging Columnar Storage for Scalable Join Processing in the MapReduce Framework. In SIGMOD Conference, 2011.

Digital Library

Google Scholar

[9]

P. E. O'Neil, E. J. O'Neil, and X. Chen. The Star Schema Benchmark (SSB). http://www.cs.umb.edu/~Sponeil/StarSchemaB.PDF.

Google Scholar

[10]

A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD, pages 165--178, 2009.

Digital Library

Google Scholar

[11]

M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-Store: A Column-oriented DBMS. In VLDB, pages 553--564, 2005.

Digital Library

Google Scholar

[12]

R. Vernica, A. Balmin, K. Beyer, and V. Ercegovac. Adaptive MapReduce using Situation-Aware Mappers. In EDBT, 2012.

Digital Library

Google Scholar

[13]

C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE, pages 657--668, 2010.

Crossref

Google Scholar

[14]

M. Zukowski, P. A. Boncz, N. Nes, and S. Héman. MonetDB/X100 - A DBMS In The CPU Cache. IEEE Data Engineering Bulletin, 28(2):17--22, 2005.

Google Scholar

Cited By

View all

Eltabakh M(2017)Data Organization and Curation in Big DataHandbook of Big Data Technologies10.1007/978-3-319-49340-4_5(143-178)Online publication date: 26-Feb-2017
https://doi.org/10.1007/978-3-319-49340-4_5
Sakr SLiu AFayoumi A(2014)MapReduce Family of Large-Scale Data-Processing SystemsLarge Scale and Big Data10.1201/b17112-3(39-106)Online publication date: 12-Jun-2014
https://doi.org/10.1201/b17112-3
Hsu JHsu CChen SChung Y(2014)Correlation Aware Technique for SQL to NoSQL TransformationProceedings of the 2014 7th International Conference on Ubi-Media Computing and Workshops10.1109/U-MEDIA.2014.27(43-46)Online publication date: 12-Jul-2014
https://dl.acm.org/doi/10.1109/U-MEDIA.2014.27
Show More Cited By

Index Terms

Clydesdale: structured data processing on hadoop
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Clydesdale: structured data processing on MapReduce
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

MapReduce has emerged as a promising architecture for large scale data analytics on commodity clusters. The rapid adoption of Hive, a SQL-like data processing language on Hadoop (an open source implementation of MapReduce), shows the increasing ...
A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

Current major big data analytical stacks often consist of a general purpose, multi-staged computation framework (e.g. Hadoop) and an SQL query system (e.g. Hive) on its top. A key factor of query performance is the efficiency of data shuffling between ...
GISQAF: MapReduce guided spatial query processing and analytics system

The Global Database of Event, Language, and Tone GDELT is the only global political georeferenced event dataset with more than 250 million observations covering all countries in the world since January 1, 1979. TABARI and CAMEO are the tools that are ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

May 2012

886 pages

ISBN:9781450312479

DOI:10.1145/2213836

General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 May 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Demonstration

Conference

SIGMOD/PODS '12

Sponsor:

SIGMOD

SIGMOD/PODS '12: International Conference on Management of Data

May 20 - 24, 2012

Arizona, Scottsdale, USA

Acceptance Rates

SIGMOD '12 Paper Acceptance Rate 48 of 289 submissions, 17%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
580
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Eltabakh M(2017)Data Organization and Curation in Big DataHandbook of Big Data Technologies10.1007/978-3-319-49340-4_5(143-178)Online publication date: 26-Feb-2017
https://doi.org/10.1007/978-3-319-49340-4_5
Sakr SLiu AFayoumi A(2014)MapReduce Family of Large-Scale Data-Processing SystemsLarge Scale and Big Data10.1201/b17112-3(39-106)Online publication date: 12-Jun-2014
https://doi.org/10.1201/b17112-3
Hsu JHsu CChen SChung Y(2014)Correlation Aware Technique for SQL to NoSQL TransformationProceedings of the 2014 7th International Conference on Ubi-Media Computing and Workshops10.1109/U-MEDIA.2014.27(43-46)Online publication date: 12-Jul-2014
https://dl.acm.org/doi/10.1109/U-MEDIA.2014.27
Gu JPeng SWang XRao WYang MCao Y(2014)Cost-Based Join Algorithm Selection in HadoopWeb Information Systems Engineering – WISE 201410.1007/978-3-319-11746-1_18(246-261)Online publication date: 2014
https://doi.org/10.1007/978-3-319-11746-1_18
Zhao LSakr SLiu ABouguettaya AZhao LSakr SLiu ABouguettaya A(2014)Big Data Processing SystemsCloud Data Management10.1007/978-3-319-04765-2_9(135-176)Online publication date: 17-Feb-2014
https://doi.org/10.1007/978-3-319-04765-2_9
Sakr SLiu AFayoumi A(2013)The family of mapreduce and large-scale data processing systemsACM Computing Surveys10.1145/2522968.252297946:1(1-44)Online publication date: 11-Jul-2013
https://dl.acm.org/doi/10.1145/2522968.2522979
Wolf JBalmin ARajan DHildrum KKhandekar RParekh SWu KVernica R(2012)On the optimization of schedules for MapReduce workloads in the presence of shared scansThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-012-0279-521:5(589-609)Online publication date: 1-Oct-2012
https://dl.acm.org/doi/10.1007/s00778-012-0279-5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Clydesdale: structured data processing on MapReduce

A Case Study of Optimizing Big Data Analytical Stacks Using Structured Data Shuffling

GISQAF: MapReduce guided spatial query processing and analytics system