[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2463676.2463709acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Split query processing in polybase

Published: 22 June 2013 Publication History

Abstract

This paper presents Polybase, a feature of SQL Server PDW V2 that allows users to manage and query data stored in a Hadoop cluster using the standard SQL query language. Unlike other database systems that provide only a relational view over HDFS-resident data through the use of an external table mechanism, Polybase employs a split query processing paradigm in which SQL operators on HDFS-resident data are translated into MapReduce jobs by the PDW query optimizer and then executed on the Hadoop cluster. The paper describes the design and implementation of Polybase along with a thorough performance evaluation that explores the benefits of employing a split query processing paradigm for executing queries that involve both structured data in a relational DBMS and unstructured data in Hadoop. Our results demonstrate that while the use of a split-based query execution paradigm can improve the performance of some queries by as much as 10X, one must employ a cost-based query optimizer that considers a broad set of factors when deciding whether or not it is advantageous to push a SQL operator to Hadoop. These factors include the selectivity factor of the predicate, the relative sizes of the two clusters, and whether or not their nodes are co-located. In addition, differences in the semantics of the Java and SQL languages must be carefully considered in order to avoid altering the expected results of a query.

References

[1]
http://sqoop.apache.org
[2]
Bajda-Pawlikowski, et. al., "Efficient Processing of Data Warehousing Queries in a Split Execution Environment," Proceedings of the 2011 SIGMOD Conference, June 2011.
[3]
http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-hdfs-wp-1674035.pdf
[4]
http://www.greenplum.com/sites/default/files/EMC_Greenplum_Hadoop_DB_TB_0.pdf
[5]
http://www.asterdata.com/sqlh/
[6]
Yu Xu, et. al., Integrating Hadoop and Parallel DBMS, Proceedings of the 2010 SIGMOD Conference, June 2010.
[7]
Yu Xu, et. al., A Hadoop Based Distributed Loading Approach to Parallel Data Warehouses", Proceedings of the 2011 SIGMOD Conference, June, 2011.
[8]
http://developer.teradata.com/extensibility/articles/hadoop-dfs-to-teradata
[9]
IBM InfoSphere BigInsights Information Center, pic.dhe.ibm.com/infocenter/bigins/v1r4/index.jsp
[10]
http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/,
[11]
Shankar, S., et. al., Query Optimization in Microsoft SQL Server PDW, Proceedings of the 2012 SIGMOD Conference, May 2012.
[12]
Herodotos Herodotou, Shivnath Babu: Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. PVLDB 4(11):1111--1122 (2011).
[13]
Sai Wu, et. al., Query Optimization for Massively Parallel Data Processing, Proceedings of the 2011 SoCC Conference, Oct. 2011.

Cited By

View all
  • (2023)Accelerating Cloud-Native Databases with Distributed PMem Stores2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00233(3043-3057)Online publication date: Apr-2023
  • (2023)Multi-model query languages: taming the variety of big dataDistributed and Parallel Databases10.1007/s10619-023-07433-142:1(31-71)Online publication date: 31-May-2023
  • (2022)Role of Big Data in Internet of Things NetworksResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch016(336-363)Online publication date: 2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. hadoop
  2. hdfs
  3. parallel database systems
  4. split query execution

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'13
Sponsor:

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)28
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Accelerating Cloud-Native Databases with Distributed PMem Stores2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00233(3043-3057)Online publication date: Apr-2023
  • (2023)Multi-model query languages: taming the variety of big dataDistributed and Parallel Databases10.1007/s10619-023-07433-142:1(31-71)Online publication date: 31-May-2023
  • (2022)Role of Big Data in Internet of Things NetworksResearch Anthology on Big Data Analytics, Architectures, and Applications10.4018/978-1-6684-3662-2.ch016(336-363)Online publication date: 2022
  • (2022)Polyglot data managementProceedings of the VLDB Endowment10.14778/3554821.355489115:12(3750-3753)Online publication date: 1-Aug-2022
  • (2021)Parallel query processing in a polystoreDistributed and Parallel Databases10.1007/s10619-021-07322-5Online publication date: 3-Feb-2021
  • (2021)Dragoon: a hybrid and efficient big trajectory management system for offline and online analyticsThe VLDB Journal10.1007/s00778-021-00652-xOnline publication date: 3-Feb-2021
  • (2021)Towards Taming the Adaptivity ProblemService-Oriented Computing10.1007/978-3-030-87568-8_5(83-99)Online publication date: 26-Sep-2021
  • (2020)HeliosProceedings of the VLDB Endowment10.14778/3415478.341554713:12(3231-3244)Online publication date: 14-Sep-2020
  • (2020)Decisional architectures from business intelligence to big dataProceedings of the 2nd International Conference on Digital Tools & Uses Congress10.1145/3423603.3424049(1-9)Online publication date: 15-Oct-2020
  • (2020)Towards Elastic Data Warehousing by Decoupling Data Management and ComputationProceedings of the 2020 4th International Conference on Cloud and Big Data Computing10.1145/3416921.3416935(52-57)Online publication date: 26-Aug-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media