More Web Proxy on the site http://driver.im/

research-article

Optimizing joins in a map-reduce environment

Authors:

Foto N. Afrati,

Jeffrey D. UllmanAuthors Info & Claims

EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology

Pages 99 - 110

https://doi.org/10.1145/1739041.1739056

Published: 22 March 2010 Publication History

Abstract

Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

References

[1]

F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. Technical report, Stanford, http://ilpubs.stanford.edu:8090/952/, 2009.

[2]

Apache. Hadoop. http://hadoop.apache.org/, 2006.

[3]

R. Avnur and J. M. Hellerstein. Eddies: Continuously adaptive query processing. In SIGMOD Conference, pages 261--272, 2000.

Digital Library

[4]

S. Babu, K. Munagala, J. Widom, and R. Motwani. Adaptive caching for continuous queries. In ICDE, pages 118--129, 2005.

Digital Library

[5]

S. Babu and J. Widom. Streamon: an adaptive engine for stream query processing. In SIGMOD Conference, pages 931--932, New York, NY, USA, 2004. ACM.

Digital Library

[6]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1--7):107--117, 1998.

Digital Library

[7]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2), 2008.

Digital Library

[8]

H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, pages 1029--1040, New York, NY, USA, 2007. ACM.

Digital Library

[9]

B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. PVLDB, 1(2):1277--1288, 2008.

Digital Library

[10]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.

Digital Library

[11]

A. Deshpande and L. Hellerstein. Flow algorithms for parallel query optimization. In ICDE, pages 754--763, 2008.

Digital Library

[12]

D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, pages 27--40, 1992.

Digital Library

[13]

D. J. DeWitt, E. Paulson, E. Robinson, J. F. Naughton, J. Royalty, S. Shankar, and A. Krioukov. Clustera: an integrated computation and data management system. PVLDB, 1(1):28--41, 2008.

Digital Library

[14]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In 19th ACM Symposium on Operating Systems Principles, 2003.

Digital Library

[15]

Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S. Weld. An adaptive query execution system for data integration. In SIGMOD Conference, pages 299--310, 1999.

Digital Library

[16]

H. Jacobsson. Tree-based techniques for query evaluation. Ph.D. thesis, Dept. of CS, Stanford Univ., Stanford CA USA, STAN-CS-93-1492, 1993.

Digital Library

[17]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46:668--677, 1999.

Digital Library

[18]

S. Madden, M. A. Shah, J. M. Hellerstein, and V. Raman. Continuously adaptive continuous queries over streams. In SIGMOD Conference, pages 49--60, 2002.

Digital Library

[19]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008.

Digital Library

[20]

K. A. Ross and J. Cieslewicz. Optimal splitters for database partitioning with size bounds. In ICDT, pages 98--110, New York, NY, USA, 2009. ACM.

Digital Library

[21]

U. Srivastava, K. Munagala, J. Widom, and R. Motwani. Query optimization over web services. In VLDB, pages 355--366, 2006.

Digital Library

[22]

K.-L. Tan and H. Lu. A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Rec., 20(4):81--82, 1991.

Digital Library

[23]

S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-way join queries over streaming information sources. In VLDB, pages 285--296, 2003.

Digital Library

Cited By

Mackey IChimni RSu J(2024)Early detection of temporal constraint violationsInformation and Computation10.1016/j.ic.2023.105114296(105114)Online publication date: Jan-2024
https://doi.org/10.1016/j.ic.2023.105114
Siachamis GPsarakis KFragkoulis MPapapetrou Ovan Deursen AKatsifodimos AKemme BRiviere ESchiavoni VPasin M(2023)Adaptive Distributed Streaming Similarity JoinsProceedings of the 17th ACM International Conference on Distributed and Event-based Systems10.1145/3583678.3596891(25-36)Online publication date: 27-Jun-2023
https://dl.acm.org/doi/10.1145/3583678.3596891
Uhartegaray RD'Orazio LDamigos MKalogeros E(2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
https://doi.org/10.1109/FUZZ52849.2023.10309759
Show More Cited By

Recommendations

Optimizing Multiway Joins in a Map-Reduce Environment

Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the “map-key,” the set of attributes ...
On optimizing relational self-joins
EDBT '12: Proceedings of the 15th International Conference on Extending Database Technology

Self-join, which joins a relation with itself, is a prevalent operation in relational database systems. Despite its wide applicability, there has been little attention devoted to improving its performance. In this paper, we present SCALE (<u>S</u>ort ...
Processing multi-way spatial joins on map-reduce
EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

In this paper we investigate the problem of processing multi-way spatial joins on map-reduce platform. We look at two common spatial predicates - overlap and range. We address these two classes of join queries, discuss the challenges and outline novel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

EDBT '10: Proceedings of the 13th International Conference on Extending Database Technology

March 2010

741 pages

ISBN:9781605589459

DOI:10.1145/1739041

Editors:
Ioana Manolescu
INRIA, France
,
Stefano Spaccapietra
EPFL, Switzerland
,
Jens Teubner
ETH Zurich, Switzerland
,
Masaru Kitsuregawa
Tokyo University, Japan
,
Alain Leger
Orange - France Telecom R&D, France
,
Felix Naumann
Hasso Plattner Institute, Germany
,
Anastasia Ailamaki
EPFL, Switzerland
,
Fatma Ozcan
IBM Almaden Research Center

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

EDBT/ICDT '10

EDBT/ICDT '10: EDBT/ICDT '10 joint conference

March 22 - 26, 2010

Lausanne, Switzerland

Acceptance Rates

Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

240
Total Citations
View Citations
1,972
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mackey IChimni RSu J(2024)Early detection of temporal constraint violationsInformation and Computation10.1016/j.ic.2023.105114296(105114)Online publication date: Jan-2024
https://doi.org/10.1016/j.ic.2023.105114
Siachamis GPsarakis KFragkoulis MPapapetrou Ovan Deursen AKatsifodimos AKemme BRiviere ESchiavoni VPasin M(2023)Adaptive Distributed Streaming Similarity JoinsProceedings of the 17th ACM International Conference on Distributed and Event-based Systems10.1145/3583678.3596891(25-36)Online publication date: 27-Jun-2023
https://dl.acm.org/doi/10.1145/3583678.3596891
Uhartegaray RD'Orazio LDamigos MKalogeros E(2023)Scalable Computation of Fuzzy Joins Over Large Collections of JSON Data2023 IEEE International Conference on Fuzzy Systems (FUZZ)10.1109/FUZZ52849.2023.10309759(01-06)Online publication date: 13-Aug-2023
https://doi.org/10.1109/FUZZ52849.2023.10309759
Ma HLi J(2023)The Hardness of Optimization Problems on the Weighted Massively Parallel Computation ModelComputing and Combinatorics10.1007/978-3-031-49193-1_9(106-117)Online publication date: 9-Dec-2023
https://doi.org/10.1007/978-3-031-49193-1_9
Phan APhan TCao HTrieu T(2022)Comparative Analysis of Skew-Join Strategies for Large-Scale Datasets with MapReduce and SparkApplied Sciences10.3390/app1213655412:13(6554)Online publication date: 28-Jun-2022
https://doi.org/10.3390/app12136554
Metwally AIves ZBonifati AEl Abbadi A(2022)Scaling Equi-JoinsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526042(2163-2176)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526042
Deng TFan WLu PLuo XZhu XAn W(2022)Deep and Collective Entity Resolution in Parallel2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00200(2060-2072)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00200
DOVIER AFORMISANO AGUPTA GHERMENEGILDO MPONTELLI EROCHA R(2022)Parallel Logic Programming: A SequelTheory and Practice of Logic Programming10.1017/S147106842200005922:6(905-973)Online publication date: 28-Mar-2022
https://doi.org/10.1017/S1471068422000059
Parida PDora LSwain MAgrawal SPanda R(2022)Data science methodologies in smart healthcare: a reviewHealth and Technology10.1007/s12553-022-00648-912:2(329-344)Online publication date: 24-Feb-2022
https://doi.org/10.1007/s12553-022-00648-9
Ceballos ORamírez Restrepo CPabón MCastillo ACorcho O(2021)SPARQL2Flink: Evaluation of SPARQL Queries on Apache FlinkApplied Sciences10.3390/app1115703311:15(7033)Online publication date: 30-Jul-2021
https://doi.org/10.3390/app11157033
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents