More Web Proxy on the site http://driver.im/

research-article

Minimal MapReduce algorithms

Authors:

Xiaokui XiaoAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 529 - 540

https://doi.org/10.1145/2463676.2463719

Published: 22 June 2013 Publication History

Abstract

MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.

References

[1]

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1):922--933, 2009.

Digital Library

[2]

F. N. Afrati, A. D. Sarma, D. Menestrina, A. G. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In ICDE, pages 498--509, 2012.

Digital Library

[3]

F. N. Afrati and J. D. Ullman. Optimizing multiway joins in a map-reduce environment. TKDE, 23(9):1282--1298, 2011.

Digital Library

[4]

B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized pagerank on mapreduce. In SIGMOD, pages 973--984, 2011.

Digital Library

[5]

B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. PVLDB, 5(5):454--465, 2012.

Digital Library

[6]

K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011.

Digital Library

[7]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, pages 975--986, 2010.

Digital Library

[8]

S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, pages 421--430, 2001.

Digital Library

[9]

R. Chaiken, B. Jenkins, P. ake Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.

Digital Library

[10]

B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318--1327, 2011.

Digital Library

[11]

S. Chen. Cheetah: A high performance, custom data warehouse on top of mapreduce. PVLDB, 3(2):1459--1468, 2010.

Digital Library

[12]

F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231--240, 2010.

Digital Library

[13]

R. L. F. Cordeiro, C. T. Jr., A. J. M. Traina, J. Lopez, U. Kang, and C. Faloutsos. Clustering very large multi-dimensional datasets with mapreduce. In SIGKDD, pages 690--698, 2011.

Digital Library

[14]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. The MIT Press, 2001.

Digital Library

[15]

A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007.

Digital Library

[16]

J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.

Digital Library

[17]

F. K. H. A. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grained multicomputers. In SoCG, pages 298--307, 1993.

Digital Library

[18]

J. Dittrich, J.-A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop

[19]

: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010.

Digital Library

[20]

I. Elghandour and A. Aboulnaga. Restore: Reusing results of mapreduce jobs. PVLDB, 5(6):586--597, 2012.

Digital Library

[21]

M. Y. Eltabakh, Y. Tian, F. Ozcan, R. Gemulla, A. Krettek, and J. McPherson. Cohadoop: Flexible data placement and its exploitation in hadoop. PVLDB, 4(9):575--585, 2011.

Digital Library

[22]

A. Ene, S. Im, and B. Moseley. Fast clustering using mapreduce. In SIGKDD, pages 681--689, 2011.

Digital Library

[23]

A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7):419--429, 2011.

Digital Library

[24]

A. Ghoting, P. Kambadur, E. P. D. Pednault, and R. Kannan. Nimble: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In SIGKDD, pages 334--342, 2011.

Digital Library

[25]

A. Ghoting, R. Krishnamurthy, E. P. D. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011.

Digital Library

[26]

R. Grover and M. J. Carey. Extending map-reduce for efficient predicate-based sampling. In ICDE, pages 486--497, 2012.

Digital Library

[27]

B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In ICDE, pages 522--533, 2012.

Digital Library

[28]

Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011.

Digital Library

[29]

H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, 4(11):1111--1122, 2011.

Digital Library

[30]

E. Jahani, M. J. Cafarella, and C. Re. Automatic optimization for mapreduce programs. PVLDB, 4(6):385--396, 2011.

Digital Library

[31]

J. Jestes, F. Li, and K. Yi. Building wavelet histograms on large data in mapreduce. In PVLDB, pages 617--620, 2012.

Digital Library

[32]

H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938--948, 2010.

Digital Library

[33]

N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598--609, 2012.

Digital Library

[34]

L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, pages 618--629, 2012.

Digital Library

[35]

P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In PODS, pages 223--234, 2011.

Digital Library

[36]

H. T. Kung, F. Luccio, and F. P. Preparata. On finding the maxima of a set of vectors. JACM, 22(4):469--476, 1975.

Digital Library

[37]

Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia. Skewtune: mitigating skew in mapreduce applications. In SIGMOD, pages 25--36, 2012.

Digital Library

[38]

W. Lang and J. M. Patel. Energy management for mapreduce clusters. PVLDB, 3(1):129--139, 2010.

Digital Library

[39]

N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. PVLDB, 5(10):1028--1039, 2012.

Digital Library

[40]

S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, pages 85--94, 2011.

Digital Library

[41]

H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. PVLDB, 5(11):1196--1207, 2012.

Digital Library

[42]

Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In SIGMOD, pages 961--972, 2011.

Digital Library

[43]

W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient processing of k nearest neighbor joins using mapreduce. PVLDB, 5(10):1016--1027, 2012.

Digital Library

[44]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330--339, 2010.

Digital Library

[45]

A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8):704--715, 2012.

Digital Library

[46]

G. D. F. Morales, A. Gionis, and M. Sozio. Social content matching in mapreduce. PVLDB, 4(7):460--469, 2011.

Digital Library

[47]

K. Morton, M. Balazinska, and D. Grossman. Paratimer: a progress indicator for mapreduce dags. In SIGMOD, pages 507--518, 2010.

Digital Library

[48]

T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: Sharing across multiple queries in mapreduce. PVLDB, 3(1):494--505, 2010.

Digital Library

[49]

A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011.

Digital Library

[50]

C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008.

Digital Library

[51]

O. O'Malley. Terabyte sort on apache hadoop. Technical report, Yahoo, 2008.

[52]

B. Panda, J. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. PVLDB, 2(2):1426--1437, 2009.

Digital Library

[53]

N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11):1135--1145, 2011.

Digital Library

[54]

A. Shinnar, D. Cunningham, B. Herta, and V. A. Saraswat. M3r: Increased performance for in-memory hadoop jobs. PVLDB, 5(12):1736--1747, 2012.

Digital Library

[55]

S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011.

Digital Library

[56]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE, pages 996--1005, 2010.

[57]

C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: counting triangles in massive graphs with a coin. In SIGKDD, pages 837--846, 2009.

Digital Library

[58]

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.

Digital Library

[59]

R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive mapreduce using situation-aware mappers. In EDBT, pages 420--431, 2012.

Digital Library

[60]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010.

Digital Library

[61]

G. Wang, M. A. V. Salles, B. Sowell, X. Wang, T. Cao, A. J. Demers, J. Gehrke, and W. M. White. Behavioral simulations in mapreduce. PVLDB, 3(1):952--963, 2010.

Digital Library

[62]

B. Zhang, S. Zhou, and J. Guan. Adapting skyline computation to the mapreduce framework: Algorithms and experiments. In DASFAA Workshops, pages 403--414, 2011.

Digital Library

[63]

X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using mapreduce. PVLDB, 5(11):1184--1195, 2012.

Digital Library

Cited By

Ma HLi J(2023)The Hardness of Optimization Problems on the Weighted Massively Parallel Computation ModelComputing and Combinatorics10.1007/978-3-031-49193-1_9(106-117)Online publication date: 9-Dec-2023
https://doi.org/10.1007/978-3-031-49193-1_9
Bazai SJang-Jaccard JAlavizadeh H(2021)Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache SparkElectronics10.3390/electronics1005058910:5(589)Online publication date: 3-Mar-2021
https://doi.org/10.3390/electronics10050589
Hu XKoutris PBlanas SLibkin LPichler RGuagliardo P(2021)Algorithms for a Topology-aware Massively Parallel Computation ModelProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458318(199-214)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3452021.3458318
Show More Cited By

Index Terms

Minimal MapReduce algorithms
1. Theory of computation
  1. Design and analysis of algorithms

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on Services

In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Survey on improving the performance of MapReduce in Hadoop
NISS '21: Proceedings of the 4th International Conference on Networking, Information Systems & Security

Hadoop has become the most popular and the most used platform in distributed data processing, Hadoop is also an open-source software that implements the MapReduce model for processing big data, it has taken a large part in scientific research in the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

45
Total Citations
View Citations
1,471
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ma HLi J(2023)The Hardness of Optimization Problems on the Weighted Massively Parallel Computation ModelComputing and Combinatorics10.1007/978-3-031-49193-1_9(106-117)Online publication date: 9-Dec-2023
https://doi.org/10.1007/978-3-031-49193-1_9
Bazai SJang-Jaccard JAlavizadeh H(2021)Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache SparkElectronics10.3390/electronics1005058910:5(589)Online publication date: 3-Mar-2021
https://doi.org/10.3390/electronics10050589
Hu XKoutris PBlanas SLibkin LPichler RGuagliardo P(2021)Algorithms for a Topology-aware Massively Parallel Computation ModelProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458318(199-214)Online publication date: 20-Jun-2021
https://dl.acm.org/doi/10.1145/3452021.3458318
Lin WZhu FChin Ooi BMiao CWang HSkrypnyk IHsu WChawla S(2021)Large-Scale Network Embedding in Apache SparkProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467136(3271-3279)Online publication date: 14-Aug-2021
https://dl.acm.org/doi/10.1145/3447548.3467136
Yoon J(2021)An intelligent distance learning framework: assessment-driven approachIntelligent Systems and Learning Data Analytics in Online Education10.1016/B978-0-12-823410-5.00011-5(273-299)Online publication date: 2021
https://doi.org/10.1016/B978-0-12-823410-5.00011-5
Qi JTao YChang YZhang R(2020)Packing R-trees with Space-filling CurvesACM Transactions on Database Systems10.1145/339750645:3(1-47)Online publication date: 26-Aug-2020
https://dl.acm.org/doi/10.1145/3397506
Cormode GVeselý PSuciu DTao YWei Z(2020)A Tight Lower Bound for Comparison-Based Quantile SummariesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387650(81-93)Online publication date: 14-Jun-2020
https://dl.acm.org/doi/10.1145/3375395.3387650
Xia DSimpson MSrinivasan VThomo A(2020)Strongly Minimal MapReduce Algorithms: A TeraSort Case StudyFoundations of Information and Knowledge Systems10.1007/978-3-030-39951-1_18(301-317)Online publication date: 3-Jan-2020
https://doi.org/10.1007/978-3-030-39951-1_18
Siddique MTian HQaosar MMorimoto Y(2019)MapReduce Algorithm for Variants of Skyline Queries: Skyband and Dominating QueriesAlgorithms10.3390/a1208016612:8(166)Online publication date: 13-Aug-2019
https://doi.org/10.3390/a12080166
Hu XYi KTao Y(2019)Output-Optimal Massively Parallel Algorithms for Similarity JoinsACM Transactions on Database Systems10.1145/331196744:2(1-36)Online publication date: 8-Apr-2019
https://dl.acm.org/doi/10.1145/3311967
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents