[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2463676.2463719acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Minimal MapReduce algorithms

Published: 22 June 2013 Publication History

Abstract

MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.

References

[1]
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Rasin, and A. Silberschatz. Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB, 2(1):922--933, 2009.
[2]
F. N. Afrati, A. D. Sarma, D. Menestrina, A. G. Parameswaran, and J. D. Ullman. Fuzzy joins using mapreduce. In ICDE, pages 498--509, 2012.
[3]
F. N. Afrati and J. D. Ullman. Optimizing multiway joins in a map-reduce environment. TKDE, 23(9):1282--1298, 2011.
[4]
B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized pagerank on mapreduce. In SIGMOD, pages 973--984, 2011.
[5]
B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. PVLDB, 5(5):454--465, 2012.
[6]
K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12):1272--1283, 2011.
[7]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In SIGMOD, pages 975--986, 2010.
[8]
S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE, pages 421--430, 2001.
[9]
R. Chaiken, B. Jenkins, P. ake Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008.
[10]
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318--1327, 2011.
[11]
S. Chen. Cheetah: A high performance, custom data warehouse on top of mapreduce. PVLDB, 3(2):1459--1468, 2010.
[12]
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231--240, 2010.
[13]
R. L. F. Cordeiro, C. T. Jr., A. J. M. Traina, J. Lopez, U. Kang, and C. Faloutsos. Clustering very large multi-dimensional datasets with mapreduce. In SIGKDD, pages 690--698, 2011.
[14]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. The MIT Press, 2001.
[15]
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007.
[16]
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[17]
F. K. H. A. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable parallel geometric algorithms for coarse grained multicomputers. In SoCG, pages 298--307, 1993.
[18]
J. Dittrich, J.-A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop
[19]
: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010.
[20]
I. Elghandour and A. Aboulnaga. Restore: Reusing results of mapreduce jobs. PVLDB, 5(6):586--597, 2012.
[21]
M. Y. Eltabakh, Y. Tian, F. Ozcan, R. Gemulla, A. Krettek, and J. McPherson. Cohadoop: Flexible data placement and its exploitation in hadoop. PVLDB, 4(9):575--585, 2011.
[22]
A. Ene, S. Im, and B. Moseley. Fast clustering using mapreduce. In SIGKDD, pages 681--689, 2011.
[23]
A. Floratou, J. M. Patel, E. J. Shekita, and S. Tata. Column-oriented storage techniques for mapreduce. PVLDB, 4(7):419--429, 2011.
[24]
A. Ghoting, P. Kambadur, E. P. D. Pednault, and R. Kannan. Nimble: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce. In SIGKDD, pages 334--342, 2011.
[25]
A. Ghoting, R. Krishnamurthy, E. P. D. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011.
[26]
R. Grover and M. J. Carey. Extending map-reduce for efficient predicate-based sampling. In ICDE, pages 486--497, 2012.
[27]
B. Gufler, N. Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In ICDE, pages 522--533, 2012.
[28]
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011.
[29]
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, 4(11):1111--1122, 2011.
[30]
E. Jahani, M. J. Cafarella, and C. Re. Automatic optimization for mapreduce programs. PVLDB, 4(6):385--396, 2011.
[31]
J. Jestes, F. Li, and K. Yi. Building wavelet histograms on large data in mapreduce. In PVLDB, pages 617--620, 2012.
[32]
H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938--948, 2010.
[33]
N. Khoussainova, M. Balazinska, and D. Suciu. Perfxplain: Debugging mapreduce job performance. PVLDB, 5(7):598--609, 2012.
[34]
L. Kolb, A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. In ICDE, pages 618--629, 2012.
[35]
P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. In PODS, pages 223--234, 2011.
[36]
H. T. Kung, F. Luccio, and F. P. Preparata. On finding the maxima of a set of vectors. JACM, 22(4):469--476, 1975.
[37]
Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia. Skewtune: mitigating skew in mapreduce applications. In SIGMOD, pages 25--36, 2012.
[38]
W. Lang and J. M. Patel. Energy management for mapreduce clusters. PVLDB, 3(1):129--139, 2010.
[39]
N. Laptev, K. Zeng, and C. Zaniolo. Early accurate results for advanced analytics on mapreduce. PVLDB, 5(10):1028--1039, 2012.
[40]
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: a method for solving graph problems in mapreduce. In SPAA, pages 85--94, 2011.
[41]
H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. PVLDB, 5(11):1196--1207, 2012.
[42]
Y. Lin, D. Agrawal, C. Chen, B. C. Ooi, and S. Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In SIGMOD, pages 961--972, 2011.
[43]
W. Lu, Y. Shen, S. Chen, and B. C. Ooi. Efficient processing of k nearest neighbor joins using mapreduce. PVLDB, 5(10):1016--1027, 2012.
[44]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330--339, 2010.
[45]
A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8):704--715, 2012.
[46]
G. D. F. Morales, A. Gionis, and M. Sozio. Social content matching in mapreduce. PVLDB, 4(7):460--469, 2011.
[47]
K. Morton, M. Balazinska, and D. Grossman. Paratimer: a progress indicator for mapreduce dags. In SIGMOD, pages 507--518, 2010.
[48]
T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: Sharing across multiple queries in mapreduce. PVLDB, 3(1):494--505, 2010.
[49]
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011.
[50]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008.
[51]
O. O'Malley. Terabyte sort on apache hadoop. Technical report, Yahoo, 2008.
[52]
B. Panda, J. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. PVLDB, 2(2):1426--1437, 2009.
[53]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. PVLDB, 4(11):1135--1145, 2011.
[54]
A. Shinnar, D. Cunningham, B. Herta, and V. A. Saraswat. M3r: Increased performance for in-memory hadoop jobs. PVLDB, 5(12):1736--1747, 2012.
[55]
S. Suri and S. Vassilvitskii. Counting triangles and the curse of the last reducer. In WWW, pages 607--614, 2011.
[56]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE, pages 996--1005, 2010.
[57]
C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: counting triangles in massive graphs with a coin. In SIGKDD, pages 837--846, 2009.
[58]
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990.
[59]
R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive mapreduce using situation-aware mappers. In EDBT, pages 420--431, 2012.
[60]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010.
[61]
G. Wang, M. A. V. Salles, B. Sowell, X. Wang, T. Cao, A. J. Demers, J. Gehrke, and W. M. White. Behavioral simulations in mapreduce. PVLDB, 3(1):952--963, 2010.
[62]
B. Zhang, S. Zhou, and J. Guan. Adapting skyline computation to the mapreduce framework: Algorithms and experiments. In DASFAA Workshops, pages 403--414, 2011.
[63]
X. Zhang, L. Chen, and M. Wang. Efficient multi-way theta-join processing using mapreduce. PVLDB, 5(11):1184--1195, 2012.

Cited By

View all
  • (2023)The Hardness of Optimization Problems on the Weighted Massively Parallel Computation ModelComputing and Combinatorics10.1007/978-3-031-49193-1_9(106-117)Online publication date: 9-Dec-2023
  • (2021)Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache SparkElectronics10.3390/electronics1005058910:5(589)Online publication date: 3-Mar-2021
  • (2021)Algorithms for a Topology-aware Massively Parallel Computation ModelProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458318(199-214)Online publication date: 20-Jun-2021
  • Show More Cited By

Index Terms

  1. Minimal MapReduce algorithms

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. big data
    2. mapreduce
    3. minimal algorithm

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 14 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)The Hardness of Optimization Problems on the Weighted Massively Parallel Computation ModelComputing and Combinatorics10.1007/978-3-031-49193-1_9(106-117)Online publication date: 9-Dec-2023
    • (2021)Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache SparkElectronics10.3390/electronics1005058910:5(589)Online publication date: 3-Mar-2021
    • (2021)Algorithms for a Topology-aware Massively Parallel Computation ModelProceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3452021.3458318(199-214)Online publication date: 20-Jun-2021
    • (2021)Large-Scale Network Embedding in Apache SparkProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining10.1145/3447548.3467136(3271-3279)Online publication date: 14-Aug-2021
    • (2021)An intelligent distance learning framework: assessment-driven approachIntelligent Systems and Learning Data Analytics in Online Education10.1016/B978-0-12-823410-5.00011-5(273-299)Online publication date: 2021
    • (2020)Packing R-trees with Space-filling CurvesACM Transactions on Database Systems10.1145/339750645:3(1-47)Online publication date: 26-Aug-2020
    • (2020)A Tight Lower Bound for Comparison-Based Quantile SummariesProceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3375395.3387650(81-93)Online publication date: 14-Jun-2020
    • (2020)Strongly Minimal MapReduce Algorithms: A TeraSort Case StudyFoundations of Information and Knowledge Systems10.1007/978-3-030-39951-1_18(301-317)Online publication date: 3-Jan-2020
    • (2019)MapReduce Algorithm for Variants of Skyline Queries: Skyband and Dominating QueriesAlgorithms10.3390/a1208016612:8(166)Online publication date: 13-Aug-2019
    • (2019)Output-Optimal Massively Parallel Algorithms for Similarity JoinsACM Transactions on Database Systems10.1145/331196744:2(1-36)Online publication date: 8-Apr-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media