[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2442992.2443000acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A generate-test-aggregate parallel programming library: systematic parallel programming for MapReduce

Published: 23 February 2013 Publication History

Abstract

Generate-Test-Aggregate (GTA for short) is a novel programming model for MapReduce, dramatically simplifying the development of efficient parallel algorithms. Under the GTA model, a parallel computation is encoded into a simple pattern: generate all candidates, test them to filter out invalid ones, and aggregate valid ones to make the result. Once users specify their parallel computations in the GTA style, they get efficient MapReduce programs for free owing to an automatic optimization given by the GTA theory.
In this paper, we report our implementation of a GTA library to support programming in the GTA model. In this library, we provide a compact programming interface for hiding the complexity of GTA's internal transformation, so that many problems can be encoded in the GTA style easily and straightforwardly. The GTA transformation and optimization mechanism implemented inside is a black-box to the end users, while users can extend the library by modifying existing (or implementing new) generators, testers or aggregators through standard programming interfaces of the GTA library. This GTA programming library supports both sequential or parallel execution on single computer and on-cluster execution with MapReduce computing engines. We evaluate our library by giving the results of our experiments on large data to show the efficiency, scalability and usefulness of this GTA library.

References

[1]
Apache Software Foundation. Avro. http://avro.apache.org/.
[2]
Apache Software Foundation. Hadoop. http://hadoop.apache.org/.
[3]
R. S. Bird. An introduction to the theory of lists. In M. Broy, editor, Logic of Programming and Calculi of Discrete Design, volume 36 of NATO ASI Series F, pages 5--42. Springer, 1987.
[4]
R. S. Bird. Introduction to Functional Programming using Haskell. Prentice Hall, 1998.
[5]
W. N. Chin, S. C. Khoo, Z. Hu, and M. Takeichi. Deriving parallel codes via invariants. In International Static Analysis Symposium (SAS2000), volume Lecture Notes in Computer Science 1824, pages 75--94. Springer Verlag, 2000.
[6]
M. Cole. List homomorphic parallel algorithms for bracket matching. Technical report, Department of Computer Science, University of Edinburgh, 1993.
[7]
M. Cole. Parallel programming, list homomorphisms and the maximum segment sum problems. Report CSR-25-93, Department of Computing Science, The University of Edinburgh, 1993.
[8]
M. Cole. Parallel programming with list homomorphisms. Parallel Processing Letters, 5(2):191--203, 1995.
[9]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In 6th Symposium on Operating System Design and Implementation (OSDI2004), December 6--8, 2004, San Francisco, California, USA, pages 137--150, 2004.
[10]
K. Emoto, S. Fischer, and Z. Hu. Filter-embedding semiring fusion for programming with mapreduce. Formal Aspects of Computing, 24:623--645, 2012.
[11]
K. Emoto, S. Fischer, and Z. Hu. Generate, test, and aggregate - a calculation-based framework for systematic parallel programming with mapreduce. In Programming Languages and Systems, 21st European Symposium on Programming (ESOP 2012), volume 7211 of Lecture Notes in Computer Science, pages 254--273. Springer, 2012.
[12]
A. L. Fisher and A. M. Ghuloum. Parallelizing complex scans and reductions. In Proceedings of the ACM SIGPLAN '94 Conference on Programming Language Design and Implementation (PLDI '94), pages 135--146, 1994.
[13]
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow., 2(2):1414--1425, Aug. 2009.
[14]
J. Goodman. Semiring parsing. Computational Linguistics, 25:573--605, 1999.
[15]
S. Gorlatch. Systematic efficient parallelization of scan and other list homomorphisms. In L. Bougé, P. Fraigniaud, A. Mignotte, and Y. Robert, editors, Euro-Par '96 Parallel Processing, Second International Euro-Par Conference, Lyon, France, August 26--29, 1996, Proceedings, Volume II, volume 1124 of Lecture Notes in Computer Science, pages 401--408. Springer, 1996.
[16]
S. Gorlatch. Systematic extraction and implementation of divide-and-conquer parallelism. In H. Kuchen and D. Swierstra, editors, Programming languages: Implementation, Logics and Programs. PLILP'96, Lecture Notes in Computer Science 1140, pages 274--288. Springer-Verlag, 1996.
[17]
Z. Hu, H. Iwasaki, and M. Takeichi. Formal derivation of parallel program for 2-dimensional maximum segment sum problem. In Euro-Par '96 Parallel Processing, Second International Euro-Par Conference, Lyon, France, August 26--29, 1996, Proceedings, Volume I, volume 1123 of Lecture Notes in Computer Science, pages 553--562. Springer, 1996.
[18]
Z. Hu, H. Iwasaki, and M. Takeichi. Formal derivation of efficient parallel programs by construction of list homomorphisms. ACM Transactions on Programming Languages and Systems, 19(3):444--461, 1997.
[19]
Z. Hu, M. Takeichi, and W.-N. Chin. Parallelization in calculational forms. In POPL '98, Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, January 19--21, 1998, San Diego, CA, USA, pages 316--328. ACM Press, 1998.
[20]
Y. Liu, Z. Hu, and K. Matsuzaki. Towards systematic parallel programming over mapreduce. In Proceedings of the 17th international Euro-Par conference on Parallel processing: Part II, Euro-Par'11, Berlin, Heidelberg, 2011. Springer-Verlag.
[21]
K. Morita, A. Morihata, K. Matsuzaki, Z. Hu, and M. Takeichi. Automatic inversion generates divide-and-conquer parallel programs. In J. Ferrante and K. S. McKinley, editors, Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10--13, 2007, pages 146--155. ACM, 2007.
[22]
K. Morita, A. Morihata, K. Matsuzaki, Z. Hu, and M. Takeichi. Automatic inversion generates divide-and-conquer parallel programs. In ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), pages 146--155. ACM Press, June 2007.
[23]
S.-C. Mu. Maximum segment sum is back: deriving algorithms for two segment problems with bounded lengths. In Proceedings of the 2008 ACM SIGPLAN symposium on Partial evaluation and semantics-based program manipulation, PEPM '08, pages 31--39, New York, NY, USA, 2008. ACM.
[24]
M. Odersky and al. The scala language specification, version 2.9. Technical report, EPFL Lausanne, Switzerland, 2011.
[25]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Sci. Program., 13(4):277--298, Oct. 2005.
[26]
I. Sasano, Z. Hu, M. Takeichi, and M. Ogawa. Make it practical: A generic linear time algorithm for solving maximum-weightsum problems. In The 2000 ACM SIGPLAN International Conference on Functional Programming (ICFP 2000), pages 137--149. ACM Press, 2000.
[27]
S. Sato and H. Iwasaki. Automatic parallelization via matrix multiplication. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation (PLDI '11), pages 470--479. ACM, 2011.
[28]
D. B. Skillicorn. The Bird-Meertens formalism as a parallel model. In J. S. Kowalik and L. Grandinetti, editors, NATO ASI Workshop on Software for Parallel Computation, NATO ARW "Software for Parallel Computation", volume 106 of F. Springer-Verlag NATO ASI, 1992.
[29]
K. Varda. Protocol buffers: Google's data interchange format. Technical report, Google, 6 2008.
[30]
A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on, 13(2):260--269, apr 1967.
[31]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, NSDI'12, pages 2--2, Berkeley, CA, USA, 2012. USENIX Association.
[32]
P. G. H. Zully N. Grant-Duff. Parallelism via homomorphism. Parallel Processing Letters, 6(2):279--295, 1996.

Cited By

View all
  • (2018)Thrust2D: A new design abstraction framework for structured grid class of algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.474030:19Online publication date: 17-Jul-2018
  • (2013)Hourglass: A library for incremental processing on Hadoop2013 IEEE International Conference on Big Data10.1109/BigData.2013.6691647(742-752)Online publication date: Oct-2013

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PMAM '13: Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores
February 2013
134 pages
ISBN:9781450319089
DOI:10.1145/2442992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 February 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MapReduce
  2. functional programming
  3. generate test aggregate programming model
  4. high-level parallel programming
  5. optimization
  6. scala

Qualifiers

  • Research-article

Conference

PPoPP '13
Sponsor:

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Thrust2D: A new design abstraction framework for structured grid class of algorithmsConcurrency and Computation: Practice and Experience10.1002/cpe.474030:19Online publication date: 17-Jul-2018
  • (2013)Hourglass: A library for incremental processing on Hadoop2013 IEEE International Conference on Big Data10.1109/BigData.2013.6691647(742-752)Online publication date: Oct-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media