[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

FlumeJava: easy, efficient data-parallel pipelines

Published: 05 June 2010 Publication History

Abstract

MapReduce and similar systems significantly ease the task of writing data-parallel code. However, many real-world computations require a pipeline of MapReduces, and programming and managing such pipelines can be difficult. We present FlumeJava, a Java library that makes it easy to develop, test, and run efficient data-parallel pipelines. At the core of the FlumeJava library are a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies. To enable parallel operations to run efficiently, FlumeJava defers their evaluation, instead internally constructing an execution plan dataflow graph. When the final results of the parallel operations are eventually needed, FlumeJava first optimizes the execution plan, and then executes the optimized operations on appropriate underlying primitives (e.g., MapReduces). The combination of high-level abstractions for parallel data and computation, deferred evaluation and optimization, and efficient parallel primitives yields an easy-to-use system that approaches the efficiency of hand-optimized pipelines. FlumeJava is in active use by hundreds of pipeline developers within Google.

References

[1]
Cascading. http://www.cascading.org.
[2]
Hadoop. http://hadoop.apache.org.
[3]
Pig. http://hadoop.apache.org/pig.
[4]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1 (2), 2008.
[5]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006.
[6]
J. Dean. Experiences with MapReduce, an abstraction for large-scale computation. In PACT, 2006.
[7]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51, no. 1, 2008.
[8]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[9]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In SOSP, 2003.
[10]
R. H. Halstead Jr. New ideas in parallel Lisp: Language design, implementation, and programming tools. In Workshop on Parallel Lisp, 1989.
[11]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, 2007.
[12]
J. R. Larus. C**: A large-grain, object-oriented, data-parallel programming language. In LCPC, 1992.
[13]
C. Lasser and S. M. Omohundro. The essential Star-lisp manual. Technical Report 86.15, Thinking Machines, Inc., 1986.
[14]
E. Meijer, B. Beckman, and G. Bierman. LINQ: reconciling objects, relations and XML in the .NET framework. In SIGMOD Conference, 2006.
[15]
R. S. Nikhil and Arvind. Implicit Parallel Programming in pH. Academic Press, 2001.
[16]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD Conference, 2008.
[17]
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming, 13 (4), 2005.
[18]
J. R. Rose and G. L. Steele Jr. C*: An extended C language. In C Workshop, 1987.
[19]
H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD Conference, 2007.
[20]
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008.

Cited By

View all
  • (2024)Kimbap: A Node-Property Map System for Distributed Graph AnalyticsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640421(566-581)Online publication date: 27-Apr-2024
  • (2024)Design and Implementation of a Big Query Dataset and Application Programmer Interface (API) for the U.S. National Water ModelEnvironmental Modelling & Software10.1016/j.envsoft.2024.106123(106123)Online publication date: Jun-2024
  • (2023)A Technique for Generating Preliminary Satellite Data to Evaluate SUHI Using Cloud Computing: A Case Study in Moscow, RussiaRemote Sensing10.3390/rs1513329415:13(3294)Online publication date: 27-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 45, Issue 6
PLDI '10
June 2010
496 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1809028
Issue’s Table of Contents
  • cover image ACM Conferences
    PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2010
    514 pages
    ISBN:9781450300193
    DOI:10.1145/1806596
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2010
Published in SIGPLAN Volume 45, Issue 6

Check for updates

Author Tags

  1. data-parallel programming
  2. java
  3. mapreduce

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)104
  • Downloads (Last 6 weeks)23
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Kimbap: A Node-Property Map System for Distributed Graph AnalyticsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640421(566-581)Online publication date: 27-Apr-2024
  • (2024)Design and Implementation of a Big Query Dataset and Application Programmer Interface (API) for the U.S. National Water ModelEnvironmental Modelling & Software10.1016/j.envsoft.2024.106123(106123)Online publication date: Jun-2024
  • (2023)A Technique for Generating Preliminary Satellite Data to Evaluate SUHI Using Cloud Computing: A Case Study in Moscow, RussiaRemote Sensing10.3390/rs1513329415:13(3294)Online publication date: 27-Jun-2023
  • (2023)Hydrus: Improving Personalized Quality of Experience in Short-form Video ServicesProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3539618.3591696(1127-1136)Online publication date: 19-Jul-2023
  • (2023)Data Pipeline of Efficient Stream Data Ingestion for Game AnalyticsAdvances in Internet, Data & Web Technologies10.1007/978-3-031-26281-4_50(483-490)Online publication date: 12-Feb-2023
  • (2023)Dataset versus realityJournal of the Association for Information Science and Technology10.1002/asi.2482574:11(1293-1306)Online publication date: 3-Oct-2023
  • (2022)TaiSuProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601485(16705-16717)Online publication date: 28-Nov-2022
  • (2022)A Serverless-Based, On-the-Fly Computing Framework for Remote Sensing Image CollectionRemote Sensing10.3390/rs1407172814:7(1728)Online publication date: 3-Apr-2022
  • (2022)Cloud-based storage and computing for remote sensing big data: a technical reviewInternational Journal of Digital Earth10.1080/17538947.2022.211556715:1(1417-1445)Online publication date: 24-Aug-2022
  • (2022)SQL Query Optimization in Distributed NoSQL Databases for Cloud-Based ApplicationsAlgorithmic Aspects of Cloud Computing10.1007/978-3-031-33437-5_2(21-41)Online publication date: 6-Sep-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media