research-article

Free access

MapReduce: simplified data processing on large clusters

Authors:

Jeffrey Dean,

Sanjay GhemawatAuthors Info & Claims

Communications of the ACM, Volume 51, Issue 1

Pages 107 - 113

https://doi.org/10.1145/1327452.1327492

Published: 01 January 2008 Publication History

All formats PDF

Abstract

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

Supplementary Material

PDF File (p107-dean.jp.pdf)

Requires Asian Language Support in Adobe Reader and Japanese Language Support in Your Browser.

Download
351.34 KB

References

[1]

Hadoop: Open source implementation of MapReduce. http://lucene. apache.org/hadoop/.

Google Scholar

[2]

The Phoenix system for MapReduce programming. http://csl.stanford. edu/~christos/sw/phoenix/.

Google Scholar

[3]

Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Culler, D. E., Hellerstein, J. M., and Patterson, D. A. 1997. High-performance sorting on networks of workstations. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. Tucson, AZ.

Digital Library

Google Scholar

[4]

Barroso, L. A., Dean, J., and Urs Hölzle, U. 2003. Web search for a planet: The Google cluster architecture. IEEE Micro 23, 2, 22-28.

Digital Library

Google Scholar

[5]

Bent, J., Thain, D., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Livny, M. 2004. Explicit control in a batch-aware distributed file system. In Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation (NSDI).

Digital Library

Google Scholar

[6]

Blelloch, G. E. 1989. Scans as primitive parallel operations. IEEE Trans. Comput. C-38, 11.

Digital Library

Google Scholar

[7]

Chu, C.-T., Kim, S. K., Lin, Y. A., Yu, Y., Bradski, G., Ng, A., and Olukotun, K. 2006. Map-Reduce for machine learning on multicore. In Proceedings of Neural Information Processing Systems Conference (NIPS). Vancouver, Canada.

Google Scholar

[8]

Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of Operating Systems Design and Implementation (OSDI). San Francisco, CA. 137-150.

Digital Library

Google Scholar

[9]

Fox, A., Gribble, S. D., Chawathe, Y., Brewer, E. A., and Gauthier, P. 1997. Cluster-based scalable network services. In Proceedings of the 16th ACM Symposium on Operating System Principles. Saint-Malo, France. 78-91.

Digital Library

Google Scholar

[10]

Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In 19th Symposium on Operating Systems Principles. Lake George, NY. 29-43.

Digital Library

Google Scholar

[11]

Gorlatch, S. 1996. Systematic efficient parallelization of scan and other list homomorphisms. In L. Bouge, P. Fraigniaud, A. Mignotte, and Y. Robert, Eds. Euro-Par'96. Parallel Processing, Lecture Notes in Computer Science, vol. 1124. Springer-Verlag. 401-408

Digital Library

Google Scholar

[12]

Gray, J. Sort benchmark home page. http://research.microsoft.com/barc/SortBenchmark/.

Google Scholar

[13]

Huston, L., Sukthankar, R., Wickremesinghe, R., Satyanarayanan, M., Ganger, G. R., Riedel, E., and Ailamaki, A. 2004. Diamond: A storage architecture for early discard in interactive search. In Proceedings of the 2004 USENIX File and Storage Technologies FAST Conference.

Digital Library

Google Scholar

[14]

Ladner, R. E., and Fischer, M. J. 1980. Parallel prefix computation. JACM 27, 4. 831-838.

Digital Library

Google Scholar

[15]

Rabin, M. O. 1989. Efficient dispersal of information for security, load balancing and fault tolerance. JACM 36, 2. 335-348.

Digital Library

Google Scholar

[16]

Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., and Kozyrakis, C. 2007. Evaluating mapreduce for multi-core and multiprocessor systems. In Proceedings of 13th International Symposium on High-Performance Computer Architecture (HPCA). Phoenix, AZ.

Digital Library

Google Scholar

[17]

Riedel, E., Faloutsos, C., Gibson, G. A., and Nagle, D. Active disks for large-scale data processing. IEEE Computer. 68-74.

Digital Library

Google Scholar

Cited By

View all

Shailin Saraiya (2025)Technical Evolution and Performance Analysis of MapReduce in Modern Distributed SystemsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2511120611:1(29-35)Online publication date: 3-Jan-2025
https://doi.org/10.32628/CSEIT25111206
Jiang JHuang HZheng ZWei YFu FLi XCui B(2025)Detecting and Analyzing Motifs in Large-Scale Online Transaction NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.351113637:2(584-596)Online publication date: Feb-2025
https://doi.org/10.1109/TKDE.2024.3511136
Park JLee D(2025)Parallelly Running and Privacy-Preserving Agglomerative Hierarchical Clustering in Outsourced Cloud Computing EnvironmentsIEEE Transactions on Big Data10.1109/TBDATA.2024.340337511:1(174-189)Online publication date: Feb-2025
https://doi.org/10.1109/TBDATA.2024.3403375
Show More Cited By

Index Terms

MapReduce: simplified data processing on large clusters

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on Services

In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
MapReduce: simplified data processing on large clusters
OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that ...

Reviews

Reviewer: Chris A Mattmann

Google has revolutionized the way that large-scale data management is engineered and deployed, and evolves over time. In particular, it developed novel methods for file-based data management for rapid indexing and searching of Web pages (PageRank and Google's index structure) and for large-scale data computation. Dean and Ghemawat's article focuses on the description of Google's novel distributed computation paradigm MapReduce and its associated infrastructure and deployment at Google. MapReduce is a programming paradigm in which developers are required to cast a computational problem in the form of two atomic components: a "map" function (similar to the Lisp map function), in which a set of input data in the form of "key,value" is split into a set of intermediate "key,value" pairs, and a "reduce" function (similar to the Lisp reduce function) that takes as input an intermediate key and set of associated values, and reduces that set of associated values to a smaller set, typically consisting of just a single value. Google has found that several of their mission-critical services can be cast as a MapReduce-style problem. Specifically, Dean and Ghemawat tout Google's major success of retooling their production crawling/indexing service as a MapReduce program; there are many other examples, including large-scale machine learning problems, clustering problems for Google News and Froogle, identification of popular queries, processing satellite imagery, and over 10,000 others. The general applicability and simplicity of the MapReduce paradigm has caused other implementation frameworks to become publicly available besides Google's in-house developed solution: Apache Hadoop, an open-source, Java-based implementation of MapReduce, and the Phoenix shared-memory MapReduce system developed by the computer science department at Stanford University (both are mentioned in the paper). This is a very readable paper that serves as a higher-level summary of Dean and Ghemawat's earlier, more technical paper [1]. The casual practitioner who wants to learn the value added by adopting MapReduce style programs will find this paper interesting, as will architects who want to understand the core components and architectural style of MapReduce. These readers should focus on sections 2 and 3. For those interested in specifics of how MapReduce was implemented, optimized, and evaluated at Google, sections 4 and 5 will be of interest. Sections 1 and 6 identify the importance of using MapReduce at Google and are valuable in making the business case for adopting MapReduce at a particular organization. Overall, this paper represents a fast, enjoyable read for any software developer working in the area of data-intensive information systems, as Google has clearly engendered a viable computational paradigm and architectural style for simplifying the construction of software within the domain. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

Communications of the ACM Volume 51, Issue 1

50th anniversary issue: 1958 - 2008

January 2008

106 pages

ISSN:0001-0782

EISSN:1557-7317

DOI:10.1145/1327452

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 January 2008

Published in CACM Volume 51, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Popular
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11,588
Total Citations
View Citations
83,549
Total Downloads

Downloads (Last 12 months)8,797
Downloads (Last 6 weeks)1,052

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Shailin Saraiya (2025)Technical Evolution and Performance Analysis of MapReduce in Modern Distributed SystemsInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT2511120611:1(29-35)Online publication date: 3-Jan-2025
https://doi.org/10.32628/CSEIT25111206
Jiang JHuang HZheng ZWei YFu FLi XCui B(2025)Detecting and Analyzing Motifs in Large-Scale Online Transaction NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.351113637:2(584-596)Online publication date: Feb-2025
https://doi.org/10.1109/TKDE.2024.3511136
Park JLee D(2025)Parallelly Running and Privacy-Preserving Agglomerative Hierarchical Clustering in Outsourced Cloud Computing EnvironmentsIEEE Transactions on Big Data10.1109/TBDATA.2024.340337511:1(174-189)Online publication date: Feb-2025
https://doi.org/10.1109/TBDATA.2024.3403375
Fu XPan LLiu S(2025)Caching or re-computing: Online cost optimization for running big data tasks in IaaS cloudsJournal of Network and Computer Applications10.1016/j.jnca.2024.104080235(104080)Online publication date: Mar-2025
https://doi.org/10.1016/j.jnca.2024.104080
Salles RLange BAkbarinia RMasseglia FOgasawara EPacitti E(2025)Scalable and accurate online multivariate anomaly detectionInformation Systems10.1016/j.is.2025.102524(102524)Online publication date: Jan-2025
https://doi.org/10.1016/j.is.2025.102524
Almurshed OKaushal AMeshoul SMuftah AAlmoghamis OPetri IAuluck NRana O(2025)Enhancing performance of machine learning tasks on edge-cloud infrastructures: A cross-domain Internet of Things based frameworkFuture Generation Computer Systems10.1016/j.future.2024.107696166(107696)Online publication date: May-2025
https://doi.org/10.1016/j.future.2024.107696
Turaga VChebrolu S(2025)Rapid and optimized parallel attribute reduction based on neighborhood rough sets and MapReduceExpert Systems with Applications10.1016/j.eswa.2024.125323260(125323)Online publication date: Jan-2025
https://doi.org/10.1016/j.eswa.2024.125323
Calkoen FLuijendijk AVos KKras EBaart F(2025)Enabling coastal analytics at planetary scaleEnvironmental Modelling & Software10.1016/j.envsoft.2024.106257183(106257)Online publication date: Jan-2025
https://doi.org/10.1016/j.envsoft.2024.106257
Razzaq AKhan ZNaeem KRiaz M(2025)Picture fuzzy complex proportional assessment approach with step-wise weight assessment ratio analysis and criteria importance through intercriteria correlationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109554139(109554)Online publication date: Jan-2025
https://doi.org/10.1016/j.engappai.2024.109554
Masoori MNarayanan LPankratov D(2025)Renting servers in the cloud: The case of equal duration jobsDiscrete Applied Mathematics10.1016/j.dam.2024.11.015362(82-99)Online publication date: Feb-2025
https://doi.org/10.1016/j.dam.2024.11.015
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Digital Edition

View this article in digital edition.

Digital Edition

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Supplementary Material

References

Cited By

Index Terms

Recommendations

MapReduce: Review and open challenges

Challenges for MapReduce in Big Data