More Web Proxy on the site http://driver.im/

research-article

Open access

Apache REEF: Retainable Evaluator Execution Framework

ACM Transactions on Computer Systems (TOCS), Volume 35, Issue 2

Article No.: 5, Pages 1 - 31

https://doi.org/10.1145/3132037

Published: 10 October 2017 Publication History

Abstract

Resource Managers like YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault tolerance, task scheduling and coordination) and reimplement common mechanisms (e.g., caching, bulk-data transfers). This article presents REEF, a development framework that provides a control plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource reuse for data caching and state management abstractions that greatly ease the development of elastic data processing pipelines on cloud platforms that support a Resource Manager service. We illustrate the power of REEF by showing applications built atop: a distributed shell application, a machine-learning framework, a distributed in-memory caching system, and a port of the CORFU system. REEF is currently an Apache top-level project that has attracted contributors from several institutions and it is being used to develop several commercial offerings such as the Azure Stream Analytics service.

References

[1]

Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2011. A reliable effective terascale linear learning system. CoRR abs/1110.4198 (2011).

[2]

A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. 2012. Scalable inference in latent variable models. In ACM International Conference on Web Search and Data Mining (WSDM’12).

Digital Library

[3]

Peter Alvaro, Neil Conway, Joe Hellerstein, and William R. Marczak. 2011. Consistency analysis in bloom: A CALM and collected approach. In Conference on Innovative Data Systems Research (CIDR’11).

[4]

Colin McCabe and Andrew Wang. 2013. Centralized cache management in HDFS. https://issues.apache.org/jira/browse/HDFS-4949.

[5]

The Kubernetes Authors. 2015. Kubernetes. Retrieved from https://kubernetes.io/.

[6]

Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Transactions on Computer Systems (TOCS) 31, 4 (2013), 10.

Digital Library

[7]

Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In ACM Symposium on Cloud Computing (SoCC’10).

Digital Library

[8]

Alex Beutel, Markus Weimer, Vijay Narayanan, Yordan Zaykov, and Tom Minka. 2014. Elastic distributed bayesian collaborative filtering. In NIPS Workshop on Distributed Machine Learning and Matrix Computations.

[9]

Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Declarative systems for large-scale machine learning. IEEE Technical Committee on Data Engineering (TCDE) 35, 2 (2012), 24--32.

[10]

Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE’11).

Digital Library

[11]

Olivier Bousquet and Léon Bottou. 2007. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS’07).

[12]

Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 571--582.

[13]

Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. 2006. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS’06).

[14]

Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.

[15]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113.

Digital Library

[16]

Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative mapreduce. In ACM International Symposium on High Performance Distributed Computing (HPDC’10).

Digital Library

[17]

Google. 2015. Guice. Retrieved from https://github.com/google/guice.

[18]

William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. 1998. MPI - The Complete Reference: Volume 2, The MPI-2 Extensions. MIT Press.

[19]

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’11).

[20]

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM European Conference on Computer Systems (EuroSys’07).

Digital Library

[21]

Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An analysis of traces from a production mapreduce cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid’10).

Digital Library

[22]

Michael Kearns. 1998. Efficient noise-tolerant learning from statistical queries. Journal of the ACM 45, 6 (1998), 983--1006.

Digital Library

[23]

Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. 2000. The click modular router. ACM Transactions on Computer Systems (TOCS) 18, 3 (2000), 263--297.

Digital Library

[24]

J. Kreps, N. Narkhede, and J. Rao. 2011. Kafka: A distributed messaging system for log processing. In International Workshop on Networking Meets Databases (NetDB’11).

[25]

Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer, and Vijay Narayanan. 2013. Distributed and scalable PCA in the cloud. In BigLearn NIPS Workshop.

[26]

Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In ACM Symposium on Cloud Computing (SoCC’14).

Digital Library

[27]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14).

Digital Library

[28]

Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI’10).

[29]

Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’10).

Digital Library

[30]

Nathan Marz. 2015. Storm: Distributed and Fault-Tolerant Realtime Computation. http://storm.apache.org.

[31]

Erik Meijer. 2012. Your mouse is a database. Communications of the ACM 55, 5 (2012), 66--73.

Digital Library

[32]

Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A timely dataflow system. In ACM Symposium on Operating Systems Principles (SOSP’13).

Digital Library

[33]

Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan, Tyson Condie, Sundararajan Sellamanickam, and S. Sathiya Keerthi. 2013. Towards resource-elastic machine learning. In BigLearn NIPS Workshop.

[34]

Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In IEEE International Conference on Data Mining Workshops (ICDMW’10).

Digital Library

[35]

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’08).

Digital Library

[36]

Ariel Rabkin. 2012. Using Program Analysis to Reduce Misconfiguration in Open Source Systems Software. Ph.D. Dissertation, UC Berkeley (2012).

[37]

David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1985. Learning Internal Representations by Error Propagation. Technical Report. DTIC Document.

[38]

Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15).

Digital Library

[39]

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In ACM European Conference on Computer Systems (EuroSys’13).

Digital Library

[40]

Marc Shapiro and Nuno M. Preguiça. 2007. Designing a commutative replicated data type. CoRR abs/0710.1784.

[41]

Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York,1135--1149.

Digital Library

[42]

Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1--2 (2010), 703--710.

Digital Library

[43]

Michael Stonebraker and Ugur Cetintemel. 2005. One size fits all: An idea whose time has come and gone. In International Conference on Data Engineering (ICDE’05).

Digital Library

[44]

The Apache Software Foundation. 2017. Apache Accumulo. Retrieved from http://accumulo.apache.org/.

[45]

The Apache Software Foundation. 2017. Apache Giraph. Retrieved from http://giraph.apache.org/.

[46]

The Apache Software Foundation. 2017. Apache Hadoop. Retrieved from http://hadoop.apache.org.

[47]

The Apache Software Foundation. 2017. Apache Mahout. Retrieved from http://mahout.apache.org.

[48]

The Apache Software Foundation. 2017. Apache Slider. Retrieved from http://slider.incubator.apache.org/.

[49]

The Apache Software Foundation. 2017. Apache Twill. Retrieved from http://twill.apache.org/.

[50]

The Netty Project. 2015. Netty. Retrieved from http://netty.io.

[51]

Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive -- A warehousing solution over a map-reduce framework. In Proceedings of the VLDB Endowment (PVLDB’09).

[52]

Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990), 103--111.

Digital Library

[53]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache hadoop YARN: Yet another resource negotiator. In ACM Symposium on Cloud Computing (SoCC’13).

Digital Library

[54]

Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, Brandon Myers, Shravan Narayanamuthy, Raghu Ramakrishnan, Sriram Rao, Russell Sear, Beysim Sezgin, and Julia Wang. 2015. REEF: Retainable evaluator execution framework. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15).

Digital Library

[55]

Markus Weimer, Sriram Rao, and Martin Zinkevich. 2010. A convenient framework for efficient parallel multipass algorithms. In NIPS Workshop on Learning on Cores, Clusters and Clouds.

[56]

Matt Welsh. 2013. What I wish systems researchers would work on. Retrieved from http://matt-welsh.blogspot.com/2013/05/what-i-wish-systems-researchers-would.html.

[57]

Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable internet services. SIGOPS Operating Systems Review 35 (2001), 230--243.

Digital Library

[58]

Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. 2009. Stochastic gradient boosted distributed decision trees. In ACM Conference on Information and Knowledge Management (CIKM’09).

Digital Library

[59]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’12).

Digital Library

[60]

Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’10).

[61]

Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB Journal 21, 5 (2012), 611--636.

Digital Library

Cited By

Xuejiao HYiwei ZJianxing YDianhua W(2022)Discrete configurations and combinatorial methods originated from distributed networkSCIENTIA SINICA Mathematica10.1360/SSM-2022-007453:2(151)Online publication date: 28-Sep-2022
https://doi.org/10.1360/SSM-2022-0074
Lee WLee YSong WYang YKim JChun B(2021)Harmony: A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00085(841-851)Online publication date: Jul-2021
https://doi.org/10.1109/ICDCS51616.2021.00085
Yang YChung JWang GGupta VKarnati AJiang KStoica IGonzalez JRamchandran K(2020)Robust Class Parallelism - Error Resilient Parallel Inference with Low Communication Cost2020 54th Asilomar Conference on Signals, Systems, and Computers10.1109/IEEECONF51394.2020.9443452(1064-1065)Online publication date: 1-Nov-2020
https://doi.org/10.1109/IEEECONF51394.2020.9443452
Show More Cited By

Index Terms

Apache REEF: Retainable Evaluator Execution Framework

Recommendations

REEF: Retainable Evaluator Execution Framework
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Resource Managers like Apache YARN have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. This flexibility comes at a ...
Learning Apache Spark 2.0
Apache Accumulo for Developers

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 35, Issue 2

May 2017

113 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/3129286

Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2017

Accepted: 01 July 2017

Revised: 01 February 2017

Received: 01 December 2015

Published in TOCS Volume 35, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

IITP (Institute for Information 8 Communications Technology Promotion)
National Institute of Biomedical Imaging and Bioengineering (NIBIB)
UCLA
trans-NIH Big Data to Knowledge (BD2K)
MSIT (Ministry of Science and ICT), Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
1,068
Total Downloads

Downloads (Last 12 months)98
Downloads (Last 6 weeks)15

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xuejiao HYiwei ZJianxing YDianhua W(2022)Discrete configurations and combinatorial methods originated from distributed networkSCIENTIA SINICA Mathematica10.1360/SSM-2022-007453:2(151)Online publication date: 28-Sep-2022
https://doi.org/10.1360/SSM-2022-0074
Lee WLee YSong WYang YKim JChun B(2021)Harmony: A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS)10.1109/ICDCS51616.2021.00085(841-851)Online publication date: Jul-2021
https://doi.org/10.1109/ICDCS51616.2021.00085
Yang YChung JWang GGupta VKarnati AJiang KStoica IGonzalez JRamchandran K(2020)Robust Class Parallelism - Error Resilient Parallel Inference with Low Communication Cost2020 54th Asilomar Conference on Signals, Systems, and Computers10.1109/IEEECONF51394.2020.9443452(1064-1065)Online publication date: 1-Nov-2020
https://doi.org/10.1109/IEEECONF51394.2020.9443452
Gautam BBasava A(2019)Performance prediction of data streams on high-performance architectureHuman-centric Computing and Information Sciences10.1186/s13673-018-0163-49:1(1-23)Online publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1186/s13673-018-0163-4
Yang YInterlandi MGrover PKar SAmizadeh SWeimer M(2019)Coded Elastic Computing2019 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT.2019.8849212(2654-2658)Online publication date: Jul-2019
https://doi.org/10.1109/ISIT.2019.8849212
Zhao JBao QNi Y(2019)Improved SM4 Encryption Algorithm Based on Mixed Congruence MethodInnovative Mobile and Internet Services in Ubiquitous Computing10.1007/978-3-030-22263-5_23(236-245)Online publication date: 19-Jun-2019
https://doi.org/10.1007/978-3-030-22263-5_23

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents