[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2670979.2670992acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

The Case for Drill-Ready Cloud Computing

Published: 03 November 2014 Publication History

Abstract

As cloud computing has matured, more and more local applications are replaced by easy-to-use on-demand services accessible via computer networks (a.k.a. cloud services). Running behind these services are massive hardware infrastructures and complex management tasks (e.g., recovery, software upgrades) that if not tested thoroughly can exhibit failures that lead to major service disruptions. Some researchers estimate that 568 hours of downtime at 13 well-known cloud services since 2007 had an economic impact of more than $70 million [18]. Others predict worse: for every hour it is not up and running, a cloud service can take a hit between $1 to 5 million [32]. Moreover, an outage of a popular service can shutdown other dependent services [11, 37, 59], leading to many more frustrated and furious users.

References

[1]
http://cloutage.org.
[2]
Amazon Web Services. http://aws.amazon.com.
[3]
Apache HBase Operational Management. http://hbase.apache.org/book/ops_mgt.html.
[4]
Cassandra Operations. http://wiki.apache.org/cassandra/Operations.
[5]
DevOps GameDay. https://github.com/cloudworkshop/devopsgameday/wiki.
[6]
Open Sourced Vulnerability Database. http://www.osvdb.org.
[7]
Peter Alvaro, Tyson Condie, Neil Conway, Khaled Elmeleegy, Joseph M. Hellerstein, and Russell C. Sears. BOOM Analytics: Exploring Data-Centric, Declarative Programming for the Cloud. In EuroSys '10.
[8]
Mona Attariyan, Michael Chow, and Jason Flinn. X-ray: Automating Root-Cause Diagnosis of Performance Anomalies in Production Software. In OSDI '12.
[9]
Cory Bennett and Ariel Tseitlin. Chaos Monkey Released Into The Wild. http://techblog.netflix.com, 2012.
[10]
Sapan Bhatia, Abhishek Kumar, Marc E. Fiuczynski, and Larry Peterson. Lightweight, High-Resolution Monitoring for Troubleshooting Production Systems. In OSDI '08.
[11]
Henry Blodget. Amazon's Cloud Crash Disaster Permanently Destroyed Many Customers' Data. http://www.businessinsider.com, 2011.
[12]
Andrew Bosworth. Building and testing at Facebook. http://www.facebook.com/Engineering, 2012.
[13]
Marco Canini, Vojin Jovanović, Daniele Venzano, Boris Spasojević, Olivier Crameri, and Dejan Kostić. Toward Online Testing of Federated and Heterogeneous Distributed Systems. In USENIX ATC '11.
[14]
Boston Computing. Data Loss Statistics. http://www.bostoncomputing.net.
[15]
Olivier Crameri, Nikola Knezevic, Dejan Kostic, Ricardo Bianchini, and Willy Zwaenepoel. Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System. In SOSP '07.
[16]
Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. Limplock: Understanding the Impact of Limpware on Scale-Out Cloud Systems. In SoCC '13.
[17]
U. Erlingsson, M. Peinado, S. Peter, and M. Budiu. Fay: Extensible Distributed Tracing from Kernels to Clusters. In SOSP '11.
[18]
Loek Essers. Cloud Failures Cost More Than $70 Million Since 2007, Researchers Estimate. http://www.pcworld.com, 2012.
[19]
Daniel B. Giffin, Amit Levy, Deian Stefan, David Terei, David Mazieres, John C. Mitchell, and Alejandro Russo. Hails: Protecting Data Privacy in Untrusted Web Applications. In OSDI '12.
[20]
Haryadi S. Gunawi, Thanh Do, Joseph M. Hellerstein, Ion Stoica, Dhruba Borthakur, and Jesse Robbins. Failure as a Service (FaaS): A Cloud Service for Large-Scale, Online Failure Drills. UC Berkeley Technical Report UCB/EECS-2011-87.
[21]
Haryadi S. Gunawi, Thanh Do, Pallavi Joshi, Peter Alvaro, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Koushik Sen, and Dhruba Borthakur. Fate and Destini: A Framework for Cloud Recovery Testing. In NSDI '11.
[22]
Haryadi S. Gunawi, Mingzhe. Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang Satria. What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems. In SoCC '14.
[23]
Zhenyu Guo, Sean McDirmid, Mao Yang, Li Zhuang, Pu Zhang, Yingwei Luo, Tom Bergan, Madan Musuvathi, Zheng Zhang, and Lidong Zhou. Failure Recovery: When the Cure Is Worse Than the Disease. In HotOS XIV, 2013.
[24]
Weihang Jiang, Chongfeng Hu, Shankar Pasupathy, Arkady Kanevsky, Zhenmin Li, and Yuanyuan Zhou. Understanding Customer Problem Troubleshooting from Storage System Logs. In FAST '09.
[25]
Baris Kasikci, Cristian Zamfir, and George Candea. RaceMob: Crowdsourced Data Race Detection. In SOSP '13.
[26]
Emre Kiciman and Benjamin Livshits. Ajaxscope: A platform for remotely monitoring the client-side behavior of web 2.0 applications. In SOSP '07.
[27]
Taesoo Kim, Ramesh Chandra, and Nickolai Zeldovich CSAIL. Efficient Patch-based Auditing for Web Application Vulnerabilities. In OSDI '12.
[28]
Taesoo Kim, Xi Wang, Nickolai Zeldovich, and M. Frans Kaashoek. Intrusion Recovery Using Selective Re-execution. In OSDI '10.
[29]
Oren Laadan, Nicolas Viennot, Chia che Tsai, Chris Blinn, Junfeng Yang, and Jason Nieh. Pervasive Detection of Process Races in Deployed Systems. In SOSP '11.
[30]
H. Andres Lagar-Cavilla, Joseph A. Whitney, Adin Scannell, Stephen M. Rumble, Philip Patchin, Eyal de Lara, Michael Brudno, and M. Satyanarayanan. SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing. In EuroSys '09.
[31]
Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi Joshi, Jeffrey F. Lukman, and Haryadi S. Gunawi. SAMC: Semantic-Aware Model Checking for Fast Discovery of Deep Bugs in Cloud Systems. In OSDI '14.
[32]
David Linthicum. Calculating the true cost of cloud outages. http://www.infoworld.com, 2013.
[33]
Lionel Litty, H. Andres Lagar-Cavilla, and David Lie. Computer Meteorology: Monitoring Compute Clouds. In HotOS XII, 2009.
[34]
Changbin Liu, Boon Thau Loo, and Yun Mao. Declarative Automated Cloud Resource Orchestration. In SoCC '11.
[35]
Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang. D3S: Debugging Deployed Distributed Systems. In NSDI '08.
[36]
Marissa Mayer. An Update on Yahoo Mail, December 2013.
[37]
Rich Miller. Amazon Cloud Outage KOs Reddit, Foursquare and Others. http://www.datacenterknowledge.com, 2012.
[38]
Michael J. Mior and Eyal de Lara. FlurryDB: A Dynamically Scalable Relational Database with Virtual Machine Cloning. In SYSTOR '11.
[39]
Iulian Neamtiu and Tudor Dumitras. Cloud Software Upgrades: Challenges and Opportunities. In MESOCA '11.
[40]
Netflix. 5 Lessons We've Learned Using AWS. http://techblog.netflix.com, December 2010.
[41]
Pertino. April 1st Service Disruption Postmortem, April 2013.
[42]
Ken Presti. 6 Devastating Cloud Outages Over The Last 6 Months. http://www.crn.com, 2013.
[43]
Ariel Rabkin and Randy Katz. Precomputing Possible Configuration Error Diagnoses. In ASE '11.
[44]
Patrick Reynolds, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, Charles Killian, and Amin Vahdat. Pip: Detecting the unexpected in distributed systems. In NSDI '06.
[45]
Jesse Robbins, Kripa Krishnan, John Allspaw, and Tom Limoncelli. Resilience Engineering: Learning to Embrace Failure. ACM Queue, 10(9), September 2012.
[46]
Chuck Rossi. Ship early and ship twice as often. https://www.facebook.com/Engineering, 2012.
[47]
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. Diagnosing Performance Changes by Comparing Request Flows. In NSDI '11.
[48]
Zhiming Shen, Sethuraman Subbiah, Xiaohui Gu, and John Wilkes. CloudScale: Elastic Resource Scaling for Multi-Tenant Cloud Systems. In SoCC '11.
[49]
Rob Sherwood, Glen Gibb, Kok-Kiong Yap, Guido Appenzeller, Martin Casado, Nick McKeown, and Guru Parulkar. Can the Production Network Be the Testbed?. In OSDI '10.
[50]
Atul Singh, Petros Maniatis, Timothy Roscoe, and Peter Druschel. Using Queries for Distributed Monitoring and Forensics. In EuroSys '06.
[51]
AWS Team. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://aws.amazon.com/message/65648, 2011.
[52]
AWS Team. Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region. http://aws.amazon.com/message/680587, 2012.
[53]
Gmail Team. More on today's Gmail issue. http://gmailblog.blogspot.com, September 2009.
[54]
Google AppEngine Team. Post-mortem for February 24th, 2010 outage. https://groups.google.com/group/google-appengine, February 2010.
[55]
Google Apps Team. GoogleApps IncidentReport, March 2013.
[56]
Skype Team. CIO update: Post-mortem on the Skype outage (December 2010). http://blogs.skype.com, December 2010.
[57]
The Joyent Team. Postmortem for outage of us-east-1, May 2014.
[58]
The Verge. Microsoft apologizes for Outlook, ActiveSync downtime, says error overloaded servers, August 2013.
[59]
Christina Warren. How Facebook killed the Internet. http://www.cnn.com, 2013.
[60]
Maysam Yabandeh, Nikola Knezevic, Dejan Kostic, and Viktor Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed. Distributed Systems. In NSDI '09.
[61]
Wenchao Zhou, Qiong Fei, Arjun Narayan, Andreas Haeberlen, Boon Thau Loo, and Micah Sherr. Secure Network Provenance. In SOSP '11.
[62]
Wenchao Zhou, Micah Sherr, Tao Tao, Xiaozhou Li, Boon Thau Loo, and Yun Mao. Efficient Querying and Maintenance of Network Provenance at Internet-Scale. In SIGMOD '10.

Cited By

View all
  • (2022)Maximizing Error Injection Realism for Chaos Engineering With System CallsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.306971519:4(2695-2708)Online publication date: 1-Jul-2022
  • (2021)A Chaos Engineering System for Live Analysis and Falsification of Exception-Handling in the JVMIEEE Transactions on Software Engineering10.1109/TSE.2019.295487147:11(2534-2548)Online publication date: 1-Nov-2021
  • (2019)ScalecheckProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323332(359-373)Online publication date: 25-Feb-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SOCC '14: Proceedings of the ACM Symposium on Cloud Computing
November 2014
383 pages
ISBN:9781450332521
DOI:10.1145/2670979
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 November 2014

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

SOCC '14
Sponsor:
SOCC '14: ACM Symposium on Cloud Computing
November 3 - 5, 2014
WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Maximizing Error Injection Realism for Chaos Engineering With System CallsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2021.306971519:4(2695-2708)Online publication date: 1-Jul-2022
  • (2021)A Chaos Engineering System for Live Analysis and Falsification of Exception-Handling in the JVMIEEE Transactions on Software Engineering10.1109/TSE.2019.295487147:11(2534-2548)Online publication date: 1-Nov-2021
  • (2019)ScalecheckProceedings of the 17th USENIX Conference on File and Storage Technologies10.5555/3323298.3323332(359-373)Online publication date: 25-Feb-2019
  • (2018)MaelstromProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291196(373-389)Online publication date: 8-Oct-2018
  • (2017)Scalability BugsProceedings of the 16th Workshop on Hot Topics in Operating Systems10.1145/3102980.3102985(24-29)Online publication date: 7-May-2017
  • (2016)Why Does the Cloud Stop Computing?Proceedings of the Seventh ACM Symposium on Cloud Computing10.1145/2987550.2987583(1-16)Online publication date: 5-Oct-2016
  • (2015)SAMC: a fast model checker for finding heisenbugs in distributed systems (demo)Proceedings of the 2015 International Symposium on Software Testing and Analysis10.1145/2771783.2784771(423-427)Online publication date: 13-Jul-2015
  • (2014)What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud SystemsProceedings of the ACM Symposium on Cloud Computing10.1145/2670979.2670986(1-14)Online publication date: 3-Nov-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media