More Web Proxy on the site http://driver.im/

research-article

Gray Failure: The Achilles' Heel of Cloud-Scale Systems

Authors:

Chuanxiong Guo,

Jacob R. Lorch,

Murali Chintalapati,

Randolph YaoAuthors Info & Claims

HotOS '17: Proceedings of the 16th Workshop on Hot Topics in Operating Systems

Pages 150 - 155

https://doi.org/10.1145/3102980.3103005

Published: 07 May 2017 Publication History

Abstract

Cloud scale provides the vast resources necessary to replace failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences. We also argue that a key feature of gray failure is differential observability: that the system's failure detectors may not notice problems even when applications are afflicted by them. This realization leads us to believe that, to best deal with them, we should focus on bridging the gap between different components' perceptions of what constitutes failure.

References

[1]

Al-Fares, M., Loukissas, A., and Vahdat, A. A scalable, commodity data center network architecture. In Proceedings of the 2008 ACM SIGCOMM Conference (Aug. 2008), pp. 63--74.

Digital Library

[2]

Alsberg, P. A., and Day, J. D. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering (ICSE) (Oct. 1976), pp. 562--570.

Digital Library

[3]

Amazon. AWS service outage on October 22nd, 2012. https://aws.amazon.com/message/680342.

[4]

Andreyev, A. Introducing Data Center Fabric, The Next-generation Facebook Data Center Network. https://code.facebook.com/posts/360346274145943/, Nov. 2014.

[5]

Bodik, P., Goldszmidt, M., Fox, A., Woodard, D. B., and Andersen, H. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer Systems (EuroSys) (Apr. 2010), pp. 111--124.

Digital Library

[6]

Castro, M., and Liskov, B. Practical Byzantine fault tolerance. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI) (Feb. 1999), pp. 173--186.

Digital Library

[7]

Chow, M., Meisner, D., Flinn, J., Peek, D., and Wenisch, T. F. The mystery machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI) (Oct. 2014), pp. 217--231.

[8]

Clement, A., Wong, E., Alvisi, L., Dahlin, M., and Marchetti, M. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (Apr. 2009), pp. 153--168.

Digital Library

[9]

Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., and Chase, J. S. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI) (2004), pp. 16--16.

[10]

Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., and Fox, A. Capturing, indexing, clustering, and retrieving system history. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP) (Oct. 2005), pp. 105--118.

Digital Library

[11]

Gray, J. Why do computers stop and what can be done about it? In Proc. Symposium on Reliability in Distributed Software and Database Systems (1986), pp. 3--12.

[12]

Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D. A., Patel, P., and Sengupta, S. VL2: A scalable and flexible data center network. In Proceedings of the 2009 ACM SIGCOMM Conference (Aug. 2009), pp. 51--62.

Digital Library

[13]

Gunawi, H. S., Hao, M., Suminto, R. O., Laksono, A., Satria, A. D., Adityatama, J., and Eliazar, K. J. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC) (Oct. 2016), pp. 1--16.

Digital Library

[14]

Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., Wang, V., Pang, B., Chen, H., Lin, Z.-W., and Kurien, V. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proceedings of the 2015 ACM SIGCOMM Conference (New York, NY, USA, 2015), SIGCOMM '15, ACM, pp. 139--152.

Digital Library

[15]

Huang, P., Jin, X., Bolosky, W. J., and Zhou, Y. Why does a cloud-scale service fail despite fault-tolerance? Unpublished internal document (2014).

[16]

Lamport, L. The part-time parliament. ACM Transactions on Computer Systems (TOCS) 16, 2 (May 1998), 133--169.

Digital Library

[17]

Leners, J. B., Gupta, T., Aguilera, M. K., and Walfish, M. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI) (Apr. 2013), pp. 427--442.

Digital Library

[18]

Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP) (Oct. 2011), pp. 279--294.

Digital Library

[19]

Microsoft. Office 365 service incident on November 13th, 2013. https://blogs.office.com/2012/11/13/update-on-recent-customer-issues/.

[20]

Oppenheimer, D., Ganapathi, A., and Patterson, D. A. Why do Internet services fail, and what can be done about it? In Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems (USITS) (Mar. 2003).

Digital Library

[21]

Patterson, D. A., Gibson, G., and Katz, R. H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (1988), pp. 109--116.

Digital Library

[22]

Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A., Bannon, R., Boving, S., Desai, G., Felderman, B., Germano, P., Kanagala, A., Provost, J., Simmons, J., Tanda, E., Wanderer, J., Hölzle, U., Stuart, S., and Vahdat, A. Jupiter rising: A decade of Clos topologies and centralized control in Google's datacenter network. In Proceedings of the 2015 ACM SIGCOMM Conference (Aug. 2015), SIGCOMM '15, pp. 183--197.

Digital Library

[23]

van Renesse, R., and Schneider, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Operating Systems Design (OSDI) (Dec. 2004), pp. 91--104.

Cited By

Gomes Jr. EAlchieri EDotti FMendizabal O(2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
https://doi.org/10.5753/jisa.2024.3891
Lei HLi CZhou KZhu JYan KXiao FXie MWang JDi S(2024)X-Stor: A Cloud-Native NoSQL Database Service with Multi-Model SupportProceedings of the VLDB Endowment10.14778/3685800.368582417:12(4025-4037)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685824
Sruthi PGuo ZChu DChen ZZhang Y(2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698568
Show More Cited By

Gray Failure: The Achilles' Heel of Cloud-Scale Systems
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. General and reference
  1. Cross-computing tools and techniques

Recommendations

Failure Type-Aware Reliability Assessment with Component Failure Dependency
SSIRI '10: Proceedings of the 2010 Fourth International Conference on Secure Software Integration and Reliability Improvement

Most of the existing reliability assessment techniques assume that components fail independently and consider different types of failures equally. By disregarding component failure dependency, these techniques assume inappropriately that a component ...
Failure recovery: when the cure is worse than the disease
HotOS'13: Proceedings of the 14th USENIX conference on Hot Topics in Operating Systems

Cloud services inevitably fail: machines lose power, networks become disconnected, pesky software bugs cause sporadic crashes, and so on. Unfortunately, failure recovery itself is often faulty; e.g. recovery can accidentally recursively replicate small ...
Quasi-synchronous checkpointing and failure recovery in distributed systems

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HotOS '17: Proceedings of the 16th Workshop on Hot Topics in Operating Systems

May 2017

185 pages

ISBN:9781450350686

DOI:10.1145/3102980

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

HotOS '17

Sponsor:

SIGOPS

HotOS '17: Workshop on Hot Topics in Operating Systems

May 7 - 10, 2017

BC, Whistler, Canada

Upcoming Conference

HOTOS '25

Sponsor:
sigops

Workshop on Hot Topics in Operating Systems

May 14 - 16, 2025

Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
1,023
Total Downloads

Downloads (Last 12 months)142
Downloads (Last 6 weeks)12

Reflects downloads up to 16 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gomes Jr. EAlchieri EDotti FMendizabal O(2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
https://doi.org/10.5753/jisa.2024.3891
Lei HLi CZhou KZhu JYan KXiao FXie MWang JDi S(2024)X-Stor: A Cloud-Native NoSQL Database Service with Multi-Model SupportProceedings of the VLDB Endowment10.14778/3685800.368582417:12(4025-4037)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685824
Sruthi PGuo ZChu DChen ZZhang Y(2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698568
Hong FSarantopoulos IHogg ERichardson DZhang YWilliams HSweeney DChatzieleftheriou ARowstron A(2024)Self-maintaining [networked] systems: The rise of datacenter robotics!Proceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696872(159-166)Online publication date: 18-Nov-2024
https://dl.acm.org/doi/10.1145/3696348.3696872
Anandayuvaraj DCampbell MTewari ADavis JFilkov VRay BZhou M(2024)FAIL: Analyzing Software Failures from the News Using LLMsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695022(506-518)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695022
Pears JBocchi LHu RFernandez-Reyes KVoinea A(2024)Erlang on TOAST: Generating Erlang Stubs with Inline TOAST MonitorsProceedings of the 23rd ACM SIGPLAN International Workshop on Erlang10.1145/3677995.3678192(33-44)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1145/3677995.3678192
Zhang SZhao YXiong XSun YNie XZhang JWang FZheng XZhang YPei Dd'Amorim M(2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663834
Yang TLee CShen JSu YFeng CYang YLyu MChristakis MPradel M(2024)MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination IndexingProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652131(325-337)Online publication date: 11-Sep-2024
https://dl.acm.org/doi/10.1145/3650212.3652131
Hong SYang HYoon YLee J(2024)Group-Wise Verifiable Coded Computing Under Byzantine Attacks and StragglersIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.337792919(4344-4357)Online publication date: 2024
https://doi.org/10.1109/TIFS.2024.3377929
Chen HChen PYu GLi XHe ZZhang H(2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
https://doi.org/10.1109/TDSC.2024.3363902
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents