[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3102980.3103005acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article

Gray Failure: The Achilles' Heel of Cloud-Scale Systems

Published: 07 May 2017 Publication History

Abstract

Cloud scale provides the vast resources necessary to replace failed components, but this is useful only if those failures can be detected. For this reason, the major availability breakdowns and performance anomalies we see in cloud environments tend to be caused by subtle underlying faults, i.e., gray failure rather than fail-stop failure. In this paper, we discuss our experiences with gray failure in production cloud-scale systems to show its broad scope and consequences. We also argue that a key feature of gray failure is differential observability: that the system's failure detectors may not notice problems even when applications are afflicted by them. This realization leads us to believe that, to best deal with them, we should focus on bridging the gap between different components' perceptions of what constitutes failure.

References

[1]
Al-Fares, M., Loukissas, A., and Vahdat, A. A scalable, commodity data center network architecture. In Proceedings of the 2008 ACM SIGCOMM Conference (Aug. 2008), pp. 63--74.
[2]
Alsberg, P. A., and Day, J. D. A principle for resilient sharing of distributed resources. In Proceedings of the 2nd International Conference on Software Engineering (ICSE) (Oct. 1976), pp. 562--570.
[3]
Amazon. AWS service outage on October 22nd, 2012. https://aws.amazon.com/message/680342.
[4]
Andreyev, A. Introducing Data Center Fabric, The Next-generation Facebook Data Center Network. https://code.facebook.com/posts/360346274145943/, Nov. 2014.
[5]
Bodik, P., Goldszmidt, M., Fox, A., Woodard, D. B., and Andersen, H. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer Systems (EuroSys) (Apr. 2010), pp. 111--124.
[6]
Castro, M., and Liskov, B. Practical Byzantine fault tolerance. In Proceedings of the 3rd Symposium on Operating Systems Design and Implementation (OSDI) (Feb. 1999), pp. 173--186.
[7]
Chow, M., Meisner, D., Flinn, J., Peek, D., and Wenisch, T. F. The mystery machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI) (Oct. 2014), pp. 217--231.
[8]
Clement, A., Wong, E., Alvisi, L., Dahlin, M., and Marchetti, M. Making Byzantine fault tolerant systems tolerate Byzantine faults. In Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI) (Apr. 2009), pp. 153--168.
[9]
Cohen, I., Goldszmidt, M., Kelly, T., Symons, J., and Chase, J. S. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proceedings of the 6th Conference on Symposium on Operating Systems Design and Implementation (OSDI) (2004), pp. 16--16.
[10]
Cohen, I., Zhang, S., Goldszmidt, M., Symons, J., Kelly, T., and Fox, A. Capturing, indexing, clustering, and retrieving system history. In Proceedings of the 20th ACM Symposium on Operating Systems Principles (SOSP) (Oct. 2005), pp. 105--118.
[11]
Gray, J. Why do computers stop and what can be done about it? In Proc. Symposium on Reliability in Distributed Software and Database Systems (1986), pp. 3--12.
[12]
Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., Maltz, D. A., Patel, P., and Sengupta, S. VL2: A scalable and flexible data center network. In Proceedings of the 2009 ACM SIGCOMM Conference (Aug. 2009), pp. 51--62.
[13]
Gunawi, H. S., Hao, M., Suminto, R. O., Laksono, A., Satria, A. D., Adityatama, J., and Eliazar, K. J. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC) (Oct. 2016), pp. 1--16.
[14]
Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu, Z., Wang, V., Pang, B., Chen, H., Lin, Z.-W., and Kurien, V. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proceedings of the 2015 ACM SIGCOMM Conference (New York, NY, USA, 2015), SIGCOMM '15, ACM, pp. 139--152.
[15]
Huang, P., Jin, X., Bolosky, W. J., and Zhou, Y. Why does a cloud-scale service fail despite fault-tolerance? Unpublished internal document (2014).
[16]
Lamport, L. The part-time parliament. ACM Transactions on Computer Systems (TOCS) 16, 2 (May 1998), 133--169.
[17]
Leners, J. B., Gupta, T., Aguilera, M. K., and Walfish, M. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI) (Apr. 2013), pp. 427--442.
[18]
Leners, J. B., Wu, H., Hung, W.-L., Aguilera, M. K., and Walfish, M. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP) (Oct. 2011), pp. 279--294.
[19]
Microsoft. Office 365 service incident on November 13th, 2013. https://blogs.office.com/2012/11/13/update-on-recent-customer-issues/.
[20]
Oppenheimer, D., Ganapathi, A., and Patterson, D. A. Why do Internet services fail, and what can be done about it? In Proceedings of the 4th Conference on USENIX Symposium on Internet Technologies and Systems (USITS) (Mar. 2003).
[21]
Patterson, D. A., Gibson, G., and Katz, R. H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (1988), pp. 109--116.
[22]
Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A., Bannon, R., Boving, S., Desai, G., Felderman, B., Germano, P., Kanagala, A., Provost, J., Simmons, J., Tanda, E., Wanderer, J., Hölzle, U., Stuart, S., and Vahdat, A. Jupiter rising: A decade of Clos topologies and centralized control in Google's datacenter network. In Proceedings of the 2015 ACM SIGCOMM Conference (Aug. 2015), SIGCOMM '15, pp. 183--197.
[23]
van Renesse, R., and Schneider, F. B. Chain replication for supporting high throughput and availability. In Proceedings of the 6th Conference on Symposium on Operating Systems Design (OSDI) (Dec. 2004), pp. 91--104.

Cited By

View all
  • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
  • (2024)X-Stor: A Cloud-Native NoSQL Database Service with Multi-Model SupportProceedings of the VLDB Endowment10.14778/3685800.368582417:12(4025-4037)Online publication date: 1-Aug-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HotOS '17: Proceedings of the 16th Workshop on Hot Topics in Operating Systems
May 2017
185 pages
ISBN:9781450350686
DOI:10.1145/3102980
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 May 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

HotOS '17
Sponsor:
HotOS '17: Workshop on Hot Topics in Operating Systems
May 7 - 10, 2017
BC, Whistler, Canada

Upcoming Conference

HOTOS '25
Workshop on Hot Topics in Operating Systems
May 14 - 16, 2025
Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)142
  • Downloads (Last 6 weeks)12
Reflects downloads up to 16 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Reducing Persistence Overhead in Parallel State Machine Replication through Time-Phased Partitioned CheckpointJournal of Internet Services and Applications10.5753/jisa.2024.389115:1(194-211)Online publication date: 26-Jul-2024
  • (2024)X-Stor: A Cloud-Native NoSQL Database Service with Multi-Model SupportProceedings of the VLDB Endowment10.14778/3685800.368582417:12(4025-4037)Online publication date: 1-Aug-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Self-maintaining [networked] systems: The rise of datacenter robotics!Proceedings of the 23rd ACM Workshop on Hot Topics in Networks10.1145/3696348.3696872(159-166)Online publication date: 18-Nov-2024
  • (2024)FAIL: Analyzing Software Failures from the News Using LLMsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695022(506-518)Online publication date: 27-Oct-2024
  • (2024)Erlang on TOAST: Generating Erlang Stubs with Inline TOAST MonitorsProceedings of the 23rd ACM SIGPLAN International Workshop on Erlang10.1145/3677995.3678192(33-44)Online publication date: 28-Aug-2024
  • (2024)Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating SystemsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663834(126-137)Online publication date: 10-Jul-2024
  • (2024)MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination IndexingProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652131(325-337)Online publication date: 11-Sep-2024
  • (2024)Group-Wise Verifiable Coded Computing Under Byzantine Attacks and StragglersIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.337792919(4344-4357)Online publication date: 2024
  • (2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media