More Web Proxy on the site http://driver.im/

Article

Capturing and enhancing in situ system observability for failure detection

Authors:

Chuanxiong Guo,

Jacob R. Lorch,

Yingnong DangAuthors Info & Claims

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation

Pages 1 - 16

Published: 08 October 2018 Publication History

Abstract

Real-world distributed systems suffer unavailability due to various types of failure. But, despite enormous effort, many failures, especially gray failures, still escape detection. In this paper, we argue that the missing piece in failure detection is detecting what the requesters of a failing component see. This insight leads us to the design and implementation of Panorama, a system designed to enhance system observability by taking advantage of the interactions between a system's components. By providing a systematic channel and analysis tool, Panorama turns a component into a logical observer so that it not only handles errors, but also reports them. Furthermore, Panorama incorporates techniques for making such observations even when indirection exists between components. Panorama can easily integrate with popular distributed systems and detect all 15 real-world gray failures that we reproduced in less than 7 s, whereas existing approaches detect only one of them in under 300 s.

References

[1]

Asana service outage on September 8th, 2016. https://blog.asana.com/2016/09/yesterdays-outage/.

[2]

AspectJ, aspect-oriented extension to the Java programming language. https://www.eclipse.org/aspectj.

[3]

GoCardless service outage on October 10th, 2017. https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october.

[4]

Google Compute Engine incident 16007. https://status.cloud.google.com/incident/compute/16007.

[5]

gRPC, a high performance, open-source universal RPC framework. https://grpc.io.

[6]

Microsoft Azure status history. https://azure.microsoft.com/enus/status/history.

[7]

Protocol buffers. https://developers.google.com/protocol-buffers/.

[8]

M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. Distributed Computing, 13(2):99-125, Apr. 2000.

Digital Library

[9]

M. K. Aguilera and M. Walfish. No time for asynchrony. In Proceedings of the 12th Conference on Hot Topics in Operating Systems, HotOS'09, Monte Verità, Switzerland, May 2009. USENIX Association.

Digital Library

[10]

Amazon. AWS service outage on October 22nd, 2012. https://aws.amazon.com/message/680342.

[11]

R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Fail-stutter fault tolerance. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems, HotOS '01. IEEE Computer Society, 2001.

Digital Library

[12]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI '04, San Francisco, CA, 2004. USENIX Association.

Digital Library

[13]

T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225-267, Mar. 1996.

Digital Library

[14]

W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computing, 51(5):561-580, May 2002.

Digital Library

[15]

M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI '14, pages 217-231, Broomfield, CO, 2014. USENIX Association.

Digital Library

[16]

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 143-154, Indianapolis, Indiana, USA, 2010. ACM.

Digital Library

[17]

J. Dean. Designs, lessons and advice from building large distributed systems, 2009. Keynote at The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS).

[18]

J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74-80, Feb. 2013.

Digital Library

[19]

T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, Santa Clara, California, 2013. ACM.

Digital Library

[20]

C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Transactions on Computing, 52(2):99-112, Feb. 2003.

Digital Library

[21]

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-Trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, NSDI '07, Cambridge, MA, 2007. USENIX Association.

Digital Library

[22]

S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP '03, pages 29-43, Bolton Landing, NY, USA, 2003. ACM.

Digital Library

[23]

E. Gilman. PagerDuty production ZooKeeper service incident in 2014. https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/.

[24]

C. Gray and D. Cheriton. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP '89, pages 202-210. ACM, 1989.

Digital Library

[25]

H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Jacobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies, FAST '18, pages 1-14, Oakland, CA, USA, 2018. USENIX Association.

Digital Library

[26]

C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proceedings of the 2015 ACM SIGCOMM Conference, SIGCOMM '15, pages 139-152, London, United Kingdom, 2015. ACM.

Digital Library

[27]

A. Haeberlen, P. Kouznetsov, and P. Druschel. PeerReview: Practical accountability for distributed systems. In Proceedings of the Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 175-188, Stevenson, Washington, USA, 2007. ACM.

Digital Library

[28]

A. Haeberlen and P. Kuznetsov. The fault detection problem. In Proceedings of the 13th International Conference on Principles of Distributed Systems, OPODIS '09, pages 99-114, Nîmes, France, 2009. Springer-Verlag.

Digital Library

[29]

N. Hayashibara, X. Defago, R. Yared, and T. Katayama. The ϕ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, SRDS '04, pages 66-78. IEEE Computer Society, 2004.

Digital Library

[30]

P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y. Dang, M. Chintalapati, and R. Yao. Gray failure: The Achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 150-155, Whistler, BC, Canada, 2017. ACM.

Digital Library

[31]

P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC '10, Boston, MA, 2010. USENIX Association.

Digital Library

[32]

R. E. Kalman. On the general theory of control systems. IRE Transactions on Automatic Control, 4(3):110-110, December 1959.

[33]

J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI '13, pages 427-442, Lombard, IL, 2013. USENIX Association.

Digital Library

[34]

J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the Twenty-third ACM Symposium on Operating Systems Principles, SOSP '11, pages 279-294, Cascais, Portugal, 2011. ACM.

Digital Library

[35]

J. Mace, R. Roelke, and R. Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 378-393, Monterey, California, 2015. ACM.

Digital Library

[36]

Microsoft. Office 365 service incident on November 13th, 2013. https://blogs.office.com/2012/11/13/update-on-recent-customer-issues/.

[37]

J. C. Mogul, R. Isaacs, and B. Welch. Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 12-17, Whistler, BC, Canada, 2017. ACM.

Digital Library

[38]

D. Nadolny. Network issues can cause cluster to hang due to near-deadlock. https://issues.apache.org/jira/browse/ZOOKEEPER-2201.

[39]

D. Nadolny. Debugging distributed systems. In SREcon 2016, Santa Clara, CA, Apr. 2016.

[40]

Oracle. Java Future and FutureTask. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html.

[41]

D. L. Parnas. On the criteria to be used in decomposing systems into modules. Communications of the ACM, 15(12):1053-1058, Dec. 1972.

Digital Library

[42]

J. Postel. DoD Standard Transmission Control Protocol, January 1980. RFC 761.

Digital Library

[43]

R. Ricci, E. Eide, and the CloudLab Team. Introducing Cloud-Lab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login:, 39(6), December 2014.

[44]

T. Schlossnagle. Monitoring in a DevOps world. Communications of the ACM, 61(3):58-61, Feb. 2018.

Digital Library

[45]

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.

[46]

R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a Java bytecode optimization framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON '99, Mississauga, Ontario, Canada, 1999. IBM Press.

Digital Library

[47]

R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pages 55-70, The Lake District, United Kingdom, 1998. Springer-Verlag.

Digital Library

[48]

M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable Internet services. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, SOSP '01, pages 230-243, Banff, Alberta, Canada, 2001. ACM.

Digital Library

[49]

D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. U. Jain, and M. Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 249-265, Broomfield, CO, 2014. USENIX Association.

Digital Library

[50]

D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI '12, pages 293-306, Hollywood, CA, USA, 2012. USENIX Association.

Digital Library

[51]

A. R. Yumerefendi and J. S. Chase. The role of accountability in dependable distributed systems. In Proceedings of the First Conference on Hot Topics in System Dependability, HotDep '05, Yokohama, Japan, 2005. USENIX Association.

Digital Library

Cited By

Wu HPan JHuang PVanbever LZhang I(2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691895
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu JNaor DGoel A(2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585942
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu J(2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3617690
Show More Cited By

Capturing and enhancing in situ system observability for failure detection
1. Software and its engineering
  1. Software organization and properties
    1. Software system structures

Recommendations

Failure Detection Sequencers: Necessary and Sufficient Information about Failures to Solve Predicate Detection
DISC '02: Proceedings of the 16th International Conference on Distributed Computing

This paper investigates the amount of information about failures needed to solve the predicate detection problem in asynchronous systems with crash failures. In particular, we show that predicate detection cannot be solved with traditional failure ...
Crash-quiescent failure detection
DISC'09: Proceedings of the 23rd international conference on Distributed computing

A distributed algorithm is crash quiescent if it eventually stops sending messages to crashed processes. An algorithm can be made crash quiescent by providing it with either a crash notification service or a reliable communication service. Both services ...
Failure Detection and Randomization: A Hybrid Approach to Solve Consensus

We present a consensus algorithm that combines unreliable failure detection and randomization, two well-known techniques for solving consensus in asynchronous systems with crash failures. This hybrid algorithm combines advantages from both approaches:...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation

October 2018

815 pages

ISBN:9781931971478

Program Chairs:
Andrea Arpaci-Dusseau
University of Wisconsin-Madison
,
Geoff Voelker
University of California, San Diego

Sponsors

NetApp
Google Inc.
NSF
Microsoft: Microsoft
Facebook: Facebook

In-Cooperation

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

USENIX Association

United States

Publication History

Published: 08 October 2018

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu HPan JHuang PVanbever LZhang I(2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691895
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu JNaor DGoel A(2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
https://dl.acm.org/doi/10.5555/3585938.3585942
Lu RXu EZhang YZhu FZhu ZWang MZhu ZXue GShu JLi MWu J(2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1145/3617690
Shen JZhang HXiang YShi XLi XShen YZhang ZWu YYin XWang JXu MLi YYin JSong JLi ZNie RSchulzrinne HKohler EMaltz DMisra V(2023)Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero CodeProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604823(420-437)Online publication date: 10-Sep-2023
https://dl.acm.org/doi/10.1145/3603269.3604823
Zhang YYang JJin ZSethi URodrigues KLu SYuan D(2021)Understanding and Detecting Software Upgrade Failures in Distributed SystemsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483577(116-131)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483577
Yoo AWang YSinha RMu SXu TAngel SKasikci BKohler E(2021)Fail-slow fault tolerance needs programming supportProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3458336.3465299(228-235)Online publication date: 1-Jun-2021
https://dl.acm.org/doi/10.1145/3458336.3465299
Levy SYao RWu YDang YHuang PMu ZZhao PRamani TGovindaraju NLi XLin QShafriri GChintalapati MLu SHowell J(2020)Predictive and adaptive failure mitigation to avert production cloud VM interruptionsProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488831(1155-1170)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488831
Lou CHuang PSmith SBhagwan RPorter G(2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388284
Li ZCheng QHsieh KDang YHuang PSingh PYang XLin QWu YLevy SChintalapati MBhagwan RPorter G(2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388271
Wang ZZhang GWang YYang QZhu JDan TDahlia M(2019)DayuProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358892(993-1007)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358892
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten