[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3291168.3291170acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

Capturing and enhancing in situ system observability for failure detection

Published: 08 October 2018 Publication History

Abstract

Real-world distributed systems suffer unavailability due to various types of failure. But, despite enormous effort, many failures, especially gray failures, still escape detection. In this paper, we argue that the missing piece in failure detection is detecting what the requesters of a failing component see. This insight leads us to the design and implementation of Panorama, a system designed to enhance system observability by taking advantage of the interactions between a system's components. By providing a systematic channel and analysis tool, Panorama turns a component into a logical observer so that it not only handles errors, but also reports them. Furthermore, Panorama incorporates techniques for making such observations even when indirection exists between components. Panorama can easily integrate with popular distributed systems and detect all 15 real-world gray failures that we reproduced in less than 7 s, whereas existing approaches detect only one of them in under 300 s.

References

[1]
Asana service outage on September 8th, 2016. https://blog.asana.com/2016/09/yesterdays-outage/.
[2]
AspectJ, aspect-oriented extension to the Java programming language. https://www.eclipse.org/aspectj.
[3]
GoCardless service outage on October 10th, 2017. https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october.
[4]
Google Compute Engine incident 16007. https://status.cloud.google.com/incident/compute/16007.
[5]
gRPC, a high performance, open-source universal RPC framework. https://grpc.io.
[6]
Microsoft Azure status history. https://azure.microsoft.com/enus/status/history.
[7]
Protocol buffers. https://developers.google.com/protocol-buffers/.
[8]
M. K. Aguilera, W. Chen, and S. Toueg. Failure detection and consensus in the crash-recovery model. Distributed Computing, 13(2):99-125, Apr. 2000.
[9]
M. K. Aguilera and M. Walfish. No time for asynchrony. In Proceedings of the 12th Conference on Hot Topics in Operating Systems, HotOS'09, Monte Verità, Switzerland, May 2009. USENIX Association.
[10]
Amazon. AWS service outage on October 22nd, 2012. https://aws.amazon.com/message/680342.
[11]
R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Fail-stutter fault tolerance. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems, HotOS '01. IEEE Computer Society, 2001.
[12]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI '04, San Francisco, CA, 2004. USENIX Association.
[13]
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225-267, Mar. 1996.
[14]
W. Chen, S. Toueg, and M. K. Aguilera. On the quality of service of failure detectors. IEEE Transactions on Computing, 51(5):561-580, May 2002.
[15]
M. Chow, D. Meisner, J. Flinn, D. Peek, and T. F. Wenisch. The Mystery Machine: End-to-end performance analysis of large-scale Internet services. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI '14, pages 217-231, Broomfield, CO, 2014. USENIX Association.
[16]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with ycsb. In Proceedings of the 1st ACM Symposium on Cloud Computing, SoCC '10, pages 143-154, Indianapolis, Indiana, USA, 2010. ACM.
[17]
J. Dean. Designs, lessons and advice from building large distributed systems, 2009. Keynote at The 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware (LADIS).
[18]
J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74-80, Feb. 2013.
[19]
T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, Santa Clara, California, 2013. ACM.
[20]
C. Fetzer. Perfect failure detection in timed asynchronous systems. IEEE Transactions on Computing, 52(2):99-112, Feb. 2003.
[21]
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-Trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, NSDI '07, Cambridge, MA, 2007. USENIX Association.
[22]
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google File System. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP '03, pages 29-43, Bolton Landing, NY, USA, 2003. ACM.
[23]
E. Gilman. PagerDuty production ZooKeeper service incident in 2014. https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/.
[24]
C. Gray and D. Cheriton. Leases: An efficient fault-tolerant mechanism for distributed file cache consistency. In Proceedings of the Twelfth ACM Symposium on Operating Systems Principles, SOSP '89, pages 202-210. ACM, 1989.
[25]
H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Jacobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies, FAST '18, pages 1-14, Oakland, CA, USA, 2018. USENIX Association.
[26]
C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, and V. Kurien. Pingmesh: A large-scale system for data center network latency measurement and analysis. In Proceedings of the 2015 ACM SIGCOMM Conference, SIGCOMM '15, pages 139-152, London, United Kingdom, 2015. ACM.
[27]
A. Haeberlen, P. Kouznetsov, and P. Druschel. PeerReview: Practical accountability for distributed systems. In Proceedings of the Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 175-188, Stevenson, Washington, USA, 2007. ACM.
[28]
A. Haeberlen and P. Kuznetsov. The fault detection problem. In Proceedings of the 13th International Conference on Principles of Distributed Systems, OPODIS '09, pages 99-114, Nîmes, France, 2009. Springer-Verlag.
[29]
N. Hayashibara, X. Defago, R. Yared, and T. Katayama. The ϕ accrual failure detector. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, SRDS '04, pages 66-78. IEEE Computer Society, 2004.
[30]
P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y. Dang, M. Chintalapati, and R. Yao. Gray failure: The Achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 150-155, Whistler, BC, Canada, 2017. ACM.
[31]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC '10, Boston, MA, 2010. USENIX Association.
[32]
R. E. Kalman. On the general theory of control systems. IRE Transactions on Automatic Control, 4(3):110-110, December 1959.
[33]
J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI '13, pages 427-442, Lombard, IL, 2013. USENIX Association.
[34]
J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the Twenty-third ACM Symposium on Operating Systems Principles, SOSP '11, pages 279-294, Cascais, Portugal, 2011. ACM.
[35]
J. Mace, R. Roelke, and R. Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 378-393, Monterey, California, 2015. ACM.
[36]
Microsoft. Office 365 service incident on November 13th, 2013. https://blogs.office.com/2012/11/13/update-on-recent-customer-issues/.
[37]
J. C. Mogul, R. Isaacs, and B. Welch. Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 12-17, Whistler, BC, Canada, 2017. ACM.
[38]
D. Nadolny. Network issues can cause cluster to hang due to near-deadlock. https://issues.apache.org/jira/browse/ZOOKEEPER-2201.
[39]
D. Nadolny. Debugging distributed systems. In SREcon 2016, Santa Clara, CA, Apr. 2016.
[40]
Oracle. Java Future and FutureTask. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Future.html.
[41]
D. L. Parnas. On the criteria to be used in decomposing systems into modules. Communications of the ACM, 15(12):1053-1058, Dec. 1972.
[42]
J. Postel. DoD Standard Transmission Control Protocol, January 1980. RFC 761.
[43]
R. Ricci, E. Eide, and the CloudLab Team. Introducing Cloud-Lab: Scientific infrastructure for advancing cloud architectures and applications. USENIX ;login:, 39(6), December 2014.
[44]
T. Schlossnagle. Monitoring in a DevOps world. Communications of the ACM, 61(3):58-61, Feb. 2018.
[45]
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, Inc., 2010.
[46]
R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a Java bytecode optimization framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON '99, Mississauga, Ontario, Canada, 1999. IBM Press.
[47]
R. van Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. In Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, Middleware '98, pages 55-70, The Lake District, United Kingdom, 1998. Springer-Verlag.
[48]
M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable Internet services. In Proceedings of the Eighteenth ACM Symposium on Operating Systems Principles, SOSP '01, pages 230-243, Banff, Alberta, Canada, 2001. ACM.
[49]
D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. U. Jain, and M. Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 249-265, Broomfield, CO, 2014. USENIX Association.
[50]
D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI '12, pages 293-306, Hollywood, CA, USA, 2012. USENIX Association.
[51]
A. R. Yumerefendi and J. S. Chase. The role of accountability in dependable distributed systems. In Proceedings of the First Conference on Hot Topics in System Dependability, HotDep '05, Yokohama, Japan, 2005. USENIX Association.

Cited By

View all
  • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
  • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
  • (2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
  • Show More Cited By
  1. Capturing and enhancing in situ system observability for failure detection

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation
    October 2018
    815 pages
    ISBN:9781931971478

    Sponsors

    • NetApp
    • Google Inc.
    • NSF
    • Microsoft: Microsoft
    • Facebook: Facebook

    In-Cooperation

    Publisher

    USENIX Association

    United States

    Publication History

    Published: 08 October 2018

    Check for updates

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
    • (2023)PERSEUSProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585942(49-63)Online publication date: 21-Feb-2023
    • (2023)From Missteps to Milestones: A Journey to Practical Fail-Slow DetectionACM Transactions on Storage10.1145/361769019:4(1-28)Online publication date: 1-Nov-2023
    • (2023)Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero CodeProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604823(420-437)Online publication date: 10-Sep-2023
    • (2021)Understanding and Detecting Software Upgrade Failures in Distributed SystemsProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483577(116-131)Online publication date: 26-Oct-2021
    • (2021)Fail-slow fault tolerance needs programming supportProceedings of the Workshop on Hot Topics in Operating Systems10.1145/3458336.3465299(228-235)Online publication date: 1-Jun-2021
    • (2020)Predictive and adaptive failure mitigation to avert production cloud VM interruptionsProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488831(1155-1170)Online publication date: 4-Nov-2020
    • (2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
    • (2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020
    • (2019)DayuProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358892(993-1007)Online publication date: 10-Jul-2019
    • Show More Cited By

    View Options

    View options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media