[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3317550.3321440acmconferencesArticle/Chapter ViewAbstractPublication PageshotosConference Proceedingsconference-collections
research-article

Comprehensive and Efficient Runtime Checking in System Software through Watchdogs

Published: 13 May 2019 Publication History

Abstract

Systems software today is composed of numerous modules and exhibits complex failure modes. Existing failure detectors focus on catching simple, complete failures and treat programs uniformly at the process level. In this paper, we argue that modern software needs intrinsic failure detectors that are tailored to individual systems and can detect anomalies within a process at finer granularity. We particularly advocate a notion of intrinsic software watchdogs and propose an abstraction for it. Among the different styles of watchdogs, we believe watchdogs that imitate the main program can provide the best combination of completeness, accuracy and localization for detecting gray failures. But, manually constructing such mimic-type watchdogs is challenging and time-consuming. To close this gap, we present an early exploration for automatically generating mimic-type watchdogs.

References

[1]
Apache module mod_watchdog. https://httpd.apache.Org/docs/2.4/mod/mod_watchdog.html.
[2]
Detecting long jvm GC pause detector in Ignite. https://issues.apache.org/jira/browse/IGNITE-6171.
[3]
HDFS disck checker. https://issues.apache.org/jira/browse/HADOOP-13738.
[4]
Just say no to more end-to-end tests. https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html.
[5]
Linux watchdog daemon. https://linux.die.net/man/8/watchdog.
[6]
Memory leak in HBase. https://issues.apache.org/jira/browse/HBASE-21228.
[7]
Network issues can cause cluster to hang due to near-deadlock. https://issues.apache.org/jira/browse/ZOOKEEPER-2201.
[8]
Zookeeper administrator's guide. https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html.
[9]
M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti. Control-flow integrity. In Proceedings of the 12th ACM Conference on Computer and Communications Security, CCS '05, pages 340--353, Alexandria, VA, USA, 2005. ACM.
[10]
M. K. Aguilera and M. Walfish. No time for asynchrony. In Proceedings of the 12th Conference on Hot Topics in Operating Systems, HotOS'09, Monte Verità, Switzerland, May 2009. USENIX Association.
[11]
R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Fail-stutter fault tolerance. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems, HotOS '01. IEEE Computer Society, 2001.
[12]
A. S. Berger. Embedded Systems Design: An Introduction to Processes, Tools, and Techniques. CMP Books. Taylor & Francis, 2001.
[13]
B. Beyer, C. Jones, J. Petoff, and N. R. Murphy. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, Inc., 1st edition, 2016.
[14]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot --- a technique for cheap recovery. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 3--3, San Francisco, CA, 2004. USENIX Association.
[15]
T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, Mar. 1996.
[16]
M. Correia, D. G. Ferro, F. P. Junqueira, and M. Serafini. Practical hardening of crash-tolerant systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC'12, pages 41--41, Boston, MA, 2012. USENIX Association.
[17]
T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, Santa Clara, California, 2013. ACM.
[18]
M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynamically discovering likely program invariants to support program evolution. In Proceedings of the 21st International Conference on Software Engineering, ICSE '99, pages 213--224, Los Angeles, California, USA, 1999. ACM.
[19]
J. Ganssle. Great watchdog timers for embedded systems. http://www.ganssle.com/watchdogs.htm.
[20]
H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Jacobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies, FAST '18, pages 1--14, Oakland, CA, USA, 2018. USENIX Association.
[21]
A. Gupta and A. Rybalchenko. InvGen: An efficient invariant generator. In Proceedings of the 21st International Conference on Computer Aided Verification, CAV '09, pages 634--640, Berlin, Heidelberg, 2009. Springer-Verlag.
[22]
P. Huang, C. Guo, J. R. Lorch, L. Zhou, and Y. Dang. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI '18, pages 1--16, Carlsbad, CA, October 2018. USENIX Association.
[23]
P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y. Dang, M. Chintalapati, and R. Yao. Gray failure: The Achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 150--155, Whistler, BC, Canada, 2017. ACM.
[24]
B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea. Failure sketching: A technique for automated root cause diagnosis of inproduction failures. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 344--360, Monterey, California, 2015. ACM.
[25]
J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI '13, pages 427--442, Lombard, IL, 2013. USENIX Association.
[26]
J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Taming uncertainty in distributed systems with help from the network. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, pages 9:1--9:16, Bordeaux, France, 2015. ACM.
[27]
J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the Twenty-third ACM Symposium on Operating Systems Principles, SOSP '11, pages 279--294, Cascais, Portugal, 2011. ACM.
[28]
A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors - a survey. IEEE Transactions on Computers, 37(2):160--174, Feb 1988.
[29]
J. C. Mogul, R. Isaacs, and B. Welch. Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 12--17, Whistler, BC, Canada, 2017. ACM.
[30]
N. Murphy. Watchdog timers. Embedded Systems Programming, pages 112--124, 2000.
[31]
V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. IRON file systems. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP '05, pages 206--220, Brighton, United Kingdom, 2005. ACM.
[32]
R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a java bytecode optimization framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON '99, pages 13-, Mississauga, Ontario, Canada, 1999. IBM Press.
[33]
M. Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering, ICSE '81, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press.
[34]
T. Xu, X. Jin, P. Huang, Y. Zhou, S. Lu, L. Jin, and S. Pasupathy. Early detection of configuration errors to reduce failure damage. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 619--634, Savannah, GA, USA, 2016. USENIX Association.
[35]
Y. Zhang, S. Makarov, X. Ren, D. Lion, and D. Yuan. Pensieve: Non-intrusive failure reproduction for distributed systems using the event chaining approach. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 19--33, Shanghai, China, 2017. ACM.

Cited By

View all
  • (2020)Tolerating slowdowns in replicated state machines using copilotsProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488799(583-598)Online publication date: 4-Nov-2020
  • (2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
  • (2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020
  1. Comprehensive and Efficient Runtime Checking in System Software through Watchdogs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems
    May 2019
    227 pages
    ISBN:9781450367271
    DOI:10.1145/3317550
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 May 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    HotOS '19
    Sponsor:

    Upcoming Conference

    HOTOS '25
    Workshop on Hot Topics in Operating Systems
    May 14 - 16, 2025
    Banff or Lake Louise , AB , Canada

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)23
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 09 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Tolerating slowdowns in replicated state machines using copilotsProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488799(583-598)Online publication date: 4-Nov-2020
    • (2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
    • (2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media