More Web Proxy on the site http://driver.im/

research-article

Comprehensive and Efficient Runtime Checking in System Software through Watchdogs

Authors:

Scott SmithAuthors Info & Claims

HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems

Pages 51 - 57

https://doi.org/10.1145/3317550.3321440

Published: 13 May 2019 Publication History

Abstract

Systems software today is composed of numerous modules and exhibits complex failure modes. Existing failure detectors focus on catching simple, complete failures and treat programs uniformly at the process level. In this paper, we argue that modern software needs intrinsic failure detectors that are tailored to individual systems and can detect anomalies within a process at finer granularity. We particularly advocate a notion of intrinsic software watchdogs and propose an abstraction for it. Among the different styles of watchdogs, we believe watchdogs that imitate the main program can provide the best combination of completeness, accuracy and localization for detecting gray failures. But, manually constructing such mimic-type watchdogs is challenging and time-consuming. To close this gap, we present an early exploration for automatically generating mimic-type watchdogs.

References

[1]

Apache module mod_watchdog. https://httpd.apache.Org/docs/2.4/mod/mod_watchdog.html.

[2]

Detecting long jvm GC pause detector in Ignite. https://issues.apache.org/jira/browse/IGNITE-6171.

[3]

HDFS disck checker. https://issues.apache.org/jira/browse/HADOOP-13738.

[4]

Just say no to more end-to-end tests. https://testing.googleblog.com/2015/04/just-say-no-to-more-end-to-end-tests.html.

[5]

Linux watchdog daemon. https://linux.die.net/man/8/watchdog.

[6]

Memory leak in HBase. https://issues.apache.org/jira/browse/HBASE-21228.

[7]

Network issues can cause cluster to hang due to near-deadlock. https://issues.apache.org/jira/browse/ZOOKEEPER-2201.

[8]

Zookeeper administrator's guide. https://zookeeper.apache.org/doc/r3.4.8/zookeeperAdmin.html.

[9]

M. Abadi, M. Budiu, U. Erlingsson, and J. Ligatti. Control-flow integrity. In Proceedings of the 12th ACM Conference on Computer and Communications Security, CCS '05, pages 340--353, Alexandria, VA, USA, 2005. ACM.

Digital Library

[10]

M. K. Aguilera and M. Walfish. No time for asynchrony. In Proceedings of the 12th Conference on Hot Topics in Operating Systems, HotOS'09, Monte Verità, Switzerland, May 2009. USENIX Association.

Digital Library

[11]

R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau. Fail-stutter fault tolerance. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems, HotOS '01. IEEE Computer Society, 2001.

Digital Library

[12]

A. S. Berger. Embedded Systems Design: An Introduction to Processes, Tools, and Techniques. CMP Books. Taylor & Francis, 2001.

[13]

B. Beyer, C. Jones, J. Petoff, and N. R. Murphy. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, Inc., 1st edition, 2016.

Digital Library

[14]

G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot --- a technique for cheap recovery. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 3--3, San Francisco, CA, 2004. USENIX Association.

Digital Library

[15]

T. D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225--267, Mar. 1996.

Digital Library

[16]

M. Correia, D. G. Ferro, F. P. Junqueira, and M. Serafini. Practical hardening of crash-tolerant systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC'12, pages 41--41, Boston, MA, 2012. USENIX Association.

Digital Library

[17]

T. Do, M. Hao, T. Leesatapornwongsa, T. Patana-anake, and H. S. Gunawi. Limplock: Understanding the impact of limpware on scale-out cloud systems. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, Santa Clara, California, 2013. ACM.

Digital Library

[18]

M. D. Ernst, J. Cockrell, W. G. Griswold, and D. Notkin. Dynamically discovering likely program invariants to support program evolution. In Proceedings of the 21st International Conference on Software Engineering, ICSE '99, pages 213--224, Los Angeles, California, USA, 1999. ACM.

Digital Library

[19]

J. Ganssle. Great watchdog timers for embedded systems. http://www.ganssle.com/watchdogs.htm.

[20]

H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Jacobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li. Fail-slow at scale: Evidence of hardware performance faults in large production systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies, FAST '18, pages 1--14, Oakland, CA, USA, 2018. USENIX Association.

Digital Library

[21]

A. Gupta and A. Rybalchenko. InvGen: An efficient invariant generator. In Proceedings of the 21st International Conference on Computer Aided Verification, CAV '09, pages 634--640, Berlin, Heidelberg, 2009. Springer-Verlag.

Digital Library

[22]

P. Huang, C. Guo, J. R. Lorch, L. Zhou, and Y. Dang. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI '18, pages 1--16, Carlsbad, CA, October 2018. USENIX Association.

Digital Library

[23]

P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y. Dang, M. Chintalapati, and R. Yao. Gray failure: The Achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 150--155, Whistler, BC, Canada, 2017. ACM.

Digital Library

[24]

B. Kasikci, B. Schubert, C. Pereira, G. Pokam, and G. Candea. Failure sketching: A technique for automated root cause diagnosis of inproduction failures. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP '15, pages 344--360, Monterey, California, 2015. ACM.

Digital Library

[25]

J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Improving availability in distributed systems with failure informers. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI '13, pages 427--442, Lombard, IL, 2013. USENIX Association.

Digital Library

[26]

J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Taming uncertainty in distributed systems with help from the network. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, pages 9:1--9:16, Bordeaux, France, 2015. ACM.

Digital Library

[27]

J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish. Detecting failures in distributed systems with the Falcon spy network. In Proceedings of the Twenty-third ACM Symposium on Operating Systems Principles, SOSP '11, pages 279--294, Cascais, Portugal, 2011. ACM.

Digital Library

[28]

A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors - a survey. IEEE Transactions on Computers, 37(2):160--174, Feb 1988.

Digital Library

[29]

J. C. Mogul, R. Isaacs, and B. Welch. Thinking about availability in large service infrastructures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems, HotOS '17, pages 12--17, Whistler, BC, Canada, 2017. ACM.

Digital Library

[30]

N. Murphy. Watchdog timers. Embedded Systems Programming, pages 112--124, 2000.

[31]

V. Prabhakaran, L. N. Bairavasundaram, N. Agrawal, H. S. Gunawi, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. IRON file systems. In Proceedings of the Twentieth ACM Symposium on Operating Systems Principles, SOSP '05, pages 206--220, Brighton, United Kingdom, 2005. ACM.

Digital Library

[32]

R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan. Soot - a java bytecode optimization framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON '99, pages 13-, Mississauga, Ontario, Canada, 1999. IBM Press.

Digital Library

[33]

M. Weiser. Program slicing. In Proceedings of the 5th International Conference on Software Engineering, ICSE '81, pages 439--449, Piscataway, NJ, USA, 1981. IEEE Press.

Digital Library

[34]

T. Xu, X. Jin, P. Huang, Y. Zhou, S. Lu, L. Jin, and S. Pasupathy. Early detection of configuration errors to reduce failure damage. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 619--634, Savannah, GA, USA, 2016. USENIX Association.

Digital Library

[35]

Y. Zhang, S. Makarov, X. Ren, D. Lion, and D. Yuan. Pensieve: Non-intrusive failure reproduction for distributed systems using the event chaining approach. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 19--33, Shanghai, China, 2017. ACM.

Digital Library

Cited By

Ngo KSen SLloyd WLu SHowell J(2020)Tolerating slowdowns in replicated state machines using copilotsProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488799(583-598)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488799
Lou CHuang PSmith SBhagwan RPorter G(2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388284
Li ZCheng QHsieh KDang YHuang PSingh PYang XLin QWu YLevy SChintalapati MBhagwan RPorter G(2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388271

Comprehensive and Efficient Runtime Checking in System Software through Watchdogs
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis

Recommendations

A Runtime-Monitoring-Based Dependable Software Construction Method
ICYCS '08: Proceedings of the 2008 The 9th International Conference for Young Computer Scientists

Software runtime monitoring mechanisms can be used to increase the dependability of software systems. However, it is a complex and burdensome job for developers to rebuild existing software systems by adding software runtime monitoring mechanism. ...
Runtime support for improving reliability in system software
A Comprehensive Model for Software Rejuvenation

Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems

May 2019

227 pages

ISBN:9781450367271

DOI:10.1145/3317550

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

HotOS '19

Sponsor:

SIGOPS

HotOS '19: Workshop on Hot Topics in Operating Systems

May 13 - 15, 2019

Bertinoro, Italy

Upcoming Conference

HOTOS '25

Sponsor:
sigops

Workshop on Hot Topics in Operating Systems

May 14 - 16, 2025

Banff or Lake Louise , AB , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
223
Total Downloads

Downloads (Last 12 months)23
Downloads (Last 6 weeks)9

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ngo KSen SLloyd WLu SHowell J(2020)Tolerating slowdowns in replicated state machines using copilotsProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488799(583-598)Online publication date: 4-Nov-2020
https://dl.acm.org/doi/10.5555/3488766.3488799
Lou CHuang PSmith SBhagwan RPorter G(2020)Understanding, detecting and localizing partial failures in large system softwareProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388284(559-574)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388284
Li ZCheng QHsieh KDang YHuang PSingh PYang XLin QWu YLevy SChintalapati MBhagwan RPorter G(2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388271

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents