[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Probabilistic Model-Driven Recovery in Distributed Systems

Published: 01 November 2011 Publication History

Abstract

Automatic system monitoring and recovery has the potential to provide effective, low-cost ways to improve dependability in distributed software systems. However, automating recovery is challenging in practice because accurate fault diagnosis is hampered by monitoring tools and techniques that often have low fault coverage, poor fault localization, detection delays, and false positives. In this paper, we present a holistic model-based approach that overcomes these challenges and enables automatic recovery in distributed systems. To do so, it uses theoretically sound techniques including Bayesian estimation and Markov decision theory to provide controllers that choose good, if not optimal, recovery actions according to a user-defined optimization criteria. By combining monitoring and recovery, the approach realizes benefits that could not have been obtained by using them in isolation. We experimentally validate our framework by fault injection on realistic e-commerce systems.

Cited By

View all
  • (2018)Root-Cause Diagnosis Using Logs Generated by User Actions2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647957(1-7)Online publication date: 9-Dec-2018
  • (2017)Security of Cyber-Physical Systems in the Presence of Transient Sensor FaultsACM Transactions on Cyber-Physical Systems10.1145/30648091:3(1-23)Online publication date: 9-May-2017
  • (2015)MADRevised Selected Papers, Part II, of the 5th International Conference on Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques - Volume 924310.1007/978-3-319-23862-3_30(308-315)Online publication date: 14-Jun-2015
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Dependable and Secure Computing
IEEE Transactions on Dependable and Secure Computing  Volume 8, Issue 6
November 2011
159 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 November 2011

Author Tags

  1. Bayesian.
  2. Fault tolerance
  3. POMDP
  4. adaptive systems
  5. diagnosis
  6. distributed systems
  7. monitoring
  8. recovery

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)Root-Cause Diagnosis Using Logs Generated by User Actions2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647957(1-7)Online publication date: 9-Dec-2018
  • (2017)Security of Cyber-Physical Systems in the Presence of Transient Sensor FaultsACM Transactions on Cyber-Physical Systems10.1145/30648091:3(1-23)Online publication date: 9-May-2017
  • (2015)MADRevised Selected Papers, Part II, of the 5th International Conference on Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques - Volume 924310.1007/978-3-319-23862-3_30(308-315)Online publication date: 14-Jun-2015
  • (undefined)Recovery command generation towards automatic recovery in ICT systems by Seq2Seq learningNOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium10.1109/NOMS47738.2020.9110370(1-6)

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media