[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ICDCS.2008.34guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Toward Predictive Failure Management for Distributed Stream Processing Systems

Published: 17 June 2008 Publication History

Abstract

Distributed stream processing systems (DSPSs) have many important applications such as sensor data analysis, network security, and business intelligence. Failure management is essential for DSPSs that often require highly-available system operations. In this paper, we explore a new predictive failure management approach that employs online failure prediction to achieve more efficient failure management than previous reactive or proactive failure management approaches. We employ light-weight stream-based classification methods to perform online failure forecast. Based on the prediction results, the system can take differentiated failure preventions on abnormal components only. Our failure prediction model is tunable, which can achieve a desired tradeoff between failure penalty reduction and prevention cost based on a user-defined reward function. To achieve low-overhead online learning, we propose adaptive data stream sampling schemes to adaptively adjust measurement sampling rates based on the states of monitored components, and maintain a limited size of historical training data using reservoir sampling. We have implemented an initial prototype of the predictive failure management framework within the IBM System S distributed stream processing system. Experiment results show that our system can achieve more efficient failure management than conventional reactive and proactive approaches, while imposing low overhead to the DSPS.

Cited By

View all
  • (2013)Research on Optimum Checkpoint Interval for Hybrid Fault ToleranceRevised Selected Papers of the 10th International Symposium on Advanced Parallel Processing Technologies - Volume 829910.1007/978-3-642-45293-2_28(367-380)Online publication date: 27-Aug-2013
  • (2011)Temporal data mining approaches for sustainable chiller management in data centersACM Transactions on Intelligent Systems and Technology10.1145/1989734.19897382:4(1-29)Online publication date: 15-Jul-2011
  • (2010)Adaptive system anomaly prediction for large-scale hosting infrastructuresProceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing10.1145/1835698.1835741(173-182)Online publication date: 25-Jul-2010
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICDCS '08: Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems
June 2008
886 pages
ISBN:9780769531724

Publisher

IEEE Computer Society

United States

Publication History

Published: 17 June 2008

Author Tags

  1. Data Stream Processing
  2. Failure Prediction
  3. Fault Tolerance
  4. System Mining

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2013)Research on Optimum Checkpoint Interval for Hybrid Fault ToleranceRevised Selected Papers of the 10th International Symposium on Advanced Parallel Processing Technologies - Volume 829910.1007/978-3-642-45293-2_28(367-380)Online publication date: 27-Aug-2013
  • (2011)Temporal data mining approaches for sustainable chiller management in data centersACM Transactions on Intelligent Systems and Technology10.1145/1989734.19897382:4(1-29)Online publication date: 15-Jul-2011
  • (2010)Adaptive system anomaly prediction for large-scale hosting infrastructuresProceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing10.1145/1835698.1835741(173-182)Online publication date: 25-Jul-2010
  • (2010)FlowProceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation10.1109/PADS.2010.5471658(97-105)Online publication date: 17-May-2010
  • (2009)Self-correlating predictive information tracking for large-scale production systemsProceedings of the 6th international conference on Autonomic computing10.1145/1555228.1555235(33-42)Online publication date: 15-Jun-2009
  • (2008)Proactive process-level live migration in HPC environmentsProceedings of the 2008 ACM/IEEE conference on Supercomputing10.5555/1413370.1413414(1-12)Online publication date: 15-Nov-2008

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media