[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/IPDPS.2014.16guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Communication-Efficient Distributed Variance Monitoring and Outlier Detection for Multivariate Time Series

Published: 19 May 2014 Publication History

Abstract

Modern scale-out services are comprised of thousands of individual machines, which must be continuously monitored for unexpected failures. One recent approach to monitoring is latent fault detection, an adaptive statistical framework for scale-out, load-balanced systems. By periodically measuring hundreds of performance metrics and looking for outlier machines, it attempts to detect subtle problems such as misconfigurations, bugs, and malfunctioning hardware, before they manifest as machine failures. Previous work on a large, real-world Web service has shown that many failures are indeed preceded by such latent faults. Latent fault detection is an offline framework with large bandwidth and processing requirements. Each machine must send all its measurements to a centralized location, which is prohibitive in some settings and requires data-parallel processing infrastructure. In this work we adapt the latent fault detector to provide an online, communication- and computation-reduced version. We utilize stream processing techniques to trade accuracy for communication and computation. We first describe a novel communication-efficient online distributed variance monitoring algorithm that provides a continuous estimate of the global variance within guaranteed approximation bounds. Using the variance monitor, we provide an online distributed outlier detection framework for non-stationary multivariate time series common in scale-out systems. The adapted framework reduces data size and central processing cost by processing the data in situ, making it usable in wider settings. Like the original framework, our adaptation admits different comparison functions, supports non-stationary data, and provides statistical guarantees on the rate of false positives. Simulations on logs from a production system show that we are able to reduce bandwidth by an order of magnitude, with below 1% error compared to the original algorithm.

Cited By

View all
  • (2022)MinMax Sampling: A Near-optimal Global Summary for Aggregation in the Wide AreaProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526160(744-758)Online publication date: 10-Jun-2022
  • (2018)Lightweight Monitoring of Distributed StreamsACM Transactions on Database Systems10.1145/322611343:2(1-37)Online publication date: 31-Jul-2018
  • (2017)Anarchists, UniteProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/3097983.3098092(837-846)Online publication date: 13-Aug-2017
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
IPDPS '14: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium
May 2014
1176 pages
ISBN:9781479938001

Publisher

IEEE Computer Society

United States

Publication History

Published: 19 May 2014

Author Tag

  1. distributed computing, distributed processing, data analysis, time series analysis, fault detection

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)MinMax Sampling: A Near-optimal Global Summary for Aggregation in the Wide AreaProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526160(744-758)Online publication date: 10-Jun-2022
  • (2018)Lightweight Monitoring of Distributed StreamsACM Transactions on Database Systems10.1145/322611343:2(1-37)Online publication date: 31-Jul-2018
  • (2017)Anarchists, UniteProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/3097983.3098092(837-846)Online publication date: 13-Aug-2017
  • (2017)One for All and All for OneProceedings of the 11th ACM International Conference on Distributed and Event-based Systems10.1145/3093742.3093918(203-214)Online publication date: 8-Jun-2017
  • (2016)TrumpetProceedings of the 2016 ACM SIGCOMM Conference10.1145/2934872.2934879(129-143)Online publication date: 22-Aug-2016
  • (2016)Scalable Approximate Query Tracking over Highly Distributed Data StreamsProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2915225(1497-1512)Online publication date: 26-Jun-2016
  • (2016)Incremental computations over strongly distributed databasesConcurrency and Computation: Practice & Experience10.1002/cpe.359728:11(3061-3076)Online publication date: 10-Aug-2016
  • (2015)Streaming anomaly detection using randomized matrix sketchingProceedings of the VLDB Endowment10.14778/2850583.28505939:3(192-203)Online publication date: 1-Nov-2015
  • (2015)Monitoring distributed streams using convex decompositionsProceedings of the VLDB Endowment10.14778/2735479.27354878:5(545-556)Online publication date: 1-Jan-2015
  • (2015)Monitoring Least Squares Models of Distributed StreamsProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783349(319-328)Online publication date: 10-Aug-2015

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media