[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2623330.2623374acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Correlating events with time series for incident diagnosis

Published: 24 August 2014 Publication History

Abstract

As online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by analyzing a large amount of telemetry data collected from the services at runtime. Time series data and event sequence data are two major types of telemetry data. Techniques of correlation analysis are important tools that are widely used by engineers for data-driven incident diagnosis. Despite their importance, there has been little previous work addressing the correlation between two types of heterogeneous data for incident diagnosis: continuous time series data and temporal event data. In this paper, we propose an approach to evaluate the correlation between time series data and event data. Our approach is capable of discovering three important aspects of event-timeseries correlation in the context of incident diagnosis: existence of correlation, temporal order, and monotonic effect. Our experimental results on simulation data sets and two real data sets demonstrate the effectiveness of the algorithm.

Supplementary Material

MP4 File (p1583-sidebyside.mp4)

References

[1]
Amazon's s3 cloud service turns into a puff of smoke. Information Week, Aug 2008.
[2]
P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enter-prise network services via inference of multi-level dependencies. In SIGCOMM, 2007.
[3]
M. Basseville, I. V. Nikiforov, et al. Detection of abrupt changes: theory and application, volume 104. Prentice Hall Englewood Cliffs, 1993.
[4]
D. J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In Knowledge Discovery and Data Mining, pages 359--370, 1994.
[5]
Y. Chen, B. Hu, E. Keogh, and G. E. Batista. Dtw-d: time series semi-supervised learning from a single example. In KDD, pages 383--391. ACM, 2013.
[6]
I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, and J. Symons. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In OSDI, pages 231--244, 2004.
[7]
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In Proc. SOSP, pages 105--118, 2005.
[8]
J. Cohen. Statistical power analysis for the behavioral sciences. 1988.
[9]
Q. Fu, J.-G. Lou, Q.-W. Lin, R. Ding, Z. Ye, D. Zhang, and T. Xie. Performance issue diagnosis for online service systems. In SRDS, October 2012.
[10]
A. Gretton, K. M. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. volume 19, page 513. MIT; 1998, 2007.
[11]
B. Gruschke et al. Integrated event management: Event correlation using dependency graphs. In Proc. DSOM 98, pages 130--141, 1998.
[12]
J. D. Hamilton. Time series analysis, volume 2. Princeton university press Princeton, 1994.
[13]
J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006.
[14]
J. N. Hoover. Outages force cloud computing users to rethink tactics. Information Week, Aug 2008.
[15]
R. A. Johnson and D. W. Wichern. Applied multivariate statistical analysis. Pearson, 2007.
[16]
S. Kandula, R. Chandra, and D. Katabi. What's going on? learning communication rules in edge networks. SIGCOMM, 38(4):87--98, 2008.
[17]
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In Proc. SIGCOMM, pages 243--254, 2009.
[18]
J.-G. Lou, Q. Fu, Y. Wang, and J. Li. Mining dependency in distributed systems through unstructured logs analysis. SIGOPS Operating Systems Review, 41(1):91--96, 2010.
[19]
J.-G. Lou, Q. Fu, S. Yang, J. Li, and B. Wu. Mining program work flow from interleaved traces. In KDD, pages 613--622. ACM, 2010.
[20]
J.-G. Lou, Q. Lin, R. Ding, Q. Fu, D. Zhang, and T. Xie. Software analytics for incident management of online services: An experience report. In ASE. ACM, November 2013.
[21]
H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 18(1):50--60, 1947.
[22]
H. R. Motahari-Nezhad, R. Saint-Paul, F. Casati, and B. Benatallah. Event correlation for process discovery from web service interaction logs. VLDBJ, 20(3):417--444, 2011.
[23]
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
[24]
G. Piateski and W. Frawley. Knowledge discovery in databases. MIT press, 1991.
[25]
S. C. Porter and A. Zhisheng. Correlation between climate events in the north atlantic and china during the last glaciation. Nature, 375:305--308, 1995.
[26]
D. M. Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. JMLT, 2(1):37--63, 2011.
[27]
B. Rosner. Fundamentals of biostatistics. Cengage Learning, 2010.
[28]
M. F. Schilling. Multivariate two-sample tests based on nearest neighbors. Journal of the American Statistical Association, 81(395):799--806, 1986.
[29]
D. Sejdinovic, A. Gretton, K. Fukumizu, and B. K. Sriperumbudur. Hypothesis testing using pairwise distances and associated kernels. In ICML-12, pages 1111--1118, 2012.
[30]
T. J. VanderWeele and J. M. Robins. Signed directed acyclic graphs for causal inference. Journal of the Royal Statistical Society, 72(1):111--127, 2010.
[31]
D. Wu, Y. Ke, J. X. Yu, S. Y. Philip, and L. Chen. Detecting leaders from correlated time series. In Database Systems for Advanced Applications, pages 352--367. Springer, 2010.
[32]
Y. Zhu and D. Shasha. Statstream: Statistical monitoring of thousands of data streams in real time. In VLDB, pages 358--369. VLDB Endowment, 2002.

Cited By

View all
  • (2024)HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data SourcesACM Transactions on Software Engineering and Methodology10.1145/367472633:8(1-25)Online publication date: 1-Jul-2024
  • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Correlating events with time series for incident diagnosis

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2014
    2028 pages
    ISBN:9781450329569
    DOI:10.1145/2623330
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 August 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. correlation
    2. incident diagnosis
    3. two-sample problem

    Qualifiers

    • Research-article

    Conference

    KDD '14
    Sponsor:

    Acceptance Rates

    KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)96
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data SourcesACM Transactions on Software Engineering and Methodology10.1145/367472633:8(1-25)Online publication date: 1-Jul-2024
    • (2024)On the Model Update Strategies for Supervised Learning in AIOps SolutionsACM Transactions on Software Engineering and Methodology10.1145/366459933:7(1-38)Online publication date: 26-Aug-2024
    • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
    • (2024)LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud IncidentsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663858(388-398)Online publication date: 10-Jul-2024
    • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
    • (2024)BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point DetectionProceedings of the ACM on Software Engineering10.1145/36608051:FSE(2214-2237)Online publication date: 12-Jul-2024
    • (2024)FaultInsight: Interpreting Hyperscale Data Center Host FaultsProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3637528.3672051(141-152)Online publication date: 25-Aug-2024
    • (2024)Automatic Root Cause Analysis via Large Language Models for Cloud IncidentsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629553(674-688)Online publication date: 22-Apr-2024
    • (2024)Xpert: Empowering Incident Management with Query Recommendations via Large Language ModelsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639081(1-13)Online publication date: 20-May-2024
    • (2024)STFT-TCANComputers and Security10.1016/j.cose.2024.103961144:COnline publication date: 1-Sep-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media