[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3447851.3458737acmotherconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
short-paper

CARE: Infusing Causal Aware Thinking to Root Cause Analysis in Cloud System

Published: 26 April 2021 Publication History

Abstract

With millions of customers accessing online service all over the world, ensuring high service availability is very critical for cloud system. In recent years, empowered by advanced data mining and machine learning technology, there emerges extensive study on data-driven solution to detect anomalous system behavior and diagnose the root cause. However, without any surveilance of data generation process, the existing passive data-driven approach may lead to biased analysis result induced by observed and unobserved confounding factors in the dynamic and heterogeneous system, and thus affect service availability with misleading mitigation actions. In this paper, we propose to infuse causal thinking to the current data-driven solution for cloud system. We developed CARE, a causal aware root cause discovery engine, which utilizes Random Control Trial to proactively generate less ambiguous data for further analysis. A case study shows the application of CARE to Microsoft Office365.

References

[1]
Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and Mathru Janakiraman. 2020. DeCaf: diagnosing and triaging performance issues in large-scale cloud services. Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (2020).
[2]
J. Chen, X. He, Q. Lin, Y. Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, Y. Dang, and D. Zhang. 2019. An Empirical Investigation of Incident Triage for Online Service Systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111--120. https://doi.org/10.1109/ICSE-SEIP.2019.00020
[3]
Y. Dang, Q. Lin, and P. Huang. 2019. AIOps: Real-World Challenges and Research Innovations. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 4-5. https://doi.org/10.1109/ICSE-Companion.2019.00023
[4]
Kamil Figiela, Adam Gajek, Adam Zima, Beata Obrok, and Maciej Malawski. 2018. Performance evaluation of heterogeneous cloud functions. Concurrency and Computation: Practice and Experience 30, 23 (2018).
[5]
Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xinsheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and Murali Chintalapati. 2020. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA, 389--402. https://www.usenix.org/conference/nsdi20/presentation/li
[6]
F. Lin, Keyur Muzumdar, N. Laptev, Mihai-Valentin Curelea, S. Lee, and S. Sankar. 2020. Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment. Proceedings of the ACM on Measurement and Analysis of Computing Systems 4 (2020), 1 - 23.
[7]
Kamlesh Kumar Pandey and Diwakar Shukla. 2020. Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In Communication and Intelligent Systems, Jagdish Chand Bansal, Mukesh Kumar Gupta, Harish Sharma, and Basant Agarwal (Eds.). Springer Singapore, Singapore, 107--122.
[8]
Harald Stolberg, Geoffrey Norman, and Isabelle Trop. 2005. Randomized Controlled Trials. AJR. American journal of roentgenology 183 (01 2005), 1539--44. https://doi.org/10.2214/ajr.183.6.01831539
[9]
Elizabeth Stuart, Haiden Huskamp, Kenneth Duckworth, Jeffrey Simmons, Zirui Song, Michael Chernew, and Colleen Barry. 2014. Using propensity scores in difference-in-differences models to estimate the effects of a policy change. Health services outcomes research methodology 14(12 2014), 166--182. https://doi.org/10.1007/s10742-014-0123-z
[10]
Xu Zhang, Qingwei Lin, Yong Xu, Si Qin, Hongyu Zhang, Bo Qiao, Yingnong Dang, Xinsheng Yang, Qian Cheng, Murali Chintalapati, Youjiang Wu, Ken Hsieh, Kaixin Sui, Xin Meng, Yaohai Xu, Wenchi Zhang, Furao Shen, and Dongmei Zhang. 2019. Cross-dataset Time Series Anomaly Detection for Cloud Systems. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). USENIX Association, Renton, WA, 1063--1076. https://www.usenix.org/conference/atc19/presentation/zhang-xu

Cited By

View all
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • (2024)Understanding and Improving Change Risk Detection in Practice2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00079(717-727)Online publication date: 12-Mar-2024
  • (2022)Identifying Erroneous Software Changes through Self-Supervised Contrastive Learning on Time Series Data2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00043(366-377)Online publication date: Oct-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
HAOC '21: Proceedings of the 1st Workshop on High Availability and Observability of Cloud Systems
April 2021
29 pages
ISBN:9781450383363
DOI:10.1145/3447851
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 April 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud system
  2. Reliability
  3. Root cause analysis

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

EuroSys '21

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)2
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ChangeRCA: Finding Root Causes from Software Changes in Large Online SystemsProceedings of the ACM on Software Engineering10.1145/36437281:FSE(24-46)Online publication date: 12-Jul-2024
  • (2024)Understanding and Improving Change Risk Detection in Practice2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00079(717-727)Online publication date: 12-Mar-2024
  • (2022)Identifying Erroneous Software Changes through Self-Supervised Contrastive Learning on Time Series Data2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00043(366-377)Online publication date: Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media