[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3627703.3629553acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article
Open access

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents

Published: 22 April 2024 Publication History

Abstract

Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative on-call system empowered by the large language model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at Microsoft for over four years.

References

[1]
Ahmed, T., Ghosh, S., Bansal, C., Zimmermann, T., Zhang, X., and Rajmohan, S. Recommending root-cause and mitigation steps for cloud incidents using large language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE'23) (2023).
[2]
Alquraan, A., Takruri, H., Alfatafta, M., and Al-Kiswany, S. An analysis of network-partitioning failures in cloud systems. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18) (2018).
[3]
Arzani, B., Ciraci, S., Loo, B. T., Schuster, A., and Outhred, G. Taking the blame game out of data centers operations with netpoirot. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM'16) (2016).
[4]
Bansal, C., Renganathan, S., Asudani, A., Midy, O., and Janakiraman, M. Decaf: Diagnosing and triaging performance issues in large-scale cloud services. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (2020).
[5]
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems (2020).
[6]
Chalkidis, I. Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark. arXiv preprint arXiv:2304.12202 (2023).
[7]
Chen, H., Dou, W., Jiang, Y., and Qin, F. Understanding exception-related bugs in large-scale cloud systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19) (2019).
[8]
Chen, J., He, X., Lin, Q., Xu, Y., Zhang, H., Hao, D., Gao, F., Xu, Z., Dang, Y., and Zhang, D. An empirical investigation of incident triage for online service systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'19) (2019).
[9]
Chen, J., He, X., Lin, Q., Zhang, H., Hao, D., Gao, F., Xu, Z., Dang, Y., and Zhang, D. Continuous incident triage for large-scale online service systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19) (2019).
[10]
Chen, J., Zhang, S., He, X., Lin, Q., Zhang, H., Hao, D., Kang, Y., Gao, F., Xu, Z., Dang, Y., et al. How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE'20) (2020).
[11]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
[12]
Chen, Y., Sun, X., Nath, S., Yang, Z., and Xu, T. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI'23) (2023).
[13]
Fu, M., Tantithamthavorn, C., Le, T., Nguyen, V., and Phung, D. Vulrepair: a t5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'22) (2022).
[14]
Ganatra, V., Parayil, A., Ghosh, S., Kang, Y., Ma, M., Bansal, C., Nath, S., and Mace, J. Detection is better than cure: A cloud incidents perspective. In Proceedings of the 31st Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (2023).
[15]
Gao, Y., Dou, W., Qin, F., Gao, C., Wang, D., Wei, J., Huang, R., Zhou, L., and Wu, Y. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE'18) (2018).
[16]
Ghosh, S., Shetty, M., Bansal, C., and Nath, S. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing (2022).
[17]
Gu, J. T., Sun, X., Zhang, W., Jiang, Y., Wang, C., Vaziei, M., Legunsen, O., and Xu, T. Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP'23) (2023).
[18]
He, S., Zhang, X., He, P., Xu, Y., Li, L., Kang, Y., Ma, M., Wei, Y., Dang, Y., Rajmohan, S., et al. An empirical study of log analysis at microsoft. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (2022).
[19]
Inam, M. A., Chen, Y., Goyal, A., Liu, J., Mink, J., Michael, N., Gaur, S., Bates, A., and Hassan, W. U. Sok: History is a vast early warning system: Auditing the provenance of system intrusions. In 2023 IEEE Symposium on Security and Privacy (S&P'22) (2022).
[20]
Jiang, J., Lu, W., Chen, J., Lin, Q., Zhao, P., Kang, Y., Zhang, H., Xiong, Y., Gao, F., Xu, Z., et al. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'20) (2020).
[21]
Jin, P., Zhang, S., Ma, M., Li, H., Kang, Y., Li, L., Liu, Y., Qiao, B., Zhang, C., Zhao, P., et al. Assess and summarize: Improve outage understanding with large language models. In Proceedings of the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (2023).
[22]
Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., and Radev, D. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations. arXiv preprint arXiv:2303.18027 (2023).
[23]
Leesatapornwongsa, T., Stuardo, C. A., Suminto, R. O., Ke, H., Lukman, J. F., and Gunawi, H. S. Scalability bugs: When 100-node testing is not enough. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS'17) (2017).
[24]
Li, H., Ma, M., Liu, Y., Qin, S., Qiao, B., Yao, R., Chaturvedi, H., Tran, T., Chintalapati, M., Rajmohan, S., Lin, Q., and Zhang, D. Codec: Cost-effective duration prediction system for deadline scheduling in the cloud. In Proceedings of the 34th IEEE International Symposium on Software Reliability Engineering (2023).
[25]
Li, M., Ma, M., Nie, X., Yin, K., Cao, L., Wen, X., Yuan, Z., Wu, D., Li, G., Liu, W., et al. Mining fluctuation propagation graph among time series with active learning. In Database and Expert Systems Applications: 33rd International Conference (2022).
[26]
Li, Z., Chen, J., Jiao, R., Zhao, N., Wang, Z., Zhang, S., Wu, Y., Jiang, L., Yan, L., Wang, Z., et al. Practical root cause localization for microservice systems via trace analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (2021).
[27]
Li, Z., Luo, C., Chen, T.-H., Shang, W., He, S., Lin, Q., and Zhang, D. Did we miss something important? studying and exploring variable-aware log abstraction. arXiv preprint arXiv:2304.11391 (2023).
[28]
Lian, X., Chen, Y., Cheng, R., Huang, J., Thakkar, P., and Xu, T. Configuration validation with large language models. arXiv preprint arXiv:2310.09690 (2023).
[29]
Liu, D., He, C., Peng, X., Lin, F., Zhang, C., Gong, S., Li, Z., Ou, J., and Wu, Z. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'21) (2021).
[30]
Liu, H., Lu, S., Musuvathi, M., and Nath, S. What bugs cause production cloud incidents? In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS'19) (2019).
[31]
Liu, Y., Zhang, X., He, S., Zhang, H., Li, L., Kang, Y., Xu, Y., Ma, M., Lin, Q., Dang, Y., et al. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022 (2022).
[32]
Lou, C., Chen, C., Huang, P., Dang, Y., Qin, S., Yang, X., Li, X., Lin, Q., and Chintalapati, M. RESIN: A holistic service for dealing with memory leaks in production cloud infrastructure. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (2022).
[33]
Lou, C., Huang, P., and Smith, S. Understanding, detecting and localizing partial failures in large system software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (2020).
[34]
Luo, C., Lou, J.-G., Lin, Q., Fu, Q., Ding, R., Zhang, D., and Wang, Z. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014).
[35]
Ma, M., Xu, J., Wang, Y., Chen, P., Zhang, Z., and Wang, P. Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020 (2020).
[36]
Ma, M., Yin, Z., Zhang, S., Wang, S., Zheng, C., Jiang, X., Hu, H., Luo, C., Li, Y., Qiu, N., et al. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment (VLDB'20) (2020).
[37]
Ma, M., Zhang, S., Chen, J., Xu, J., Li, H., Lin, Y., Nie, X., Zhou, B., Wang, Y., and Pei, D. Jump-starting multivariate time series anomaly detection for online service systems. In 2021 USENIX Annual Technical Conference (ATC'21) (2021).
[38]
Ma, M., Zhang, S., Pei, D., Huang, X., and Dai, H. Robust and rapid adaption for concept drift in software system anomaly detection. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE'18) (2018).
[39]
Mastropaolo, A., Pascarella, L., and Bavota, G. Using deep learning to generate complete log statements. In Proceedings of the 44th International Conference on Software Engineering (ICSE'22) (2022).
[40]
Mastropaolo, A., Scalabrino, S., Cooper, N., Palacio, D. N., Poshyvanyk, D., Oliveto, R., and Bavota, G. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the 43rd International Conference on Software Engineering (ICSE'21) (2021).
[41]
OpenAI. Tiktoken: A python library for tokenizing text. https://github.com/openai/tiktoken, 2023.
[42]
Shetty, M., Bansal, C., Upadhyayula, S. P., Radhakrishna, A., and Gupta, A. Autotsg: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'22) (2022).
[43]
Sun, X., Cheng, R., Chen, J., Ang, E., Legunsen, O., and Xu, T. Testing Configuration Changes in Context to Prevent Production Failures. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20) (2020).
[44]
Sun, X., Luo, W., Gu, J. T., Ganesan, A., Alagappan, R., Gasch, M., Suresh, L., and Xu, T. Automatic Reliability Testing for Cluster Management Controllers. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (2022).
[45]
Tan, Y., Min, D., Li, Y., Li, W., Hu, N., Chen, Y., and Qi, G. Evaluation of chatgpt as a question answering system for answering complex questions. arXiv preprint arXiv:2303.07992 (2023).
[46]
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).
[47]
Wu, Y., Chen, A., Haeberlen, A., Zhou, W., and Loo, B. T. Automated bug removal for software-defined networks. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI'17) (2017).
[48]
Xie, Z., Xu, H., Chen, W., Li, W., Jiang, H., Su, L., Wang, H., and Pei, D. Unsupervised anomaly detection on microservice traces through graph vae. In Proceedings of the ACM Web Conference 2023 (2023).
[49]
Yan, X., Hsieh, K., Liyanage, Y., Ma, M., Chintalapati, M., Lin, Q., Dang, Y., and Zhang, D. Aegis: Attribution of control plane change impact across layers and components for cloud systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'23) (2023).
[50]
Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G. R., Zhao, X., Zhang, Y., Jain, P., and Stumm, M. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14) (2014).
[51]
Zeng, J., Chua, Z. L., Chen, Y., Ji, K., Liang, Z., and Mao, J. Watson: Abstracting behaviors from audit logs via aggregation of contextual semantics. In Network and Distributed System Security Symposium (NDSS'21) (2021).
[52]
Zeng, J., Wang, X., Liu, J., Chen, Y., Liang, Z., Chua, T.-S., and Chua, Z. L. Shadewatcher: Recommendation-guided cyber threat analysis using system audit records. In 2022 IEEE Symposium on Security and Privacy (S&P'22) (2022).
[53]
Zeng, Z., Zhang, Y., Xu, Y., Ma, M., Qiao, B., Zou, W., Chen, Q., Zhang, M., Zhang, X., Zhang, H., et al. Traceark: Towards actionable performance anomaly alerting for online service systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'23) (2023).
[54]
Zeng, Z., Zhang, Y., Xu, Y., Ma, M., Qiao, B., Zou, W., Chen, Q., Zhang, M., Zhang, X., Zhang, H., Gao, X., Fan, H., Rajmohan, S., Lin, Q., and Zhang, D. Traceark: Towards actionable performance anomaly alerting for online service systems. In To appear in Proc. of ICSE (2023).
[55]
Zhang, J., Mytkowicz, T., Kaurman, M., Piskac, R., and Lahiri, S. K. Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper). In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022).
[56]
Zhang, T., Qiu, H., Castellano, G., Rirai, M., Chen, C. S., and Pianese, F. System log parsing: A survey. IEEE Transactions on Knowledge and Data Engineering (2023).
[57]
Zhang, X., Xu, Y., Qin, S., He, S., Qiao, B., Li, Z., Zhang, H., Li, X., Dang, Y., Lin, Q., et al. Onion: identifying incident-indicating logs for cloud systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021).
[58]
Zhang, Y., Guan, Z., Qian, H., Xu, L., Liu, H., Wen, Q., Sun, L., Jiang, J., Fan, L., and Ke, M. Cloudrca: a root cause analysis framework for cloud computing platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021).
[59]
Zhang, Y., Yang, J., Jin, Z., Sethi, U., Rodrigues, K., Lu, S., and Yuan, D. Understanding and detecting software upgrade failures in distributed systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP'21) (2021).
[60]
Zhang, Z., Zhang, A., Li, M., and Smola, A. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations (ICLR'23) (2023).
[61]
Zhao, C., Ma, M., Zhong, Z., Zhang, S., Tan, Z., Xiong, X., Yu, L., Feng, J., Sun, Y., Zhang, Y., et al. Robust multimodal failure detection for microservice systems. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2023).

Cited By

View all
  • (2024)ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly DetectionProceedings of the VLDB Endowment10.14778/3632093.363210117:3(359-372)Online publication date: 20-Jan-2024
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • (2024)FBDetect: Catching Tiny Performance Regressions at Hyperscale through In-Production MonitoringProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695977(522-540)Online publication date: 4-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems
April 2024
1245 pages
ISBN:9798400704376
DOI:10.1145/3627703
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cloud Systems
  2. Large Language Models
  3. Root Cause Analysis

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

EuroSys '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,433
  • Downloads (Last 6 weeks)550
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly DetectionProceedings of the VLDB Endowment10.14778/3632093.363210117:3(359-372)Online publication date: 20-Jan-2024
  • (2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
  • (2024)FBDetect: Catching Tiny Performance Regressions at Hyperscale through In-Production MonitoringProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695977(522-540)Online publication date: 4-Nov-2024
  • (2024)End-to-End AutoML for Unsupervised Log Anomaly DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695535(1680-1692)Online publication date: 27-Oct-2024
  • (2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
  • (2024)Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive OptimizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695489(1107-1119)Online publication date: 27-Oct-2024
  • (2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
  • (2024)FAIL: Analyzing Software Failures from the News Using LLMsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695022(506-518)Online publication date: 27-Oct-2024
  • (2024)Effective Vulnerable Function Identification based on CVE Description Empowered by Large Language ModelsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695013(393-405)Online publication date: 27-Oct-2024
  • (2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media