More Web Proxy on the site http://driver.im/

research-article

Open access

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents

Authors:

Saravan Rajmohan,

Tianyin XuAuthors Info & Claims

EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems

Pages 674 - 688

https://doi.org/10.1145/3627703.3629553

Published: 22 April 2024 Publication History

Abstract

Ensuring the reliability and availability of cloud services necessitates efficient root cause analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual investigations of data sources such as logs and traces, are often laborious, error-prone, and challenging for on-call engineers. In this paper, we introduce RCACopilot, an innovative on-call system empowered by the large language model for automating RCA of cloud incidents. RCACopilot matches incoming incidents to corresponding incident handlers based on their alert types, aggregates the critical runtime diagnostic information, predicts the incident's root cause category, and provides an explanatory narrative. We evaluate RCACopilot using a real-world dataset consisting of a year's worth of incidents from Microsoft. Our evaluation demonstrates that RCACopilot achieves RCA accuracy up to 0.766. Furthermore, the diagnostic information collection component of RCACopilot has been successfully in use at Microsoft for over four years.

References

[1]

Ahmed, T., Ghosh, S., Bansal, C., Zimmermann, T., Zhang, X., and Rajmohan, S. Recommending root-cause and mitigation steps for cloud incidents using large language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE'23) (2023).

Digital Library

[2]

Alquraan, A., Takruri, H., Alfatafta, M., and Al-Kiswany, S. An analysis of network-partitioning failures in cloud systems. In Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation (OSDI'18) (2018).

Digital Library

[3]

Arzani, B., Ciraci, S., Loo, B. T., Schuster, A., and Outhred, G. Taking the blame game out of data centers operations with netpoirot. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM'16) (2016).

Digital Library

[4]

Bansal, C., Renganathan, S., Asudani, A., Midy, O., and Janakiraman, M. Decaf: Diagnosing and triaging performance issues in large-scale cloud services. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice (2020).

Digital Library

[5]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems (2020).

[6]

Chalkidis, I. Chatgpt may pass the bar exam soon, but has a long way to go for the lexglue benchmark. arXiv preprint arXiv:2304.12202 (2023).

[7]

Chen, H., Dou, W., Jiang, Y., and Qin, F. Understanding exception-related bugs in large-scale cloud systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19) (2019).

Digital Library

[8]

Chen, J., He, X., Lin, Q., Xu, Y., Zhang, H., Hao, D., Gao, F., Xu, Z., Dang, Y., and Zhang, D. An empirical investigation of incident triage for online service systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'19) (2019).

Digital Library

[9]

Chen, J., He, X., Lin, Q., Zhang, H., Hao, D., Gao, F., Xu, Z., Dang, Y., and Zhang, D. Continuous incident triage for large-scale online service systems. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE'19) (2019).

Digital Library

[10]

Chen, J., Zhang, S., He, X., Lin, Q., Zhang, H., Hao, D., Kang, Y., Gao, F., Xu, Z., Dang, Y., et al. How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE'20) (2020).

Digital Library

[11]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[12]

Chen, Y., Sun, X., Nath, S., Yang, Z., and Xu, T. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI'23) (2023).

[13]

Fu, M., Tantithamthavorn, C., Le, T., Nguyen, V., and Phung, D. Vulrepair: a t5-based automated software vulnerability repair. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'22) (2022).

Digital Library

[14]

Ganatra, V., Parayil, A., Ghosh, S., Kang, Y., Ma, M., Bansal, C., Nath, S., and Mace, J. Detection is better than cure: A cloud incidents perspective. In Proceedings of the 31st Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (2023).

Digital Library

[15]

Gao, Y., Dou, W., Qin, F., Gao, C., Wang, D., Wei, J., Huang, R., Zhou, L., and Wu, Y. An empirical study on crash recovery bugs in large-scale distributed systems. In Proceedings of the 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering (ESEC/FSE'18) (2018).

Digital Library

[16]

Ghosh, S., Shetty, M., Bansal, C., and Nath, S. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing (2022).

Digital Library

[17]

Gu, J. T., Sun, X., Zhang, W., Jiang, Y., Wang, C., Vaziei, M., Legunsen, O., and Xu, T. Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP'23) (2023).

Digital Library

[18]

He, S., Zhang, X., He, P., Xu, Y., Li, L., Kang, Y., Ma, M., Wei, Y., Dang, Y., Rajmohan, S., et al. An empirical study of log analysis at microsoft. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (2022).

Digital Library

[19]

Inam, M. A., Chen, Y., Goyal, A., Liu, J., Mink, J., Michael, N., Gaur, S., Bates, A., and Hassan, W. U. Sok: History is a vast early warning system: Auditing the provenance of system intrusions. In 2023 IEEE Symposium on Security and Privacy (S&P'22) (2022).

[20]

Jiang, J., Lu, W., Chen, J., Lin, Q., Zhao, P., Kang, Y., Zhang, H., Xiong, Y., Gao, F., Xu, Z., et al. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'20) (2020).

Digital Library

[21]

Jin, P., Zhang, S., Ma, M., Li, H., Kang, Y., Li, L., Liu, Y., Qiao, B., Zhang, C., Zhao, P., et al. Assess and summarize: Improve outage understanding with large language models. In Proceedings of the Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (2023).

Digital Library

[22]

Kasai, J., Kasai, Y., Sakaguchi, K., Yamada, Y., and Radev, D. Evaluating gpt-4 and chatgpt on japanese medical licensing examinations. arXiv preprint arXiv:2303.18027 (2023).

[23]

Leesatapornwongsa, T., Stuardo, C. A., Suminto, R. O., Ke, H., Lukman, J. F., and Gunawi, H. S. Scalability bugs: When 100-node testing is not enough. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS'17) (2017).

Digital Library

[24]

Li, H., Ma, M., Liu, Y., Qin, S., Qiao, B., Yao, R., Chaturvedi, H., Tran, T., Chintalapati, M., Rajmohan, S., Lin, Q., and Zhang, D. Codec: Cost-effective duration prediction system for deadline scheduling in the cloud. In Proceedings of the 34th IEEE International Symposium on Software Reliability Engineering (2023).

[25]

Li, M., Ma, M., Nie, X., Yin, K., Cao, L., Wen, X., Yuan, Z., Wu, D., Li, G., Liu, W., et al. Mining fluctuation propagation graph among time series with active learning. In Database and Expert Systems Applications: 33rd International Conference (2022).

Digital Library

[26]

Li, Z., Chen, J., Jiao, R., Zhao, N., Wang, Z., Zhang, S., Wu, Y., Jiang, L., Yan, L., Wang, Z., et al. Practical root cause localization for microservice systems via trace analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (2021).

[27]

Li, Z., Luo, C., Chen, T.-H., Shang, W., He, S., Lin, Q., and Zhang, D. Did we miss something important? studying and exploring variable-aware log abstraction. arXiv preprint arXiv:2304.11391 (2023).

[28]

Lian, X., Chen, Y., Cheng, R., Huang, J., Thakkar, P., and Xu, T. Configuration validation with large language models. arXiv preprint arXiv:2310.09690 (2023).

[29]

Liu, D., He, C., Peng, X., Lin, F., Zhang, C., Gong, S., Li, Z., Ou, J., and Wu, Z. Microhecl: High-efficient root cause localization in large-scale microservice systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'21) (2021).

Digital Library

[30]

Liu, H., Lu, S., Musuvathi, M., and Nath, S. What bugs cause production cloud incidents? In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS'19) (2019).

Digital Library

[31]

Liu, Y., Zhang, X., He, S., Zhang, H., Li, L., Kang, Y., Xu, Y., Ma, M., Lin, Q., Dang, Y., et al. Uniparser: A unified log parser for heterogeneous log data. In Proceedings of the ACM Web Conference 2022 (2022).

Digital Library

[32]

Lou, C., Chen, C., Huang, P., Dang, Y., Qin, S., Yang, X., Li, X., Lin, Q., and Chintalapati, M. RESIN: A holistic service for dealing with memory leaks in production cloud infrastructure. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (2022).

[33]

Lou, C., Huang, P., and Smith, S. Understanding, detecting and localizing partial failures in large system software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (2020).

[34]

Luo, C., Lou, J.-G., Lin, Q., Fu, Q., Ding, R., Zhang, D., and Wang, Z. Correlating events with time series for incident diagnosis. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014).

Digital Library

[35]

Ma, M., Xu, J., Wang, Y., Chen, P., Zhang, Z., and Wang, P. Automap: Diagnose your microservice-based web applications automatically. In Proceedings of The Web Conference 2020 (2020).

Digital Library

[36]

Ma, M., Yin, Z., Zhang, S., Wang, S., Zheng, C., Jiang, X., Hu, H., Luo, C., Li, Y., Qiu, N., et al. Diagnosing root causes of intermittent slow queries in cloud databases. Proceedings of the VLDB Endowment (VLDB'20) (2020).

Digital Library

[37]

Ma, M., Zhang, S., Chen, J., Xu, J., Li, H., Lin, Y., Nie, X., Zhou, B., Wang, Y., and Pei, D. Jump-starting multivariate time series anomaly detection for online service systems. In 2021 USENIX Annual Technical Conference (ATC'21) (2021).

[38]

Ma, M., Zhang, S., Pei, D., Huang, X., and Dai, H. Robust and rapid adaption for concept drift in software system anomaly detection. In 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE'18) (2018).

[39]

Mastropaolo, A., Pascarella, L., and Bavota, G. Using deep learning to generate complete log statements. In Proceedings of the 44th International Conference on Software Engineering (ICSE'22) (2022).

Digital Library

[40]

Mastropaolo, A., Scalabrino, S., Cooper, N., Palacio, D. N., Poshyvanyk, D., Oliveto, R., and Bavota, G. Studying the usage of text-to-text transfer transformer to support code-related tasks. In Proceedings of the 43rd International Conference on Software Engineering (ICSE'21) (2021).

Digital Library

[41]

OpenAI. Tiktoken: A python library for tokenizing text. https://github.com/openai/tiktoken, 2023.

[42]

Shetty, M., Bansal, C., Upadhyayula, S. P., Radhakrishna, A., and Gupta, A. Autotsg: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'22) (2022).

Digital Library

[43]

Sun, X., Cheng, R., Chen, J., Ang, E., Legunsen, O., and Xu, T. Testing Configuration Changes in Context to Prevent Production Failures. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20) (2020).

[44]

Sun, X., Luo, W., Gu, J. T., Ganesan, A., Alagappan, R., Gasch, M., Suresh, L., and Xu, T. Automatic Reliability Testing for Cluster Management Controllers. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (2022).

[45]

Tan, Y., Min, D., Li, Y., Li, W., Hu, N., Chen, Y., and Qi, G. Evaluation of chatgpt as a question answering system for answering complex questions. arXiv preprint arXiv:2303.07992 (2023).

[46]

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 (2022).

[47]

Wu, Y., Chen, A., Haeberlen, A., Zhou, W., and Loo, B. T. Automated bug removal for software-defined networks. In Proceedings of the 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI'17) (2017).

Digital Library

[48]

Xie, Z., Xu, H., Chen, W., Li, W., Jiang, H., Su, L., Wang, H., and Pei, D. Unsupervised anomaly detection on microservice traces through graph vae. In Proceedings of the ACM Web Conference 2023 (2023).

Digital Library

[49]

Yan, X., Hsieh, K., Liyanage, Y., Ma, M., Chintalapati, M., Lin, Q., Dang, Y., and Zhang, D. Aegis: Attribution of control plane change impact across layers and components for cloud systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'23) (2023).

Digital Library

[50]

Yuan, D., Luo, Y., Zhuang, X., Rodrigues, G. R., Zhao, X., Zhang, Y., Jain, P., and Stumm, M. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'14) (2014).

[51]

Zeng, J., Chua, Z. L., Chen, Y., Ji, K., Liang, Z., and Mao, J. Watson: Abstracting behaviors from audit logs via aggregation of contextual semantics. In Network and Distributed System Security Symposium (NDSS'21) (2021).

[52]

Zeng, J., Wang, X., Liu, J., Chen, Y., Liang, Z., Chua, T.-S., and Chua, Z. L. Shadewatcher: Recommendation-guided cyber threat analysis using system audit records. In 2022 IEEE Symposium on Security and Privacy (S&P'22) (2022).

[53]

Zeng, Z., Zhang, Y., Xu, Y., Ma, M., Qiao, B., Zou, W., Chen, Q., Zhang, M., Zhang, X., Zhang, H., et al. Traceark: Towards actionable performance anomaly alerting for online service systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP'23) (2023).

Digital Library

[54]

Zeng, Z., Zhang, Y., Xu, Y., Ma, M., Qiao, B., Zou, W., Chen, Q., Zhang, M., Zhang, X., Zhang, H., Gao, X., Fan, H., Rajmohan, S., Lin, Q., and Zhang, D. Traceark: Towards actionable performance anomaly alerting for online service systems. In To appear in Proc. of ICSE (2023).

Digital Library

[55]

Zhang, J., Mytkowicz, T., Kaurman, M., Piskac, R., and Lahiri, S. K. Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper). In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (2022).

Digital Library

[56]

Zhang, T., Qiu, H., Castellano, G., Rirai, M., Chen, C. S., and Pianese, F. System log parsing: A survey. IEEE Transactions on Knowledge and Data Engineering (2023).

[57]

Zhang, X., Xu, Y., Qin, S., He, S., Qiao, B., Li, Z., Zhang, H., Li, X., Dang, Y., Lin, Q., et al. Onion: identifying incident-indicating logs for cloud systems. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021).

Digital Library

[58]

Zhang, Y., Guan, Z., Qian, H., Xu, L., Liu, H., Wen, Q., Sun, L., Jiang, J., Fan, L., and Ke, M. Cloudrca: a root cause analysis framework for cloud computing platforms. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (2021).

Digital Library

[59]

Zhang, Y., Yang, J., Jin, Z., Sethi, U., Rodrigues, K., Lu, S., and Yuan, D. Understanding and detecting software upgrade failures in distributed systems. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP'21) (2021).

Digital Library

[60]

Zhang, Z., Zhang, A., Li, M., and Smola, A. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations (ICLR'23) (2023).

[61]

Zhao, C., Ma, M., Zhong, Z., Zhang, S., Tan, Z., Xiong, X., Yu, L., Feng, J., Sun, Y., Zhang, Y., et al. Robust multimodal failure detection for microservice systems. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2023).

Digital Library

Cited By

Chen YZhang CMa MLiu YDing RLi BHe SRajmohan SLin QZhang D(2024)ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly DetectionProceedings of the VLDB Endowment10.14778/3632093.363210117:3(359-372)Online publication date: 20-Jan-2024
https://doi.org/10.14778/3632093.3632101
Shetty MChen YSomashekar GMa MSimmhan YZhang XMace JVandevoorde DLas-Casas PGupta SNath SBansal CRajmohan S(2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698525
Yoon DWang YYu MHuang EJones JKukkadapu AKocas OWiepert JGoenka KChen SLin YHuang ZKong JChow MTang CWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)FBDetect: Catching Tiny Performance Regressions at Hyperscale through In-Production MonitoringProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695977(522-540)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695977
Show More Cited By

Index Terms

Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software

Recommendations

Exploring LLM-Based Agents for Root Cause Analysis
FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering

The growing complexity of cloud based software systems has resulted in incident management becoming an integral part of the software development lifecycle. Root cause analysis (RCA), a critical part of the incident management process, is a demanding task ...
What bugs cause production cloud incidents?
HotOS '19: Proceedings of the Workshop on Hot Topics in Operating Systems

Cloud services have become the backbone of today's computing world. Runtime incidents, which adversely affect the expected service operations, are extremely costly in terms of user impacts and engineering efforts required to resolve them. Hence, such ...
Performance Monitoring and Root Cause Analysis for Cloud-hosted Web Applications
WWW '17: Proceedings of the 26th International Conference on World Wide Web

In this paper, we describe Roots - a system for automatically identifying the "root cause" of performance anomalies in web applications deployed in Platform-as-a-Service (PaaS) clouds. Roots does not require application-level instrumentation. Instead, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems

April 2024

1245 pages

ISBN:9798400704376

DOI:10.1145/3627703

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroSys '24

Sponsor:

SIGOPS

EuroSys '24: Nineteenth European Conference on Computer Systems

April 22 - 25, 2024

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
2,433
Total Downloads

Downloads (Last 12 months)2,433
Downloads (Last 6 weeks)550

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YZhang CMa MLiu YDing RLi BHe SRajmohan SLin QZhang D(2024)ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly DetectionProceedings of the VLDB Endowment10.14778/3632093.363210117:3(359-372)Online publication date: 20-Jan-2024
https://doi.org/10.14778/3632093.3632101
Shetty MChen YSomashekar GMa MSimmhan YZhang XMace JVandevoorde DLas-Casas PGupta SNath SBansal CRajmohan S(2024)Building AI Agents for Autonomous Clouds: Challenges and Design PrinciplesProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698525(99-110)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698525
Yoon DWang YYu MHuang EJones JKukkadapu AKocas OWiepert JGoenka KChen SLin YHuang ZKong JChow MTang CWitchel EArpaci-Dusseau ARossbach CKeeton K(2024)FBDetect: Catching Tiny Performance Regressions at Hyperscale through In-Production MonitoringProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695977(522-540)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3694715.3695977
Zhang SJi YLuan JNie XChen ZMa MSun YPei DFilkov VRay BZhou M(2024)End-to-End AutoML for Unsupervised Log Anomaly DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695535(1680-1692)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695535
Sun YShi BMao MMa MXia SZhang SPei DFilkov VRay BZhou M(2024)ART: A Unified Unsupervised Framework for Incident Management in Microservice SystemsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695495(1183-1194)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695495
Tao LZhang SJia ZSun JMa MLi ZSun YYang CZhang YPei DFilkov VRay BZhou M(2024)Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive OptimizationProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695489(1107-1119)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695489
Han YDu QHuang YWu JTian FHe CFilkov VRay BZhou M(2024)The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small ClassifierProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695475(931-943)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695475
Anandayuvaraj DCampbell MTewari ADavis JFilkov VRay BZhou M(2024)FAIL: Analyzing Software Failures from the News Using LLMsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695022(506-518)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695022
Wu YWen MYu ZGuo XJin HFilkov VRay BZhou M(2024)Effective Vulnerable Function Identification based on CVE Description Empowered by Large Language ModelsProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695013(393-405)Online publication date: 27-Oct-2024
https://dl.acm.org/doi/10.1145/3691620.3695013
Goel DHusain FSingh AGhosh SParayil ABansal CZhang XRajmohan Sd'Amorim M(2024)X-Lifecycle Learning for Cloud Incident Management using LLMsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663861(417-428)Online publication date: 10-Jul-2024
https://doi.org/10.1145/3663529.3663861
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents