More Web Proxy on the site http://driver.im/

research-article

Open access

Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach

Authors:

Michael R. LyuAuthors Info & Claims

ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice

Pages 369 - 380

https://doi.org/10.1145/3639477.3639745

Published: 31 May 2024 Publication History

Abstract

Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts.

To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X¹.

References

[1]

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. arXiv preprint arXiv:2301.03797 (2023).

[2]

Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An empirical investigation of incident triage for online service systems. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 111--120.

Digital Library

[3]

Jia Chen, Peng Wang, and Wei Wang. 2022. Online summarizing alerts through semantic and behavior information. In Proceedings of the 44th International Conference on Software Engineering. 1646--1657.

Digital Library

[4]

Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, et al. 2020. How incidental are the incidents? characterizing and prioritizing incidents for large-scale online service systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 373--384.

Digital Library

[5]

Yinfang Chen, Huaibing Xie, Minghua Ma, Yu Kang, Xin Gao, Liu Shi, Yunjie Cao, Xuedong Gao, Hao Fan, Ming Wen, et al. 2023. Empowering Practical Root Cause Analysis by Large Language Models for Cloud Incidents. arXiv preprint arXiv:2305.15778 (2023).

[6]

Yujun Chen, Xian Yang, Hang Dong, Xiaoting He, Hongyu Zhang, Qingwei Lin, Junjie Chen, Pu Zhao, Yu Kang, Feng Gao, et al. 2020. Identifying linked incidents in large-scale online service systems. In Proceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 304--314.

Digital Library

[7]

Zhuangbin Chen, Yu Kang, Liqun Li, Xu Zhang, Hongyu Zhang, Hui Xu, Yangfan Zhou, Li Yang, Jeffrey Sun, Zhangwei Xu, et al. 2020. Towards intelligent incident management: why we need it and how we make it. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1487--1497.

Digital Library

[8]

Zhuangbin Chen, Jinyang Liu, Yuxin Su, Hongyu Zhang, Xuemin Wen, Xiao Ling, Yongqiang Yang, and Michael R Lyu. 2021. Graph-based incident aggregation for large-scale online service systems. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 430--442.

Digital Library

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[10]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In kdd, Vol. 96. 226--231.

[11]

Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R. Lyu. 2023. What Makes Good In-context Demonstrations for Code Intelligence Tasks with LLMs?. In Proceedings of the 38th International Conference on Automated Software Engineering (ASE).

[12]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 855--864.

Digital Library

[13]

Wenwei Gu, Jinyang Liu, Zhuangbin Chen, Jianping Zhang, Yuxin Su, Jiazhen Gu, Cong Feng, Zengyin Yang, and Michael Lyu. 2023. Performance Issue Identification in Cloud Systems with Relational-Temporal Anomaly Detection. arXiv preprint arXiv:2307.10869 (2023).

[14]

Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. ACM sigmod record 29, 2 (2000), 1--12.

[15]

Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R Lyu, and Dongmei Zhang. 2018. Identifying impactful service system problems via log analysis. In Proceedings of the 2018 26th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering. 60--70.

Digital Library

[16]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv preprint arXiv:2308.10620 (2023).

[17]

Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022).

[18]

Jiajun Jiang, Weihai Lu, Junjie Chen, Qingwei Lin, Pu Zhao, Yu Kang, Hongyu Zhang, Yingfei Xiong, Feng Gao, Zhangwei Xu, et al. 2020. How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1410--1420.

Digital Library

[19]

Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, et al. 2023. Assess and Summarize: Improve Outage Understanding with Large Language Models. arXiv preprint arXiv:2305.18084 (2023).

[20]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016).

[21]

Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).

[22]

Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. 2023. Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data. arXiv preprint arXiv:2302.05092 (2023).

[23]

Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Yongqiang Yang, and Michael R Lyu. 2023. Heterogeneous anomaly detection for software systems via semi-supervised cross-modal attention. arXiv preprint arXiv:2302.06914 (2023).

[24]

Liqun Li, Xu Zhang, Xin Zhao, Hongyu Zhang, Yu Kang, Pu Zhao, Bo Qiao, Shilin He, Pochian Lee, Jeffrey Sun, et al. 2021. Fighting the fog of war: Automated incident detection for cloud systems. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). 131--146.

[25]

Jinyang Liu, Shilin He, Zhuangbin Chen, Liqun Li, Yu Kang, Xu Zhang, Pinjia He, Hongyu Zhang, Qingwei Lin, Zhangwei Xu, et al. 2023. Incident-aware Duplicate Ticket Aggregation for Cloud Systems. In Proceedings of the 44th International Conference on Software Engineering.

[26]

Jinyang Liu, Junjie Huang, Yintong Huo, Zhihan Jiang, Jiazhen Gu, Zhuangbin Chen, Cong Feng, Minzhi Yan, and Michael R Lyu. 2023. Scalable and Adaptive Log-based Anomaly Detection with Expert in the Loop. arXiv preprint arXiv:2306.05032 (2023).

[27]

Jinyang Liu, Jieming Zhu, Shilin He, Pinjia He, Zibin Zheng, and Michael R Lyu. 2019. Logzip: Extracting hidden structures via iterative clustering for log compression. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 863--873.

Digital Library

[28]

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 61--68.

[29]

Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2021. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021).

[30]

Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022. Using deep learning to generate complete log statements. In Proceedings of the 44th International Conference on Software Engineering. 2279--2290.

Digital Library

[31]

Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336--347.

Digital Library

[32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems 26 (2013).

[33]

Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenxuan Wang, Cuiyun Gao, and Michael R Lyu. 2022. Revisiting, benchmarking and exploring API recommendation: How far are we? IEEE Transactions on Software Engineering 49, 4 (2022), 1876--1897.

Digital Library

[34]

Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun Radhakrishna, and Anurag Gupta. 2022. AutoTSG: learning and synthesis for incident troubleshooting. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1477--1488.

Digital Library

[35]

Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. 2021. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 419--429.

Digital Library

[36]

Yaohui Wang, Guozheng Li, Zijian Wang, Yu Kang, Yangfan Zhou, Hongyu Zhang, Feng Gao, Jeffrey Sun, Li Yang, Pochian Lee, et al. 2021. Fast outage analysis of large-scale production clouds with service correlation mining. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 885--896.

Digital Library

[37]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824--24837.

[38]

Tianyi Yang, Jiacheng Shen, Yuxin Su, Xiaoxue Ren, Yongqiang Yang, and Michael R Lyu. 2022. Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 393--401.

[39]

Nengwen Zhao, Junjie Chen, Xiao Peng, Honglin Wang, Xinya Wu, Yuanzong Zhang, Zikai Chen, Xiangzhong Zheng, Xiaohui Nie, Gang Wang, et al. 2020. Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. 162--171.

Digital Library

[40]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).

[41]

Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yanlin Wang, Wenqing Chen, Lianghong Guo, and Weicheng Wang. 2023. Towards an Understanding of Large Language Models in Software Engineering Tasks. arXiv preprint arXiv:2308.11396 (2023).

Cited By

Huang JJiang ZLiu JHuo YGu JChen ZFeng CDong HYang ZLyu M(2024)Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00055(511-522)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00055
Gu WSun XLiu JHuo YChen ZZhang JGu JYang YLyu M(2024)KPIRoot: Efficient Monitoring Metric-based Root Cause Localization in Large-scale Cloud Systems2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00046(403-414)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00046
Sarda KNamrud ZWatts IShwartz LNagar SMohapatra PLitoiu M(2024)Augmenting Automatic Root-Cause Identification with Incident Alerts Using LLM2024 34th International Conference on Collaborative Advances in Software and COmputiNg (CASCON)10.1109/CASCON62161.2024.10838171(1-10)Online publication date: 11-Nov-2024
https://doi.org/10.1109/CASCON62161.2024.10838171

Index Terms

Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues
      1. Maintaining software

Recommendations

Dealing with Security Alert Flooding: Using Machine Learning for Domain-independent Alert Aggregation
Intrusion Detection Systems (IDS) secure all kinds of IT infrastructures through automatic detection of malicious activities. Unfortunately, they are known to produce large numbers of alerts that often become overwhelming for manual analysis. Therefore, ...
Research on Preprocessing Technique of Alert Aggregation
CSO '12: Proceedings of the 2012 Fifth International Joint Conference on Computational Sciences and Optimization

In order to solve the problems caused by repetitive IDS alerts, an adaptive alert aggregation approach is proposed in this paper. According to the corresponding alert types, the stay times of aggregate alerts in the buffer area can be adjusted ...
Online Intrusion Alert Aggregation with Generative Data Stream Modeling

Alert aggregation is an important subtask of intrusion detection. The goal is to identify and to cluster different alerts—produced by low-level intrusion detection systems, firewalls, etc.—belonging to a specific attack instance which has been initiated ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice

April 2024

480 pages

ISBN:9798400705014

DOI:10.1145/3639477

Co-chairs:
Ana Paiva,
Rui Abreu,
Maurício Aniche
Delft University of Technology, Netherlands
,
Nachiappan Nagappan
Meta, USA
,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The work described in this paper was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14206921 of the General Research Fund)

Conference

ICSE-SEIP '24

Sponsor:

SIGSOFT

ICSE-SEIP '24: 46th International Conference on Software Engineering: Software Engineering in Practice

April 14 - 20, 2024

Lisbon, Portugal

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
435
Total Downloads

Downloads (Last 12 months)435
Downloads (Last 6 weeks)62

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang JJiang ZLiu JHuo YGu JChen ZFeng CDong HYang ZLyu M(2024)Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00055(511-522)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00055
Gu WSun XLiu JHuo YChen ZZhang JGu JYang YLyu M(2024)KPIRoot: Efficient Monitoring Metric-based Root Cause Localization in Large-scale Cloud Systems2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00046(403-414)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00046
Sarda KNamrud ZWatts IShwartz LNagar SMohapatra PLitoiu M(2024)Augmenting Automatic Root-Cause Identification with Incident Alerts Using LLM2024 34th International Conference on Collaborative Advances in Software and COmputiNg (CASCON)10.1109/CASCON62161.2024.10838171(1-10)Online publication date: 11-Nov-2024
https://doi.org/10.1109/CASCON62161.2024.10838171

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten