More Web Proxy on the site http://driver.im/

research-article

ADARMA Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models

Authors:

Mohammadreza Rasolroveicy,

Larisa Shwartz,

Ian WattsAuthors Info & Claims

CASCON'23: Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering

Pages 200 - 205

Published: 11 September 2023 Publication History

Abstract

In microservice architecture, anomalies can cause slow response times or poor user experience if not detected early. Manual detection can be time-consuming and error-prone, making real-time anomaly detection necessary. By implementing runtime performance anomaly detection models, microservice systems can become more stable and reliable. However, anomaly detection alone is not enough, and complementary auto-remediation techniques are required to automatically detect and fix issues. Auto-remediation techniques can optimize resource allocation, tune code, or trigger automatic recovery mechanisms. The combination of anomaly detection and auto-remediation reduces downtime and enhances system reliability, resulting in increased productivity and customer satisfaction, which, in turn, drives higher revenue. Prior works have overlooked auto-remediation. In this work in progress paper, we propose a pipeline for automatic anomaly detection and remediation based on Large Language Models (LLMs).

References

[1]

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).

[2]

Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. arXiv preprint arXiv:2301.03797 (2023).

[3]

Faruk Caglar, Shashank Shekhar, Aniruddha Gokhale, and Xenofon Koutsoukos. 2016. Intelligent, performance interference-aware resource management for iot cloud backends. In 2016 IEEE First International Conference on Internet-of-Things Design and Implementation (IoTDI). IEEE, 95–105.

[4]

Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee, 39–57.

[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[6]

Qian Cheng, Doyen Sahoo, Amrita Saha, Wenzhuo Yang, Chenghao Liu, Gerald Woo, Manpreet Singh, Silvio Saverese, and Steven CH Hoi. 2023. AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges. arXiv preprint arXiv:2304.04661 (2023).

[7]

Ron C Chiang and H Howie Huang. 2011. TRACON: Interference-aware scheduling for data-intensive applications in virtualized environments. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12.

Digital Library

[8]

Marcello Cinque, Raffaele Della Corte, and Antonio Pecchia. 2022. Micro2vec: Anomaly detection in microservices systems by mining numeric representations of computer logs. Journal of Network and Computer Applications 208 (2022), 103515.

Digital Library

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[10]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).

[11]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).

[12]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).

[13]

Rajeev Gupta, Himanshu Gupta, and Mukesh Mohania. 2012. Cloud computing and big data analytics: what is new from databases perspective?. In Big Data Analytics: First International Conference, BDA 2012, New Delhi, India, December

[14]

26, 2012. Proceedings 1. Springer, 42–61.

[15]

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020).

[16]

Christoph Heger, André van Hoorn, Mario Mann, and Dušan Okanović. 2017. Application performance management: State of the art and challenges for the future. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering. 429–432.

Digital Library

[17]

Eric Horton and Chris Parnin. 2022. Dozer: Migrating shell commands to Ansible modules via execution profiling and synthesis. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 147–148.

Digital Library

[18]

Tao Huang, Pengfei Chen, and Ruipeng Li. 2022. A Semi-Supervised VAE Based Active Anomaly Detection Framework in Multivariate Time Series for Online Systems. In Proceedings of the ACM Web Conference 2022. 1797–1806.

Digital Library

[19]

Hiroki Ikeuchi, Jiawen Ge, Yoichi Matsuo, and Keishiro Watanabe. 2020. A framework for automatic failure recovery in ict systems by deep reinforcement learning. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1310–1315.

[20]

Hiroki Ikeuchi, Akio Watanabe, Tsutomu Hirao, Makoto Morishita, Masaaki Nishino, Yoichi Matsuo, and Keishiro Watanabe. 2020. Recovery command generation towards automatic recovery in ict systems by seq2seq learning. In NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium. IEEE, 1–6.

Digital Library

[21]

Vincent Jacob, Fei Song, Arnaud Stiegler, Bijan Rad, Yanlei Diao, and Nesime Tatbul. 2020. Exathlon: A benchmark for explainable anomaly detection over time series. arXiv preprint arXiv:2010.05073 (2020).

[22]

Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019. Spoc: Search-based pseudocode to code. Advances in Neural Information Processing Systems 32 (2019).

[23]

Max Landauer, Sebastian Onder, Florian Skopik, and Markus Wurzenberger. 2023. Deep learning for anomaly detection in log data: A survey. Machine Learning with Applications 12 (2023), 100470.

[24]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).

[25]

Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. Skcoder: A sketch-based approach for automatic code generation. arXiv preprint arXiv:2302.06144 (2023).

[26]

Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM Symposium on Cloud Computing. 412–426.

Digital Library

[27]

Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2021. ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures. IEEE Transactions on Dependable and Secure Computing 19, 5 (2021), 3087–3100.

[28]

Vahid MirzaEbrahim Mostofi. 2021. Auto-Scaling Containerized Microservice Applications. Master’s thesis. Schulich School of Engineering.

[29]

Joydeep Mukherjee and Diwakar Krishnamurthy. 2019. Prima: Subscriber-driven interference mitigation for cloud services. IEEE Transactions on Network and Service Management 17, 2 (2019), 958–971.

Digital Library

[30]

Daniel Njuguna Ng’ang’a, Wilson Kipchirchir Cheruiyot, and Dennis Njagi. [n. d.]. A Machine Learning Framework for Predicting Failures in Cloud Data Centers-A Case of Google Cluster-Azure Clouds and Alibaba Clouds. Available at SSRN 4404569 ([n. d.]).

[31]

Ruben Opdebeeck, Ahmed Zerouali, and Coen De Roover. 2021. Andromeda: A dataset of Ansible Galaxy roles and their evolution. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 580–584.

[32]

Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matt Jones, Alessandro Morari, et al. 2023. Automated Code generation for Information Technology Tasks in YAML through Large Language Models. arXiv preprint arXiv:2305.02783 (2023).

[33]

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).

[34]

Alan Ramponi and Barbara Plank. 2020. Neural unsupervised domain adaptation in NLP—a survey. arXiv preprint arXiv:2006.00632 (2020).

[35]

Rupashree Rangaiyengar, Raghavan Komondoor, and Raveendra Kumar Medicherla. 2023. Multi-Layer Observability for Fault Localization in Microser-vices Based Systems. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 733–737.

[36]

Vishwanath Seshagiri, Darby Huye, Lan Liu, Avani Wildani, and Raja R Sambasivan. 2022. [SoK] Identifying Mismatches Between Microservice Testbeds and Industrial Perceptions of Microservices. Journal of Systems Research 2, 1 (2022).

[37]

Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2023. Benchmarking Large Language Models for Automated Verilog RTL Code Generation. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6.

[38]

Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556 (2021).

[39]

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).

[40]

Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).

[41]

Li Wu, Johan Tordsson, Alexander Acker, and Odej Kao. 2020. MicroRAS: Automatic Recovery in the Absence of Historical Failure Data for Microservice Systems. In 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC). IEEE, 227–236.

[42]

Fei Xu, Fangming Liu, Hai Jin, and Athanasios V Vasilakos. 2013. Managing performance overhead of virtual machines in cloud computing: A survey, state of the art, and future directions. Proc. IEEE 102, 1 (2013), 11–31.

[43]

Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.

Digital Library

[44]

Liu Ying. 2015. Sponge: an oversubscription strategy supporting performance interference management in cloud. China Communications 12, 11 (2015), 1–14.

[45]

Saeed Zareian, Marios Fokaefs, Hamzeh Khazaei, Marin Litoiu, and Xi Zhang. 2016. A big data framework for cloud monitoring. In Proceedings of the 2nd International Workshop on BIG Data Software Engineering. 58–64.

Digital Library

[46]

Qixun Zhang, Tong Jia, Zhonghai Wu, Qingxin Wu, Lichun Jia, Donglei Li, Yuqing Tao, and Yutong Xiao. 2022. Fault Localization for Microservice Applications with System Logs and Monitoring Metrics. In 2022 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). IEEE, 149–154.

[47]

Yuqi Zhang, Yu Huang, Hengfeng Wei, and Xiaoxing Ma. 2023. Model-checking-driven explorative testing of CRDT designs and implementations. Journal of Software: Evolution and Process (2023), e2555.

[48]

Yuan Zuo, Yulei Wu, Geyong Min, Chengqiang Huang, and Ke Pei. 2020. An intelligent anomaly detection scheme for micro-services architectures with tem-poral and spatial data analysis. IEEE Transactions on Cognitive Communications and Networking 6, 2 (2020), 548–561.

Cited By

Sarda KNamrud ZLitoiu MShwartz LWatts Id'Amorim M(2024)Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental StudyCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663855(358-369)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663855
Yu ZMa MZhang CQin SKang YBansal CRajmohan SDang YPei CPei DLin QZhang Dd'Amorim M(2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663826
Namrud ZSarda KLitoiu MShwartz LWatts IBalsamo SKnottenbelt WAbad CShang W(2024)KubePlaybook: A Repository of Ansible Playbooks for Kubernetes Auto-Remediation with LLMsCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3653665(57-61)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629527.3653665
Show More Cited By

Index Terms

ADARMA Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models
1. General and reference
2. Software and its engineering

Index terms have been assigned to the content through auto-classification.

Recommendations

Event log anomaly detection method based on auto-encoder and control flow
Abstract
Anomaly detection is widely used in the field of business process management, and researchers have proposed various anomaly detection algorithms to detect anomalies in event logs. However, existing research focuses on detecting anomalies in event ...
Framework for automatic detection of anomalies in DevOps
Abstract
Log-based anomaly detection is important for improving the reliability and availability of software systems, especially those evolving using DevOps, owing to the huge number of logs generated during continuous practices. However, ...
CenAD: Collaborative Embedding Network for Anomaly Detection with Leveraging Partially Observed Anomalies
Neural Information Processing
Abstract
Leveraging observed anomalies in anomaly detection can significantly improve detection accuracy. Assuming that observed anomalies cover all anomaly distributions, existing methods commonly learn the anomaly distributions from these observed ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

CASCON '23: Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering

September 2023

251 pages

Conference Chair:
Iosif-Viorel (Vio) Onut
IBM Canada Ltd.
,
Editors:
Paria Shirani
University of Ottawa
,
Iosif-Viorel (Vio) Onut
IBM Canada Ltd.
,
Program Co-chairs:
Iosif-Viorel (Vio) Onut
IBM Canada Ltd
,
Paula Branco
University of Ottawa

Publisher

IBM Corp.

United States

Publication History

Published: 11 September 2023

Author Tags

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 24 of 90 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
445
Total Downloads

Downloads (Last 12 months)305
Downloads (Last 6 weeks)24

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sarda KNamrud ZLitoiu MShwartz LWatts Id'Amorim M(2024)Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental StudyCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663855(358-369)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663855
Yu ZMa MZhang CQin SKang YBansal CRajmohan SDang YPei CPei DLin QZhang Dd'Amorim M(2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3663529.3663826
Namrud ZSarda KLitoiu MShwartz LWatts IBalsamo SKnottenbelt WAbad CShang W(2024)KubePlaybook: A Repository of Ansible Playbooks for Kubernetes Auto-Remediation with LLMsCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3653665(57-61)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629527.3653665
Litoiu MBalsamo SKnottenbelt WAbad CShang W(2024)Closing the Loop: Building Self-Adaptive Software for Continuous Performance EngineeringCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3652910(258-259)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3629527.3652910

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten