[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3615924.3615949dlproceedingsArticle/Chapter ViewAbstractPublication PagescasconConference Proceedingsconference-collections
research-article

ADARMA Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models

Published: 11 September 2023 Publication History

Abstract

In microservice architecture, anomalies can cause slow response times or poor user experience if not detected early. Manual detection can be time-consuming and error-prone, making real-time anomaly detection necessary. By implementing runtime performance anomaly detection models, microservice systems can become more stable and reliable. However, anomaly detection alone is not enough, and complementary auto-remediation techniques are required to automatically detect and fix issues. Auto-remediation techniques can optimize resource allocation, tune code, or trigger automatic recovery mechanisms. The combination of anomaly detection and auto-remediation reduces downtime and enhances system reliability, resulting in increased productivity and customer satisfaction, which, in turn, drives higher revenue. Prior works have overlooked auto-remediation. In this work in progress paper, we propose a pipeline for automatic anomaly detection and remediation based on Large Language Models (LLMs).

References

[1]
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).
[2]
Toufique Ahmed, Supriyo Ghosh, Chetan Bansal, Thomas Zimmermann, Xuchao Zhang, and Saravan Rajmohan. 2023. Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models. arXiv preprint arXiv:2301.03797 (2023).
[3]
Faruk Caglar, Shashank Shekhar, Aniruddha Gokhale, and Xenofon Koutsoukos. 2016. Intelligent, performance interference-aware resource management for iot cloud backends. In 2016 IEEE First International Conference on Internet-of-Things Design and Implementation (IoTDI). IEEE, 95–105.
[4]
Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp). Ieee, 39–57.
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
[6]
Qian Cheng, Doyen Sahoo, Amrita Saha, Wenzhuo Yang, Chenghao Liu, Gerald Woo, Manpreet Singh, Silvio Saverese, and Steven CH Hoi. 2023. AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges. arXiv preprint arXiv:2304.04661 (2023).
[7]
Ron C Chiang and H Howie Huang. 2011. TRACON: Interference-aware scheduling for data-intensive applications in virtualized environments. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12.
[8]
Marcello Cinque, Raffaele Della Corte, and Antonio Pecchia. 2022. Micro2vec: Anomaly detection in microservices systems by mining numeric representations of computer logs. Journal of Network and Computer Applications 208 (2022), 103515.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
[11]
Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).
[12]
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
[13]
Rajeev Gupta, Himanshu Gupta, and Mukesh Mohania. 2012. Cloud computing and big data analytics: what is new from databases perspective?. In Big Data Analytics: First International Conference, BDA 2012, New Delhi, India, December
[14]
26, 2012. Proceedings 1. Springer, 42–61.
[15]
Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. 2020. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020).
[16]
Christoph Heger, André van Hoorn, Mario Mann, and Dušan Okanović. 2017. Application performance management: State of the art and challenges for the future. In Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering. 429–432.
[17]
Eric Horton and Chris Parnin. 2022. Dozer: Migrating shell commands to Ansible modules via execution profiling and synthesis. In Proceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 147–148.
[18]
Tao Huang, Pengfei Chen, and Ruipeng Li. 2022. A Semi-Supervised VAE Based Active Anomaly Detection Framework in Multivariate Time Series for Online Systems. In Proceedings of the ACM Web Conference 2022. 1797–1806.
[19]
Hiroki Ikeuchi, Jiawen Ge, Yoichi Matsuo, and Keishiro Watanabe. 2020. A framework for automatic failure recovery in ict systems by deep reinforcement learning. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE, 1310–1315.
[20]
Hiroki Ikeuchi, Akio Watanabe, Tsutomu Hirao, Makoto Morishita, Masaaki Nishino, Yoichi Matsuo, and Keishiro Watanabe. 2020. Recovery command generation towards automatic recovery in ict systems by seq2seq learning. In NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium. IEEE, 1–6.
[21]
Vincent Jacob, Fei Song, Arnaud Stiegler, Bijan Rad, Yanlei Diao, and Nesime Tatbul. 2020. Exathlon: A benchmark for explainable anomaly detection over time series. arXiv preprint arXiv:2010.05073 (2020).
[22]
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019. Spoc: Search-based pseudocode to code. Advances in Neural Information Processing Systems 32 (2019).
[23]
Max Landauer, Sebastian Onder, Florian Skopik, and Markus Wurzenberger. 2023. Deep learning for anomaly detection in log data: A survey. Machine Learning with Applications 12 (2023), 100470.
[24]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[25]
Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. Skcoder: A sketch-based approach for automatic code generation. arXiv preprint arXiv:2302.06144 (2023).
[26]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing microservice dependency and performance: Alibaba trace analysis. In Proceedings of the ACM Symposium on Cloud Computing. 412–426.
[27]
Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2021. ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures. IEEE Transactions on Dependable and Secure Computing 19, 5 (2021), 3087–3100.
[28]
Vahid MirzaEbrahim Mostofi. 2021. Auto-Scaling Containerized Microservice Applications. Master’s thesis. Schulich School of Engineering.
[29]
Joydeep Mukherjee and Diwakar Krishnamurthy. 2019. Prima: Subscriber-driven interference mitigation for cloud services. IEEE Transactions on Network and Service Management 17, 2 (2019), 958–971.
[30]
Daniel Njuguna Ng’ang’a, Wilson Kipchirchir Cheruiyot, and Dennis Njagi. [n. d.]. A Machine Learning Framework for Predicting Failures in Cloud Data Centers-A Case of Google Cluster-Azure Clouds and Alibaba Clouds. Available at SSRN 4404569 ([n. d.]).
[31]
Ruben Opdebeeck, Ahmed Zerouali, and Coen De Roover. 2021. Andromeda: A dataset of Ansible Galaxy roles and their evolution. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 580–584.
[32]
Saurabh Pujar, Luca Buratti, Xiaojie Guo, Nicolas Dupuis, Burn Lewis, Sahil Suneja, Atin Sood, Ganesh Nalawade, Matt Jones, Alessandro Morari, et al. 2023. Automated Code generation for Information Technology Tasks in YAML through Large Language Models. arXiv preprint arXiv:2305.02783 (2023).
[33]
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
[34]
Alan Ramponi and Barbara Plank. 2020. Neural unsupervised domain adaptation in NLP—a survey. arXiv preprint arXiv:2006.00632 (2020).
[35]
Rupashree Rangaiyengar, Raghavan Komondoor, and Raveendra Kumar Medicherla. 2023. Multi-Layer Observability for Fault Localization in Microser-vices Based Systems. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 733–737.
[36]
Vishwanath Seshagiri, Darby Huye, Lan Liu, Avani Wildani, and Raja R Sambasivan. 2022. [SoK] Identifying Mismatches Between Microservice Testbeds and Industrial Perceptions of Microservices. Journal of Systems Research 2, 1 (2022).
[37]
Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, and Siddharth Garg. 2023. Benchmarking Large Language Models for Automated Verilog RTL Code Generation. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 1–6.
[38]
Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556 (2021).
[39]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
[40]
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
[41]
Li Wu, Johan Tordsson, Alexander Acker, and Odej Kao. 2020. MicroRAS: Automatic Recovery in the Absence of Historical Failure Data for Microservice Systems. In 2020 IEEE/ACM 13th International Conference on Utility and Cloud Computing (UCC). IEEE, 227–236.
[42]
Fei Xu, Fangming Liu, Hai Jin, and Athanasios V Vasilakos. 2013. Managing performance overhead of virtual machines in cloud computing: A survey, state of the art, and future directions. Proc. IEEE 102, 1 (2013), 11–31.
[43]
Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
[44]
Liu Ying. 2015. Sponge: an oversubscription strategy supporting performance interference management in cloud. China Communications 12, 11 (2015), 1–14.
[45]
Saeed Zareian, Marios Fokaefs, Hamzeh Khazaei, Marin Litoiu, and Xi Zhang. 2016. A big data framework for cloud monitoring. In Proceedings of the 2nd International Workshop on BIG Data Software Engineering. 58–64.
[46]
Qixun Zhang, Tong Jia, Zhonghai Wu, Qingxin Wu, Lichun Jia, Donglei Li, Yuqing Tao, and Yutong Xiao. 2022. Fault Localization for Microservice Applications with System Logs and Monitoring Metrics. In 2022 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). IEEE, 149–154.
[47]
Yuqi Zhang, Yu Huang, Hengfeng Wei, and Xiaoxing Ma. 2023. Model-checking-driven explorative testing of CRDT designs and implementations. Journal of Software: Evolution and Process (2023), e2555.
[48]
Yuan Zuo, Yulei Wu, Geyong Min, Chengqiang Huang, and Ke Pei. 2020. An intelligent anomaly detection scheme for micro-services architectures with tem-poral and spatial data analysis. IEEE Transactions on Cognitive Communications and Networking 6, 2 (2020), 548–561.

Cited By

View all
  • (2024)Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental StudyCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663855(358-369)Online publication date: 10-Jul-2024
  • (2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
  • (2024)KubePlaybook: A Repository of Ansible Playbooks for Kubernetes Auto-Remediation with LLMsCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3653665(57-61)Online publication date: 7-May-2024
  • Show More Cited By

Index Terms

  1. ADARMA Auto-Detection and Auto-Remediation of Microservice Anomalies by Leveraging Large Language Models
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image DL Hosted proceedings
          CASCON '23: Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering
          September 2023
          251 pages

          Publisher

          IBM Corp.

          United States

          Publication History

          Published: 11 September 2023

          Author Tags

          1. AIOps
          2. Anomaly Detection
          3. Root-cause Analysis
          4. Auto-remediation
          5. Microservice

          Qualifiers

          • Research-article

          Acceptance Rates

          Overall Acceptance Rate 24 of 90 submissions, 27%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)305
          • Downloads (Last 6 weeks)24
          Reflects downloads up to 02 Mar 2025

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Leveraging Large Language Models for the Auto-remediation of Microservice Applications: An Experimental StudyCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663855(358-369)Online publication date: 10-Jul-2024
          • (2024)MonitorAssistant: Simplifying Cloud Service Monitoring via Large Language ModelsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663826(38-49)Online publication date: 10-Jul-2024
          • (2024)KubePlaybook: A Repository of Ansible Playbooks for Kubernetes Auto-Remediation with LLMsCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3653665(57-61)Online publication date: 7-May-2024
          • (2024)Closing the Loop: Building Self-Adaptive Software for Continuous Performance EngineeringCompanion of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629527.3652910(258-259)Online publication date: 7-May-2024

          View Options

          Login options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media