More Web Proxy on the site http://driver.im/

research-article

Open access

LogSD: Detecting Anomalies from System Logs through Self-Supervised Learning and Frequency-Based Masking

Authors:

Muhammad Ali BabarAuthors Info & Claims

Proceedings of the ACM on Software Engineering, Volume 1, Issue FSE

Article No.: 93, Pages 2098 - 2120

https://doi.org/10.1145/3660800

Published: 12 July 2024 Publication History

Abstract

Log analysis is one of the main techniques that engineers use for troubleshooting large-scale software systems. Over the years, many supervised, semi-supervised, and unsupervised log analysis methods have been proposed to detect system anomalies by analyzing system logs. Among these, semi-supervised methods have garnered increasing attention as they strike a balance between relaxed labeled data requirements and optimal detection performance, contrasting with their supervised and unsupervised counterparts. However, existing semi-supervised methods overlook the potential bias introduced by highly frequent log messages on the learned normal patterns, which leads to their less than satisfactory performance. In this study, we propose LogSD, a novel semi-supervised self-supervised learning approach. LogSD employs a dual-network architecture and incorporates a frequency-based masking scheme, a global-to-local reconstruction paradigm and three self-supervised learning tasks. These features enable LogSD to focus more on relatively infrequent log messages, thereby effectively learning less biased and more discriminative patterns from historical normal data. This emphasis ultimately leads to improved anomaly detection performance. Extensive experiments have been conducted on three commonly-used datasets and the results show that LogSD significantly outperforms eight state-of-the-art benchmark methods.

References

[1]

2019. LogAnomaly Code Repository. https://github.com/donglee-afar/logdeep

[2]

2021. DeepLoglizer Code Repository. https://github.com/logpai/deep-loglizer

[3]

2021. LogBert Code Repository. https://github.com/HelenGuohx/logbert

[4]

2021. OC4Seq Code Repository. https://github.com/wzwtrevor/Multi-Scale-One-Class-Recurrent-Neural-Networks

[5]

2021. PLELog Code Repository. https://github.com/LeonYang95/PLELog

[6]

2022. CAT Code Repository. https://github.com/mmichaelzhang/CAT

[7]

Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon. 2019. Ganomaly: Semi-supervised anomaly detection via adversarial training. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part III 14. 622–637.

[8]

Stefan Andonov, Viktor Jovev, Aleksandar Kitanovski, Aleksandar Krsteski, and Gjorgji Madjarov. 2022. logs2graphs: Data-driven graph representation and visualization of log data.

[9]

Pierre Baldi, Søren Brunak, Yves Chauvin, Claus AF Andersen, and Henrik Nielsen. 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16, 5 (2000), 412–424.

[10]

Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: automated classification of performance crises. In Proceedings of the 5th European conference on Computer systems. 111–124.

Digital Library

[11]

Mike Chen, Alice X Zheng, Jim Lloyd, Michael I Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In International Conference on Autonomic Computing, 2004. Proceedings. 36–43.

[12]

Zhuangbin Chen, Jinyang Liu, Wenwei Gu, Yuxin Su, and Michael R Lyu. 2021. Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection. arXiv preprint arXiv:2107.05908.

[13]

Hetong Dai, Yiming Tang, Heng Li, and Weiyi Shang. 2023. PILAR: Studying and Mitigating the Influence of Configurations on Log Parsing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 818–829.

[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[15]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 1285–1298.

Digital Library

[16]

Charles Elkan and Keith Noto. 2008. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. 213–220.

Digital Library

[17]

Amir Farzad and T Aaron Gulliver. 2020. Unsupervised log message anomaly detection. ICT Express, 6, 3 (2020), 229–237.

[18]

Ying Fu, Meng Yan, Zhou Xu, Xin Xia, Xiaohong Zhang, and Dan Yang. 2023. An empirical study of the impact of log parsers on the performance of log-based anomaly detection. Empirical Software Engineering, 28, 1 (2023), 6.

Digital Library

[19]

Haixuan Guo, Shuhan Yuan, and Xintao Wu. 2021. Logbert: Log anomaly detection via bert. In 2021 international joint conference on neural networks (IJCNN). 1–8.

[20]

He Haibo and Ma Yunqian. 2013. Imbalanced learning: foundations, algorithms, and applications. Wiley-IEEE Press, 1, 27 (2013), 12.

[21]

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.

[22]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In 2017 IEEE international conference on web services (ICWS). 33–40.

[23]

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

[24]

Chaoqin Huang, Qinwei Xu, Yanfeng Wang, Yu Wang, and Ya Zhang. 2022. Self-supervised masking for unsupervised anomaly detection and localization. IEEE Transactions on Multimedia.

[25]

Jin Huang and Charles X Ling. 2005. Using AUC and accuracy in evaluating learning algorithms. IEEE Transactions on knowledge and Data Engineering, 17, 3 (2005), 299–310.

Digital Library

[26]

Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, and Tomas Mikolov. 2016. Fasttext. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

[27]

Max Landauer, Sebastian Onder, Florian Skopik, and Markus Wurzenberger. 2022. Deep Learning for Anomaly Detection in Log Data: A Survey. arXiv preprint arXiv:2207.03820.

[28]

Luigi Lavazza and Sandro Morasca. 2022. Comparing φ and the F-measure as performance metrics for software-related classifications. Empirical Software Engineering, 27, 7 (2022), 185.

Digital Library

[29]

Van-Hoang Le and Hongyu Zhang. 2021. Log-based anomaly detection without log parsing. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 492–504.

Digital Library

[30]

Van Hoang Le and Hongyu Zhang. 2022. Log-based Anomaly Detection with Deep Learning: How Far Are We? arXiv preprint arXiv:2202.04301.

[31]

Zhenhao Li, An Ran Chen, Xing Hu, Xin Xia, Tse-Hsun Chen, and Weiyi Shang. 2023. Are They All Good? Studying Practitioners’ Expectations on the Readability of Log Messages. arXiv preprint arXiv:2308.08836.

[32]

Zhong Li, Jiayang Shi, and Matthijs van Leeuwen. 2023. Graph Neural Network based Log Anomaly Detection and Explanation. arXiv preprint arXiv:2307.00527.

[33]

Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Seventh IEEE International Conference on Data Mining (ICDM 2007). 583–588.

Digital Library

[34]

Lizhi Liao, Jinfu Chen, Heng Li, Yi Zeng, Weiyi Shang, Jianmei Guo, Catalin Sporea, Andrei Toma, and Sarah Sajedi. 2020. Using black-box performance models to detect performance regressions under varying workloads: an empirical study. Empirical Software Engineering, 25 (2020), 4130–4160.

Digital Library

[35]

Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C). 102–111.

Digital Library

[36]

Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 eighth ieee international conference on data mining. 413–422.

Digital Library

[37]

Jinyang Liu, Tianyi Yang, Zhuangbin Chen, Yuxin Su, Cong Feng, Zengyin Yang, and Michael R Lyu. 2023. Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services. arXiv preprint arXiv:2308.09937.

[38]

Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining Invariants from Console Logs for System Problem Detection. In USENIX Annual Technical Conference. 1–14.

[39]

Siyang Lu, Xiang Wei, Yandong Li, and Liqiang Wang. 2018. Detecting anomaly in big data system logs using convolutional neural network. In 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing. 151–158.

[40]

Rongrong Ma, Guansong Pang, Ling Chen, and Anton van den Hengel. 2022. Deep graph-level anomaly detection by glocal knowledge distillation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 704–714.

[41]

Adetokunbo AO Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2009. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 1255–1264.

Digital Library

[42]

Weibin Meng, Ying Liu, Yuheng Huang, Shenglin Zhang, Federico Zaiter, Bingjin Chen, and Dan Pei. 2020. A semantic-aware representation framework for online log analysis. In 2020 29th International Conference on Computer Communications and Networks (ICCCN). 1–7.

[43]

Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, and Pei Sun. 2019. Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs. In IJCAI. 19, 4739–4745.

[44]

Sasho Nedelkoski, Jasmin Bogatinovski, Alexander Acker, Jorge Cardoso, and Odej Kao. 2020. Self-attentive classification-based anomaly detection in unstructured logs. In 2020 IEEE International Conference on Data Mining (ICDM). 1196–1201.

[45]

Kim Anh Nguyen, Sabine Schulte im Walde, and Ngoc Thang Vu. 2016. Integrating distributional lexical contrast into word embeddings for antonym-synonym distinction. arXiv preprint arXiv:1605.07766.

[46]

Chao Ni, Wei Wang, Kaiwen Yang, Xin Xia, Kui Liu, and David Lo. 2022. The best of both worlds: integrating semantic features with expert features for defect prediction and localization. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 672–683.

Digital Library

[47]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.

[48]

Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep one-class classification. In International conference on machine learning. 4393–4402.

[49]

Mohammadreza Salehi, Niousha Sadjadi, Soroosh Baselizadeh, Mohammad H Rohban, and Hamid R Rabiee. 2021. Multiresolution knowledge distillation for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14902–14912.

[50]

Bernhard Schölkopf, John C Platt, John Shawe-Taylor, Alex J Smola, and Robert C Williamson. 2001. Estimating the support of a high-dimensional distribution. Neural computation, 13, 7 (2001), 1443–1471.

[51]

Qiaoyu Tan, Ninghao Liu, Xiao Huang, Soo-Hyun Choi, Li Li, Rui Chen, and Xia Hu. 2023. S2GAE: Self-Supervised Graph Autoencoders are Generalizable Learners with Graph Masking. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 787–795.

Digital Library

[52]

USENIX. 2008. CFDR DATA. GitHub. https://www.usenix.org/cfdr-data Accessed: 3/3/2023

[53]

Yi Wan, Yilin Liu, Dong Wang, and Yujin Wen. 2021. GLAD-PAW: Graph-Based Log Anomaly Detection by Position Aware Weighted Graph Attention Network. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 66–77.

[54]

Zhiwei Wang, Zhengzhang Chen, Jingchao Ni, Hui Liu, Haifeng Chen, and Jiliang Tang. 2021. Multi-scale one-class recurrent neural networks for discrete event sequence anomaly detection. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 3726–3734.

[55]

Qinfeng Xiao, Jing Wang, Youfang Lin, Wenbo Gongsa, Ganghui Hu, Menggang Li, and Fang Wang. 2021. Unsupervised anomaly detection with distillated teacher-student network ensemble. Entropy, 23, 2 (2021), 201.

[56]

Yaochen Xie, Zhao Xu, and Shuiwang Ji. 2022. Self-supervised representation learning via latent graph prediction. In International Conference on Machine Learning. 24460–24477.

[57]

Yongzheng Xie, Hongyu Zhang, and Muhammad Ali Babar. 2022. LogGD: Detecting Anomalies from System Logs by Graph Neural Networks. arXiv preprint arXiv:2209.07869.

[58]

Yongzheng Xie, Hongyu Zhang, Bo Zhang, Muhammad Ali Babar, and Sha Lu. 2021. LogDP: Combining Dependency and Proximity for Log-Based Anomaly Detection. In International Conference on Service-Oriented Computing. 708–716.

[59]

Hongzuo Xu, Yijie Wang, Juhui Wei, Songlei Jian, Yizhou Li, and Ning Liu. 2023. Fascinating Supervisory Signals and Where to Find Them: Deep Anomaly Detection with Scale Learning. In Proceedings of the 40th International Conference on Machine Learning (Poster), ICML.

[60]

Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Largescale system problem detection by mining console logs. Proceedings of SOSP’09.

Digital Library

[61]

Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. 117–132.

Digital Library

[62]

Lin Yang, Junjie Chen, Zan Wang, Weijing Wang, Jiajun Jiang, Xuyuan Dong, and Wenbin Zhang. 2021. Semi-supervised log-based anomaly detection via probabilistic label estimation. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1448–1460.

Digital Library

[63]

Jingxiu Yao and Martin Shepperd. 2021. The impact of using biased performance metrics on software defect prediction research. Information and Software Technology, 139 (2021), 106664.

Digital Library

[64]

Bo Zhang, Hongyu Zhang, Pablo Moscato, and Aozhong Zhang. 2020. Anomaly detection via mining numerical workflow relations from logs. In 2020 International Symposium on Reliable Distributed Systems (SRDS). 195–204.

[65]

Shengming Zhang, Yanchi Liu, Xuchao Zhang, Wei Cheng, Haifeng Chen, and Hui Xiong. 2022. Cat: Beyond efficient transformer for content-aware anomaly detection in event sequences. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4541–4550.

[66]

Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, and Ze Li. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 807–817.

Digital Library

[67]

Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, and Michael R. Lyu. 2023. GitHub. https://github.com/logpai/loghub

[68]

Jieming Zhu, Shilin He, Pinjia He, Jinyang Liu, and Michael R. Lyu. 2023. Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics. In IEEE International Symposium on Software Reliability Engineering (ISSRE).

[69]

Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R Lyu. 2019. Tools and benchmarks for automated log parsing. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 121–130.

Digital Library

Index Terms

LogSD: Detecting Anomalies from System Logs through Self-Supervised Learning and Frequency-Based Masking
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. Software and its engineering
  1. Software notations and tools
    1. Software maintenance tools

Recommendations

Semi-supervised Log-based Anomaly Detection via Probabilistic Label Estimation
ICSE '21: Proceedings of the 43rd International Conference on Software Engineering

With the growth of software systems, logs have become an important data to aid system maintenance. Log-based anomaly detection is one of the most important methods for such purpose, which aims to automatically detect system anomalies via log analysis. ...
LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak Supervision
Service-Oriented Computing
Abstract
With increasing scale and complexity of cloud operations, automated detection of anomalies in monitoring data such as logs will be an essential part of managing future IT infrastructures. However, many methods based on artificial intelligence, ...
Deep Weakly-supervised Anomaly Detection
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Recent semi-supervised anomaly detection methods that are trained using small labeled anomaly examples and large unlabeled data (mostly normal data) have shown largely improved performance over unsupervised methods. However, these methods often focus on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Software Engineering

Proceedings of the ACM on Software Engineering Volume 1, Issue FSE

July 2024

2770 pages

EISSN:2994-970X

DOI:10.1145/3554322

Editor:
Luciano Baresi
Politecnico di Milano, Italy

Issue’s Table of Contents

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2024

Published in PACMSE Volume 1, Issue FSE

Author Tags

Qualifiers

Research-article

Funding Sources

Australian Research Council

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
324
Total Downloads

Downloads (Last 12 months)324
Downloads (Last 6 weeks)101

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents