Abstract
Logs are a reliable source of information for development and maintenance purposes. They record information at runtime regarding the state of a system and are commonly used to analyze its behavior. Parsing operations on logs structure the information embedded within the log message and are a crucial step for many log mining applications. In such use cases, parsing effectiveness can impact performance. For systems that require real-time performance, parsing efficiency is also an important factor. In this paper, we present USTEP, an online log parser that uses an evolving tree structure to encode and discover new parsing rules on the fly. Our evaluation of 14 datasets from different logging environments highlights the superiority of our method in terms of robustness and effectiveness compared to the state of the art. Our analysis of space and time complexity shows that USTEP is the only considered method capable of processing logs in constant time regardless of their length. We also propose here USTEP-UP, a way of running multiple USTEP instances in parallel.
Similar content being viewed by others
References
Vervaet A, Chiky, R, Callau-Zori M (2021) Ustep: unfixed search tree for efficient log parsing. In: 2021 IEEE international conference on data mining (ICDM)
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I et al (2010) A view of cloud computing. Commun ACM 53(4):50–58
Gartner (2021) 3 Cloud disciplines to fuel digital innovation. https://www.gartner.com/smarterwithgartner/3-cloud-disciplines-to-fuel-digital-innovation
Varghese B, Buyya R (2018) Next generation cloud computing: new trends and research directions. Futur Gener Comput Syst 79:849–861
He S, He P, Chen Z, Yang T, Su Y, Lyu MR (2021) A survey on automated log analysis for reliability engineering. ACM Comput Surv (CSUR) 54(6):1–37
Zeng L, Xiao Y, Chen H, Sun B, Han W (2016) Computer operating system logging and security issues: a survey. Secur Commun Netw 9(17):4804–4821
Mi H, Wang H, Zhou Y, Lyu MR-T, Cai H (2013) Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Trans Parallel Distrib Syst 24(6):1245–1255
Liang H, Song L, Wang J, Guo L, Li X, Liang J (2021) Robust unsupervised anomaly detection via multi-time scale dcgans with forgetting mechanism for industrial multivariate time series. Neurocomputing 423:444–462
Du M, Li F, Zheng G, Srikumar V (2017) Deeplog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pp 1285–1298
Zhu J, He S, Liu J, He P, Xie Q, Zheng Z, Lyu MR (2019) Tools and benchmarks for automated log parsing. In: 2019 IEEE/ACM 41st international conference on software engineering: software engineering in practice (ICSE-SEIP). IEEE, pp 121–130
Mizutani M (2013) Incremental mining of system log format. In: 2013 IEEE international conference on services computing. IEEE, pp 595–602
Shima K (2016) Length matters: clustering system log messages using length of words. arXiv preprint arXiv:1611.03213
Du M, Li F (2016) Spell: streaming parsing of system event logs. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE, pp 859–864
He P, Zhu J, Zheng Z, Lyu MR (2017) Drain: an online log parsing approach with fixed depth tree. In: 2017 IEEE international conference on web services (ICWS). IEEE, pp 33–40
Fu Q, Lou J-G, Wang Y, Li J (2009) Execution anomaly detection in distributed systems through unstructured log analysis. In: 2009 Ninth IEEE international conference on data mining. IEEE, pp 149–158
Tang L, Li T, Perng C-S (2011) Logsig: Generating system events from raw textual logs. In: Proceedings of the 20th ACM international conference on information and knowledge management, pp 785–794
Hamooni H, Debnath B, Xu J, Zhang H, Jiang G, Mueen A (2016) Logmine: fast pattern recognition for log analytics. In: Proceedings of the 25th ACM international on conference on information and knowledge management, pp 1573–1582
Makanju AA, Zincir-Heywood AN, Milios EE (2009) Clustering event logs using iterative partitioning. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1255–1264
Vaarandi R (2003) A data clustering algorithm for mining patterns from event logs. In: Proceedings of the 3rd ieee workshop on IP operations & management (IPOM 2003)(IEEE Cat. No. 03EX764). IEEE, pp 119–126
Nagappan M, Vouk MA (2010) Abstracting log lines to log event types for mining software system logs. In: 2010 7th IEEE working conference on mining software repositories (MSR 2010). IEEE, pp 114–117
Vaarandi R, Pihelgas M (2015) Logcluster-a data clustering and pattern mining algorithm for event logs. In: 2015 11th international conference on network and service management (CNSM). IEEE, pp 1–7
Jiang ZM, Hassan AE, Flora P, Hamann G (2008) Abstracting execution logs to execution events for enterprise applications (short paper). In: 2008 The eighth international conference on quality software. IEEE, pp 181–186
Dai H, Li H, Shang W, Chen T-H, Chen C-S (2020) Logram: efficient log parsing using n-gram dictionaries. arXiv preprint arXiv:2001.03038
Nedelkoski S, Bogatinovski J, Acker A, Cardoso J, Kao O (2020) Self-supervised log parsing. arXiv preprint arXiv:2003.07905
He P, Zhu J, Xu P, Zheng Z, Lyu MR (2018) A directed acyclic graph approach to online log parsing
He P, Zhu J, He S, Li J, Lyu MR (2017) Towards automated log parsing for large-scale log data analysis. IEEE Trans Dependable Secur Comput 15(6):931–944
Agrawal A, Karlupia R, Gupta R (2019) Logan: a distributed online log parser. In: 2019 IEEE 35th international conference on data engineering (ICDE). IEEE, pp 1946–1951
Pang G, Shen C, Cao L, Hengel AVD (2021) Deep learning for anomaly detection: a review. ACM Comput Surv (CSUR) 54(2):1–38
Xu W, Huang L, Fox A, Patterson D, Jordan MI (2009) Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles, pp 117–132
Lou J-G, Fu Q, Yang S, Xu Y, Li J (2010) Mining invariants from console logs for system problem detection. In: USENIX annual technical conference, pp 1–14
Zhang X, Xu Y, Lin Q, Qiao B, Zhang H, Dang Y, Xie C, Yang X, Cheng Q, Li Z et al (2019) Robust log-based anomaly detection on unstable log data. In: Proceedings of the 2019 27th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, pp 807–817
Nedelkoski S, Bogatinovski J, Acker A, Cardoso J, Kao O (2020) Self-attentive classification-based anomaly detection in unstructured logs. In: 2020 IEEE international conference on data mining (ICDM), pp 1196–1201. https://doi.org/10.1109/ICDM50108.2020.00148
Kimura T, Watanabe A, Toyono T, Ishibashi K (2018) Proactive failure detection learning generation patterns of large-scale network logs. IEICE Trans Commun
Lu S, Rao B, Wei X, Tak B, Wang L, Wang L (2017) Log-based abnormal task detection and root cause analysis for spark. In: 2017 IEEE international conference on web services (ICWS). IEEE, pp 389–396
Anitha V, Isakki P (2016) A survey on predicting user behavior based on web server log files in a web usage mining. In: 2016 International conference on computing technologies and intelligent data engineering (ICCTIDE’16), pp 1–4. https://doi.org/10.1109/ICCTIDE.2016.7725340
Awad M, Menascé DA (2015) Automatic workload characterization using system log analysis. In: Computer measurement group conference on performance and capacity, San Antonio, TX
He P, Zhu J, He S, Li J, Lyu MR (2016) An evaluation study on log parsing and its use in log mining. In: 2016 46th annual IEEE/IFIP international conference on dependable systems and networks (DSN). IEEE, pp 654–661
He S, Zhu J, He P, Lyu MR (2020) Loghub: a large collection of system log datasets towards automated log analytics. arXiv preprint arXiv:2008.06448
Ghomi EJ, Rahmani AM, Qader NN (2017) Load-balancing algorithms in cloud computing: a survey. J Netw Comput Appl 88:50–71
Mishra SK, Sahoo B, Parida PP (2020) Load balancing in cloud computing: a big picture. J King Saud Univ Comput Inf Sci 32(2):149–158
Acknowledgements
The work described in this paper was supported by the cloud provider 3DS OUSTCALE and by the French National Research and Technology Association (CIFRE program N\(^{\circ }\) 2020/0289). We warmly thank both of them for their support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Some preliminary results have been published at the IEEE International Conference on Data Mining in 2021 [1].
Appendix A: Details about parsing experimental settings
Appendix A: Details about parsing experimental settings
Preprocessing using regex helps log parsers achieve more accurate results. During the evaluation, we selected the same regex for all the parsers on a given dataset. For every algorithm, the parameter setting values were fine-tuned through over 100 runs to avoid bias from randomization. We kept the values for which algorithms achieve the highest accuracy on a given dataset. Therefore, preprocessing regex for each dataset and parameters for each parser are summarized in Table 5).
Regarding the number of parameters, SHISO requires four: (1) maxChild the maximum number of children for each internal node; (2) mergeThreshold, a threshold for searching the most similar template in the children; (3) formatLookupThreshold, lower bound to find the most similar node to adjust; and (4)superFormatThreshold, the threshold of average LCS length to determine if the creation of a super format is needed. LenMa uses only one parameter \(T_c\), the threshold for similarity comparisons between the log message and the clusters. Spell also requires only one parameter \(\tau \) as a threshold for similarity. Finally, Drain needs three parameters [14]: (1) depth, the depth of the parsing tree; (2) st a threshold for similarity comparisons between the log messages and the discovered templates; and (3) maxChild, the maximum number of children that a node can have. Once this threshold is reached, every new value is sent to a default node. In the last version [25], the number of parameters was reduced to only one, st, and a dynamic update is proposed (Table 6).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Vervaet, A., Callau-Zori, M., Chabchoub, Y. et al. Online log parsing using evolving research tree. Knowl Inf Syst 66, 1231–1255 (2024). https://doi.org/10.1007/s10115-023-01953-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01953-z