CN111367964B

CN111367964B - Method for automatically analyzing log

Info

Publication number: CN111367964B
Application number: CN202010132165.XA
Authority: CN
Inventors: 李宁宁
Original assignee: Shanghai Eisoo Information Technology Co Ltd
Current assignee: Shanghai Eisoo Information Technology Co Ltd
Priority date: 2020-02-29
Filing date: 2020-02-29
Publication date: 2023-11-17
Anticipated expiration: 2040-02-29
Also published as: CN111367964A

Abstract

The invention relates to a method for automatically analyzing logs, which comprises the following steps: s1, acquiring sample log data; s2, respectively establishing a log database and a log analysis model according to the sample log data; s3, acquiring target log data and preprocessing the target log data; s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path; s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log. Compared with the prior art, the method and the device solve the problem of low efficiency of manually formulating regular expression analysis logs by constructing the hidden Markov log analysis model and combining with the dimension bit algorithm, and can quickly and accurately automatically identify the internal structure of the log and extract effective information.

Description

Method for automatically analyzing log

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method for automatically analyzing logs.

Background

With the increasing development of computer technology, computer systems are also becoming more complex. For IT operation and maintenance, the original log cannot directly provide effective information, and the fields in the original log need to be parsed and then the effective information is extracted. The traditional log analysis method is to manually formulate a corresponding regular rule. This approach is more viable if there are fewer log species and little change in log structure. However, as various functions continue to be integrated into the system, a large number of IT subsystems and thus a large amount of various types of log data are generated. For these logs, if regular matching rules are designed for each type of log, it is very time-consuming and labor-consuming. Therefore, how to quickly and accurately parse text logs has become a problem to be solved.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a method for automatically analyzing the log, which is based on natural language processing technology, and automatically identifies the internal structure of the text log by a computer so as to extract effective information from the log rapidly and accurately.

The aim of the invention can be achieved by the following technical scheme: a method of automatically parsing a log, comprising the steps of:

s1, acquiring sample log data;

s2, respectively establishing a log database and a log analysis model according to the sample log data;

s3, acquiring target log data and preprocessing the target log data;

s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path;

s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.

Further, the step S2 specifically includes the following steps:

s21, marking the structure of the sample log according to the effective information of the sample log data so as to establish a log database;

s22, constructing a hidden Markov model according to the noted log structure information in the log database to serve as a log analysis model.

Further, the sample log data in the step S21 includes eight types of log data: apache access, apache error, aruba wireless, nginx access, nginx error, exchange, juniper firewall logs, and VPNs.

Further, when the structure of the sample log is noted in step S21, the structure of the log is specifically noted by using a B, M, E, S, O identifier to obtain a label corresponding to each character in the log structure one by one, where S represents a single character, B, M, E represents a middle part and an end part of the beginning of a character string, and O represents a character that is not in the log structure.

Further, the log structure information already marked in step S22 includes a log structure string and a corresponding string, where each character in the log structure string is a different observed quantity, and each label in the string is in a different state.

Further, the specific process of constructing the hidden markov model in step S22 is as follows:

s221, counting the transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;

s222, counting transition probability from states to observed quantities in a log database to obtain an observation probability matrix;

s223, counting initial state probabilities in a log database to obtain initial probability distribution;

s224, constructing a hidden Markov model through training a state transition matrix, an observation probability matrix and an initial probability distribution.

Further, the state transition matrix specifically includes:

A＝[a _ij ] _N×M

a _ij ＝P(i _t+1 ＝q _j |i _t ＝q _i )，i＝1，2，...，N；j＝1，2，...，N

the observation probability matrix specifically comprises:

B＝[b _j (k)] _N×M

b _j (k)＝P(o _t ＝v _k |i _t ＝q _j )，k＝1，2，...，M；j＝1，2，...，N

the initial probability distribution is specifically:

π＝(π _i ) ^T

π _i ＝P(i ₁ ＝q _i )，i＝1，2，...，N

Q＝{q ₁ ，q ₂ ，...，q _N }，V＝{v ₁ ，v ₂ ，...，v _M }，

I＝{i ₁ ，i ₂ ，...，i _T }，O＝{o ₁ ，o ₂ ，...，o _T }

wherein Q is the set of states, V is the set of observations, N is the number of states, M is the number of observations, I is the sequence of states of length T, O is the sequence of observations corresponding to I, pi is the initial probability distribution, pi _i Is at time t=1 in state q _i A is a state transition probability matrix, a _ij Is in state q at the moment _i Time t+1 under the condition of (2) transitions to state q _j B is the observation probability matrix, B _j (k) At time t in state q _j Generates observations v under the condition of (2) _k Is a probability of (2).

Further, the preprocessing in step S3 specifically means cleaning invalid characters in the target log structure, including messy codes, carriage return symbols and blank spaces.

Compared with the prior art, the method and the device have the advantages that the log analysis model is built based on the hidden Markov model, when different types of log data are processed, the log can be automatically analyzed without manually making regular expressions or retraining the model, the analysis speed is improved, the labor and time for log analysis are greatly saved, and in addition, the method and the device combine the hidden Markov model and the Viterbi algorithm to calculate the probability maximum path, so that the accuracy of log analysis is ensured.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a process diagram of log parsing model construction in an embodiment;

FIG. 3 is a diagram illustrating an application process of a log parsing model according to an embodiment;

FIG. 4 is an Apache access data sample in an embodiment;

FIG. 5 is an Apache error log data sample of an embodiment;

FIG. 6 is a data sample of Aruba wireless in an embodiment;

FIG. 7 is a data sample of an Nginx access in an embodiment;

FIG. 8 is a sample of Nginx error data in an example;

FIG. 9 is an Exchange data sample in an embodiment;

FIG. 10 is a Juniper firewall log sample of an embodiment;

FIG. 11 is a VPN sample in the embodiment;

FIG. 12 is a diagram of a log structure annotation in an embodiment;

FIG. 13 is a schematic diagram of a calculation process of a maximum probability path;

fig. 14 is a schematic diagram of a usage flow of REST API service in an embodiment.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples.

Examples

As shown in fig. 1, a method for automatically parsing a log includes the following steps:

s1, acquiring sample log data;

s3, acquiring target log data and preprocessing the target log data;

The embodiment adopts the method to automatically analyze the text log, and constructs an application service based on REST (Representational State Transfer, service representative state transmission) API, as shown in fig. 2-3:

1. preparation work

Before the log is parsed, log data needs to be collected, including log database and log entity tag determination.

1.1 Log library establishment

Various types of log data are collected and the effective information of the log data is marked.

1.2 Log resolution model construction

And constructing a model according to the collected log data and the marked log structure information. The present invention uses a hidden Markov model as the analytical model, thus calculating three parameters of the hidden Markov model: an initial probability distribution, a state transition probability matrix, and an observation probability matrix.

Specifically, the hidden Markov model (Hidden Markov Model, HMM) is a probability map model (Probabilistic graphical model). HMM is mainly used to describe transition of a systematic hidden state and the probability of appearance of the hidden state. HMM capability is the ability to estimate what the corresponding hidden variable sequence is based on a given sequence of observed variables and make predictions about future observed variables.

Such as speech recognition, gives you a piece of audio data that needs to be recognized for the text. The audio data is the observed variable and the text is the hidden variable. There is a slight variation in sound for different contextual environments, but the approximate pronunciation is statistically regular. On the other hand, when we say a sentence, there are also some transfer rules between words.

In terms of model representation:

the HMM comprises three parameters, namely an initial probability distribution, a state transition probability matrix and an observation probability matrix.

Let Q be the set of all possible states and V be the set of all possible observations.

Q＝{q ₁ ，q ₂ ，...，q _N }，V＝{v ₁ ，v ₂ ，...，v _M }

Where N is the possible state tree and M is the possible observations.

I is a state sequence of length T, and O is a corresponding observation sequence.

I＝{i ₁ ，i ₂ ，...，i _T }，O＝{o ₁ ，o ₂ ，...，o _T }

Pi is the initial state probability vector:

π＝(π _i )

wherein,

π _i ＝P(i ₁ ＝q _i )，i＝1，2，...，N

is at time t=1 in state q _i Is a probability of (2).

A is a state transition probability matrix:

A＝[a _ij ] _N×M

wherein,

is in state q at the moment _i Time t+1 under the condition of (2) transitions to state q _j Is a probability of (2).

B is the observation probability matrix:

B＝[b _j (k)] _N×M

wherein,

is in state q at time t _j Generates observations v under the condition of (2) _k Is a probability of (2).

HMM can mainly address three problems:

probability calculation problem. Given a model λ= (a, B, pi) and an observation sequence= (o) ₁ ，o ₂ ，...，o _T ) The probability P (o|λ) of occurrence of the observation sequence O under the model λ is calculated.

Learning problems. Known observation sequence o= (O) ₁ ，o ₂ ，...，o _T ) The model λ= (a, B, pi) parameter is estimated, which is the observed sequence probability P (o|λ) under this model. I.e. estimating the parameters using maximum likelihood estimation.

The prediction problem is also called decoding (decoding) problem. The known model λ= (a, B, pi) and the observed sequence o= (O) ₁ ，o ₂ ，...，o _T ) Solving a state sequence I= (I) with the maximum conditional probability P (I|O) for a given observation sequence ₁ ，i ₂ ，...，i _T ). I.e. given an observation sequence, the most likely corresponding sequence state is found.

When marking logs, firstly marking the structures of the logs, and preliminarily determining the log structures of different log types. In this embodiment, eight types of relatively typical logs are selected for marking, namely, apache access, apache error, aruba wireless, nmginx access, nmginx error, exchange, juniper firewall logs, and VPN, and data samples thereof are shown in fig. 4 to 11, respectively.

The log content is then marked. The internal structure in the log is shown in table 1:

TABLE 1

The log data is marked below. For sequence annotation problems, the annotation typically uses an identifier such as B, M, E, S, O. S denotes a single character, B, M, E denotes the beginning middle and end parts of a character string, respectively, and O denotes a character that is not log-structured. The above log internal structure and B, M, E, S are labeled in combination. For the log: "192.168.3.1- [08/Aug/2017:00:31:26+0800] "GET/qx/xts/images/x_gkbg.jpg HTTP/1.1" 200 1171", the corresponding labels are shown in FIG. 12.

The collected sample log data are labeled one by one, so that a hidden Markov model is constructed, and three important variables, namely pi and A, B, need to be counted first for the hidden Markov model. For a log, the string of the log is an observation sequence. Each character is an observed quantity. And the labels of the characters are hidden variables, i.e., states in the hidden markov model. These parameter calculations are shown below.

The state transition matrix is an m×m matrix. M is the number of states, i.e., the number of log tags. The calculation method comprises the following steps:

the observation probability matrix is an mxn matrix. M is the number of log labels, and N is the number of character types. The calculation method comprises the following steps:

probability of initial state pi _i Is calculated as S logs and the initial state is q _i Is a frequency of (a) is a frequency of (b).

For example: for three pieces of log data, i.e., three observation sequences.

“127.0.0.1get 200”.

“192.168.10.1post 404”

“127.0.0.1get 403”

The following shows a log structure sequence, in brackets is a log structure type, s represents a space character.

“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]2[http-code-b]0[http-code-m]0[http-code-e]”

“1[host-b]9[host-m]2[host-m].[host-m]1[host-m]6[host-m]8[host-m].[host-m]1[host-m]0[host-m].[host-m]1[host-e]\s[o-s]p[http-method-b]

o[http-method-m]s[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]4[http-code-e]”

“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]3[http-code-e]”

Then we can get the state set { host-b, host-m, host-e, o-s, http-method-b, http-method-m, http-method-e, http-code-b, http-code-m, http-code-e }, ten states total.

The observation set is {1,2,7,0,9,6,8,4,3, g, e, t, p, o, s, \s }, for a total of sixteen observations, where\s represents a space.

First, we count two adjacent states, and count the probability of the previous state to the next state. For example, we calculate the state transition probabilities p (host-e|host-m) for "host-m" to "host-e". host-m and host-e back and forthAdjacent occurrences occur 3 times, while host-m occurs 24 times in total. Therefore, the state transition probability from host-m to host-e can be calculated asThus we can obtain a state transition matrix a of 10x 10.

Second, we count the observation probability matrix. Let us calculate the observed transition probability p (3|host-code-e) of "http-code-e" to "3". Then the observed character "3" is marked as state "http-code-e" 1 time, and the state "http-code-e" appears 3 times in total, so that the observed transition probability from the state "http-code-e" to the observed character "3" can be calculatedIs the following. Thus we have obtained a 10x 16 observation probability matrix B.

Finally, we count the initial state probability pi. We can see a total of 3 sequences, while "host-b" appears 3 times and none of the other states appear, then our initial probability of "host-b" is 1.0 and all other states are 0.

2 log structure parsing

The step of log structure parsing will be described by taking certain log data as an example.

The first step: preprocessing an input log, and cleaning carriage return symbols and spaces at the front end and the rear end of the log and messy codes in the log.

And a second step of: and analyzing the structure by using a Viterbi algorithm by using the trained initial probability distribution, state transition probability matrix and observation probability matrix, and selecting the structure with the maximum probability.

And a third step of: and outputting the analysis structure of the log, extracting the effective information in the analysis structure, and marking the corresponding position.

Specifically, for a newly input log, first log data is preprocessed. The preprocessing content mainly removes invalid characters such as messy codes and the like. Then, based on the three parameters of the hidden Markov, the Viterbi algorithm is used to evaluate the optimal analytical method. The Viterbi algorithm is a method based on dynamic programming, solving the probability maximum path. The path here corresponds to a log parsing structure.

For example, if we obtain a state transition matrix as

The observation probability matrix is

The initial probability distribution is:

π＝(0.3，0.2.0.5) ^T 。

the state set is { "a", "b", "c" }, the observation set is { "m", "n" }, and the optimal analysis structure is calculated for the observation sequence ("m", "n", "m").

First, initializing parameters, for t=1, i.e. in the first state, for each state i, i=1, 2,3, find the state i as observing o ₁ To observe the probability of the character "m", this probability is noted as δ ₁ (i) Then

δ ₁ (i)＝π _i b _i (o ₁ )＝π _i b _i (m)，i＝1，2，3

Substituting actual data

δ ₁ (1)＝0.3*0.3＝0.09

δ ₁ (2)＝0.2*0.6＝0.12

δ ₁ (3)＝0.5*.04＝0.20

Psi mark ₁ (i)＝0，i＝1，2，3。

For each state i, i=1, 2,3, when t=2, the state j is found to be observed red when t=1 and the state i is found to be observed character o when t=2 ₂ The maximum probability of a path being "n", noted as delta ₂ (i) Then

Meanwhile, for each state i, i=1, 2,3, the previous state j of the path with the highest probability is recorded:

and (3) calculating:

ψ ₂ (1)＝3

δ ₂ (2)＝0.024，ψ ₂ (2)＝3

δ ₂ (3)＝0.048，ψ ₂ (3)＝3

also, at t=3,

δ ₃ (1)＝0.00756，ψ ₃ (1)＝1

δ ₃ (1)＝0.00864，ψ ₃ (2)＝3

δ ₂ (1)＝0.00768，ψ ₃ (3)＝3

with P ^* Representing the probability of the optimal path

The end point of the optimal path is

From the end point of the optimal pathFind +.>

At the time of t=2,

at the time of t=1,

then, an optimal path, i.e. an optimal state sequence, is determinedI.e., "c", "b"). Fig. 13 illustrates a process of calculating the maximum path.

After the model construction is completed, the present embodiment evaluates the log parsing model. For the analytical model, the requirement is that all log entities are found as much as possible, and the found log entities are as accurate as possible, namely that the recall ratio recovery and precision are higher. Meanwhile, in order to ensure that the recall and precision are both guaranteed, the model needs to be evaluated by using f 1-measure.

In the above, correct _extract Representing the extracted correct log entity number, extract _entitv Representing the total number of log entities extracted. data _entity Representing the number of log entities in the data. For example, for a log, the format is as follows:

"Jan 12 17:47:48 127.0.0.1xxx, information, download, 175.42.41.4'

The correct parsing structure is "Jan 12 17:47:48","127.0.0.1"," information ","175.42.41.4". If the resolution of the model is "Jan", "12", "17:47:48","127.0.0.1"," information ","175.42.41.4". Then the evaluation results are as follows.

correct _extract = { "127.0.0.1", "information", "175.42.41.4" }

extract _entity = { "Jan", "12", "17:47:48","127.0.0.1"," information ","175.42.41.4 "}

data _entity = { "Jan 12 17:47:48","127.0.0.1"," information ","175.42.41.4 "}

In order to evaluate the analysis results of the logs, this embodiment divides each log data set, uses 60% of the data as a training set, and uses 40% of the data as a test set. And predicting the test set by using a model trained by the training set, and finally evaluating the result. Table 2 shows various log model parsing results.

TABLE 2

In order to verify the application scenario of the model in the log with large data volume, the embodiment tests the analysis speed of each log, and the test results are shown in table 3:

TABLE 3 Table 3

Log category	Number of log strips	File size	File format	Analysis time
					Apache Access	523	51KB	txt	0.1s
Apache error	30001	4.23MB	txt	4.1s
					Aruba wireless	380752	62.6MB	txt	63.2s
Nginx access	2231408	482MB	txt	420.1s
					Nginx error	33026	13.5MB	txt	10.1s
Exchange	648492	357MB	txt	301.2s
					Juniper firewall log	33034	12.4MB	txt	23.5s
VPN log	18581	2.64MB	txt	2.5s

3 construction of REST

The log parsing method is applied to REST service, and can be used as a library for users to call by using REST API.

In order to apply log analysis to practice, the embodiment provides a log analysis service based on rest pi, which is convenient for users to use.

Architecture as shown in fig. 14, this service is programmed using Python3, log parsing and classification are integrated into the service as a library based on tornado framework as the basic framework of REST service, and REST pi is provided. The interface design is shown in table 4:

TABLE 4 Table 4

In summary, the conventional manual analysis technology needs to make a large number of regular expressions for different types of logs, but the invention uses natural language processing and data mining technology, does not need to make regular expressions manually, and saves manpower and time; the manually formulated regular expression needs to be formulated again in the face of the change of the log structure, and the method used by the invention does not need to be trained again.

Claims

1. A method for automatically parsing a log, comprising the steps of:

s1, acquiring sample log data;

s3, acquiring target log data and preprocessing the target log data;

s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log;

the step S2 specifically includes the following steps:

s21, marking the structure of the sample log and log data in sequence according to the effective information of the sample log data so as to establish a log database;

s22, constructing a hidden Markov model according to the noted log structure information in the log database to serve as a log analysis model;

the sample log data in step S21 includes eight types of log data: apache access, apache error, aruba wireless, nginx access, nginx error, exchange, juniper firewall logs, and VPNs;

therefore, when labeling the structure of the sample log in step S21, labeling is specifically performed for host, date, http method, http_code, uri, log level and meaningless;

in the step S21, when marking the log data of the sample log, B, M, E, S, O is specifically used to mark Fu Hangbiao to obtain a label corresponding to each character in the log structure, where S represents a single character, B, M, E represents the middle part and the end part of the beginning of a character string, and O represents a character that is not in the log structure.

2. The method according to claim 1, wherein the log structure information already marked in the step S22 includes a log structure string and a corresponding string tag, wherein each character in the log structure string is a different observed quantity, and each tag in the string tag is a different state.

3. The method for automatically parsing a log according to claim 2, wherein the specific process of constructing the hidden markov model in step S22 is as follows:

4. A method for automatically parsing a log according to claim 3, wherein the state transition matrix is specifically:

A＝[a _ij ] _N×M

the observation probability matrix specifically comprises:

B＝[b _j (k)] _N×M

the initial probability distribution is specifically:

π＝(π _i ) ^T

π _i ＝P(i ₁ ＝q _i )，i＝1，2，...，N

Q＝{q ₁ ，q ₂ ，...，q _N }，V＝{v ₁ ，v ₂ ，...，v _M }，

I＝{i ₁ ，i ₂ ，...，i _T }，O＝{o ₁ ，o ₂ ，...，o _T }

wherein Q is the set of states, V is the set of observables, N is the number of states, M is the number of observables, I is the state sequence of length TThe column, O, is the observed sequence corresponding to I, pi is the initial probability distribution, pi _i Is at time t=1 in state q _i A is a state transition probability matrix, a _ij Is in state q at the moment _i Time t+1 under the condition of (2) transitions to state q _j B is the observation probability matrix, B _j (k) At time t in state q _j Generates observations v under the condition of (2) _k Is a probability of (2).

5. The method according to claim 1, wherein the preprocessing in step S3 specifically refers to clearing invalid characters in the target log structure, including messy codes, carriage returns and spaces.