CN111367964B - Method for automatically analyzing log - Google Patents
Method for automatically analyzing log Download PDFInfo
- Publication number
- CN111367964B CN111367964B CN202010132165.XA CN202010132165A CN111367964B CN 111367964 B CN111367964 B CN 111367964B CN 202010132165 A CN202010132165 A CN 202010132165A CN 111367964 B CN111367964 B CN 111367964B
- Authority
- CN
- China
- Prior art keywords
- log
- state
- analysis
- data
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000004458 analytical method Methods 0.000 claims abstract description 38
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 239000011159 matrix material Substances 0.000 claims description 33
- 230000007704 transition Effects 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 6
- 241000721662 Juniperus Species 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 4
- 238000002372 labelling Methods 0.000 claims 2
- 238000010195 expression analysis Methods 0.000 abstract 1
- 238000004364 calculation method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/547—Remote procedure calls [RPC]; Web services
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a method for automatically analyzing logs, which comprises the following steps: s1, acquiring sample log data; s2, respectively establishing a log database and a log analysis model according to the sample log data; s3, acquiring target log data and preprocessing the target log data; s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path; s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log. Compared with the prior art, the method and the device solve the problem of low efficiency of manually formulating regular expression analysis logs by constructing the hidden Markov log analysis model and combining with the dimension bit algorithm, and can quickly and accurately automatically identify the internal structure of the log and extract effective information.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method for automatically analyzing logs.
Background
With the increasing development of computer technology, computer systems are also becoming more complex. For IT operation and maintenance, the original log cannot directly provide effective information, and the fields in the original log need to be parsed and then the effective information is extracted. The traditional log analysis method is to manually formulate a corresponding regular rule. This approach is more viable if there are fewer log species and little change in log structure. However, as various functions continue to be integrated into the system, a large number of IT subsystems and thus a large amount of various types of log data are generated. For these logs, if regular matching rules are designed for each type of log, it is very time-consuming and labor-consuming. Therefore, how to quickly and accurately parse text logs has become a problem to be solved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a method for automatically analyzing the log, which is based on natural language processing technology, and automatically identifies the internal structure of the text log by a computer so as to extract effective information from the log rapidly and accurately.
The aim of the invention can be achieved by the following technical scheme: a method of automatically parsing a log, comprising the steps of:
s1, acquiring sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path;
s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
Further, the step S2 specifically includes the following steps:
s21, marking the structure of the sample log according to the effective information of the sample log data so as to establish a log database;
s22, constructing a hidden Markov model according to the noted log structure information in the log database to serve as a log analysis model.
Further, the sample log data in the step S21 includes eight types of log data: apache access, apache error, aruba wireless, nginx access, nginx error, exchange, juniper firewall logs, and VPNs.
Further, when the structure of the sample log is noted in step S21, the structure of the log is specifically noted by using a B, M, E, S, O identifier to obtain a label corresponding to each character in the log structure one by one, where S represents a single character, B, M, E represents a middle part and an end part of the beginning of a character string, and O represents a character that is not in the log structure.
Further, the log structure information already marked in step S22 includes a log structure string and a corresponding string, where each character in the log structure string is a different observed quantity, and each label in the string is in a different state.
Further, the specific process of constructing the hidden markov model in step S22 is as follows:
s221, counting the transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting transition probability from states to observed quantities in a log database to obtain an observation probability matrix;
s223, counting initial state probabilities in a log database to obtain initial probability distribution;
s224, constructing a hidden Markov model through training a state transition matrix, an observation probability matrix and an initial probability distribution.
Further, the state transition matrix specifically includes:
A=[a ij ] N×M
a ij =P(i t+1 =q j |i t =q i ),i=1,2,...,N;j=1,2,...,N
the observation probability matrix specifically comprises:
B=[b j (k)] N×M
b j (k)=P(o t =v k |i t =q j ),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(π i ) T
π i =P(i 1 =q i ),i=1,2,...,N
Q={q 1 ,q 2 ,...,q N },V={v 1 ,v 2 ,...,v M },
I={i 1 ,i 2 ,...,i T },O={o 1 ,o 2 ,...,o T }
wherein Q is the set of states, V is the set of observations, N is the number of states, M is the number of observations, I is the sequence of states of length T, O is the sequence of observations corresponding to I, pi is the initial probability distribution, pi i Is at time t=1 in state q i A is a state transition probability matrix, a ij Is in state q at the moment i Time t+1 under the condition of (2) transitions to state q j B is the observation probability matrix, B j (k) At time t in state q j Generates observations v under the condition of (2) k Is a probability of (2).
Further, the preprocessing in step S3 specifically means cleaning invalid characters in the target log structure, including messy codes, carriage return symbols and blank spaces.
Compared with the prior art, the method and the device have the advantages that the log analysis model is built based on the hidden Markov model, when different types of log data are processed, the log can be automatically analyzed without manually making regular expressions or retraining the model, the analysis speed is improved, the labor and time for log analysis are greatly saved, and in addition, the method and the device combine the hidden Markov model and the Viterbi algorithm to calculate the probability maximum path, so that the accuracy of log analysis is ensured.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a process diagram of log parsing model construction in an embodiment;
FIG. 3 is a diagram illustrating an application process of a log parsing model according to an embodiment;
FIG. 4 is an Apache access data sample in an embodiment;
FIG. 5 is an Apache error log data sample of an embodiment;
FIG. 6 is a data sample of Aruba wireless in an embodiment;
FIG. 7 is a data sample of an Nginx access in an embodiment;
FIG. 8 is a sample of Nginx error data in an example;
FIG. 9 is an Exchange data sample in an embodiment;
FIG. 10 is a Juniper firewall log sample of an embodiment;
FIG. 11 is a VPN sample in the embodiment;
FIG. 12 is a diagram of a log structure annotation in an embodiment;
FIG. 13 is a schematic diagram of a calculation process of a maximum probability path;
fig. 14 is a schematic diagram of a usage flow of REST API service in an embodiment.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples.
Examples
As shown in fig. 1, a method for automatically parsing a log includes the following steps:
s1, acquiring sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path;
s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
The embodiment adopts the method to automatically analyze the text log, and constructs an application service based on REST (Representational State Transfer, service representative state transmission) API, as shown in fig. 2-3:
1. preparation work
Before the log is parsed, log data needs to be collected, including log database and log entity tag determination.
1.1 Log library establishment
Various types of log data are collected and the effective information of the log data is marked.
1.2 Log resolution model construction
And constructing a model according to the collected log data and the marked log structure information. The present invention uses a hidden Markov model as the analytical model, thus calculating three parameters of the hidden Markov model: an initial probability distribution, a state transition probability matrix, and an observation probability matrix.
Specifically, the hidden Markov model (Hidden Markov Model, HMM) is a probability map model (Probabilistic graphical model). HMM is mainly used to describe transition of a systematic hidden state and the probability of appearance of the hidden state. HMM capability is the ability to estimate what the corresponding hidden variable sequence is based on a given sequence of observed variables and make predictions about future observed variables.
Such as speech recognition, gives you a piece of audio data that needs to be recognized for the text. The audio data is the observed variable and the text is the hidden variable. There is a slight variation in sound for different contextual environments, but the approximate pronunciation is statistically regular. On the other hand, when we say a sentence, there are also some transfer rules between words.
In terms of model representation:
the HMM comprises three parameters, namely an initial probability distribution, a state transition probability matrix and an observation probability matrix.
Let Q be the set of all possible states and V be the set of all possible observations.
Q={q 1 ,q 2 ,...,q N },V={v 1 ,v 2 ,...,v M }
Where N is the possible state tree and M is the possible observations.
I is a state sequence of length T, and O is a corresponding observation sequence.
I={i 1 ,i 2 ,...,i T },O={o 1 ,o 2 ,...,o T }
Pi is the initial state probability vector:
π=(π i )
wherein,
π i =P(i 1 =q i ),i=1,2,...,N
is at time t=1 in state q i Is a probability of (2).
A is a state transition probability matrix:
A=[a ij ] N×M
wherein,
a ij =P(i t+1 =q j |i t =q i ),i=1,2,...,N;j=1,2,...,N
is in state q at the moment i Time t+1 under the condition of (2) transitions to state q j Is a probability of (2).
B is the observation probability matrix:
B=[b j (k)] N×M
wherein,
b j (k)=P(o t =v k |i t =q j ),k=1,2,...,M;j=1,2,...,N
is in state q at time t j Generates observations v under the condition of (2) k Is a probability of (2).
HMM can mainly address three problems:
probability calculation problem. Given a model λ= (a, B, pi) and an observation sequence= (o) 1 ,o 2 ,...,o T ) The probability P (o|λ) of occurrence of the observation sequence O under the model λ is calculated.
Learning problems. Known observation sequence o= (O) 1 ,o 2 ,...,o T ) The model λ= (a, B, pi) parameter is estimated, which is the observed sequence probability P (o|λ) under this model. I.e. estimating the parameters using maximum likelihood estimation.
The prediction problem is also called decoding (decoding) problem. The known model λ= (a, B, pi) and the observed sequence o= (O) 1 ,o 2 ,...,o T ) Solving a state sequence I= (I) with the maximum conditional probability P (I|O) for a given observation sequence 1 ,i 2 ,...,i T ). I.e. given an observation sequence, the most likely corresponding sequence state is found.
When marking logs, firstly marking the structures of the logs, and preliminarily determining the log structures of different log types. In this embodiment, eight types of relatively typical logs are selected for marking, namely, apache access, apache error, aruba wireless, nmginx access, nmginx error, exchange, juniper firewall logs, and VPN, and data samples thereof are shown in fig. 4 to 11, respectively.
The log content is then marked. The internal structure in the log is shown in table 1:
TABLE 1
The log data is marked below. For sequence annotation problems, the annotation typically uses an identifier such as B, M, E, S, O. S denotes a single character, B, M, E denotes the beginning middle and end parts of a character string, respectively, and O denotes a character that is not log-structured. The above log internal structure and B, M, E, S are labeled in combination. For the log: "192.168.3.1- [08/Aug/2017:00:31:26+0800] "GET/qx/xts/images/x_gkbg.jpg HTTP/1.1" 200 1171", the corresponding labels are shown in FIG. 12.
The collected sample log data are labeled one by one, so that a hidden Markov model is constructed, and three important variables, namely pi and A, B, need to be counted first for the hidden Markov model. For a log, the string of the log is an observation sequence. Each character is an observed quantity. And the labels of the characters are hidden variables, i.e., states in the hidden markov model. These parameter calculations are shown below.
The state transition matrix is an m×m matrix. M is the number of states, i.e., the number of log tags. The calculation method comprises the following steps:
the observation probability matrix is an mxn matrix. M is the number of log labels, and N is the number of character types. The calculation method comprises the following steps:
probability of initial state pi i Is calculated as S logs and the initial state is q i Is a frequency of (a) is a frequency of (b).
For example: for three pieces of log data, i.e., three observation sequences.
“127.0.0.1get 200”.
“192.168.10.1post 404”
“127.0.0.1get 403”
The following shows a log structure sequence, in brackets is a log structure type, s represents a space character.
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]2[http-code-b]0[http-code-m]0[http-code-e]”
“1[host-b]9[host-m]2[host-m].[host-m]1[host-m]6[host-m]8[host-m].[host-m]1[host-m]0[host-m].[host-m]1[host-e]\s[o-s]p[http-method-b]
o[http-method-m]s[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]4[http-code-e]”
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]3[http-code-e]”
Then we can get the state set { host-b, host-m, host-e, o-s, http-method-b, http-method-m, http-method-e, http-code-b, http-code-m, http-code-e }, ten states total.
The observation set is {1,2,7,0,9,6,8,4,3, g, e, t, p, o, s, \s }, for a total of sixteen observations, where\s represents a space.
First, we count two adjacent states, and count the probability of the previous state to the next state. For example, we calculate the state transition probabilities p (host-e|host-m) for "host-m" to "host-e". host-m and host-e back and forthAdjacent occurrences occur 3 times, while host-m occurs 24 times in total. Therefore, the state transition probability from host-m to host-e can be calculated asThus we can obtain a state transition matrix a of 10x 10.
Second, we count the observation probability matrix. Let us calculate the observed transition probability p (3|host-code-e) of "http-code-e" to "3". Then the observed character "3" is marked as state "http-code-e" 1 time, and the state "http-code-e" appears 3 times in total, so that the observed transition probability from the state "http-code-e" to the observed character "3" can be calculatedIs the following. Thus we have obtained a 10x 16 observation probability matrix B.
Finally, we count the initial state probability pi. We can see a total of 3 sequences, while "host-b" appears 3 times and none of the other states appear, then our initial probability of "host-b" is 1.0 and all other states are 0.
2 log structure parsing
The step of log structure parsing will be described by taking certain log data as an example.
The first step: preprocessing an input log, and cleaning carriage return symbols and spaces at the front end and the rear end of the log and messy codes in the log.
And a second step of: and analyzing the structure by using a Viterbi algorithm by using the trained initial probability distribution, state transition probability matrix and observation probability matrix, and selecting the structure with the maximum probability.
And a third step of: and outputting the analysis structure of the log, extracting the effective information in the analysis structure, and marking the corresponding position.
Specifically, for a newly input log, first log data is preprocessed. The preprocessing content mainly removes invalid characters such as messy codes and the like. Then, based on the three parameters of the hidden Markov, the Viterbi algorithm is used to evaluate the optimal analytical method. The Viterbi algorithm is a method based on dynamic programming, solving the probability maximum path. The path here corresponds to a log parsing structure.
For example, if we obtain a state transition matrix as
The observation probability matrix is
The initial probability distribution is:
π=(0.3,0.2.0.5) T 。
the state set is { "a", "b", "c" }, the observation set is { "m", "n" }, and the optimal analysis structure is calculated for the observation sequence ("m", "n", "m").
First, initializing parameters, for t=1, i.e. in the first state, for each state i, i=1, 2,3, find the state i as observing o 1 To observe the probability of the character "m", this probability is noted as δ 1 (i) Then
δ 1 (i)=π i b i (o 1 )=π i b i (m),i=1,2,3
Substituting actual data
δ 1 (1)=0.3*0.3=0.09
δ 1 (2)=0.2*0.6=0.12
δ 1 (3)=0.5*.04=0.20
Psi mark 1 (i)=0,i=1,2,3。
For each state i, i=1, 2,3, when t=2, the state j is found to be observed red when t=1 and the state i is found to be observed character o when t=2 2 The maximum probability of a path being "n", noted as delta 2 (i) Then
Meanwhile, for each state i, i=1, 2,3, the previous state j of the path with the highest probability is recorded:
and (3) calculating:
ψ 2 (1)=3
δ 2 (2)=0.024,ψ 2 (2)=3
δ 2 (3)=0.048,ψ 2 (3)=3
also, at t=3,
δ 3 (1)=0.00756,ψ 3 (1)=1
δ 3 (1)=0.00864,ψ 3 (2)=3
δ 2 (1)=0.00768,ψ 3 (3)=3
with P * Representing the probability of the optimal path
The end point of the optimal path is
From the end point of the optimal pathFind +.>
At the time of t=2,
at the time of t=1,
then, an optimal path, i.e. an optimal state sequence, is determinedI.e., "c", "b"). Fig. 13 illustrates a process of calculating the maximum path.
After the model construction is completed, the present embodiment evaluates the log parsing model. For the analytical model, the requirement is that all log entities are found as much as possible, and the found log entities are as accurate as possible, namely that the recall ratio recovery and precision are higher. Meanwhile, in order to ensure that the recall and precision are both guaranteed, the model needs to be evaluated by using f 1-measure.
In the above, correct extract Representing the extracted correct log entity number, extract entitv Representing the total number of log entities extracted. data entity Representing the number of log entities in the data. For example, for a log, the format is as follows:
"Jan 12 17:47:48 127.0.0.1xxx, information, download, 175.42.41.4'
The correct parsing structure is "Jan 12 17:47:48","127.0.0.1"," information ","175.42.41.4". If the resolution of the model is "Jan", "12", "17:47:48","127.0.0.1"," information ","175.42.41.4". Then the evaluation results are as follows.
correct extract = { "127.0.0.1", "information", "175.42.41.4" }
extract entity = { "Jan", "12", "17:47:48","127.0.0.1"," information ","175.42.41.4 "}
data entity = { "Jan 12 17:47:48","127.0.0.1"," information ","175.42.41.4 "}
In order to evaluate the analysis results of the logs, this embodiment divides each log data set, uses 60% of the data as a training set, and uses 40% of the data as a test set. And predicting the test set by using a model trained by the training set, and finally evaluating the result. Table 2 shows various log model parsing results.
TABLE 2
In order to verify the application scenario of the model in the log with large data volume, the embodiment tests the analysis speed of each log, and the test results are shown in table 3:
TABLE 3 Table 3
Log category | Number of log strips | File size | File format | Analysis time |
Apache Access | 523 | 51KB | txt | 0.1s |
Apache error | 30001 | 4.23MB | txt | 4.1s |
Aruba wireless | 380752 | 62.6MB | txt | 63.2s |
Nginx access | 2231408 | 482MB | txt | 420.1s |
Nginx error | 33026 | 13.5MB | txt | 10.1s |
Exchange | 648492 | 357MB | txt | 301.2s |
Juniper firewall log | 33034 | 12.4MB | txt | 23.5s |
VPN log | 18581 | 2.64MB | txt | 2.5s |
3 construction of REST
The log parsing method is applied to REST service, and can be used as a library for users to call by using REST API.
In order to apply log analysis to practice, the embodiment provides a log analysis service based on rest pi, which is convenient for users to use.
Architecture as shown in fig. 14, this service is programmed using Python3, log parsing and classification are integrated into the service as a library based on tornado framework as the basic framework of REST service, and REST pi is provided. The interface design is shown in table 4:
TABLE 4 Table 4
In summary, the conventional manual analysis technology needs to make a large number of regular expressions for different types of logs, but the invention uses natural language processing and data mining technology, does not need to make regular expressions manually, and saves manpower and time; the manually formulated regular expression needs to be formulated again in the face of the change of the log structure, and the method used by the invention does not need to be trained again.
Claims (5)
1. A method for automatically parsing a log, comprising the steps of:
s1, acquiring sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path;
s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log;
the step S2 specifically includes the following steps:
s21, marking the structure of the sample log and log data in sequence according to the effective information of the sample log data so as to establish a log database;
s22, constructing a hidden Markov model according to the noted log structure information in the log database to serve as a log analysis model;
the sample log data in step S21 includes eight types of log data: apache access, apache error, aruba wireless, nginx access, nginx error, exchange, juniper firewall logs, and VPNs;
therefore, when labeling the structure of the sample log in step S21, labeling is specifically performed for host, date, http method, http_code, uri, log level and meaningless;
in the step S21, when marking the log data of the sample log, B, M, E, S, O is specifically used to mark Fu Hangbiao to obtain a label corresponding to each character in the log structure, where S represents a single character, B, M, E represents the middle part and the end part of the beginning of a character string, and O represents a character that is not in the log structure.
2. The method according to claim 1, wherein the log structure information already marked in the step S22 includes a log structure string and a corresponding string tag, wherein each character in the log structure string is a different observed quantity, and each tag in the string tag is a different state.
3. The method for automatically parsing a log according to claim 2, wherein the specific process of constructing the hidden markov model in step S22 is as follows:
s221, counting the transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting transition probability from states to observed quantities in a log database to obtain an observation probability matrix;
s223, counting initial state probabilities in a log database to obtain initial probability distribution;
s224, constructing a hidden Markov model through training a state transition matrix, an observation probability matrix and an initial probability distribution.
4. A method for automatically parsing a log according to claim 3, wherein the state transition matrix is specifically:
A=[a ij ] N×M
a ij =P(i t+1 =q j |i t =q i ),i=1,2,...,N;j=1,2,...,N
the observation probability matrix specifically comprises:
B=[b j (k)] N×M
b j (k)=P(o t =v k |i t =q j ),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(π i ) T
π i =P(i 1 =q i ),i=1,2,...,N
Q={q 1 ,q 2 ,...,q N },V={v 1 ,v 2 ,...,v M },
I={i 1 ,i 2 ,...,i T },O={o 1 ,o 2 ,...,o T }
wherein Q is the set of states, V is the set of observables, N is the number of states, M is the number of observables, I is the state sequence of length TThe column, O, is the observed sequence corresponding to I, pi is the initial probability distribution, pi i Is at time t=1 in state q i A is a state transition probability matrix, a ij Is in state q at the moment i Time t+1 under the condition of (2) transitions to state q j B is the observation probability matrix, B j (k) At time t in state q j Generates observations v under the condition of (2) k Is a probability of (2).
5. The method according to claim 1, wherein the preprocessing in step S3 specifically refers to clearing invalid characters in the target log structure, including messy codes, carriage returns and spaces.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010132165.XA CN111367964B (en) | 2020-02-29 | 2020-02-29 | Method for automatically analyzing log |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010132165.XA CN111367964B (en) | 2020-02-29 | 2020-02-29 | Method for automatically analyzing log |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111367964A CN111367964A (en) | 2020-07-03 |
CN111367964B true CN111367964B (en) | 2023-11-17 |
Family
ID=71206461
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010132165.XA Active CN111367964B (en) | 2020-02-29 | 2020-02-29 | Method for automatically analyzing log |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111367964B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912570A (en) * | 2016-03-29 | 2016-08-31 | 北京工业大学 | English resume key field extraction method based on hidden Markov model |
CN107070852A (en) * | 2016-12-07 | 2017-08-18 | 东软集团股份有限公司 | Network attack detecting method and device |
CN107273269A (en) * | 2017-06-12 | 2017-10-20 | 北京奇虎科技有限公司 | Daily record analysis method and device |
CN108021552A (en) * | 2017-11-09 | 2018-05-11 | 国网浙江省电力公司电力科学研究院 | A kind of power system operation ticket method for extracting content and system |
CN108881194A (en) * | 2018-06-07 | 2018-11-23 | 郑州信大先进技术研究院 | Enterprises user anomaly detection method and device |
CN109388803A (en) * | 2018-10-12 | 2019-02-26 | 北京搜狐新动力信息技术有限公司 | Chinese word cutting method and system |
CN109947891A (en) * | 2017-11-07 | 2019-06-28 | 北京国双科技有限公司 | Document analysis method and device |
-
2020
- 2020-02-29 CN CN202010132165.XA patent/CN111367964B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105912570A (en) * | 2016-03-29 | 2016-08-31 | 北京工业大学 | English resume key field extraction method based on hidden Markov model |
CN107070852A (en) * | 2016-12-07 | 2017-08-18 | 东软集团股份有限公司 | Network attack detecting method and device |
CN107273269A (en) * | 2017-06-12 | 2017-10-20 | 北京奇虎科技有限公司 | Daily record analysis method and device |
CN109947891A (en) * | 2017-11-07 | 2019-06-28 | 北京国双科技有限公司 | Document analysis method and device |
CN108021552A (en) * | 2017-11-09 | 2018-05-11 | 国网浙江省电力公司电力科学研究院 | A kind of power system operation ticket method for extracting content and system |
CN108881194A (en) * | 2018-06-07 | 2018-11-23 | 郑州信大先进技术研究院 | Enterprises user anomaly detection method and device |
CN109388803A (en) * | 2018-10-12 | 2019-02-26 | 北京搜狐新动力信息技术有限公司 | Chinese word cutting method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111367964A (en) | 2020-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543084B (en) | Method for establishing detection model of hidden sensitive text facing network social media | |
CN110532554B (en) | Chinese abstract generation method, system and storage medium | |
CN111198948B (en) | Text classification correction method, apparatus, device and computer readable storage medium | |
CN110196906B (en) | Deep learning text similarity detection method oriented to financial industry | |
CN112560478B (en) | Chinese address Roberta-BiLSTM-CRF coupling analysis method using semantic annotation | |
CN112163424B (en) | Data labeling method, device, equipment and medium | |
CN109492230B (en) | Method for extracting insurance contract key information based on interested text field convolutional neural network | |
CN110826494B (en) | Labeling data quality evaluation method, labeling data quality evaluation device, computer equipment and storage medium | |
CN114880468B (en) | Construction specification examination method and system based on BiLSTM and knowledge graph | |
CN113011191B (en) | Knowledge joint extraction model training method | |
CN108647191B (en) | Sentiment dictionary construction method based on supervised sentiment text and word vector | |
CN112507190B (en) | Method and system for extracting keywords of financial and economic news | |
CN113449489B (en) | Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium | |
CN113051914A (en) | Enterprise hidden label extraction method and device based on multi-feature dynamic portrait | |
CN110837736B (en) | Named entity recognition method of Chinese medical record based on word structure | |
CN111985612A (en) | Encoder network model design method for improving video text description accuracy | |
CN115017907A (en) | Chinese agricultural named entity recognition method based on domain dictionary | |
CN110046356A (en) | Label is embedded in the application study in the classification of microblogging text mood multi-tag | |
CN115858785A (en) | Sensitive data identification method and system based on big data | |
CN108984159B (en) | Abbreviative phrase expansion method based on Markov language model | |
CN116881430A (en) | Industrial chain identification method and device, electronic equipment and readable storage medium | |
CN115587597A (en) | Sentiment analysis method and device of aspect words based on clause-level relational graph | |
CN111367964B (en) | Method for automatically analyzing log | |
CN113935314A (en) | Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network | |
Noya et al. | Discriminative learning of two-dimensional probabilistic context-free grammars for mathematical expression recognition and retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |