[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111367964B - Method for automatically analyzing log - Google Patents

Method for automatically analyzing log Download PDF

Info

Publication number
CN111367964B
CN111367964B CN202010132165.XA CN202010132165A CN111367964B CN 111367964 B CN111367964 B CN 111367964B CN 202010132165 A CN202010132165 A CN 202010132165A CN 111367964 B CN111367964 B CN 111367964B
Authority
CN
China
Prior art keywords
log
state
analysis
data
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010132165.XA
Other languages
Chinese (zh)
Other versions
CN111367964A (en
Inventor
李宁宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Eisoo Information Technology Co Ltd
Original Assignee
Shanghai Eisoo Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Eisoo Information Technology Co Ltd filed Critical Shanghai Eisoo Information Technology Co Ltd
Priority to CN202010132165.XA priority Critical patent/CN111367964B/en
Publication of CN111367964A publication Critical patent/CN111367964A/en
Application granted granted Critical
Publication of CN111367964B publication Critical patent/CN111367964B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/547Remote procedure calls [RPC]; Web services
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for automatically analyzing logs, which comprises the following steps: s1, acquiring sample log data; s2, respectively establishing a log database and a log analysis model according to the sample log data; s3, acquiring target log data and preprocessing the target log data; s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path; s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log. Compared with the prior art, the method and the device solve the problem of low efficiency of manually formulating regular expression analysis logs by constructing the hidden Markov log analysis model and combining with the dimension bit algorithm, and can quickly and accurately automatically identify the internal structure of the log and extract effective information.

Description

Method for automatically analyzing log
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method for automatically analyzing logs.
Background
With the increasing development of computer technology, computer systems are also becoming more complex. For IT operation and maintenance, the original log cannot directly provide effective information, and the fields in the original log need to be parsed and then the effective information is extracted. The traditional log analysis method is to manually formulate a corresponding regular rule. This approach is more viable if there are fewer log species and little change in log structure. However, as various functions continue to be integrated into the system, a large number of IT subsystems and thus a large amount of various types of log data are generated. For these logs, if regular matching rules are designed for each type of log, it is very time-consuming and labor-consuming. Therefore, how to quickly and accurately parse text logs has become a problem to be solved.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a method for automatically analyzing the log, which is based on natural language processing technology, and automatically identifies the internal structure of the text log by a computer so as to extract effective information from the log rapidly and accurately.
The aim of the invention can be achieved by the following technical scheme: a method of automatically parsing a log, comprising the steps of:
s1, acquiring sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path;
s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
Further, the step S2 specifically includes the following steps:
s21, marking the structure of the sample log according to the effective information of the sample log data so as to establish a log database;
s22, constructing a hidden Markov model according to the noted log structure information in the log database to serve as a log analysis model.
Further, the sample log data in the step S21 includes eight types of log data: apache access, apache error, aruba wireless, nginx access, nginx error, exchange, juniper firewall logs, and VPNs.
Further, when the structure of the sample log is noted in step S21, the structure of the log is specifically noted by using a B, M, E, S, O identifier to obtain a label corresponding to each character in the log structure one by one, where S represents a single character, B, M, E represents a middle part and an end part of the beginning of a character string, and O represents a character that is not in the log structure.
Further, the log structure information already marked in step S22 includes a log structure string and a corresponding string, where each character in the log structure string is a different observed quantity, and each label in the string is in a different state.
Further, the specific process of constructing the hidden markov model in step S22 is as follows:
s221, counting the transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting transition probability from states to observed quantities in a log database to obtain an observation probability matrix;
s223, counting initial state probabilities in a log database to obtain initial probability distribution;
s224, constructing a hidden Markov model through training a state transition matrix, an observation probability matrix and an initial probability distribution.
Further, the state transition matrix specifically includes:
A=[a ij ] N×M
a ij =P(i t+1 =q j |i t =q i ),i=1,2,...,N;j=1,2,...,N
the observation probability matrix specifically comprises:
B=[b j (k)] N×M
b j (k)=P(o t =v k |i t =q j ),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(π i ) T
π i =P(i 1 =q i ),i=1,2,...,N
Q={q 1 ,q 2 ,...,q N },V={v 1 ,v 2 ,...,v M },
I={i 1 ,i 2 ,...,i T },O={o 1 ,o 2 ,...,o T }
wherein Q is the set of states, V is the set of observations, N is the number of states, M is the number of observations, I is the sequence of states of length T, O is the sequence of observations corresponding to I, pi is the initial probability distribution, pi i Is at time t=1 in state q i A is a state transition probability matrix, a ij Is in state q at the moment i Time t+1 under the condition of (2) transitions to state q j B is the observation probability matrix, B j (k) At time t in state q j Generates observations v under the condition of (2) k Is a probability of (2).
Further, the preprocessing in step S3 specifically means cleaning invalid characters in the target log structure, including messy codes, carriage return symbols and blank spaces.
Compared with the prior art, the method and the device have the advantages that the log analysis model is built based on the hidden Markov model, when different types of log data are processed, the log can be automatically analyzed without manually making regular expressions or retraining the model, the analysis speed is improved, the labor and time for log analysis are greatly saved, and in addition, the method and the device combine the hidden Markov model and the Viterbi algorithm to calculate the probability maximum path, so that the accuracy of log analysis is ensured.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention;
FIG. 2 is a process diagram of log parsing model construction in an embodiment;
FIG. 3 is a diagram illustrating an application process of a log parsing model according to an embodiment;
FIG. 4 is an Apache access data sample in an embodiment;
FIG. 5 is an Apache error log data sample of an embodiment;
FIG. 6 is a data sample of Aruba wireless in an embodiment;
FIG. 7 is a data sample of an Nginx access in an embodiment;
FIG. 8 is a sample of Nginx error data in an example;
FIG. 9 is an Exchange data sample in an embodiment;
FIG. 10 is a Juniper firewall log sample of an embodiment;
FIG. 11 is a VPN sample in the embodiment;
FIG. 12 is a diagram of a log structure annotation in an embodiment;
FIG. 13 is a schematic diagram of a calculation process of a maximum probability path;
fig. 14 is a schematic diagram of a usage flow of REST API service in an embodiment.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples.
Examples
As shown in fig. 1, a method for automatically parsing a log includes the following steps:
s1, acquiring sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path;
s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log.
The embodiment adopts the method to automatically analyze the text log, and constructs an application service based on REST (Representational State Transfer, service representative state transmission) API, as shown in fig. 2-3:
1. preparation work
Before the log is parsed, log data needs to be collected, including log database and log entity tag determination.
1.1 Log library establishment
Various types of log data are collected and the effective information of the log data is marked.
1.2 Log resolution model construction
And constructing a model according to the collected log data and the marked log structure information. The present invention uses a hidden Markov model as the analytical model, thus calculating three parameters of the hidden Markov model: an initial probability distribution, a state transition probability matrix, and an observation probability matrix.
Specifically, the hidden Markov model (Hidden Markov Model, HMM) is a probability map model (Probabilistic graphical model). HMM is mainly used to describe transition of a systematic hidden state and the probability of appearance of the hidden state. HMM capability is the ability to estimate what the corresponding hidden variable sequence is based on a given sequence of observed variables and make predictions about future observed variables.
Such as speech recognition, gives you a piece of audio data that needs to be recognized for the text. The audio data is the observed variable and the text is the hidden variable. There is a slight variation in sound for different contextual environments, but the approximate pronunciation is statistically regular. On the other hand, when we say a sentence, there are also some transfer rules between words.
In terms of model representation:
the HMM comprises three parameters, namely an initial probability distribution, a state transition probability matrix and an observation probability matrix.
Let Q be the set of all possible states and V be the set of all possible observations.
Q={q 1 ,q 2 ,...,q N },V={v 1 ,v 2 ,...,v M }
Where N is the possible state tree and M is the possible observations.
I is a state sequence of length T, and O is a corresponding observation sequence.
I={i 1 ,i 2 ,...,i T },O={o 1 ,o 2 ,...,o T }
Pi is the initial state probability vector:
π=(π i )
wherein,
π i =P(i 1 =q i ),i=1,2,...,N
is at time t=1 in state q i Is a probability of (2).
A is a state transition probability matrix:
A=[a ij ] N×M
wherein,
a ij =P(i t+1 =q j |i t =q i ),i=1,2,...,N;j=1,2,...,N
is in state q at the moment i Time t+1 under the condition of (2) transitions to state q j Is a probability of (2).
B is the observation probability matrix:
B=[b j (k)] N×M
wherein,
b j (k)=P(o t =v k |i t =q j ),k=1,2,...,M;j=1,2,...,N
is in state q at time t j Generates observations v under the condition of (2) k Is a probability of (2).
HMM can mainly address three problems:
probability calculation problem. Given a model λ= (a, B, pi) and an observation sequence= (o) 1 ,o 2 ,...,o T ) The probability P (o|λ) of occurrence of the observation sequence O under the model λ is calculated.
Learning problems. Known observation sequence o= (O) 1 ,o 2 ,...,o T ) The model λ= (a, B, pi) parameter is estimated, which is the observed sequence probability P (o|λ) under this model. I.e. estimating the parameters using maximum likelihood estimation.
The prediction problem is also called decoding (decoding) problem. The known model λ= (a, B, pi) and the observed sequence o= (O) 1 ,o 2 ,...,o T ) Solving a state sequence I= (I) with the maximum conditional probability P (I|O) for a given observation sequence 1 ,i 2 ,...,i T ). I.e. given an observation sequence, the most likely corresponding sequence state is found.
When marking logs, firstly marking the structures of the logs, and preliminarily determining the log structures of different log types. In this embodiment, eight types of relatively typical logs are selected for marking, namely, apache access, apache error, aruba wireless, nmginx access, nmginx error, exchange, juniper firewall logs, and VPN, and data samples thereof are shown in fig. 4 to 11, respectively.
The log content is then marked. The internal structure in the log is shown in table 1:
TABLE 1
The log data is marked below. For sequence annotation problems, the annotation typically uses an identifier such as B, M, E, S, O. S denotes a single character, B, M, E denotes the beginning middle and end parts of a character string, respectively, and O denotes a character that is not log-structured. The above log internal structure and B, M, E, S are labeled in combination. For the log: "192.168.3.1- [08/Aug/2017:00:31:26+0800] "GET/qx/xts/images/x_gkbg.jpg HTTP/1.1" 200 1171", the corresponding labels are shown in FIG. 12.
The collected sample log data are labeled one by one, so that a hidden Markov model is constructed, and three important variables, namely pi and A, B, need to be counted first for the hidden Markov model. For a log, the string of the log is an observation sequence. Each character is an observed quantity. And the labels of the characters are hidden variables, i.e., states in the hidden markov model. These parameter calculations are shown below.
The state transition matrix is an m×m matrix. M is the number of states, i.e., the number of log tags. The calculation method comprises the following steps:
the observation probability matrix is an mxn matrix. M is the number of log labels, and N is the number of character types. The calculation method comprises the following steps:
probability of initial state pi i Is calculated as S logs and the initial state is q i Is a frequency of (a) is a frequency of (b).
For example: for three pieces of log data, i.e., three observation sequences.
“127.0.0.1get 200”.
“192.168.10.1post 404”
“127.0.0.1get 403”
The following shows a log structure sequence, in brackets is a log structure type, s represents a space character.
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]2[http-code-b]0[http-code-m]0[http-code-e]”
“1[host-b]9[host-m]2[host-m].[host-m]1[host-m]6[host-m]8[host-m].[host-m]1[host-m]0[host-m].[host-m]1[host-e]\s[o-s]p[http-method-b]
o[http-method-m]s[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]4[http-code-e]”
“1[host-b]2[host-m]7[host-m].[host-m]0[host-m].[host-m]0[host-m].[host-m]1[host-e]\s[o-s]g[http-method-b]e[http-method-m]t[http-method-e]\s[o-s]4[http-code-b]0[http-code-m]3[http-code-e]”
Then we can get the state set { host-b, host-m, host-e, o-s, http-method-b, http-method-m, http-method-e, http-code-b, http-code-m, http-code-e }, ten states total.
The observation set is {1,2,7,0,9,6,8,4,3, g, e, t, p, o, s, \s }, for a total of sixteen observations, where\s represents a space.
First, we count two adjacent states, and count the probability of the previous state to the next state. For example, we calculate the state transition probabilities p (host-e|host-m) for "host-m" to "host-e". host-m and host-e back and forthAdjacent occurrences occur 3 times, while host-m occurs 24 times in total. Therefore, the state transition probability from host-m to host-e can be calculated asThus we can obtain a state transition matrix a of 10x 10.
Second, we count the observation probability matrix. Let us calculate the observed transition probability p (3|host-code-e) of "http-code-e" to "3". Then the observed character "3" is marked as state "http-code-e" 1 time, and the state "http-code-e" appears 3 times in total, so that the observed transition probability from the state "http-code-e" to the observed character "3" can be calculatedIs the following. Thus we have obtained a 10x 16 observation probability matrix B.
Finally, we count the initial state probability pi. We can see a total of 3 sequences, while "host-b" appears 3 times and none of the other states appear, then our initial probability of "host-b" is 1.0 and all other states are 0.
2 log structure parsing
The step of log structure parsing will be described by taking certain log data as an example.
The first step: preprocessing an input log, and cleaning carriage return symbols and spaces at the front end and the rear end of the log and messy codes in the log.
And a second step of: and analyzing the structure by using a Viterbi algorithm by using the trained initial probability distribution, state transition probability matrix and observation probability matrix, and selecting the structure with the maximum probability.
And a third step of: and outputting the analysis structure of the log, extracting the effective information in the analysis structure, and marking the corresponding position.
Specifically, for a newly input log, first log data is preprocessed. The preprocessing content mainly removes invalid characters such as messy codes and the like. Then, based on the three parameters of the hidden Markov, the Viterbi algorithm is used to evaluate the optimal analytical method. The Viterbi algorithm is a method based on dynamic programming, solving the probability maximum path. The path here corresponds to a log parsing structure.
For example, if we obtain a state transition matrix as
The observation probability matrix is
The initial probability distribution is:
π=(0.3,0.2.0.5) T
the state set is { "a", "b", "c" }, the observation set is { "m", "n" }, and the optimal analysis structure is calculated for the observation sequence ("m", "n", "m").
First, initializing parameters, for t=1, i.e. in the first state, for each state i, i=1, 2,3, find the state i as observing o 1 To observe the probability of the character "m", this probability is noted as δ 1 (i) Then
δ 1 (i)=π i b i (o 1 )=π i b i (m),i=1,2,3
Substituting actual data
δ 1 (1)=0.3*0.3=0.09
δ 1 (2)=0.2*0.6=0.12
δ 1 (3)=0.5*.04=0.20
Psi mark 1 (i)=0,i=1,2,3。
For each state i, i=1, 2,3, when t=2, the state j is found to be observed red when t=1 and the state i is found to be observed character o when t=2 2 The maximum probability of a path being "n", noted as delta 2 (i) Then
Meanwhile, for each state i, i=1, 2,3, the previous state j of the path with the highest probability is recorded:
and (3) calculating:
ψ 2 (1)=3
δ 2 (2)=0.024,ψ 2 (2)=3
δ 2 (3)=0.048,ψ 2 (3)=3
also, at t=3,
δ 3 (1)=0.00756,ψ 3 (1)=1
δ 3 (1)=0.00864,ψ 3 (2)=3
δ 2 (1)=0.00768,ψ 3 (3)=3
with P * Representing the probability of the optimal path
The end point of the optimal path is
From the end point of the optimal pathFind +.>
At the time of t=2,
at the time of t=1,
then, an optimal path, i.e. an optimal state sequence, is determinedI.e., "c", "b"). Fig. 13 illustrates a process of calculating the maximum path.
After the model construction is completed, the present embodiment evaluates the log parsing model. For the analytical model, the requirement is that all log entities are found as much as possible, and the found log entities are as accurate as possible, namely that the recall ratio recovery and precision are higher. Meanwhile, in order to ensure that the recall and precision are both guaranteed, the model needs to be evaluated by using f 1-measure.
In the above, correct extract Representing the extracted correct log entity number, extract entitv Representing the total number of log entities extracted. data entity Representing the number of log entities in the data. For example, for a log, the format is as follows:
"Jan 12 17:47:48 127.0.0.1xxx, information, download, 175.42.41.4'
The correct parsing structure is "Jan 12 17:47:48","127.0.0.1"," information ","175.42.41.4". If the resolution of the model is "Jan", "12", "17:47:48","127.0.0.1"," information ","175.42.41.4". Then the evaluation results are as follows.
correct extract = { "127.0.0.1", "information", "175.42.41.4" }
extract entity = { "Jan", "12", "17:47:48","127.0.0.1"," information ","175.42.41.4 "}
data entity = { "Jan 12 17:47:48","127.0.0.1"," information ","175.42.41.4 "}
In order to evaluate the analysis results of the logs, this embodiment divides each log data set, uses 60% of the data as a training set, and uses 40% of the data as a test set. And predicting the test set by using a model trained by the training set, and finally evaluating the result. Table 2 shows various log model parsing results.
TABLE 2
In order to verify the application scenario of the model in the log with large data volume, the embodiment tests the analysis speed of each log, and the test results are shown in table 3:
TABLE 3 Table 3
Log category Number of log strips File size File format Analysis time
Apache Access 523 51KB txt 0.1s
Apache error 30001 4.23MB txt 4.1s
Aruba wireless 380752 62.6MB txt 63.2s
Nginx access 2231408 482MB txt 420.1s
Nginx error 33026 13.5MB txt 10.1s
Exchange 648492 357MB txt 301.2s
Juniper firewall log 33034 12.4MB txt 23.5s
VPN log 18581 2.64MB txt 2.5s
3 construction of REST
The log parsing method is applied to REST service, and can be used as a library for users to call by using REST API.
In order to apply log analysis to practice, the embodiment provides a log analysis service based on rest pi, which is convenient for users to use.
Architecture as shown in fig. 14, this service is programmed using Python3, log parsing and classification are integrated into the service as a library based on tornado framework as the basic framework of REST service, and REST pi is provided. The interface design is shown in table 4:
TABLE 4 Table 4
In summary, the conventional manual analysis technology needs to make a large number of regular expressions for different types of logs, but the invention uses natural language processing and data mining technology, does not need to make regular expressions manually, and saves manpower and time; the manually formulated regular expression needs to be formulated again in the face of the change of the log structure, and the method used by the invention does not need to be trained again.

Claims (5)

1. A method for automatically parsing a log, comprising the steps of:
s1, acquiring sample log data;
s2, respectively establishing a log database and a log analysis model according to the sample log data;
s3, acquiring target log data and preprocessing the target log data;
s4, analyzing the preprocessed target log data by adopting a Viterbi algorithm based on a log analysis model, and obtaining an analysis structure of the target log by solving a probability maximum path;
s5, extracting effective information from the analysis structure of the target log, and marking the corresponding position, namely completing the analysis of the target log;
the step S2 specifically includes the following steps:
s21, marking the structure of the sample log and log data in sequence according to the effective information of the sample log data so as to establish a log database;
s22, constructing a hidden Markov model according to the noted log structure information in the log database to serve as a log analysis model;
the sample log data in step S21 includes eight types of log data: apache access, apache error, aruba wireless, nginx access, nginx error, exchange, juniper firewall logs, and VPNs;
therefore, when labeling the structure of the sample log in step S21, labeling is specifically performed for host, date, http method, http_code, uri, log level and meaningless;
in the step S21, when marking the log data of the sample log, B, M, E, S, O is specifically used to mark Fu Hangbiao to obtain a label corresponding to each character in the log structure, where S represents a single character, B, M, E represents the middle part and the end part of the beginning of a character string, and O represents a character that is not in the log structure.
2. The method according to claim 1, wherein the log structure information already marked in the step S22 includes a log structure string and a corresponding string tag, wherein each character in the log structure string is a different observed quantity, and each tag in the string tag is a different state.
3. The method for automatically parsing a log according to claim 2, wherein the specific process of constructing the hidden markov model in step S22 is as follows:
s221, counting the transition probabilities of adjacent front and back states in a log database to obtain a state transition matrix;
s222, counting transition probability from states to observed quantities in a log database to obtain an observation probability matrix;
s223, counting initial state probabilities in a log database to obtain initial probability distribution;
s224, constructing a hidden Markov model through training a state transition matrix, an observation probability matrix and an initial probability distribution.
4. A method for automatically parsing a log according to claim 3, wherein the state transition matrix is specifically:
A=[a ij ] N×M
a ij =P(i t+1 =q j |i t =q i ),i=1,2,...,N;j=1,2,...,N
the observation probability matrix specifically comprises:
B=[b j (k)] N×M
b j (k)=P(o t =v k |i t =q j ),k=1,2,...,M;j=1,2,...,N
the initial probability distribution is specifically:
π=(π i ) T
π i =P(i 1 =q i ),i=1,2,...,N
Q={q 1 ,q 2 ,...,q N },V={v 1 ,v 2 ,...,v M },
I={i 1 ,i 2 ,...,i T },O={o 1 ,o 2 ,...,o T }
wherein Q is the set of states, V is the set of observables, N is the number of states, M is the number of observables, I is the state sequence of length TThe column, O, is the observed sequence corresponding to I, pi is the initial probability distribution, pi i Is at time t=1 in state q i A is a state transition probability matrix, a ij Is in state q at the moment i Time t+1 under the condition of (2) transitions to state q j B is the observation probability matrix, B j (k) At time t in state q j Generates observations v under the condition of (2) k Is a probability of (2).
5. The method according to claim 1, wherein the preprocessing in step S3 specifically refers to clearing invalid characters in the target log structure, including messy codes, carriage returns and spaces.
CN202010132165.XA 2020-02-29 2020-02-29 Method for automatically analyzing log Active CN111367964B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010132165.XA CN111367964B (en) 2020-02-29 2020-02-29 Method for automatically analyzing log

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010132165.XA CN111367964B (en) 2020-02-29 2020-02-29 Method for automatically analyzing log

Publications (2)

Publication Number Publication Date
CN111367964A CN111367964A (en) 2020-07-03
CN111367964B true CN111367964B (en) 2023-11-17

Family

ID=71206461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010132165.XA Active CN111367964B (en) 2020-02-29 2020-02-29 Method for automatically analyzing log

Country Status (1)

Country Link
CN (1) CN111367964B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912570A (en) * 2016-03-29 2016-08-31 北京工业大学 English resume key field extraction method based on hidden Markov model
CN107070852A (en) * 2016-12-07 2017-08-18 东软集团股份有限公司 Network attack detecting method and device
CN107273269A (en) * 2017-06-12 2017-10-20 北京奇虎科技有限公司 Daily record analysis method and device
CN108021552A (en) * 2017-11-09 2018-05-11 国网浙江省电力公司电力科学研究院 A kind of power system operation ticket method for extracting content and system
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system
CN109947891A (en) * 2017-11-07 2019-06-28 北京国双科技有限公司 Document analysis method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105912570A (en) * 2016-03-29 2016-08-31 北京工业大学 English resume key field extraction method based on hidden Markov model
CN107070852A (en) * 2016-12-07 2017-08-18 东软集团股份有限公司 Network attack detecting method and device
CN107273269A (en) * 2017-06-12 2017-10-20 北京奇虎科技有限公司 Daily record analysis method and device
CN109947891A (en) * 2017-11-07 2019-06-28 北京国双科技有限公司 Document analysis method and device
CN108021552A (en) * 2017-11-09 2018-05-11 国网浙江省电力公司电力科学研究院 A kind of power system operation ticket method for extracting content and system
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device
CN109388803A (en) * 2018-10-12 2019-02-26 北京搜狐新动力信息技术有限公司 Chinese word cutting method and system

Also Published As

Publication number Publication date
CN111367964A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN109543084B (en) Method for establishing detection model of hidden sensitive text facing network social media
CN110532554B (en) Chinese abstract generation method, system and storage medium
CN111198948B (en) Text classification correction method, apparatus, device and computer readable storage medium
CN110196906B (en) Deep learning text similarity detection method oriented to financial industry
CN112560478B (en) Chinese address Roberta-BiLSTM-CRF coupling analysis method using semantic annotation
CN112163424B (en) Data labeling method, device, equipment and medium
CN109492230B (en) Method for extracting insurance contract key information based on interested text field convolutional neural network
CN110826494B (en) Labeling data quality evaluation method, labeling data quality evaluation device, computer equipment and storage medium
CN114880468B (en) Construction specification examination method and system based on BiLSTM and knowledge graph
CN113011191B (en) Knowledge joint extraction model training method
CN108647191B (en) Sentiment dictionary construction method based on supervised sentiment text and word vector
CN112507190B (en) Method and system for extracting keywords of financial and economic news
CN113449489B (en) Punctuation mark labeling method, punctuation mark labeling device, computer equipment and storage medium
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
CN110837736B (en) Named entity recognition method of Chinese medical record based on word structure
CN111985612A (en) Encoder network model design method for improving video text description accuracy
CN115017907A (en) Chinese agricultural named entity recognition method based on domain dictionary
CN110046356A (en) Label is embedded in the application study in the classification of microblogging text mood multi-tag
CN115858785A (en) Sensitive data identification method and system based on big data
CN108984159B (en) Abbreviative phrase expansion method based on Markov language model
CN116881430A (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN115587597A (en) Sentiment analysis method and device of aspect words based on clause-level relational graph
CN111367964B (en) Method for automatically analyzing log
CN113935314A (en) Abstract extraction method, device, terminal equipment and medium based on heteromorphic graph network
Noya et al. Discriminative learning of two-dimensional probabilistic context-free grammars for mathematical expression recognition and retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant