WO2019060326A1 - Parsing system event logs while streaming - Google Patents
Parsing system event logs while streaming Download PDFInfo
- Publication number
- WO2019060326A1 WO2019060326A1 PCT/US2018/051599 US2018051599W WO2019060326A1 WO 2019060326 A1 WO2019060326 A1 WO 2019060326A1 US 2018051599 W US2018051599 W US 2018051599W WO 2019060326 A1 WO2019060326 A1 WO 2019060326A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- tokens
- message
- log entry
- log
- stored
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
Definitions
- Embodiments disclosed herein relate to computer systems, methods, and hardware storage devices that operate within a computing architecture to improve how system event logs are parsed by facilitating the dynamic extraction of log patterns from incoming log entries and the classification of those log patterns in a streaming manner.
- an unstructured log entry is extracted from an incoming stream of data and then parsed into a sequence of tokens. These tokens are classified as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry.
- a comparison is subsequently performed between message tokens classified as belonging to the log entry's structured message type and stored-message tokens that define message types of at least some, but potentially not all, previously stored token sequence objects.
- a result of this comparison indicates whether a relationship between the message tokens and a particular set of the stored- message tokens satisfies a relationship criteria.
- the particular set of the stored- message tokens are associated with a particular one of the previously stored token sequence objects. If the criteria is not satisfied, then a new token sequence object is created, where the object is based on the message tokens and any tokens classified as belonging to the log entry's parameter set.
- Figures 1A and IB illustrate an example method for operating a computing architecture that improves how system event logs are parsed by facilitating dynamic extraction of log patterns from incoming log entries and by classifying those log patterns in a streaming manner.
- Figure 2 illustrates an example of a set of unstructured log entries that are passed through a parsing engine to produce a set of structured token sequences.
- Figure 3 illustrates an example process flow for comparing log entries to sequence objects.
- Figure 4 illustrates an example pre-filtering technique that may be used to optimize an analysis of the log entries.
- Figure 5 illustrates another example pre-filtering technique that may be used to optimize the analysis.
- Figure 6 illustrates an example computer system capable of performing any of the disclosed operations.
- Embodiments disclosed herein relate to computer systems, methods, and hardware storage devices that operate within a computing architecture to improve how system event logs are parsed by facilitating the dynamic extraction of log patterns from incoming log entries and the classification of those log patterns in a streaming manner.
- an unstructured log entry is parsed into a sequence of tokens.
- the tokens are classified as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry.
- a comparison is performed between (1) message tokens classified as belonging to the log entry's structured message type and (2) stored-message tokens that define message types of previously stored token sequence objects.
- a result of this comparison indicates whether a relationship between the message tokens and a particular set of the stored-message tokens satisfies a relationship criteria. If the criteria is not satisfied, then a new token sequence object is created based on the message tokens and any tokens classified as belonging to the log entry's parameter set. If the criteria is satisfied, then certain data is added to one of the previously stored token sequence objects.
- the disclosed embodiments significantly improve the technical field and also improve how a computer system parses data (thereby also improving its own operations).
- the disclosed embodiments provide a structured streaming parser for event logs using an LCS (longest common subsequence) based approach. This technique parses unstructured log messages into structured message types and parameters in an online streaming fashion.
- LCS longest common subsequence
- the disclosed embodiments By streaming the data, real-time message types and parameters produced by the disclosed embodiments also enable the generation of a concise, intuitive summary for the end users. Furthermore, the disclosed embodiments produce log data that is clean and structured. This structured data can be easily processed and analyzed further down the pipeline using advanced data analytics by downstream analysts (e.g., human operators or autonomous computer systems). In this regard, the disclosed embodiments are able to significantly improve not only the preparation of data, but also how that data is subsequently analyzed as a result of its carefully structured/prepared format.
- the disclosed embodiments are designed as a general-purpose streaming log parsing engine.
- the disclosed embodiments are system, type, and even format agnostic. Consequently, the disclosed embodiments provide significant improvements over the art through their scalability and flexibility while also improving the operations of the computer system.
- system event logs are a universal resource that exists practically in every system.
- the disclosed embodiments use system event logs to perform a free-text audit trace generated by the system's execution (e.g., which is often stored in the /var/log folder).
- a log message or a log record/entry refers to at least one line in the log file, which can be produced by a log printing statement in a user program's source code or a kernel program's code (both of which may be executed within a computing architecture that includes one or more computer systems).
- One function/object of the disclosed embodiments is to parse each log entry "e” into a collection of tokens that describe a "message type” (also referred to herein as a "structured message type”) and a collection of tokens describing parameter values (or simply "parameter") that were delineated in the log entry.
- a message type also referred to herein as a "structured message type”
- a collection of tokens describing parameter values or simply “parameter”
- Figures 1 A and IB refer to a number of method acts that may be performed. Although the method acts may be discussed in a certain order or illustrated in a flowchart as occurring in a particular order, no particular ordering is required unless specifically stated, or required because an act is dependent on another act being completed prior to the act being performed. This method is used to introduce the disclosed embodiments at a high level. Subsequent portions of this disclosure will delve more fully into specifics and features of the various different embodiments.
- Figure 1A illustrates a flowchart of an example method 100 for operating a computing architecture that is designed to improve how system event logs are parsed. This is achieved by facilitating the dynamic extraction of log patterns from incoming log entries and the classification of those log patterns in a streaming manner.
- method 100 includes an act 105 of extracting an unstructured new log entry from an incoming stream of data.
- the incoming stream of data may be received in any manner such as, for example, from an application executing on the computer system (e.g., the system's operating system "OS"), from an outside source over a communication channel (e.g., via a TCP or UDP port), or via any other communication interface.
- OS operating system
- a communication channel e.g., via a TCP or UDP port
- act 110 the process of parsing will be described in more detail later.
- Each of the tokens in the token sequence is then classified (act 115). Specifically, the tokens are classified as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry. In some embodiments, parsing the log entry and/or classifying each token is performed in realtime (i.e. in a dynamic streaming manner) as the incoming stream of data is being received.
- each message type (e.g., "e") has a one- to-one mapping with a log printing statement in the source code producing the raw log entry "e.” For example, consider the following printing statement:
- This printing statement may produce several log entries such as, for example:
- the parameter values are "41C and " 3C" respectively while the message type is "Temperature * exceeds warning threshold ''.
- a comparison is subsequently performed between "message tokens" that have been classified as belonging to the log entry's structured message type and "stored-message tokens" that define structured message types of at least some (but potentially not all) of previously stored token sequence objects.
- These previously stored token sequence objects may be stored in a repository and may include any number of objects. It will be appreciated that the repository may be local to the computer system or it may be remote. Following the above step, an analysis of this comparison is performed in act 125.
- Figure IB further elaborates on this analysis. For instance, upon determining that the relationship criteria is not satisfied, a new token sequence object is created (act 130). This new token sequence is based on the message tokens and any tokens classified as belonging in the log entry's parameter set. Alternatively, upon determining that the relationship criteria is satisfied, then certain information is added to the particular one token sequence object (act 135). Specifically, a new line identification is added to a line identification list included as a part of the particular one token sequence object. In this manner, the disclosed embodiments provide an online streaming technique to parse data in a quick and efficient manner, as described earlier.
- log entries are produced by "print” statements in a system program's source code.
- a log entry can be viewed as a collection of ⁇ "message type", "parameter value” ⁇ pairs.
- the log printing statement "printfi("File %d finished. ", id)" contains a constant message type "File finished' and a variable parameter value which is the value loaded into the "/ ' if' field.
- one functionality of a structured log parser is to identify the message type "File * finished', where * is a placeholder dummy value for variables (i.e. parameter values).
- Figure 2 shows a set of unstructured log entries 200, which includes timestamp data and message data.
- each individual log entry in the set is "unstructured” text data.
- unstructured it is generally meant that the data is not formatted in a particular manner, nor are the individual data items grouped or otherwise identified as being distinct from one another.
- the data that was previously unstructured is now formatted and arranged in a structural manner, as shown by the structured data 210.
- the table shown by the structured data 210 is sometimes referred to as a "structured message type summary" and can be provided to a user in a graphical user interface.
- This table can include other data, such as metadata further describing the listed information.
- each line item in the table constitutes a "token sequence object.”
- the token sequence objects may include metadata information describing attributes of those object's corresponding structured message types and/or parameters.
- One example of the metadata includes the frequency by which each of the structured message types was observed within the input data stream.
- metadata includes a position indicator (e.g., the "*" placeholder value) indicating a location within each corresponding token sequence where a particular parameter is located, as will be described in further detail later.
- the structured message type summary lists (1) one or more identified message types, (2) a frequency of observation for each of those identified message types (the frequency is a form of metadata), and (3) a corresponding parameter that was previously associated with each of the identified message types (e.g., before their token sequences were parsed). This information (i.e. the structured message type summary) can then be displayed on the user interface.
- a structured log parser which can be performed by the parsing engine 205 in Figure 2, is defined as follows:
- log ⁇ ei; ... that contains "m” distinct message types produced by "m” different log printing statements from “p” different programs, where the values "m” and “p” (and the printing statements and the source code of these programs) are unknown
- a structured log parser is to parse "/og” and produce all message types from those "m” log printing statements.
- each log printing statement contains a single, unique message type that may have multiple parameters.
- the structured log parser is the first and foremost step for automatic and smart log mining and data-driven log analytics solutions, and it also a useful step for managing logs in a log management system ("LMS").
- LMS log management system
- Some of the disclosed embodiments are able to achieve high degrees of efficiency and scalability because the disclosed "streaming structured log parser" makes only one pass over the log (and hence the unstructured new log entry in the incoming stream of data) and processes each log entry in an online, continuous, and streaming (i.e. real-time) fashion.
- the first process is often to prepare the data by turning unstructured logs into structured data.
- Most previous structured log parsing methods focus on offline batched processing or matching new log entries with previously offline-extracted message types or regular expressions (e.g., from source code). Hence, they cannot be used as an online streaming structured log parser.
- some of these conventional tools provide an interface to parse logs upon their arrival, their parsers are often based on regular expressions defined by end users. Consequently, the system itself can only parse very simple and basic structured schema such as timestamp and hostname, while log messages are continued to be treated as unstructured text values.
- the disclosed embodiments operate using a streaming structured log parser for system event logs.
- a longest common subsequence (LCS) algorithm is used to facilitate these operations.
- LCS is more fully described below.
- performing the previously described "comparison" between the different tokens may include determining the LCS between (1) the message tokens classified as belonging to the log entry's structured message type and (2) the stored-message tokens defining the structured message types of at least some of the previously stored token sequence objects.
- ⁇ is a universe of alphabets (e.g., a-z, 0-9, etc.).
- a ⁇ ai; a.2; ar, ... a m f such that a, c ⁇ for 1 ⁇ i ⁇ m, a subsequence of a is defined as ⁇ a x i ; a X 2 ; a X 3 ; ... ; a x kj, where Vx x, c Z + , and 1 ⁇ xi ⁇ X2 ⁇ ... ⁇ Xk ⁇ m.
- ? (bi; b2; br, ...
- b n ⁇ " be another sequence such that b ⁇ e ⁇ for j e fl, n] .
- a subsequence ⁇ is called a common subsequence of a and ⁇ if and only if it is a subsequence of each.
- the longest common subsequence (LCS) problem for input sequences a and ⁇ is to find longest such ⁇ . For instance, sequence ⁇ 1; 3; 5; 7; 9 ⁇ and sequence ⁇ 1; 5; 7; 10 ⁇ yields an LCS of ⁇ 1; 5; 7 ⁇ .
- an LCS-based method is provided to efficiently and effectively extract and generate structured message types from raw system logs.
- One feature of the disclosed embodiments is that if the output by a log printing statement (which is a log entry) is viewed as a sequence, in most log printing statements, the constant that represents a "message type" often forms a relatively larger portion of the sequence, and the parameter values form a relatively smaller portion of the sequence (i.e. more text is included in the message type than in the parameter set). If two log entries are produced by the same log printing statement (e.g., designated "staf), but only differ by having different parameter values, the LCS of the two sequences is very likely to be the constant in the code "stat," which thereby implies a message type.
- the merit of using the LCS formulation to parse system event logs, as compared with the previously mentioned clustering and iterative partitioning methods, is that a particular message type can be derived even with very few instances of log entries produced by its log statement.
- the other benefit of using the LCS approach is that, instead of relying on an offline batched approach, it is possible to parse logs using a streaming approach, which is described below.
- each word is referred to as a token.
- a log entry "e” can be parsed to a set of tokens using system defined (or user input) delimiters according to the format of the log. Common delimiters (e.g., spaces, equal signs, colons, semicolons, etc.) are typically sufficient to cover most cases. It will be appreciated, however, that other delimiters may be used.
- LCS LCS
- Each log entry is assigned a unique line id.
- the first arriving log entry can be assigned a line id that was initialized to an initial base value (e.g., 0).
- the line ids of subsequently received log entries can be auto-incremented upon arrival of those new log entries, thereby defining a list of entries related to one another by their incremented line ids.
- LCSObject creates an object data structure called "LCSObject” to hold currently parsed LCS sequences and their related metadata information (e.g., each line item in structured data 210 from Figure 2 constitutes an “LCSObject” (aka a “token sequence object”)).
- LCSseq use "LCSseq” to denote a sequence that is the LCS of multiple log messages (also referred to herein as an LCS sequence). The LCS of the multiple log messages constitutes a candidate for the message type of those log entries.
- each LCSObject contains an LCSseq, a list of line indices (a form of metadata) called linelds that stores the line ids for the corresponding log entries that lead to this LCSseq, and a list of parameter positions (another form of metadata) called paramPos that indicates where in the LCSseq are the placeholders for parameter values (i.e. the positions of "*") with respect to each log entry indexed by linelds.
- the "*" can be considered metadata information describing attributes of the previously stored token sequence objects.
- these embodiments store one, some, or all currently parsed LCSObjects in a list/repository called LCSMap.
- the LCS algorithm runs in a streaming fashion, as shown in Figure 3.
- the LCSMap 300 list is empty.
- a new log entry 305 e.g., e,
- it is firstly parsed into a token sequence s, using a pre-defined set of delimiters (e.g., spaces, equal signs, colons, etc.).
- s is compared with the LCSseq from all LCSObject in the current LCSMap (i.e. LCSMap 100), to see if s, "matches" one of the existing LCSseq.
- line id "z" is added to the linelds list of the corresponding LCSObject.
- a new LCSObject is created and insert it into LCSMap 100.
- LSCMap 300 is updated to include the parsed information, thereby resulting in an updated LSCMap 310.
- another new log entry 315 is received, parsed, and compared. After parsing new log entry 315, it is determined that new log entry 315 includes the same structured message types as log entry 305. As such, a match is identified, and the metadata in LSCMap 310 is updated to thereby produce updated LSCMap 320, which includes new information in the "Uneids" and "paramPos" fields. Subsequently, another new log entry 325 is received.
- log entry 325 After being parsed, it is determined that log entry 325 is unique, so a new LCSObject is created, thereby producing updated LCSMap 330. The process continues with new log entry 330 and can continue until no new log entries are received. In this manner, the disclosed embodiments provide a highly scalable and flexible solution to parsing log entries and providing detailed information about those log entries, even when only a limited number of log entries are available.
- a new LCS will be generated. For instance, given a new log sequence "s" produced by the tokenization of a new log entry "e,” some embodiments conduct a search through LCSMap. For the 1 th LCSObject, suppose its LCSseq is "q ". It is possible to compute the value which is the length of the LCS(q s). While searching through the LCSMap, the largest value as well as the index to the corresponding LCSObject are retained.
- the LCS of "3 ⁇ 4, ⁇ ” and “s” is the maximum LCS among all LCSObjects in the LCSMap, and the length of LCS(3 ⁇ 4 , s) is at least half the length of "s”; hence, unless the total length of parameter values in "e” is more than half of its size, which is very unlikely in practice, the length of LCS(3 ⁇ 4 , s) is a good indicator whether the log entries in the th LCSObject (which share the LCSseq "3 ⁇ 4, ⁇ ” ) share the same message type with "e” or not (which would be LCSto , *)).
- some embodiments choose the one with the smallest ⁇ q, ⁇ value, since it has a higher set similarity value with "s”. Then, some embodiments use backtracking to generate a new LCS sequence to represent the message type for all log entries in the th LCSObject and "e".
- some embodiments mark a placeholder indicator (e.g., "*") at the places where the two sequences disagree, as the place holders for parameters, and consecutive adjacent "*"s are merged into one "*". With reference to Figure 2, the "*" constitutes a placeholder.
- one, some, or all of the identified message types may be displayed with a placeholder indicator to indicate a position where the corresponding parameters are located with respect to text included in the identified message types. For instance, consider the following two sequences:
- Some embodiments are configured to further improve the above efficiencies and effectiveness, especially for cases when message types differ by a small margin or for logs having a large number of parameters.
- some embodiments are further configured to achieve nearly optimal time complexity for most incoming log entries (i.e., linear to ⁇ s ⁇ , the number of tokens in the token sequence "s" of a log entry "e”).
- a new log entry arrives, it is beneficial to compute the length of its LCS with each existing message type.
- DP dynamic programming
- m ' be the number of currently parsed message types in LCSMap.
- the disclosed efficiencies lead to a time complexity of 0(m ' ⁇ n 2 ) for each new log entry. Note that since the number of possible tokens in a complex system could be large, it is not particularly beneficial to apply techniques that compute LCS efficiently by assuming a limited set of alphabets (i.e., by assuming small
- each string is a set of tokens and it is possible to simply view each token as a character.
- Such an operation facilitates the process of pre-filtering.
- the previously stored token sequence objects which were discussed earlier and which are used during the comparison process, may actually be included in a larger group of previously stored token sequence objects stored in the repository.
- the disclosed embodiments are able to limit the larger group of previously stored token sequence objects so that only a subset of those objects are considered when the comparison is performed. Accordingly, the following discussion will now focus on a few different techniques that can be followed to perform the pre-filtering process described above.
- These techniques include, but are not limited to, performing pre-filtering by generating any one or combination of (1) a simple loop, (2) an inverted list built from the structured message types of at least some, and potentially all, of the previously stored token sequence objects, and/or (3) a prefix tree that is also built from those structured message types.
- pre-filtering techniques some of the embodiments are able to improve efficiency by restricting the number of discrete comparisons that are performed. That is, by pre-filtering some of the stored-message tokens included in an token sequence object (and/or the objects themselves), those filtered stored-message tokens are not considered, or rather are refrained from being considered, during the comparison process.
- One pre-filtering method is to simply loop through “strs” . Specifically, for each “str ", the embodiments maintain a pointer "/?, ⁇ " pointing to the head of "str and another pointer "p " pointing to the head of ⁇ . If the characters (i.e. tokens) pointed to by "pi” and “p " match, both pointers are advanced; otherwise, only pointer " ?/' is advanced. When “p " has gone to the end of ⁇ , check if " ? " has also reached the end of "strT . A pruning can be applied which is to skip "strT if its length is less than 3 ⁇ 4
- An inverted index 400 (i.e. " ') is built over a set of incoming "strs" 405, as shown in Figure 4. For each unique character in strs 405, the embodiments record its position (/ ' , pos in str,) in all "stris” that it appears in, sorted by "F . Given this index "7", the procedure to find the subsequence of ⁇ is as follows.
- Another pre-filtering technique is to use a prefix tree. To perform this technique, some embodiments index "sfris" in "strs" using a prefix tree, and prune away many candidates.
- ⁇ ABPC
- the disclosed embodiments are able to use combinations of the different pre- filtering techniques.
- the pre-filtering process may include the following steps. For each new log entry "e”, attempt to find its message type using a prefix tree. If not found, apply the inverted index lookup.
- the refinement takes place, or rather is triggered, when certain conditions occur. Some example conditions include, but are not limited to, when the number of processed log entries has reached a user-defined threshold, when the system in the computing architecture is idle, or when the system is determined to be underutilized (e.g., arrival rates of new log entries become low or reach a threshold volume). Another condition is right before the message type lookup procedure completes.
- the refinement may be achieved via two steps, namely, split and merge.
- the first parameter position has only 2 unique tokens, while the second parameter position has many (in the actual log file). So the first parameter position has a high probability of being a part of a message type.
- the split procedure will replace the first parameter ' *' in the LCSseq with each unique token ("boof and "wait” in this example) at the same position, leading to two new LCSObjects with two new LCSseqs as shown above.
- the disclosed embodiments are highly versatile and work well with formatted log messages (as well as other types of log messages), especially where the number of tokens in the message types is more than that in parameters, and parameters vary in different log messages.
- some embodiments use a threshold to distinguish whether the LCS of two sequences is a valid message type. That said, there could be situations where the number of parameters in one log entry is so many that this threshold based approach is not optimal to extract the message type properly.
- Fan speeds (3552 3534 3375 4354 3515 3479 ) Fan speeds (3552 3534 3375 4299 3515 3479 ) Fan speeds (3552 3552 3391 4245 3515 3497 ) Fan speeds (3534 3534 3375 4245 3497 3479 ) Fan speeds (3534 3534 3375 4066 3497 3479 )
- a merge procedure which first clusters current LCSObjects that might have the same message type together and then counts the number of distinct tokens at each position of LCSseqs from these LCSObjects. To do so, some embodiments partition LCSMap into a set of clusters, until each cluster satisfies the following conditions:
- Condition 1 For a subset of token positions, at each such position, there is only one unique token for all "LCSObject.LCSseq" at this position from this cluster.
- Condition 2 For every other token position, the number of unique tokens of all "LCSObject.LCSseq" from this cluster exceeds a merge threshold.
- the merge threshold is different for positions having numbers and positions having only strings.
- LCSMap i.e. the repository
- LCSMap size is at most the number of total message types (which is "m") that could be produced by the corresponding source code, which is a constant.
- the time complexity for each new log entry can be 0(m ⁇ n 2 ), since the naive dynamic programming method to compute LCS between a log entry and a message type is 0(n 2 ) for log entries of size 0(n), whereas the backtracking method is often cheaper and it is performed only with a target message type in LCSMap which has the longest LCS length with respect to the new log entry and the length exceeds a threshold, thus it is computed at most once for each new log message.
- pre-filtering step for each log entry, it is beneficial to first try to find its message type in prefix tree, then inverted index, and only for the small portion that are still not located, the LCSMap is compared against (e.g., via the simple loop technique). For "J" log records, suppose the number of log records that fail to find message types in pre-filtering step is "F and the number of log records that are returned in inverted index step is "7". The amortized cost for each log record is only
- the split threshold is a small constant, so the total size of an adjusted LCSMap is at most a small constant times of the previous LCSMap size, which is still 0(m).
- the split method itself takes only 0(m) cost.
- the LCSMap is first repeatedly partitioned until each cluster satisfies the conditions to be merged. Note, this is done over currently parsed message types. Hence, the partition cost is at most the number of currently parsed message types, 0(m), and the LCSMap size only reduces after a merge step.
- the split/merge heuristics are typically executed only very infrequently, thus their impacts to the amortized cost on each log entry can be mostly ignored. It is worthwhile to highlight that the LCS-based construction is not similar to a so-called "edit distance based approach.” A crucial difference between the two is that the LCS sequence of two log messages is naturally a message type, which makes streaming log parsing possible.
- the disclosed embodiments are able to dynamically parse log entry data in a highly efficient, scalable, and effective manner.
- the efficiency, scalability, and effectiveness can be improved even further.
- the disclosed embodiments provide a significant improvement over conventional systems, thereby improving how computer systems operate.
- Figure 6 illustrates an example computer system 600 that may be used to facilitate the operations described herein.
- computer system 600 may take various different forms.
- computer system 600 may be embodied as a tablet 600A or a desktop 600B.
- the ellipsis 600C demonstrates that computer system 600 may be embodied in any other form.
- computer system 600 may also be a distributed system that includes one or more connected computing components/devices that are in communication with computer system 600, a laptop computer, a mobile phone, a server, a data center, and/or any other computer system.
- computer system 600 includes various different components.
- Figure 6 shows that computer system 600 includes at least one processor 605 (aka a "hardware processing unit"), a parsing engine 610 (an example implementation of the parsing engine 205 from Figure 2), and storage 615.
- processor 605 and/or parsing engine 610 may be configured to perform any of the operations discussed herein. That is, the parsing engine 610 may be a dedicated, specialized, or even general processor.
- Storage 615 is shown as including executable code/instructions 620. When executed, the executable code/instructions 620 causes the computer system 600 to perform the disclosed operations.
- Storage 615 may be physical system memory, which may be volatile, nonvolatile, or some combination of the two.
- memory may also be used herein to refer to non-volatile mass storage such as physical storage media. If computer system 600 is distributed, the processing, memory, and/or storage capability may be distributed as well.
- executable module can refer to software objects, routines, or methods that may be executed on computer system 600.
- the different components, modules, engines, and services described herein may be implemented as objects or processors that execute on computer system 600 (e.g. as separate threads).
- the disclosed embodiments may comprise or utilize a special-purpose or general-purpose computer including computer hardware, such as, for example, one or more processors (such as processor 605) and system memory (such as storage 615), as discussed in greater detail below.
- Embodiments also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures.
- Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system.
- Computer-readable media that store computer-executable instructions in the form of data are physical computer storage media.
- Computer-readable media that carry computer-executable instructions are transmission media.
- the current embodiments can comprise at least two distinctly different kinds of computer- readable media: computer storage media and transmission media.
- Computer storage media are hardware storage devices, such as RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) that are based on RAM, Flash memory, phase-change memory (PCM), or other types of memory, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code means in the form of computer-executable instructions, data, or data structures and that can be accessed by a general-purpose or special-purpose computer.
- RAM random access memory
- ROM read-only memory
- EEPROM electrically erasable programmable read-only memory
- CD-ROM Compact Disk Read Only Memory
- SSDs solid state drives
- PCM phase-change memory
- Computer system 600 may also be connected (via a wired or wireless connection) to external sensors (e.g., one or more remote cameras, accelerometers, gyroscopes, acoustic sensors, magnetometers, etc.) or computer systems. Further, computer system 600 may also be connected through one or more wired or wireless networks 625 to remote systems(s) that are configured to perform any of the processing described with regard to computer system 600. As such, computer system 600 is able to collect streamed log data from those other external devices as well.
- a graphics rendering engine may also be configured, with processor 605, to render one or more user interfaces for the user to view and interact with.
- a "network,” like the network 625 shown in Figure 6, is defined as one or more data links and/or data switches that enable the transport of electronic data between computer systems, modules, and/or other electronic devices.
- a network either hardwired, wireless, or a combination of hardwired and wireless
- Computer system 600 will include one or more communication channels that are used to communicate with the network 625.
- Transmissions media include a network that can be used to carry data or desired program code means in the form of computer-executable instructions or in the form of data structures. Further, these computer-executable instructions can be accessed by a general-purpose or special- purpose computer. Combinations of the above should also be included within the scope of computer-readable media.
- program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to computer storage media (or vice versa).
- program code means in the form of computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a network interface card or "NIC") and then eventually transferred to computer system RAM and/or to less volatile computer storage media at a computer system.
- NIC network interface card
- Computer-executable (or computer-interpretable) instructions comprise, for example, instructions that cause a general-purpose computer, special-purpose computer, or special-purpose processing device to perform a certain function or group of functions.
- the computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
- embodiments may be practiced in network computing environments with many types of computer system configurations, including personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, pagers, routers, switches, and the like.
- the embodiments may also be practiced in distributed system environments where local and remote computer systems that are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network each perform tasks (e.g. cloud computing, cloud services and the like).
- program modules may be located in both local and remote memory storage devices.
- the functionality described herein can be performed, at least in part, by one or more hardware logic components (e.g., the processor 605).
- hardware logic components e.g., the processor 605
- FPGAs Field-Programmable Gate Arrays
- ASICs Program-Specific or Application-Specific Integrated Circuits
- ASSPs Program-Specific Standard Products
- SOCs System-On-A-Chip Systems
- CPLDs Complex Programmable Logic Devices
- CPUs Central Processing Units
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Debugging And Monitoring (AREA)
Abstract
Improvements to parsing system logs are disclosed. Initially, an unstructured log entry is parsed into a sequence of tokens. The tokens are classified as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry. A comparison is performed between message tokens classified as belonging to the log entry's structured message type and stored-message tokens that define message types of previously stored token sequence objects. A result of this comparison indicates whether a relationship between the message tokens and a particular set of stored-message tokens satisfies a relationship criteria. If the criteria is not satisfied, then a new token sequence object is created based on the message tokens and any tokens classified as belonging to the log entry's parameter set. If the criteria is satisfied, then certain data is added to one of the previously stored token sequence objects.
Description
PARSING SYSTEM EVENT LOGS WHILE STREAMING
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of and priority to United States Provisional Patent Application Serial No. 62/561, 115 filed on September 20, 2017 and entitled "SPELL: STREAMING PARSING OF SYSTEM LOGS," which application is expressly incorporated herein by reference in its entirety.
BACKGROUND
[0002] Computers have impacted nearly every aspect of modern-day living. For instance, computers are generally involved in work, recreation, healthcare, transportation, and so forth.
[0003] The increasing complexity of modern computer systems has become a significant limiting factor in deploying and managing them. In complex systems such as a large distributed cloud, it becomes increasingly challenging to quickly identify the root cause of a problem in the system. The ability to be alerted and to mitigate the problem right away has become a fundamental requirement in many systems. Data-driven methods based on machine learning and data mining techniques are often employed to understand complex system behaviors by exploring machine data for automatic pattern discovery and anomaly detection. System logs, which are essentially a universal data source that contains important information (e.g., usage patterns, execution paths, and program running status), are valuable assets in assisting these data-driven system analytics in order to gain insights that are useful to enhance system health, stability, and usability.
[0004] While machine learning and other data mining techniques have proven to be highly beneficial in resolving computing issues, significant problems have arisen with regard to preparing the data from these logs so that the data can be fed into those machine learning and mining systems. If the data is not prepared adequately, then those subsequent systems will not be able to operate properly, or at least they will not be able to operate as efficiently as they otherwise might. Therefore, with the increasing complexity of computing systems and data analysis engines (e.g., machine learning applications and other data mining tools), there has arisen a substantial need to better prepare the data that is obtained from the complex computing system and that will subsequently be fed into these data analysis engines.
[0005] The subject matter claimed herein is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one exemplary technology area where some embodiments described herein may be practiced.
BRIEF SUMMARY
[0006] Embodiments disclosed herein relate to computer systems, methods, and hardware storage devices that operate within a computing architecture to improve how system event logs are parsed by facilitating the dynamic extraction of log patterns from incoming log entries and the classification of those log patterns in a streaming manner.
[0007] In some embodiments, an unstructured log entry is extracted from an incoming stream of data and then parsed into a sequence of tokens. These tokens are classified as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry. A comparison is subsequently performed between message tokens classified as belonging to the log entry's structured message type and stored-message tokens that define message types of at least some, but potentially not all, previously stored token sequence objects. A result of this comparison indicates whether a relationship between the message tokens and a particular set of the stored- message tokens satisfies a relationship criteria. Notably, the particular set of the stored- message tokens are associated with a particular one of the previously stored token sequence objects. If the criteria is not satisfied, then a new token sequence object is created, where the object is based on the message tokens and any tokens classified as belonging to the log entry's parameter set.
[0008] In some embodiments, If the criteria is satisfied, then certain data is added to one of the previously stored token sequence objects. Specifically, a new line identification to a line identification list is added/included as a part of the particular one previously stored token sequence object. Additionally, some embodiments are configured to perform combinations of the above processes.
[0009] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
[0010] Additional features and advantages will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the
practice of the teachings herein. Features and advantages of the invention may be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. Features of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.
BRIEF DESCRIPTION OF THE DRAWINGS
[0011] In order to describe the manner in which the above-recited and other advantages and features can be obtained, a more particular description of the subject matter briefly described above will be rendered by reference to specific embodiments which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments and are not therefore to be considered to be limiting in scope, embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
[0012] Figures 1A and IB illustrate an example method for operating a computing architecture that improves how system event logs are parsed by facilitating dynamic extraction of log patterns from incoming log entries and by classifying those log patterns in a streaming manner.
[0013] Figure 2 illustrates an example of a set of unstructured log entries that are passed through a parsing engine to produce a set of structured token sequences.
[0014] Figure 3 illustrates an example process flow for comparing log entries to sequence objects.
[0015] Figure 4 illustrates an example pre-filtering technique that may be used to optimize an analysis of the log entries.
[0016] Figure 5 illustrates another example pre-filtering technique that may be used to optimize the analysis.
[0017] Figure 6 illustrates an example computer system capable of performing any of the disclosed operations.
DETAILED DESCRIPTION
[0018] Embodiments disclosed herein relate to computer systems, methods, and hardware storage devices that operate within a computing architecture to improve how system event logs are parsed by facilitating the dynamic extraction of log patterns from incoming log entries and the classification of those log patterns in a streaming manner.
Initially, an unstructured log entry is parsed into a sequence of tokens. The tokens are classified as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry. A comparison is performed between (1) message tokens classified as belonging to the log entry's structured message type and (2) stored-message tokens that define message types of previously stored token sequence objects. A result of this comparison indicates whether a relationship between the message tokens and a particular set of the stored-message tokens satisfies a relationship criteria. If the criteria is not satisfied, then a new token sequence object is created based on the message tokens and any tokens classified as belonging to the log entry's parameter set. If the criteria is satisfied, then certain data is added to one of the previously stored token sequence objects.
Technical Improvements
[0019] In this manner, the disclosed embodiments significantly improve the technical field and also improve how a computer system parses data (thereby also improving its own operations). For example, the disclosed embodiments provide a structured streaming parser for event logs using an LCS (longest common subsequence) based approach. This technique parses unstructured log messages into structured message types and parameters in an online streaming fashion. By following the disclosed principles, the time complexity to process each log entry ("e") is close to linear (i.e. to the size of "e"), thereby significantly reducing the time required to prepare the log data and hence improving the operations of the computer system.
[0020] By streaming the data, real-time message types and parameters produced by the disclosed embodiments also enable the generation of a concise, intuitive summary for the end users. Furthermore, the disclosed embodiments produce log data that is clean and structured. This structured data can be easily processed and analyzed further down the pipeline using advanced data analytics by downstream analysts (e.g., human operators or autonomous computer systems). In this regard, the disclosed embodiments are able to significantly improve not only the preparation of data, but also how that data is subsequently analyzed as a result of its carefully structured/prepared format.
[0021] Conventional parsing systems have not kept up with the rapid improvements in machine learning and data mining engines, or even with the increasing complexity of computing systems. For example, in most traditional data analysis systems, the first and foremost step is to automatically parse the unstructured system logs to structured data using techniques such as, for example, use of regular expressions, leveraging source
code, or parsing purely based on system log characteristics (e.g., using data mining approaches such as clustering and iterative partitioning). Nevertheless, these techniques fail to work efficiently for general purpose system log parsing. Additionally, these traditional techniques fail to perform online parsing in a streaming fashion.
[0022] Furthermore, previous methods tuned for a specific type of system log may work terribly on a new format or type of system log. For example, some popular open source cloud infrastructures use logs that contain various formats not present in previous system logs (e.g., in a JSON format). These traditional techniques are not scalable; thus, the resulting systems are extremely inefficient in parsing large amounts of differing types/formats of log entries.
[0023] In contrast, the disclosed embodiments are designed as a general-purpose streaming log parsing engine. Hence, the disclosed embodiments are system, type, and even format agnostic. Consequently, the disclosed embodiments provide significant improvements over the art through their scalability and flexibility while also improving the operations of the computer system.
Example Method(s)
[0024] As an initial matter, it is noted that system event logs are a universal resource that exists practically in every system. The disclosed embodiments use system event logs to perform a free-text audit trace generated by the system's execution (e.g., which is often stored in the /var/log folder). In some implementations described herein, a log message or a log record/entry refers to at least one line in the log file, which can be produced by a log printing statement in a user program's source code or a kernel program's code (both of which may be executed within a computing architecture that includes one or more computer systems).
[0025] One function/object of the disclosed embodiments is to parse each log entry "e" into a collection of tokens that describe a "message type" (also referred to herein as a "structured message type") and a collection of tokens describing parameter values (or simply "parameter") that were delineated in the log entry. By creating these new data fields (i.e. the structured message type and the parameters) and by classifying/grouping the tokens into these data fields, the disclosed embodiments are able to translate (in a streaming manner) an unstructured log entry into a searchable and easily workable structured log entry.
[0026] With that understanding, attention will now be directed to Figures 1 A and IB which refer to a number of method acts that may be performed. Although the method
acts may be discussed in a certain order or illustrated in a flowchart as occurring in a particular order, no particular ordering is required unless specifically stated, or required because an act is dependent on another act being completed prior to the act being performed. This method is used to introduce the disclosed embodiments at a high level. Subsequent portions of this disclosure will delve more fully into specifics and features of the various different embodiments.
[0027] Figure 1A illustrates a flowchart of an example method 100 for operating a computing architecture that is designed to improve how system event logs are parsed. This is achieved by facilitating the dynamic extraction of log patterns from incoming log entries and the classification of those log patterns in a streaming manner.
[0028] Initially, method 100 includes an act 105 of extracting an unstructured new log entry from an incoming stream of data. The incoming stream of data may be received in any manner such as, for example, from an application executing on the computer system (e.g., the system's operating system "OS"), from an outside source over a communication channel (e.g., via a TCP or UDP port), or via any other communication interface. Once this log entry is received, then it is parsed into a sequence of tokens (act 110). The process of parsing will be described in more detail later.
[0029] Each of the tokens in the token sequence is then classified (act 115). Specifically, the tokens are classified as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry. In some embodiments, parsing the log entry and/or classifying each token is performed in realtime (i.e. in a dynamic streaming manner) as the incoming stream of data is being received.
[0030] In some instances (though not all), each message type (e.g., "e") has a one- to-one mapping with a log printing statement in the source code producing the raw log entry "e." For example, consider the following printing statement:
printfC Temperature %s exceeds warning thresholds", tmp);
[0031] This printing statement may produce several log entries such as, for example:
Temperature (41C) exceeds warning threshold
Temperature (43C) exceeds warning threshold
[0032] In this example scenario, the parameter values are "41C and " 3C" respectively while the message type is "Temperature * exceeds warning threshold '' From this example, it is shown that parsing a log entry into different fields (e.g., a structured message type and a parameter) will be highly beneficial when analyzing the
log entry because the analysis engines will have a better understanding of the content included in the log entry.
[0033] Continuing with method 100, a comparison is subsequently performed between "message tokens" that have been classified as belonging to the log entry's structured message type and "stored-message tokens" that define structured message types of at least some (but potentially not all) of previously stored token sequence objects. These previously stored token sequence objects may be stored in a repository and may include any number of objects. It will be appreciated that the repository may be local to the computer system or it may be remote. Following the above step, an analysis of this comparison is performed in act 125.
[0034] Figure IB further elaborates on this analysis. For instance, upon determining that the relationship criteria is not satisfied, a new token sequence object is created (act 130). This new token sequence is based on the message tokens and any tokens classified as belonging in the log entry's parameter set. Alternatively, upon determining that the relationship criteria is satisfied, then certain information is added to the particular one token sequence object (act 135). Specifically, a new line identification is added to a line identification list included as a part of the particular one token sequence object. In this manner, the disclosed embodiments provide an online streaming technique to parse data in a quick and efficient manner, as described earlier.
Structured Log Parsing Engine
[0035] As described earlier, in some implementations, log entries are produced by "print" statements in a system program's source code. In this regard, a log entry can be viewed as a collection of {"message type", "parameter value"} pairs. For example, the log printing statement "printfi("File %d finished. ", id)" contains a constant message type "File finished' and a variable parameter value which is the value loaded into the "/'if' field. According to the disclosed embodiments, one functionality of a structured log parser is to identify the message type "File * finished', where * is a placeholder dummy value for variables (i.e. parameter values).
[0036] Figure 2 shows a set of unstructured log entries 200, which includes timestamp data and message data. Currently, each individual log entry in the set is "unstructured" text data. By "unstructured" it is generally meant that the data is not formatted in a particular manner, nor are the individual data items grouped or otherwise identified as being distinct from one another.
[0037] After passing through a parsing engine 205, which performs the method described earlier in connection with Figures 1A and IB, the data that was previously unstructured is now formatted and arranged in a structural manner, as shown by the structured data 210.
[0038] For instance, certain portions of the data is now categorized as belonging to a structured message type while other portions of the data is categorized as belonging to a parameter set. The table shown by the structured data 210 is sometimes referred to as a "structured message type summary" and can be provided to a user in a graphical user interface. This table can include other data, such as metadata further describing the listed information. As such, each line item in the table constitutes a "token sequence object." Indeed, one, some, or all of the token sequence objects may include metadata information describing attributes of those object's corresponding structured message types and/or parameters. One example of the metadata includes the frequency by which each of the structured message types was observed within the input data stream. Another example of metadata includes a position indicator (e.g., the "*" placeholder value) indicating a location within each corresponding token sequence where a particular parameter is located, as will be described in further detail later.
[0039] In this regard, a user or computer system will be able to easily identify how often these structured message types were provided via the structured message type summary. Such information is particularly beneficial when diagnosing issues with the computer system. As generally shown in Figure 2, the structured message type summary lists (1) one or more identified message types, (2) a frequency of observation for each of those identified message types (the frequency is a form of metadata), and (3) a corresponding parameter that was previously associated with each of the identified message types (e.g., before their token sequences were parsed). This information (i.e. the structured message type summary) can then be displayed on the user interface.
[0040] Formally, a structured log parser, which can be performed by the parsing engine 205 in Figure 2, is defined as follows:
[0041] Given an ordered set of log entries (ordered by timestamps), "log = {ei; ... that contains "m" distinct message types produced by "m" different log printing statements from "p" different programs, where the values "m" and "p" (and the printing statements and the source code of these programs) are unknown, a structured log parser is to parse "/og" and produce all message types from those "m" log printing statements.
In some embodiments, each log printing statement contains a single, unique message type that may have multiple parameters.
[0042] In some embodiments, the structured log parser is the first and foremost step for automatic and smart log mining and data-driven log analytics solutions, and it also a useful step for managing logs in a log management system ("LMS"). Some of the disclosed embodiments are able to achieve high degrees of efficiency and scalability because the disclosed "streaming structured log parser" makes only one pass over the log (and hence the unstructured new log entry in the incoming stream of data) and processes each log entry in an online, continuous, and streaming (i.e. real-time) fashion.
[0043] To achieve effective data-driven log analytics, the first process is often to prepare the data by turning unstructured logs into structured data. Most previous structured log parsing methods focus on offline batched processing or matching new log entries with previously offline-extracted message types or regular expressions (e.g., from source code). Hence, they cannot be used as an online streaming structured log parser. Additionally, while some of these conventional tools provide an interface to parse logs upon their arrival, their parsers are often based on regular expressions defined by end users. Consequently, the system itself can only parse very simple and basic structured schema such as timestamp and hostname, while log messages are continued to be treated as unstructured text values.
[0044] In contrast, the disclosed embodiments operate using a streaming structured log parser for system event logs. In some embodiments, a longest common subsequence (LCS) algorithm is used to facilitate these operations. LCS is more fully described below. By way of a brief introduction, performing the previously described "comparison" between the different tokens may include determining the LCS between (1) the message tokens classified as belonging to the log entry's structured message type and (2) the stored-message tokens defining the structured message types of at least some of the previously stored token sequence objects. These operations are further described below.
Longest Common Subsequence
[0045] Suppose "∑" is a universe of alphabets (e.g., a-z, 0-9, etc.). Given any sequence "a = {ai; a.2; ar, ... amf such that a, c∑ for 1 < i < m, a subsequence of a is defined as {axi ; aX2 ; aX3 ; ... ; axkj, where Vx x, c Z+, and 1 < xi < X2 < ... < Xk≤m. Let " ? = (bi; b2; br, ... bn}" be another sequence such that b} e∑ for j e fl, n] . A subsequence γ is called a common subsequence of a and β if and only if it is a subsequence of each. The longest common subsequence (LCS) problem for input sequences a and β is to find
longest such γ. For instance, sequence { 1; 3; 5; 7; 9} and sequence { 1; 5; 7; 10} yields an LCS of { 1; 5; 7}.
[0046] With that understanding, it is noted that an LCS-based method is provided to efficiently and effectively extract and generate structured message types from raw system logs. One feature of the disclosed embodiments is that if the output by a log printing statement (which is a log entry) is viewed as a sequence, in most log printing statements, the constant that represents a "message type" often forms a relatively larger portion of the sequence, and the parameter values form a relatively smaller portion of the sequence (i.e. more text is included in the message type than in the parameter set). If two log entries are produced by the same log printing statement (e.g., designated "staf), but only differ by having different parameter values, the LCS of the two sequences is very likely to be the constant in the code "stat," which thereby implies a message type.
[0047] The merit of using the LCS formulation to parse system event logs, as compared with the previously mentioned clustering and iterative partitioning methods, is that a particular message type can be derived even with very few instances of log entries produced by its log statement. The other benefit of using the LCS approach is that, instead of relying on an offline batched approach, it is possible to parse logs using a streaming approach, which is described below.
Streaming Approach
[0048] In a log entry "e," each word is referred to as a token. A log entry "e" can be parsed to a set of tokens using system defined (or user input) delimiters according to the format of the log. Common delimiters (e.g., spaces, equal signs, colons, semicolons, etc.) are typically sufficient to cover most cases. It will be appreciated, however, that other delimiters may be used. After tokenization of a log, each log entry is translated into a token sequence, which is then used to compute the LCS (i.e., "∑ = {tokens from eij Π {tokens from e2j Π ... Π {tokens from e∑}"). Each log entry is assigned a unique line id. The first arriving log entry can be assigned a line id that was initialized to an initial base value (e.g., 0). The line ids of subsequently received log entries can be auto-incremented upon arrival of those new log entries, thereby defining a list of entries related to one another by their incremented line ids.
[0049] Some embodiments create an object data structure called "LCSObject" to hold currently parsed LCS sequences and their related metadata information (e.g., each line item in structured data 210 from Figure 2 constitutes an "LCSObject" (aka a "token sequence object")). Some embodiments also use "LCSseq" to denote a sequence that is
the LCS of multiple log messages (also referred to herein as an LCS sequence). The LCS of the multiple log messages constitutes a candidate for the message type of those log entries. As will be shown in Figure 3, each LCSObject contains an LCSseq, a list of line indices (a form of metadata) called linelds that stores the line ids for the corresponding log entries that lead to this LCSseq, and a list of parameter positions (another form of metadata) called paramPos that indicates where in the LCSseq are the placeholders for parameter values (i.e. the positions of "*") with respect to each log entry indexed by linelds. As described earlier, the "*" can be considered metadata information describing attributes of the previously stored token sequence objects. Finally, these embodiments store one, some, or all currently parsed LCSObjects in a list/repository called LCSMap. When a new log entry e, arrives, it is compared with all LCSseq in existing LCSObject in the LCSMap. Then, based on the results of these comparisons, the embodiments either insert the line id "/" to the linelds of an existing LCSObject, or they compute a new LCSObject and insert it into LCSMap, as described in Figure IB.
[0050] The LCS algorithm runs in a streaming fashion, as shown in Figure 3. Initially, the LCSMap 300 list is empty. When a new log entry 305 (e.g., e,) arrives, it is firstly parsed into a token sequence s, using a pre-defined set of delimiters (e.g., spaces, equal signs, colons, etc.). After that, s, is compared with the LCSseq from all LCSObject in the current LCSMap (i.e. LCSMap 100), to see if s, "matches" one of the existing LCSseq. If there is a match, then as described in method 100 of Figure 1, line id "z" is added to the linelds list of the corresponding LCSObject. Alternatively, if there is no match, a new LCSObject is created and insert it into LCSMap 100.
[0051] In Figure 3, because LSCMap 300 is initially empty, there was no match. Therefore, LSCMap 300 is updated to include the parsed information, thereby resulting in an updated LSCMap 310. As shown by the data flow in Figure 3, another new log entry 315 is received, parsed, and compared. After parsing new log entry 315, it is determined that new log entry 315 includes the same structured message types as log entry 305. As such, a match is identified, and the metadata in LSCMap 310 is updated to thereby produce updated LSCMap 320, which includes new information in the "Uneids" and "paramPos" fields. Subsequently, another new log entry 325 is received. After being parsed, it is determined that log entry 325 is unique, so a new LCSObject is created, thereby producing updated LCSMap 330. The process continues with new log entry 330 and can continue until no new log entries are received. In this manner, the disclosed embodiments provide a highly scalable and flexible solution to parsing log entries and
providing detailed information about those log entries, even when only a limited number of log entries are available.
[0052] In some instance, a new LCS will be generated. For instance, given a new log sequence "s" produced by the tokenization of a new log entry "e," some embodiments conduct a search through LCSMap. For the 1th LCSObject, suppose its LCSseq is "q ". It is possible to compute the value which is the length of the LCS(q s). While searching through the LCSMap, the largest value as well as the index to the corresponding LCSObject are retained. In the end, if "/, = max(l, s)" is greater than a threshold "f (by default, t = \s\ / 2), where \s\ denotes the length of a sequence "s", i.e., number of tokens in a log entry "e"), the LCSseq "¾,·" and the new log sequence "s" are considered as having the same message type. The intuition is that the LCS of "¾,·" and "s" is the maximum LCS among all LCSObjects in the LCSMap, and the length of LCS(¾ , s) is at least half the length of "s"; hence, unless the total length of parameter values in "e" is more than half of its size, which is very unlikely in practice, the length of LCS(¾ , s) is a good indicator whether the log entries in the th LCSObject (which share the LCSseq "¾,·" ) share the same message type with "e" or not (which would be LCSto , *)).
[0053] If there are multiple LCSObjects having the same max "Γ values, some embodiments choose the one with the smallest \q,\ value, since it has a higher set similarity value with "s". Then, some embodiments use backtracking to generate a new LCS sequence to represent the message type for all log entries in the th LCSObject and "e". Worthwhile to note, when using backtracking to get the new LCSseq of "¾,·" and "s", some embodiments mark a placeholder indicator (e.g., "*") at the places where the two sequences disagree, as the place holders for parameters, and consecutive adjacent "*"s are merged into one "*". With reference to Figure 2, the "*" constitutes a placeholder. As such, when the structured message type summary is displayed on the user interface, one, some, or all of the identified message types may be displayed with a placeholder indicator to indicate a position where the corresponding parameters are located with respect to text included in the identified message types. For instance, consider the following two sequences:
qj = Command Failed on: node-235 node-236
s = Command Failed on: node-127
[0054] Using backtracking to get the new LCSseq of these two, the result would be:
Command Failed on: *
[0055] It is beneficial to prove that this backtracking method gives LCS(¾ , s). Once this is done, the embodiments update the LCSseq of the th LCSObject from "¾,·" to LCS(¾ , s), and add the line id of V to the linelds of the f LCSObject.
[0056] If no existing "q " shares an LCS with "s" that is at least \s\/2 in length, the embodiments create a new LCSObject for "e" in LCSMap, and set its LCSseq as "s".
Additional Refinements And Improvements In Efficiency
[0057] Some embodiments are configured to further improve the above efficiencies and effectiveness, especially for cases when message types differ by a small margin or for logs having a large number of parameters.
[0058] For example, some embodiments are further configured to achieve nearly optimal time complexity for most incoming log entries (i.e., linear to \s\, the number of tokens in the token sequence "s" of a log entry "e"). Specifically, when a new log entry arrives, it is beneficial to compute the length of its LCS with each existing message type. Suppose each log entry is of size 0(n) for some small constant "«" (i.e., n = \s\), in many cases, it takes 0(n2) time to compute LCS of a log entry and an existing message type (using a standard dynamic programming (DP) formulation). This time can be improved, however, as described below.
[0059] Let m ' be the number of currently parsed message types in LCSMap. The disclosed efficiencies lead to a time complexity of 0(m ' · n2) for each new log entry. Note that since the number of possible tokens in a complex system could be large, it is not particularly beneficial to apply techniques that compute LCS efficiently by assuming a limited set of alphabets (i.e., by assuming small |∑| values).
[0060] A notable observation is that, for a vast majority of new log entries (over 99.9% in a tested evaluation), their message types are often already present in currently parsed message types (stored by LCSMap repository). Hence, instead of computing the LCS between a new log entry and each existing message type, some embodiments perform a pre-filtering step to find if its message type already exists, which reduces to the following problem:
[0061] For a new string σ and a set of current strings "strs = {stri; sir 2; ... ;
find the longest "str " such that " LCS(o , str,) = str, ", and return true if \str,\ > ¾ \σ\ .
[0062] Additionally, it is noted that in some embodiments, each string is a set of tokens and it is possible to simply view each token as a character. Such an operation facilitates the process of pre-filtering. In this regard, the previously stored token sequence objects, which were discussed earlier and which are used during the comparison process,
may actually be included in a larger group of previously stored token sequence objects stored in the repository. By performing a pre-filtering operation, the disclosed embodiments are able to limit the larger group of previously stored token sequence objects so that only a subset of those objects are considered when the comparison is performed. Accordingly, the following discussion will now focus on a few different techniques that can be followed to perform the pre-filtering process described above. These techniques include, but are not limited to, performing pre-filtering by generating any one or combination of (1) a simple loop, (2) an inverted list built from the structured message types of at least some, and potentially all, of the previously stored token sequence objects, and/or (3) a prefix tree that is also built from those structured message types. Through use of these pre-filtering techniques, some of the embodiments are able to improve efficiency by restricting the number of discrete comparisons that are performed. That is, by pre-filtering some of the stored-message tokens included in an token sequence object (and/or the objects themselves), those filtered stored-message tokens are not considered, or rather are refrained from being considered, during the comparison process.
Simple Loop Pre-Filtering
[0063] One pre-filtering method is to simply loop through "strs" . Specifically, for each "str ", the embodiments maintain a pointer "/?,■" pointing to the head of "str and another pointer "p " pointing to the head of σ. If the characters (i.e. tokens) pointed to by "pi" and "p " match, both pointers are advanced; otherwise, only pointer " ?/' is advanced. When "p " has gone to the end of σ, check if " ? " has also reached the end of "strT . A pruning can be applied which is to skip "strT if its length is less than ¾ | σ \ . The worst time complexity for this approach is 0(m · n).
Inverted List Pre-Filtering
[0064] To avoid going through the entire "stf set, some embodiments use an inverted index based approach which could skip checking many "strts . Such an approach is illustrated in Figure 4.
[0065] An inverted index 400 (i.e. " ') is built over a set of incoming "strs" 405, as shown in Figure 4. For each unique character in strs 405, the embodiments record its position (/' , pos in str,) in all "stris" that it appears in, sorted by "F . Given this index "7", the procedure to find the subsequence of σ is as follows.
[0066] First, for each character in σ, determine if it is included in the current If not, then treat that character as a parameter and skip; otherwise, assign a unique, auto-
increment id to the matching inverted list (e.g., 1, 2, 3, etc. on each row in Figure 4 for A, B, and C).
[0067] Next, for such matching lists, scan them using a sort-merge join style method. Instead of trying to find equal items across all lists (as that in sort-merge join), it is beneficial to try to find, by following the order of assigned ids to these lists, if any column with the same string id "7" in the inverted index 100 (aka an inverted index matrix) could form a subsequence of σ.
[0068] For example, this is represented in Figure 4, by (1, 1) from the list for A, by (1, 2) from the list for B, and by (1, 3) from the list for C. These values follow the order of assigned ids and share the same string id value "1" and form a "strT that is a proper subsequence of σ (by checking if the position values are properly ordered). Note that this step will return all "stris" in "strs" that satisfy "str, = LCS(a , str,)" .
[0069] Additionally, for all "stris" returned by the previous step described above, it is beneficial to find the longest one and check if its length is greater than ¾ |σ| . By performing the inverted index lookup procedure, all "str,-s" are considered, and the time complexity is only O(cn), where "c" is the average length of each inverted list (i.e. the average number of duplicated characters in the set "strs").
Prefix Tree Pre-Filtering
[0070] Another pre-filtering technique is to use a prefix tree. To perform this technique, some embodiments index "sfris" in "strs" using a prefix tree, and prune away many candidates.
[0071] An example is shown in Figure 5. Here, strs 500 is shown in the following manner: "strs = {ABC ; ACD ; AD ; EFf and the strs 500 are indexed by a prefix tree 505 (i.e. "7"). Instead of checking σ against every "strT in "strs", it is beneficial to first check prefix tree 505 to see if there is an existing "strT that is a subsequence of σ. If such a "strT is identified, some embodiments apply the length filter (i.e., check if \str,\ > ¾ σ|). As shown in Figure 5, suppose a new log entry 510 (i.e. σ = ABPC) is received. By comparing each character of σ with each node of the prefix tree 505, it is possible to efficiently prune most branches in the prefix tree 505, and mark the characters in σ that do not match any node in the prefix tree 505 as parameters in a message type. In this case, the embodiments will successfully identify "ABC as the message type for σ, and as its parameter.
[0072] For most log entries, it is highly likely that their message type already exists in the prefix tree 505, so it is often beneficial to stop here. As such, the time complexity
is only 0(n). This is optimal, since the embodiments will have to go through every token in a new log entry at least once. However, this approach only guarantees to return a sir " if such "str, = LCS(a , str,)" exists. In contrast to the inverted list approach, it does not guarantee that the returned sir " is the longest one among all "sfris" that satisfy "str, = LCS(o , strif. For example, if σ = DAPBC while "sirs = {DA ; ABC}", the prefix tree 505 would return "DA" instead of "ABC.
[0073] In practice, it has been determined that although the prefix tree approach does not guarantee to find the longest message type, its returned message type is almost identical to the results of simple loop and inverted list methods. That is because parameters in each log record tend to appear near the end.
[0074] The disclosed embodiments are able to use combinations of the different pre- filtering techniques. For instance, the pre-filtering process may include the following steps. For each new log entry "e", attempt to find its message type using a prefix tree. If not found, apply the inverted index lookup.
[0075] If those two techniques are not successful, then the simple look technique can be applied. For instance, for log entries (less than 0: 1% in a real test case scenario) that do not find message types using the pre-filtering step, some embodiments then compare the new log entry "e" with all existing parsed message types to see if a new message type could be generated. However, instead of computing LCS between each message type "q" and "e" (and then finding the length of this LCS), it is beneficial to first compute their set similarity score using Jaccard similarity. In some embodiments, only for those message types that have more than half common elements (i.e., tokens) with "e" do those embodiments then compute their LCS. Then, if their LCS length exceeds ¾ \s\, it is possible to adjust that message type, and adjust inverted index " ' and prefix tree "7" accordingly. Otherwise "e" is a new message type, so the embodiments simply add it to "7" and "T thereby creating a new LCSObject in LCSMap.
Improvements On Effectiveness
[0076] For a small portion of message types that are not recognized by the above processes, it is possible to further improve the embodiments effectiveness through certain refinement techniques. The refinement takes place, or rather is triggered, when certain conditions occur. Some example conditions include, but are not limited to, when the number of processed log entries has reached a user-defined threshold, when the system in the computing architecture is idle, or when the system is determined to be underutilized (e.g., arrival rates of new log entries become low or reach a threshold volume).
Another condition is right before the message type lookup procedure completes. The refinement may be achieved via two steps, namely, split and merge.
Split Refinement
[0077] Consider the following log entries:
boot (command 1880) Error: Console-Busy Port already in use boot (command 2359) Error: Console-Busy Port already in use wait (command 3964) Error: Console-Busy Port already in use
[0078] By following the principles discussed thus far, the extracted message types would likely be a single message type:
* (command *) Error: Console-Busy Port already in use
[0079] However the correct message types should be:
boot (command *) Error: Console-Busy Port already in use wait (command *) Error: Console-Busy Port already in use
[0080] It is noted that the token positions contributing to the message types have fewer number of unique tokens. By utilizing this feature, it is possible to introduce a split method that is applied to each LCSObject in LCSMap. For this purpose, it is possible to introduce another field in an LCSObject called "params" which contains key-value pairs that store all parameter values (as keys), seen from the history, at each parameter position, and how many times each parameter value has been seen (as values). These key-value pairs can be easily produced and/or updated during the backtracking process to produce LCS.
[0081] In particular, for each LCSObject "O", "O" has a paramPos list that contains the token positions as place holders for parameters for the message type LCSseq of "O". A split refinement process leverage the above observation and splits some existing parameters and adds/categories those split parameter values as message types.
[0082] Specifically, for each LCSObject "O" in LCSMap, it is possible to go through "O.params", its key-value pair parameter list, and count the number of unique tokens at each parameter position. The embodiments then find the position having the least number of unique tokens. If the number of unique tokens in that position is less than a split threshold, and no token contains any digit value, the embodiments determine that the position will contribute to a message type. In that case, the embodiments split "O" to several new LCSObjects, with that parameter position in "O.LCSseq" being replaced by a unique token in the same position in each new LCSObject's LCSseq.
[0083] To illustrate using the above example, the first parameter position has only 2 unique tokens, while the second parameter position has many (in the actual log file). So the first parameter position has a high probability of being a part of a message type. The split procedure will replace the first parameter ' *' in the LCSseq with each unique token ("boof and "wait" in this example) at the same position, leading to two new LCSObjects with two new LCSseqs as shown above.
Merge Refinement
[0084] The disclosed embodiments are highly versatile and work well with formatted log messages (as well as other types of log messages), especially where the number of tokens in the message types is more than that in parameters, and parameters vary in different log messages. Recall, that some embodiments use a threshold to distinguish whether the LCS of two sequences is a valid message type. That said, there could be situations where the number of parameters in one log entry is so many that this threshold based approach is not optimal to extract the message type properly. Consider the case below:
Fan speeds (3552 3534 3375 4354 3515 3479 ) Fan speeds (3552 3534 3375 4299 3515 3479 ) Fan speeds (3552 3552 3391 4245 3515 3497 ) Fan speeds (3534 3534 3375 4245 3497 3479 ) Fan speeds (3534 3534 3375 4066 3497 3479 )
[0085] Clearly, they should lead to the following single message type: "Fan speeds: ( *)". However, each of the log messages contains 10 tokens (i.e. "Fan" is the first token, "speeds" is the second token, is the third token, "3552" is the fourth token, as so on), and the message type has only 4 tokens (i.e. "Fan" and "speeds " and "("' and " '). If no refinements were made, it is possible that the outputted message types would be the following:
Fan speeds ( * 3552 * 3515 * )
Fan speeds (3534 3534 3375 * 3497 3479 )
[0086] Though their LCS has correctly identified "Fan speeds", its length is less than the threshold "t" used to qualify an LCSseq as a message type. So in this case, it is possible that some embodiments not using refinement will identify the above example as two distinct message types.
[0087] To address this problem (when the number of parameters in a log entry is too many), a merge procedure is introduced, which first clusters current LCSObjects that
might have the same message type together and then counts the number of distinct tokens at each position of LCSseqs from these LCSObjects. To do so, some embodiments partition LCSMap into a set of clusters, until each cluster satisfies the following conditions:
[0088] Condition 1 - For a subset of token positions, at each such position, there is only one unique token for all "LCSObject.LCSseq" at this position from this cluster.
[0089] Condition 2 - For every other token position, the number of unique tokens of all "LCSObject.LCSseq" from this cluster exceeds a merge threshold. The merge threshold is different for positions having numbers and positions having only strings. Some embodiments consider if any token in one position has digits in it (might be ids or values), then that position could be a parameter position. Such an operation works quite well in practice.
[0090] After this partition step, some embodiments consider all LCSObjects inside one cluster as having the same message type. So, these embodiments merge them into a single LCSObject, assign the positions from above as parameter positions, and the tokens from above will be inserted into LCSseq as part of the message type. As in the above example, the positions having only one unique token are positions 1, 2, 3, 10, which are "Fan speeds: ()" (i.e. "Fan" occupies position 1, "speeds" occupies position 2, occupies position 3, and " ' occupies position 10) and the rest having many unique tokens are identified as parameter positions.
Timing Analysis
[0091] The disclosed embodiments ensure that the size of LCSMap (i.e. the repository) increases by one only when a new message type has been identified; otherwise, a new log entry id will be added to an existing LCSObject with an updated message type. This guarantees that LCSMap size is at most the number of total message types (which is "m") that could be produced by the corresponding source code, which is a constant. Earlier, it was shown that the time complexity for each new log entry can be 0(m · n2), since the naive dynamic programming method to compute LCS between a log entry and a message type is 0(n2) for log entries of size 0(n), whereas the backtracking method is often cheaper and it is performed only with a target message type in LCSMap which has the longest LCS length with respect to the new log entry and the length exceeds a threshold, thus it is computed at most once for each new log message.
[0092] With the pre-filtering step, for each log entry, it is beneficial to first try to find its message type in prefix tree, then inverted index, and only for the small portion that
are still not located, the LCSMap is compared against (e.g., via the simple loop technique). For "J" log records, suppose the number of log records that fail to find message types in pre-filtering step is "F and the number of log records that are returned in inverted index step is "7". The amortized cost for each log record is only
[0093] where "c" is the average number of duplicated tokens in all message types, "m" is the number of message types, and "«" is the log record length. In some sample test timing evaluations, (J+FJ/L < 0:01 and F/L < 0:001, thus the cost for each log record to find its message type using the disclosed embodiments is approximately only 0(n) in practice.
[0094] For the complexity of the split method, note that the split threshold is a small constant, so the total size of an adjusted LCSMap is at most a small constant times of the previous LCSMap size, which is still 0(m). The split method itself takes only 0(m) cost. In the merge procedure, the LCSMap is first repeatedly partitioned until each cluster satisfies the conditions to be merged. Note, this is done over currently parsed message types. Hence, the partition cost is at most the number of currently parsed message types, 0(m), and the LCSMap size only reduces after a merge step. Note that the split/merge heuristics are typically executed only very infrequently, thus their impacts to the amortized cost on each log entry can be mostly ignored. It is worthwhile to highlight that the LCS-based construction is not similar to a so-called "edit distance based approach." A crucial difference between the two is that the LCS sequence of two log messages is naturally a message type, which makes streaming log parsing possible.
[0095] Accordingly, the disclosed embodiments are able to dynamically parse log entry data in a highly efficient, scalable, and effective manner. By utilizing certain refinement procedures, the efficiency, scalability, and effectiveness can be improved even further. In this regard, the disclosed embodiments provide a significant improvement over conventional systems, thereby improving how computer systems operate.
Example Computer Systems
[0096] Attention will now be directed to Figure 6 which illustrates an example computer system 600 that may be used to facilitate the operations described herein. It will be appreciated that previous references to a "computing architecture" may refer to computer system 600 by itself or to computer system 600 along with any number of other
computer systems. As such, computer system 600 may take various different forms. For example, in Figure 6, computer system 600 may be embodied as a tablet 600A or a desktop 600B. The ellipsis 600C demonstrates that computer system 600 may be embodied in any other form. For example, computer system 600 may also be a distributed system that includes one or more connected computing components/devices that are in communication with computer system 600, a laptop computer, a mobile phone, a server, a data center, and/or any other computer system.
[0097] In its most basic configuration, computer system 600 includes various different components. For example, Figure 6 shows that computer system 600 includes at least one processor 605 (aka a "hardware processing unit"), a parsing engine 610 (an example implementation of the parsing engine 205 from Figure 2), and storage 615. Processor 605 and/or parsing engine 610 may be configured to perform any of the operations discussed herein. That is, the parsing engine 610 may be a dedicated, specialized, or even general processor. Storage 615 is shown as including executable code/instructions 620. When executed, the executable code/instructions 620 causes the computer system 600 to perform the disclosed operations.
[0098] Storage 615 may be physical system memory, which may be volatile, nonvolatile, or some combination of the two. The term "memory" may also be used herein to refer to non-volatile mass storage such as physical storage media. If computer system 600 is distributed, the processing, memory, and/or storage capability may be distributed as well. As used herein, the term "executable module," "executable component," "module," or even "component" can refer to software objects, routines, or methods that may be executed on computer system 600. The different components, modules, engines, and services described herein may be implemented as objects or processors that execute on computer system 600 (e.g. as separate threads).
[0099] The disclosed embodiments may comprise or utilize a special-purpose or general-purpose computer including computer hardware, such as, for example, one or more processors (such as processor 605) and system memory (such as storage 615), as discussed in greater detail below. Embodiments also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system. Computer-readable media that store computer-executable instructions in the form of data are physical computer storage media. Computer-readable media that carry computer-executable
instructions are transmission media. Thus, by way of example and not limitation, the current embodiments can comprise at least two distinctly different kinds of computer- readable media: computer storage media and transmission media.
[00100] Computer storage media are hardware storage devices, such as RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) that are based on RAM, Flash memory, phase-change memory (PCM), or other types of memory, or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code means in the form of computer-executable instructions, data, or data structures and that can be accessed by a general-purpose or special-purpose computer.
[00101] Computer system 600 may also be connected (via a wired or wireless connection) to external sensors (e.g., one or more remote cameras, accelerometers, gyroscopes, acoustic sensors, magnetometers, etc.) or computer systems. Further, computer system 600 may also be connected through one or more wired or wireless networks 625 to remote systems(s) that are configured to perform any of the processing described with regard to computer system 600. As such, computer system 600 is able to collect streamed log data from those other external devices as well. A graphics rendering engine may also be configured, with processor 605, to render one or more user interfaces for the user to view and interact with.
[00102] A "network," like the network 625 shown in Figure 6, is defined as one or more data links and/or data switches that enable the transport of electronic data between computer systems, modules, and/or other electronic devices. When information is transferred, or provided, over a network (either hardwired, wireless, or a combination of hardwired and wireless) to a computer, the computer properly views the connection as a transmission medium. Computer system 600 will include one or more communication channels that are used to communicate with the network 625. Transmissions media include a network that can be used to carry data or desired program code means in the form of computer-executable instructions or in the form of data structures. Further, these computer-executable instructions can be accessed by a general-purpose or special- purpose computer. Combinations of the above should also be included within the scope of computer-readable media.
[00103] Upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to computer storage media (or vice versa). For
example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a network interface card or "NIC") and then eventually transferred to computer system RAM and/or to less volatile computer storage media at a computer system. Thus, it should be understood that computer storage media can be included in computer system components that also (or even primarily) utilize transmission media.
[00104] Computer-executable (or computer-interpretable) instructions comprise, for example, instructions that cause a general-purpose computer, special-purpose computer, or special-purpose processing device to perform a certain function or group of functions. The computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
[00105] Those skilled in the art will appreciate that the embodiments may be practiced in network computing environments with many types of computer system configurations, including personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, pagers, routers, switches, and the like. The embodiments may also be practiced in distributed system environments where local and remote computer systems that are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network each perform tasks (e.g. cloud computing, cloud services and the like). In a distributed system environment, program modules may be located in both local and remote memory storage devices.
[00106] Additionally, or alternatively, the functionality described herein can be performed, at least in part, by one or more hardware logic components (e.g., the processor 605). For example, and without limitation, illustrative types of hardware logic components that can be used include Field-Programmable Gate Arrays (FPGAs), Program-Specific or Application-Specific Integrated Circuits (ASICs), Program-Specific Standard Products (ASSPs), System-On-A-Chip Systems (SOCs), Complex
Programmable Logic Devices (CPLDs), Central Processing Units (CPUs), and other types of programmable hardware.
[00107] The present invention may be embodied in other specific forms without departing from its spirit or characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Claims
1. A computer system comprising:
one or more processors; and
one or more computer-readable hardware storage devices having stored thereon computer-executable instructions that are executable by the one or more processors to cause the computer system to:
from an incoming stream of data, extract an unstructured new log entry; parse the log entry into a sequence of tokens;
from the token sequence, classify each token as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry;
perform a comparison between message tokens classified as belonging to the log entry's structured message type and stored-message tokens defining corresponding structured message types of at least some previously stored token sequence objects, wherein a result of the comparison indicates whether a relationship between the message tokens and a particular set of the stored- message tokens satisfies a relationship criteria, the particular set of the stored- message tokens being associated with a particular one of the previously stored token sequence objects; and
upon determining that the relationship criteria is not satisfied, create a new token sequence object that is based on the message tokens and any tokens classified as belonging in the log entry's parameter set.
2. The computer system of claim 1, wherein the at least some previously stored token sequence objects are included in a plurality of previously stored token sequence objects, and wherein a pre-filtering is performed to limit the plurality of previously stored token sequence objects so that only the at least some previously stored token sequence objects are considered when the comparison is performed.
3. The computer system of claim 2, wherein the pre-filtering is performed using either (1) an inverted list built from the corresponding structured message types of the at least some previously stored token sequence objects or (2) a prefix tree that is also built from the corresponding structured message types.
4. The computer system of claim 1, wherein performing the comparison includes determining a longest common subsequence between (1) the message tokens classified as belonging to the log entry's structured message type and (2) the stored- message tokens defining the corresponding structured message types of the at least some previously stored token sequence objects.
5. The computer system of claim 1, wherein parsing the log entry and classifying each token is performed in real-time as the incoming stream of data is being received.
6. The computer system of claim 1, wherein only one pass is performed over each unstructured new log entry included in the incoming stream of data such that each unstructured new log entry is processed in an online, continuous streaming manner.
7. The computer system of claim 1, wherein the structured message type of the log entry corresponds to a print statement in corresponding source code.
8. The computer system of claim 1, wherein each unstructured new log entry extracted from the incoming stream of data is assigned a unique line identification.
9. The computer system of claim 1, wherein each of the at least some previously stored token sequence objects includes metadata information describing attributes of the corresponding structured message types.
10. The computer system of claim 1, wherein parsing the log entry into the sequence of tokens is performed using a defined set of delimiters.
11. A method for operating a computing architecture that improves how system event logs are parsed by facilitating dynamic extraction of log patterns from incoming log entries and classifying those log patterns in a streaming manner, the method comprising:
from an incoming stream of data, extract an unstructured new log entry; parse the log entry into a sequence of tokens;
from the token sequence, classify each token as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry;
perform a comparison between message tokens classified as belonging to the log entry's structured message type and stored-message tokens defining corresponding structured message types of at least some previously stored token sequence objects, wherein a result of the comparison indicates whether a relationship between the message tokens and a particular set of the stored-
message tokens satisfies a relationship criteria, the particular set of the stored- message tokens being associated with a particular one of the previously stored token sequence objects; and
upon determining that the relationship criteria is satisfied, adding a new line identification to a line identification list included as a part of the particular one token sequence object.
12. The method of claim 11, wherein each of the at least some previously stored token sequence objects includes metadata, and wherein the metadata includes a position indicator indicating a location within each corresponding token sequence where a particular parameter is located.
13. The method of claim 11, wherein the method further includes:
generating an inverted index based on the stored-message tokens included within the at least some previously stored token sequence objects; and
using the inverted index to pre-filter some of the stored-message tokens such that those filtered stored-message tokens are refrained from being considered during the comparison.
14. The method of claim 11, wherein the method further includes:
generating a prefix tree based on the stored-message tokens included within the at least some previously stored token sequence objects; and
using the prefix tree to pre-filter some of the stored-message tokens such that those filtered stored-message tokens are refrained from being considered during the comparison.
15. The method of claim 11, wherein the method further includes:
generating a structured message type summary that lists (1) one or more identified message types, (2) a frequency of observation for each of the one or more identified message types, and (3) a corresponding parameter that was previously associated with each of the one or more identified message types; and displaying the structured message type summary on a user interface.
16. The method of claim 11, wherein each unstructured new log entry that is extracted from the incoming stream of data constitutes a single line in a log file and corresponds to a log printing statement in source code of a user program or to a kernel program executing within the computing architecture.
17. The method of claim 11, wherein the method further includes:
in response to one or more of (1) a number of processed log entries reaching a pre-defined threshold number, or (2) a computer system in the computing architecture being idle, or (3) the computer system in the computing architecture being determined to be under-utilized, trigger performance of a refinement process that, when performed, either (1) splits tokens that were previously identified as being parameters so that those split tokens become portions of associated structured message types or (2) merges tokens that were previously identified as being parts of structured message types so that those merged tokens become parameters.
18. One or more hardware storage devices having stored thereon computer- executable instructions that are executable by one or more processors of a computer system to cause the computer system to:
from an incoming stream of data, extract an unstructured new log entry; parse the log entry into a sequence of tokens;
from the token sequence, classify each token as belonging to a structured message type of the log entry or, alternatively, as belonging to a parameter set of the log entry;
perform a comparison between message tokens classified as belonging to the log entry's structured message type and stored-message tokens defining corresponding structured message types of at least some previously stored token sequence objects, wherein a result of the comparison indicates whether a relationship between the message tokens and a particular set of the stored- message tokens satisfies a relationship criteria, the particular set of the stored- message tokens being associated with a particular one of the previously stored token sequence objects; and
upon determining that the relationship criteria is not satisfied, create a new token sequence object that is based on the message tokens and any tokens classified as belonging in the log entry's parameter set and add the new sequence object to the repository, or, alternatively, upon determining that the relationship criteria is satisfied, adding a new line identification to a line identification list included as a part of the particular one token sequence object.
19. The one or more hardware storage devices of claim 18, wherein execution of the computer-executable instructions further causes the computer system to:
generate a structured message type summary that lists (1) one or more identified message types, (2) a frequency of observation for each of the one or more identified message types, and (3) a corresponding parameter that was previously associated with each of the one or more identified message types; and display the structured message type summary on a user interface, wherein each of the one or more identified message types are displayed with a placeholder indicator to indicate a position where the corresponding parameters are located with respect to text included in the one or more identified message types.
20. The one or more hardware storage devices of claim 18, wherein each unstructured new log entry includes only a single unique message type, and wherein each unstructured new log entry includes one or more parameters.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762561115P | 2017-09-20 | 2017-09-20 | |
US62/561,115 | 2017-09-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019060326A1 true WO2019060326A1 (en) | 2019-03-28 |
Family
ID=65810791
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2018/051599 WO2019060326A1 (en) | 2017-09-20 | 2018-09-18 | Parsing system event logs while streaming |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019060326A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110333990A (en) * | 2019-05-29 | 2019-10-15 | 阿里巴巴集团控股有限公司 | Data processing method and device |
CN111400500A (en) * | 2020-03-25 | 2020-07-10 | 上海擎创信息技术有限公司 | L CS-based Chameleon real-time log clustering method |
CN112560407A (en) * | 2020-12-18 | 2021-03-26 | 上海中畅数据技术有限公司 | Method for extracting computer software log template on line |
CN113190426A (en) * | 2020-07-02 | 2021-07-30 | 北京睿知图远科技有限公司 | Stability monitoring method for big data scoring system |
CN113268462A (en) * | 2020-02-14 | 2021-08-17 | 西安诺瓦星云科技股份有限公司 | Log management method, device and system based on embedded equipment |
CN114722014A (en) * | 2022-06-09 | 2022-07-08 | 杭银消费金融股份有限公司 | Batch data time sequence transmission method and system based on database log file |
CN116383155A (en) * | 2023-06-05 | 2023-07-04 | 成都融见软件科技有限公司 | Log query system based on EDA verification simulator |
CN116542558A (en) * | 2023-04-27 | 2023-08-04 | 上海数禾信息科技有限公司 | Service index calculation method, device, computer equipment and storage medium |
WO2024145208A1 (en) * | 2022-12-29 | 2024-07-04 | Bitdrift, Inc. | Systems and methods for managing log data |
US12093162B1 (en) * | 2022-03-24 | 2024-09-17 | Amazon Technologies, Inc. | Block anchors for online log parsing |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050015624A1 (en) * | 2003-06-09 | 2005-01-20 | Andrew Ginter | Event monitoring and management |
US20060129555A1 (en) * | 2004-12-09 | 2006-06-15 | Microsoft Corporation | System and method for indexing and prefiltering |
US20110296244A1 (en) * | 2010-05-25 | 2011-12-01 | Microsoft Corporation | Log message anomaly detection |
US8154812B1 (en) * | 2009-03-06 | 2012-04-10 | Western Digital Technologies, Inc. | Condensing a defect scan log for a disk of a disk drive |
US20120163196A1 (en) * | 2009-08-13 | 2012-06-28 | International Business Machines Corporation | Automatic Address Range Detection for IP Networks |
US20140082463A1 (en) * | 2012-09-14 | 2014-03-20 | David H. Sitrick | Multi-Event Processing Systems And Methodologies |
US20170180404A1 (en) * | 2015-12-22 | 2017-06-22 | Sap Se | Efficient identification of log events in enterprise threat detection |
-
2018
- 2018-09-18 WO PCT/US2018/051599 patent/WO2019060326A1/en active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050015624A1 (en) * | 2003-06-09 | 2005-01-20 | Andrew Ginter | Event monitoring and management |
US20060129555A1 (en) * | 2004-12-09 | 2006-06-15 | Microsoft Corporation | System and method for indexing and prefiltering |
US8154812B1 (en) * | 2009-03-06 | 2012-04-10 | Western Digital Technologies, Inc. | Condensing a defect scan log for a disk of a disk drive |
US20120163196A1 (en) * | 2009-08-13 | 2012-06-28 | International Business Machines Corporation | Automatic Address Range Detection for IP Networks |
US20110296244A1 (en) * | 2010-05-25 | 2011-12-01 | Microsoft Corporation | Log message anomaly detection |
US20140082463A1 (en) * | 2012-09-14 | 2014-03-20 | David H. Sitrick | Multi-Event Processing Systems And Methodologies |
US20170180404A1 (en) * | 2015-12-22 | 2017-06-22 | Sap Se | Efficient identification of log events in enterprise threat detection |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110333990B (en) * | 2019-05-29 | 2023-06-27 | 创新先进技术有限公司 | Data processing method and device |
CN110333990A (en) * | 2019-05-29 | 2019-10-15 | 阿里巴巴集团控股有限公司 | Data processing method and device |
CN113268462B (en) * | 2020-02-14 | 2024-05-10 | 西安诺瓦星云科技股份有限公司 | Log management method, device and system based on embedded equipment |
CN113268462A (en) * | 2020-02-14 | 2021-08-17 | 西安诺瓦星云科技股份有限公司 | Log management method, device and system based on embedded equipment |
CN111400500A (en) * | 2020-03-25 | 2020-07-10 | 上海擎创信息技术有限公司 | L CS-based Chameleon real-time log clustering method |
CN113190426B (en) * | 2020-07-02 | 2023-10-20 | 北京睿知图远科技有限公司 | Stability monitoring method for big data scoring system |
CN113190426A (en) * | 2020-07-02 | 2021-07-30 | 北京睿知图远科技有限公司 | Stability monitoring method for big data scoring system |
CN112560407A (en) * | 2020-12-18 | 2021-03-26 | 上海中畅数据技术有限公司 | Method for extracting computer software log template on line |
US12093162B1 (en) * | 2022-03-24 | 2024-09-17 | Amazon Technologies, Inc. | Block anchors for online log parsing |
CN114722014B (en) * | 2022-06-09 | 2022-09-02 | 杭银消费金融股份有限公司 | Batch data time sequence transmission method and system based on database log file |
CN114722014A (en) * | 2022-06-09 | 2022-07-08 | 杭银消费金融股份有限公司 | Batch data time sequence transmission method and system based on database log file |
WO2024145208A1 (en) * | 2022-12-29 | 2024-07-04 | Bitdrift, Inc. | Systems and methods for managing log data |
CN116542558A (en) * | 2023-04-27 | 2023-08-04 | 上海数禾信息科技有限公司 | Service index calculation method, device, computer equipment and storage medium |
CN116542558B (en) * | 2023-04-27 | 2024-06-04 | 上海数禾信息科技有限公司 | Service index calculation method, device, computer equipment and storage medium |
CN116383155B (en) * | 2023-06-05 | 2023-08-11 | 成都融见软件科技有限公司 | Log query system based on EDA verification simulator |
CN116383155A (en) * | 2023-06-05 | 2023-07-04 | 成都融见软件科技有限公司 | Log query system based on EDA verification simulator |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019060326A1 (en) | Parsing system event logs while streaming | |
US11734315B2 (en) | Method and system for implementing efficient classification and exploration of data | |
Du et al. | Spell: Online streaming parsing of large unstructured system logs | |
CN108701254B (en) | System and method for dynamic lineage tracking, reconstruction and lifecycle management | |
CN107111625B (en) | Method and system for efficient classification and exploration of data | |
US11995071B1 (en) | Assigning field values based on an identified extraction rule | |
EP2092419B1 (en) | Method and system for high performance data metatagging and data indexing using coprocessors | |
Meng et al. | Logparse: Making log parsing adaptive through word classification | |
KR102361153B1 (en) | Managing data profiling operations related to data type | |
CN104112026B (en) | A kind of short message text sorting technique and system | |
US20100057736A1 (en) | Techniques for performing regular expression-based pattern matching in data streams | |
Taerat et al. | Baler: deterministic, lossless log message clustering tool | |
US11500871B1 (en) | Systems and methods for decoupling search processing language and machine learning analytics from storage of accessed data | |
CN111160021A (en) | Log template extraction method and device | |
US11748634B1 (en) | Systems and methods for integration of machine learning components within a pipelined search query to generate a graphic visualization | |
US11567735B1 (en) | Systems and methods for integration of multiple programming languages within a pipelined search query | |
US11727007B1 (en) | Systems and methods for a unified analytics platform | |
US10635662B2 (en) | Signature detection | |
Hadi et al. | Aobtm: Adaptive online biterm topic modeling for version sensitive short-texts analysis | |
US10467276B2 (en) | Systems and methods for merging electronic data collections | |
US20080222063A1 (en) | Extensible mechanism for detecting duplicate search items | |
WO2016093839A1 (en) | Structuring of semi-structured log messages | |
Eljasik-Swoboda et al. | Leveraging Clustering and Natural Language Processing to Overcome Variety Issues in Log Management. | |
US10387474B2 (en) | System and method for cross-cloud identification of topics | |
CN116822491A (en) | Log analysis method and device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18859774 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18859774 Country of ref document: EP Kind code of ref document: A1 |