CN109766254B

CN109766254B - IT system operation and maintenance monitoring data auxiliary preprocessing method and system

Info

Publication number: CN109766254B
Application number: CN201811545056.XA
Authority: CN
Inventors: 陈劭力; 王巍
Original assignee: Eccom Network System Co ltd
Current assignee: Eccom Network System Co ltd
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2022-04-08
Anticipated expiration: 2038-12-17
Also published as: CN109766254A

Abstract

The invention provides an auxiliary preprocessing method and system for IT system operation and maintenance monitoring data, wherein the operation and maintenance monitoring data are collected based on Hadoop, and are stored in a file form to obtain original file data; judging whether the lengths of all data items of the original file data are consistent, and if so, defining the data items as protocol data; if the lengths are not consistent, defining the data entry as log data; preprocessing the protocol data and the log data respectively to obtain preprocessing result data; and displaying the preprocessing result data and the original file data for the operation and maintenance personnel to adjust according to the requirements. The method is suitable for various IT system operation and maintenance monitoring data, and reduces the dependence of operation and maintenance personnel on experts in the process of manually analyzing the data; the method is easy to realize, the complexity is low, and more accurate preprocessing results can be continuously obtained along with the increase of the number of data samples so as to assist operation and maintenance personnel to perform further data processing.

Description

IT system operation and maintenance monitoring data auxiliary preprocessing method and system

Technical Field

The invention relates to the technical field of big data processing, in particular to an auxiliary preprocessing method and system for IT system operation and maintenance monitoring data, and particularly relates to an auxiliary preprocessing method for IT system operation and maintenance monitoring data in a big data storage environment.

Background

At present, almost any enterprise can not support the IT system for business development, and in order to ensure the normal operation of the system, the enterprise usually uses a set of monitoring software to assist ITs operation and maintenance personnel to implement the operation and maintenance work on the IT system. As the business volume of enterprises increases, IT systems become more and more complex, and the kinds and quantity of the operation and maintenance monitoring data generated therewith also increase greatly, thereby forming two problems: 1) a large amount of operation and maintenance monitoring data needs to be stored, the cost of the traditional storage method is too high, and the processing capacity is seriously attenuated under the condition of large data volume; 2) the data protocol and the variety are various, and with the update of the system, new types can be generated at any time, and the new data is manually analyzed only by operation and maintenance personnel, so that the workload is large, and the work can be carried out by matching field experts. Meanwhile, the increase of the data volume also means the richness of data samples, which brings possibility for the automatic analysis of the data.

Based on the above situation, if a big data technology is adopted to store large-scale operation and maintenance monitoring data, the storage cost can be greatly reduced, and the data processing capacity can be improved. And based on a large amount of data samples, the operation and maintenance monitoring data is automatically preprocessed to a certain degree, and data segments are roughly divided, so that the workload of data analysis work of operation and maintenance personnel can be reduced, and the analysis work can be completed by only taking the preprocessing result as a bluebook and carrying out fine adjustment on the basis.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an auxiliary preprocessing method and system for IT system operation and maintenance monitoring data.

The invention provides an auxiliary preprocessing method for IT system operation and maintenance monitoring data, which comprises the following steps:

a data acquisition step: acquiring operation and maintenance monitoring data based on Hadoop, and storing the operation and maintenance monitoring data in a file form to obtain original file data;

data classification step: judging whether the lengths of all data items of the original file data are consistent, and if so, defining the data items as protocol data; if the lengths are not consistent, defining the data entry as log data;

a data preprocessing step: preprocessing the protocol data and the log data respectively to obtain preprocessing result data;

and (3) data display step: and displaying the preprocessing result data and the original file data for the operation and maintenance personnel to adjust according to the requirements.

Preferably, the data preprocessing step comprises:

preprocessing protocol data: for protocol data, obtaining section division of data items by comparing public parts among all data items, and storing the section division in a Hive table;

preprocessing log data: and calculating the longest common subsequence of the data entry by setting logic for the log data to obtain a constant section and a variable section of the data entry, and storing the constant section and the variable section in a Hive table.

Preferably, the protocol data preprocessing step includes:

and a common part extracting step: extracting each data item of the protocol data, comparing the data items, dividing the data items into the same bytes and the different bytes, and marking the same bytes as a public part;

and (3) counting the change rate: comparing the public part of the data entry, counting the change rate of each byte bit, and obtaining a section division basis according to a preset change rate percentage threshold;

a first table saving step: and creating a first Hive table by utilizing the component according to the section division basis, and storing the data subjected to section division into the first Hive table.

Preferably, the log-class data preprocessing step includes:

a section distinguishing step: carrying out constant variable distinguishing on bytes in the data entries according to set logic to obtain constant sections and variable sections;

acquiring a longest public subsequence: calculating the longest common subsequence of each data entry according to the constant section, recording the length of the longest common subsequence as a second length, and recording the length of the data entry as a first length;

creating a data structure example step: comparing the second length with the first length, and if the second length exceeds half of the first length, creating a first data structure instance; otherwise, creating a second data structure instance;

a second table saving step: and respectively creating second Hive tables aiming at the first data structure example and the second data structure example, and storing the data with the frequent variable distinction in the second Hive tables.

The invention provides an auxiliary preprocessing system for IT system operation and maintenance monitoring data, which comprises:

a data acquisition module: acquiring operation and maintenance monitoring data based on Hadoop, and storing the operation and maintenance monitoring data in a file form to obtain original file data;

a data classification module: judging whether the lengths of all data items of the original file data are consistent, and if so, defining the data items as protocol data; if the lengths are not consistent, defining the data entry as log data;

a data preprocessing module: preprocessing the protocol data and the log data respectively to obtain preprocessing result data;

the data display module: and displaying the preprocessing result data and the original file data for the operation and maintenance personnel to adjust according to the requirements.

Preferably, the data preprocessing module comprises:

the protocol data preprocessing module: for protocol data, obtaining section division of data items by comparing public parts among all data items, and storing the section division in a Hive table;

the log type data preprocessing module: and calculating the longest common subsequence of the data entry by setting logic for the log data to obtain a constant section and a variable section of the data entry, and storing the constant section and the variable section in a Hive table.

Preferably, the protocol data preprocessing module includes:

and a common part extracting module: extracting each data item of the protocol data, comparing the data items, dividing the data items into the same bytes and the different bytes, and marking the same bytes as a public part;

a statistic change rate module: comparing the public part of the data entry, counting the change rate of each byte bit, and obtaining a section division basis according to a preset change rate percentage threshold;

a first table saving module: and creating a first Hive table by utilizing the component according to the section division basis, and storing the data subjected to section division into the first Hive table.

Preferably, the log-class data preprocessing module includes:

a section distinguishing module: carrying out constant variable distinguishing on bytes in the data entries according to set logic to obtain constant sections and variable sections;

obtaining a longest public subsequence module: calculating the longest common subsequence of each data entry according to the constant section, recording the length of the longest common subsequence as a second length, and recording the length of the data entry as a first length;

create data structure instance module: comparing the second length with the first length, and if the second length exceeds half of the first length, creating a first data structure instance; otherwise, creating a second data structure instance;

a second table saving module: and respectively creating second Hive tables aiming at the first data structure example and the second data structure example, and storing the data with the frequent variable distinction in the second Hive tables.

Preferably, the comparing of the data entries is to randomly extract one of the data entries as a reference entry, compare the reference entry with other entries one by one, compare the identity of each byte between every two entries, mark the common byte bit as the same byte, and mark different byte bits as different bytes;

counting the change rate of each byte bit, namely calculating the change rate of each byte in the data entry relative to the previous byte;

the section division is based on taking the byte with high change rate as the starting byte of the section division.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention can save massive operation and maintenance monitoring data with low cost and ensure the feasibility of a data preprocessing algorithm;

2. the monitoring data is divided into sections in the preprocessing process and stored in a table form, so that the workload of manual analysis of the data by operation and maintenance personnel is greatly reduced, and the operation and maintenance personnel can obtain the data processing effect only by carrying out small amount of adjustment;

3. the preprocessing algorithm is suitable for various IT system operation and maintenance monitoring data, has universality, and reduces the dependence of operation and maintenance personnel on experts in the process of manually analyzing the data;

4. the data preprocessing algorithm is easy to implement, the complexity is low, and more accurate preprocessing results can be continuously obtained along with the increase of the number of data samples.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a detailed diagram of the protocol data parsing algorithm of the present invention;

FIG. 2 is a detailed diagram of the log-class data parsing algorithm of the present invention;

FIG. 3 is a block diagram of a system of the present invention;

FIG. 4 is a flow chart of data processing according to the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention aims at various massive monitoring data received by IT system operation and maintenance monitoring software, realizes distributed storage based on a big data architecture, designs a data automatic preprocessing method on the basis, and can automatically complete section division of data entries by learning a certain amount of data samples, thereby storing the data in a form of a table according to the divided sections to assist operation and maintenance personnel in further data processing.

In the technical scheme of the invention, the overall architecture is based on a Hadoop big data infrastructure and comprises the functions of data acquisition, storage, query and display. The characteristics of the monitoring data are utilized to divide the monitoring data into protocol data and log data, and different preprocessing modes are respectively adopted. A preprocessing method for protocol data is designed, the entries are marked by comparing public parts among all data entries, and then the change rate of all byte bits of the data entries is counted, so that the section division of the protocol data entries can be obtained. A preprocessing method for log data is designed, data entries can be clustered into a plurality of types by carrying out calculation on the longest common subsequence of the data entries according to a certain logic, and a constant mode and variable section division of each type are obtained. And automatically creating a table according to the result of the data preprocessing calculation, and storing the corresponding data copy in a table form.

The operation and maintenance monitoring data of the IT system can be divided into two types, wherein the first type is protocol type data, and the second type is log type data. Regardless of the type of data, a single data entry is used. Typically, each data entry for protocol class data is fixed-length, while the entry for log class data is variable-length and contains a series of constant and variable sections. Based on the above facts, in the case that the data samples are sufficient, the data entries of the type can be divided into sections by calculating the common part among the data entries of the protocol type and counting the change rate of each byte relative to the previous byte. Constant and variable segments can also be extracted from log-like data entries by calculating the longest common subsequence between the data entries.

Specifically, the data preprocessing step includes:

Specifically, the protocol data preprocessing step includes:

Specifically, the log-class data preprocessing step includes:

According to the present invention, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above.

Specifically, the data preprocessing module includes:

Specifically, the protocol data preprocessing module includes:

Specifically, the log-class data preprocessing module includes:

Specifically, the comparing of the data items is to randomly extract one of the data items as a reference item, compare the reference item with other items one by one, compare the identity of each byte between every two items, mark a common byte bit as a same byte, and mark different byte bits as different bytes;

The auxiliary preprocessing system for the IT system operation and maintenance monitoring data can be realized through the step flow of the auxiliary preprocessing method for the IT system operation and maintenance monitoring data. The person skilled in the art can understand the auxiliary preprocessing method of the IT system operation and maintenance monitoring data as a preferred example of the auxiliary preprocessing system of the IT system operation and maintenance monitoring data.

Preferred embodiments of the present invention will be described below with reference to the accompanying drawings.

As shown in fig. 1, for protocol data, after a sufficiently large data sample set is obtained, a data entry is randomly selected from the protocol data as a reference entry, the entry and other entries are compared one by one, the similarities and the differences of each byte between every two entries are compared, a common byte bit is marked as P, and a difference byte bit is marked as X.

And after the isomorphism comparison is calibrated, counting all comparison results according to byte bits, and counting the change rate of each bit relative to the previous bit (except the first bit). After the statistics is finished, selecting the byte bit with the change rate ranking at the top s% (s has a value range of [0,100], and the s value in fig. 1 is 20, which means that the byte bit ranking within the top 20%) as the start byte bit of each section in the protocol data entry, thereby being capable of dividing each section of the data entry. The value of s can be set, the smaller the value is, the thicker the section division is, the larger the value is, the finer the division is, and the operation and maintenance personnel can set the value as required. It should be noted that if the value of s is too large, a section in the original protocol may be divided into a plurality of sections.

And after obtaining division basis of the sections according to the statistical result, creating a corresponding table under a Hadoop big data storage environment, and creating table fields according to the sections. And then copying and storing the original protocol data into the newly-built table on the premise of keeping the original data.

For log-like data, the invention takes the operation of calculating the longest common subsequence, which is simplified based on the method proposed by Min Du and Feifei Li (Spell: Streaming sharing of System Event Logs). Given an arbitrary sequence α ═ a₁,a₂,...,a_mAnd β ═ b₁,b₂,…,b_nAnd the subsequence gamma is a common subsequence of alpha and beta, and the longest common subsequence of alpha and beta is the one with the longest length among all gamma. The longest common subsequence of, for example, the sequence { A, B, C, D, E } and the sequence { A, D, E, G } is { A, D, E }.

As shown in fig. 2, log class data is generated by a print statement in a program like printf ("Boot Failure, Error State% s Code% d Description% s \ n", x, y, z), and if a log class data entry is treated as a sequence, those fixed constants usually account for most of the sequence, while the variable sections account for only a small part. Based on the concept of the longest common subsequence, assuming that two log-class data entries are generated from the same print statement, the longest common subsequence of the sequence corresponding to the two entries is likely to be a constant in the print statement.

For a collection of log-like data entries, each entry therein may be broken down into a sequence of words using separators, typically spaces. This sequence can be used to calculate the longest common subsequence.

And constructing a character string array named log _ type, and classifying and storing the longest common subsequence (namely constant mode) of different log entry types in the learning set and recording corresponding entries.

When there are enough log-type data entries in the learning set, firstly processing each entry into a sequence according to a separator, then randomly selecting two entry sequences, such as entry 1 and entry 2 in fig. 2, calculating the longest common subsequence of the two entries, if the length of the obtained subsequence is not less than half of the length of entry 2, considering that entry 1 and entry 2 have the same type, recording the calculated longest common subsequence as a constant pattern of the type, and then storing the constant pattern (i.e. the schema in fig. 2) and all the two entries under log _ type 1. If the longest common subsequence length of entry 1 and entry 2 does not exceed half the length of entry 2, then the two entries are considered to be of different types, two log _ types need to be created, and the constant pattern recorded in each log _ type is entry 1 and entry 2 themselves.

Still taking fig. 2 as an example, after having calculated

entries

1 and 2, for entry 3, the longest common subsequence between the entry and the existing schema of log _ type is first calculated, and if the length of the sequence is not less than half the length of entry 3, entry 3 can also be included in the log _ type, otherwise a log _ type is newly created, such as log _ type2 in the figure.

Thereafter, for entry 4, the longest common subsequence between the entry and all existing schemas of log _ type, which is more than half the length of entry 4 between entry 4 and the schemas of log _ type2 in fig. 2, is also calculated, thus grouping entry 4 into log _ type2 and updating the schema of log _ type to make the schema more accurate.

If the length of the longest common subsequence between the currently computed entry and the schema of the plurality of existing log _ types exceeds half the length of the current entry, the log _ type corresponding to the longest one of the computed results is selected for inclusion in the new entry. And if a plurality of maximum values exist, selecting the log _ type with the shortest schema length from the log _ types corresponding to all the maximum values to be included in the new entry.

By utilizing the flow, the constant mode and the variable section of most log data can be analyzed, a corresponding Hive table is created for each log _ type under a Hadoop big data storage environment according to the analysis result, and the data in the log _ type is stored in the table. As shown in FIG. 2, each variable field corresponds to a column in the table, and the schema of log _ type corresponds to the comment attribute of the table.

Fig. 3 is a block diagram of the system module according to the present invention. The whole architecture is constructed based on a Hadoop big data infrastructure, and comprises three main modules: 1) a data acquisition module; 2) a data storage module; 3) and the data query and display module. The data acquisition module is responsible for receiving various kinds of monitoring data imported by IT system operation and maintenance monitoring software, can flexibly use Sqoop and Flume to acquire data aiming at different data types such as structured data and unstructured data, and can also directly receive the imported data through Kafka. The data collected by the Sqoop can be directly imported into the data storage module and stored in a table form. Other data are sent into Kafka for caching, each data category corresponds to one Topic in Kafka, after that, Spark initiates data consumption on Kafka, and cached data are stored in a data storage module in a batch mode in a file mode, and different files correspond to different Topics. The data storage module, namely an HDFS file system at the bottom layer of the Hadoop, is responsible for storing all operation and maintenance monitoring data, including data to be processed and processed data. The data query and display module comprises a plurality of component tools, wherein Spark, Hive, Impala, Pig and the like can be used for realizing the automatic preprocessing method for the protocol data and the log data. HBase and Phoenix can be used together to manage the processed data. Solar is a data search engine. Hue is a user interactive interface tool. In addition, Yarn and Zookeeper are responsible for coordinating and managing the various components and resources. And the API provides an interactive method for an external IT system operation and maintenance monitoring program.

Fig. 4 is a data processing flow chart of the present invention, and the whole flow is described in detail as follows:

step 101: and after a new monitoring data type is accessed into the Kafka, creating a corresponding Topic in the Kafka for caching the data. Step 102 is thereafter entered.

Step 102: and after the new Topic is created for 24 hours, saving the data in the new Topic into a corresponding file newly created on the HDFS through Spark. Thereafter, the newly received data in Topic is imported into the file on HDFS every 24 hours. Step 103 is thereafter entered.

Step 103: and judging whether the number of data items in the file for storing the data reaches the preset number of items, if so, taking the data in the file as a learning set of an automatic preprocessing process, and then entering the step 104, otherwise, returning to the step 102.

Step 104: judging whether the lengths of all data entries in the file are consistent, if so, indicating that the data are protocol data, and then entering step 105; if not, indicating that the data is log-like data, then step 108 is entered.

Step 105: extracting each data item in the file, comparing public parts among the items by using Spark according to the method of the invention, and obtaining a section division basis for the data according to a preset percentage threshold value of the change rate after counting the change rate of each byte. Step 106 is thereafter entered.

Step 106: and creating a corresponding table by using the components in the data query and display module according to the obtained section division basis. Step 107 is thereafter entered.

Step 107: and copying and storing the data in the file into the created table according to the division basis of the sections. Step 111 is thereafter entered.

Step 108: extracting each data entry in the file, classifying the log entries by continuously calculating the longest common subsequence by using Spark according to the method of the invention, obtaining the constant mode and the variable section division basis of each category, and recording the contents and the corresponding entries into a plurality of log _ type character string arrays. Step 109 is thereafter entered.

Step 109: and creating a plurality of corresponding tables by utilizing the components in the data query and presentation module according to the obtained log _ types. Step 110 is thereafter entered.

Step 110: and copying the data in the log _ type into a corresponding table according to the identification results of the constant mode and the variable section. Step 111 is thereafter entered.

Step 111: because all the original data files are reserved, operation and maintenance personnel can use the preprocessing result as a bluebook and perform more fine adjustment processing on the original data files.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. An auxiliary preprocessing method for IT system operation and maintenance monitoring data is characterized by comprising the following steps:

and (3) data display step: displaying the preprocessing result data and the original file data for operation and maintenance personnel to adjust according to requirements; the data preprocessing step comprises:

2. The IT system operation and maintenance monitoring data auxiliary preprocessing method according to claim 1, wherein the protocol data preprocessing step comprises:

3. The IT system operation and maintenance monitoring data auxiliary preprocessing method according to claim 1, wherein the log-class data preprocessing step comprises:

4. The IT system operation and maintenance monitoring data auxiliary preprocessing method of claim 2, wherein the comparing each data item is to randomly extract one of the data items as a reference item, compare the reference item with other items one by one, compare the identity of each byte between two and two, mark the common byte bit as the same byte, and mark the different byte bits as different bytes;

5. An auxiliary preprocessing system for IT system operation and maintenance monitoring data, comprising:

the data display module: displaying the preprocessing result data and the original file data for operation and maintenance personnel to adjust according to requirements; the data preprocessing module comprises:

6. The auxiliary preprocessing system for IT system operation and maintenance monitoring data according to claim 5, wherein the protocol data preprocessing module comprises:

7. The IT system operation and maintenance monitoring data auxiliary preprocessing system of claim 5, wherein the log-class data preprocessing module comprises:

8. The auxiliary preprocessing system for IT system operation and maintenance monitoring data according to claim 6, wherein the comparing each data entry is to randomly extract one of the data entries as a reference entry, compare the reference entry with other entries one by one, compare the identity of each byte between two and two, mark the common byte bit as the same byte, and mark the different byte bits as different bytes;