WO2006136055A1

WO2006136055A1 - A text data mining method

Info

Publication number: WO2006136055A1
Application number: PCT/CN2005/000894
Authority: WO
Inventors: Jin Li; Xiaojin Li; Zhaoming Deng; Wenbin Tang; Meipeng Guo; Mei Xiang
Original assignee: Zte Corporation
Priority date: 2005-06-22
Filing date: 2005-06-22
Publication date: 2006-12-28
Also published as: CN101151843B; CN101151843A

Abstract

A text data mining method comprises: fetching a preformed template file including at least one template parameter rule; compiling said file into the template objects composed by the regular expressions according to the template parameter rule; scanning the text data to be mined and implementing the data matching according to the template objects; extracting the matched original information among the text data sequentially according to the regular expression; resolving the extracted original information into the data value of assigned data name and data type according to the template parameter rule. According to the present invention, the analysis process for the text data of different types can be implemented just by modifying the template file without relying on developing the program code or using expensive commercialized data mining tool so that the complexity and the cost of the communication network management system are reduced.

Description

A text data mining method

The present invention relates to data analysis processing techniques, and in particular to a text data mining method. Background technique

In the past decade, telecom network management technology has developed rapidly. In the field of telecommunication network management, it is often necessary to process a large amount of data in real time, in which text data is dominant. For example: Various business operations equipment generates a large number of alarms, performance data, and various real-time bill data. The inclusion of rich business information in these textual data is often an important source of profit.

We can see that these data have the following common characteristics:

1. Based on text data;

2. Massive data;

3. There are certain real-time processing requirements;

4. When the device or system that generates the data is stable, the data format is relatively fixed;

5. Often facing multiple devices or systems, the data format is very large.

The processing of these textual data is typically handled using traditional hard-coded methods and through commercial data mining tools.

Traditional text data analysis methods generally use hard coding, but this hard coding method has the following problems:

1. Flexibility is not enough, the data format of the device or system that generates the data changes, and even a little change may require rewriting the code;

2. The amount of hard-coded code increases sharply with the complexity and variety of data formats. It often requires tens of thousands of lines of code for data analysis and processing, and the efficiency and maintainability of the code is extremely poor.

3. Generally, no data analysis algorithm is adopted, and the execution efficiency is low, which is not suitable for massive, Real-time data processing.

Commercial data mining tools are currently being processed, but there are many disadvantages -

1. Commercial data mining tools are mainly for the analysis and processing of data in the database, mining tools are difficult to break away from the database system;

2. It is complicated to use and it is difficult to integrate it into existing applications immediately;

3. The price of commercial data mining tools is very expensive.

In summary, how to easily and efficiently process these massive amounts of data and obtain valuable information from them has become an urgent problem to be solved. Summary of the invention

The technical problem to be solved by the present invention is to provide a text data mining method. Text data in different formats can be analyzed by modifying the template file without relying on developing code or using expensive commercial data mining tools.

In order to solve the above technical problems, the present invention provides the following solutions:

A text data mining method includes the following steps:

Reading a pre-made template file containing at least one template variable rule;

Compiling the template file into a template object composed of a regular expression according to the template variable rule;

And scanning, according to the template object, text data to be mined, and performing data matching; according to the regular expression, sequentially extracting original information in the text data; and

According to the template variable rule, the extracted original information is parsed into data values of a specified data name and a data type.

The method of the present invention, wherein the pre-made template file is generated according to a text data structure and a template language variable rule that need to be mined.

The method of the present invention, wherein the template variable rule includes: a variable name attribute and Variable type attribute.

The method of the present invention, wherein each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.

The method of the present invention, wherein the template file is compiled into a template object composed of a regular expression, and is compiled by using a template compiler.

The method of the present invention, wherein the compiler processing steps are as follows:

Scan the template file and record the template variable rules therein;

Replace the template variable rules section of the template file with a regular expression;

Compile the generated regular expression into a regular expression object.

The method of the present invention, wherein the scanning the template file further comprises: filtering the annotation information therein, and masking the non-template variable rule part in the template file.

The method of the present invention, wherein the non-template variable rule part of the mask template file refers to the part by using a quotation in the regular expression syntax.

According to the method of the present invention, the extracting the original information in the text data further includes: sequentially storing the extracted original information in a temporary storage area.

According to the method of the present invention, the extracted original information is parsed into data values of a specified data name and a data type, and is parsed according to attributes of the template variable rule.

The advantages of the present invention over the prior art are:

The method of the invention does not need to modify the code for text data of different formats, and only needs to modify the template file according to the template definition language to adapt to different data formats, greatly reducing the time spent on data analysis; and using regular expressions The data matching algorithm is used to mine the data information, which is much more efficient than the traditional method. Moreover, by converting the data value into the specified format, the subsequent processing difficulty is reduced. The method of the present invention is suitable for concurrent data mining processing, making full use of The processing capability of the computer; and the method according to the present invention can be quickly applied to a system implemented by using different development tools, which is simple to implement and The price is low.

The technical problems, technical points, and advantageous effects to be solved by the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments. BRIEF abstract

FIG. 1 is a schematic flowchart of a text data mining method according to the present invention.

FIG. 2 is a schematic diagram of pre-production of a template file according to the present invention.

FIG. 3 is a schematic flowchart of a process of a compiler according to the present invention.

Figure 4 is a schematic diagram of the compiled template file.

Figure 5 is a schematic diagram of text data mining.

6 and FIG. 7 are schematic flowcharts of an embodiment of a text data mining method according to the present invention.

BEST MODE FOR CARRYING OUT THE INVENTION

As shown in FIG. 1 , a schematic diagram of a flow of a text data mining method according to the present invention, first reading a pre-made template file including at least one template variable rule (step 101 ); where the template variable rule may include two Attributes: The name of the variable and the type of the variable. Each of the template variable rules corresponds to a data item to be extracted in a text data to be mined. Then, according to the template variable rule, the template file is compiled into a template object composed of a regular expression (step 102); here, the template file is compiled into a template object composed of a regular expression, and the template is utilized. The compiler is compiled. And scanning the text data to be mined according to the template object, performing data matching on the data (step 103); and then sequentially extracting the matched original information in the text data according to the regular expression (step 104) Here, the extracted original information in the text data is extracted, and the extracted original information may be sequentially stored in the temporary storage area. Finally, the extracted original information is parsed into data values of the specified data name and the data type according to the template variable rule I" (step 105); here, the extracted original text data is parsed into a specified The type of data is parsed according to the variables and variable types in the template variable rules.

It should be noted that the pre-made template file used in the present invention is not limited to any one of the template languages. In other words, it can be written and generated according to the type of text data to be excavated, and different template languages are defined. For the data mining process of the present invention, the previously generated template file is used to perform mining processing on the text data. However, in order to more clearly illustrate the mining process of the present invention, an example of a template file prepared in advance is provided below.

First, templates can support annotations to facilitate the maintenance of template files.

Annotations are interpreted text that is ignored during template compilation and use, but is indispensable for the readability of the template.

Comment format: comment content } "

Description of the comment format: The comment is similar to the multi-line comment in the JAVA language, and the comment is from the beginning until the first "}" encountered as the comment content.

Second, you can define template variable rules, for example:

For a piece of data information that needs to be mined from text data, we define it as a template variable, and the description corresponding to this variable becomes a template variable rule. The template variable rule requires at least two attributes, the variable name and the type of the variable.

For example: The template variable rule format can be: "$ {VAR[ ; VAR_TYPE] } "

The format of the variable name "VAR" is similar to the definition of a variable in a computer language: it must be a letter or an underscore, consisting of letters, numbers, and underscores.

Among them, the variable type "VARJTYPE" is the value of the enumerated type, which can be S, N, D, A, and so on. Corresponds to strings, numbers, dates, lists, and so on.

Example: "${USERNAME; S}" represents a template variable rule with a variable named "USERNAME" and a data type of string.

Here, one or more template variable rules can be defined in a template file. If no variable type is specified, the default is a string type variable. Templates automatically convert raw data information from text data into data values of the specified type.

As shown in Figure 2, the text data in this example is a real alarm message sent by a certain type of telecommunication device to the network management system. Our goal is to extract the alarm number, alarm location, etc. from this text data. Next we will compile according to the sample of this text file. Wrote a template file. Each template variable in the template file corresponds to a piece of data information we need to extract. For example, the template variable rules for the alarm sequence number and alarm location information are as follows: Alarm sequence number: $ {ALARMID ; S}

The variable name of the above alarm number is "ALARMID", and the variable type is a string.

The alarm location is as follows:

Rack: $ {Rack ; N}

Chassis: $ {Shelf ;N}

Slot: $ {Slot ; N}

Here, the alarm position is composed of three template variables, namely "Rack", "Shelf", "Slot", and the variable types are all numeric.

As shown in FIG. 3, it is a schematic flowchart of a process of a compiler according to the present invention. First, scanning the template file, and recording a template variable rule therein (step 201); here, scanning the template file by filtering the annotation information therein; and then using the quotation in the regular expression syntax to use the non-template variable The rules section is referenced to implement blocking the non-template variable rules section of the template file. Then, the template variable rule portion in the template file is replaced with a regular expression (step 202); finally, the generated regular expression is compiled into a regular expression object (step 203).

As shown in Figure 4, a schematic diagram of the template file is compiled. The purpose of compiling a template file is to scan a template file written according to the template language and compile it into a regular expression. Figure 4 is an implementation of our regular expression engine based on the JAVA language. For other applications, the language can be used according to the development and the regular expression engine can be used.

First, scan the template annotation file, filter out the annotation information in the template file; scan the template variable rule definition, record the content defined by the template variable rule; then, use the regular expression in the part of the template file defined by the non-template variable rule The quotation in the grammar is quoted to prevent conflicts with the keywords in the regular expression; then the part defined by the template variable rule in the template file is replaced with a regular expression; Among them, the rules for replacement are as follows:

Strings are replaced with "(. *) "; numbers are replaced with "(\\d*) "; other types of template variables and so on.

Finally, the generated regular expression is compiled into a regular expression object so that the template object is generated. The compilation result is shown in Figure 4.

As shown in Figure 5, it is a schematic diagram of text data mining. A schematic diagram of extracting and mining data information in a text data using a template object is described. '

First, the text data is scanned, and the original information in the text data is extracted through the regular expression object in the template;

The raw data information is then converted to a data value of the specified type based on the template variable rule definition in the template. The text data mining results are shown in Figure 5.

The data mining process supports multi-threaded concurrent operations, which improves the utilization of computer resources.

FIG. 6 and FIG. 7 are schematic diagrams showing an embodiment of a text data mining method according to the present invention. First, a template data file generated according to a text data structure and a template language variable rule to be mined is read (step 301); then, the template file is scanned, template annotation information is filtered, and template variable rules are recorded, by using regular expressions The quotation in the grammar refers to the non-template variable rule part to implement blocking the non-template variable rule part in the template file (step 302); and then replaces the template variable rule part in the template file with a regular expression ( Step 303) compiling the generated regular expression into a regular expression object (step 304); then, according to the template object, scanning the text data to be mined, and performing data matching thereon (step 305); a regular expression, sequentially extracting the matched original information in the text data (step 306); and then parsing the extracted original information into data values of the specified data name and the data type according to the template variable rule (Step 307). Finally, it can be judged whether the text data to be mined has been completely processed, and if so, the process ends directly, and if not, step 305 is performed (step 308).

A text data mining method according to the present invention is not limited to the specification and the implementation side. The use of the applications listed in the specification can be applied to various fields suitable for the present invention, and other advantages and modifications can be easily made by those skilled in the art, and therefore, without departing from the scope of the claims and the equivalents The present invention is not limited to the specific details, the representative devices, and the illustrated examples shown and described herein.

Claims

1. A text data mining method,

It is characterized by the following steps:

Compiling the template file into a template object composed of a regular expression according to the template variable authority;

And scanning the text data to be mined according to the template object, and performing data matching on the template object; and sequentially extracting the original information matched in the text data according to the regular expression;

According to the template variable rule, the extracted original information is parsed into data values of a specified data name and a data type. Book

2. The method of claim 1 ,

It is characterized in that the pre-made template file is generated according to the text data structure and template language variable rules that need to be mined.

3. The method of claim 1 ,

The template variable rule includes: a variable name attribute and a variable type attribute.

4. The method of claim 1 ,

It is characterized in that each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.

5. The method of claim 1 ,

The method is characterized in that the template file is compiled into a template object composed of a regular expression, which is compiled by using a template compiler.

6. The method of claim 4,

It is characterized in that the compiler processing steps are as follows:

Scan the template file and record the template variable rules therein;

Replace the template variable rule part of the template file with a regular expression; And compile the generated regular expression into a regular expression object.

7. The method of claim 5,

The method of scanning the template file further includes: filtering the annotation information therein, and masking the rule part of the non-template variable in the template file

8. The method of claim 7 wherein

It is characterized in that the non-template variable rule part in the mask template file refers to the part by using the quotation in the regular expression syntax.

9. The method of claim 1 ,

The method further includes: extracting the original information that is matched in the text data, and further comprising: sequentially storing the extracted original information in a temporary storage area.

10. The method of claim 2,

The method is characterized in that the extracted original information is parsed into a data value of a specified data name and a data type, and is parsed according to an attribute of the template variable rule.