WO2006136055A1 - A text data mining method - Google Patents
A text data mining method Download PDFInfo
- Publication number
- WO2006136055A1 WO2006136055A1 PCT/CN2005/000894 CN2005000894W WO2006136055A1 WO 2006136055 A1 WO2006136055 A1 WO 2006136055A1 CN 2005000894 W CN2005000894 W CN 2005000894W WO 2006136055 A1 WO2006136055 A1 WO 2006136055A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- template
- data
- variable
- text data
- regular expression
- Prior art date
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
Definitions
- the present invention relates to data analysis processing techniques, and in particular to a text data mining method. Background technique
- the technical problem to be solved by the present invention is to provide a text data mining method. Text data in different formats can be analyzed by modifying the template file without relying on developing code or using expensive commercial data mining tools.
- the present invention provides the following solutions:
- a text data mining method includes the following steps:
- the extracted original information is parsed into data values of a specified data name and a data type.
- the pre-made template file is generated according to a text data structure and a template language variable rule that need to be mined.
- the template variable rule includes: a variable name attribute and Variable type attribute.
- each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.
- the method of the present invention wherein the template file is compiled into a template object composed of a regular expression, and is compiled by using a template compiler.
- scanning the template file further comprises: filtering the annotation information therein, and masking the non-template variable rule part in the template file.
- non-template variable rule part of the mask template file refers to the part by using a quotation in the regular expression syntax.
- the extracting the original information in the text data further includes: sequentially storing the extracted original information in a temporary storage area.
- the extracted original information is parsed into data values of a specified data name and a data type, and is parsed according to attributes of the template variable rule.
- the method of the invention does not need to modify the code for text data of different formats, and only needs to modify the template file according to the template definition language to adapt to different data formats, greatly reducing the time spent on data analysis; and using regular expressions
- the data matching algorithm is used to mine the data information, which is much more efficient than the traditional method. Moreover, by converting the data value into the specified format, the subsequent processing difficulty is reduced.
- the method of the present invention is suitable for concurrent data mining processing, making full use of The processing capability of the computer; and the method according to the present invention can be quickly applied to a system implemented by using different development tools, which is simple to implement and The price is low.
- FIG. 1 is a schematic flowchart of a text data mining method according to the present invention.
- FIG. 2 is a schematic diagram of pre-production of a template file according to the present invention.
- FIG. 3 is a schematic flowchart of a process of a compiler according to the present invention.
- Figure 4 is a schematic diagram of the compiled template file.
- Figure 5 is a schematic diagram of text data mining.
- FIG. 6 and FIG. 7 are schematic flowcharts of an embodiment of a text data mining method according to the present invention.
- FIG. 1 a schematic diagram of a flow of a text data mining method according to the present invention, first reading a pre-made template file including at least one template variable rule (step 101 ); where the template variable rule may include two Attributes: The name of the variable and the type of the variable.
- Each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.
- the template file is compiled into a template object composed of a regular expression (step 102); here, the template file is compiled into a template object composed of a regular expression, and the template is utilized.
- the compiler is compiled.
- step 103 And scanning the text data to be mined according to the template object, performing data matching on the data (step 103); and then sequentially extracting the matched original information in the text data according to the regular expression (step 104)
- the extracted original information in the text data is extracted, and the extracted original information may be sequentially stored in the temporary storage area.
- step 105 the extracted original information is parsed into data values of the specified data name and the data type according to the template variable rule I" (step 105); here, the extracted original text data is parsed into a specified
- the type of data is parsed according to the variables and variable types in the template variable rules.
- the pre-made template file used in the present invention is not limited to any one of the template languages. In other words, it can be written and generated according to the type of text data to be excavated, and different template languages are defined.
- the previously generated template file is used to perform mining processing on the text data.
- an example of a template file prepared in advance is provided below.
- templates can support annotations to facilitate the maintenance of template files.
- Annotations are interpreted text that is ignored during template compilation and use, but is indispensable for the readability of the template.
- Comment format comment content ⁇ "
- the comment is similar to the multi-line comment in the JAVA language, and the comment is from the beginning until the first " ⁇ " encountered as the comment content.
- the template variable rule requires at least two attributes, the variable name and the type of the variable.
- the template variable rule format can be: "$ ⁇ VAR[ ; VAR_TYPE] ⁇ "
- variable name "VAR" is similar to the definition of a variable in a computer language: it must be a letter or an underscore, consisting of letters, numbers, and underscores.
- variable type "VARJTYPE” is the value of the enumerated type, which can be S, N, D, A, and so on. Corresponds to strings, numbers, dates, lists, and so on.
- Example: "$ ⁇ USERNAME; S ⁇ ” represents a template variable rule with a variable named "USERNAME” and a data type of string.
- one or more template variable rules can be defined in a template file. If no variable type is specified, the default is a string type variable. Templates automatically convert raw data information from text data into data values of the specified type.
- the text data in this example is a real alarm message sent by a certain type of telecommunication device to the network management system.
- Our goal is to extract the alarm number, alarm location, etc. from this text data.
- Each template variable in the template file corresponds to a piece of data information we need to extract.
- the template variable rules for the alarm sequence number and alarm location information are as follows: Alarm sequence number: $ ⁇ ALARMID ; S ⁇
- variable name of the above alarm number is "ALARMID”
- variable type is a string.
- the alarm location is as follows:
- Chassis $ ⁇ Shelf ;N ⁇
- the alarm position is composed of three template variables, namely "Rack”, “Shelf”, “Slot”, and the variable types are all numeric.
- FIG. 3 it is a schematic flowchart of a process of a compiler according to the present invention.
- scanning the template file, and recording a template variable rule therein step 201); here, scanning the template file by filtering the annotation information therein; and then using the quotation in the regular expression syntax to use the non-template variable
- the rules section is referenced to implement blocking the non-template variable rules section of the template file.
- the template variable rule portion in the template file is replaced with a regular expression (step 202); finally, the generated regular expression is compiled into a regular expression object (step 203).
- FIG 4 a schematic diagram of the template file is compiled.
- the purpose of compiling a template file is to scan a template file written according to the template language and compile it into a regular expression.
- Figure 4 is an implementation of our regular expression engine based on the JAVA language. For other applications, the language can be used according to the development and the regular expression engine can be used.
- FIG. 5 it is a schematic diagram of text data mining.
- a schematic diagram of extracting and mining data information in a text data using a template object is described.
- the text data is scanned, and the original information in the text data is extracted through the regular expression object in the template;
- the raw data information is then converted to a data value of the specified type based on the template variable rule definition in the template.
- the text data mining results are shown in Figure 5.
- the data mining process supports multi-threaded concurrent operations, which improves the utilization of computer resources.
- FIG. 6 and FIG. 7 are schematic diagrams showing an embodiment of a text data mining method according to the present invention.
- a template data file generated according to a text data structure and a template language variable rule to be mined is read (step 301); then, the template file is scanned, template annotation information is filtered, and template variable rules are recorded, by using regular expressions
- the quotation in the grammar refers to the non-template variable rule part to implement blocking the non-template variable rule part in the template file (step 302); and then replaces the template variable rule part in the template file with a regular expression ( Step 303) compiling the generated regular expression into a regular expression object (step 304); then, according to the template object, scanning the text data to be mined, and performing data matching thereon (step 305); a regular expression, sequentially extracting the matched original information in the text data (step 306); and then parsing the extracted original information into data values of the specified data name and the data type according to the template variable rule (Step 307).
- a text data mining method according to the present invention is not limited to the specification and the implementation side.
- the use of the applications listed in the specification can be applied to various fields suitable for the present invention, and other advantages and modifications can be easily made by those skilled in the art, and therefore, without departing from the scope of the claims and the equivalents
- the present invention is not limited to the specific details, the representative devices, and the illustrated examples shown and described herein.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A text data mining method comprises: fetching a preformed template file including at least one template parameter rule; compiling said file into the template objects composed by the regular expressions according to the template parameter rule; scanning the text data to be mined and implementing the data matching according to the template objects; extracting the matched original information among the text data sequentially according to the regular expression; resolving the extracted original information into the data value of assigned data name and data type according to the template parameter rule. According to the present invention, the analysis process for the text data of different types can be implemented just by modifying the template file without relying on developing the program code or using expensive commercialized data mining tool so that the complexity and the cost of the communication network management system are reduced.
Description
一种文本数据挖掘方法 技术领域 A text data mining method
本发明涉及数据分析处理技术,具体地说,是涉及一种文本数据挖掘 方法。 背景技术 The present invention relates to data analysis processing techniques, and in particular to a text data mining method. Background technique
近十几年, 电信网络管理技术飞速发展。在电信网络管理领域往往需 要实时处理大量的数据, 其中以文本数据为主。 例如: 各种业务运营设 备产生大量的告警、 性能数据和各种实时话单数据等等。 在这些文本数 据中包含丰富的业务信息, 往往是利润的重要来源。 In the past decade, telecom network management technology has developed rapidly. In the field of telecommunication network management, it is often necessary to process a large amount of data in real time, in which text data is dominant. For example: Various business operations equipment generates a large number of alarms, performance data, and various real-time bill data. The inclusion of rich business information in these textual data is often an important source of profit.
我们可以看到, 这些数据具有以下共同特点: We can see that these data have the following common characteristics:
1.以文本数据为主; 1. Based on text data;
2.海量数据; 2. Massive data;
3.有一定的实时处理要求; 3. There are certain real-time processing requirements;
4.当产生数据的设备或者系统稳定时, 数据格式就相对固定; 4. When the device or system that generates the data is stable, the data format is relatively fixed;
5.往往面向多种设备或者系统, 数据格式非常多。 5. Often facing multiple devices or systems, the data format is very large.
一般对这些文本数据的处理采用传统的硬编码方式和通过商业化的 数据挖掘工具来进行处理。 The processing of these textual data is typically handled using traditional hard-coded methods and through commercial data mining tools.
传统的文本数据分析方法一般采用硬编码的方式,但这种硬编码方式 存在以下的问题: Traditional text data analysis methods generally use hard coding, but this hard coding method has the following problems:
1.灵活性不够, 产生数据的设备或者系统的数据格式发生变化, 哪 怕是一点点改动都可能需要重新编写代码; 1. Flexibility is not enough, the data format of the device or system that generates the data changes, and even a little change may require rewriting the code;
2.硬编码的代码量随着数据格式的复杂度和种类增加而急剧增加, 往往需要上万行的代码来做数据分析处理, 代码的效率和可维护性极差; 2. The amount of hard-coded code increases sharply with the complexity and variety of data formats. It often requires tens of thousands of lines of code for data analysis and processing, and the efficiency and maintainability of the code is extremely poor.
3.普遍没有采用数据分析算法, 执行效率低下, 不适合对海量的、
实时的数据进行处理。 3. Generally, no data analysis algorithm is adopted, and the execution efficiency is low, which is not suitable for massive, Real-time data processing.
商业化的数据挖掘工具来进行处理的情况目前也非常多,但是存在以 下的缺点- Commercial data mining tools are currently being processed, but there are many disadvantages -
1.商业化的数据挖掘工具主要是针对数据库中的数据进行分析处理 的, 挖掘工具很难脱离数据库系统; 1. Commercial data mining tools are mainly for the analysis and processing of data in the database, mining tools are difficult to break away from the database system;
2.使用起来复杂, 很难立即集成到现有的应用中; 2. It is complicated to use and it is difficult to integrate it into existing applications immediately;
3.商业化的数据挖掘工具价格是非常昂贵的。 3. The price of commercial data mining tools is very expensive.
综上所述, 如何能够简单、高效的处理这些海量般的数据, 并从中得 到有价值的信息成了迫切需要解决的问题。 发明内容 In summary, how to easily and efficiently process these massive amounts of data and obtain valuable information from them has become an urgent problem to be solved. Summary of the invention
本发明所要解决的技术问题是提供一种文本数据挖掘方法。对不同格 式的文本数据只需通过修改模板文件就可以对其进行分析处理, 而不需 要依靠开发程序代码或使用昂贵的商业化数据挖掘工具。 The technical problem to be solved by the present invention is to provide a text data mining method. Text data in different formats can be analyzed by modifying the template file without relying on developing code or using expensive commercial data mining tools.
为解决上述技术问题, 本发明提供方案如下: In order to solve the above technical problems, the present invention provides the following solutions:
一种文本数据挖掘方法, 包括如下步骤: A text data mining method includes the following steps:
读取包含有至少一个模板变量规则的预制模板文件; Reading a pre-made template file containing at least one template variable rule;
根据所述模板变量规则, 将所述模板文件编译为由正则表达式构 成的模板对象; Compiling the template file into a template object composed of a regular expression according to the template variable rule;
根据所述模板对象,扫描待挖掘的文本数据, 对其进行数据匹配; 根据所述正则表达式, 将所述文本数据中匹配的原始信息顺序提 取出来; 以及 And scanning, according to the template object, text data to be mined, and performing data matching; according to the regular expression, sequentially extracting original information in the text data; and
根据所述模板变量规则, 将所述提取出来的原始信息解析为指定 数据名与数据类型的数据值。 According to the template variable rule, the extracted original information is parsed into data values of a specified data name and a data type.
本发明所述的方法, 其中, 所述预制的模板文件是根据需要挖掘 的文本数据结构和模板语言变量规则生成的。 The method of the present invention, wherein the pre-made template file is generated according to a text data structure and a template language variable rule that need to be mined.
本发明所述的方法, 其中, 所述模板变量规则包括: 变量名属性以及
变量类型属性。 The method of the present invention, wherein the template variable rule includes: a variable name attribute and Variable type attribute.
本发明所述的方法, 其中, 每一个所述模板变量规则对应于一个 待挖掘的文本数据中需要提取的数据项。 The method of the present invention, wherein each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.
本发明所述的方法, 其中, 所述将模板文件编译为由正则表达式 构成的模板对象, 是利用模板编译器来进行编译的。 The method of the present invention, wherein the template file is compiled into a template object composed of a regular expression, and is compiled by using a template compiler.
本发明所述的方法, 其中, 所述编译器处理步骤如下: The method of the present invention, wherein the compiler processing steps are as follows:
扫描所述模板文件, 并记录其中的模板变量规则; Scan the template file and record the template variable rules therein;
将模板文件中的模板变量规则部分, 使用正则表达式进行替换; 以及 Replace the template variable rules section of the template file with a regular expression;
将生成的正则表达式编译为正则表达式对象。 Compile the generated regular expression into a regular expression object.
本发明所述的方法, 其中, 所述扫描模板文件时, 进一步包括: 过滤 其中的注释信息, 屏蔽所述模板文件中的非模板变量规则部分。 The method of the present invention, wherein the scanning the template file further comprises: filtering the annotation information therein, and masking the non-template variable rule part in the template file.
本发明所述的方法, 其中, 所述屏蔽模板文件中的非模板变量规 则部分, 是使用正则表达式语法中的引用语将该部分引用起来。 The method of the present invention, wherein the non-template variable rule part of the mask template file refers to the part by using a quotation in the regular expression syntax.
本发明所述的方法, 其中, 所述将文本数据中匹配的原始信息提 取出来,进一步包括:将提取出的原始信息顺序存储于临时存储区中。 According to the method of the present invention, the extracting the original information in the text data further includes: sequentially storing the extracted original information in a temporary storage area.
本发明所述的方法, 其中, 所述将提取出来的原始信息解析为指 定数据名与数据类型的数据值, 是根据所述模板变量规则的属性进行 解析的。 According to the method of the present invention, the extracted original information is parsed into data values of a specified data name and a data type, and is parsed according to attributes of the template variable rule.
与现有技术相比, 本发明的优点在于: The advantages of the present invention over the prior art are:
本发明所述方法,对于不同格式的文本数据不需要修改代码,只需要 按照模板定义语言修改模板文件就可以适应不同的数据格式, 大大降低 开发花在数据分析上的时间; 并且利用正则表达式进行数据匹配的算法 来挖掘数据信息, 比传统的方法执行效率高很多; 而且通过转化为指定 格式的数据值, 减少了后续处理难度; 本发明所述方法适合并发的数据 挖掘处理过程, 充分利用了计算机的处理能力; 并且根据本发明所述方 法可以快速应用到使用不同开发工具实现的系统中, 实现起来简单, 代
价低廉。 The method of the invention does not need to modify the code for text data of different formats, and only needs to modify the template file according to the template definition language to adapt to different data formats, greatly reducing the time spent on data analysis; and using regular expressions The data matching algorithm is used to mine the data information, which is much more efficient than the traditional method. Moreover, by converting the data value into the specified format, the subsequent processing difficulty is reduced. The method of the present invention is suitable for concurrent data mining processing, making full use of The processing capability of the computer; and the method according to the present invention can be quickly applied to a system implemented by using different development tools, which is simple to implement and The price is low.
本发明所要解决的技术问题、技术方案要点及有益效果,将结合实施 例, 参照附图作进一步的说明。 附图概述 The technical problems, technical points, and advantageous effects to be solved by the present invention will be further described with reference to the accompanying drawings in conjunction with the embodiments. BRIEF abstract
图 1为本发明所述的文本数据挖掘方法流程示意图。 FIG. 1 is a schematic flowchart of a text data mining method according to the present invention.
图 2为本发明所述的模板文件的预制生成示意图。 FIG. 2 is a schematic diagram of pre-production of a template file according to the present invention.
图 3为本发明所述编译器处理过程的流程示意图。 FIG. 3 is a schematic flowchart of a process of a compiler according to the present invention.
图 4为编译模板文件示意图。 Figure 4 is a schematic diagram of the compiled template file.
图 5为文本数据挖掘示意图。 Figure 5 is a schematic diagram of text data mining.
图 6、 图 7为本发明所述的文本数据挖掘方法的实施例流程示意图。 6 and FIG. 7 are schematic flowcharts of an embodiment of a text data mining method according to the present invention.
本发明的最佳实施方式 BEST MODE FOR CARRYING OUT THE INVENTION
如图 1所示,为本发明所述的文本数据挖掘方法流程示意图,首先读 取包含有至少一个模板变量规则的预制模板文件(步骤 101 ) ; 这里, 所 述的模板变量规则可以包括两个属性: 变量名以及变量的类型。 每一个 所述模板变量规则对应于一个待挖掘的文本数据中需要提取的数据项。 然后, 根据所述模板变量规则, 将所述模板文件编译为由正则表达式构 成的模板对象(步骤 102) ; 这里, 所述将模板文件编译为由正则表达式 构成的模板对象, 是利用模板编译器来进行编译的。 再根据所述模板对 象, 扫描待挖掘的文本数据, 对其进行数据匹配 (步骤 103) ; 然后, 根 据所述正则表达式,将所述文本数据中匹配的原始信息顺序提取出来(步 骤 104) ; 这里, 所述将文本数据中匹配的原始信息提取出来, 可以将提 取出的原始信息顺序存储于临时存储区中。 最后, 根据所述模板变量规 贝 I」, 将所述提取出来的原始信息解析为指定数据名与数据类型的数据值 (步骤 105) ; 这里, 所述将挖掘出的原始文本数据解析成指定类型的数 据是根据模板变量规则中变量及变量类型来解析的。 As shown in FIG. 1 , a schematic diagram of a flow of a text data mining method according to the present invention, first reading a pre-made template file including at least one template variable rule (step 101 ); where the template variable rule may include two Attributes: The name of the variable and the type of the variable. Each of the template variable rules corresponds to a data item to be extracted in a text data to be mined. Then, according to the template variable rule, the template file is compiled into a template object composed of a regular expression (step 102); here, the template file is compiled into a template object composed of a regular expression, and the template is utilized. The compiler is compiled. And scanning the text data to be mined according to the template object, performing data matching on the data (step 103); and then sequentially extracting the matched original information in the text data according to the regular expression (step 104) Here, the extracted original information in the text data is extracted, and the extracted original information may be sequentially stored in the temporary storage area. Finally, the extracted original information is parsed into data values of the specified data name and the data type according to the template variable rule I" (step 105); here, the extracted original text data is parsed into a specified The type of data is parsed according to the variables and variable types in the template variable rules.
应当说明,本发明所使用的预制模板文件并不局限于任何一种模板语
言, 其可以根据实际待挖掘的文本数据的类型, 定义不同的模板语言而 编写生成, 对本发明的数据挖掘过程而言, 只是使用了该事先生成的模 板文件来对文本数据进行挖掘处理。 但是, 为了更清楚的说明本发明的 挖掘过程, 下面提供一个模板文件事先编写生成的实例。 It should be noted that the pre-made template file used in the present invention is not limited to any one of the template languages. In other words, it can be written and generated according to the type of text data to be excavated, and different template languages are defined. For the data mining process of the present invention, the previously generated template file is used to perform mining processing on the text data. However, in order to more clearly illustrate the mining process of the present invention, an example of a template file prepared in advance is provided below.
首先, 模板可以支持注释, 便于模板文件的维护。 First, templates can support annotations to facilitate the maintenance of template files.
注释就是在模板编译和使用过程中被忽略,但是对于模板的可读性不 可缺少的解释性文字。 Annotations are interpreted text that is ignored during template compilation and use, but is indispensable for the readability of the template.
注释格式: 注释内容 } " Comment format: comment content } "
注释格式的说明: 注释类似于 JAVA语言中的多行注释,注释从" 开始一直到遇到的第一个 "} "都作为注释内容。 Description of the comment format: The comment is similar to the multi-line comment in the JAVA language, and the comment is from the beginning until the first "}" encountered as the comment content.
其次, 可以定义模板变量规则, 例如: Second, you can define template variable rules, for example:
对于需要从文本数据中挖掘的一个数据信息,我们定义为一个模板变 量, 而与这个变量对应的说明成为模板变量规则。 模板变量规则至少需 要有两个属性, 变量名和变量的类型。 For a piece of data information that needs to be mined from text data, we define it as a template variable, and the description corresponding to this variable becomes a template variable rule. The template variable rule requires at least two attributes, the variable name and the type of the variable.
例如: 模板变量规则格式可以是: "$ {VAR[ ; VAR—TYPE] } " For example: The template variable rule format can be: "$ {VAR[ ; VAR_TYPE] } "
其中, 变量名字 "VAR"的格式类似计算机语言中的变量定义: 必须 是字母或者下划线开头, 由字母、 数字、 下划线组成。 The format of the variable name "VAR" is similar to the definition of a variable in a computer language: it must be a letter or an underscore, consisting of letters, numbers, and underscores.
其中, 变量类型 "VARJTYPE"是枚举类型的值, 可以是 S、 N、 D、 A 等等。 分别对应字符串、 数字、 日期、 列表等等类型。 Among them, the variable type "VARJTYPE" is the value of the enumerated type, which can be S, N, D, A, and so on. Corresponds to strings, numbers, dates, lists, and so on.
举例: "$ {USERNAME ; S} "表示一个变量名为 "USERNAME", 数据类 型为字符串的模板变量规则。 Example: "${USERNAME; S}" represents a template variable rule with a variable named "USERNAME" and a data type of string.
这里,一个模板文件中可以定义一个或者多个模板变量规则。如果没 有指定变量类型, 默认为字符串类型的变量。 模板可以自动将文本数据 中的原始数据信息转化为指定类型的数据值。 Here, one or more template variable rules can be defined in a template file. If no variable type is specified, the default is a string type variable. Templates automatically convert raw data information from text data into data values of the specified type.
如图 2所示,在这个例子中的文本数据是一个真实的某类电信设备上 报到网络管理系统的一条告警信息。 我们目标是从这个文本数据中提取 告警序号、 告警位置等等信息。 接下来我们根据这个文本文件的样例编
写了一个模板文件。 模板文件中的每个模板变量对应我们需要提取的一 个数据信息。 比如, 告警序号、 告警位置信息的模板变量规则如下: 告警序号: $ {ALARMID ; S} As shown in Figure 2, the text data in this example is a real alarm message sent by a certain type of telecommunication device to the network management system. Our goal is to extract the alarm number, alarm location, etc. from this text data. Next we will compile according to the sample of this text file. Wrote a template file. Each template variable in the template file corresponds to a piece of data information we need to extract. For example, the template variable rules for the alarm sequence number and alarm location information are as follows: Alarm sequence number: $ {ALARMID ; S}
上述告警序号的变量名是 " ALARMID" , 变量类型是字符串。 The variable name of the above alarm number is "ALARMID", and the variable type is a string.
告警位置如下: The alarm location is as follows:
机架: $ {Rack ; N} Rack: $ {Rack ; N}
机框: $ {Shelf ;N} Chassis: $ {Shelf ;N}
槽位: $ {Slot ; N} Slot: $ {Slot ; N}
这里, 告警位置由三个模板变量组成, 分别是 "Rack", "Shelf", "Slot ", 变量类型都是数字型。 Here, the alarm position is composed of three template variables, namely "Rack", "Shelf", "Slot", and the variable types are all numeric.
如图 3所示, 为本发明所述编译器处理过程的流程示意图。首先,扫 描所述模板文件, 并记录其中的模板变量规则 (步骤 201 ) ; 这里, 扫描 所述模板文件, 通过过滤其中的注释信息; 再通过使用正则表达式语法 中的引用语将非模板变量规则部分引用起来以实现屏蔽所述模板文件中 的非模板变量规则部分。 然后, 将模板文件中的模板变量规则部分, 使 用正则表达式进行替换(步骤 202) ; 最后, 将生成的正则表达式编译为 正则表达式对象 (步骤 203) 。 As shown in FIG. 3, it is a schematic flowchart of a process of a compiler according to the present invention. First, scanning the template file, and recording a template variable rule therein (step 201); here, scanning the template file by filtering the annotation information therein; and then using the quotation in the regular expression syntax to use the non-template variable The rules section is referenced to implement blocking the non-template variable rules section of the template file. Then, the template variable rule portion in the template file is replaced with a regular expression (step 202); finally, the generated regular expression is compiled into a regular expression object (step 203).
如图 4所示,为编译模板文件示意图。编译一个模板文件的目的就是 扫描根据模板语言编写的模板文件, 并编译成正则表达式。 图 4是我们 根据 JAVA语言正则表达式引擎的一种实现, 对于其它的应用可以根据开 发所用的语言和正则表达式引擎做相应的改动便可以使用。 As shown in Figure 4, a schematic diagram of the template file is compiled. The purpose of compiling a template file is to scan a template file written according to the template language and compile it into a regular expression. Figure 4 is an implementation of our regular expression engine based on the JAVA language. For other applications, the language can be used according to the development and the regular expression engine can be used.
首先, 扫描模板注释文件, 过滤掉模板文件中的注释信息; 再扫描模板变量规则定义, 记录模板变量规则定义的内容; 然后,将模板文件中的非模板变量规则定义的部分,使用正则表达式 语法中的引用语引用起来, 防止与正则表达式中的关键字发生冲突; 再将模板文件中的模板变量规则定义的部分,使用正则表达式来替换 掉;
其中, 替换的规则如下: First, scan the template annotation file, filter out the annotation information in the template file; scan the template variable rule definition, record the content defined by the template variable rule; then, use the regular expression in the part of the template file defined by the non-template variable rule The quotation in the grammar is quoted to prevent conflicts with the keywords in the regular expression; then the part defined by the template variable rule in the template file is replaced with a regular expression; Among them, the rules for replacement are as follows:
字符串使用 "(. *) "替换; 数字使用 "(\\d*) "替换; 其它类型的 模板变量以此类推。 Strings are replaced with "(. *) "; numbers are replaced with "(\\d*) "; other types of template variables and so on.
最后,将生成的正则表达式编译为正则表达式对象,这样模板对象就 生成好了。 编译结果如图 4所示。 Finally, the generated regular expression is compiled into a regular expression object so that the template object is generated. The compilation result is shown in Figure 4.
如图 5所示,为文本数据挖掘示意图。描述了使用模板对象将一个文 本数据中的数据信息提取、 挖掘出来后的示意图。 ' As shown in Figure 5, it is a schematic diagram of text data mining. A schematic diagram of extracting and mining data information in a text data using a template object is described. '
首先,扫描文本数据,通过模板中的正则表达式对象将文本数据中的 原始信息提取出来; First, the text data is scanned, and the original information in the text data is extracted through the regular expression object in the template;
然后,根据模板中的模板变量规则定义,将原始数据信息转化为指定 类型的数据值。 文本数据挖掘结果如图 5所示。 The raw data information is then converted to a data value of the specified type based on the template variable rule definition in the template. The text data mining results are shown in Figure 5.
所述数据挖掘的过程支持多线程并发操作,提高了计算机资源的利用 率。 The data mining process supports multi-threaded concurrent operations, which improves the utilization of computer resources.
如图 6、 图 7所示, 为本发明所述的文本数据挖掘方法的实施例流程 示意图。 首先, 读取根据需要挖掘的文本数据结构和模板语言变量规则 生成的模板数据文件(步骤 301 ) ; 然后, 扫描所述模板文件, 过滤模板 注释信息, 并纪录模板变量规则, 通过使用正则表达式语法中的引用语 将非模板变量规则部分引用起来以实现屏蔽所述模板文件中的非模板变 量规则部分(步骤 302 ) ; 再将模板文件中的模板变量规则部分, 使用正 则表达式进行替换(步骤 303 ) ; 将生成的正则表达式编译为正则表达式 对象 (步骤 304) ; 然后, 根据所述模板对象, 扫描待挖掘的文本数据, 对其进行数据匹配 (步骤 305 ) ; 再根据所述正则表达式, 将所述文本数 据中匹配的原始信息顺序提取出来(步骤 306 ) ; 然后, 根据所述模板变 量规则, 将所述提取出来的原始信息解析为指定数据名与数据类型的数 据值(步骤 307 ) 。 最后, 可以判断一下需要挖掘的文本数据是否已经全 部处理完毕, 如果是, 则直接结束, 如果不是, 则执行步骤 305 (步骤 308 ) 。 FIG. 6 and FIG. 7 are schematic diagrams showing an embodiment of a text data mining method according to the present invention. First, a template data file generated according to a text data structure and a template language variable rule to be mined is read (step 301); then, the template file is scanned, template annotation information is filtered, and template variable rules are recorded, by using regular expressions The quotation in the grammar refers to the non-template variable rule part to implement blocking the non-template variable rule part in the template file (step 302); and then replaces the template variable rule part in the template file with a regular expression ( Step 303) compiling the generated regular expression into a regular expression object (step 304); then, according to the template object, scanning the text data to be mined, and performing data matching thereon (step 305); a regular expression, sequentially extracting the matched original information in the text data (step 306); and then parsing the extracted original information into data values of the specified data name and the data type according to the template variable rule (Step 307). Finally, it can be judged whether the text data to be mined has been completely processed, and if so, the process ends directly, and if not, step 305 is performed (step 308).
本发明所述的一种文本数据挖掘方法,并不仅仅限于说明书和实施方
式中所列运用, 它完全可以被适用于各种适合本发明之领域, 对于熟悉 本领域的人员而言可容易地实现另外的优点和进行修改, 因此在不背离 权利要求及等同范围所限定的一般概念的精神和范围的情况下, 本发明 并不限于特定的细节、 代表性的设备和这里示出与描述的图示示例。
A text data mining method according to the present invention is not limited to the specification and the implementation side. The use of the applications listed in the specification can be applied to various fields suitable for the present invention, and other advantages and modifications can be easily made by those skilled in the art, and therefore, without departing from the scope of the claims and the equivalents The present invention is not limited to the specific details, the representative devices, and the illustrated examples shown and described herein.
Claims
1、 一种文本数据挖掘方法, 1. A text data mining method,
其特征在于包括如下步骤: It is characterized by the following steps:
读取包含有至少一个模板变量规则的预制模板文件; Reading a pre-made template file containing at least one template variable rule;
根据所述模板变量规权则, 将所述模板文件编译为由正则表达式构 成的模板对象; Compiling the template file into a template object composed of a regular expression according to the template variable authority;
根据所述模板对象, 扫描待挖掘的文本数据, 对其进行数据匹配; 根据所述正则表达式, 将所述文本数据中匹配的原始信息顺序提 取出来; 以及 And scanning the text data to be mined according to the template object, and performing data matching on the template object; and sequentially extracting the original information matched in the text data according to the regular expression;
根据所述模板变量规则,将所述提取出来的原始信息解析为指定数据 名与数据类型的数据值。 书 According to the template variable rule, the extracted original information is parsed into data values of a specified data name and a data type. Book
2、 如权利要求 1所述的方法, 2. The method of claim 1 ,
其特征在于所述预制的模板文件是根据需要挖掘的文本数据结构和模板 语言变量规则生成的。 It is characterized in that the pre-made template file is generated according to the text data structure and template language variable rules that need to be mined.
3、 如权利要求 1所述的方法, 3. The method of claim 1 ,
其特征在于, 所述模板变量规则包括: 变量名属性以及变量类型属性。 The template variable rule includes: a variable name attribute and a variable type attribute.
4、 如权利要求 1所述的方法, 4. The method of claim 1 ,
其特征在于,每一个所述模板变量规则对应于一个待挖掘的文本数据中需 要提取的数据项。 It is characterized in that each of the template variable rules corresponds to a data item to be extracted in a text data to be mined.
5、 如权利要求 1所述的方法, 5. The method of claim 1 ,
其特征在于,所述将模板文件编译为由正则表达式构成的模板对象, 是利 用模板编译器来进行编译的。 The method is characterized in that the template file is compiled into a template object composed of a regular expression, which is compiled by using a template compiler.
6、 如权利要求 4所述的方法, 6. The method of claim 4,
其特征在于, 所述编译器处理步骤如下: It is characterized in that the compiler processing steps are as follows:
扫描所述模板文件, 并记录其中的模板变量规则; Scan the template file and record the template variable rules therein;
将模板文件中的模板变量规则部分, 使用正则表达式进行替换;
以及将生成的正则表达式编译为正则表达式对象。 Replace the template variable rule part of the template file with a regular expression; And compile the generated regular expression into a regular expression object.
7、 如权利要求 5所述的方法, 7. The method of claim 5,
其特征在于, 所述扫描模板文件时, 进一步包括: 过滤其中的注释信息, 屏蔽所述模板文件中的非模板变量规则部分 The method of scanning the template file further includes: filtering the annotation information therein, and masking the rule part of the non-template variable in the template file
8、 如权利要求 7所述的方法, 8. The method of claim 7 wherein
其特征在于,所述屏蔽模板文件中的非模板变量规则部分, 是使用正则表 达式语法中的引用语将该部分引用起来。 It is characterized in that the non-template variable rule part in the mask template file refers to the part by using the quotation in the regular expression syntax.
9、 如权利要求 1所述的方法, 9. The method of claim 1 ,
其特征在于, 所述将文本数据中匹配的原始信息提取出来, 进一步包括: 将提取出的原始信息顺序存储于临时存储区中。 The method further includes: extracting the original information that is matched in the text data, and further comprising: sequentially storing the extracted original information in a temporary storage area.
10、 如权利要求 2所述的方法, 10. The method of claim 2,
其特征在于,所述将提取出来的原始信息解析为指定数据名与数据类型的 数据值, 是根据所述模板变量规则的属性进行解析的。
The method is characterized in that the extracted original information is parsed into a data value of a specified data name and a data type, and is parsed according to an attribute of the template variable rule.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2005800493417A CN101151843B (en) | 2005-06-22 | 2005-06-22 | Text data digging method |
PCT/CN2005/000894 WO2006136055A1 (en) | 2005-06-22 | 2005-06-22 | A text data mining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2005/000894 WO2006136055A1 (en) | 2005-06-22 | 2005-06-22 | A text data mining method |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2006136055A1 true WO2006136055A1 (en) | 2006-12-28 |
Family
ID=37570080
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2005/000894 WO2006136055A1 (en) | 2005-06-22 | 2005-06-22 | A text data mining method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN101151843B (en) |
WO (1) | WO2006136055A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095745A (en) * | 2016-05-27 | 2016-11-09 | 厦门市美亚柏科信息股份有限公司 | Transaction record extracting method based on log and system thereof |
CN109726284A (en) * | 2018-12-07 | 2019-05-07 | 成都品果科技有限公司 | A kind of versatile data analysing method |
CN111291547A (en) * | 2020-01-20 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Template generation method, device, equipment and medium |
CN111569427A (en) * | 2020-06-10 | 2020-08-25 | 网易(杭州)网络有限公司 | Resource processing method and device, storage medium and electronic device |
US11714849B2 (en) | 2021-08-31 | 2023-08-01 | Alibaba Damo (Hangzhou) Technology Co., Ltd. | Image generation system and method |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609984B (en) * | 2008-06-16 | 2012-08-29 | 上海申瑞电力科技股份有限公司 | Fast aided modeling method for supervisory control and system |
CN104731555A (en) * | 2013-12-23 | 2015-06-24 | 中兴通讯股份有限公司 | Method and device for avoiding conflict among registers |
CN105739947A (en) * | 2014-12-10 | 2016-07-06 | 中兴通讯股份有限公司 | Register conflict detection method and apparatus |
CN108279883B (en) * | 2016-12-30 | 2021-11-26 | 北京京东尚科信息技术有限公司 | Configurable feature calculation method and system |
CN112580298B (en) * | 2019-09-29 | 2024-05-07 | 大众问问(北京)信息科技有限公司 | Method, device and equipment for acquiring annotation data |
CN111880838B (en) * | 2020-08-03 | 2024-04-12 | 北京神舟航天软件技术有限公司 | Data analysis method based on template matching technology |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002025564A1 (en) * | 2000-09-25 | 2002-03-28 | Kent Ridge Digital Labs | A system, method and interface for building biological databases using templates |
CN1492336A (en) * | 2003-09-04 | 2004-04-28 | 上海格尔软件股份有限公司 | Information system auditing method based on data storehouse |
US20050027710A1 (en) * | 2003-07-30 | 2005-02-03 | International Business Machines Corporation | Methods and apparatus for mining attribute associations |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5692107A (en) * | 1994-03-15 | 1997-11-25 | Lockheed Missiles & Space Company, Inc. | Method for generating predictive models in a computer system |
-
2005
- 2005-06-22 CN CN2005800493417A patent/CN101151843B/en active Active
- 2005-06-22 WO PCT/CN2005/000894 patent/WO2006136055A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2002025564A1 (en) * | 2000-09-25 | 2002-03-28 | Kent Ridge Digital Labs | A system, method and interface for building biological databases using templates |
US20050027710A1 (en) * | 2003-07-30 | 2005-02-03 | International Business Machines Corporation | Methods and apparatus for mining attribute associations |
CN1492336A (en) * | 2003-09-04 | 2004-04-28 | 上海格尔软件股份有限公司 | Information system auditing method based on data storehouse |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106095745A (en) * | 2016-05-27 | 2016-11-09 | 厦门市美亚柏科信息股份有限公司 | Transaction record extracting method based on log and system thereof |
CN109726284A (en) * | 2018-12-07 | 2019-05-07 | 成都品果科技有限公司 | A kind of versatile data analysing method |
CN111291547A (en) * | 2020-01-20 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Template generation method, device, equipment and medium |
CN111291547B (en) * | 2020-01-20 | 2024-04-26 | 腾讯科技(深圳)有限公司 | Template generation method, device, equipment and medium |
CN111569427A (en) * | 2020-06-10 | 2020-08-25 | 网易(杭州)网络有限公司 | Resource processing method and device, storage medium and electronic device |
CN111569427B (en) * | 2020-06-10 | 2023-04-25 | 网易(杭州)网络有限公司 | Resource processing method and device, storage medium and electronic device |
US11714849B2 (en) | 2021-08-31 | 2023-08-01 | Alibaba Damo (Hangzhou) Technology Co., Ltd. | Image generation system and method |
Also Published As
Publication number | Publication date |
---|---|
CN101151843B (en) | 2010-05-12 |
CN101151843A (en) | 2008-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112149399B (en) | Table information extraction method, device, equipment and medium based on RPA and AI | |
JP2000148461A (en) | Software model and existing source code synchronizing method and device | |
US7792851B2 (en) | Mechanism for defining queries in terms of data objects | |
CN115543402B (en) | Software knowledge graph increment updating method based on code submission | |
Dean et al. | Using design recovery techniques to transform legacy systems | |
Neubauer et al. | XMLText: from XML schema to Xtext | |
WO2006136055A1 (en) | A text data mining method | |
CN109299074A (en) | A kind of data verification method and system based on templating data base view | |
US20030200534A1 (en) | Mechanism for reformatting a simple source code statement into a compound source code statement | |
CN111124380A (en) | Front-end code generation method | |
CN113608903A (en) | Fault management method based on XML language | |
JP4086253B1 (en) | XML document processing method and processing program | |
CN108241658A (en) | A kind of logging mode finds method and system | |
CN112506488A (en) | Method for generating programming language class based on sql creating statement | |
CN109325217B (en) | File conversion method, system, device and computer readable storage medium | |
Ballance et al. | Grammatical abstraction and incremental syntax analysis in a language-based editor | |
CN113326261B (en) | Data blood relationship extraction method and device and electronic equipment | |
CN113971044A (en) | Component document generation method, device, equipment and readable storage medium | |
JP2006011756A (en) | Program converting program, program converting device and program converting method | |
CN110222169A (en) | A kind of visualized data processing resolution system and its processing method | |
KR100762712B1 (en) | Method for transforming of electronic document based on mapping rule and system thereof | |
CN115203494A (en) | Text-oriented time information extraction method and device | |
CN107577476A (en) | A kind of Android system source code difference analysis method, server and medium based on Module Division | |
CN110515653A (en) | Document structure tree method, apparatus, electronic equipment and computer readable storage medium | |
CN116962407B (en) | Distributed link label processing method and device, distributed link tracking system and distributed system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 200580049341.7 Country of ref document: CN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 05754937 Country of ref document: EP Kind code of ref document: A1 |