CN110598203B

CN110598203B - Method and device for extracting entity information of military design document combined with dictionary

Info

Publication number: CN110598203B
Application number: CN201910653281.3A
Authority: CN
Inventors: 蒋序平; 鲁义威; 杨若鹏; 张建军; 卢稳新; 朱巍; 刘乾
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2023-08-01
Anticipated expiration: 2039-07-19
Also published as: CN110598203A

Abstract

The invention discloses a method and a device for extracting military design document entity information by combining a dictionary, wherein the method comprises the following steps: 1. preprocessing, namely establishing a military wanted corpus and a military wanted entity dictionary; 2. constructing a military design dictionary and a word vector matrix; 3. determining 14 types of military expected entity types and semantic description rules thereof, selecting corpus for labeling, respectively establishing a training corpus and a testing corpus, and preparing for model training; 4. establishing an entity information extraction model, and training entity information extraction model parameters; 5. and (5) extracting military project entity information from the military project text data to be predicted. The method for extracting the military project entity information can effectively solve the problems of insufficient manual construction characteristics, strong word segmentation dependence and the like of the military project entity information extraction, thereby improving the efficiency of the military project entity information extraction.

Description

Method and device for extracting entity information of military design document combined with dictionary

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a military-design-oriented entity information extraction method and device.

Background

Military design text is descriptive text which is assumed and assumed according to the attempts, situations and combat progress scenes of the two parties. The military design text entity information is a basic information element of military design data, is the basis for extracting, processing and analyzing the military design text data, and aims to find an entity hidden in the military design unstructured and semi-structured text information and extract the entity by adopting a certain means.

At present, methods for identifying named entities in the general field mainly comprise a rule-based method, a statistical and machine learning-based method and a deep learning-based method. The rule-based method is high in accuracy, coverage, portability and development cost; the method based on statistics and machine learning has low development cost, but has strong dependency on feature engineering and Chinese word segmentation; the method based on deep learning has high precision and strong portability, but word vector construction still needs word segmentation, and has high requirement on the corpus scale of computing capacity.

In the extraction of military sketched entity information, a method based on rules and dictionaries is popular, semantic entities are extracted from military sketched text data, text features can be learned by using a Conditional Random Field (CRF) model to identify entity information in a scene, and a method of combining multiple models (combining CRF with rules, combining CRF with dictionaries and rules) can also be used for identifying entity information. The traditional method has pertinence, but has slightly insufficient recognition effect and expandability, is difficult to adapt to daily and monthly changes of military expectation information, and cannot meet the requirements of automatic and intelligent processing of mass big data.

Currently, the military project entity information extraction mainly has the following problems:

1) Under different scenes, a large number of combinations, nesting, short forms and the like exist for the entities;

2) Because of the differences of the language styles and habits of scenes, the number of certain entities is huge, the name forms are complex and changeable, and the strict unified rule is not available, the comprehensive and reasonable entity characteristics are difficult to construct;

3) The existing word segmentation tool is mainly suitable for the general field, the word segmentation accuracy rate of military design text data is not high, and especially, scene professional terms are rare in the general field, and even if a scene dictionary is added, all scene entities are difficult to contain, so that the method with strong word segmentation dependency is difficult to break through the current bottleneck in recognition effect.

Disclosure of Invention

Aiming at the practical problems of complex military thinking data, difficult manual acquisition and the like, the invention aims to overcome the defects of the prior art, establishes a military thinking entity dictionary based on an authoritative dictionary in the military field, establishes a training corpus and a testing corpus by determining 14 types of military thinking entity types and semantic description rules thereof, trains an entity information extraction model, and realizes a method and a device for extracting military thinking document entity information by combining the dictionary.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a method for extracting military project document entity information in combination with a dictionary, said method comprising the steps of:

s1, preprocessing data, which is used for preprocessing military design document data and establishing a military design entity dictionary, specifically comprises the following steps:

s1.1, establishing a corpus, namely preprocessing military data, removing meaningless symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish the corpus;

the Chinese sentence breaking symbol comprises ". "is! "equal sign;

s1.2, establishing a military wanting entity dictionary, selecting an in-field authority dictionary according to the field related to military wanting, collecting proper nouns from the authority dictionary, establishing the military wanting entity dictionary according to the types of military skill word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity;

the authoritative dictionary in the field is a dictionary which is published and issued in the field and is widely accepted, and the dictionary comprises but is not limited to dictionary such as Chinese military encyclopedia, military dictionary, concise military dictionary and the like.

S2, generating word vectors, and constructing a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary, wherein the method specifically comprises the following steps of:

s2.1, counting characters, counting all characters appearing in a military wanting entity dictionary and an authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting dictionary, and recording the total word number Z of the corpus dictionary;

s2.2, generating a military intended word vector matrix, and generating an open source tool training corpus dictionary by utilizing the word vector matrix to obtain a multidimensional military intended word vector matrix;

the word vector matrix generation open source tools include, but are not limited to, word2vec, glove, etc.

S3, corpus labeling, combining an authoritative dictionary and the corpus, determining complete definition rules of military definition entity types, selecting the corpus for labeling, respectively establishing a training corpus and a testing corpus, and preparing for model training, wherein the method specifically comprises the following steps of:

s3.1, determining the types of military sketching entities and semantic description rules, combining an authoritative dictionary, analyzing corpus content, consulting a plurality of expert opinions in the field, and determining 14 types of military sketching entities and semantic description rules in three major categories of entity names, time expressions and digital expressions;

s3.2, generating word labels, namely, adopting a method of manually labeling word attributes and automatically generating character labels, and endowing each word in the preprocessing result data of the step S1 with a unified label by taking sentences as units;

s3.3, generating a character label, namely generating the character label for the marked text by adopting a specific marking system by using an open source tool kit;

the open source toolkit includes, but is not limited to, YEDDA, brat, etc.;

the specific labeling system comprises, but is not limited to, labeling systems such as BIO, BIEOS and the like, wherein a B label in the BIO system represents a word head character, an I label represents a character in a word, and an O label represents a non-entity character; in the BMEWO system, a B label represents a word head character, an I label represents a character in a word, an E label represents a word tail character, a W label represents a single character, and an O label represents a non-entity character.

S4, model training, which is used for establishing an entity information extraction model according to a military design dictionary and a word vector matrix and training entity information extraction model parameters, and specifically comprises the following steps:

s4.1, segmentation of a text sequence: dividing an input text sequence with sentences as basic units, wherein one sentence containing n words is expressed as X= (X) ₁ ，x ₂ ，...，x _n ) Based on the military project dictionary and word vector matrix established in step S2, each character X of X is calculated _i Conversion to a word vector matrix V of dimension w ^w ∈D ^w×z A word vector e of (a) _i ：

e _i ＝V ^w ×z ⁱ (1)

Where w is the dimension of the word vector, vector z ⁱ Is the total word number Z of the corpus dictionary, and Z ⁱ For the i-th line to take 1, the other lines take 0 vector, and the input sentence X becomes character embedding word vector sequence e= (E) ₁ ，e ₂ ，...，e _n )；

S4.2, hidden state sequence generation: the word vector sequence e= (E) generated in step S4.1 ₁ ，e ₂ ，...，e _n ) As the input of each time step of the two-way long and short memory neural network, the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network are spliced according to the positions respectively to obtain a complete hidden state sequence S ^BiLSTM ；

S4.3, generating an optimal output tag sequence: the sequence S generated in step S4.2 ^BiLSTM Inputting a conditional random field model to obtain a first transfer matrix A, and recording a label sequence of a sentence X as Y= (Y) ₁ ，y ₂ ，...，y _n ) Considering that the entity identification process has more extracted entity types, in order to improve the feature distinction, an index taking method is adopted to construct an evaluation function of the tag Y of the sentence X:

wherein S is ^BiLSTM Is a hidden state sequence, y _i For the ith label, A is a first transfer matrix, when the model is trained, calculating an evaluation function, and when the maximum value is taken, obtaining an optimal output label sequence;

s5, entity information extraction: applying a trained entity information extraction model to extract military design entity information of text data to be predicted, wherein the method specifically comprises the following steps of:

s5.1, preprocessing a text: preprocessing input military design text data;

s5.2, vectorization representation: based on the military design dictionary and the word vector matrix established in the step S2, vectorizing the sentence S1 to be extracted, and inputting a trained model;

s5.3, entity information acquisition: calculating the input sentence vector by using entity information extraction model to generate a sequence S1 ^BiLSTM Inputting a conditional random field model to obtain a second transfer matrix A1, and recording a tag sequence of the sentence S1 as Y1= (Y) ₁ ，y ₂ ，...，y _n ) Evaluation function of the tag Y1 of the sentence S1 to be extracted:

wherein S1 ^BiLSTM Is a hidden state sequence, y _i And for the ith label, A1 is a second transfer matrix, calculating an evaluation function, and obtaining an optimal output label sequence when the evaluation function is the maximum value, and extracting to obtain the entity information of S1.

The invention adopts a method for extracting military design document entity information by combining a dictionary, and has the following advantages:

1. the problems that the manual construction characteristics of military design document entity information extraction are insufficient, word segmentation dependence is strong and the like are effectively solved;

2. the workload of military thinking data acquisition is greatly reduced;

3. and the extraction efficiency of military design document entity information is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flow chart of an embodiment of a method for extracting entity information of military design documents in combination with a dictionary according to the present invention;

fig. 2 is a block diagram of the constituent structure of the present invention.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flow diagram of a method for extracting entity information of military design document combined with dictionary according to the present invention is shown, comprising the following steps:

s1.1, establishing a corpus, preprocessing military wanted data, and removing. "|! ' wait meaningless symbol, and carry on the clause according to Chinese sentence-breaking symbol, set up the corpus;

s1.2, establishing a military design entity dictionary, selecting fields according to the fields related to military design, publishing and issuing in the fields to obtain a dictionary which is widely accepted, such as a dictionary of Chinese military encyclopedia, military dictionary, simple military dictionary and the like, collecting proper nouns from the right-to-weight dictionary, establishing a military design entity dictionary according to the types of military service word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity.

s2.2, generating a military intended word vector matrix, and generating an open-source tool training corpus dictionary by using word2vec, glove and other word vector matrices to obtain a multidimensional military intended word vector matrix.

s3.1, determining the types of military sketching entities and semantic description rules, combining an authoritative dictionary, analyzing corpus content, consulting a plurality of expert opinions in the field, and determining 14 types of military sketching entities and semantic description rules in three major categories of entity names, time expressions and digital expressions, wherein the table is shown below:

s3.3, generating character labels, namely generating character labels for marked texts by using open source kits such as YEDDA, brat and the like and marking systems such as BIO, BIEOS and the like, wherein a B label in the BIO system represents a word head character, an I label represents a character in a word, and an O label represents a non-entity character; in the BMEWO system, a B label represents a word head character, an I label represents a character in a word, an E label represents a word tail character, a W label represents a single character, and an O label represents a non-entity character.

S4, model training, namely establishing an entity information extraction model based on a military design dictionary and a word vector matrix, and training entity information extraction model parameters, wherein the method specifically comprises the following steps of:

e _i ＝V ^w ×z ⁱ (1)

s5.1, preprocessing a text: preprocessing input military design text data;

Referring to fig. 2, a composition structure diagram of a military design document entity information extraction device combined with a dictionary of the present invention is shown, specifically comprising the following composition structure:

the data preprocessing module 100 is configured to preprocess military design document data, and build a military design entity dictionary, and specifically includes:

a corpus establishing unit 101 for preprocessing military data, removing meaningless symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish a corpus;

the military entity dictionary creating unit 102 selects an in-domain authority dictionary according to the field related to military design, collects proper nouns from the authority dictionary, creates a military entity dictionary according to the types of military weapon word stock, facility word stock and operation word stock, and analyzes and marks the semantic structure of the entity.

The word vector generation module 200 constructs a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary, and specifically includes:

the character statistics unit 201 is used for counting all characters appearing in the military wanting entity dictionary and the authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting entity dictionary, and recording the total word number of the corpus dictionary;

the military intended word vector matrix generating unit 202 generates an open source tool training corpus dictionary by using the word vector matrix to obtain a military intended word vector matrix with a certain dimension.

The corpus labeling module 300 combines an authoritative dictionary and a corpus in the field to determine complete definition rules of military definition entity types, selects the corpus for labeling, respectively establishes a training corpus and a testing corpus, prepares for model training, and specifically comprises the following steps:

determining 14 kinds of military entity types and semantic description rules in total, namely determining entity names, time expressions and digital expressions by combining an authoritative dictionary, analyzing corpus content and multiple expert opinions in the consultation field;

the word label generating unit 302 applies a unified label to each word in the preprocessing result data of the data preprocessing module 100 by using sentences as units by adopting a method of manually labeling word attributes and automatically generating character labels;

the character tag generation unit 303 generates a character tag for the labeled text using the open source toolkit with a specific labeling system.

The model training module 400 establishes an entity information extraction model based on the military design dictionary and the word vector matrix, and trains entity information extraction model parameters, and specifically includes:

a text sequence dividing unit 401 that divides an input text sequence with sentences as a basic unit;

the hidden state sequence generating unit 402 takes the word vector sequence generated in the text sequence dividing unit 401 as the input of each time step of the two-way long and short memory neural network, and then respectively splices the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network according to positions to obtain a complete hidden state sequence;

the optimal output tag sequence generating unit 403 inputs the sequence generated in the hidden state sequence generating unit 402 into a Conditional Random Field (CRF) model to obtain an optimal output tag sequence.

The entity information extraction module 500 applies a trained entity information extraction model to perform military design entity information extraction on text data to be predicted, and specifically includes:

text preprocessing 501 for preprocessing input military design text data;

the vectorization representation unit 502 performs vectorization representation on sentences to be extracted based on the military design dictionary and the word vector matrix established in the word vector generation module 200, and inputs a trained model;

the entity information acquisition unit 503 calculates the input sentence vector by applying the entity information extraction model, generates a sequence, inputs a Conditional Random Field (CRF) model, and extracts the entity information.

Claims

1. A method for extracting entity information of military design document combined with dictionary, characterized in that the method comprises the following steps:

s1, data preprocessing: preprocessing military design document data to establish a military design entity dictionary, which comprises the following steps:

s1.1, corpus establishment: preprocessing military thinking data, removing nonsensical symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish a corpus;

s1.2, establishing a military entity dictionary: selecting an in-field authoritative dictionary according to the field related to military design, collecting proper nouns from the in-field authoritative dictionary, establishing a military design entity dictionary according to the types of military service word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity;

s2, generating a word vector: constructing a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary established in the step S1.2, wherein the method specifically comprises the following steps:

s2.1, character statistics: counting all characters appearing in a military wanting entity dictionary and an authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting dictionary, and recording the total word number of the corpus dictionary;

s2.2, military design word vector matrix generation: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military idea word vector matrix;

s3, corpus labeling: determining military definition rules of entity types by combining an authoritative dictionary and corpus in the field, selecting corpus for marking, respectively establishing a training corpus and a testing corpus, and preparing for model training, wherein the method specifically comprises the following steps:

s3.1, determining the type of the military design entity and semantic description rules: combining an authoritative dictionary in the field, analyzing corpus content, and determining 14 military design entity types and semantic description rules in three major categories of entity names, time expressions and digital expressions;

s3.2, generating word labels: a method for manually labeling vocabulary attributes and automatically generating character labels is adopted, and unified labels are given to each word in the preprocessing result data in the step S1 by taking sentences as units;

s3.3, character label generation: generating character labels for the marked texts by using a specific open source toolkit and a specific marking system;

s4, model training: according to the military design dictionary and the word vector matrix, an entity information extraction model is established, and entity information extraction model parameters are trained, and the method specifically comprises the following steps:

e _i ＝V ^w ×z ⁱ (1)

s5.1, preprocessing a text: preprocessing input military design text data;

2. The method for extracting information of military specification document entity in combination with dictionary as recited in claim 1, wherein said chinese sentence breaking symbol includes ". "is! ".

3. The method for extracting entity information of military design document combined with dictionary as set forth in claim 1, wherein said field authority dictionary includes "chinese military encyclopedia", "military dictionary", and "concise military dictionary".

4. The method for extracting information of military design document entities combined with dictionary as set forth in claim 1, wherein said word vector matrix generating open source tool comprises word2vec, glove.

5. The method for extracting information of military project book entity in combination with dictionary as set forth in claim 1, wherein said open source toolkit comprises YEDDA, brat.

6. The method for extracting information of military project document entities combined with dictionary as set forth in claim 1, wherein said specific labeling system comprises BIO, BIEOS.

7. A dictionary-incorporated military project document entity information extraction apparatus, said apparatus comprising:

the data preprocessing module 100: preprocessing military design document data to establish a military design entity dictionary, which comprises the following steps:

corpus creation unit 101: preprocessing military thinking data, removing nonsensical symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish a corpus;

military entity dictionary creation unit 102: selecting an authoritative dictionary in the field according to the field related to military design, collecting proper nouns from the authority dictionary, establishing a military design entity dictionary according to the types of military service word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity;

word vector generation module 200: constructing a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary, wherein the method specifically comprises the following steps of:

character counting unit 201: counting all characters appearing in a military wanting entity dictionary and an authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting entity dictionary, and recording the total word number of the corpus dictionary;

military intended word vector matrix generation unit 202: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military idea word vector matrix;

corpus labeling module 300: determining a definition rule of a military entity type by combining an authoritative dictionary and a corpus, selecting the corpus for labeling, respectively establishing a training corpus and a testing corpus, and preparing for model training, wherein the method specifically comprises the following steps of:

military-intended entity type and semantic description rule determination unit 301: combining with an authoritative dictionary, analyzing corpus content, and determining 14 military definition entity types and semantic description rules in three major categories of entity names, time expressions and digital expressions by a plurality of expert opinions in the consultation field;

word label generation unit 302: a method for manually labeling word attributes and automatically generating character labels is adopted, and unified labels are assigned to each word in the preprocessing result data of the data preprocessing module 100 by taking sentences as units;

the character tag generation unit 303: generating character labels for the marked texts by using an open source toolkit and adopting a specific marking system;

model training module 400: based on the military design dictionary and the word vector matrix, an entity information extraction model is established, and entity information extraction model parameters are trained, and the method specifically comprises the following steps:

text sequence segmentation unit 401: dividing an input text sequence by taking sentences as basic units;

hidden state sequence generation unit 402: the word vector sequence generated in the text sequence segmentation unit 401 is used as the input of each time step of the two-way long and short memory neural network, and then the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network are spliced according to the positions respectively to obtain a complete hidden state sequence;

the best output tag sequence generation unit 403: inputting the sequence generated in the hidden state sequence generating unit 402 into a conditional random field model to obtain an optimal output tag sequence;

entity information extraction module 500: applying a trained entity information extraction model to extract military design entity information of text data to be predicted, wherein the method specifically comprises the following steps of:

text preprocessing unit 501: preprocessing input military design text data;

vectorization representation unit 502: based on the military design dictionary and the word vector matrix established in the word vector generation module 200, vectorizing the sentence to be extracted, and inputting a trained model;

entity information acquisition unit 503: and calculating the input sentence vectors by using the entity information extraction model, generating a sequence, inputting the conditional random field model, and extracting to obtain entity information.