[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110598203B - Method and device for extracting entity information of military design document combined with dictionary - Google Patents

Method and device for extracting entity information of military design document combined with dictionary Download PDF

Info

Publication number
CN110598203B
CN110598203B CN201910653281.3A CN201910653281A CN110598203B CN 110598203 B CN110598203 B CN 110598203B CN 201910653281 A CN201910653281 A CN 201910653281A CN 110598203 B CN110598203 B CN 110598203B
Authority
CN
China
Prior art keywords
military
dictionary
entity
corpus
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910653281.3A
Other languages
Chinese (zh)
Other versions
CN110598203A (en
Inventor
蒋序平
鲁义威
杨若鹏
张建军
卢稳新
朱巍
刘乾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910653281.3A priority Critical patent/CN110598203B/en
Publication of CN110598203A publication Critical patent/CN110598203A/en
Application granted granted Critical
Publication of CN110598203B publication Critical patent/CN110598203B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for extracting military design document entity information by combining a dictionary, wherein the method comprises the following steps: 1. preprocessing, namely establishing a military wanted corpus and a military wanted entity dictionary; 2. constructing a military design dictionary and a word vector matrix; 3. determining 14 types of military expected entity types and semantic description rules thereof, selecting corpus for labeling, respectively establishing a training corpus and a testing corpus, and preparing for model training; 4. establishing an entity information extraction model, and training entity information extraction model parameters; 5. and (5) extracting military project entity information from the military project text data to be predicted. The method for extracting the military project entity information can effectively solve the problems of insufficient manual construction characteristics, strong word segmentation dependence and the like of the military project entity information extraction, thereby improving the efficiency of the military project entity information extraction.

Description

Method and device for extracting entity information of military design document combined with dictionary
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a military-design-oriented entity information extraction method and device.
Background
Military design text is descriptive text which is assumed and assumed according to the attempts, situations and combat progress scenes of the two parties. The military design text entity information is a basic information element of military design data, is the basis for extracting, processing and analyzing the military design text data, and aims to find an entity hidden in the military design unstructured and semi-structured text information and extract the entity by adopting a certain means.
At present, methods for identifying named entities in the general field mainly comprise a rule-based method, a statistical and machine learning-based method and a deep learning-based method. The rule-based method is high in accuracy, coverage, portability and development cost; the method based on statistics and machine learning has low development cost, but has strong dependency on feature engineering and Chinese word segmentation; the method based on deep learning has high precision and strong portability, but word vector construction still needs word segmentation, and has high requirement on the corpus scale of computing capacity.
In the extraction of military sketched entity information, a method based on rules and dictionaries is popular, semantic entities are extracted from military sketched text data, text features can be learned by using a Conditional Random Field (CRF) model to identify entity information in a scene, and a method of combining multiple models (combining CRF with rules, combining CRF with dictionaries and rules) can also be used for identifying entity information. The traditional method has pertinence, but has slightly insufficient recognition effect and expandability, is difficult to adapt to daily and monthly changes of military expectation information, and cannot meet the requirements of automatic and intelligent processing of mass big data.
Currently, the military project entity information extraction mainly has the following problems:
1) Under different scenes, a large number of combinations, nesting, short forms and the like exist for the entities;
2) Because of the differences of the language styles and habits of scenes, the number of certain entities is huge, the name forms are complex and changeable, and the strict unified rule is not available, the comprehensive and reasonable entity characteristics are difficult to construct;
3) The existing word segmentation tool is mainly suitable for the general field, the word segmentation accuracy rate of military design text data is not high, and especially, scene professional terms are rare in the general field, and even if a scene dictionary is added, all scene entities are difficult to contain, so that the method with strong word segmentation dependency is difficult to break through the current bottleneck in recognition effect.
Disclosure of Invention
Aiming at the practical problems of complex military thinking data, difficult manual acquisition and the like, the invention aims to overcome the defects of the prior art, establishes a military thinking entity dictionary based on an authoritative dictionary in the military field, establishes a training corpus and a testing corpus by determining 14 types of military thinking entity types and semantic description rules thereof, trains an entity information extraction model, and realizes a method and a device for extracting military thinking document entity information by combining the dictionary.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a method for extracting military project document entity information in combination with a dictionary, said method comprising the steps of:
s1, preprocessing data, which is used for preprocessing military design document data and establishing a military design entity dictionary, specifically comprises the following steps:
s1.1, establishing a corpus, namely preprocessing military data, removing meaningless symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish the corpus;
the Chinese sentence breaking symbol comprises ". "is! "equal sign;
s1.2, establishing a military wanting entity dictionary, selecting an in-field authority dictionary according to the field related to military wanting, collecting proper nouns from the authority dictionary, establishing the military wanting entity dictionary according to the types of military skill word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity;
the authoritative dictionary in the field is a dictionary which is published and issued in the field and is widely accepted, and the dictionary comprises but is not limited to dictionary such as Chinese military encyclopedia, military dictionary, concise military dictionary and the like.
S2, generating word vectors, and constructing a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary, wherein the method specifically comprises the following steps of:
s2.1, counting characters, counting all characters appearing in a military wanting entity dictionary and an authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting dictionary, and recording the total word number Z of the corpus dictionary;
s2.2, generating a military intended word vector matrix, and generating an open source tool training corpus dictionary by utilizing the word vector matrix to obtain a multidimensional military intended word vector matrix;
the word vector matrix generation open source tools include, but are not limited to, word2vec, glove, etc.
S3, corpus labeling, combining an authoritative dictionary and the corpus, determining complete definition rules of military definition entity types, selecting the corpus for labeling, respectively establishing a training corpus and a testing corpus, and preparing for model training, wherein the method specifically comprises the following steps of:
s3.1, determining the types of military sketching entities and semantic description rules, combining an authoritative dictionary, analyzing corpus content, consulting a plurality of expert opinions in the field, and determining 14 types of military sketching entities and semantic description rules in three major categories of entity names, time expressions and digital expressions;
s3.2, generating word labels, namely, adopting a method of manually labeling word attributes and automatically generating character labels, and endowing each word in the preprocessing result data of the step S1 with a unified label by taking sentences as units;
s3.3, generating a character label, namely generating the character label for the marked text by adopting a specific marking system by using an open source tool kit;
the open source toolkit includes, but is not limited to, YEDDA, brat, etc.;
the specific labeling system comprises, but is not limited to, labeling systems such as BIO, BIEOS and the like, wherein a B label in the BIO system represents a word head character, an I label represents a character in a word, and an O label represents a non-entity character; in the BMEWO system, a B label represents a word head character, an I label represents a character in a word, an E label represents a word tail character, a W label represents a single character, and an O label represents a non-entity character.
S4, model training, which is used for establishing an entity information extraction model according to a military design dictionary and a word vector matrix and training entity information extraction model parameters, and specifically comprises the following steps:
s4.1, segmentation of a text sequence: dividing an input text sequence with sentences as basic units, wherein one sentence containing n words is expressed as X= (X) 1 ,x 2 ,...,x n ) Based on the military project dictionary and word vector matrix established in step S2, each character X of X is calculated i Conversion to a word vector matrix V of dimension w w ∈D w×z A word vector e of (a) i
e i =V w ×z i (1)
Where w is the dimension of the word vector, vector z i Is the total word number Z of the corpus dictionary, and Z i For the i-th line to take 1, the other lines take 0 vector, and the input sentence X becomes character embedding word vector sequence e= (E) 1 ,e 2 ,...,e n );
S4.2, hidden state sequence generation: the word vector sequence e= (E) generated in step S4.1 1 ,e 2 ,...,e n ) As the input of each time step of the two-way long and short memory neural network, the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network are spliced according to the positions respectively to obtain a complete hidden state sequence S BiLSTM
S4.3, generating an optimal output tag sequence: the sequence S generated in step S4.2 BiLSTM Inputting a conditional random field model to obtain a first transfer matrix A, and recording a label sequence of a sentence X as Y= (Y) 1 ,y 2 ,...,y n ) Considering that the entity identification process has more extracted entity types, in order to improve the feature distinction, an index taking method is adopted to construct an evaluation function of the tag Y of the sentence X:
wherein S is BiLSTM Is a hidden state sequence, y i For the ith label, A is a first transfer matrix, when the model is trained, calculating an evaluation function, and when the maximum value is taken, obtaining an optimal output label sequence;
s5, entity information extraction: applying a trained entity information extraction model to extract military design entity information of text data to be predicted, wherein the method specifically comprises the following steps of:
s5.1, preprocessing a text: preprocessing input military design text data;
s5.2, vectorization representation: based on the military design dictionary and the word vector matrix established in the step S2, vectorizing the sentence S1 to be extracted, and inputting a trained model;
s5.3, entity information acquisition: calculating the input sentence vector by using entity information extraction model to generate a sequence S1 BiLSTM Inputting a conditional random field model to obtain a second transfer matrix A1, and recording a tag sequence of the sentence S1 as Y1= (Y) 1 ,y 2 ,...,y n ) Evaluation function of the tag Y1 of the sentence S1 to be extracted:
wherein S1 BiLSTM Is a hidden state sequence, y i And for the ith label, A1 is a second transfer matrix, calculating an evaluation function, and obtaining an optimal output label sequence when the evaluation function is the maximum value, and extracting to obtain the entity information of S1.
The invention adopts a method for extracting military design document entity information by combining a dictionary, and has the following advantages:
1. the problems that the manual construction characteristics of military design document entity information extraction are insufficient, word segmentation dependence is strong and the like are effectively solved;
2. the workload of military thinking data acquisition is greatly reduced;
3. and the extraction efficiency of military design document entity information is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flow chart of an embodiment of a method for extracting entity information of military design documents in combination with a dictionary according to the present invention;
fig. 2 is a block diagram of the constituent structure of the present invention.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flow diagram of a method for extracting entity information of military design document combined with dictionary according to the present invention is shown, comprising the following steps:
s1, preprocessing data, which is used for preprocessing military design document data and establishing a military design entity dictionary, specifically comprises the following steps:
s1.1, establishing a corpus, preprocessing military wanted data, and removing. "|! ' wait meaningless symbol, and carry on the clause according to Chinese sentence-breaking symbol, set up the corpus;
s1.2, establishing a military design entity dictionary, selecting fields according to the fields related to military design, publishing and issuing in the fields to obtain a dictionary which is widely accepted, such as a dictionary of Chinese military encyclopedia, military dictionary, simple military dictionary and the like, collecting proper nouns from the right-to-weight dictionary, establishing a military design entity dictionary according to the types of military service word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity.
S2, generating word vectors, and constructing a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary, wherein the method specifically comprises the following steps of:
s2.1, counting characters, counting all characters appearing in a military wanting entity dictionary and an authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting dictionary, and recording the total word number Z of the corpus dictionary;
s2.2, generating a military intended word vector matrix, and generating an open-source tool training corpus dictionary by using word2vec, glove and other word vector matrices to obtain a multidimensional military intended word vector matrix.
S3, corpus labeling, combining an authoritative dictionary and the corpus, determining complete definition rules of military definition entity types, selecting the corpus for labeling, respectively establishing a training corpus and a testing corpus, and preparing for model training, wherein the method specifically comprises the following steps of:
s3.1, determining the types of military sketching entities and semantic description rules, combining an authoritative dictionary, analyzing corpus content, consulting a plurality of expert opinions in the field, and determining 14 types of military sketching entities and semantic description rules in three major categories of entity names, time expressions and digital expressions, wherein the table is shown below:
s3.2, generating word labels, namely, adopting a method of manually labeling word attributes and automatically generating character labels, and endowing each word in the preprocessing result data of the step S1 with a unified label by taking sentences as units;
s3.3, generating character labels, namely generating character labels for marked texts by using open source kits such as YEDDA, brat and the like and marking systems such as BIO, BIEOS and the like, wherein a B label in the BIO system represents a word head character, an I label represents a character in a word, and an O label represents a non-entity character; in the BMEWO system, a B label represents a word head character, an I label represents a character in a word, an E label represents a word tail character, a W label represents a single character, and an O label represents a non-entity character.
S4, model training, namely establishing an entity information extraction model based on a military design dictionary and a word vector matrix, and training entity information extraction model parameters, wherein the method specifically comprises the following steps of:
s4.1, segmentation of a text sequence: dividing an input text sequence with sentences as basic units, wherein one sentence containing n words is expressed as X= (X) 1 ,x 2 ,...,x n ) Based on the military project dictionary and word vector matrix established in step S2, each character X of X is calculated i Conversion to a word vector matrix V of dimension w w ∈D w×z A word vector e of (a) i
e i =V w ×z i (1)
Where w is the dimension of the word vector, vector z i Is the total word number Z of the corpus dictionary, and Z i For the i-th line to take 1, the other lines take 0 vector, and the input sentence X becomes character embedding word vector sequence e= (E) 1 ,e 2 ,...,e n );
S4.2, hidden state sequence generation: the word vector sequence e= (E) generated in step S4.1 1 ,e 2 ,...,e n ) As the input of each time step of the two-way long and short memory neural network, the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network are spliced according to the positions respectively to obtain a complete hidden state sequence S BiLSTM
S4.3, generating an optimal output tag sequence: the sequence S generated in step S4.2 BiLSTM Inputting a conditional random field model to obtain a first transfer matrix A, and recording a label sequence of a sentence X as Y= (Y) 1 ,y 2 ,...,y n ) Considering that the entity identification process has more extracted entity types, in order to improve the feature distinction, an index taking method is adopted to construct an evaluation function of the tag Y of the sentence X:
wherein S is BiLSTM Is a hidden state sequence, y i For the ith label, A is a first transfer matrix, when the model is trained, calculating an evaluation function, and when the maximum value is taken, obtaining an optimal output label sequence;
s5, entity information extraction: applying a trained entity information extraction model to extract military design entity information of text data to be predicted, wherein the method specifically comprises the following steps of:
s5.1, preprocessing a text: preprocessing input military design text data;
s5.2, vectorization representation: based on the military design dictionary and the word vector matrix established in the step S2, vectorizing the sentence S1 to be extracted, and inputting a trained model;
s5.3, entity information acquisition: calculating the input sentence vector by using entity information extraction model to generate a sequence S1 BiLSTM Inputting a conditional random field model to obtain a second transfer matrix A1, and recording a tag sequence of the sentence S1 as Y1= (Y) 1 ,y 2 ,...,y n ) Evaluation function of the tag Y1 of the sentence S1 to be extracted:
wherein S1 BiLSTM Is a hidden state sequence, y i And for the ith label, A1 is a second transfer matrix, calculating an evaluation function, and obtaining an optimal output label sequence when the evaluation function is the maximum value, and extracting to obtain the entity information of S1.
Referring to fig. 2, a composition structure diagram of a military design document entity information extraction device combined with a dictionary of the present invention is shown, specifically comprising the following composition structure:
the data preprocessing module 100 is configured to preprocess military design document data, and build a military design entity dictionary, and specifically includes:
a corpus establishing unit 101 for preprocessing military data, removing meaningless symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish a corpus;
the military entity dictionary creating unit 102 selects an in-domain authority dictionary according to the field related to military design, collects proper nouns from the authority dictionary, creates a military entity dictionary according to the types of military weapon word stock, facility word stock and operation word stock, and analyzes and marks the semantic structure of the entity.
The word vector generation module 200 constructs a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary, and specifically includes:
the character statistics unit 201 is used for counting all characters appearing in the military wanting entity dictionary and the authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting entity dictionary, and recording the total word number of the corpus dictionary;
the military intended word vector matrix generating unit 202 generates an open source tool training corpus dictionary by using the word vector matrix to obtain a military intended word vector matrix with a certain dimension.
The corpus labeling module 300 combines an authoritative dictionary and a corpus in the field to determine complete definition rules of military definition entity types, selects the corpus for labeling, respectively establishes a training corpus and a testing corpus, prepares for model training, and specifically comprises the following steps:
determining 14 kinds of military entity types and semantic description rules in total, namely determining entity names, time expressions and digital expressions by combining an authoritative dictionary, analyzing corpus content and multiple expert opinions in the consultation field;
the word label generating unit 302 applies a unified label to each word in the preprocessing result data of the data preprocessing module 100 by using sentences as units by adopting a method of manually labeling word attributes and automatically generating character labels;
the character tag generation unit 303 generates a character tag for the labeled text using the open source toolkit with a specific labeling system.
The model training module 400 establishes an entity information extraction model based on the military design dictionary and the word vector matrix, and trains entity information extraction model parameters, and specifically includes:
a text sequence dividing unit 401 that divides an input text sequence with sentences as a basic unit;
the hidden state sequence generating unit 402 takes the word vector sequence generated in the text sequence dividing unit 401 as the input of each time step of the two-way long and short memory neural network, and then respectively splices the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network according to positions to obtain a complete hidden state sequence;
the optimal output tag sequence generating unit 403 inputs the sequence generated in the hidden state sequence generating unit 402 into a Conditional Random Field (CRF) model to obtain an optimal output tag sequence.
The entity information extraction module 500 applies a trained entity information extraction model to perform military design entity information extraction on text data to be predicted, and specifically includes:
text preprocessing 501 for preprocessing input military design text data;
the vectorization representation unit 502 performs vectorization representation on sentences to be extracted based on the military design dictionary and the word vector matrix established in the word vector generation module 200, and inputs a trained model;
the entity information acquisition unit 503 calculates the input sentence vector by applying the entity information extraction model, generates a sequence, inputs a Conditional Random Field (CRF) model, and extracts the entity information.

Claims (7)

1. A method for extracting entity information of military design document combined with dictionary, characterized in that the method comprises the following steps:
s1, data preprocessing: preprocessing military design document data to establish a military design entity dictionary, which comprises the following steps:
s1.1, corpus establishment: preprocessing military thinking data, removing nonsensical symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish a corpus;
s1.2, establishing a military entity dictionary: selecting an in-field authoritative dictionary according to the field related to military design, collecting proper nouns from the in-field authoritative dictionary, establishing a military design entity dictionary according to the types of military service word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity;
s2, generating a word vector: constructing a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary established in the step S1.2, wherein the method specifically comprises the following steps:
s2.1, character statistics: counting all characters appearing in a military wanting entity dictionary and an authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting dictionary, and recording the total word number of the corpus dictionary;
s2.2, military design word vector matrix generation: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military idea word vector matrix;
s3, corpus labeling: determining military definition rules of entity types by combining an authoritative dictionary and corpus in the field, selecting corpus for marking, respectively establishing a training corpus and a testing corpus, and preparing for model training, wherein the method specifically comprises the following steps:
s3.1, determining the type of the military design entity and semantic description rules: combining an authoritative dictionary in the field, analyzing corpus content, and determining 14 military design entity types and semantic description rules in three major categories of entity names, time expressions and digital expressions;
s3.2, generating word labels: a method for manually labeling vocabulary attributes and automatically generating character labels is adopted, and unified labels are given to each word in the preprocessing result data in the step S1 by taking sentences as units;
s3.3, character label generation: generating character labels for the marked texts by using a specific open source toolkit and a specific marking system;
s4, model training: according to the military design dictionary and the word vector matrix, an entity information extraction model is established, and entity information extraction model parameters are trained, and the method specifically comprises the following steps:
s4.1, segmentation of a text sequence: dividing an input text sequence with sentences as basic units, wherein one sentence containing n words is expressed as X= (X) 1 ,x 2 ,...,x n ) Based on the military project dictionary and word vector matrix established in step S2, each character X of X is calculated i Conversion to a word vector matrix V of dimension w w ∈D w×Z A word vector e of (a) i
e i =V w ×z i (1)
Where w is the dimension of the word vector, vector z i Is the total word number Z of the corpus dictionary, and Z i For the i-th line to take 1, the other lines take 0 vector, and the input sentence X becomes character embedding word vector sequence e= (E) 1 ,e 2 ,...,e n );
S4.2, hidden state sequence generation: the word vector sequence e= (E) generated in step S4.1 1 ,e 2 ,...,e n ) As the input of each time step of the two-way long and short memory neural network, the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network are spliced according to the positions respectively to obtain a complete hidden state sequence S BiLSTM
S4.3, generating an optimal output tag sequence: the sequence S generated in step S4.2 BiLSTM Inputting a conditional random field model to obtain a first transfer matrix A, and recording a label sequence of a sentence X as Y= (Y) 1 ,y 2 ,...,y n ) Considering that the entity identification process has more extracted entity types, in order to improve the feature distinction, an index taking method is adopted to construct an evaluation function of the tag Y of the sentence X:
wherein S is BiLSTM Is a hidden state sequence, y i For the ith label, A is a first transfer matrix, when the model is trained, calculating an evaluation function, and when the maximum value is taken, obtaining an optimal output label sequence;
s5, entity information extraction: applying a trained entity information extraction model to extract military design entity information of text data to be predicted, wherein the method specifically comprises the following steps of:
s5.1, preprocessing a text: preprocessing input military design text data;
s5.2, vectorization representation: based on the military design dictionary and the word vector matrix established in the step S2, vectorizing the sentence S1 to be extracted, and inputting a trained model;
s5.3, entity information acquisition: calculating the input sentence vector by using entity information extraction model to generate a sequence S1 BiLSTM Inputting a conditional random field model to obtain a second transfer matrix A1, and recording a tag sequence of the sentence S1 as Y1= (Y) 1 ,y 2 ,...,y n ) Evaluation function of the tag Y1 of the sentence S1 to be extracted:
wherein S1 BiLSTM Is a hidden state sequence, y i And for the ith label, A1 is a second transfer matrix, calculating an evaluation function, and obtaining an optimal output label sequence when the evaluation function is the maximum value, and extracting to obtain the entity information of S1.
2. The method for extracting information of military specification document entity in combination with dictionary as recited in claim 1, wherein said chinese sentence breaking symbol includes ". "is! ".
3. The method for extracting entity information of military design document combined with dictionary as set forth in claim 1, wherein said field authority dictionary includes "chinese military encyclopedia", "military dictionary", and "concise military dictionary".
4. The method for extracting information of military design document entities combined with dictionary as set forth in claim 1, wherein said word vector matrix generating open source tool comprises word2vec, glove.
5. The method for extracting information of military project book entity in combination with dictionary as set forth in claim 1, wherein said open source toolkit comprises YEDDA, brat.
6. The method for extracting information of military project document entities combined with dictionary as set forth in claim 1, wherein said specific labeling system comprises BIO, BIEOS.
7. A dictionary-incorporated military project document entity information extraction apparatus, said apparatus comprising:
the data preprocessing module 100: preprocessing military design document data to establish a military design entity dictionary, which comprises the following steps:
corpus creation unit 101: preprocessing military thinking data, removing nonsensical symbols, and dividing sentences according to Chinese sentence-breaking symbols to establish a corpus;
military entity dictionary creation unit 102: selecting an authoritative dictionary in the field according to the field related to military design, collecting proper nouns from the authority dictionary, establishing a military design entity dictionary according to the types of military service word stock, weapon equipment word stock, facility word stock and operation word stock, and analyzing and marking the semantic structure of an entity;
word vector generation module 200: constructing a military definition dictionary and a word vector matrix according to the military definition corpus and the military definition entity dictionary, wherein the method specifically comprises the following steps of:
character counting unit 201: counting all characters appearing in a military wanting entity dictionary and an authoritative dictionary in the field, establishing a digital index for each character to obtain the military wanting entity dictionary, and recording the total word number of the corpus dictionary;
military intended word vector matrix generation unit 202: generating an open source tool training corpus dictionary by using the word vector matrix to obtain a multidimensional military idea word vector matrix;
corpus labeling module 300: determining a definition rule of a military entity type by combining an authoritative dictionary and a corpus, selecting the corpus for labeling, respectively establishing a training corpus and a testing corpus, and preparing for model training, wherein the method specifically comprises the following steps of:
military-intended entity type and semantic description rule determination unit 301: combining with an authoritative dictionary, analyzing corpus content, and determining 14 military definition entity types and semantic description rules in three major categories of entity names, time expressions and digital expressions by a plurality of expert opinions in the consultation field;
word label generation unit 302: a method for manually labeling word attributes and automatically generating character labels is adopted, and unified labels are assigned to each word in the preprocessing result data of the data preprocessing module 100 by taking sentences as units;
the character tag generation unit 303: generating character labels for the marked texts by using an open source toolkit and adopting a specific marking system;
model training module 400: based on the military design dictionary and the word vector matrix, an entity information extraction model is established, and entity information extraction model parameters are trained, and the method specifically comprises the following steps:
text sequence segmentation unit 401: dividing an input text sequence by taking sentences as basic units;
hidden state sequence generation unit 402: the word vector sequence generated in the text sequence segmentation unit 401 is used as the input of each time step of the two-way long and short memory neural network, and then the hidden state sequence output by the forward long and short memory neural network and the hidden state output by the reverse long and short memory neural network are spliced according to the positions respectively to obtain a complete hidden state sequence;
the best output tag sequence generation unit 403: inputting the sequence generated in the hidden state sequence generating unit 402 into a conditional random field model to obtain an optimal output tag sequence;
entity information extraction module 500: applying a trained entity information extraction model to extract military design entity information of text data to be predicted, wherein the method specifically comprises the following steps of:
text preprocessing unit 501: preprocessing input military design text data;
vectorization representation unit 502: based on the military design dictionary and the word vector matrix established in the word vector generation module 200, vectorizing the sentence to be extracted, and inputting a trained model;
entity information acquisition unit 503: and calculating the input sentence vectors by using the entity information extraction model, generating a sequence, inputting the conditional random field model, and extracting to obtain entity information.
CN201910653281.3A 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary Active CN110598203B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910653281.3A CN110598203B (en) 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910653281.3A CN110598203B (en) 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary

Publications (2)

Publication Number Publication Date
CN110598203A CN110598203A (en) 2019-12-20
CN110598203B true CN110598203B (en) 2023-08-01

Family

ID=68853045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910653281.3A Active CN110598203B (en) 2019-07-19 2019-07-19 Method and device for extracting entity information of military design document combined with dictionary

Country Status (1)

Country Link
CN (1) CN110598203B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125378A (en) * 2019-12-25 2020-05-08 同方知网(北京)技术有限公司 Closed-loop entity extraction method based on automatic sample labeling
CN113051887A (en) * 2019-12-26 2021-06-29 深圳市北科瑞声科技股份有限公司 Method, system and device for extracting announcement information elements
CN110874531B (en) * 2020-01-20 2020-07-10 湖南蚁坊软件股份有限公司 Topic analysis method and device and storage medium
CN111309925B (en) * 2020-02-10 2023-06-30 同方知网数字出版技术股份有限公司 Knowledge graph construction method for military equipment
CN111324742B (en) * 2020-02-10 2024-01-23 同方知网数字出版技术股份有限公司 Method for constructing digital human knowledge graph
CN111324745A (en) * 2020-02-18 2020-06-23 深圳市一面网络技术有限公司 Word stock generation method and device
CN111444723B (en) * 2020-03-06 2023-07-28 深圳追一科技有限公司 Information extraction method, computer device, and storage medium
CN111476034B (en) * 2020-04-07 2023-05-12 同方赛威讯信息技术有限公司 Legal document information extraction method and system based on combination of rules and models
CN111611799B (en) * 2020-05-07 2023-06-02 北京智通云联科技有限公司 Entity attribute extraction method, system and equipment based on dictionary and sequence labeling model
CN112036183B (en) * 2020-08-31 2024-02-02 湖南星汉数智科技有限公司 Word segmentation method, device, computer device and computer storage medium based on BiLSTM network model and CRF model
CN112036184A (en) * 2020-08-31 2020-12-04 湖南星汉数智科技有限公司 Entity identification method, device, computer device and storage medium based on BilSTM network model and CRF model
CN113326700B (en) * 2021-02-26 2024-05-14 西安理工大学 ALBert-based complex heavy equipment entity extraction method
CN115114917A (en) * 2021-03-17 2022-09-27 航天科工深圳(集团)有限公司 Military named entity recognition method and device based on vocabulary enhancement
CN113254594B (en) * 2021-06-21 2022-01-14 国能信控互联技术有限公司 Smart power plant-oriented safety knowledge graph construction method and system
CN113657105A (en) * 2021-08-31 2021-11-16 平安医疗健康管理股份有限公司 Medical entity extraction method, device, equipment and medium based on vocabulary enhancement
CN113806481B (en) * 2021-09-17 2022-11-04 中国人民解放军国防科技大学 Operation event extraction method oriented to encyclopedic data
CN115906844B (en) * 2022-11-02 2023-08-29 中国兵器工业计算机应用技术研究所 Rule template-based information extraction method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789449A (en) * 2011-05-20 2012-11-21 日电(中国)有限公司 Method and device for evaluating comment text
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI608367B (en) * 2012-01-11 2017-12-11 國立臺灣師範大學 Text readability measuring system and method thereof
CN105138724B (en) * 2015-07-17 2019-03-08 中国电子科技集团公司电子科学研究院 A kind of open simulation scenario edit methods and device of generic Extensible
US10839284B2 (en) * 2016-11-03 2020-11-17 Salesforce.Com, Inc. Joint many-task neural network model for multiple natural language processing (NLP) tasks
CN107122416B (en) * 2017-03-31 2021-07-06 北京大学 Chinese event extraction method
US10474709B2 (en) * 2017-04-14 2019-11-12 Salesforce.Com, Inc. Deep reinforced model for abstractive summarization
US10394958B2 (en) * 2017-11-09 2019-08-27 Conduent Business Services, Llc Performing semantic analyses of user-generated text content using a lexicon
CN109446523B (en) * 2018-10-23 2023-04-25 重庆誉存大数据科技有限公司 Entity attribute extraction model based on BiLSTM and conditional random field
CN109871538A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of Chinese electronic health record name entity recognition method
CN109858041B (en) * 2019-03-07 2023-02-17 北京百分点科技集团股份有限公司 Named entity recognition method combining semi-supervised learning with user-defined dictionary

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102789449A (en) * 2011-05-20 2012-11-21 日电(中国)有限公司 Method and device for evaluating comment text
CN107239446A (en) * 2017-05-27 2017-10-10 中国矿业大学 A kind of intelligence relationship extracting method based on neutral net Yu notice mechanism
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109948152A (en) * 2019-03-06 2019-06-28 北京工商大学 A kind of Chinese text grammer error correcting model method based on LSTM

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向军事领域的命名实体识别及相关信息提取关键技术研究;宋瑞亮;中国优秀硕士学位论文全文数据库 (信息科技辑);I138-4703 *

Also Published As

Publication number Publication date
CN110598203A (en) 2019-12-20

Similar Documents

Publication Publication Date Title
CN110598203B (en) Method and device for extracting entity information of military design document combined with dictionary
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN111222305B (en) Information structuring method and device
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN111985239B (en) Entity identification method, entity identification device, electronic equipment and storage medium
CN110287480B (en) Named entity identification method, device, storage medium and terminal equipment
CN110597997B (en) Military scenario text event extraction corpus iterative construction method and device
CN110457689B (en) Semantic processing method and related device
CN109284400A (en) A kind of name entity recognition method based on Lattice LSTM and language model
CN106570179A (en) Evaluative text-oriented kernel entity identification method and apparatus
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN108829823A (en) A kind of file classification method
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN113360582B (en) Relation classification method and system based on BERT model fusion multi-entity information
CN112101014A (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN112417823B (en) Chinese text word order adjustment and word completion method and system
CN110705217B (en) Wrongly written or mispronounced word detection method and device, computer storage medium and electronic equipment
CN114881043B (en) Deep learning model-based legal document semantic similarity evaluation method and system
CN114528400A (en) Unified low-sample relation extraction method and device based on multi-selection matching network
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN112307756A (en) Bi-LSTM and word fusion-based Chinese word segmentation method
CN118277531A (en) Question and answer method and device based on intent splitting and title association
CN114757191B (en) Deep learning-based electric public opinion field named entity identification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant