CN110008469A - A kind of multi-level name entity recognition method - Google Patents
A kind of multi-level name entity recognition method Download PDFInfo
- Publication number
- CN110008469A CN110008469A CN201910207179.0A CN201910207179A CN110008469A CN 110008469 A CN110008469 A CN 110008469A CN 201910207179 A CN201910207179 A CN 201910207179A CN 110008469 A CN110008469 A CN 110008469A
- Authority
- CN
- China
- Prior art keywords
- sequence
- entity
- text
- vector
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The present invention proposes a kind of multi-level name entity recognition method, comprising: S1 pre-processes data text, obtains vocabulary C;Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated;S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;S4 is decoded the Text eigenvector sequence with CRF model, marks out the entity in the Text eigenvector sequence;S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the candidate sequence of subsequent identification process;The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, and attention force vector is calculated;Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the entity in sequence by S7.
Description
Technical field
The present invention relates to natural language processing fields, and in particular to a kind of multi-level name entity recognition method.
Background technique
Natural language processing is crisscross as one of computer field and artificial intelligence field, as artificial intelligence is led
The fast development in domain and continue to develop.Naming Entity recognition (Named Entity Recognition, abbreviation NER) is nature language
Say a basic task of processing, its purpose is to identify the entity for having certain sense in text and classify to them, these
The type of entity mainly includes name, mechanism name, place and some other proper noun.It is big along with magnanimity in internet
The generation of data names Entity recognition task also in the attention constantly by academia and industry, it is transported extensively
In other natural language processing tasks such as machine translation, intelligent answer, information retrieval.
Name entity method for distinguishing includes rule-based conventional method, the conventional method based on dictionary and base at present
In the conventional method of statistics, more representational method is namely based on the hidden Markov model (HMM) and condition of statistics
Random field models (CRF) also emerge many methods neural network based with the arrival of the upsurge of deep learning, such as
It is named Entity recognition using long memory network (LSTM) in short-term, and is mutually tied by conventional method with the method for neural network
The mode of conjunction achieves good achievement.
Rule-based and dictionary method is very dependent on dictionary and regular construction, so they are only suitable for small rule
In the restriction domain corpus of mould, just a little have too many difficulties to cope with for Large Scale Corpus, and have very greatly when handling new term
Limitation;Statistics-Based Method relies on manual features and extracts, this will expend a large amount of manpower and time.It is much based at present
The method of neural network can solve the deficiency of conventional method to a certain extent, but the uncommon word for occurring in text, or
The unclear vocabulary of person's semanteme, the recall ratio and accuracy rate of such methods still have to be hoisted.
Summary of the invention
In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of multi-level name Entity recognitions
Method, the present invention efficiently solve practical application life by repeatedly know otherwise using reasoning element and storage unit
For uncommon word in name Entity recognition, it is complete to improve looking into for text information for the not high problem of the unclear vocabulary accuracy rate of semanteme
Rate and accuracy rate.
In order to achieve the above objects and other related objects, the present invention provides a kind of multi-level name entity recognition method, packet
Include following steps:
S1 pre-processes data text, obtains vocabulary C;
Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector table of obtained text
Show;
S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;
S4 is decoded the Text eigenvector sequence with CRF model, marks out the Text eigenvector sequence
In entity;
S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as subsequent identification process
Candidate sequence;
The Text eigenvector sequence and the candidate sequence are input to the reasoning list based on attention mechanism by S6
Attention force vector is calculated in member;
S7 into CRF model, Text eigenvector sequence inputting described in the attention vector sum is marked out in sequence
Entity.
S8 repeats step S5~S7 until step S7 does not generate novel entities.
Optionally, described that data text is pre-processed, the image information features sequence of text is obtained, is specifically included:
It is first mark with fullstop, is sentence one section of long text segmentation;
All sentences are segmented;
Then remove dittograph building vocabulary C.
Optionally, the term vector good using pre-training, in conjunction with the image information features sequence of text, obtained text
Vector indicate, specifically include:
Pre-training is carried out to corpus, by the available a term vector of pre-training;
The form for being one-hot by the text representation of input, as image information features sequence;
By the vocabulary C, the term vector of the image information features sequence and pre-training, each phrase is obtained
Term vector indicates;
Word insertion is carried out to the sentence after participle, the vector for obtaining text indicates X.
Optionally, the expression of the vector of the text is encoded using BiLSTM, the text feature after being encoded to
Measure sequence.
Optionally, the Text eigenvector sequence is decoded with CRF model, marks out the Text eigenvector
Entity in sequence, specifically includes:
Text eigenvector sequence H={ h1,h2,h3,…,htIt is input to CRF model, pass through calculating for CRF model
The prediction label sequence L={ l arrived1,l2,l3,…,ln, what l was indicated is the label of each word, is contained in annotation results
" BIE's " or " S " is all to mark obtained entity e, and entity sets are expressed as E=(e1,e2,e3,…,em), m presentation-entity
Number.
Optionally, the prediction label sequence calculates acquisition by the following method:
If flag sequence is Y={ y1,y2,y3,…,yn, sequence Y indicates all possible label in BIOES mark system
Set, the scoring function of this result of x marked with tag y are as follows:
Wherein, tj(yi-1,yi, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, upper one
Label node yi-1It is transferred to current label node yiThe case where, value is 0 or 1, referred to as transfer characteristic, sk(yi, x, i) and it indicates to work as
Preceding label node yiWhether mark on x, value is also 0 or 1, referred to as state feature, λjAnd μkThe t respectively indicatedjAnd sk's
Weight, j, k indicate the number of characteristic function;
Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping for x y label conditional probability p (y |
X), calculate p (y | x) specific formula is as follows:
Z (x) is standardizing factor, calculation formula are as follows:
Wherein y '=(y '1,y′2,y′3,…,y′n), indicate possible annotated sequence;
It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l1,l2,l3,…,ln}。
Optionally, for each entity e in entity sets Ei(i=1,2,3 ..., m), in conjunction with preceding to LSTM hidden layer
OutputAnd backward LSTM hidden layer outputEach entity is expressed as V '=[v1,v2,v3,v4] form, wherein v1
That indicate is the information above of entity, v2That indicate is the preceding entity self-information obtained to LSTM, v3What is indicated be after to LSTM
Obtained entity self-information, v4The information hereinafter of entity, then the entity sequence in storage unit is expressed as V={ V '1,V′2,
V′3,…,V′m′, m ' expression has been deposited into the entity number of storage unit, then the entity sequence in storage unit is expressed as candidate
Sequence.
Optionally, in storage unit candidate sequence V and Text eigenvector sequence H be input in reasoning element, lead to
It crosses reasoning element and obtains each entity information V to the influence degree of Text eigenvector sequence H, gain attention force vector S;
For Text eigenvector sequence H={ h1,h2,h3,…,htAnd candidate sequence V={ V '1,V′2,V′3,…,V
′m′, calculate separately the h at each momentiWith all V 'jThe dot product of (j=1,2,3 ..., m '), the power that gains attention score σ, wherein i
=1,2,3 ..., t;
Any time t, candidate sequence V is to htAttention score calculation formula it is as follows:
Attention score is converted into probability distribution as the weight α of postorder weighted sum:
αt=softmax (σt)
Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to htPay attention to force vector s, formula is such as
Under:
After the result for calculating all moment, candidate sequence V is obtained to the attention force vector sequence of Text eigenvector sequence H
Arrange S={ s1,s2,s3,…,st}。
Optionally, this method further include: the similarity for calculating the entity in the entity and step S5 in the step S7, when
When the similarity is less than similarity threshold, using the entity as new entity.
Optionally, the similarity is calculated using the method for cosine similarity, specific formula are as follows:
As described above, the multi-level name entity recognition method of one kind of the invention, has the advantages that
1, compared to the method for only carrying out primary name Entity recognition, this method is using repeatedly identification, by repeatedly identifying
Mode improve name Entity recognition task recall ratio;
2, existing method is relatively suitble to short text data, and for longer long article notebook data, efficiency will decline, the present invention
Storage unit is devised, important entity information and its contextual information is stored, long text number is handled in this manner
According to, while devising the space expense that candidate unit reduces storage unit;
3, reasoning element is devised, carries out Entity recognition in conjunction with reasoning element, for some uncommon words in text, and
The unclear word of some semantemes, the effect of identification are able to ascend, and improve the accuracy rate and recall ratio of system.
Detailed description of the invention
In order to which the present invention is further explained, described content, with reference to the accompanying drawing makees a specific embodiment of the invention
Further details of explanation.It should be appreciated that these attached drawings are only used as typical case, and it is not to be taken as to the scope of the present invention
It limits.
Fig. 1 is name entity recognition system flow chart;
Fig. 2 is respectively memory cell structure schematic diagram and candidate unit structural schematic diagram;
Fig. 3 is reasoning element structural schematic diagram;
Fig. 4 is system structure diagram;
Fig. 5 is CRF structural schematic diagram;
Fig. 6 is LSTM cellular construction schematic diagram.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification
Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities
The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from
Various modifications or alterations are carried out under spirit of the invention.It should be noted that in the absence of conflict, following embodiment and implementation
Feature in example can be combined with each other.
It should be noted that illustrating the basic structure that only the invention is illustrated in a schematic way provided in following embodiment
Think, only shown in schema then with related component in the present invention rather than component count, shape and size when according to actual implementation
Draw, when actual implementation kenel, quantity and the ratio of each component can arbitrarily change for one kind, and its assembly layout kenel
It is likely more complexity.
The present invention provides a kind of multi-level name entity recognition method, by repeatedly knowing otherwise, gradually text
In before several times name Entity recognition during it is unrecognized come out Entity recognition come out.The invention solves key problem
Including following two o'clock:
1, for traditional name entity recognition system, whole system only will do it primary identification, so in recognition result
Still the vocabulary that can have a part can not identify or identify mistake, lead to the recall ratio for naming Entity recognition and accurate
Rate is not high, and such case is identified to some uncommon words, such as rare name, place name etc., and in some texts of identification
It is more serious when semantic unsharp vocabulary;
2, bad for current many name entity recognition systems to solve the problems, such as that long text, the present invention use storage unit
The relevant information of storage entity solves.
Core of the invention thought is: the contextual information and entity itself for saving the entity identified are believed
Breath, because information composed by the contextual information of entity and entity self-information is considered as some clause information, passes through
These information and repeatedly knowledge go help system to go to identify unrecognized entity out, specific method in text otherwise
It will illustrate in the following example.
Embodiments of the present invention are as shown in Figure 1, Fig. 1 includes overall thought of the invention, wherein what the present invention used
Mark system is that BIOES marks system, the i.e. all characters of the more character entities of " B " expression, and " I " indicates the centre of more character entities
Character, " O " indicate that other non-physical words, " E " indicate the ending of more character entities, and " S " indicates that a word is individually for entity, and
Indicate that name, " ORG " outgoing mechanism name, " LOC " indicate place name with " PER ".
As shown in Figure 1, the present embodiment provides a kind of multi-level name entity recognition methods, method includes the following steps:
S1 pre-processes data text, obtains vocabulary C;
Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector table of obtained text
Show;
S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;
S4 is decoded the Text eigenvector sequence with CRF model, marks out the Text eigenvector sequence
In entity;
S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as subsequent identification process
Candidate sequence;
The Text eigenvector sequence and the candidate sequence are input to the reasoning list based on attention mechanism by S6
Attention force vector is calculated in member;
S7 into CRF model, Text eigenvector sequence inputting described in the attention vector sum is marked out in sequence
Entity.
S8 repeats step S5~S7 until step S7 does not generate novel entities.
In step sl, data text is pre-processed, specifically: it is first mark with fullstop, one section of long text
Be divided into sentence, it is processed after sentence stored with behavior unit, using Open-Source Tools jieba participle to all sentences into
Row participle, is divided into sentence the form of word, then removes the available vocabulary of dittograph, vocabulary is set as C.
In step s 2, the term vector good using pre-training, in conjunction with the image information features sequence of text, obtained text
Vector indicate.Specifically, it is first segmented with jieba and Chinese wikipedia corpus is segmented, utilize Open-Source Tools word2vec
Pre-training is enabled by pre-training available a term vector to the carry out pre-training after Chinese wikipedia corpus participle
Term vector dimension be d.For the text of input, it is first expressed as the form of one-hot, as image information features sequence
Column, then by the term vector of vocabulary C and pre-training, the term vector that can obtain each phrase is indicated, enables the term vector be
X, by carrying out word insertion to the sentence after participle, the vector of available text indicates X, wherein X={ x1,x2,x3,…,
xn, X ∈ Rd×n, n is the number of word after sentence participle.
In step s3, step S1 and step S2 passes through the processing work of first two steps, and the vector for having obtained text indicates X,
It is input in two-way LSTM model using X as list entries.By LSTM, text feature information can be extracted, and at some
Between one word of section characteristic information sometimes not only include front influence of the vocabulary to it, also include subsequent word to it
Influence, so can sufficiently extract Text eigenvector sequence from both direction using two-way LSTM.Pass through two-way LSTM
The vector of text is indicated that X is encoded to Text eigenvector sequence H, wherein H={ h1,h2,h3,…,ht,For the output at hidden layer each moment of two-way LSTM,Represent before each moment to
The output of LSTM hidden layerIt is exported with backward LSTM hidden layerSpliced.
LSTM model in this step is as shown in fig. 6, its specific formula is described as follows:
Determine that LSTM needs which information is selected to need to forget first, which information needs to retain, ftIt indicates to forget door
Output
ft=σ (Wf·[ht-1,xt]+bf)
Then the information that LSTM needs to update, i are determinedtWhat is indicated is the output of Memory-Gate,What is indicated is faced in LSTM
When memory cell state value
it=σ (Wi·[ht-1,xt]+bi)
Then the state of LSTM cell, C are updatedt, Ct-1Respectively indicate current cell state and previous moment in LSTM
Cell state
Finally export hidden layer information, otIndicate the value of out gate
ot=σ (Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein, xtFor the input at current time, ht-1For the output of last moment hidden layer, htIndicate current time hidden layer
Output, W is the corresponding weight matrix of each function, and b is the corresponding biasing of each function, and σ is sigmoid function, and tanh is
Hyperbolic tangent function.
In step s 4, Text eigenvector sequence H={ h1,h2,h3,…,htInput CRF mould as shown in Figure 5
Type passes through the prediction label sequence L={ l of CRF model being calculated1,l2,l3,…,ln, what l was indicated is each word
Label, that " BIE " or " S " is contained in annotation results is all entity e that mark obtains, and entity sets are expressed as E=(e1,
e2,e3,…,em), m presentation-entity number.
The related specific calculation of prediction label sequence L in this step is as follows:
If flag sequence is Y={ y1,y2,y3,…,yn, sequence Y indicates all possible label in BIOES mark system
Set, the scoring function of this result of x marked with tag y are as follows:
Wherein, tj(yi-1,yi, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, upper one
Label node yi-1It is transferred to current label node yiThe case where, value is 0 or 1, referred to as transfer characteristic;sk(yi, x, i) and it indicates to work as
Preceding label node yiWhether mark on x, value is also 0 or 1, referred to as state feature;λjAnd μkThe t respectively indicatedjAnd sk's
Weight, j, k indicate the number of characteristic function.
Indexation and standardization are carried out to score (y, x), it will be able to obtain for x marked with tag y conditional probability p (y |
X), calculate p (y | x) specific formula is as follows:
Z (x) is standardizing factor, its calculation formula are as follows:
Wherein, y '=(y '1,y′2,y′3,…,y′n), indicate possible annotated sequence.
It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l1,l2,l3,…,ln}。
In step s 5, step S4 has identified part entity, but may possibly still be present part in the text and do not known
Not Chu Lai uncommon word, or semantic unclear vocabulary, so by the way that the entity information identified is stored, benefit
It goes to identify these uncommon words with the self-information of entity and its contextual information help system.
The operation of step S5 is as follows:
According to syntax rule, the front or behind of a noun should connect verb or preposition, verb or preposition and noun
Composed clause just text feature information rich in.Because including text each moment in the hidden layer output of LSTM
Information, so for each entity e in Ei(i=1,2,3 ..., m) is exported in conjunction with preceding to LSTM hidden layerEiBefore
The information at one moment is stored as v1, eiSelf-information is stored as v2, exported in conjunction with rear to LSTM hidden layerEiLater moment in time
Information be stored as v3, eiSelf-information is stored as v4, then each entity is expressed as V '=[v1,v2,v3,v4] form, storage
To storage unit.As shown in Fig. 2, v1That indicate is the information above of entity, v2What is indicated is the preceding entity obtained to LSTM itself
Information, v3What is indicated be after the entity self-information that is obtained to LSTM, v4The information hereinafter of entity, the then entity in storage unit
Sequence is expressed as V={ V '1,V′2,V′3,…,V′m′, m ' expression has been deposited into the entity number of storage unit, then storage unit
In entity sequence be expressed as candidate sequence V.
Step S6: in storage unit candidate sequence V and Text eigenvector sequence H be input in reasoning element, lead to
It crosses reasoning element and obtains each candidate sequence V to the influence degree of Text eigenvector sequence H, gain attention force vector S.
The schematic diagram of reasoning element in this step as shown in figure 3, using attention mechanism method, pass through and calculate storage
Each entity in unit obtains an attention force vector to the degree of concern of each feature vector.
For Text eigenvector sequence H={ h1,h2,h3,…,htAnd candidate sequence V={ V '1,V′2,V′3,…,V
′m′, calculate separately the h at each momenti(i=1,2,3 ..., t) and all V 'jThe dot product of (j=1,2,3 ..., m '), is infused
Anticipate power score σ.
Any time t, V is to htAttention score calculation formula it is as follows:
Then attention score is converted into probability distribution as the weight α of postorder weighted sum:
αt=softmax (σt)
Summation finally is weighted to V, obtains any time t, V is to htNotice that force vector s, formula are as follows:
After the result for calculating all moment, obtaining V is S={ s to H attention sequence vector1,s2,s3,…,st}。
In the step s 7, Text eigenvector sequence H and after noticing that force vector S spliced, splicing and double
Splicing to the hidden layer of LSTM is the same, and the vector after splicing is expressed as [H:S], and the vector after splicing is input to
After CRF model, new annotation results are obtained, the CRF model sharing parameter in CRF model and step S4 herein.
In an embodiment, based on multi-level name entity recognition method further include: obtained by the new mark of step S7
They are expressed as V "=[v using the method for step S5 by corresponding novel entities1,v2,v3,v4] form, prepare V "=[v1,
v2,v3,v4] store into storage unit.New entity information is stored to storage unit if without screening, it may
Cause later period memory cell data amount excessive, and cause the waste of memory space, thus new entity information storage to depositing
Before storage unit, it is stored arrive candidate unit first, structure is as shown in Fig. 2, by calculating these new terms in candidate unit
Similarity between the new term of storage unit sets a threshold value beta, could be new when similarity is less than threshold value
Entity be deposited into storage unit.
The similarity between vocabulary in this step goes to calculate using the method for cosine similarity, specific formula are as follows:
In an embodiment, further include step 9 based on multi-level name entity recognition method: the S5 that repeats the above steps is arrived
Step S8 names the process of Entity recognition to terminate when step S7 does not generate new entity.
Above-mentioned name entity procedure is told about below by an example, is illustrated in fig. 4 shown below:
Step 1: the sentence of input is " encountering Cai when Tom encounters Jie Ruishili ", is located in advance to sentence first
Reason, with jieba Chinese word segmentation tool to sentence segment, result be " encountered as/Tom// Jie Rui/when/Lee/encounter/Cai ",
Wherein "/" indicates the separator of participle, and utilizes the term vector table of the pre-trained available sentence of good term vector
Show.
Step 2: the term vector expression that step 1 obtains being input in two-way LSTM, uses two-way LSTM as encoder pair
Term vector is encoded, and the output of the hidden layer of LSTM is exactly text feature sequence vector after our required codings.
Step 3: the Text eigenvector in step 2 being input in next layer of CRF model, use CRF model as solution
Code device is decoded Text eigenvector, while being labeled to obtain annotation results as follows:
When | Tom | It encounters | Jie Rui | When | Lee | It encounters | Cai |
O | S-PER | O | S-PER | O | O | O | O |
By the above results as can be seen that entity name " Tom (S-PER) " and " Jie Rui (S-PER) " mark are correct, quilt
It identifies, and other two word " Lee " and " Cai " do not mark out and, because in contrast the two vocabulary are ratios
More uncommon, so not being marked out.
By observing model sentence " encountering Cai when Tom encounters Jie Ruishili ", it can be seen that although two below
Word " Lee " and " Cai " are more uncommon vocabulary, but the contextual information of the two words and the entity identified
" Tom " " Jie Rui " is known each other very much, " encountering " word while being appeared in " Tom encounters Jie Rui " and " Lee encounters Cai ",
That is, " Tom encounters Jie Rui " and " Lee encounters Cai " is similar clause, they are that context semanteme is closely similar
Entity, it is possible to by getting up the context information store of " Tom " and " Jie Rui ", go to help using their information
System identification goes out " Lee " " bavin " the two entities.
Step 4: believing according to entity information above, forward direction entity for the entity " Tom " " Jie Rui " identified in step 3
Breath, backward entity information, entity information hereinafter as format as a candidate sequence, candidate sequence storage to storage
In unit.
Step 5: the candidate sequence for Text eigenvector sequence and the storage unit storage that step 2 is obtained is input to
In reasoning element, calculated by reasoning element it can be concluded that the self-information and context of entity " Tom " " Jie Rui " are believed
The influence degree to word each in original sentence is ceased, one group of attention force vector is obtained.
Step 6: in step 5 attention force vector and Text eigenvector sequence inputting be decoded to CRF model,
CRF model is by reference to having marked out " Lee " and " Cai " the two entities referring to vector.
Step 7: candidate unit is arrived in " Lee " and " Cai " storage, threshold value is set as β, by calculating they and storage
The similarity of each word in unit finds the two vocabulary and " Lee " and " Tom ", the similarity of " Cai " and " Jie Rui "
Greater than the threshold value beta of setting, so the two vocabulary are not stored to storage unit.
Step 8: repeating the above steps 4 to after step 7, it is found that not new entity generates, so name Entity recognition
Terminate, the entity finally marked out are as follows:
When | Tom | It encounters | Jie Rui | When | Lee | It encounters | Cai |
O | S-PER | O | S-PER | O | S-PER | O | S-PER |
The entity that the above process identifies, after storage to storage unit, as long as occurring and storing in text later
, that is, there is the text of clause similar with storage unit in the information of similar entities in unit unit, these entities just can
It is identified out, this addresses the problem the low problems of long text accuracy rate and recall ratio.
In the present system, optionally, in order to reduce the spending of calculation amount, identification name Entity recognition can be previously set
Cycle-index can reduce calculating without the end mark using " generating without novel entities " as name physical system in this way
The spending of amount, but will lead to the decline of recall ratio and accuracy rate, when practice, which needs to weigh the advantages and disadvantages, to carry out house and takes.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe
The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause
This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as
At all equivalent modifications or change, should be covered by the claims of the present invention.
Claims (10)
1. a kind of multi-level name entity recognition method, which is characterized in that it includes following that this names entity recognition method at many levels
Step:
S1 pre-processes data text, obtains vocabulary C;
Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated;
S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;
S4 is decoded the Text eigenvector sequence with CRF model, marks out in the Text eigenvector sequence
Entity;
S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the time of subsequent identification process
Select sequence;
The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6,
Attention force vector is calculated;
Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the reality in sequence by S7
Body;
S8 repeats step S5~S7 until step S7 does not generate novel entities.
2. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described to data
Text is pre-processed, and is obtained the image information features sequence of text, is specifically included:
It is first mark with fullstop, is sentence one section of long text segmentation;
All sentences are segmented;
Then remove dittograph building vocabulary C.
3. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described using pre-
Trained term vector, in conjunction with the image information features sequence of text, the vector of obtained text is indicated, is specifically included:
Pre-training is carried out to corpus, by the available a term vector of pre-training;
The form for being one-hot by the text representation of input, as image information features sequence;
By the vocabulary C, the term vector of the image information features sequence and pre-training, obtain the word of each phrase to
Amount indicates;
Word insertion is carried out to the sentence after participle, the vector for obtaining text indicates X.
4. according to claim 3 a kind of based on multi-level name entity recognition method, which is characterized in that use BiLSTM
The vector expression of the text is encoded, the Text eigenvector sequence after being encoded.
5. according to claim 4 a kind of based on multi-level name entity recognition method, which is characterized in that the text
Characteristic vector sequence is decoded with CRF model, is marked out the entity in the Text eigenvector sequence, is specifically included:
Text eigenvector sequence H={ h1,h2,h3,…,htIt is input to CRF model, pass through being calculated for CRF model
Prediction label sequence L={ l1,l2,l3,…,ln, what l was indicated is the label of each word, contains " BIE " in annotation results
Or " S " is all the obtained entity e of mark, entity sets are expressed as E=(e1,e2,e3,…,em), m presentation-entity number.
6. according to claim 5 a kind of based on multi-level name entity recognition method, which is characterized in that the pre- mark
Label sequence calculates acquisition by the following method:
If flag sequence is Y={ y1,y2,y3,…,yn, sequence Y indicates the collection of all possible label in BIOES mark system
It closes, the scoring function of this result of x marked with tag y are as follows:
Wherein, tj(yi-1,yi, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, a upper label
Node yi-1It is transferred to current label node yiThe case where, value is 0 or 1, referred to as transfer characteristic, sk(yi, x, i) and indicate current mark
Sign node yiWhether mark on x, value is also 0 or 1, referred to as state feature, λjAnd μkThe t respectively indicatedjAnd skWeight,
The number of j, k expression characteristic function;
Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping the conditional probability p (y | x) of y label, meter for x
Calculation p (y | x) specific formula is as follows:
Z (x) is standardizing factor, calculation formula are as follows:
Wherein y '=(y '1,y′2,y′3,…,y′n), indicate possible annotated sequence;
It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l1,l2,l3,…,ln}。
7. according to claim 6 a kind of based on multi-level name entity recognition method, which is characterized in that for entity set
Close each entity e in Ei, i=1,2,3 ..., m is exported in conjunction with preceding to LSTM hidden layerAnd LSTM hidden layer is defeated backward
OutEach entity is expressed as V '=[v1,v2,v3,v4] form, wherein v1That indicate is the information above of entity, v2Table
That show is the preceding entity self-information obtained to LSTM, v3What is indicated be after the entity self-information that is obtained to LSTM, v4Entity
Information hereinafter, then the entity sequence in storage unit is expressed as V={ V1′,V2′,V3′,…,V′m′, m ' expression has been deposited into
The entity number of storage unit, then the entity sequence in storage unit is expressed as candidate sequence.
8. according to claim 7 a kind of based on multi-level name entity recognition method, which is characterized in that storage unit
In candidate sequence V and Text eigenvector sequence H be input in reasoning element, each entity information is obtained by reasoning element
V is to the influence degree of Text eigenvector sequence H, and gain attention force vector S;
For Text eigenvector sequence H={ h1,h2,h3,…,htAnd candidate sequence V={ V1′,V2′,V3′,…,V′m′, point
The h at each moment is not calculatediWith all Vj' dot product, the power that gains attention score σ, wherein i=1,2,3 ..., t, j=1,2,
3,...,m';
Any time t, candidate sequence V is to htAttention score calculation formula it is as follows:
Attention score is converted into probability distribution as the weight α of postorder weighted sum:
αt=softmax (σt)
Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to htNotice that force vector s, formula are as follows:
After the result for calculating all moment, candidate sequence V is obtained to the attention sequence vector S of Text eigenvector sequence H
={ s1,s2,s3,…,st}。
9. according to claim 8 a kind of based on multi-level name entity recognition method, which is characterized in that this method is also wrapped
It includes: calculating the similarity of the entity in the entity and step S5 in the step S7, when the similarity is less than similarity threshold
When, using the entity as new entity.
10. according to claim 9 a kind of based on multi-level name entity recognition method, which is characterized in that described similar
Degree is calculated using the method for cosine similarity, specific formula are as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910207179.0A CN110008469B (en) | 2019-03-19 | 2019-03-19 | Multilevel named entity recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910207179.0A CN110008469B (en) | 2019-03-19 | 2019-03-19 | Multilevel named entity recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110008469A true CN110008469A (en) | 2019-07-12 |
CN110008469B CN110008469B (en) | 2022-06-07 |
Family
ID=67167300
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910207179.0A Active CN110008469B (en) | 2019-03-19 | 2019-03-19 | Multilevel named entity recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110008469B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516241A (en) * | 2019-08-26 | 2019-11-29 | 北京三快在线科技有限公司 | Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment |
CN110688854A (en) * | 2019-09-02 | 2020-01-14 | 平安科技(深圳)有限公司 | Named entity recognition method, device and computer readable storage medium |
CN110852108A (en) * | 2019-11-11 | 2020-02-28 | 中山大学 | Joint training method, apparatus and medium for entity recognition and entity disambiguation |
CN111241832A (en) * | 2020-01-15 | 2020-06-05 | 北京百度网讯科技有限公司 | Core entity labeling method and device and electronic equipment |
CN111274815A (en) * | 2020-01-15 | 2020-06-12 | 北京百度网讯科技有限公司 | Method and device for mining entity attention points in text |
CN111581957A (en) * | 2020-05-06 | 2020-08-25 | 浙江大学 | Nested entity detection method based on pyramid hierarchical network |
CN111832293A (en) * | 2020-06-24 | 2020-10-27 | 四川大学 | Entity and relation combined extraction method based on head entity prediction |
CN111858817A (en) * | 2020-07-23 | 2020-10-30 | 中国石油大学(华东) | BilSTM-CRF path inference method for sparse track |
CN112151183A (en) * | 2020-09-23 | 2020-12-29 | 上海海事大学 | Entity identification method of Chinese electronic medical record based on Lattice LSTM model |
CN112185572A (en) * | 2020-09-25 | 2021-01-05 | 志诺维思(北京)基因科技有限公司 | Tumor specific disease database construction system, method, electronic device and medium |
CN112307208A (en) * | 2020-11-05 | 2021-02-02 | Oppo广东移动通信有限公司 | Long text classification method, terminal and computer storage medium |
CN112836514A (en) * | 2020-06-19 | 2021-05-25 | 合肥量圳建筑科技有限公司 | Nested entity recognition method and device, electronic equipment and storage medium |
WO2021114745A1 (en) * | 2019-12-13 | 2021-06-17 | 华南理工大学 | Named entity recognition method employing affix perception for use in social media |
CN114648028A (en) * | 2020-12-21 | 2022-06-21 | 阿里巴巴集团控股有限公司 | Method and device for training label model, electronic equipment and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106569998A (en) * | 2016-10-27 | 2017-04-19 | 浙江大学 | Text named entity recognition method based on Bi-LSTM, CNN and CRF |
US20170109355A1 (en) * | 2015-10-16 | 2017-04-20 | Baidu Usa Llc | Systems and methods for human inspired simple question answering (hisqa) |
CN106980608A (en) * | 2017-03-16 | 2017-07-25 | 四川大学 | A kind of Chinese electronic health record participle and name entity recognition method and system |
WO2017165038A1 (en) * | 2016-03-21 | 2017-09-28 | Amazon Technologies, Inc. | Speaker verification method and system |
CN107871158A (en) * | 2016-09-26 | 2018-04-03 | 清华大学 | A kind of knowledge mapping of binding sequence text message represents learning method and device |
CN107977353A (en) * | 2017-10-12 | 2018-05-01 | 北京知道未来信息技术有限公司 | A kind of mixing language material name entity recognition method based on LSTM-CNN |
CN108536679A (en) * | 2018-04-13 | 2018-09-14 | 腾讯科技(成都)有限公司 | Name entity recognition method, device, equipment and computer readable storage medium |
CN108536754A (en) * | 2018-03-14 | 2018-09-14 | 四川大学 | Electronic health record entity relation extraction method based on BLSTM and attention mechanism |
CN109062893A (en) * | 2018-07-13 | 2018-12-21 | 华南理工大学 | A kind of product name recognition methods based on full text attention mechanism |
CN109062901A (en) * | 2018-08-14 | 2018-12-21 | 第四范式(北京)技术有限公司 | Neural network training method and device and name entity recognition method and device |
CN109359293A (en) * | 2018-09-13 | 2019-02-19 | 内蒙古大学 | Mongolian name entity recognition method neural network based and its identifying system |
-
2019
- 2019-03-19 CN CN201910207179.0A patent/CN110008469B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170109355A1 (en) * | 2015-10-16 | 2017-04-20 | Baidu Usa Llc | Systems and methods for human inspired simple question answering (hisqa) |
WO2017165038A1 (en) * | 2016-03-21 | 2017-09-28 | Amazon Technologies, Inc. | Speaker verification method and system |
CN107871158A (en) * | 2016-09-26 | 2018-04-03 | 清华大学 | A kind of knowledge mapping of binding sequence text message represents learning method and device |
CN106569998A (en) * | 2016-10-27 | 2017-04-19 | 浙江大学 | Text named entity recognition method based on Bi-LSTM, CNN and CRF |
CN106980608A (en) * | 2017-03-16 | 2017-07-25 | 四川大学 | A kind of Chinese electronic health record participle and name entity recognition method and system |
CN107977353A (en) * | 2017-10-12 | 2018-05-01 | 北京知道未来信息技术有限公司 | A kind of mixing language material name entity recognition method based on LSTM-CNN |
CN108536754A (en) * | 2018-03-14 | 2018-09-14 | 四川大学 | Electronic health record entity relation extraction method based on BLSTM and attention mechanism |
CN108536679A (en) * | 2018-04-13 | 2018-09-14 | 腾讯科技(成都)有限公司 | Name entity recognition method, device, equipment and computer readable storage medium |
CN109062893A (en) * | 2018-07-13 | 2018-12-21 | 华南理工大学 | A kind of product name recognition methods based on full text attention mechanism |
CN109062901A (en) * | 2018-08-14 | 2018-12-21 | 第四范式(北京)技术有限公司 | Neural network training method and device and name entity recognition method and device |
CN109359293A (en) * | 2018-09-13 | 2019-02-19 | 内蒙古大学 | Mongolian name entity recognition method neural network based and its identifying system |
Non-Patent Citations (6)
Title |
---|
GUL KHAN SAFI QAMAS等: "基于深度神经网络的命名实体识别方法研究", 《信息网络安全》 * |
HUI-KANG YI等: "A Chinese Named Entity Recognition System with Neural Networks", 《ITM WEB OF CONFERENCES》 * |
XIAOCHENG FENG等: "Multi-Level Cross-Lingual Attentive Neural Architecture for Low Resource Name Tagging", 《TSINGHUA SCIENCE AND TECHNOLOGY》 * |
姜宇新: "基于深度学习的生物医学命名实体识别研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 * |
常量: "图像理解中的卷积神经网络", 《自动化学报》 * |
张璞: "基于深度学习的中文微博评价对象抽取方法", 《计算机工程与设计》 * |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516241B (en) * | 2019-08-26 | 2021-03-02 | 北京三快在线科技有限公司 | Geographic address resolution method and device, readable storage medium and electronic equipment |
CN110516241A (en) * | 2019-08-26 | 2019-11-29 | 北京三快在线科技有限公司 | Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment |
CN110688854A (en) * | 2019-09-02 | 2020-01-14 | 平安科技(深圳)有限公司 | Named entity recognition method, device and computer readable storage medium |
CN110852108A (en) * | 2019-11-11 | 2020-02-28 | 中山大学 | Joint training method, apparatus and medium for entity recognition and entity disambiguation |
WO2021114745A1 (en) * | 2019-12-13 | 2021-06-17 | 华南理工大学 | Named entity recognition method employing affix perception for use in social media |
CN111241832A (en) * | 2020-01-15 | 2020-06-05 | 北京百度网讯科技有限公司 | Core entity labeling method and device and electronic equipment |
CN111274815A (en) * | 2020-01-15 | 2020-06-12 | 北京百度网讯科技有限公司 | Method and device for mining entity attention points in text |
CN111274815B (en) * | 2020-01-15 | 2024-04-12 | 北京百度网讯科技有限公司 | Method and device for mining entity focus point in text |
US11775761B2 (en) | 2020-01-15 | 2023-10-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for mining entity focus in text |
CN111241832B (en) * | 2020-01-15 | 2023-08-15 | 北京百度网讯科技有限公司 | Core entity labeling method and device and electronic equipment |
CN111581957B (en) * | 2020-05-06 | 2022-04-12 | 浙江大学 | Nested entity detection method based on pyramid hierarchical network |
CN111581957A (en) * | 2020-05-06 | 2020-08-25 | 浙江大学 | Nested entity detection method based on pyramid hierarchical network |
CN112836514A (en) * | 2020-06-19 | 2021-05-25 | 合肥量圳建筑科技有限公司 | Nested entity recognition method and device, electronic equipment and storage medium |
CN111832293B (en) * | 2020-06-24 | 2023-05-26 | 四川大学 | Entity and relation joint extraction method based on head entity prediction |
CN111832293A (en) * | 2020-06-24 | 2020-10-27 | 四川大学 | Entity and relation combined extraction method based on head entity prediction |
CN111858817B (en) * | 2020-07-23 | 2021-05-18 | 中国石油大学(华东) | BilSTM-CRF path inference method for sparse track |
CN111858817A (en) * | 2020-07-23 | 2020-10-30 | 中国石油大学(华东) | BilSTM-CRF path inference method for sparse track |
CN112151183A (en) * | 2020-09-23 | 2020-12-29 | 上海海事大学 | Entity identification method of Chinese electronic medical record based on Lattice LSTM model |
CN112185572A (en) * | 2020-09-25 | 2021-01-05 | 志诺维思(北京)基因科技有限公司 | Tumor specific disease database construction system, method, electronic device and medium |
CN112185572B (en) * | 2020-09-25 | 2024-03-01 | 志诺维思(北京)基因科技有限公司 | Tumor specific disease database construction system, method, electronic equipment and medium |
CN112307208A (en) * | 2020-11-05 | 2021-02-02 | Oppo广东移动通信有限公司 | Long text classification method, terminal and computer storage medium |
CN114648028A (en) * | 2020-12-21 | 2022-06-21 | 阿里巴巴集团控股有限公司 | Method and device for training label model, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN110008469B (en) | 2022-06-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110008469A (en) | A kind of multi-level name entity recognition method | |
CN111783462B (en) | Chinese named entity recognition model and method based on double neural network fusion | |
CN110245229B (en) | Deep learning theme emotion classification method based on data enhancement | |
CN107291693B (en) | Semantic calculation method for improved word vector model | |
CN106776581B (en) | Subjective text emotion analysis method based on deep learning | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN109284400B (en) | Named entity identification method based on Lattice LSTM and language model | |
CN111708882B (en) | Transformer-based Chinese text information missing completion method | |
CN110263325B (en) | Chinese word segmentation system | |
CN111325029B (en) | Text similarity calculation method based on deep learning integrated model | |
CN110232192A (en) | Electric power term names entity recognition method and device | |
CN108932226A (en) | A kind of pair of method without punctuate text addition punctuation mark | |
CN113255320A (en) | Entity relation extraction method and device based on syntax tree and graph attention machine mechanism | |
CN111666758A (en) | Chinese word segmentation method, training device and computer readable storage medium | |
CN110188175A (en) | A kind of question and answer based on BiLSTM-CRF model are to abstracting method, system and storage medium | |
CN112364623A (en) | Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method | |
CN112464669B (en) | Stock entity word disambiguation method, computer device, and storage medium | |
CN113128203A (en) | Attention mechanism-based relationship extraction method, system, equipment and storage medium | |
CN113449084A (en) | Relationship extraction method based on graph convolution | |
CN115600597A (en) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium | |
CN111368542A (en) | Text language association extraction method and system based on recurrent neural network | |
CN113312918B (en) | Word segmentation and capsule network law named entity identification method fusing radical vectors | |
CN112699685A (en) | Named entity recognition method based on label-guided word fusion | |
CN111428501A (en) | Named entity recognition method, recognition system and computer readable storage medium | |
CN114417874B (en) | Chinese named entity recognition method and system based on graph attention network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |