CN110008469A

CN110008469A - A kind of multi-level name entity recognition method

Info

Publication number: CN110008469A
Application number: CN201910207179.0A
Authority: CN
Inventors: 常亮; 王文凯; 宾辰忠; 宣闻; 秦赛歌; 陈源鹏
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2019-07-12
Anticipated expiration: 2039-03-19
Also published as: CN110008469B

Abstract

The present invention proposes a kind of multi-level name entity recognition method, comprising: S1 pre-processes data text, obtains vocabulary C；Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated；S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded；S4 is decoded the Text eigenvector sequence with CRF model, marks out the entity in the Text eigenvector sequence；S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the candidate sequence of subsequent identification process；The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, and attention force vector is calculated；Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the entity in sequence by S7.

Description

A kind of multi-level name entity recognition method

Technical field

The present invention relates to natural language processing fields, and in particular to a kind of multi-level name entity recognition method.

Background technique

Natural language processing is crisscross as one of computer field and artificial intelligence field, as artificial intelligence is led The fast development in domain and continue to develop.Naming Entity recognition (Named Entity Recognition, abbreviation NER) is nature language Say a basic task of processing, its purpose is to identify the entity for having certain sense in text and classify to them, these The type of entity mainly includes name, mechanism name, place and some other proper noun.It is big along with magnanimity in internet The generation of data names Entity recognition task also in the attention constantly by academia and industry, it is transported extensively In other natural language processing tasks such as machine translation, intelligent answer, information retrieval.

Name entity method for distinguishing includes rule-based conventional method, the conventional method based on dictionary and base at present In the conventional method of statistics, more representational method is namely based on the hidden Markov model (HMM) and condition of statistics Random field models (CRF) also emerge many methods neural network based with the arrival of the upsurge of deep learning, such as It is named Entity recognition using long memory network (LSTM) in short-term, and is mutually tied by conventional method with the method for neural network The mode of conjunction achieves good achievement.

Rule-based and dictionary method is very dependent on dictionary and regular construction, so they are only suitable for small rule In the restriction domain corpus of mould, just a little have too many difficulties to cope with for Large Scale Corpus, and have very greatly when handling new term Limitation；Statistics-Based Method relies on manual features and extracts, this will expend a large amount of manpower and time.It is much based at present The method of neural network can solve the deficiency of conventional method to a certain extent, but the uncommon word for occurring in text, or The unclear vocabulary of person's semanteme, the recall ratio and accuracy rate of such methods still have to be hoisted.

Summary of the invention

In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of multi-level name Entity recognitions Method, the present invention efficiently solve practical application life by repeatedly know otherwise using reasoning element and storage unit For uncommon word in name Entity recognition, it is complete to improve looking into for text information for the not high problem of the unclear vocabulary accuracy rate of semanteme Rate and accuracy rate.

In order to achieve the above objects and other related objects, the present invention provides a kind of multi-level name entity recognition method, packet Include following steps:

S1 pre-processes data text, obtains vocabulary C；

Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector table of obtained text Show；

S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded；

S4 is decoded the Text eigenvector sequence with CRF model, marks out the Text eigenvector sequence In entity；

S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as subsequent identification process Candidate sequence；

The Text eigenvector sequence and the candidate sequence are input to the reasoning list based on attention mechanism by S6 Attention force vector is calculated in member；

S7 into CRF model, Text eigenvector sequence inputting described in the attention vector sum is marked out in sequence Entity.

S8 repeats step S5~S7 until step S7 does not generate novel entities.

Optionally, described that data text is pre-processed, the image information features sequence of text is obtained, is specifically included:

It is first mark with fullstop, is sentence one section of long text segmentation；

All sentences are segmented；

Then remove dittograph building vocabulary C.

Optionally, the term vector good using pre-training, in conjunction with the image information features sequence of text, obtained text Vector indicate, specifically include:

Pre-training is carried out to corpus, by the available a term vector of pre-training；

The form for being one-hot by the text representation of input, as image information features sequence；

By the vocabulary C, the term vector of the image information features sequence and pre-training, each phrase is obtained Term vector indicates；

Word insertion is carried out to the sentence after participle, the vector for obtaining text indicates X.

Optionally, the expression of the vector of the text is encoded using BiLSTM, the text feature after being encoded to Measure sequence.

Optionally, the Text eigenvector sequence is decoded with CRF model, marks out the Text eigenvector Entity in sequence, specifically includes:

Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tIt is input to CRF model, pass through calculating for CRF model The prediction label sequence L={ l arrived₁,l₂,l₃,…,l_n, what l was indicated is the label of each word, is contained in annotation results " BIE's " or " S " is all to mark obtained entity e, and entity sets are expressed as E=(e₁,e₂,e₃,…,e_m), m presentation-entity Number.

Optionally, the prediction label sequence calculates acquisition by the following method:

If flag sequence is Y={ y₁,y₂,y₃,…,y_n, sequence Y indicates all possible label in BIOES mark system Set, the scoring function of this result of x marked with tag y are as follows:

Wherein, t_j(y_i-1,y_i, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, upper one Label node y_i-1It is transferred to current label node y_iThe case where, value is 0 or 1, referred to as transfer characteristic, s_k(y_i, x, i) and it indicates to work as Preceding label node y_iWhether mark on x, value is also 0 or 1, referred to as state feature, λ_jAnd μ_kThe t respectively indicated_jAnd s_k's Weight, j, k indicate the number of characteristic function；

Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping for x y label conditional probability p (y | X), calculate p (y | x) specific formula is as follows:

Z (x) is standardizing factor, calculation formula are as follows:

Wherein y '=(y '₁,y′₂,y′₃,…,y′_n), indicate possible annotated sequence；

It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l₁,l₂,l₃,…,l_n}。

Optionally, for each entity e in entity sets E_i(i=1,2,3 ..., m), in conjunction with preceding to LSTM hidden layer OutputAnd backward LSTM hidden layer outputEach entity is expressed as V '=[v₁,v₂,v₃,v₄] form, wherein v₁ That indicate is the information above of entity, v₂That indicate is the preceding entity self-information obtained to LSTM, v₃What is indicated be after to LSTM Obtained entity self-information, v₄The information hereinafter of entity, then the entity sequence in storage unit is expressed as V={ V '₁,V′₂, V′₃,…,V′_m′, m ' expression has been deposited into the entity number of storage unit, then the entity sequence in storage unit is expressed as candidate Sequence.

Optionally, in storage unit candidate sequence V and Text eigenvector sequence H be input in reasoning element, lead to It crosses reasoning element and obtains each entity information V to the influence degree of Text eigenvector sequence H, gain attention force vector S；

For Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tAnd candidate sequence V={ V '₁,V′₂,V′₃,…,V ′_m′, calculate separately the h at each moment_iWith all V '_jThe dot product of (j=1,2,3 ..., m '), the power that gains attention score σ, wherein i =1,2,3 ..., t；

Any time t, candidate sequence V is to h_tAttention score calculation formula it is as follows:

Attention score is converted into probability distribution as the weight α of postorder weighted sum:

α^t=softmax (σ^t)

Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to h_tPay attention to force vector s, formula is such as Under:

After the result for calculating all moment, candidate sequence V is obtained to the attention force vector sequence of Text eigenvector sequence H Arrange S={ s₁,s₂,s₃,…,s_t}。

Optionally, this method further include: the similarity for calculating the entity in the entity and step S5 in the step S7, when When the similarity is less than similarity threshold, using the entity as new entity.

Optionally, the similarity is calculated using the method for cosine similarity, specific formula are as follows:

As described above, the multi-level name entity recognition method of one kind of the invention, has the advantages that

1, compared to the method for only carrying out primary name Entity recognition, this method is using repeatedly identification, by repeatedly identifying Mode improve name Entity recognition task recall ratio；

2, existing method is relatively suitble to short text data, and for longer long article notebook data, efficiency will decline, the present invention Storage unit is devised, important entity information and its contextual information is stored, long text number is handled in this manner According to, while devising the space expense that candidate unit reduces storage unit；

3, reasoning element is devised, carries out Entity recognition in conjunction with reasoning element, for some uncommon words in text, and The unclear word of some semantemes, the effect of identification are able to ascend, and improve the accuracy rate and recall ratio of system.

Detailed description of the invention

In order to which the present invention is further explained, described content, with reference to the accompanying drawing makees a specific embodiment of the invention Further details of explanation.It should be appreciated that these attached drawings are only used as typical case, and it is not to be taken as to the scope of the present invention It limits.

Fig. 1 is name entity recognition system flow chart；

Fig. 2 is respectively memory cell structure schematic diagram and candidate unit structural schematic diagram；

Fig. 3 is reasoning element structural schematic diagram；

Fig. 4 is system structure diagram；

Fig. 5 is CRF structural schematic diagram；

Fig. 6 is LSTM cellular construction schematic diagram.

Specific embodiment

Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from Various modifications or alterations are carried out under spirit of the invention.It should be noted that in the absence of conflict, following embodiment and implementation Feature in example can be combined with each other.

It should be noted that illustrating the basic structure that only the invention is illustrated in a schematic way provided in following embodiment Think, only shown in schema then with related component in the present invention rather than component count, shape and size when according to actual implementation Draw, when actual implementation kenel, quantity and the ratio of each component can arbitrarily change for one kind, and its assembly layout kenel It is likely more complexity.

The present invention provides a kind of multi-level name entity recognition method, by repeatedly knowing otherwise, gradually text In before several times name Entity recognition during it is unrecognized come out Entity recognition come out.The invention solves key problem Including following two o'clock:

1, for traditional name entity recognition system, whole system only will do it primary identification, so in recognition result Still the vocabulary that can have a part can not identify or identify mistake, lead to the recall ratio for naming Entity recognition and accurate Rate is not high, and such case is identified to some uncommon words, such as rare name, place name etc., and in some texts of identification It is more serious when semantic unsharp vocabulary；

2, bad for current many name entity recognition systems to solve the problems, such as that long text, the present invention use storage unit The relevant information of storage entity solves.

Core of the invention thought is: the contextual information and entity itself for saving the entity identified are believed Breath, because information composed by the contextual information of entity and entity self-information is considered as some clause information, passes through These information and repeatedly knowledge go help system to go to identify unrecognized entity out, specific method in text otherwise It will illustrate in the following example.

Embodiments of the present invention are as shown in Figure 1, Fig. 1 includes overall thought of the invention, wherein what the present invention used Mark system is that BIOES marks system, the i.e. all characters of the more character entities of " B " expression, and " I " indicates the centre of more character entities Character, " O " indicate that other non-physical words, " E " indicate the ending of more character entities, and " S " indicates that a word is individually for entity, and Indicate that name, " ORG " outgoing mechanism name, " LOC " indicate place name with " PER ".

As shown in Figure 1, the present embodiment provides a kind of multi-level name entity recognition methods, method includes the following steps:

S1 pre-processes data text, obtains vocabulary C；

S8 repeats step S5~S7 until step S7 does not generate novel entities.

In step sl, data text is pre-processed, specifically: it is first mark with fullstop, one section of long text Be divided into sentence, it is processed after sentence stored with behavior unit, using Open-Source Tools jieba participle to all sentences into Row participle, is divided into sentence the form of word, then removes the available vocabulary of dittograph, vocabulary is set as C.

In step s 2, the term vector good using pre-training, in conjunction with the image information features sequence of text, obtained text Vector indicate.Specifically, it is first segmented with jieba and Chinese wikipedia corpus is segmented, utilize Open-Source Tools word2vec Pre-training is enabled by pre-training available a term vector to the carry out pre-training after Chinese wikipedia corpus participle Term vector dimension be d.For the text of input, it is first expressed as the form of one-hot, as image information features sequence Column, then by the term vector of vocabulary C and pre-training, the term vector that can obtain each phrase is indicated, enables the term vector be X, by carrying out word insertion to the sentence after participle, the vector of available text indicates X, wherein X={ x₁,x₂,x₃,…, x_n, X ∈ R^d×n, n is the number of word after sentence participle.

In step s3, step S1 and step S2 passes through the processing work of first two steps, and the vector for having obtained text indicates X, It is input in two-way LSTM model using X as list entries.By LSTM, text feature information can be extracted, and at some Between one word of section characteristic information sometimes not only include front influence of the vocabulary to it, also include subsequent word to it Influence, so can sufficiently extract Text eigenvector sequence from both direction using two-way LSTM.Pass through two-way LSTM The vector of text is indicated that X is encoded to Text eigenvector sequence H, wherein H={ h₁,h₂,h₃,…,h_t,For the output at hidden layer each moment of two-way LSTM,Represent before each moment to The output of LSTM hidden layerIt is exported with backward LSTM hidden layerSpliced.

LSTM model in this step is as shown in fig. 6, its specific formula is described as follows:

Determine that LSTM needs which information is selected to need to forget first, which information needs to retain, f_tIt indicates to forget door Output

f_t=σ (W_f·[h_t-1,x_t]+b_f)

Then the information that LSTM needs to update, i are determined_tWhat is indicated is the output of Memory-Gate,What is indicated is faced in LSTM When memory cell state value

i_t=σ (W_i·[h_t-1,x_t]+b_i)

Then the state of LSTM cell, C are updated_t, C_t-1Respectively indicate current cell state and previous moment in LSTM Cell state

Finally export hidden layer information, o_tIndicate the value of out gate

o_t=σ (W_o[h_t-1,x_t]+b_o)

h_t=o_t*tanh(C_t)

Wherein, x_tFor the input at current time, h_t-1For the output of last moment hidden layer, h_tIndicate current time hidden layer Output, W is the corresponding weight matrix of each function, and b is the corresponding biasing of each function, and σ is sigmoid function, and tanh is Hyperbolic tangent function.

In step s 4, Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tInput CRF mould as shown in Figure 5 Type passes through the prediction label sequence L={ l of CRF model being calculated₁,l₂,l₃,…,l_n, what l was indicated is each word Label, that " BIE " or " S " is contained in annotation results is all entity e that mark obtains, and entity sets are expressed as E=(e₁, e₂,e₃,…,e_m), m presentation-entity number.

The related specific calculation of prediction label sequence L in this step is as follows:

Wherein, t_j(y_i-1,y_i, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, upper one Label node y_i-1It is transferred to current label node y_iThe case where, value is 0 or 1, referred to as transfer characteristic；s_k(y_i, x, i) and it indicates to work as Preceding label node y_iWhether mark on x, value is also 0 or 1, referred to as state feature；λ_jAnd μ_kThe t respectively indicated_jAnd s_k's Weight, j, k indicate the number of characteristic function.

Indexation and standardization are carried out to score (y, x), it will be able to obtain for x marked with tag y conditional probability p (y | X), calculate p (y | x) specific formula is as follows:

Z (x) is standardizing factor, its calculation formula are as follows:

Wherein, y '=(y '₁,y′₂,y′₃,…,y′_n), indicate possible annotated sequence.

In step s 5, step S4 has identified part entity, but may possibly still be present part in the text and do not known Not Chu Lai uncommon word, or semantic unclear vocabulary, so by the way that the entity information identified is stored, benefit It goes to identify these uncommon words with the self-information of entity and its contextual information help system.

The operation of step S5 is as follows:

According to syntax rule, the front or behind of a noun should connect verb or preposition, verb or preposition and noun Composed clause just text feature information rich in.Because including text each moment in the hidden layer output of LSTM Information, so for each entity e in E_i(i=1,2,3 ..., m) is exported in conjunction with preceding to LSTM hidden layerE_iBefore The information at one moment is stored as v₁, e_iSelf-information is stored as v₂, exported in conjunction with rear to LSTM hidden layerE_iLater moment in time Information be stored as v₃, e_iSelf-information is stored as v₄, then each entity is expressed as V '=[v₁,v₂,v₃,v₄] form, storage To storage unit.As shown in Fig. 2, v₁That indicate is the information above of entity, v₂What is indicated is the preceding entity obtained to LSTM itself Information, v₃What is indicated be after the entity self-information that is obtained to LSTM, v₄The information hereinafter of entity, the then entity in storage unit Sequence is expressed as V={ V '₁,V′₂,V′₃,…,V′_m′, m ' expression has been deposited into the entity number of storage unit, then storage unit In entity sequence be expressed as candidate sequence V.

Step S6: in storage unit candidate sequence V and Text eigenvector sequence H be input in reasoning element, lead to It crosses reasoning element and obtains each candidate sequence V to the influence degree of Text eigenvector sequence H, gain attention force vector S.

The schematic diagram of reasoning element in this step as shown in figure 3, using attention mechanism method, pass through and calculate storage Each entity in unit obtains an attention force vector to the degree of concern of each feature vector.

For Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tAnd candidate sequence V={ V '₁,V′₂,V′₃,…,V ′_m′, calculate separately the h at each moment_i(i=1,2,3 ..., t) and all V '_jThe dot product of (j=1,2,3 ..., m '), is infused Anticipate power score σ.

Any time t, V is to h_tAttention score calculation formula it is as follows:

Then attention score is converted into probability distribution as the weight α of postorder weighted sum:

α^t=softmax (σ^t)

Summation finally is weighted to V, obtains any time t, V is to h_tNotice that force vector s, formula are as follows:

After the result for calculating all moment, obtaining V is S={ s to H attention sequence vector₁,s₂,s₃,…,s_t}。

In the step s 7, Text eigenvector sequence H and after noticing that force vector S spliced, splicing and double Splicing to the hidden layer of LSTM is the same, and the vector after splicing is expressed as [H:S], and the vector after splicing is input to After CRF model, new annotation results are obtained, the CRF model sharing parameter in CRF model and step S4 herein.

In an embodiment, based on multi-level name entity recognition method further include: obtained by the new mark of step S7 They are expressed as V "=[v using the method for step S5 by corresponding novel entities₁,v₂,v₃,v₄] form, prepare V "=[v₁, v₂,v₃,v₄] store into storage unit.New entity information is stored to storage unit if without screening, it may Cause later period memory cell data amount excessive, and cause the waste of memory space, thus new entity information storage to depositing Before storage unit, it is stored arrive candidate unit first, structure is as shown in Fig. 2, by calculating these new terms in candidate unit Similarity between the new term of storage unit sets a threshold value beta, could be new when similarity is less than threshold value Entity be deposited into storage unit.

The similarity between vocabulary in this step goes to calculate using the method for cosine similarity, specific formula are as follows:

In an embodiment, further include step 9 based on multi-level name entity recognition method: the S5 that repeats the above steps is arrived Step S8 names the process of Entity recognition to terminate when step S7 does not generate new entity.

Above-mentioned name entity procedure is told about below by an example, is illustrated in fig. 4 shown below:

Step 1: the sentence of input is " encountering Cai when Tom encounters Jie Ruishili ", is located in advance to sentence first Reason, with jieba Chinese word segmentation tool to sentence segment, result be " encountered as/Tom// Jie Rui/when/Lee/encounter/Cai ", Wherein "/" indicates the separator of participle, and utilizes the term vector table of the pre-trained available sentence of good term vector Show.

Step 2: the term vector expression that step 1 obtains being input in two-way LSTM, uses two-way LSTM as encoder pair Term vector is encoded, and the output of the hidden layer of LSTM is exactly text feature sequence vector after our required codings.

Step 3: the Text eigenvector in step 2 being input in next layer of CRF model, use CRF model as solution Code device is decoded Text eigenvector, while being labeled to obtain annotation results as follows:

When

Tom

It encounters

Jie Rui

When

Lee

It encounters

Cai

O

S-PER

O

S-PER

O

By the above results as can be seen that entity name " Tom (S-PER) " and " Jie Rui (S-PER) " mark are correct, quilt It identifies, and other two word " Lee " and " Cai " do not mark out and, because in contrast the two vocabulary are ratios More uncommon, so not being marked out.

By observing model sentence " encountering Cai when Tom encounters Jie Ruishili ", it can be seen that although two below Word " Lee " and " Cai " are more uncommon vocabulary, but the contextual information of the two words and the entity identified " Tom " " Jie Rui " is known each other very much, " encountering " word while being appeared in " Tom encounters Jie Rui " and " Lee encounters Cai ", That is, " Tom encounters Jie Rui " and " Lee encounters Cai " is similar clause, they are that context semanteme is closely similar Entity, it is possible to by getting up the context information store of " Tom " and " Jie Rui ", go to help using their information System identification goes out " Lee " " bavin " the two entities.

Step 4: believing according to entity information above, forward direction entity for the entity " Tom " " Jie Rui " identified in step 3 Breath, backward entity information, entity information hereinafter as format as a candidate sequence, candidate sequence storage to storage In unit.

Step 5: the candidate sequence for Text eigenvector sequence and the storage unit storage that step 2 is obtained is input to In reasoning element, calculated by reasoning element it can be concluded that the self-information and context of entity " Tom " " Jie Rui " are believed The influence degree to word each in original sentence is ceased, one group of attention force vector is obtained.

Step 6: in step 5 attention force vector and Text eigenvector sequence inputting be decoded to CRF model, CRF model is by reference to having marked out " Lee " and " Cai " the two entities referring to vector.

Step 7: candidate unit is arrived in " Lee " and " Cai " storage, threshold value is set as β, by calculating they and storage The similarity of each word in unit finds the two vocabulary and " Lee " and " Tom ", the similarity of " Cai " and " Jie Rui " Greater than the threshold value beta of setting, so the two vocabulary are not stored to storage unit.

Step 8: repeating the above steps 4 to after step 7, it is found that not new entity generates, so name Entity recognition Terminate, the entity finally marked out are as follows:

When

Tom

It encounters

Jie Rui

When

Lee

It encounters

Cai

O

S-PER

O

S-PER

O

S-PER

O

S-PER

The entity that the above process identifies, after storage to storage unit, as long as occurring and storing in text later , that is, there is the text of clause similar with storage unit in the information of similar entities in unit unit, these entities just can It is identified out, this addresses the problem the low problems of long text accuracy rate and recall ratio.

In the present system, optionally, in order to reduce the spending of calculation amount, identification name Entity recognition can be previously set Cycle-index can reduce calculating without the end mark using " generating without novel entities " as name physical system in this way The spending of amount, but will lead to the decline of recall ratio and accuracy rate, when practice, which needs to weigh the advantages and disadvantages, to carry out house and takes.

The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should be covered by the claims of the present invention.

Claims

1. a kind of multi-level name entity recognition method, which is characterized in that it includes following that this names entity recognition method at many levels Step:

S1 pre-processes data text, obtains vocabulary C；

Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated；

S4 is decoded the Text eigenvector sequence with CRF model, marks out in the Text eigenvector sequence Entity；

S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the time of subsequent identification process Select sequence；

The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, Attention force vector is calculated；

Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the reality in sequence by S7 Body；

S8 repeats step S5~S7 until step S7 does not generate novel entities.

2. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described to data Text is pre-processed, and is obtained the image information features sequence of text, is specifically included:

All sentences are segmented；

Then remove dittograph building vocabulary C.

3. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described using pre- Trained term vector, in conjunction with the image information features sequence of text, the vector of obtained text is indicated, is specifically included:

By the vocabulary C, the term vector of the image information features sequence and pre-training, obtain the word of each phrase to Amount indicates；

4. according to claim 3 a kind of based on multi-level name entity recognition method, which is characterized in that use BiLSTM The vector expression of the text is encoded, the Text eigenvector sequence after being encoded.

5. according to claim 4 a kind of based on multi-level name entity recognition method, which is characterized in that the text Characteristic vector sequence is decoded with CRF model, is marked out the entity in the Text eigenvector sequence, is specifically included:

Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tIt is input to CRF model, pass through being calculated for CRF model Prediction label sequence L={ l₁,l₂,l₃,…,l_n, what l was indicated is the label of each word, contains " BIE " in annotation results Or " S " is all the obtained entity e of mark, entity sets are expressed as E=(e₁,e₂,e₃,…,e_m), m presentation-entity number.

6. according to claim 5 a kind of based on multi-level name entity recognition method, which is characterized in that the pre- mark Label sequence calculates acquisition by the following method:

If flag sequence is Y={ y₁,y₂,y₃,…,y_n, sequence Y indicates the collection of all possible label in BIOES mark system It closes, the scoring function of this result of x marked with tag y are as follows:

Wherein, t_j(y_i-1,y_i, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, a upper label Node y_i-1It is transferred to current label node y_iThe case where, value is 0 or 1, referred to as transfer characteristic, s_k(y_i, x, i) and indicate current mark Sign node y_iWhether mark on x, value is also 0 or 1, referred to as state feature, λ_jAnd μ_kThe t respectively indicated_jAnd s_kWeight, The number of j, k expression characteristic function；

Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping the conditional probability p (y | x) of y label, meter for x Calculation p (y | x) specific formula is as follows:

Z (x) is standardizing factor, calculation formula are as follows:

7. according to claim 6 a kind of based on multi-level name entity recognition method, which is characterized in that for entity set Close each entity e in E_i, i=1,2,3 ..., m is exported in conjunction with preceding to LSTM hidden layerAnd LSTM hidden layer is defeated backward OutEach entity is expressed as V '=[v₁,v₂,v₃,v₄] form, wherein v₁That indicate is the information above of entity, v₂Table That show is the preceding entity self-information obtained to LSTM, v₃What is indicated be after the entity self-information that is obtained to LSTM, v₄Entity Information hereinafter, then the entity sequence in storage unit is expressed as V={ V₁′,V₂′,V₃′,…,V′_m′, m ' expression has been deposited into The entity number of storage unit, then the entity sequence in storage unit is expressed as candidate sequence.

8. according to claim 7 a kind of based on multi-level name entity recognition method, which is characterized in that storage unit In candidate sequence V and Text eigenvector sequence H be input in reasoning element, each entity information is obtained by reasoning element V is to the influence degree of Text eigenvector sequence H, and gain attention force vector S；

For Text eigenvector sequence H={ h₁,h₂,h₃,…,h_tAnd candidate sequence V={ V₁′,V₂′,V₃′,…,V′_m′, point The h at each moment is not calculated_iWith all V_j' dot product, the power that gains attention score σ, wherein i=1,2,3 ..., t, j=1,2, 3,...,m'；

α^t=softmax (σ^t)

Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to h_tNotice that force vector s, formula are as follows:

After the result for calculating all moment, candidate sequence V is obtained to the attention sequence vector S of Text eigenvector sequence H ={ s₁,s₂,s₃,…,s_t}。

9. according to claim 8 a kind of based on multi-level name entity recognition method, which is characterized in that this method is also wrapped It includes: calculating the similarity of the entity in the entity and step S5 in the step S7, when the similarity is less than similarity threshold When, using the entity as new entity.

10. according to claim 9 a kind of based on multi-level name entity recognition method, which is characterized in that described similar Degree is calculated using the method for cosine similarity, specific formula are as follows: