[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110008469A - A kind of multi-level name entity recognition method - Google Patents

A kind of multi-level name entity recognition method Download PDF

Info

Publication number
CN110008469A
CN110008469A CN201910207179.0A CN201910207179A CN110008469A CN 110008469 A CN110008469 A CN 110008469A CN 201910207179 A CN201910207179 A CN 201910207179A CN 110008469 A CN110008469 A CN 110008469A
Authority
CN
China
Prior art keywords
sequence
entity
text
vector
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910207179.0A
Other languages
Chinese (zh)
Other versions
CN110008469B (en
Inventor
常亮
王文凯
宾辰忠
宣闻
秦赛歌
陈源鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guilin University of Electronic Technology
Original Assignee
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guilin University of Electronic Technology filed Critical Guilin University of Electronic Technology
Priority to CN201910207179.0A priority Critical patent/CN110008469B/en
Publication of CN110008469A publication Critical patent/CN110008469A/en
Application granted granted Critical
Publication of CN110008469B publication Critical patent/CN110008469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The present invention proposes a kind of multi-level name entity recognition method, comprising: S1 pre-processes data text, obtains vocabulary C;Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated;S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;S4 is decoded the Text eigenvector sequence with CRF model, marks out the entity in the Text eigenvector sequence;S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the candidate sequence of subsequent identification process;The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, and attention force vector is calculated;Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the entity in sequence by S7.

Description

A kind of multi-level name entity recognition method
Technical field
The present invention relates to natural language processing fields, and in particular to a kind of multi-level name entity recognition method.
Background technique
Natural language processing is crisscross as one of computer field and artificial intelligence field, as artificial intelligence is led The fast development in domain and continue to develop.Naming Entity recognition (Named Entity Recognition, abbreviation NER) is nature language Say a basic task of processing, its purpose is to identify the entity for having certain sense in text and classify to them, these The type of entity mainly includes name, mechanism name, place and some other proper noun.It is big along with magnanimity in internet The generation of data names Entity recognition task also in the attention constantly by academia and industry, it is transported extensively In other natural language processing tasks such as machine translation, intelligent answer, information retrieval.
Name entity method for distinguishing includes rule-based conventional method, the conventional method based on dictionary and base at present In the conventional method of statistics, more representational method is namely based on the hidden Markov model (HMM) and condition of statistics Random field models (CRF) also emerge many methods neural network based with the arrival of the upsurge of deep learning, such as It is named Entity recognition using long memory network (LSTM) in short-term, and is mutually tied by conventional method with the method for neural network The mode of conjunction achieves good achievement.
Rule-based and dictionary method is very dependent on dictionary and regular construction, so they are only suitable for small rule In the restriction domain corpus of mould, just a little have too many difficulties to cope with for Large Scale Corpus, and have very greatly when handling new term Limitation;Statistics-Based Method relies on manual features and extracts, this will expend a large amount of manpower and time.It is much based at present The method of neural network can solve the deficiency of conventional method to a certain extent, but the uncommon word for occurring in text, or The unclear vocabulary of person's semanteme, the recall ratio and accuracy rate of such methods still have to be hoisted.
Summary of the invention
In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of multi-level name Entity recognitions Method, the present invention efficiently solve practical application life by repeatedly know otherwise using reasoning element and storage unit For uncommon word in name Entity recognition, it is complete to improve looking into for text information for the not high problem of the unclear vocabulary accuracy rate of semanteme Rate and accuracy rate.
In order to achieve the above objects and other related objects, the present invention provides a kind of multi-level name entity recognition method, packet Include following steps:
S1 pre-processes data text, obtains vocabulary C;
Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector table of obtained text Show;
S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;
S4 is decoded the Text eigenvector sequence with CRF model, marks out the Text eigenvector sequence In entity;
S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as subsequent identification process Candidate sequence;
The Text eigenvector sequence and the candidate sequence are input to the reasoning list based on attention mechanism by S6 Attention force vector is calculated in member;
S7 into CRF model, Text eigenvector sequence inputting described in the attention vector sum is marked out in sequence Entity.
S8 repeats step S5~S7 until step S7 does not generate novel entities.
Optionally, described that data text is pre-processed, the image information features sequence of text is obtained, is specifically included:
It is first mark with fullstop, is sentence one section of long text segmentation;
All sentences are segmented;
Then remove dittograph building vocabulary C.
Optionally, the term vector good using pre-training, in conjunction with the image information features sequence of text, obtained text Vector indicate, specifically include:
Pre-training is carried out to corpus, by the available a term vector of pre-training;
The form for being one-hot by the text representation of input, as image information features sequence;
By the vocabulary C, the term vector of the image information features sequence and pre-training, each phrase is obtained Term vector indicates;
Word insertion is carried out to the sentence after participle, the vector for obtaining text indicates X.
Optionally, the expression of the vector of the text is encoded using BiLSTM, the text feature after being encoded to Measure sequence.
Optionally, the Text eigenvector sequence is decoded with CRF model, marks out the Text eigenvector Entity in sequence, specifically includes:
Text eigenvector sequence H={ h1,h2,h3,…,htIt is input to CRF model, pass through calculating for CRF model The prediction label sequence L={ l arrived1,l2,l3,…,ln, what l was indicated is the label of each word, is contained in annotation results " BIE's " or " S " is all to mark obtained entity e, and entity sets are expressed as E=(e1,e2,e3,…,em), m presentation-entity Number.
Optionally, the prediction label sequence calculates acquisition by the following method:
If flag sequence is Y={ y1,y2,y3,…,yn, sequence Y indicates all possible label in BIOES mark system Set, the scoring function of this result of x marked with tag y are as follows:
Wherein, tj(yi-1,yi, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, upper one Label node yi-1It is transferred to current label node yiThe case where, value is 0 or 1, referred to as transfer characteristic, sk(yi, x, i) and it indicates to work as Preceding label node yiWhether mark on x, value is also 0 or 1, referred to as state feature, λjAnd μkThe t respectively indicatedjAnd sk's Weight, j, k indicate the number of characteristic function;
Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping for x y label conditional probability p (y | X), calculate p (y | x) specific formula is as follows:
Z (x) is standardizing factor, calculation formula are as follows:
Wherein y '=(y '1,y′2,y′3,…,y′n), indicate possible annotated sequence;
It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l1,l2,l3,…,ln}。
Optionally, for each entity e in entity sets Ei(i=1,2,3 ..., m), in conjunction with preceding to LSTM hidden layer OutputAnd backward LSTM hidden layer outputEach entity is expressed as V '=[v1,v2,v3,v4] form, wherein v1 That indicate is the information above of entity, v2That indicate is the preceding entity self-information obtained to LSTM, v3What is indicated be after to LSTM Obtained entity self-information, v4The information hereinafter of entity, then the entity sequence in storage unit is expressed as V={ V '1,V′2, V′3,…,V′m′, m ' expression has been deposited into the entity number of storage unit, then the entity sequence in storage unit is expressed as candidate Sequence.
Optionally, in storage unit candidate sequence V and Text eigenvector sequence H be input in reasoning element, lead to It crosses reasoning element and obtains each entity information V to the influence degree of Text eigenvector sequence H, gain attention force vector S;
For Text eigenvector sequence H={ h1,h2,h3,…,htAnd candidate sequence V={ V '1,V′2,V′3,…,V ′m′, calculate separately the h at each momentiWith all V 'jThe dot product of (j=1,2,3 ..., m '), the power that gains attention score σ, wherein i =1,2,3 ..., t;
Any time t, candidate sequence V is to htAttention score calculation formula it is as follows:
Attention score is converted into probability distribution as the weight α of postorder weighted sum:
αt=softmax (σt)
Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to htPay attention to force vector s, formula is such as Under:
After the result for calculating all moment, candidate sequence V is obtained to the attention force vector sequence of Text eigenvector sequence H Arrange S={ s1,s2,s3,…,st}。
Optionally, this method further include: the similarity for calculating the entity in the entity and step S5 in the step S7, when When the similarity is less than similarity threshold, using the entity as new entity.
Optionally, the similarity is calculated using the method for cosine similarity, specific formula are as follows:
As described above, the multi-level name entity recognition method of one kind of the invention, has the advantages that
1, compared to the method for only carrying out primary name Entity recognition, this method is using repeatedly identification, by repeatedly identifying Mode improve name Entity recognition task recall ratio;
2, existing method is relatively suitble to short text data, and for longer long article notebook data, efficiency will decline, the present invention Storage unit is devised, important entity information and its contextual information is stored, long text number is handled in this manner According to, while devising the space expense that candidate unit reduces storage unit;
3, reasoning element is devised, carries out Entity recognition in conjunction with reasoning element, for some uncommon words in text, and The unclear word of some semantemes, the effect of identification are able to ascend, and improve the accuracy rate and recall ratio of system.
Detailed description of the invention
In order to which the present invention is further explained, described content, with reference to the accompanying drawing makees a specific embodiment of the invention Further details of explanation.It should be appreciated that these attached drawings are only used as typical case, and it is not to be taken as to the scope of the present invention It limits.
Fig. 1 is name entity recognition system flow chart;
Fig. 2 is respectively memory cell structure schematic diagram and candidate unit structural schematic diagram;
Fig. 3 is reasoning element structural schematic diagram;
Fig. 4 is system structure diagram;
Fig. 5 is CRF structural schematic diagram;
Fig. 6 is LSTM cellular construction schematic diagram.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from Various modifications or alterations are carried out under spirit of the invention.It should be noted that in the absence of conflict, following embodiment and implementation Feature in example can be combined with each other.
It should be noted that illustrating the basic structure that only the invention is illustrated in a schematic way provided in following embodiment Think, only shown in schema then with related component in the present invention rather than component count, shape and size when according to actual implementation Draw, when actual implementation kenel, quantity and the ratio of each component can arbitrarily change for one kind, and its assembly layout kenel It is likely more complexity.
The present invention provides a kind of multi-level name entity recognition method, by repeatedly knowing otherwise, gradually text In before several times name Entity recognition during it is unrecognized come out Entity recognition come out.The invention solves key problem Including following two o'clock:
1, for traditional name entity recognition system, whole system only will do it primary identification, so in recognition result Still the vocabulary that can have a part can not identify or identify mistake, lead to the recall ratio for naming Entity recognition and accurate Rate is not high, and such case is identified to some uncommon words, such as rare name, place name etc., and in some texts of identification It is more serious when semantic unsharp vocabulary;
2, bad for current many name entity recognition systems to solve the problems, such as that long text, the present invention use storage unit The relevant information of storage entity solves.
Core of the invention thought is: the contextual information and entity itself for saving the entity identified are believed Breath, because information composed by the contextual information of entity and entity self-information is considered as some clause information, passes through These information and repeatedly knowledge go help system to go to identify unrecognized entity out, specific method in text otherwise It will illustrate in the following example.
Embodiments of the present invention are as shown in Figure 1, Fig. 1 includes overall thought of the invention, wherein what the present invention used Mark system is that BIOES marks system, the i.e. all characters of the more character entities of " B " expression, and " I " indicates the centre of more character entities Character, " O " indicate that other non-physical words, " E " indicate the ending of more character entities, and " S " indicates that a word is individually for entity, and Indicate that name, " ORG " outgoing mechanism name, " LOC " indicate place name with " PER ".
As shown in Figure 1, the present embodiment provides a kind of multi-level name entity recognition methods, method includes the following steps:
S1 pre-processes data text, obtains vocabulary C;
Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector table of obtained text Show;
S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;
S4 is decoded the Text eigenvector sequence with CRF model, marks out the Text eigenvector sequence In entity;
S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as subsequent identification process Candidate sequence;
The Text eigenvector sequence and the candidate sequence are input to the reasoning list based on attention mechanism by S6 Attention force vector is calculated in member;
S7 into CRF model, Text eigenvector sequence inputting described in the attention vector sum is marked out in sequence Entity.
S8 repeats step S5~S7 until step S7 does not generate novel entities.
In step sl, data text is pre-processed, specifically: it is first mark with fullstop, one section of long text Be divided into sentence, it is processed after sentence stored with behavior unit, using Open-Source Tools jieba participle to all sentences into Row participle, is divided into sentence the form of word, then removes the available vocabulary of dittograph, vocabulary is set as C.
In step s 2, the term vector good using pre-training, in conjunction with the image information features sequence of text, obtained text Vector indicate.Specifically, it is first segmented with jieba and Chinese wikipedia corpus is segmented, utilize Open-Source Tools word2vec Pre-training is enabled by pre-training available a term vector to the carry out pre-training after Chinese wikipedia corpus participle Term vector dimension be d.For the text of input, it is first expressed as the form of one-hot, as image information features sequence Column, then by the term vector of vocabulary C and pre-training, the term vector that can obtain each phrase is indicated, enables the term vector be X, by carrying out word insertion to the sentence after participle, the vector of available text indicates X, wherein X={ x1,x2,x3,…, xn, X ∈ Rd×n, n is the number of word after sentence participle.
In step s3, step S1 and step S2 passes through the processing work of first two steps, and the vector for having obtained text indicates X, It is input in two-way LSTM model using X as list entries.By LSTM, text feature information can be extracted, and at some Between one word of section characteristic information sometimes not only include front influence of the vocabulary to it, also include subsequent word to it Influence, so can sufficiently extract Text eigenvector sequence from both direction using two-way LSTM.Pass through two-way LSTM The vector of text is indicated that X is encoded to Text eigenvector sequence H, wherein H={ h1,h2,h3,…,ht,For the output at hidden layer each moment of two-way LSTM,Represent before each moment to The output of LSTM hidden layerIt is exported with backward LSTM hidden layerSpliced.
LSTM model in this step is as shown in fig. 6, its specific formula is described as follows:
Determine that LSTM needs which information is selected to need to forget first, which information needs to retain, ftIt indicates to forget door Output
ft=σ (Wf·[ht-1,xt]+bf)
Then the information that LSTM needs to update, i are determinedtWhat is indicated is the output of Memory-Gate,What is indicated is faced in LSTM When memory cell state value
it=σ (Wi·[ht-1,xt]+bi)
Then the state of LSTM cell, C are updatedt, Ct-1Respectively indicate current cell state and previous moment in LSTM Cell state
Finally export hidden layer information, otIndicate the value of out gate
ot=σ (Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein, xtFor the input at current time, ht-1For the output of last moment hidden layer, htIndicate current time hidden layer Output, W is the corresponding weight matrix of each function, and b is the corresponding biasing of each function, and σ is sigmoid function, and tanh is Hyperbolic tangent function.
In step s 4, Text eigenvector sequence H={ h1,h2,h3,…,htInput CRF mould as shown in Figure 5 Type passes through the prediction label sequence L={ l of CRF model being calculated1,l2,l3,…,ln, what l was indicated is each word Label, that " BIE " or " S " is contained in annotation results is all entity e that mark obtains, and entity sets are expressed as E=(e1, e2,e3,…,em), m presentation-entity number.
The related specific calculation of prediction label sequence L in this step is as follows:
If flag sequence is Y={ y1,y2,y3,…,yn, sequence Y indicates all possible label in BIOES mark system Set, the scoring function of this result of x marked with tag y are as follows:
Wherein, tj(yi-1,yi, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, upper one Label node yi-1It is transferred to current label node yiThe case where, value is 0 or 1, referred to as transfer characteristic;sk(yi, x, i) and it indicates to work as Preceding label node yiWhether mark on x, value is also 0 or 1, referred to as state feature;λjAnd μkThe t respectively indicatedjAnd sk's Weight, j, k indicate the number of characteristic function.
Indexation and standardization are carried out to score (y, x), it will be able to obtain for x marked with tag y conditional probability p (y | X), calculate p (y | x) specific formula is as follows:
Z (x) is standardizing factor, its calculation formula are as follows:
Wherein, y '=(y '1,y′2,y′3,…,y′n), indicate possible annotated sequence.
It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l1,l2,l3,…,ln}。
In step s 5, step S4 has identified part entity, but may possibly still be present part in the text and do not known Not Chu Lai uncommon word, or semantic unclear vocabulary, so by the way that the entity information identified is stored, benefit It goes to identify these uncommon words with the self-information of entity and its contextual information help system.
The operation of step S5 is as follows:
According to syntax rule, the front or behind of a noun should connect verb or preposition, verb or preposition and noun Composed clause just text feature information rich in.Because including text each moment in the hidden layer output of LSTM Information, so for each entity e in Ei(i=1,2,3 ..., m) is exported in conjunction with preceding to LSTM hidden layerEiBefore The information at one moment is stored as v1, eiSelf-information is stored as v2, exported in conjunction with rear to LSTM hidden layerEiLater moment in time Information be stored as v3, eiSelf-information is stored as v4, then each entity is expressed as V '=[v1,v2,v3,v4] form, storage To storage unit.As shown in Fig. 2, v1That indicate is the information above of entity, v2What is indicated is the preceding entity obtained to LSTM itself Information, v3What is indicated be after the entity self-information that is obtained to LSTM, v4The information hereinafter of entity, the then entity in storage unit Sequence is expressed as V={ V '1,V′2,V′3,…,V′m′, m ' expression has been deposited into the entity number of storage unit, then storage unit In entity sequence be expressed as candidate sequence V.
Step S6: in storage unit candidate sequence V and Text eigenvector sequence H be input in reasoning element, lead to It crosses reasoning element and obtains each candidate sequence V to the influence degree of Text eigenvector sequence H, gain attention force vector S.
The schematic diagram of reasoning element in this step as shown in figure 3, using attention mechanism method, pass through and calculate storage Each entity in unit obtains an attention force vector to the degree of concern of each feature vector.
For Text eigenvector sequence H={ h1,h2,h3,…,htAnd candidate sequence V={ V '1,V′2,V′3,…,V ′m′, calculate separately the h at each momenti(i=1,2,3 ..., t) and all V 'jThe dot product of (j=1,2,3 ..., m '), is infused Anticipate power score σ.
Any time t, V is to htAttention score calculation formula it is as follows:
Then attention score is converted into probability distribution as the weight α of postorder weighted sum:
αt=softmax (σt)
Summation finally is weighted to V, obtains any time t, V is to htNotice that force vector s, formula are as follows:
After the result for calculating all moment, obtaining V is S={ s to H attention sequence vector1,s2,s3,…,st}。
In the step s 7, Text eigenvector sequence H and after noticing that force vector S spliced, splicing and double Splicing to the hidden layer of LSTM is the same, and the vector after splicing is expressed as [H:S], and the vector after splicing is input to After CRF model, new annotation results are obtained, the CRF model sharing parameter in CRF model and step S4 herein.
In an embodiment, based on multi-level name entity recognition method further include: obtained by the new mark of step S7 They are expressed as V "=[v using the method for step S5 by corresponding novel entities1,v2,v3,v4] form, prepare V "=[v1, v2,v3,v4] store into storage unit.New entity information is stored to storage unit if without screening, it may Cause later period memory cell data amount excessive, and cause the waste of memory space, thus new entity information storage to depositing Before storage unit, it is stored arrive candidate unit first, structure is as shown in Fig. 2, by calculating these new terms in candidate unit Similarity between the new term of storage unit sets a threshold value beta, could be new when similarity is less than threshold value Entity be deposited into storage unit.
The similarity between vocabulary in this step goes to calculate using the method for cosine similarity, specific formula are as follows:
In an embodiment, further include step 9 based on multi-level name entity recognition method: the S5 that repeats the above steps is arrived Step S8 names the process of Entity recognition to terminate when step S7 does not generate new entity.
Above-mentioned name entity procedure is told about below by an example, is illustrated in fig. 4 shown below:
Step 1: the sentence of input is " encountering Cai when Tom encounters Jie Ruishili ", is located in advance to sentence first Reason, with jieba Chinese word segmentation tool to sentence segment, result be " encountered as/Tom// Jie Rui/when/Lee/encounter/Cai ", Wherein "/" indicates the separator of participle, and utilizes the term vector table of the pre-trained available sentence of good term vector Show.
Step 2: the term vector expression that step 1 obtains being input in two-way LSTM, uses two-way LSTM as encoder pair Term vector is encoded, and the output of the hidden layer of LSTM is exactly text feature sequence vector after our required codings.
Step 3: the Text eigenvector in step 2 being input in next layer of CRF model, use CRF model as solution Code device is decoded Text eigenvector, while being labeled to obtain annotation results as follows:
When Tom It encounters Jie Rui When Lee It encounters Cai
O S-PER O S-PER O O O O
By the above results as can be seen that entity name " Tom (S-PER) " and " Jie Rui (S-PER) " mark are correct, quilt It identifies, and other two word " Lee " and " Cai " do not mark out and, because in contrast the two vocabulary are ratios More uncommon, so not being marked out.
By observing model sentence " encountering Cai when Tom encounters Jie Ruishili ", it can be seen that although two below Word " Lee " and " Cai " are more uncommon vocabulary, but the contextual information of the two words and the entity identified " Tom " " Jie Rui " is known each other very much, " encountering " word while being appeared in " Tom encounters Jie Rui " and " Lee encounters Cai ", That is, " Tom encounters Jie Rui " and " Lee encounters Cai " is similar clause, they are that context semanteme is closely similar Entity, it is possible to by getting up the context information store of " Tom " and " Jie Rui ", go to help using their information System identification goes out " Lee " " bavin " the two entities.
Step 4: believing according to entity information above, forward direction entity for the entity " Tom " " Jie Rui " identified in step 3 Breath, backward entity information, entity information hereinafter as format as a candidate sequence, candidate sequence storage to storage In unit.
Step 5: the candidate sequence for Text eigenvector sequence and the storage unit storage that step 2 is obtained is input to In reasoning element, calculated by reasoning element it can be concluded that the self-information and context of entity " Tom " " Jie Rui " are believed The influence degree to word each in original sentence is ceased, one group of attention force vector is obtained.
Step 6: in step 5 attention force vector and Text eigenvector sequence inputting be decoded to CRF model, CRF model is by reference to having marked out " Lee " and " Cai " the two entities referring to vector.
Step 7: candidate unit is arrived in " Lee " and " Cai " storage, threshold value is set as β, by calculating they and storage The similarity of each word in unit finds the two vocabulary and " Lee " and " Tom ", the similarity of " Cai " and " Jie Rui " Greater than the threshold value beta of setting, so the two vocabulary are not stored to storage unit.
Step 8: repeating the above steps 4 to after step 7, it is found that not new entity generates, so name Entity recognition Terminate, the entity finally marked out are as follows:
When Tom It encounters Jie Rui When Lee It encounters Cai
O S-PER O S-PER O S-PER O S-PER
The entity that the above process identifies, after storage to storage unit, as long as occurring and storing in text later , that is, there is the text of clause similar with storage unit in the information of similar entities in unit unit, these entities just can It is identified out, this addresses the problem the low problems of long text accuracy rate and recall ratio.
In the present system, optionally, in order to reduce the spending of calculation amount, identification name Entity recognition can be previously set Cycle-index can reduce calculating without the end mark using " generating without novel entities " as name physical system in this way The spending of amount, but will lead to the decline of recall ratio and accuracy rate, when practice, which needs to weigh the advantages and disadvantages, to carry out house and takes.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should be covered by the claims of the present invention.

Claims (10)

1. a kind of multi-level name entity recognition method, which is characterized in that it includes following that this names entity recognition method at many levels Step:
S1 pre-processes data text, obtains vocabulary C;
Term vector S2 good using pre-training, in conjunction with the image information features sequence of text, the vector of obtained text is indicated;
S3 encodes the vector expression of the text, the Text eigenvector sequence after being encoded;
S4 is decoded the Text eigenvector sequence with CRF model, marks out in the Text eigenvector sequence Entity;
S5 is using the information of the information above of the entity of mark, hereinafter information and the entity as the time of subsequent identification process Select sequence;
The Text eigenvector sequence and the candidate sequence are input to the reasoning element based on attention mechanism by S6, Attention force vector is calculated;
Text eigenvector sequence inputting described in the attention vector sum into CRF model, is marked out the reality in sequence by S7 Body;
S8 repeats step S5~S7 until step S7 does not generate novel entities.
2. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described to data Text is pre-processed, and is obtained the image information features sequence of text, is specifically included:
It is first mark with fullstop, is sentence one section of long text segmentation;
All sentences are segmented;
Then remove dittograph building vocabulary C.
3. according to claim 1 a kind of based on multi-level name entity recognition method, which is characterized in that described using pre- Trained term vector, in conjunction with the image information features sequence of text, the vector of obtained text is indicated, is specifically included:
Pre-training is carried out to corpus, by the available a term vector of pre-training;
The form for being one-hot by the text representation of input, as image information features sequence;
By the vocabulary C, the term vector of the image information features sequence and pre-training, obtain the word of each phrase to Amount indicates;
Word insertion is carried out to the sentence after participle, the vector for obtaining text indicates X.
4. according to claim 3 a kind of based on multi-level name entity recognition method, which is characterized in that use BiLSTM The vector expression of the text is encoded, the Text eigenvector sequence after being encoded.
5. according to claim 4 a kind of based on multi-level name entity recognition method, which is characterized in that the text Characteristic vector sequence is decoded with CRF model, is marked out the entity in the Text eigenvector sequence, is specifically included:
Text eigenvector sequence H={ h1,h2,h3,…,htIt is input to CRF model, pass through being calculated for CRF model Prediction label sequence L={ l1,l2,l3,…,ln, what l was indicated is the label of each word, contains " BIE " in annotation results Or " S " is all the obtained entity e of mark, entity sets are expressed as E=(e1,e2,e3,…,em), m presentation-entity number.
6. according to claim 5 a kind of based on multi-level name entity recognition method, which is characterized in that the pre- mark Label sequence calculates acquisition by the following method:
If flag sequence is Y={ y1,y2,y3,…,yn, sequence Y indicates the collection of all possible label in BIOES mark system It closes, the scoring function of this result of x marked with tag y are as follows:
Wherein, tj(yi-1,yi, x, i) and characteristic function in CRF model is represented, it indicates in the case where given x, a upper label Node yi-1It is transferred to current label node yiThe case where, value is 0 or 1, referred to as transfer characteristic, sk(yi, x, i) and indicate current mark Sign node yiWhether mark on x, value is also 0 or 1, referred to as state feature, λjAnd μkThe t respectively indicatedjAnd skWeight, The number of j, k expression characteristic function;
Indexation and standardization are carried out to score (y, x), it will be able to obtain stamping the conditional probability p (y | x) of y label, meter for x Calculation p (y | x) specific formula is as follows:
Z (x) is standardizing factor, calculation formula are as follows:
Wherein y '=(y '1,y′2,y′3,…,y′n), indicate possible annotated sequence;
It is solved by viterbi algorithm, model takes so that the y ' of maximum probability is to be denoted as l, i.e., as annotation resultsThen prediction label sequence is L={ l1,l2,l3,…,ln}。
7. according to claim 6 a kind of based on multi-level name entity recognition method, which is characterized in that for entity set Close each entity e in Ei, i=1,2,3 ..., m is exported in conjunction with preceding to LSTM hidden layerAnd LSTM hidden layer is defeated backward OutEach entity is expressed as V '=[v1,v2,v3,v4] form, wherein v1That indicate is the information above of entity, v2Table That show is the preceding entity self-information obtained to LSTM, v3What is indicated be after the entity self-information that is obtained to LSTM, v4Entity Information hereinafter, then the entity sequence in storage unit is expressed as V={ V1′,V2′,V3′,…,V′m′, m ' expression has been deposited into The entity number of storage unit, then the entity sequence in storage unit is expressed as candidate sequence.
8. according to claim 7 a kind of based on multi-level name entity recognition method, which is characterized in that storage unit In candidate sequence V and Text eigenvector sequence H be input in reasoning element, each entity information is obtained by reasoning element V is to the influence degree of Text eigenvector sequence H, and gain attention force vector S;
For Text eigenvector sequence H={ h1,h2,h3,…,htAnd candidate sequence V={ V1′,V2′,V3′,…,V′m′, point The h at each moment is not calculatediWith all Vj' dot product, the power that gains attention score σ, wherein i=1,2,3 ..., t, j=1,2, 3,...,m';
Any time t, candidate sequence V is to htAttention score calculation formula it is as follows:
Attention score is converted into probability distribution as the weight α of postorder weighted sum:
αt=softmax (σt)
Summation is weighted to candidate sequence V, obtains any time t, candidate sequence V is to htNotice that force vector s, formula are as follows:
After the result for calculating all moment, candidate sequence V is obtained to the attention sequence vector S of Text eigenvector sequence H ={ s1,s2,s3,…,st}。
9. according to claim 8 a kind of based on multi-level name entity recognition method, which is characterized in that this method is also wrapped It includes: calculating the similarity of the entity in the entity and step S5 in the step S7, when the similarity is less than similarity threshold When, using the entity as new entity.
10. according to claim 9 a kind of based on multi-level name entity recognition method, which is characterized in that described similar Degree is calculated using the method for cosine similarity, specific formula are as follows:
CN201910207179.0A 2019-03-19 2019-03-19 Multilevel named entity recognition method Active CN110008469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910207179.0A CN110008469B (en) 2019-03-19 2019-03-19 Multilevel named entity recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910207179.0A CN110008469B (en) 2019-03-19 2019-03-19 Multilevel named entity recognition method

Publications (2)

Publication Number Publication Date
CN110008469A true CN110008469A (en) 2019-07-12
CN110008469B CN110008469B (en) 2022-06-07

Family

ID=67167300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910207179.0A Active CN110008469B (en) 2019-03-19 2019-03-19 Multilevel named entity recognition method

Country Status (1)

Country Link
CN (1) CN110008469B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516241A (en) * 2019-08-26 2019-11-29 北京三快在线科技有限公司 Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment
CN110688854A (en) * 2019-09-02 2020-01-14 平安科技(深圳)有限公司 Named entity recognition method, device and computer readable storage medium
CN110852108A (en) * 2019-11-11 2020-02-28 中山大学 Joint training method, apparatus and medium for entity recognition and entity disambiguation
CN111241832A (en) * 2020-01-15 2020-06-05 北京百度网讯科技有限公司 Core entity labeling method and device and electronic equipment
CN111274815A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 Method and device for mining entity attention points in text
CN111581957A (en) * 2020-05-06 2020-08-25 浙江大学 Nested entity detection method based on pyramid hierarchical network
CN111832293A (en) * 2020-06-24 2020-10-27 四川大学 Entity and relation combined extraction method based on head entity prediction
CN111858817A (en) * 2020-07-23 2020-10-30 中国石油大学(华东) BilSTM-CRF path inference method for sparse track
CN112151183A (en) * 2020-09-23 2020-12-29 上海海事大学 Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN112185572A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic device and medium
CN112307208A (en) * 2020-11-05 2021-02-02 Oppo广东移动通信有限公司 Long text classification method, terminal and computer storage medium
CN112836514A (en) * 2020-06-19 2021-05-25 合肥量圳建筑科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
WO2021114745A1 (en) * 2019-12-13 2021-06-17 华南理工大学 Named entity recognition method employing affix perception for use in social media
CN114648028A (en) * 2020-12-21 2022-06-21 阿里巴巴集团控股有限公司 Method and device for training label model, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
US20170109355A1 (en) * 2015-10-16 2017-04-20 Baidu Usa Llc Systems and methods for human inspired simple question answering (hisqa)
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
WO2017165038A1 (en) * 2016-03-21 2017-09-28 Amazon Technologies, Inc. Speaker verification method and system
CN107871158A (en) * 2016-09-26 2018-04-03 清华大学 A kind of knowledge mapping of binding sequence text message represents learning method and device
CN107977353A (en) * 2017-10-12 2018-05-01 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM-CNN
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN108536754A (en) * 2018-03-14 2018-09-14 四川大学 Electronic health record entity relation extraction method based on BLSTM and attention mechanism
CN109062893A (en) * 2018-07-13 2018-12-21 华南理工大学 A kind of product name recognition methods based on full text attention mechanism
CN109062901A (en) * 2018-08-14 2018-12-21 第四范式(北京)技术有限公司 Neural network training method and device and name entity recognition method and device
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian name entity recognition method neural network based and its identifying system

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170109355A1 (en) * 2015-10-16 2017-04-20 Baidu Usa Llc Systems and methods for human inspired simple question answering (hisqa)
WO2017165038A1 (en) * 2016-03-21 2017-09-28 Amazon Technologies, Inc. Speaker verification method and system
CN107871158A (en) * 2016-09-26 2018-04-03 清华大学 A kind of knowledge mapping of binding sequence text message represents learning method and device
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106980608A (en) * 2017-03-16 2017-07-25 四川大学 A kind of Chinese electronic health record participle and name entity recognition method and system
CN107977353A (en) * 2017-10-12 2018-05-01 北京知道未来信息技术有限公司 A kind of mixing language material name entity recognition method based on LSTM-CNN
CN108536754A (en) * 2018-03-14 2018-09-14 四川大学 Electronic health record entity relation extraction method based on BLSTM and attention mechanism
CN108536679A (en) * 2018-04-13 2018-09-14 腾讯科技(成都)有限公司 Name entity recognition method, device, equipment and computer readable storage medium
CN109062893A (en) * 2018-07-13 2018-12-21 华南理工大学 A kind of product name recognition methods based on full text attention mechanism
CN109062901A (en) * 2018-08-14 2018-12-21 第四范式(北京)技术有限公司 Neural network training method and device and name entity recognition method and device
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian name entity recognition method neural network based and its identifying system

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
GUL KHAN SAFI QAMAS等: "基于深度神经网络的命名实体识别方法研究", 《信息网络安全》 *
HUI-KANG YI等: "A Chinese Named Entity Recognition System with Neural Networks", 《ITM WEB OF CONFERENCES》 *
XIAOCHENG FENG等: "Multi-Level Cross-Lingual Attentive Neural Architecture for Low Resource Name Tagging", 《TSINGHUA SCIENCE AND TECHNOLOGY》 *
姜宇新: "基于深度学习的生物医学命名实体识别研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
常量: "图像理解中的卷积神经网络", 《自动化学报》 *
张璞: "基于深度学习的中文微博评价对象抽取方法", 《计算机工程与设计》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516241B (en) * 2019-08-26 2021-03-02 北京三快在线科技有限公司 Geographic address resolution method and device, readable storage medium and electronic equipment
CN110516241A (en) * 2019-08-26 2019-11-29 北京三快在线科技有限公司 Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment
CN110688854A (en) * 2019-09-02 2020-01-14 平安科技(深圳)有限公司 Named entity recognition method, device and computer readable storage medium
CN110852108A (en) * 2019-11-11 2020-02-28 中山大学 Joint training method, apparatus and medium for entity recognition and entity disambiguation
WO2021114745A1 (en) * 2019-12-13 2021-06-17 华南理工大学 Named entity recognition method employing affix perception for use in social media
CN111241832A (en) * 2020-01-15 2020-06-05 北京百度网讯科技有限公司 Core entity labeling method and device and electronic equipment
CN111274815A (en) * 2020-01-15 2020-06-12 北京百度网讯科技有限公司 Method and device for mining entity attention points in text
CN111274815B (en) * 2020-01-15 2024-04-12 北京百度网讯科技有限公司 Method and device for mining entity focus point in text
US11775761B2 (en) 2020-01-15 2023-10-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for mining entity focus in text
CN111241832B (en) * 2020-01-15 2023-08-15 北京百度网讯科技有限公司 Core entity labeling method and device and electronic equipment
CN111581957B (en) * 2020-05-06 2022-04-12 浙江大学 Nested entity detection method based on pyramid hierarchical network
CN111581957A (en) * 2020-05-06 2020-08-25 浙江大学 Nested entity detection method based on pyramid hierarchical network
CN112836514A (en) * 2020-06-19 2021-05-25 合肥量圳建筑科技有限公司 Nested entity recognition method and device, electronic equipment and storage medium
CN111832293B (en) * 2020-06-24 2023-05-26 四川大学 Entity and relation joint extraction method based on head entity prediction
CN111832293A (en) * 2020-06-24 2020-10-27 四川大学 Entity and relation combined extraction method based on head entity prediction
CN111858817B (en) * 2020-07-23 2021-05-18 中国石油大学(华东) BilSTM-CRF path inference method for sparse track
CN111858817A (en) * 2020-07-23 2020-10-30 中国石油大学(华东) BilSTM-CRF path inference method for sparse track
CN112151183A (en) * 2020-09-23 2020-12-29 上海海事大学 Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN112185572A (en) * 2020-09-25 2021-01-05 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic device and medium
CN112185572B (en) * 2020-09-25 2024-03-01 志诺维思(北京)基因科技有限公司 Tumor specific disease database construction system, method, electronic equipment and medium
CN112307208A (en) * 2020-11-05 2021-02-02 Oppo广东移动通信有限公司 Long text classification method, terminal and computer storage medium
CN114648028A (en) * 2020-12-21 2022-06-21 阿里巴巴集团控股有限公司 Method and device for training label model, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110008469B (en) 2022-06-07

Similar Documents

Publication Publication Date Title
CN110008469A (en) A kind of multi-level name entity recognition method
CN111783462B (en) Chinese named entity recognition model and method based on double neural network fusion
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN107291693B (en) Semantic calculation method for improved word vector model
CN106776581B (en) Subjective text emotion analysis method based on deep learning
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN109284400B (en) Named entity identification method based on Lattice LSTM and language model
CN111708882B (en) Transformer-based Chinese text information missing completion method
CN110263325B (en) Chinese word segmentation system
CN111325029B (en) Text similarity calculation method based on deep learning integrated model
CN110232192A (en) Electric power term names entity recognition method and device
CN108932226A (en) A kind of pair of method without punctuate text addition punctuation mark
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN111666758A (en) Chinese word segmentation method, training device and computer readable storage medium
CN110188175A (en) A kind of question and answer based on BiLSTM-CRF model are to abstracting method, system and storage medium
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN112464669B (en) Stock entity word disambiguation method, computer device, and storage medium
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN113449084A (en) Relationship extraction method based on graph convolution
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN113312918B (en) Word segmentation and capsule network law named entity identification method fusing radical vectors
CN112699685A (en) Named entity recognition method based on label-guided word fusion
CN111428501A (en) Named entity recognition method, recognition system and computer readable storage medium
CN114417874B (en) Chinese named entity recognition method and system based on graph attention network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant