CN115238693A

CN115238693A - Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory

Info

Publication number: CN115238693A
Application number: CN202210809038.8A
Authority: CN
Inventors: 张锋; 程振宁; 陈婕卿; 曾可; 姜会珍; 李大伟
Original assignee: Beijing Anne Fox Information Consulting Co ltd; Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Current assignee: Beijing Anne Fox Information Consulting Co ltd; Peking Union Medical College Hospital Chinese Academy of Medical Sciences
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-10-25

Abstract

The invention discloses a Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory, which improves the recognition precision of named entities by modifying a BERT-BILSTM-CRF model; determining input and output of the named entity recognition model: taking a medical text as a research object, taking a medical text data set with entity labels as the input of a named entity recognition model, wherein the output of the model is an entity label result given after medical entity prediction is carried out on the data set; according to the invention, by further enhancing the context feature extraction performance of the text of the model, on one hand, a multi-word segmentation method is considered to increase local context features, on the other hand, a multi-layer bidirectional long-term and short-term memory method is introduced, global context features are increased by setting BILSTM models with different depths, external knowledge of a medical dictionary is introduced, and the precision of the named entity recognition task is further improved by enriching semantic feature information in the model learning process.

Description

Chinese named entity recognition method based on multi-word segmentation and multi-layer bidirectional long-short term memory

Technical Field

The invention relates to the technical field of Chinese named entity recognition, in particular to a Chinese named entity recognition method based on multi-participle and multi-layer bidirectional long-short term memory.

Background

Named entity recognition is a challenging fundamental task in the task of natural language processing. Named entity recognition is used as a basic task in natural language processing, and plays a key role in information extraction, knowledge graph construction and the like. In some specific areas, research into named entity recognition has found widespread and mature applications. Current methods of named entity recognition focus mainly on dictionary and rule based methods, statistical machine learning based methods, and deep learning based methods.

Dictionary-based and rule-based methods for entity recognition by using string matching and manually constructed entity extraction rules can achieve better accuracy on small data sets, but are not applicable as data sets increase. The learning method based on the statistical machine comprises a hidden Markov model, a support vector machine, a conditional random field and the like. While these approaches reduce vocabulary and rule-based workload to some extent, the need for manual assignment of features and extrinsic knowledge information is inevitable. Therefore, the methods can only be generally suitable for the current field, and the problem of named entity identification in a brand-new field is difficult to directly solve. Deep learning-based methods have gained widespread use and breakthrough in recent years, including BERT models, CNN models, BILSTM models, and the like. Compared with a machine learning model, the deep learning method can learn high-dimensional and deep feature representation, and is beneficial to improving the generalization capability of entity recognition.

Although the deep learning method of the existing research achieves better effect in the identification of the medical named entity, the identification of the medical named entity still faces some difficulties and challenges:

(1) The text representation with single granularity of the existing method can only obtain the global context characteristics of the text and lacks the local context characteristic information, thereby hindering the further improvement of the model performance;

(2) The commonly adopted single BILSTM can only capture the context characteristics of a specific dimension, and the contribution of the context characteristics to the model performance under other dimensions is not considered;

therefore, it is necessary to design a Chinese named entity recognition method based on multi-segmented words and multi-layer bidirectional long-short term memory.

Disclosure of Invention

The invention solves the problem of providing a Chinese named entity recognition method based on multi-participle and multi-layer bidirectional long-short term memory, which modifies a BERT-BILSTM-CRF model widely used in the named entity recognition technology; a multi-Word segmentation module is introduced, all data sets can be segmented into a plurality of words in the module, and then local features are extracted through Word-Level BILSTM; introducing a multi-layer BILSTM module, wherein the module consists of BILSTM and Attention, and by setting different hidden layer parameters for the BILSTM, text context characteristics with different dimensionalities can be learned, and then important information is captured by the Attention; the two modules can enrich information in the model learning process, so that the accuracy of named entity identification is improved.

In order to achieve the purpose, the invention adopts the following technical scheme:

a Chinese named entity recognition method based on multi-word segmentation and multilayer bidirectional long-short term memory improves the recognition precision of named entities by modifying a BERT-BILSTM-CRF model; the method comprises the following steps:

step S1: determining input and output of the named entity recognition model: taking a medical text as a research object, taking a medical text data set with entity labels as the input of a named entity recognition model, wherein the output of the model is an entity label result given after medical entity prediction is carried out on the data set;

step S2: designing a medical named entity recognition model with multiple words and multi-layer bidirectional long and short term memory, wherein the model consists of an input layer, a word embedding layer, a semantic feature extraction layer, a CRF layer and an output layer; the model comprises a BERT pre-training language model, a bidirectional long-short term memory model BILSTM, an attention mechanism and a conditional random field CRF; the main method of the medical named entity recognition model comprises the following steps:

(1) an input layer: this layer is used to input the data set;

(2) word embedding layer: the layer encodes characters in a text into a vector representation form through a BERT pre-training language model; the output result after passing through the BERT model is represented as V = V ₁ ，V ₂ ，...，V _n Wherein n represents the total number of characters contained in the current sentence;

(3) semantic feature extraction layer: the layer is formed by combining a multi-word segmentation module and a multi-layer BILSTM module. The multi-Word segmentation module mainly extracts features through a Word-LeveI BILSTM module, and the multi-layer BILSTM module acquires feature information of different dimensions by setting hidden layers of different sizes and captures important information by using an attention mechanism; the specific process is as follows:

1) Multi-word segmentation module

The Word-Level BILSTM module is formed based on a BILSTM model; BILSTM is formed by combining forward LSTM and backward LSTM; LSTM is expressed by mathematical expressions as shown in equations 1-6:

f _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f )#(1)

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )#(2)

C _t ＝tanh(W _C ·[h _t-1 ，x _t ]+b _c )#(3)

C _t ＝f _t *C _t-1 +i _t *C _t #(4)

o _t ＝σ(W _o ·[h _t-1 ，x _t ]+b _o )#(5)

h _t ＝O _t *tanh(C _t )#(6)

wherein t and t-1 respectively represent the current moment and the last moment, h represents a hidden state, and sigma and tanh respectively represent a sigmoid activation function and a tanh activation function; w represents a weight matrix, b represents a bias vector; * Represents a dot product;

the output of BILSTM is a forward LSTM output

And negative LSTM output

Is represented by

2) Multilayer BILSTM module

The module integrates a BILSTM and an Attention mechanism; setting hidden layers with different sizes for the BILSTM to extract context features with different dimensions; the Attention mechanism is used to distinguish different degrees of importance of different features;

feature vector h output by attention mechanism layer to BILSTM layer _t Carrying out weight distribution, and calculating to obtain a common output feature vector W of the t-th word in a BILSTM layer and an attention layer _t(k) Expressed mathematically as in equations 7-9:

score(s ^t ，h ⁱ )＝vtanh(w[s ^t ，h ⁱ ])#(9)

wherein a is ^t，i For the attention function, the score function is an alignment model that assigns a score based on how well the inputs and outputs match at time i, defining each output as hidden to each inputHow much weight the hidden state is; w _t(k) Representing the output of the tth word passing through the kth MBA model, wherein the value of k is 1,2;

the final output O of the semantic feature extraction layer is obtained by fusing the output of the multi-word segmentation module and the output of the multi-layer BILSTM, and is expressed by a mathematical expression as formula 10:

the final output sequence of the layer model is [ O ] ₁ ，O ₂ ...，O _n ]；

(4) CRF layer:

the main role of this layer is to predict the label; in the process of training data, the layer automatically learns the constraint between the labels and ensures that the predicted labels are legal; the matrix P is a scoring matrix, P _i，j Is a probability value, A, that classifies the ith character as the jth token _i，j Is the state transition score from the ith marker to the jth marker; if the input sentence x = (x) ₁ ，x ₂ …，x _n ) The mark sequence is y = (y) ₁ ，y ₂ ，...，y _n ) The score is as follows:

for Score (x, y), the normalization process was performed using the Softmax function, as follows:

in training, for training sample (x, y) ^x ) Maximizing the log probability of the marker sequence using the following formula;

the experiment adopts the Viterbi algorithm to solve the probability maximum path of the dynamic programming, and the formula is as follows:

Y ^* is the sequence with the highest score in the scoring function, i.e. the expected output of the model,

is a maximize score function;

(5) and (3) an output layer: the layer is used for inputting labeling results of all texts in the data set; the evaluation index is measured by the accuracy P, recall R and F1 values, as shown in equations 15, 16 and 17:

wherein T is _p Number of correctly identified medical entities of the representation model, F _p Representing the number of unrelated medical entities identified by the model, F _N The number of relevant medical entities which cannot be identified by the model; f1 is the weighted harmonic mean of P and R.

The invention has the beneficial effects that: according to the invention, by further enhancing the context feature extraction performance of the text of the model, on one hand, a multi-word segmentation method is considered to increase local context features and obtain feature information of different dimensions, and an attention mechanism is utilized to capture important information, on the other hand, a multi-layer bidirectional long-short term memory method is introduced, global context features are increased by setting BILSTM models of different depths, external knowledge of a medical dictionary is introduced, and the precision of the named entity recognition task is further improved by enriching semantic feature information in the model learning process.

Drawings

FIG. 1 is an overall process flow of the medical named entity recognition model of the present invention;

FIG. 2 is the overall processing procedure of the multi-segmentation module of the present invention;

FIG. 3 is an overall process of the multi-layer BILSTM module of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Specific examples are given below.

Referring to fig. 1-3, a method for recognizing a named entity in chinese based on multi-segmentation and multi-layer bidirectional long-short term memory, which improves recognition accuracy of the named entity by modifying a BERT-blst (tm) -CRF model; with reference to fig. 1, the method comprises the following steps:

step S2: designing a medical named entity recognition model with multiple word segments and multi-layer bidirectional long-term and short-term memory, wherein the model consists of an input layer, a word embedding layer, a semantic feature extraction layer, a CRF layer and an output layer; the model comprises a BERT pre-training language model, a bidirectional long-short term memory model BILSTM, an attention mechanism and a conditional random field CRF; the main method of the medical named entity recognition model sequentially comprises the following steps:

(1) an input layer: this layer is used to input the data set;

1) Multi-word segmentation module

The Word-Level BILSTM module is formed based on a BILSTM model; the BILSTM is formed by combining a forward LSTM and a backward LSTM; LSTM is expressed by mathematical expressions as shown in equations 1-6:

f _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f )#(1)

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )#(2)

C _t ＝tanh(W _C ·[h _t-1 ，x _t ]+b _c )#(3)

C _t ＝f _t *C _t-1 +i _t *C _t #(4)

o _t ＝σ(W _o ·[h _t-1 ，x _t ]+b _o )#(5)

h _t ＝O _t *tanh(C _t )#(6)

wherein t and t-1 respectively represent the current moment and the last moment, h represents a hidden state, and sigma and tanh respectively represent a sigmoid activation function and a tanh activation function; w represents a weight matrix, b represents an offset vector; * Represents a dot product;

the output of BILSTM is a forward LSTM output

And negative LSTM output

Is represented by

The multi-segmentation module integrates a plurality of BILSTMs as shown in figure 2; taking the text of 'postpartum diagnosis as diabetes' as an example, the sequence of the text after full word segmentation is expressed as [ 'postpartum', 'diagnosis as', 'diabetes' ], and each word is subjected to word local context feature capture through a BILSTM model;

v in the diagram of FIG. 2 _i The representative word embedding layer generates a character vector for the ith word; words _i Represents the ith word after the multi-word segmentation,

represents the output representation of the ith word after the word-level local feature analysis when the jth word appears in a certain word, W _i Representing the output of the ith Word after passing through a Word-LeveI BILSTM module;

2) Multilayer BILSTM module

The module integrates the BILSTM and the Attention mechanism, see FIG. 3; hidden layers with different sizes are arranged on the BILSTM to extract context features with different dimensions; the Attention mechanism is used to distinguish different degrees of importance of different features;

score(s ^t ，h ⁱ )＝vtanh(w[s ^t ，h ⁱ ])#(9)

wherein a is ^t，i For the attention function, the score function is an alignment model, which assigns a score based on the degree of matching of the input and output at time i, defining how much weight each output gives to each input hidden state; w is a group of _t(k) Representing the output of the tth word passing through the kth MBA model, wherein the value of k is 1,2;

the final output 0 of the semantic feature extraction layer is obtained by fusing the multi-word segmentation module output and the multi-layer BILSTM output, and is expressed by a mathematical expression as formula 10:

(4) CRF layer:

in the training, for trainingSample (x, y) ^x ) Maximizing the log probability of the marker sequence using the following formula;

Y ^* is the sequence of the scoring function with the highest score, i.e. the expected output of the model,

is a maximize score function;

(5) an output layer: the layer is used for inputting labeling results of all texts in the data set; the evaluation index is measured by the accuracy P, recall R and F1 values, as shown in equations 15, 16 and 17:

wherein T is _p Representing the number of correctly identified medical entities of the model, F _p The FN is the number of the relevant medical entities which can not be identified by the model; f1 is the weighted harmonic mean of P and R.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered as the technical solutions and the inventive concepts of the present invention within the technical scope of the present invention.

Claims

1. A Chinese named entity recognition method based on multi-word segmentation and multilayer bidirectional long-short term memory is characterized in that the recognition precision of a named entity is improved by modifying a BERT-BILSTM-CRF model; the method comprises the following steps:

step S2: designing a medical named entity recognition model with multiple words and multi-layer bidirectional long and short term memory, wherein the model consists of an input layer, a word embedding layer, a semantic feature extraction layer, a CRF layer and an output layer; the model comprises a BERT pre-training language model, a bidirectional long-short term memory model BILSTM, an attention mechanism and a conditional random field CRF; the main method of the medical named entity recognition model sequentially comprises the following steps:

(1) an input layer: this layer is used to input the data set;

1) Multi-word segmentation module

f _t ＝σ(W _f ·[h _t-1 ，x _t ]+b _f )#(1)

i _t ＝σ(W _i ·[h _t-1 ，x _t ]+b _i )#(2)

C _t ＝tanh(W _C ·[h _t-1 ，x _t ]+b _c )#(3)

C _t ＝f _t *C _t-1 +i _t *C _t #(4)

o _t ＝σ(W _o ·[h _t-1 ，x _t ]+b _o )#(5)

h _t ＝O _t *tanh(C _t )#(6)

the output of BILSTM is a forward LSTM output

And negative LSTM output

Is represented by

2) Multilayer BILSTM module

The module integrates a BILSTM and an Attention mechanism; setting hidden layers with different sizes for the BlLSTM to extract context features with different dimensions; the Attention mechanism is used to distinguish different degrees of importance of different features;

feature vector h output by attention mechanism layer to BILSTM layer _t Weight distribution is carried out, and common output feature vectors of the t-th word in a BILSTM layer and an attention layer are obtained through calculationW _t(k) Expressed by mathematical formulas such as 7-9:

score(s ^t ，h ⁱ )＝vtanh(w[s ^t ，h ⁱ ])#(9)

wherein a is ^t，j For the attention function, the score function is an alignment model, which assigns a score based on the degree of matching of the input and output at time i, defining how much weight each output gives to each input hidden state; w _t(k) Representing the output of the tth word passing through the kth MBA model, wherein the value of k is 1,2;

the final output 0 of the semantic feature extraction layer is obtained by fusing the output of the multi-word segmentation module and the output of the multi-layer BILSTM, and is expressed by a mathematical expression as formula 10:

(4) CRF layer:

is a maximize score function;

wherein T is _p Representing the number of correctly identified medical entities of the model, F _p Number of unrelated medical entities identified by the representation model, F _N The number of relevant medical entities which cannot be identified by the model; f1 is the weighted harmonic mean of P and R.