CN109697285B

CN109697285B - Hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation

Info

Publication number: CN109697285B
Application number: CN201811523661.7A
Authority: CN
Inventors: 王建新; 余颖; 李敏
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2022-06-21
Anticipated expiration: 2038-12-13
Also published as: CN109697285A

Abstract

The invention discloses a hierarchical BilSTM Chinese electronic medical record disease coding and labeling method for enhancing semantic expression, which comprises the steps of preprocessing an input electronic medical record text, considering Chinese word composition, extracting character-level feature vector expression by using a BilSTM introducing an attention mechanism to obtain the semantic and word-forming features of a single Chinese character; splicing the character-level word vector representation and the word-level vector representation obtained by word2vec training to obtain a word vector representation with enhanced character features; and taking the text sequence represented by the feature word vector as input, learning the context features in the whole electronic medical record by using the BilSTM again, and calculating the contribution degree of each feature word by adopting an attention mechanism to obtain the text vector representation weighted by the context features, thereby improving the prediction effect. The method is suitable for the disease label classification task based on the Chinese electronic medical record text, and effectively improves the classification effect.

Description

Hierarchical BilSTM Chinese electronic medical record disease coding and labeling method for enhancing semantic representation

Technical Field

The invention relates to the field of medical informatics, in particular to a hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation.

Background

Electronic Health Records (EHRs) have become one of the important data resources in clinical research. The medical data storage system stores various information of patients in the hospitalizing process as digital data, and facilitates the analysis and processing of clinical data by a computer. For an electronic medical record, there is a need for a uniform label specification that describes the disease status of the patient, thereby facilitating the rational classification of patient information to aid in clinical decision-making. International Classification of Diseases (ICD), issued by the world health organization and continuously updated, is an International universal disease coding scheme that is often used as a label for clinical records to identify symptoms, signs, Diseases, abnormal findings or operations, and the like. Currently, the newly revised ICD code version 10 has been widely used in hospital information systems in our country.

Labeling ICD codes on electronic medical records is an important and fundamental task for utilizing electronic medical records. The absence of diagnosis names and ICD codes in electronic medical records is not favorable for the analysis and research of clinical data. Normally, the ICD-coded labeling work is manually determined by medical staff in each hospital case according to the clinical diagnosis description given by a doctor. Manual coding not only requires a coding person to know certain medical knowledge, coding rules and medical terms, but also is time-consuming and labor-consuming. Therefore, the automatic coding by using the computer can provide effective assistance for coding labeling work, and the labeling efficiency of the ICD coding is improved.

At present, most of automatic labeling work of disease codes is carried out based on clinical text data, such as radiology reports, death certificates, discharge knots and the like. However, most research work is focused on english corpus, the work of disease coding prediction on chinese clinical texts is less, and the main method is character string semantic comparison based on diagnosis names. The semantic similarity comparison has high requirements on the quality of the diagnosis name description, and automatic coding labeling cannot be performed under the condition that the diagnosis name is missing. At present, no relevant research work uses a neural network model for a disease coding and labeling task of a Chinese electronic medical record.

The processing of the Chinese electronic medical record text has two characteristics: firstly, the electronic medical record has a long text, and the context information of the long text is difficult to obtain; secondly, Chinese characters are different from English, single Chinese characters also have semantics, and particularly in medical terms, such as direction, body parts and the like are described by one Chinese character, so that the semantics containing character features can better express the semantics of words.

Disclosure of Invention

The invention aims to solve the technical problem that aiming at the defects of the prior art, the invention provides a hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation, which finishes automatic labeling in an end-to-end mode and improves the prediction effect.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation comprises the following steps:

1) utilizing a Chinese word segmentation tool, introducing a user-defined medical clinical word dictionary for word segmentation, removing stop words, and screening out characteristic words according to word frequency;

2) respectively carrying out character-level and word-level vectorization representation on the feature words, splicing character-level vectors and word-level vectors, and constructing character enhancement feature vector representation of the words;

3) and obtaining the context characteristics of the whole text by utilizing the spliced characteristic words, and calculating the contribution degree of each characteristic word by adopting an attention mechanism to obtain the context characteristic weighting vector representation of the whole text.

In the step 1), the feature words are selected according to the following rules:

wherein S_fwA set of characteristic words is represented,

the expression w_iFrequency of (1), N_dAnd the total number of samples of the electronic medical record is shown.

In the step 2), the character-level feature vector representation of the bidirectional LSTM training feature words fused with the attention mechanism is utilized, and the word-level vector representation form of the feature words is obtained by utilizing a word vector representation method word2vec based on word distributed representation.

The output mode of the bidirectional long-short term memory network training is as follows:

wherein

Representing the hidden layer output of the forward LSTM at the t-th element or time,

then it is the backward LSTM output at the hidden layer of the t-th cell.

The calculation mode of the attention mechanism is as follows:

u_ij＝tanh(W_ch_ij+b_c)；

h_ijis the hidden layer output of the jth character of the ith word after the BilSTM training, W_cAs a weight matrix, b_cAs an offset vector, u_cFor randomly initializing a character-level context feature vector, alpha_ijFor the weight of the jth character to the ith word calculated by the softmax function,

the feature vector representation is weighted for the context of the ith word.

In step 3), the method for calculating the context feature weighting vector of the whole text comprises the following steps: and inputting the text represented by the spliced feature word vector into a second-layer bidirectional long-short term memory network, learning to obtain the context features of the whole text, and calculating the weight of each feature word by adopting an attention mechanism to obtain the text feature vector weighted by context information.

The calculation mode of the attention mechanism is as follows:

u_i＝tanh(Wh_i+b_w)；

v＝∑_iα_ih_i；

h_iis character ofThe character of the ith word of the sequence is strengthened and the characteristic vector is trained by the BilSTM to obtain the output of a hidden layer, W is a weight matrix, b_wFor a bias vector, when an attention mechanism is applied, a document context feature vector u at a word level is correspondingly introduced and randomly initialized_wTo complete the calculation of the weight, alpha_iAnd v is represented by the context weighted feature vector of the whole text for the weight corresponding to each word, the vector is input into a full connection layer, and the occurrence probability of each disease code is calculated by a sigmoid function.

Compared with the prior art, the invention has the beneficial effects that: aiming at the characteristics of Chinese, the semantic features of single Chinese characters are integrated into the feature vector representation of words, and the feature words really contributing to the input sequence are weighted by combining an attention mechanism, so that the prediction effect of disease coding is improved; the method is suitable for Chinese clinical text data, text features are automatically extracted by utilizing a neural network model, and automatic labeling is completed in an end-to-end mode.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a hierarchical BilSTM feature learning model incorporating a focus mechanism;

FIG. 3 concerns the computation of a mechanism; (a) h is to be_ijBecome u_ij(ii) a (b) Computing each u using context feature vectors_ijThe weight of (c); (c) h is_ijObtaining the feature vector representation of the application attention mechanism by the weighted summation;

FIG. 4 is a graph showing the results of an experiment performed in the present invention.

Detailed Description

Preprocessing of clinical text data

Utilizing a Chinese word segmentation tool 'knot' and a user-defined medical word library to segment input discharge summary texts, removing stop words, counting word frequencies of effective words, selecting characteristic words after ordering from large to small based on the word frequencies, and selecting the characteristic words according to the following rules:

wherein S_fwRepresenting a set of feature words，

The expression w_iFrequency of (1), N_dRepresenting the total number of electronic medical records.

Second, word vector representation of feature words

1) Character-based word vector representation

Firstly, initializing a vector representation for each character, then inputting a BilSTM fused with a focus mechanism, training to obtain a character-level word vector representation of each characteristic word, and obtaining a state value c of each neural unit in the BilSTM_tAnd the output value h_tThe specific calculation process is (t ═ 1, 2.., n, t denotes the t-th neural unit in the network or the neural unit at time t):

i_t＝sigmoid(W_i[x_t；h_t-1]+b_i) (1)

f_t＝sigmoid(W_f[x_t；h_t-1]+b_f) (2)

g_t＝tanh(W_g[x_t；h_t-1]+b_g) (3)

o_t＝sigmoid(W_o[x_t；h_t-1]+b_o) (4)

c_t＝f_t*c_t-1+i_t*g_t (5)

h_t＝o_t*tanh(c_t) (6)

each neural unit comprises an input gate i, an output gate o, a forgetting gate f, a storage unit g, a state-saving unit c and a hidden state h which are vectors W_i,W_f,W_g,W_oAs a weight matrix, b_i,b_f,b_g,b_oIs an offset vector, "; "represents a join operation," + "represents a dot product of elements, and sigmoid function is calculated as

the tan h function is calculated as

The output mode of the BilSTM is

2) Application of attention mechanism

The calculation method of the attention mechanism comprises the following steps:

u_ij＝tanh(W_ch_ij+b_c) (7)

h_ijis the hidden layer output of the jth character of the ith word after the BilSTM training, W_cAs a weight matrix, b_cAs an offset vector, u_cFor randomly initializing a character-level context feature vector, alpha_ijI.e. the weight of the jth character to the ith word calculated by the softmax function,

i.e. a context weighted feature vector representation of the ith word.

3) And splicing the character-level word vector obtained by training with the word vector generated by using word2vec to obtain a word feature vector with enhanced character-level context features.

Third, context feature extraction

Inputting the character-reinforced feature vector sequence into a BilSTM of a second-layer fusion attention mechanism, extracting text context information features, calculating a BilSTM neural unit and calculating context feature weighting, wherein the calculation is the same as that when a character-level word vector is expressed, and a specific calculation formula is as follows:

u_i＝tanh(Wh_i+b_w) (10)

v＝∑_iα_ih_i (12)

h_ithe method is characterized in that the character reinforcing feature vector of the ith word of a text sequence is output of a hidden layer obtained after BiLSTM training, W is a weight matrix, b is_wFor a bias vector, when an attention mechanism is applied, a document context feature vector u at a word level is correspondingly introduced and randomly initialized_wTo complete the calculation of the weight, alpha_iAnd v is represented by the context weighted feature vector of the whole text for the weight corresponding to each word, the vector is input into a full connection layer, and the occurrence probability of each disease code is calculated by a sigmoid function.

Fourth, experimental verification

1) Procedure of experiment

In order to verify the effectiveness of the method, experimental verification is carried out on the clinical data of the real Chinese electronic medical record. The data set comprises 7732 discharge records and 1177 ICD-10 disease code labels, wherein the ICD-10 codes comprise point-division six-digit codes consisting of letters and numbers, the letters begin with the first three-digit codes which are primary codes and indicate disease categories. The average length of the discharge node is 610 words, and each discharge node corresponds to 3.6 disease codes on average.

The experiment was done on a server that contained 256GB memory and NVIDIA GeForce Titan X Pascal CUDA GPU processor. We split the data set into training and test sets on a 9:1 scale and verified by randomizing the data ten times. The evaluation index selects the precision (P) of micro-average, recall rate (R) and index F1 value of the combination of the precision (P) and the recall rate (R), and the Hamming loss value of the false alarm condition from the sample perspective. Higher values of F1 and lower Hamming loss values indicate better model performance.

2) Results of the experiment

As related research work indicates that the deep learning method is superior to the traditional machine learning method, we mainly carry out comparison experiments with other common neural network models, and the results are shown in Table 1, wherein MA-BilSTM represents our model, D2V + CNN is the method in related research work, and the method achieves the best effect on the open English data set MIMIC III at present. The experimental result shows that the MA-BilSTM is superior to other neural network models in all evaluation indexes, and the BilSTM combined with the attention mechanism can effectively capture the context information characteristics of long texts and improve the prediction effect.

TABLE 1 comparative experimental results

Model	Micro_P(CI:95％)	Micro_R(CI:95％)	Micro_F1(CI:95％)	hLoss(CI:95％)
					CBOW	0.614(±6.43e-03)	0.522(±5.30e-03)	0.564(±4.52e-03)	0.00248(±3.14e-05)
CNN	0.647(±6.67e-03)	0.509(±6.51e-03)	0.569(±4.71e-03)	0.00237(±3.52e-05)
					D2V+CNN	0.661(±9.57e-03)	0.514(±8.74e-03)	0.579(±7.14e-03)	0.00231(±3.70e-05)
MA-BiLSTM	0.704(±1.13e-02)	0.586(±5.84e-03)	0.639(±4.45e-03)	0.00204(±3.47e-05)

To analyze the effect of each module of the model, we designed an ablation experiment for analysis, with the results shown in table 2. From experimental results, only word vectors or character vectors represent the characteristics of words in the text, and prediction results are reduced, so that word vector representation enhanced by the character vectors indeed brings better text characteristic representation. The attention mechanism plays an important role in the model, and the performance of the model is obviously reduced by removing the attention mechanism.

Prediction is performed on both ICD-10 full code and primary code, 7732 samples corresponding to 488 primary codes. The results of the experiment are shown in FIG. 4. The prediction result on the primary coding reaches 80.5 percent in precision, and the method can better assist the disease coding labeling work of medical staff in a medical record room.

TABLE 2 model ablation experimental results

Claims

1. A hierarchical BilSt Chinese electronic medical record disease coding and labeling method for enhancing semantic representation is characterized by comprising the following steps:

1) utilizing a Chinese word segmentation tool, introducing a user-defined medical clinical word dictionary to segment words of the discharge summary text, removing stop words, and screening out characteristic words according to word frequency;

2) respectively carrying out character level and word level vectorization representation on the feature words, splicing the character level vectors and the word level vectors, and constructing character enhancement feature vector representation of the words; using character-level feature vector representation of a BilSTM training feature word fused with an attention mechanism, and using a word vector representation method word2vec based on word distributed representation to obtain a word-level vector representation form of the feature word;

3) obtaining a word vector representation sequence of the whole text by using the spliced feature words, calculating the contribution degree of each feature word by using an attention mechanism, obtaining context feature weighted vector representation of the whole text, namely inputting the text represented by the spliced feature word vector into a second-layer bidirectional long-short term memory network, learning to obtain context features of the whole text, and calculating the weight of each feature word by using the attention mechanism, so as to obtain a text feature vector weighted by context information;

the calculation mode of the attention mechanism is as follows:

v＝∑_iα_ih_i；

h_iis the output of a hidden layer obtained after the character reinforcing characteristic vector of the ith word of the text sequence is trained by BilSTM, W is a weight matrix, b_wFor a bias vector, when an attention mechanism is applied, a document context feature vector u at a word level is correspondingly introduced and randomly initialized_wTo complete the calculation of the weight value, α_iAnd v is represented by the context weighted feature vector of the whole text for the weight corresponding to each word, the vector is input into a full connection layer, and the occurrence probability of each disease code is calculated by a sigmoid function.

2. The method for disease-coding and labeling of BiLSTM Chinese electronic medical record with enhanced semantic representation according to claim 1, wherein in step 1), the feature words are selected according to the following rules:

wherein S_fwA set of characteristic words is represented,

the expression w_iFrequency of (N), N_dAnd the total number of samples of the electronic medical record is shown.

3. The hierarchical BilSTM Chinese electronic medical record disease coding labeling method based on enhanced semantic representation according to claim 1, wherein the output mode of the BilSTM is as follows:

wherein

Representing the hidden layer output of the forward LSTM at the t-th element or time t,

then it is output at the hidden layer of the t-th cell for backward LSTM.

4. The method for disease coding and labeling of BiLSTM Chinese electronic medical record with enhanced semantic representation according to claim 1, wherein in step 2), the calculation mode of the attention mechanism is as follows:

u_ij＝tanh(W_ch_ij+b_c)；

h_ijis the output of the j character of the i word in the hidden layer after the BilSTM training, W_cAs a weight matrix, b_cAs an offset vector, u_cFor the random initialization of the character-level context feature vector, α_ijFor the weight of the jth character to the ith word calculated by the softmax function,

the feature vector representation is weighted for the context of the ith word.