CN112287665B

CN112287665B - Chronic disease data analysis method and system based on natural language processing and integrated training

Info

Publication number: CN112287665B
Application number: CN202011116445.8A
Authority: CN
Inventors: 亓晋; 张及棠; 孙雁飞; 闫文卿
Original assignee: Nanjing Nanyou Institute Of Information Technovation Co ltd; Nanjing University of Posts and Telecommunications
Current assignee: Nanjing Nanyou Institute Of Information Technovation Co ltd; Nanjing University of Posts and Telecommunications
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2024-05-03
Anticipated expiration: 2040-10-19
Also published as: CN112287665A

Abstract

The invention discloses a slow disease data analysis method and system based on natural language processing and integrated training, wherein the system comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein the data preprocessing module extracts slow disease data from an external slow disease database to generate corresponding word vectors, and then the word vectors are quantized to be used as training samples; the data recognition module inputs word vectors of training samples into a two-way long-short-term memory network for training, so as to obtain hidden vectors and transmit the hidden vectors to a conditional probability field to calculate character labels; the data training module performs classification training to extract ternary association models among chronic disease symptoms, biochemical pathology indexes and treatments, transmits the ternary association models to the data visualization module for statistical analysis and transmits the ternary association models to an external user interface module for presentation. Therefore, the aim of carrying out cause analysis and illness prediction of chronic diseases by fully utilizing a large amount of unstructured electronic medical data and combining natural language processing and an integrated training neural network is achieved, and accordingly symptomatic medication is achieved.

Description

Chronic disease data analysis method and system based on natural language processing and integrated training

Technical Field

The invention relates to the technical field of chronic disease data analysis, in particular to a chronic disease data analysis method and system based on natural language processing and integrated training.

Background

Chronic non-infectious disease (chronic disease) is a public health problem, and has the characteristics of complex and unclear etiology, multiple influence on curative effect factors and difficult cure, and according to the research of world health organization, the disease cause of chronic disease is 60% dependent on the life style of individuals, and is also related to factors such as heredity, medical conditions, social conditions, climate and the like; in life style, unreasonable diet, insufficient physical activity, tobacco use and harmful alcohol use are four major risk factors for chronic diseases.

The diagnosis and treatment of chronic diseases produces large amounts of electronic medical data, which are mostly unstructured in character, which presents challenges for the analysis of electronic medical data. To address this challenge, natural language processing techniques have been used at home and abroad for the identification of unstructured data in electronic medical data, such as medical concepts, patient symptom descriptions, and the like. Currently, three methods for symptom identification are mainly used: dictionary or rule-based methods, machine learning-based methods, and deep learning-based methods, wherein the deep learning-based methods solve the symptom recognition problem, and the average F-value reaches 92.31% in a large number of sample trials.

However, the current electronic medical data analysis platform system cannot effectively utilize a large amount of electronic medical data to conduct cause analysis of chronic diseases and prediction of the disease conditions, so that the system can not be applied to symptomatic drug delivery for patients with chronic diseases.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a slow disease data analysis method and system based on natural language processing and integrated training, wherein the slow disease data analysis method and system comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein: the data preprocessing module extracts slow disease data from an external slow disease database, generates corresponding word vectors, and quantizes the word vectors to be used as a training sample; the data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of the slow disease data, and receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, wherein the labels distinguish three major classes of symptoms, pathology and treatment; the data training module receives the character labels of the slow disease data, inputs the character labels into the integrated learning network for classification training, and extracts an effective slow disease data association model, namely an association relation model among the slow disease symptoms, biochemical pathological indexes and treatment; the data visualization module performs statistical analysis on the data visualization module by using a TF-IDF algorithm to obtain a required target relation model, and then transmits the target relation model to an external user interface module to be presented in a form of a statistical graph. Through the measures, the technical purposes of fully utilizing a large amount of unstructured electronic medical data, combining natural language processing and integrated training neural network technology to conduct cause analysis of chronic diseases and disease prediction so as to take medicine for symptoms and better benefit for chronic patients are achieved.

Therefore, the invention provides a slow disease data analysis system based on natural language processing and integrated training, which comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein:

The data preprocessing module is used for extracting the slow disease data from an external slow disease database and generating a corresponding word vector W= (W ₁,w₂,w₃,…,w_m), and each term W _i in the word vector W corresponds to a character term in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;

The data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, and storing the obtained character labels according to three major categories of symptom areas, pathological areas and treatment areas;

The data identification module comprises 6 vectors including a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein initial values of the vectors of the hidden gate H, the input gate i, the forgetting gate f, the output gate o, the first auxiliary gate c and the second auxiliary gate d when t=0 are 0, time dimension information of the 6 vectors is from occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector H= (H ₁,h₂,h₃,...,h_m) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:

i_t＝σ(W_ix_t+U_ih_t-1+B_i) (1)

f_t＝σ(W_fx_t+U_fh_t-1+B_f) (2)

d_t＝tanh(W_cx_t+U_ch_t-1+B_c) (3)

c_t＝f_t⊙c_t-1+i_t⊙d_t (4)

o_t＝σ(W_ox_t+U_oh_t-1+B_o) (5)

h_t＝o_t⊙tanh(c_t) (6)

Wherein: w, U, B are the connection weights of the corresponding gates of the LSTM network, σ is a sigmoid function, and radix et rhizoma Rhei is a dot product;

the data training module is used for receiving the character labels of the slow disease data, inputting the character labels into the integrated learning network for classification training, and extracting an effective slow disease data association model, namely an association relation model among the slow disease symptoms, biochemical pathological indexes and treatment;

the data visualization module is used for receiving the ternary association model, carrying out statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmitting the target relationship model to an external user interface module to be presented in a form of a statistical graph.

The invention also provides a slow disease data analysis method based on natural language processing and integrated training, which comprises the following steps:

S1, data preprocessing, wherein a data preprocessing module extracts m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W ₁,w₂,w₃,…,w_m), and each item W _i in the word vectors W corresponds to one character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data comprise symptoms, biochemical pathological indexes and descriptive characters of treatment of the chronic disease;

S2, data identification, namely defining a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d in a data identification module, and inputting word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors H= (H ₁,h₂,h₃,…,h_m) of m slow disease data;

hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:

i_t＝σ(W_ix_t+U_ih_t-1+B_i) (1)

f_t＝σ(W_fx_t+U_fh_t-1+B_f) (2)

d_t＝tanh(W_cx_t+U_ch_t-1+B_c) (3)

c_t＝f_t⊙c_t-1+i_t⊙d_t (4)

o_t＝σ(W_ox_t+U_oh_t-1+B_o) (5)

h_t＝o_t⊙tanh(c_t) (6)

Wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;

s3, label calculation, namely receiving hidden vectors from a conditional probability field by a data identification module to calculate character labels of each piece of slow disease data, and marking the character labels as M＝{(l₁,p₁,q₁),(l₂,p₂,q₂),...,(l_m,p_m,q_m)}, which distinguish three major categories of symptoms, pathology and treatment;

The conditional probability field is used for calculating the conditional probability between two given sequences in the hidden vector, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vector with the obtained conditional probability larger than a given threshold value; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;

S4, data training, namely receiving a slow disease data character label by a data training module, inputting the slow disease data character label into an integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment;

s5, data visualization, wherein the data visualization module receives the ternary association model, performs statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmits the target relationship model to an external user interface module to be presented in a form of a statistical graph;

s6, presenting a statistical analysis result in a user interface through a statistical chart.

Further, the step S4 of the slow disease data analysis method based on natural language processing and integrated training comprises the following substeps:

S401, hierarchically dividing a model M＝{(l₁,p₁,q₁),(l₂,p₂,q₂),...,(l_m,p_m,q_m)} representing slow disease data character labels into k sets D ₁,D₂,…,D_k with similar lengths, extracting 50% of the sets as test sets M _c, and extracting 50% of the sets as training sets M _t;

S402, on the training set, defining an integrated learning network and setting a plurality of different primary learning algorithms

Training a primary learning algorithm according to the initial training set M _t in a k-fold cross-validation mode to obtain a plurality of different primary learners;

Training the initial training set M _t by using a primary learner to obtain a secondary data set M _v,

Taking a multi-response linear regression process as a meta learning algorithm, and generating a meta learner with optimal prediction performance according to the secondary data set M _v;

S403: the slow data model M＝{(l₁,p₁,q₁),(l₂,p₂,q₂),...,(l_m,p_m,q_m)} is classified by an optimal meta learner, and an effective slow data association model, namely a ternary association model among slow symptoms, biochemical pathological indexes and treatment is extracted from the slow data association model.

The invention has the following beneficial effects:

Firstly, the data preprocessing link generates a word vector W= (W ₁,w₂,w₃,…,w_m) by unstructured data from an external slow disease database, and each term W _i in the word vector W corresponds to a character in the slow disease data; quantizing the word vector W to obtain a dense representation form of the word vector W to be used as a training sample; this measure ensures that a large amount of electronic medical data, both structured and unstructured, generated during diagnosis and treatment of chronic diseases can be well and effectively represented and subsequently processed;

Secondly, in the label calculation link of the data identification module, the character label of each chronic disease data sample can be automatically calculated according to the hidden vector data obtained in the previous link by utilizing the conditional probability field theory, and three major situations of symptoms, pathology and treatment are distinguished, so that the internal connection among the three situations is convenient for the neural network to learn, the complex cause analysis and illness prediction related to chronic diseases are effectively solved, and the medicine is put down for symptoms, and is a welfare for patients;

Finally, the invention also outputs a trained slow disease data association model, namely a ternary association model among the slow disease symptoms, biochemical pathological indexes and treatment, which is convenient for drawing statistics results with abundant forms such as statistics graphs from a user interface and presenting the statistics results to medical care heat personnel and decision makers.

Drawings

Figure 1 is a schematic diagram of the composition and structure of a slow disease data analysis system based on natural language processing and integrated training,

Figure 2 is a flow chart of a method of chronic disease data analysis based on natural language processing and integrated training,

Fig. 3 is a flowchart of the steps of a slow disease data analysis method S4 based on natural language processing and integrated training.

Detailed Description

The present invention will be further described with reference to the drawings and examples, which are only for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.

According to hundred degrees encyclopedia, long short-term memory network (LSTM, longShort-TermMemory) is a time-circulating neural network, and is specifically designed to solve the long-term dependence problem existing in general RNN (circulating neural network). Long-term memory network (LSTM) papers were first published in 1997, and LSTM was adapted to handle and predict very Long-spaced and delayed important events in time series due to unique design structures. The bidirectional long and short memory network (BiLSTM) is equivalent to replacing a common RNN unit in the bidirectional recurrent neural network (BiRNN) with an LSTM unit, and the structure of the bidirectional long and short memory network at least comprises an Input Gate (Input Gate), a forget Gate (Forget Gate) and an Output Gate (Output Gate).

The invention relates to a slow disease data analysis system based on natural language processing and integrated training, which has a composition structure as shown in 3 of figure 1 and comprises a data preprocessing module 301, a data identification module 302, a data training module 303 and a data visualization module 304, wherein:

the data preprocessing module 301 is configured to extract m items of slow disease data from the external slow disease database 1, generate corresponding word vectors w= (W ₁,w₂,w₃,…,w_m), where each item W _i in the word vectors W corresponds to one character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;

The data recognition module 302 is configured to input word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, and receive the hidden vectors from a conditional probability field to calculate character labels of each piece of slow disease data, where the labels distinguish three major classes of symptoms, pathology and treatment, and the three major classes are: the data recognition module 302 includes a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein an initial value of each vector H, i, f, o, c, d is 0 when t=0, time dimension information of the 6 vectors comes from occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector h= (H ₁,h₂,h₃,...,h_m) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:

i_t＝σ(W_ixt₊U_ih_t-1+B_i) (1)

f_t＝σ(W_fx_t+U_fh_t-1+B_f) (2)

d_t＝tanh(W_cx_t+U_ch_t-1+B_c) (3)

c_t＝f_t⊙c_t-1+i_t⊙d_t (4)

o_t＝σ(W_ox_t+U_oh_t-1+B_o) (5)

h_t＝o_t⊙tanh(c_t) (6)

the data training module 303 is configured to receive a character tag of the slow disease data, input the character tag to the integrated learning network for classification training, and extract an effective slow disease data association model, namely a ternary association model between a slow disease symptom, a biochemical pathology index and a treatment;

The data visualization module 304 is configured to receive the ternary association model, perform statistical analysis on the ternary association model by using TF-IDF algorithm, obtain a required target relationship model, and transmit the target relationship model to the external user interface 2 to be presented in a form of a statistical graph.

The invention also provides a slow disease data analysis method based on natural language processing and integrated training, the flow of which is shown in figure 2, comprising the following steps:

S1, data preprocessing, namely extracting slow disease data from an external slow disease database by a data preprocessing module, and generating a corresponding word vector W= (W ₁,w₂,w₃,…,w_m), wherein each item W _i of the word vector W corresponds to a character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;

i_t＝σ(W_ix_t+U_ih_t-1+B_i) (1)

f_t＝σ(W_fx_t+U_fh_t-1+B_f) (2)

d_t＝tanh(W_cx_t+U_ch_t-1+B_c) (3)

c_t＝f_t⊙c_t-1+i_t⊙d_t (4)

o_t＝σ(W_ox_t+U_oh_t-1+B_o) (5)

h_t＝o_t⊙tanh(c_t) (6)

The conditional probability field is used for calculating the conditional probability between two given sequences corresponding to the hidden vectors, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vectors with the obtained conditional probability larger than a given threshold; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;

Wherein:

The detailed flow of step S4 is shown in FIG. 3, and the method comprises the following sub-steps:

The embodiments of the present invention are disclosed as preferred embodiments, but not limited thereto, and those skilled in the art will readily appreciate from the foregoing description that various extensions and modifications can be made without departing from the spirit of the present invention.

Claims

1. The slow disease data analysis system based on natural language processing and integrated training is characterized by comprising a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein:

The data preprocessing module is used for extracting m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W ₁,w₂,w₃,…,w_m), and each item W _i in the word vectors W corresponds to one character item in the slow disease data; quantizing the word vector W to obtain a dense representation form of the word vector W to be used as a training sample; the chronic disease data comprise symptoms of the chronic disease, biochemical pathological indexes and descriptive characters of treatment;

the data identification module comprises 6 vectors including a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein the initial value of each vector of the hidden gate H, the input gate i, the forgetting gate f, the output gate o, the first auxiliary gate c and the second auxiliary gate d is 0 when t=0, then the time dimension information of the 6 vectors comes from the occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector H= (H ₁,h₂,h₃,...,h_m) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:

i_t＝σ(W_ix_t+U_ih_t-1+B_i) (1)

f_t＝σ(W_fx_t+U_fh_t-1+B_f) (2)

d_t＝tanh(W_cx_t+U_ch_t-1+B_c) (3)

c_t＝f_t⊙c_t-1+i_t⊙d_t (4)

o_t＝σ(W_ox_t+U_oh_t-1+B_o) (5)

h_t＝o_t⊙tanh(c_t) (6)

the data training module is used for receiving the character labels of the slow disease data, inputting the character labels into the integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among the slow disease symptoms, biochemical pathology indexes and treatment;

2. A slow disease data analysis method based on natural language processing and integrated training is characterized by comprising the following steps:

S1, data preprocessing, wherein a data preprocessing module extracts m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W ₁,w₂,w₃,…,w_m), and each item W _i in the word vectors W corresponds to a character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;

i_t＝σ(W_ix_t+U_ih_t-1+B_i) (1)

f_t＝σ(W_fx_t+U_fh_t-1+B_f) (2)

d_t＝tanh(W_cx_t+U_ch_t-1+B_c) (3)

c_t＝f_t⊙c_t-1+i_t⊙d_t (4)

o_t＝σ(W_ox_t+U_oh_t-1+B_o) (5)

h _t＝o_t⊙tanh(c_t) (6) wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;

The conditional probability field is used for calculating the conditional probability between two given sequences in the hidden vector, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vector with the obtained conditional probability larger than a given threshold; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;

3. A method of analyzing slow disease data based on natural language processing and integrated training as claimed in claim 2, wherein S4 comprises the sub-steps of:

S401 model of character label for representing slow disease data

M＝{(l₁,p₁,q₁),(l₂,p₂,q₂),...,(l_m,p_m,q_m)} The method comprises the steps of layering and dividing the training set into k sets D ₁,D₂,…,D_k with similar lengths, extracting 50% of the k sets as a test set M _c, and extracting 50% of the k sets as a training set M _t;

S403: model of chronic disease data with optimal meta learner

M＝{(l₁,p₁,q₁),(l₂,p₂,q₂),...,(l_m,p_m,q_m)} Classifying, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment.