CN112287665B - Chronic disease data analysis method and system based on natural language processing and integrated training - Google Patents
Chronic disease data analysis method and system based on natural language processing and integrated training Download PDFInfo
- Publication number
- CN112287665B CN112287665B CN202011116445.8A CN202011116445A CN112287665B CN 112287665 B CN112287665 B CN 112287665B CN 202011116445 A CN202011116445 A CN 202011116445A CN 112287665 B CN112287665 B CN 112287665B
- Authority
- CN
- China
- Prior art keywords
- data
- training
- slow disease
- module
- vectors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 101
- 208000017667 Chronic Disease Diseases 0.000 title claims abstract description 30
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000003058 natural language processing Methods 0.000 title claims abstract description 19
- 238000007405 data analysis Methods 0.000 title claims abstract description 18
- 201000010099 disease Diseases 0.000 claims abstract description 88
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 88
- 239000013598 vector Substances 0.000 claims abstract description 84
- 208000024891 symptom Diseases 0.000 claims abstract description 31
- 238000011282 treatment Methods 0.000 claims abstract description 28
- 230000015654 memory Effects 0.000 claims abstract description 19
- 238000013079 data visualisation Methods 0.000 claims abstract description 17
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 238000007619 statistical method Methods 0.000 claims abstract description 11
- 230000007170 pathology Effects 0.000 claims abstract description 9
- 239000000284 extract Substances 0.000 claims abstract description 6
- 230000001575 pathological effect Effects 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 10
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 6
- 238000002790 cross-validation Methods 0.000 claims description 3
- 238000012417 linear regression Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 abstract description 6
- 238000004458 analytical method Methods 0.000 abstract description 5
- 239000003814 drug Substances 0.000 abstract description 3
- 229940079593 drug Drugs 0.000 abstract 1
- 230000001684 chronic effect Effects 0.000 description 2
- 238000013499 data model Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- LFQSCWFLJHTTHZ-UHFFFAOYSA-N Ethanol Chemical compound CCO LFQSCWFLJHTTHZ-UHFFFAOYSA-N 0.000 description 1
- 241000208125 Nicotiana Species 0.000 description 1
- 235000002637 Nicotiana tabacum Nutrition 0.000 description 1
- 208000031662 Noncommunicable disease Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 238000012377 drug delivery Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000005290 field theory Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000037081 physical activity Effects 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The invention discloses a slow disease data analysis method and system based on natural language processing and integrated training, wherein the system comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein the data preprocessing module extracts slow disease data from an external slow disease database to generate corresponding word vectors, and then the word vectors are quantized to be used as training samples; the data recognition module inputs word vectors of training samples into a two-way long-short-term memory network for training, so as to obtain hidden vectors and transmit the hidden vectors to a conditional probability field to calculate character labels; the data training module performs classification training to extract ternary association models among chronic disease symptoms, biochemical pathology indexes and treatments, transmits the ternary association models to the data visualization module for statistical analysis and transmits the ternary association models to an external user interface module for presentation. Therefore, the aim of carrying out cause analysis and illness prediction of chronic diseases by fully utilizing a large amount of unstructured electronic medical data and combining natural language processing and an integrated training neural network is achieved, and accordingly symptomatic medication is achieved.
Description
Technical Field
The invention relates to the technical field of chronic disease data analysis, in particular to a chronic disease data analysis method and system based on natural language processing and integrated training.
Background
Chronic non-infectious disease (chronic disease) is a public health problem, and has the characteristics of complex and unclear etiology, multiple influence on curative effect factors and difficult cure, and according to the research of world health organization, the disease cause of chronic disease is 60% dependent on the life style of individuals, and is also related to factors such as heredity, medical conditions, social conditions, climate and the like; in life style, unreasonable diet, insufficient physical activity, tobacco use and harmful alcohol use are four major risk factors for chronic diseases.
The diagnosis and treatment of chronic diseases produces large amounts of electronic medical data, which are mostly unstructured in character, which presents challenges for the analysis of electronic medical data. To address this challenge, natural language processing techniques have been used at home and abroad for the identification of unstructured data in electronic medical data, such as medical concepts, patient symptom descriptions, and the like. Currently, three methods for symptom identification are mainly used: dictionary or rule-based methods, machine learning-based methods, and deep learning-based methods, wherein the deep learning-based methods solve the symptom recognition problem, and the average F-value reaches 92.31% in a large number of sample trials.
However, the current electronic medical data analysis platform system cannot effectively utilize a large amount of electronic medical data to conduct cause analysis of chronic diseases and prediction of the disease conditions, so that the system can not be applied to symptomatic drug delivery for patients with chronic diseases.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a slow disease data analysis method and system based on natural language processing and integrated training, wherein the slow disease data analysis method and system comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein: the data preprocessing module extracts slow disease data from an external slow disease database, generates corresponding word vectors, and quantizes the word vectors to be used as a training sample; the data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of the slow disease data, and receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, wherein the labels distinguish three major classes of symptoms, pathology and treatment; the data training module receives the character labels of the slow disease data, inputs the character labels into the integrated learning network for classification training, and extracts an effective slow disease data association model, namely an association relation model among the slow disease symptoms, biochemical pathological indexes and treatment; the data visualization module performs statistical analysis on the data visualization module by using a TF-IDF algorithm to obtain a required target relation model, and then transmits the target relation model to an external user interface module to be presented in a form of a statistical graph. Through the measures, the technical purposes of fully utilizing a large amount of unstructured electronic medical data, combining natural language processing and integrated training neural network technology to conduct cause analysis of chronic diseases and disease prediction so as to take medicine for symptoms and better benefit for chronic patients are achieved.
Therefore, the invention provides a slow disease data analysis system based on natural language processing and integrated training, which comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein:
The data preprocessing module is used for extracting the slow disease data from an external slow disease database and generating a corresponding word vector W= (W 1,w2,w3,…,wm), and each term W i in the word vector W corresponds to a character term in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
The data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, and storing the obtained character labels according to three major categories of symptom areas, pathological areas and treatment areas;
The data identification module comprises 6 vectors including a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein initial values of the vectors of the hidden gate H, the input gate i, the forgetting gate f, the output gate o, the first auxiliary gate c and the second auxiliary gate d when t=0 are 0, time dimension information of the 6 vectors is from occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector H= (H 1,h2,h3,...,hm) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are the connection weights of the corresponding gates of the LSTM network, σ is a sigmoid function, and radix et rhizoma Rhei is a dot product;
the data training module is used for receiving the character labels of the slow disease data, inputting the character labels into the integrated learning network for classification training, and extracting an effective slow disease data association model, namely an association relation model among the slow disease symptoms, biochemical pathological indexes and treatment;
the data visualization module is used for receiving the ternary association model, carrying out statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmitting the target relationship model to an external user interface module to be presented in a form of a statistical graph.
The invention also provides a slow disease data analysis method based on natural language processing and integrated training, which comprises the following steps:
S1, data preprocessing, wherein a data preprocessing module extracts m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W 1,w2,w3,…,wm), and each item W i in the word vectors W corresponds to one character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data comprise symptoms, biochemical pathological indexes and descriptive characters of treatment of the chronic disease;
S2, data identification, namely defining a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d in a data identification module, and inputting word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors H= (H 1,h2,h3,…,hm) of m slow disease data;
hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;
s3, label calculation, namely receiving hidden vectors from a conditional probability field by a data identification module to calculate character labels of each piece of slow disease data, and marking the character labels as M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)}, which distinguish three major categories of symptoms, pathology and treatment;
The conditional probability field is used for calculating the conditional probability between two given sequences in the hidden vector, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vector with the obtained conditional probability larger than a given threshold value; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;
S4, data training, namely receiving a slow disease data character label by a data training module, inputting the slow disease data character label into an integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment;
s5, data visualization, wherein the data visualization module receives the ternary association model, performs statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmits the target relationship model to an external user interface module to be presented in a form of a statistical graph;
s6, presenting a statistical analysis result in a user interface through a statistical chart.
Further, the step S4 of the slow disease data analysis method based on natural language processing and integrated training comprises the following substeps:
S401, hierarchically dividing a model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} representing slow disease data character labels into k sets D 1,D2,…,Dk with similar lengths, extracting 50% of the sets as test sets M c, and extracting 50% of the sets as training sets M t;
S402, on the training set, defining an integrated learning network and setting a plurality of different primary learning algorithms
Training a primary learning algorithm according to the initial training set M t in a k-fold cross-validation mode to obtain a plurality of different primary learners;
Training the initial training set M t by using a primary learner to obtain a secondary data set M v,
Taking a multi-response linear regression process as a meta learning algorithm, and generating a meta learner with optimal prediction performance according to the secondary data set M v;
S403: the slow data model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} is classified by an optimal meta learner, and an effective slow data association model, namely a ternary association model among slow symptoms, biochemical pathological indexes and treatment is extracted from the slow data association model.
The invention has the following beneficial effects:
Firstly, the data preprocessing link generates a word vector W= (W 1,w2,w3,…,wm) by unstructured data from an external slow disease database, and each term W i in the word vector W corresponds to a character in the slow disease data; quantizing the word vector W to obtain a dense representation form of the word vector W to be used as a training sample; this measure ensures that a large amount of electronic medical data, both structured and unstructured, generated during diagnosis and treatment of chronic diseases can be well and effectively represented and subsequently processed;
Secondly, in the label calculation link of the data identification module, the character label of each chronic disease data sample can be automatically calculated according to the hidden vector data obtained in the previous link by utilizing the conditional probability field theory, and three major situations of symptoms, pathology and treatment are distinguished, so that the internal connection among the three situations is convenient for the neural network to learn, the complex cause analysis and illness prediction related to chronic diseases are effectively solved, and the medicine is put down for symptoms, and is a welfare for patients;
Finally, the invention also outputs a trained slow disease data association model, namely a ternary association model among the slow disease symptoms, biochemical pathological indexes and treatment, which is convenient for drawing statistics results with abundant forms such as statistics graphs from a user interface and presenting the statistics results to medical care heat personnel and decision makers.
Drawings
Figure 1 is a schematic diagram of the composition and structure of a slow disease data analysis system based on natural language processing and integrated training,
Figure 2 is a flow chart of a method of chronic disease data analysis based on natural language processing and integrated training,
Fig. 3 is a flowchart of the steps of a slow disease data analysis method S4 based on natural language processing and integrated training.
Detailed Description
The present invention will be further described with reference to the drawings and examples, which are only for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
According to hundred degrees encyclopedia, long short-term memory network (LSTM, longShort-TermMemory) is a time-circulating neural network, and is specifically designed to solve the long-term dependence problem existing in general RNN (circulating neural network). Long-term memory network (LSTM) papers were first published in 1997, and LSTM was adapted to handle and predict very Long-spaced and delayed important events in time series due to unique design structures. The bidirectional long and short memory network (BiLSTM) is equivalent to replacing a common RNN unit in the bidirectional recurrent neural network (BiRNN) with an LSTM unit, and the structure of the bidirectional long and short memory network at least comprises an Input Gate (Input Gate), a forget Gate (Forget Gate) and an Output Gate (Output Gate).
The invention relates to a slow disease data analysis system based on natural language processing and integrated training, which has a composition structure as shown in 3 of figure 1 and comprises a data preprocessing module 301, a data identification module 302, a data training module 303 and a data visualization module 304, wherein:
the data preprocessing module 301 is configured to extract m items of slow disease data from the external slow disease database 1, generate corresponding word vectors w= (W 1,w2,w3,…,wm), where each item W i in the word vectors W corresponds to one character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
The data recognition module 302 is configured to input word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, and receive the hidden vectors from a conditional probability field to calculate character labels of each piece of slow disease data, where the labels distinguish three major classes of symptoms, pathology and treatment, and the three major classes are: the data recognition module 302 includes a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein an initial value of each vector H, i, f, o, c, d is 0 when t=0, time dimension information of the 6 vectors comes from occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector h= (H 1,h2,h3,...,hm) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are the connection weights of the corresponding gates of the LSTM network, σ is a sigmoid function, and radix et rhizoma Rhei is a dot product;
the data training module 303 is configured to receive a character tag of the slow disease data, input the character tag to the integrated learning network for classification training, and extract an effective slow disease data association model, namely a ternary association model between a slow disease symptom, a biochemical pathology index and a treatment;
The data visualization module 304 is configured to receive the ternary association model, perform statistical analysis on the ternary association model by using TF-IDF algorithm, obtain a required target relationship model, and transmit the target relationship model to the external user interface 2 to be presented in a form of a statistical graph.
The invention also provides a slow disease data analysis method based on natural language processing and integrated training, the flow of which is shown in figure 2, comprising the following steps:
S1, data preprocessing, namely extracting slow disease data from an external slow disease database by a data preprocessing module, and generating a corresponding word vector W= (W 1,w2,w3,…,wm), wherein each item W i of the word vector W corresponds to a character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
S2, data identification, namely defining a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d in a data identification module, and inputting word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors H= (H 1,h2,h3,…,hm) of m slow disease data;
hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;
s3, label calculation, namely receiving hidden vectors from a conditional probability field by a data identification module to calculate character labels of each piece of slow disease data, and marking the character labels as M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)}, which distinguish three major categories of symptoms, pathology and treatment;
The conditional probability field is used for calculating the conditional probability between two given sequences corresponding to the hidden vectors, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vectors with the obtained conditional probability larger than a given threshold; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;
S4, data training, namely receiving a slow disease data character label by a data training module, inputting the slow disease data character label into an integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment;
s5, data visualization, wherein the data visualization module receives the ternary association model, performs statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmits the target relationship model to an external user interface module to be presented in a form of a statistical graph;
s6, presenting a statistical analysis result in a user interface through a statistical chart.
Wherein:
The detailed flow of step S4 is shown in FIG. 3, and the method comprises the following sub-steps:
S401, hierarchically dividing a model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} representing slow disease data character labels into k sets D 1,D2,…,Dk with similar lengths, extracting 50% of the sets as test sets M c, and extracting 50% of the sets as training sets M t;
S402, on the training set, defining an integrated learning network and setting a plurality of different primary learning algorithms
Training a primary learning algorithm according to the initial training set M t in a k-fold cross-validation mode to obtain a plurality of different primary learners;
Training the initial training set M t by using a primary learner to obtain a secondary data set M v,
Taking a multi-response linear regression process as a meta learning algorithm, and generating a meta learner with optimal prediction performance according to the secondary data set M v;
S403: the slow data model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} is classified by an optimal meta learner, and an effective slow data association model, namely a ternary association model among slow symptoms, biochemical pathological indexes and treatment is extracted from the slow data association model.
The embodiments of the present invention are disclosed as preferred embodiments, but not limited thereto, and those skilled in the art will readily appreciate from the foregoing description that various extensions and modifications can be made without departing from the spirit of the present invention.
Claims (3)
1. The slow disease data analysis system based on natural language processing and integrated training is characterized by comprising a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein:
The data preprocessing module is used for extracting m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W 1,w2,w3,…,wm), and each item W i in the word vectors W corresponds to one character item in the slow disease data; quantizing the word vector W to obtain a dense representation form of the word vector W to be used as a training sample; the chronic disease data comprise symptoms of the chronic disease, biochemical pathological indexes and descriptive characters of treatment;
The data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, and storing the obtained character labels according to three major categories of symptom areas, pathological areas and treatment areas;
the data identification module comprises 6 vectors including a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein the initial value of each vector of the hidden gate H, the input gate i, the forgetting gate f, the output gate o, the first auxiliary gate c and the second auxiliary gate d is 0 when t=0, then the time dimension information of the 6 vectors comes from the occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector H= (H 1,h2,h3,...,hm) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are the connection weights of the corresponding gates of the LSTM network, σ is a sigmoid function, and radix et rhizoma Rhei is a dot product;
the data training module is used for receiving the character labels of the slow disease data, inputting the character labels into the integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among the slow disease symptoms, biochemical pathology indexes and treatment;
the data visualization module is used for receiving the ternary association model, carrying out statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmitting the target relationship model to an external user interface module to be presented in a form of a statistical graph.
2. A slow disease data analysis method based on natural language processing and integrated training is characterized by comprising the following steps:
S1, data preprocessing, wherein a data preprocessing module extracts m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W 1,w2,w3,…,wm), and each item W i in the word vectors W corresponds to a character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
S2, data identification, namely defining a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d in a data identification module, and inputting word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors H= (H 1,h2,h3,…,hm) of m slow disease data;
hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
h t=ot⊙tanh(ct) (6) wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;
s3, label calculation, namely receiving hidden vectors from a conditional probability field by a data identification module to calculate character labels of each piece of slow disease data, and marking the character labels as M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)}, which distinguish three major categories of symptoms, pathology and treatment;
The conditional probability field is used for calculating the conditional probability between two given sequences in the hidden vector, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vector with the obtained conditional probability larger than a given threshold; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;
S4, data training, namely receiving a slow disease data character label by a data training module, inputting the slow disease data character label into an integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment;
s5, data visualization, wherein the data visualization module receives the ternary association model, performs statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmits the target relationship model to an external user interface module to be presented in a form of a statistical graph;
s6, presenting a statistical analysis result in a user interface through a statistical chart.
3. A method of analyzing slow disease data based on natural language processing and integrated training as claimed in claim 2, wherein S4 comprises the sub-steps of:
S401 model of character label for representing slow disease data
M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} The method comprises the steps of layering and dividing the training set into k sets D 1,D2,…,Dk with similar lengths, extracting 50% of the k sets as a test set M c, and extracting 50% of the k sets as a training set M t;
S402, on the training set, defining an integrated learning network and setting a plurality of different primary learning algorithms
Training a primary learning algorithm according to the initial training set M t in a k-fold cross-validation mode to obtain a plurality of different primary learners;
Training the initial training set M t by using a primary learner to obtain a secondary data set M v,
Taking a multi-response linear regression process as a meta learning algorithm, and generating a meta learner with optimal prediction performance according to the secondary data set M v;
S403: model of chronic disease data with optimal meta learner
M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} Classifying, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011116445.8A CN112287665B (en) | 2020-10-19 | 2020-10-19 | Chronic disease data analysis method and system based on natural language processing and integrated training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011116445.8A CN112287665B (en) | 2020-10-19 | 2020-10-19 | Chronic disease data analysis method and system based on natural language processing and integrated training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112287665A CN112287665A (en) | 2021-01-29 |
CN112287665B true CN112287665B (en) | 2024-05-03 |
Family
ID=74497464
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011116445.8A Active CN112287665B (en) | 2020-10-19 | 2020-10-19 | Chronic disease data analysis method and system based on natural language processing and integrated training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112287665B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118136206A (en) * | 2024-05-07 | 2024-06-04 | 江苏法迈生医学科技有限公司 | Chronic disease prediction method in full course management system based on big data |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107358948A (en) * | 2017-06-27 | 2017-11-17 | 上海交通大学 | Language in-put relevance detection method based on attention model |
CN109460473A (en) * | 2018-11-21 | 2019-03-12 | 中南大学 | The electronic health record multi-tag classification method with character representation is extracted based on symptom |
CN110060773A (en) * | 2019-04-22 | 2019-07-26 | 东华大学 | Alzheimer's disease progression of the disease forecasting system based on two-way LSTM |
CN110444259A (en) * | 2019-06-06 | 2019-11-12 | 昆明理工大学 | Traditional Chinese medical electronic case history entity relationship extracting method based on entity relationship mark strategy |
CN110569511A (en) * | 2019-09-22 | 2019-12-13 | 河南工业大学 | Electronic medical record feature extraction method based on hybrid neural network |
CN111222340A (en) * | 2020-01-15 | 2020-06-02 | 东华大学 | Breast electronic medical record entity recognition system based on multi-standard active learning |
CN111428036A (en) * | 2020-03-23 | 2020-07-17 | 浙江大学 | Entity relationship mining method based on biomedical literature |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160328526A1 (en) * | 2015-04-07 | 2016-11-10 | Accordion Health, Inc. | Case management system using a medical event forecasting engine |
US9949714B2 (en) * | 2015-07-29 | 2018-04-24 | Htc Corporation | Method, electronic apparatus, and computer readable medium of constructing classifier for disease detection |
US10565305B2 (en) * | 2016-11-18 | 2020-02-18 | Salesforce.Com, Inc. | Adaptive attention model for image captioning |
-
2020
- 2020-10-19 CN CN202011116445.8A patent/CN112287665B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107358948A (en) * | 2017-06-27 | 2017-11-17 | 上海交通大学 | Language in-put relevance detection method based on attention model |
CN109460473A (en) * | 2018-11-21 | 2019-03-12 | 中南大学 | The electronic health record multi-tag classification method with character representation is extracted based on symptom |
CN110060773A (en) * | 2019-04-22 | 2019-07-26 | 东华大学 | Alzheimer's disease progression of the disease forecasting system based on two-way LSTM |
CN110444259A (en) * | 2019-06-06 | 2019-11-12 | 昆明理工大学 | Traditional Chinese medical electronic case history entity relationship extracting method based on entity relationship mark strategy |
CN110569511A (en) * | 2019-09-22 | 2019-12-13 | 河南工业大学 | Electronic medical record feature extraction method based on hybrid neural network |
CN111222340A (en) * | 2020-01-15 | 2020-06-02 | 东华大学 | Breast electronic medical record entity recognition system based on multi-standard active learning |
CN111428036A (en) * | 2020-03-23 | 2020-07-17 | 浙江大学 | Entity relationship mining method based on biomedical literature |
Non-Patent Citations (2)
Title |
---|
BiLSTM-CRF模型在中文电子病历命名实体识别中的应用研究;王若佳;魏思仪;王继民;;文献与数据学报(第02期);全文 * |
基于BLSTM网络的医学时间短语识别;张顺利;王应军;姬东鸿;;计算机应用研究(第04期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112287665A (en) | 2021-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109669994B (en) | Construction method and system of health knowledge map | |
Zheng et al. | The fusion of deep learning and fuzzy systems: A state-of-the-art survey | |
CN109460473B (en) | Electronic medical record multi-label classification method based on symptom extraction and feature representation | |
CN107516110B (en) | Medical question-answer semantic clustering method based on integrated convolutional coding | |
CN111540468B (en) | ICD automatic coding method and system for visualizing diagnostic reasons | |
CN114564565B (en) | Depth semantic recognition model for public security event analysis and construction method thereof | |
CN113040711B (en) | Cerebral apoplexy incidence risk prediction system, equipment and storage medium | |
CN111382272A (en) | Electronic medical record ICD automatic coding method based on knowledge graph | |
CN110287323B (en) | Target-oriented emotion classification method | |
CN113553440B (en) | Medical entity relationship extraction method based on hierarchical reasoning | |
CN110277167A (en) | The Chronic Non-Communicable Diseases Risk Forecast System of knowledge based map | |
CN109036577A (en) | Diabetic complication analysis method and device | |
Falissard et al. | A deep artificial neural network− based model for prediction of underlying cause of death from death certificates: algorithm development and validation | |
Johnson et al. | Hcpcs2vec: Healthcare procedure embeddings for medicare fraud prediction | |
CN114492444A (en) | Chinese electronic medical case medical entity part-of-speech tagging method | |
CN112287665B (en) | Chronic disease data analysis method and system based on natural language processing and integrated training | |
Marerngsit et al. | A two-stage text-to-emotion depressive disorder screening assistance based on contents from online community | |
CN113360643A (en) | Electronic medical record data quality evaluation method based on short text classification | |
CN110633368A (en) | Deep learning classification method for early colorectal cancer unstructured data | |
Cheng et al. | Combining knowledge extension with convolution neural network for diabetes prediction | |
CN116403706A (en) | Diabetes prediction method integrating knowledge expansion and convolutional neural network | |
CN114582449A (en) | Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model | |
CN116434951A (en) | Disease early warning method, device, electronic equipment, storage medium and program product | |
Kour et al. | Hybrid LSTM-TCN Model for Predicting Depression using Twitter Data | |
Falissard et al. | A deep artificial neural network based model for underlying cause of death prediction from death certificates |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |