[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112287665B - Chronic disease data analysis method and system based on natural language processing and integrated training - Google Patents

Chronic disease data analysis method and system based on natural language processing and integrated training Download PDF

Info

Publication number
CN112287665B
CN112287665B CN202011116445.8A CN202011116445A CN112287665B CN 112287665 B CN112287665 B CN 112287665B CN 202011116445 A CN202011116445 A CN 202011116445A CN 112287665 B CN112287665 B CN 112287665B
Authority
CN
China
Prior art keywords
data
training
slow disease
module
vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011116445.8A
Other languages
Chinese (zh)
Other versions
CN112287665A (en
Inventor
亓晋
张及棠
孙雁飞
闫文卿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Nanyou Institute Of Information Technovation Co ltd
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Nanyou Institute Of Information Technovation Co ltd
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Nanyou Institute Of Information Technovation Co ltd, Nanjing University of Posts and Telecommunications filed Critical Nanjing Nanyou Institute Of Information Technovation Co ltd
Priority to CN202011116445.8A priority Critical patent/CN112287665B/en
Publication of CN112287665A publication Critical patent/CN112287665A/en
Application granted granted Critical
Publication of CN112287665B publication Critical patent/CN112287665B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention discloses a slow disease data analysis method and system based on natural language processing and integrated training, wherein the system comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein the data preprocessing module extracts slow disease data from an external slow disease database to generate corresponding word vectors, and then the word vectors are quantized to be used as training samples; the data recognition module inputs word vectors of training samples into a two-way long-short-term memory network for training, so as to obtain hidden vectors and transmit the hidden vectors to a conditional probability field to calculate character labels; the data training module performs classification training to extract ternary association models among chronic disease symptoms, biochemical pathology indexes and treatments, transmits the ternary association models to the data visualization module for statistical analysis and transmits the ternary association models to an external user interface module for presentation. Therefore, the aim of carrying out cause analysis and illness prediction of chronic diseases by fully utilizing a large amount of unstructured electronic medical data and combining natural language processing and an integrated training neural network is achieved, and accordingly symptomatic medication is achieved.

Description

Chronic disease data analysis method and system based on natural language processing and integrated training
Technical Field
The invention relates to the technical field of chronic disease data analysis, in particular to a chronic disease data analysis method and system based on natural language processing and integrated training.
Background
Chronic non-infectious disease (chronic disease) is a public health problem, and has the characteristics of complex and unclear etiology, multiple influence on curative effect factors and difficult cure, and according to the research of world health organization, the disease cause of chronic disease is 60% dependent on the life style of individuals, and is also related to factors such as heredity, medical conditions, social conditions, climate and the like; in life style, unreasonable diet, insufficient physical activity, tobacco use and harmful alcohol use are four major risk factors for chronic diseases.
The diagnosis and treatment of chronic diseases produces large amounts of electronic medical data, which are mostly unstructured in character, which presents challenges for the analysis of electronic medical data. To address this challenge, natural language processing techniques have been used at home and abroad for the identification of unstructured data in electronic medical data, such as medical concepts, patient symptom descriptions, and the like. Currently, three methods for symptom identification are mainly used: dictionary or rule-based methods, machine learning-based methods, and deep learning-based methods, wherein the deep learning-based methods solve the symptom recognition problem, and the average F-value reaches 92.31% in a large number of sample trials.
However, the current electronic medical data analysis platform system cannot effectively utilize a large amount of electronic medical data to conduct cause analysis of chronic diseases and prediction of the disease conditions, so that the system can not be applied to symptomatic drug delivery for patients with chronic diseases.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention aims to provide a slow disease data analysis method and system based on natural language processing and integrated training, wherein the slow disease data analysis method and system comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein: the data preprocessing module extracts slow disease data from an external slow disease database, generates corresponding word vectors, and quantizes the word vectors to be used as a training sample; the data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of the slow disease data, and receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, wherein the labels distinguish three major classes of symptoms, pathology and treatment; the data training module receives the character labels of the slow disease data, inputs the character labels into the integrated learning network for classification training, and extracts an effective slow disease data association model, namely an association relation model among the slow disease symptoms, biochemical pathological indexes and treatment; the data visualization module performs statistical analysis on the data visualization module by using a TF-IDF algorithm to obtain a required target relation model, and then transmits the target relation model to an external user interface module to be presented in a form of a statistical graph. Through the measures, the technical purposes of fully utilizing a large amount of unstructured electronic medical data, combining natural language processing and integrated training neural network technology to conduct cause analysis of chronic diseases and disease prediction so as to take medicine for symptoms and better benefit for chronic patients are achieved.
Therefore, the invention provides a slow disease data analysis system based on natural language processing and integrated training, which comprises a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein:
The data preprocessing module is used for extracting the slow disease data from an external slow disease database and generating a corresponding word vector W= (W 1,w2,w3,…,wm), and each term W i in the word vector W corresponds to a character term in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
The data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, and storing the obtained character labels according to three major categories of symptom areas, pathological areas and treatment areas;
The data identification module comprises 6 vectors including a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein initial values of the vectors of the hidden gate H, the input gate i, the forgetting gate f, the output gate o, the first auxiliary gate c and the second auxiliary gate d when t=0 are 0, time dimension information of the 6 vectors is from occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector H= (H 1,h2,h3,...,hm) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are the connection weights of the corresponding gates of the LSTM network, σ is a sigmoid function, and radix et rhizoma Rhei is a dot product;
the data training module is used for receiving the character labels of the slow disease data, inputting the character labels into the integrated learning network for classification training, and extracting an effective slow disease data association model, namely an association relation model among the slow disease symptoms, biochemical pathological indexes and treatment;
the data visualization module is used for receiving the ternary association model, carrying out statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmitting the target relationship model to an external user interface module to be presented in a form of a statistical graph.
The invention also provides a slow disease data analysis method based on natural language processing and integrated training, which comprises the following steps:
S1, data preprocessing, wherein a data preprocessing module extracts m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W 1,w2,w3,…,wm), and each item W i in the word vectors W corresponds to one character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data comprise symptoms, biochemical pathological indexes and descriptive characters of treatment of the chronic disease;
S2, data identification, namely defining a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d in a data identification module, and inputting word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors H= (H 1,h2,h3,…,hm) of m slow disease data;
hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;
s3, label calculation, namely receiving hidden vectors from a conditional probability field by a data identification module to calculate character labels of each piece of slow disease data, and marking the character labels as M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)}, which distinguish three major categories of symptoms, pathology and treatment;
The conditional probability field is used for calculating the conditional probability between two given sequences in the hidden vector, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vector with the obtained conditional probability larger than a given threshold value; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;
S4, data training, namely receiving a slow disease data character label by a data training module, inputting the slow disease data character label into an integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment;
s5, data visualization, wherein the data visualization module receives the ternary association model, performs statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmits the target relationship model to an external user interface module to be presented in a form of a statistical graph;
s6, presenting a statistical analysis result in a user interface through a statistical chart.
Further, the step S4 of the slow disease data analysis method based on natural language processing and integrated training comprises the following substeps:
S401, hierarchically dividing a model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} representing slow disease data character labels into k sets D 1,D2,…,Dk with similar lengths, extracting 50% of the sets as test sets M c, and extracting 50% of the sets as training sets M t;
S402, on the training set, defining an integrated learning network and setting a plurality of different primary learning algorithms
Training a primary learning algorithm according to the initial training set M t in a k-fold cross-validation mode to obtain a plurality of different primary learners;
Training the initial training set M t by using a primary learner to obtain a secondary data set M v,
Taking a multi-response linear regression process as a meta learning algorithm, and generating a meta learner with optimal prediction performance according to the secondary data set M v;
S403: the slow data model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} is classified by an optimal meta learner, and an effective slow data association model, namely a ternary association model among slow symptoms, biochemical pathological indexes and treatment is extracted from the slow data association model.
The invention has the following beneficial effects:
Firstly, the data preprocessing link generates a word vector W= (W 1,w2,w3,…,wm) by unstructured data from an external slow disease database, and each term W i in the word vector W corresponds to a character in the slow disease data; quantizing the word vector W to obtain a dense representation form of the word vector W to be used as a training sample; this measure ensures that a large amount of electronic medical data, both structured and unstructured, generated during diagnosis and treatment of chronic diseases can be well and effectively represented and subsequently processed;
Secondly, in the label calculation link of the data identification module, the character label of each chronic disease data sample can be automatically calculated according to the hidden vector data obtained in the previous link by utilizing the conditional probability field theory, and three major situations of symptoms, pathology and treatment are distinguished, so that the internal connection among the three situations is convenient for the neural network to learn, the complex cause analysis and illness prediction related to chronic diseases are effectively solved, and the medicine is put down for symptoms, and is a welfare for patients;
Finally, the invention also outputs a trained slow disease data association model, namely a ternary association model among the slow disease symptoms, biochemical pathological indexes and treatment, which is convenient for drawing statistics results with abundant forms such as statistics graphs from a user interface and presenting the statistics results to medical care heat personnel and decision makers.
Drawings
Figure 1 is a schematic diagram of the composition and structure of a slow disease data analysis system based on natural language processing and integrated training,
Figure 2 is a flow chart of a method of chronic disease data analysis based on natural language processing and integrated training,
Fig. 3 is a flowchart of the steps of a slow disease data analysis method S4 based on natural language processing and integrated training.
Detailed Description
The present invention will be further described with reference to the drawings and examples, which are only for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
According to hundred degrees encyclopedia, long short-term memory network (LSTM, longShort-TermMemory) is a time-circulating neural network, and is specifically designed to solve the long-term dependence problem existing in general RNN (circulating neural network). Long-term memory network (LSTM) papers were first published in 1997, and LSTM was adapted to handle and predict very Long-spaced and delayed important events in time series due to unique design structures. The bidirectional long and short memory network (BiLSTM) is equivalent to replacing a common RNN unit in the bidirectional recurrent neural network (BiRNN) with an LSTM unit, and the structure of the bidirectional long and short memory network at least comprises an Input Gate (Input Gate), a forget Gate (Forget Gate) and an Output Gate (Output Gate).
The invention relates to a slow disease data analysis system based on natural language processing and integrated training, which has a composition structure as shown in 3 of figure 1 and comprises a data preprocessing module 301, a data identification module 302, a data training module 303 and a data visualization module 304, wherein:
the data preprocessing module 301 is configured to extract m items of slow disease data from the external slow disease database 1, generate corresponding word vectors w= (W 1,w2,w3,…,wm), where each item W i in the word vectors W corresponds to one character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
The data recognition module 302 is configured to input word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, and receive the hidden vectors from a conditional probability field to calculate character labels of each piece of slow disease data, where the labels distinguish three major classes of symptoms, pathology and treatment, and the three major classes are: the data recognition module 302 includes a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein an initial value of each vector H, i, f, o, c, d is 0 when t=0, time dimension information of the 6 vectors comes from occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector h= (H 1,h2,h3,...,hm) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are the connection weights of the corresponding gates of the LSTM network, σ is a sigmoid function, and radix et rhizoma Rhei is a dot product;
the data training module 303 is configured to receive a character tag of the slow disease data, input the character tag to the integrated learning network for classification training, and extract an effective slow disease data association model, namely a ternary association model between a slow disease symptom, a biochemical pathology index and a treatment;
The data visualization module 304 is configured to receive the ternary association model, perform statistical analysis on the ternary association model by using TF-IDF algorithm, obtain a required target relationship model, and transmit the target relationship model to the external user interface 2 to be presented in a form of a statistical graph.
The invention also provides a slow disease data analysis method based on natural language processing and integrated training, the flow of which is shown in figure 2, comprising the following steps:
S1, data preprocessing, namely extracting slow disease data from an external slow disease database by a data preprocessing module, and generating a corresponding word vector W= (W 1,w2,w3,…,wm), wherein each item W i of the word vector W corresponds to a character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
S2, data identification, namely defining a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d in a data identification module, and inputting word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors H= (H 1,h2,h3,…,hm) of m slow disease data;
hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;
s3, label calculation, namely receiving hidden vectors from a conditional probability field by a data identification module to calculate character labels of each piece of slow disease data, and marking the character labels as M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)}, which distinguish three major categories of symptoms, pathology and treatment;
The conditional probability field is used for calculating the conditional probability between two given sequences corresponding to the hidden vectors, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vectors with the obtained conditional probability larger than a given threshold; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;
S4, data training, namely receiving a slow disease data character label by a data training module, inputting the slow disease data character label into an integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment;
s5, data visualization, wherein the data visualization module receives the ternary association model, performs statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmits the target relationship model to an external user interface module to be presented in a form of a statistical graph;
s6, presenting a statistical analysis result in a user interface through a statistical chart.
Wherein:
The detailed flow of step S4 is shown in FIG. 3, and the method comprises the following sub-steps:
S401, hierarchically dividing a model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} representing slow disease data character labels into k sets D 1,D2,…,Dk with similar lengths, extracting 50% of the sets as test sets M c, and extracting 50% of the sets as training sets M t;
S402, on the training set, defining an integrated learning network and setting a plurality of different primary learning algorithms
Training a primary learning algorithm according to the initial training set M t in a k-fold cross-validation mode to obtain a plurality of different primary learners;
Training the initial training set M t by using a primary learner to obtain a secondary data set M v,
Taking a multi-response linear regression process as a meta learning algorithm, and generating a meta learner with optimal prediction performance according to the secondary data set M v;
S403: the slow data model M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} is classified by an optimal meta learner, and an effective slow data association model, namely a ternary association model among slow symptoms, biochemical pathological indexes and treatment is extracted from the slow data association model.
The embodiments of the present invention are disclosed as preferred embodiments, but not limited thereto, and those skilled in the art will readily appreciate from the foregoing description that various extensions and modifications can be made without departing from the spirit of the present invention.

Claims (3)

1. The slow disease data analysis system based on natural language processing and integrated training is characterized by comprising a data preprocessing module, a data identification module, a data training module and a data visualization module, wherein:
The data preprocessing module is used for extracting m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W 1,w2,w3,…,wm), and each item W i in the word vectors W corresponds to one character item in the slow disease data; quantizing the word vector W to obtain a dense representation form of the word vector W to be used as a training sample; the chronic disease data comprise symptoms of the chronic disease, biochemical pathological indexes and descriptive characters of treatment;
The data recognition module is used for inputting word vectors of training samples into the two-way long-short-term memory network for training to obtain hidden vectors of slow disease data, receiving the hidden vectors from the conditional probability field to calculate character labels of each piece of slow disease data, and storing the obtained character labels according to three major categories of symptom areas, pathological areas and treatment areas;
the data identification module comprises 6 vectors including a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d, wherein the initial value of each vector of the hidden gate H, the input gate i, the forgetting gate f, the output gate o, the first auxiliary gate c and the second auxiliary gate d is 0 when t=0, then the time dimension information of the 6 vectors comes from the occurrence time of word vectors of training samples, the word vectors of the training samples are input into a two-way long-short-term memory network for training, and a hidden vector H= (H 1,h2,h3,...,hm) of m slow data is obtained; hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
ht=ot⊙tanh(ct) (6)
Wherein: w, U, B are the connection weights of the corresponding gates of the LSTM network, σ is a sigmoid function, and radix et rhizoma Rhei is a dot product;
the data training module is used for receiving the character labels of the slow disease data, inputting the character labels into the integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among the slow disease symptoms, biochemical pathology indexes and treatment;
the data visualization module is used for receiving the ternary association model, carrying out statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmitting the target relationship model to an external user interface module to be presented in a form of a statistical graph.
2. A slow disease data analysis method based on natural language processing and integrated training is characterized by comprising the following steps:
S1, data preprocessing, wherein a data preprocessing module extracts m items of slow disease data from an external slow disease database to generate corresponding word vectors W= (W 1,w2,w3,…,wm), and each item W i in the word vectors W corresponds to a character item in the slow disease data; and quantizing the word vector W to obtain a dense representation form as a training sample, wherein: the chronic disease data refer to descriptive characters comprising symptoms, biochemical pathological indexes and treatment of the chronic disease;
S2, data identification, namely defining a hidden gate H, an input gate i, a forgetting gate f, an output gate o, a first auxiliary gate c and a second auxiliary gate d in a data identification module, and inputting word vectors of training samples into a two-way long-short-term memory network for training to obtain hidden vectors H= (H 1,h2,h3,…,hm) of m slow disease data;
hereafter, the bidirectional long-short-term memory network is abbreviated as LSTM network, and the recursive calculation formula is as follows:
it=σ(Wixt+Uiht-1+Bi) (1)
ft=σ(Wfxt+Ufht-1+Bf) (2)
dt=tanh(Wcxt+Ucht-1+Bc) (3)
ct=ft⊙ct-1+it⊙dt (4)
ot=σ(Woxt+Uoht-1+Bo) (5)
h t=ot⊙tanh(ct) (6) wherein: w, U, B are respectively the connection weights of the corresponding gates of the LSTM network, sigma is a sigmoid function, and if the sum is a dot product, the initial value of each h, i, f, o, c, d vector is 0 when t=0, and the time dimension information of the 6 vectors is from the occurrence time of the training sample word vector;
s3, label calculation, namely receiving hidden vectors from a conditional probability field by a data identification module to calculate character labels of each piece of slow disease data, and marking the character labels as M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)}, which distinguish three major categories of symptoms, pathology and treatment;
The conditional probability field is used for calculating the conditional probability between two given sequences in the hidden vector, and extracting the slow disease label l, the occurrence condition p and the conditional probability q of the hidden vector with the obtained conditional probability larger than a given threshold; specifically, the data recognition module counts the transition conditional probabilities from the tag class i to the tag class j in a continuous time step in all training samples;
S4, data training, namely receiving a slow disease data character label by a data training module, inputting the slow disease data character label into an integrated learning network for classification training, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment;
s5, data visualization, wherein the data visualization module receives the ternary association model, performs statistical analysis on the ternary association model by utilizing a TF-IDF algorithm to obtain a required target relationship model, and then transmits the target relationship model to an external user interface module to be presented in a form of a statistical graph;
s6, presenting a statistical analysis result in a user interface through a statistical chart.
3. A method of analyzing slow disease data based on natural language processing and integrated training as claimed in claim 2, wherein S4 comprises the sub-steps of:
S401 model of character label for representing slow disease data
M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} The method comprises the steps of layering and dividing the training set into k sets D 1,D2,…,Dk with similar lengths, extracting 50% of the k sets as a test set M c, and extracting 50% of the k sets as a training set M t;
S402, on the training set, defining an integrated learning network and setting a plurality of different primary learning algorithms
Training a primary learning algorithm according to the initial training set M t in a k-fold cross-validation mode to obtain a plurality of different primary learners;
Training the initial training set M t by using a primary learner to obtain a secondary data set M v,
Taking a multi-response linear regression process as a meta learning algorithm, and generating a meta learner with optimal prediction performance according to the secondary data set M v;
S403: model of chronic disease data with optimal meta learner
M={(l1,p1,q1),(l2,p2,q2),...,(lm,pm,qm)} Classifying, and extracting an effective slow disease data association model, namely a ternary association model among slow disease symptoms, biochemical pathological indexes and treatment.
CN202011116445.8A 2020-10-19 2020-10-19 Chronic disease data analysis method and system based on natural language processing and integrated training Active CN112287665B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011116445.8A CN112287665B (en) 2020-10-19 2020-10-19 Chronic disease data analysis method and system based on natural language processing and integrated training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011116445.8A CN112287665B (en) 2020-10-19 2020-10-19 Chronic disease data analysis method and system based on natural language processing and integrated training

Publications (2)

Publication Number Publication Date
CN112287665A CN112287665A (en) 2021-01-29
CN112287665B true CN112287665B (en) 2024-05-03

Family

ID=74497464

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011116445.8A Active CN112287665B (en) 2020-10-19 2020-10-19 Chronic disease data analysis method and system based on natural language processing and integrated training

Country Status (1)

Country Link
CN (1) CN112287665B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118136206A (en) * 2024-05-07 2024-06-04 江苏法迈生医学科技有限公司 Chronic disease prediction method in full course management system based on big data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107358948A (en) * 2017-06-27 2017-11-17 上海交通大学 Language in-put relevance detection method based on attention model
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
CN110060773A (en) * 2019-04-22 2019-07-26 东华大学 Alzheimer's disease progression of the disease forecasting system based on two-way LSTM
CN110444259A (en) * 2019-06-06 2019-11-12 昆明理工大学 Traditional Chinese medical electronic case history entity relationship extracting method based on entity relationship mark strategy
CN110569511A (en) * 2019-09-22 2019-12-13 河南工业大学 Electronic medical record feature extraction method based on hybrid neural network
CN111222340A (en) * 2020-01-15 2020-06-02 东华大学 Breast electronic medical record entity recognition system based on multi-standard active learning
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328526A1 (en) * 2015-04-07 2016-11-10 Accordion Health, Inc. Case management system using a medical event forecasting engine
US9949714B2 (en) * 2015-07-29 2018-04-24 Htc Corporation Method, electronic apparatus, and computer readable medium of constructing classifier for disease detection
US10565305B2 (en) * 2016-11-18 2020-02-18 Salesforce.Com, Inc. Adaptive attention model for image captioning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107358948A (en) * 2017-06-27 2017-11-17 上海交通大学 Language in-put relevance detection method based on attention model
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
CN110060773A (en) * 2019-04-22 2019-07-26 东华大学 Alzheimer's disease progression of the disease forecasting system based on two-way LSTM
CN110444259A (en) * 2019-06-06 2019-11-12 昆明理工大学 Traditional Chinese medical electronic case history entity relationship extracting method based on entity relationship mark strategy
CN110569511A (en) * 2019-09-22 2019-12-13 河南工业大学 Electronic medical record feature extraction method based on hybrid neural network
CN111222340A (en) * 2020-01-15 2020-06-02 东华大学 Breast electronic medical record entity recognition system based on multi-standard active learning
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BiLSTM-CRF模型在中文电子病历命名实体识别中的应用研究;王若佳;魏思仪;王继民;;文献与数据学报(第02期);全文 *
基于BLSTM网络的医学时间短语识别;张顺利;王应军;姬东鸿;;计算机应用研究(第04期);全文 *

Also Published As

Publication number Publication date
CN112287665A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
CN109669994B (en) Construction method and system of health knowledge map
Zheng et al. The fusion of deep learning and fuzzy systems: A state-of-the-art survey
CN109460473B (en) Electronic medical record multi-label classification method based on symptom extraction and feature representation
CN107516110B (en) Medical question-answer semantic clustering method based on integrated convolutional coding
CN111540468B (en) ICD automatic coding method and system for visualizing diagnostic reasons
CN114564565B (en) Depth semantic recognition model for public security event analysis and construction method thereof
CN113040711B (en) Cerebral apoplexy incidence risk prediction system, equipment and storage medium
CN111382272A (en) Electronic medical record ICD automatic coding method based on knowledge graph
CN110287323B (en) Target-oriented emotion classification method
CN113553440B (en) Medical entity relationship extraction method based on hierarchical reasoning
CN110277167A (en) The Chronic Non-Communicable Diseases Risk Forecast System of knowledge based map
CN109036577A (en) Diabetic complication analysis method and device
Falissard et al. A deep artificial neural network− based model for prediction of underlying cause of death from death certificates: algorithm development and validation
Johnson et al. Hcpcs2vec: Healthcare procedure embeddings for medicare fraud prediction
CN114492444A (en) Chinese electronic medical case medical entity part-of-speech tagging method
CN112287665B (en) Chronic disease data analysis method and system based on natural language processing and integrated training
Marerngsit et al. A two-stage text-to-emotion depressive disorder screening assistance based on contents from online community
CN113360643A (en) Electronic medical record data quality evaluation method based on short text classification
CN110633368A (en) Deep learning classification method for early colorectal cancer unstructured data
Cheng et al. Combining knowledge extension with convolution neural network for diabetes prediction
CN116403706A (en) Diabetes prediction method integrating knowledge expansion and convolutional neural network
CN114582449A (en) Electronic medical record named entity standardization method and system based on XLNet-BiGRU-CRF model
CN116434951A (en) Disease early warning method, device, electronic equipment, storage medium and program product
Kour et al. Hybrid LSTM-TCN Model for Predicting Depression using Twitter Data
Falissard et al. A deep artificial neural network based model for underlying cause of death prediction from death certificates

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant