CN118974839A - Patient pooling based on machine learning model - Google Patents
Patient pooling based on machine learning model Download PDFInfo
- Publication number
- CN118974839A CN118974839A CN202380032232.2A CN202380032232A CN118974839A CN 118974839 A CN118974839 A CN 118974839A CN 202380032232 A CN202380032232 A CN 202380032232A CN 118974839 A CN118974839 A CN 118974839A
- Authority
- CN
- China
- Prior art keywords
- patient
- data
- survival
- features
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010801 machine learning Methods 0.000 title claims abstract description 62
- 238000011176 pooling Methods 0.000 title description 8
- 238000000034 method Methods 0.000 claims abstract description 69
- 239000000090 biomarker Substances 0.000 claims abstract description 25
- 206010028980 Neoplasm Diseases 0.000 claims abstract description 23
- 238000009533 lab test Methods 0.000 claims abstract description 12
- 238000001574 biopsy Methods 0.000 claims abstract description 6
- 230000004083 survival effect Effects 0.000 claims description 132
- 238000003066 decision tree Methods 0.000 claims description 35
- 238000011282 treatment Methods 0.000 claims description 29
- 230000001186 cumulative effect Effects 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 17
- 201000011510 cancer Diseases 0.000 claims description 13
- 238000005259 measurement Methods 0.000 claims description 7
- 238000007637 random forest analysis Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 description 28
- 230000006870 function Effects 0.000 description 25
- 102000001301 EGF receptor Human genes 0.000 description 21
- 108060006698 EGF receptor Proteins 0.000 description 21
- 230000000875 corresponding effect Effects 0.000 description 18
- 238000003745 diagnosis Methods 0.000 description 14
- 238000004393 prognosis Methods 0.000 description 12
- 230000036541 health Effects 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 9
- 102000015694 estrogen receptors Human genes 0.000 description 9
- 108010038795 estrogen receptors Proteins 0.000 description 9
- 238000000605 extraction Methods 0.000 description 8
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 7
- 201000005202 lung cancer Diseases 0.000 description 7
- 208000020816 lung neoplasm Diseases 0.000 description 7
- 102100033793 ALK tyrosine kinase receptor Human genes 0.000 description 6
- 101710168331 ALK tyrosine kinase receptor Proteins 0.000 description 6
- BPYKTIZUTYGOLE-IFADSCNNSA-N Bilirubin Chemical compound N1C(=O)C(C)=C(C=C)\C1=C\C1=C(C)C(CCC(O)=O)=C(CC2=C(C(C)=C(\C=C/3C(=C(C=C)C(=O)N\3)C)N2)CCC(O)=O)N1 BPYKTIZUTYGOLE-IFADSCNNSA-N 0.000 description 6
- 206010009944 Colon cancer Diseases 0.000 description 6
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 6
- 238000004820 blood count Methods 0.000 description 6
- DDRJAANPRJIHGJ-UHFFFAOYSA-N creatinine Chemical compound CN1CC(=O)NC1=N DDRJAANPRJIHGJ-UHFFFAOYSA-N 0.000 description 6
- 238000002405 diagnostic procedure Methods 0.000 description 6
- 102000003998 progesterone receptors Human genes 0.000 description 6
- 108090000468 progesterone receptors Proteins 0.000 description 6
- 239000013598 vector Substances 0.000 description 6
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 238000007475 c-index Methods 0.000 description 4
- 230000003862 health status Effects 0.000 description 4
- 101100123850 Caenorhabditis elegans her-1 gene Proteins 0.000 description 3
- 101100314454 Caenorhabditis elegans tra-1 gene Proteins 0.000 description 3
- OYPRJOBELJOOCE-UHFFFAOYSA-N Calcium Chemical compound [Ca] OYPRJOBELJOOCE-UHFFFAOYSA-N 0.000 description 3
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 3
- 102000001554 Hemoglobins Human genes 0.000 description 3
- 108010054147 Hemoglobins Proteins 0.000 description 3
- 101001012157 Homo sapiens Receptor tyrosine-protein kinase erbB-2 Proteins 0.000 description 3
- DGAQECJNVWCQMB-PUAWFVPOSA-M Ilexoside XXIX Chemical compound C[C@@H]1CC[C@@]2(CC[C@@]3(C(=CC[C@H]4[C@]3(CC[C@@H]5[C@@]4(CC[C@@H](C5(C)C)OS(=O)(=O)[O-])C)C)[C@@H]2[C@]1(C)O)C)C(=O)O[C@H]6[C@@H]([C@H]([C@@H]([C@H](O6)CO)O)O)O.[Na+] DGAQECJNVWCQMB-PUAWFVPOSA-M 0.000 description 3
- ZLMJMSJWJFRBEC-UHFFFAOYSA-N Potassium Chemical compound [K] ZLMJMSJWJFRBEC-UHFFFAOYSA-N 0.000 description 3
- 102100030086 Receptor tyrosine-protein kinase erbB-2 Human genes 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 239000011575 calcium Substances 0.000 description 3
- 229910052791 calcium Inorganic materials 0.000 description 3
- 229940109239 creatinine Drugs 0.000 description 3
- 230000007423 decrease Effects 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 210000003743 erythrocyte Anatomy 0.000 description 3
- 239000008103 glucose Substances 0.000 description 3
- 238000005534 hematocrit Methods 0.000 description 3
- 210000000265 leukocyte Anatomy 0.000 description 3
- 210000004698 lymphocyte Anatomy 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 239000011591 potassium Substances 0.000 description 3
- 229910052700 potassium Inorganic materials 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 108090000623 proteins and genes Proteins 0.000 description 3
- 102000004169 proteins and genes Human genes 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 239000011734 sodium Substances 0.000 description 3
- 229910052708 sodium Inorganic materials 0.000 description 3
- 238000002626 targeted therapy Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- CURLTUGMZLYLDI-UHFFFAOYSA-N Carbon dioxide Chemical compound O=C=O CURLTUGMZLYLDI-UHFFFAOYSA-N 0.000 description 2
- 101150105104 Kras gene Proteins 0.000 description 2
- 230000009471 action Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- WQZGKKKJIJFFOK-VFUOTHLCSA-N beta-D-glucose Chemical compound OC[C@H]1O[C@@H](O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-VFUOTHLCSA-N 0.000 description 2
- 101150048834 braF gene Proteins 0.000 description 2
- 229910052799 carbon Inorganic materials 0.000 description 2
- 210000004072 lung Anatomy 0.000 description 2
- 102000002260 Alkaline Phosphatase Human genes 0.000 description 1
- 108020004774 Alkaline Phosphatase Proteins 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- DQFBYFPFKXHELB-UHFFFAOYSA-N Chalcone Natural products C=1C=CC=CC=1C(=O)C=CC1=CC=CC=C1 DQFBYFPFKXHELB-UHFFFAOYSA-N 0.000 description 1
- ZAMOUSCENKQFHK-UHFFFAOYSA-N Chlorine atom Chemical compound [Cl] ZAMOUSCENKQFHK-UHFFFAOYSA-N 0.000 description 1
- 101000584612 Homo sapiens GTPase KRas Proteins 0.000 description 1
- 101000984753 Homo sapiens Serine/threonine-protein kinase B-raf Proteins 0.000 description 1
- 102000003855 L-lactate dehydrogenase Human genes 0.000 description 1
- 108700023483 L-lactate dehydrogenases Proteins 0.000 description 1
- 238000011256 aggressive treatment Methods 0.000 description 1
- 229910002092 carbon dioxide Inorganic materials 0.000 description 1
- 239000001569 carbon dioxide Substances 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 235000005513 chalcones Nutrition 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000000460 chlorine Substances 0.000 description 1
- 229910052801 chlorine Inorganic materials 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000001616 monocyte Anatomy 0.000 description 1
- 238000001584 occupational therapy Methods 0.000 description 1
- 238000011369 optimal treatment Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000000554 physical therapy Methods 0.000 description 1
- 238000001671 psychotherapy Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 230000000391 smoking effect Effects 0.000 description 1
- 230000000153 supplemental effect Effects 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 208000011580 syndromic disease Diseases 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- DQFBYFPFKXHELB-VAWYXSNFSA-N trans-chalcone Chemical compound C=1C=CC=CC=1C(=O)\C=C\C1=CC=CC=C1 DQFBYFPFKXHELB-VAWYXSNFSA-N 0.000 description 1
- 238000011277 treatment modality Methods 0.000 description 1
Abstract
Techniques are disclosed herein for facilitating clinical decisions for a patient based on identifying a patient group having similar attributes to the patient. The patient group may be identified using information from a predictive machine learning model that clinically predicts for the patient. At least some of the attributes of the patient group may be output to support clinical decisions. The attributes may include, for example, archival data of the patient, results of one or more laboratory tests of the patient, biopsy image data of the patient, molecular biomarkers of the patient, tumor sites of the patient, and tumor stage of the patient.
Description
Cross Reference to Related Applications
The present application claims priority from U.S. patent application Ser. No. 63/362,373 filed on 1,4, 2022, the disclosure of which is incorporated herein by reference in its entirety.
Background
Predictive machine learning models trained using real world clinical data offer great potential for providing patients and their clinicians with patient-specific information regarding diagnosis, prognosis, or optimal treatment protocols. The machine learning model may be trained to make clinical predictions to predict a patient's medical outcome such as, for example, patient's survival probability as a function of time since diagnosis (e.g., advanced cancer), survival time of a new patient since diagnosis, other types of prognosis, and the like. Predictions may be provided to a patient, for example, to increase the patient's ability to plan for their future, which may increase the patient's quality of life.
Many machine learning models are developed using extensive data for multiple types of patients with different conditions, treatment modalities and prognosis appropriate for those conditions. These models may not be sufficiently trained with data applicable to a particular patient type and/or may be weighted according to data that may be of low value for a particular patient type, resulting in inaccurate prognosis and/or suggested treatment. Accordingly, there is a need for improved methods of enhancing patient care using machine learning models.
Disclosure of Invention
Techniques for facilitating clinical decisions for patients based on identifying patient groups having similar attributes to the patient are disclosed herein. Patient groups may be identified using information from predictive machine learning models that clinically predict for patients. At least some of the attributes of the patient group may be output to support clinical decisions. Attributes may include, for example, archival data of the patient, results of one or more laboratory tests of the patient, biopsy image data of the patient, molecular biomarkers of the patient, tumor sites of the patient, and tumor stage of the patient.
In particular, the clinical decision support system may employ a machine learning model to make clinical predictions for a patient based on patient attributes. For example, the machine learning model may include a Random Survival Forest (RSF) model that predicts the survival probability of the patient as a function of time elapsed since diagnosis. The clinical decision support system may also identify patient groups (e.g., "similarity-based patient pools") that have specific attributes that are similar to the specific attributes of the patient. The similarity-based patient pool may include patients having a health condition comparable to that of the patient, and may be identified based on those patients that are similar to the patient with respect to a subset of the attributes (e.g., probability of survival at a particular point in time since diagnosis) that have highest relevance to clinical predictions made by the machine learning model. The clinical decision support system may then obtain information of the attributes of the similarity-based patient pool.
The clinical decision support system may output a predicted survival probability for the new patient and attributes of the new patient determined to have highest relevance to the prediction. By affinity-based patient pooling, the clinical decision support system may output a summary of survival functions for the affinity-based patient pool and attributes of patients in the affinity-based patient pool. This facilitates a comparison between the attributes of the new patient and the attributes of the similarity-based patient pool, focusing on the attribute that has the highest correlation to the survival prediction of the new patient. Studying the relationship between attributes and survival in a similarity-based patient pool can help clinicians determine an action plan (e.g., treatment) to increase the survival probability of new patients.
In some embodiments, a computer-implemented method of facilitating clinical decisions includes: receiving first data corresponding to a plurality of features of a first patient, each feature representing an attribute of a plurality of attributes; inputting the first data to a machine learning model to generate a result of a clinical prediction for the first patient, the machine learning model being associated with a plurality of feature importance metrics defining a degree of correlation of each of the plurality of features with the clinical prediction; obtaining second data corresponding to the plurality of features of each patient in the patient group based on a degree of similarity between the first patient and the patient group with respect to at least some of the plurality of features, the degree of similarity based on the first data, the second data, and the plurality of feature data importance metrics; generating content based on the results of the clinical prediction and at least a portion of the second data; and outputting the content to make a clinical decision for the first patient based on the content.
In some embodiments, the plurality of attributes includes at least one of: patient profile data, results of one or more laboratory tests of a first patient, biopsy image data of the first patient, molecular biomarkers of the first patient, tumor sites of the first patient, or tumor stage of the first patient.
In some embodiments, the clinical prediction comprises at least one of: the probability of survival of the first patient at a predetermined time since the first patient was diagnosed with the tumor, the survival time of the first patient since the first patient was diagnosed with the tumor, or the outcome of the treatment.
In some embodiments, the machine learning model comprises a random forest survival model comprising f decision trees each configured to process a subset of the first subset of data to generate the cumulative survival probability; and wherein the survival rate of the patient at a predetermined time is determined based on an average of the cumulative survival probabilities output by the plurality of decision trees.
In some embodiments, the patient group is a first patient group; wherein the first patient group is selected from the second patient group; and wherein the machine learning model is trained based on patient data of the second patient group.
In some embodiments, the method further comprises: ranking the plurality of features based on a degree of correlation of each of the plurality of features with the clinical prediction; determining a subset of the plurality of features based on the ranking; and determining the first patient group based on a degree of similarity between the first patient and the first patient group in terms of a subset of the plurality of features.
In some embodiments, the first patient group is selected from the second patient group based on a degree of similarity between the first patient and the first patient group with respect to the subset of the plurality of features exceeding a threshold.
In some embodiments, the first patient group is selected from the second patient group based on selecting a threshold number of patients having a highest degree of similarity to the first patient with respect to the subset of the plurality of features.
In some embodiments, the method further comprises: calculating a weighted aggregate similarity degree based on summing the scaled similarity degrees for each of at least some of the plurality of features, each similarity degree scaled by a weight based on the relevance of the feature; and identifying a patient group based on the weighted aggregate similarity degree between the first patient and each patient in the patient group.
In some embodiments, a feature importance metric of the feature is determined based on a relationship between errors in results of clinical predictions generated by the machine learning model for a second patient in the first patient group; wherein the result of the clinical prediction is generated from a plurality of values of the characteristic of the second patient; and wherein the error is calculated based on comparing the result of the clinical prediction to an actual clinical outcome of the second patient.
In some embodiments, the content includes at least one of: median survival time of the first patient group, or Kaplan-Meier survival curve of the first patient group.
In some embodiments, the content includes values for one or more of a first subset of the plurality of features of the first patient, the first patient group, and the second patient group.
In some embodiments, a computer product includes a computer-readable medium storing a plurality of instructions for controlling a computer system to perform the operations of any of the above methods.
In some embodiments, a system comprises a computer product as described herein; and
One or more processors configured to execute instructions stored on a computer-readable medium.
In some embodiments, the system comprises means for performing any of the methods described herein.
In some embodiments, the system is configured for performing any of the methods described herein.
In some embodiments, the system includes modules that individually perform the steps of any of the methods described herein.
These and other exemplary embodiments are described in detail below. For example, other embodiments relate to systems, apparatuses, and computer-readable media associated with the methods described herein.
The nature and advantages of examples of the disclosure may be better understood with reference to the following detailed description and the accompanying drawings.
Drawings
A detailed description is given with reference to the accompanying drawings.
1A, 1B, and 1C illustrate exemplary techniques for facilitating clinical decisions based on clinical predictions, according to certain aspects of the present disclosure.
Fig. 2A, 2B, 2C, 2D, 2E, and 2F illustrate an improved clinical decision system capable of implementing machine learning based patient pooling in accordance with certain aspects of the present disclosure.
Fig. 3 illustrates a method of performing machine learning based patient pooling operations in accordance with certain aspects of the present disclosure.
FIG. 4 illustrates an exemplary computer system that may be used to implement the techniques disclosed herein.
Fig. 5 illustrates an example of how patient data from a patient pool may be used.
Fig. 6 illustrates another example of how patient data from a patient pool may be used.
Detailed Description
As described above, predictive machine learning models may be trained to make clinical predictions to predict medical outcomes for new patients. The new patient may be any living patient, as well as a patient for whom clinical decisions are being made. For example, a Random Survival Forest (RSF) model may be trained based on previous patient data and its survival statistics to predict survival probabilities for new patients as a function of time since diagnosis (e.g., advanced cancer). Predictions may be provided to new patients, for example, to improve their ability to plan the future. This makes it possible to improve the quality of life of the patient.
While the clinical predictions provided by predictive machine learning models may provide valuable information to new patients, the clinical prediction results themselves may not provide insight into how to improve the prognosis of new patients. For example, a prediction of a patient having a certain likelihood of survival at a particular point in time may not provide information about potential clinical decisions to improve the likelihood of survival of the patient at that point in time.
On the other hand, the medical itinerary of a previous patient (whose data and survival statistics are used to train a predictive machine learning model) can provide valuable insight into potential clinical decisions to improve the prognosis of a new patient. For example, a machine learning model, such as an RSF model, may output a prediction of the probability that a new patient will survive to a particular point in time since diagnosis. There may be a first patient group (e.g., group a) whose survival probability with respect to time is similar to that predicted by the model for the new patient, and a second patient group (e.g., group B) whose survival probability is much lower than that predicted by the model for the new patient. If group a has a biomarker in common with the new patient, and group B does not, it can be determined that the biomarker can be correlated with the survival probability of the new patient. Treatment decisions can then be made to target the biomarker. However, as described above, while predictive machine learning models may play a role in predicting a patient's prognosis based on patient attributes, machine learning models typically do not identify other patient groups whose medical outcome is similar to the patient's prognosis. In addition to providing clinical predictive results, machine learning models typically do not provide additional information that can be used to improve the prognosis of a patient.
Disclosed herein are techniques for facilitating clinical decisions for new patients based on identifying patient groups having similar attributes to the new patient (hereinafter, "similarity-based patient pools"). A predictive machine learning model is provided to make clinical predictions for new patients who are alive or whose future survival is unknown. The similarity-based patient pool may be identified from previous patient groups whose data and survival statistics are used to train the predictive machine learning model. At least some of the attributes of the similarity-based patient pool may be output to support clinical decisions. Attributes may include, for example, archival data of the patient, results of one or more laboratory tests of the patient, biopsy image data of the patient, molecular biomarkers of the patient, tumor sites of the patient, and tumor stage of the patient.
In some examples, the clinical decision support system may employ a machine learning model to make clinical predictions for a new patient based on their attributes. For example, a Random Survival Forest (RSF) model may be used to predict patient survival probability as a function of time elapsed since diagnosis. The clinical decision support system may also identify a similarity-based patient pool having specific attributes similar to the specific attributes of the new patient. The similarity-based patient pool may be identified based on patients sharing similar values with new patients for a subset of attributes determined to have highest relevance to clinical predictions made by the machine learning model (e.g., probability of survival at a particular point in time since diagnosis). The clinical decision support system may output attributes and medical outcomes of the similarity-based patient pool, as well as attributes and clinical predictions of the patient, to facilitate clinical decisions for the patient. In some examples, the similarity-based patient pool may include patients whose attributes and survival statistics are included in training data used to train the machine learning model. In some examples, the similarity-based patient pool may also include patients whose data is not used to train the machine learning model.
In particular, the clinical decision support system may receive first data corresponding to attributes of a new patient. The attributes may include various archival information such as age and gender of the patient. Each attribute may be represented as a feature that may include one or more vectors for input into the machine learning model. In some examples, the attributes may be represented as a plurality of features. Attributes may also include a patient's medical history (e.g., which treatment(s) the patient has received), a patient's habits (e.g., whether the patient smoked), a category of laboratory test results for the patient (e.g., white blood cell count, hemoglobin count, platelet count, hematocrit count, red blood cell count, creatinine count, lymphocyte count, protein, bilirubin, calcium, sodium, potassium, glucose measurements). The attributes may also indicate measurements of various biomarkers for different cancer types, such as Estrogen Receptor (ER), progestin Receptor (PR), human epidermal growth factor receptor 2 (HER 2), epidermal growth factor receptor (EGFR or HER 1), ALK (anaplastic lymphoma kinase) for lung cancer, KRAS gene for lung and colorectal cancer, BRAF gene for colorectal cancer, and the like. The attribute data may be processed by the clinical decision support system, or processed prior to input into the clinical decision support system, to create a plurality of features containing attribute information in a format (e.g., vector) that may be interpreted by a machine learning model.
The clinical decision support system may include a machine learning model that may be trained based on data from previous patients to make clinical predictions for new patients. Predictions may be made based on inputting attributes of the new patient into the machine learning model. For example, the machine learning model may include an RSF model that may output a predicted survival function based on the first data as a clinical prediction. The survival function may be used to obtain a probability that a new patient survives for a predetermined time (e.g., 500 days, 1000 days, 1500 days, etc.) after the new patient is diagnosed with a medical condition (e.g., advanced cancer). Alternatively, assuming survival to that time, the risk function provides the risk of death as a function of time. Another example of a survival function is a cumulative risk function (CHF) that provides cumulative risk as a function of time. The survival function over time may be used to generate a patient-specific survival map for the new patient.
As part of the training operation, a plurality of feature importance metrics associated with the machine learning model may also be obtained, wherein the feature importance metrics define a correlation of each feature with clinical predictions (e.g., survival at a particular point in time). In one example, out-of-bag (OOB) samples including samples of training patient data not used to construct an RSF model may be input to a decision tree to calculate a prediction error such as a consistency index (c-index). Then, for those samples, the values of the feature may be ranked, and the prediction error for each decision tree may be calculated for the ranked values of the feature. The original importance score for the feature may be calculated based on the average difference in prediction error among the trees for the ranked values. A higher raw importance score may indicate that the feature is more relevant to the predicted survival function, while a lower raw importance score may indicate that the feature is less relevant to the predicted survival function. At the end of the training operation, the features may be ranked based on their importance scores, with features having higher relevance being ranked higher.
Based on the attributes of the new patient, the clinical decision support system may identify a group of patients from the patient database that are similar to the new patient in terms of the highest ranked features. This group may be referred to as a similarity-based patient pool. The first step in selecting patients to form a pool of similarity-based patients is to calculate the similarity between the new patient and each of the patients in the database based on the highest ranked features. Then, the patients forming the similarity-based patient pool are selected based on criteria, two examples of which are shown below. In a first example, a patient may be selected based on similarity to a new patient exceeding a threshold. In a second example, patients in the database are ranked according to their similarity to new patients, and a predetermined number of patients with the highest rank are selected. Thus, a pool of similarity-based patients can be considered to be similar to a patient not only in terms of their health status, but also in terms of the features of highest relevance to clinical predictions.
The clinical decision support system may then output the attributes and medical outcomes of the similarity-based patient pool, as well as the attributes and clinical predictions of the new patient. This may help to facilitate clinical decisions for new patients. For example, the clinical decision support system may output a summary of attributes of the similarity-based patient pool, as well as a comparison of attributes (particularly attributes corresponding to the highest ranked features) between the new patient and the similarity-based patient pool. The output of clinical decision support allows the clinician to study relevant attributes and determine an action plan (e.g., treatment) to increase the survival probability of new patients.
As an illustrative example, the feature corresponding to a biomarker attribute (e.g., epidermal Growth Factor Receptor (EGFR)) may be one of the highest ranked features for the RSF model. It is assumed that the new patient is EGFR-positive and the clinical decision support system can output EGFR positive results for the similarity-based patient pool. If the similarity between the predicted survival function for the new patient and the predicted survival function for an EGFR-positive patient from the pool of similarity-based patients is higher than the similarity between the predicted survival function for an EGFR-negative patient from the pool of patients, it can be determined that EGFR-targeting therapies can be used to increase the survival probability for the new patient.
Using the disclosed techniques, a similarity-based patient pool can be identified that not only has a similar health condition to the new patient, but is also similar in terms of the attribute/condition that is most relevant to clinical predictions. The association of attributes with clinical predictions makes it more likely that medical trips to patients in a similarity-based patient pool can provide insight into potential treatments that can improve prognosis of new patients. These insights may be supported by statistics and medical history of a relatively large patient population. For example, specific biomarkers in common between a pool of similarity-based patients and a new patient may be studied to determine whether targeted therapy can improve survival probability of the new patient.
I. Examples of clinical predictions and applications
Fig. 1A and 1B illustrate examples of clinical predictions that may be provided by embodiments of the present disclosure. Fig. 1A illustrates a mechanism to predict a cumulative survival probability for a patient with respect to a time since a diagnosis of cancer was made, while fig. 1B illustrates an exemplary application of survival probability prediction. Referring to fig. 1A, a graph 100, which provides a study of survival statistics between patients with a cancer (e.g., lung cancer), illustrates an example of a Kaplan-Meier (K-M) graph. The patient may receive a particular treatment. The K-M plot shows the change in the cumulative survival probability of a patient group with respect to the time measured since the patient was diagnosed with cancer. The K-M plot also shows the cumulative survival probability of a patient's response to treatment in the case of a patient receiving treatment. Over time, some patients may die and the survival probability decreases. Some other patients may be deleted (deleted) from the map differently because of other events unrelated to the study event (e.g., they moved to a different state, so a hospital was changed). The loss event is indicated by diagonal line symbols in the K-M diagram. The length of each horizontal line represents the time interval during which they have not died, and the estimated survival at a given point represents the cumulative probability of surviving to that time.
In fig. 1A, graph 100 includes two K-M graphs of cumulative survival probabilities for different cohorts a and B of patients (e.g., cohorts of patients with different characteristics, receiving different treatments, etc.). From fig. 1A, the median survival time of queue a (the time at which the cumulative survival probability first decreases below 50%) is about 11 months, while the median survival time of queue B is about 6.5 months. For example, the probability of patient survival in cohort a for at least 8 months is about 70% (0.7), while the probability of patient survival in cohort B for at least 8 months is about 30% (0.3).
Fig. 1B illustrates an exemplary application of survival prediction for a patient. As shown in fig. 1B, data 102 of a patient 103 may be input to a clinical decision support tool 104 to generate a survival prediction 106. The data 102 may include different attributes such as, for example, archival data, medical history data, biomarkers, laboratory test result data, and the like. The clinical decision support tool 104 may generate various information 108 to assist the clinician in administering care/treatment to the patient 103 based on the survival prediction 106. For example, to facilitate care of the patient 103, the clinical decision support tool 104 may generate information 108 to indicate, for example, the life expectancy of the patient. The information 108 may facilitate discussion between the clinician and the patient 103 regarding evaluation of prognosis and treatment options for the patient and planning of life events for the patient. Two illustrative examples are given below. If the clinical decision support tool 104 predicts that the patient 103 has a relatively high probability of survival within 5 years, the patient 103 may decide to receive aggressive treatment that is more physically demanding and has more serious side effects. But if the clinical decision support tool 104 indicates that the patient 103 has a relatively low probability of survival within 5 years, the patient 103 may decide to either forego treatment or receive replacement treatment and plan the remaining life-in care and life events.
While survival predictions may provide valuable information to patients and clinicians, survival predictions themselves may not provide insight into how to improve a patient's prognosis. For example, a prediction of a patient 103 having a certain probability of survival beyond a particular point in time may not provide information about potential treatments to increase the likelihood of survival of the patient at that point in time.
Making clinical decisions is a complex task in which a clinician must infer a diagnosis or treatment plan. The goal of clinicians is to match the best treatments based on their education, study, and personal experience. They typically operate on a per patient basis and there is no digital solution at hand that can assist them in fully exploiting the potential of medical knowledge obtained from Real World Data (RWD). On the other hand, the increasing number of RWDs provides opportunities for making supplemental decisions using evidence-based overall information. Patient similarity is the fundamental component of the most effective and least effective treatment based on RWD studies of similar individuals with comparable health conditions.
FIG. 1C illustrates an example of making clinical decisions and clinical predictions based on RWDs. FIG. 1C shows a graph 120 incorporating a K-M graph 122 of a first patient group (labeled "group A" in FIG. 1C), a K-M graph 124 of a second patient group (labeled "group B" in FIG. 1C), and a survival prediction 126 for patient 103. The survival prediction 126 for the patient 103 may be a function of time, wherein the predicted cumulative survival probability decreases over time. As shown in fig. 1C, the predicted cumulative survival probability function 126 of the patient 103 has a higher similarity to the K-M map 124 of group B than to the K-M map 122 of group a.
Graph 130 shows an exemplary distribution of positive Epidermal Growth Factor Receptor (EGFR) among patient 103, group A patients (corresponding to K-M plot 124) and group B patients (corresponding to K-M plot 122). Patient 103 (corresponding to predicted cumulative survival probability function 126) had positive EGFR, so bars for patient 103 in chart 130 are 100%. Approximately 60% of patients in group a had positive EGFR results, while less than 5% of patients in group B had positive EGFR results (results all from graph 130). It should also be noted that the cumulative survival curve 124 overlaps the curve 126, while the curve 122 is much lower.
From graphs 120 and 130, it can be determined that group a patients with similar cumulative survival curves to patient 103 (as evidenced by the similarity between K-M graph 124 and predicted outcome 126) have an EGFR positive rate of about 60%. In contrast, its K-M plot 122 shows that group B, which has a much lower survival probability than the predicted outcome 126 for patient 103, has only a 5% EGFR positive rate. This may indicate that the presence of EGFR may be an important factor in determining the survival probability of patient 103. Further studies such as studying EGFR-targeted therapies may then be performed based on this observation.
While such observations are useful and may provide insight into treatment options to increase the probability of survival of the patient 103, observations generally cannot be derived solely from the survival probability predictions 106. For example, the prediction does not identify other patients with similar survival statistics. The prediction results also do not identify other patients having similar health conditions as patient 103.
Similarity-based patient pooling using machine learning models
Fig. 2A illustrates an example of a clinical decision support system 200 that makes clinical predictions for patients and identifies a similarity-based patient pool (based on patient attributes involved in the clinical predictions). As shown in fig. 2A, the clinical decision support system 200 includes a clinical prediction module 202, a patient pool determination module 204, and a portal 205. The clinical prediction module 202 may include a machine learning prediction model 206. The clinical prediction module 202 may receive patient data 208 corresponding to a plurality of features of a patient 210 and use the machine learning prediction model 206 to make clinical predictions 212 for the patient based on the patient data 208. Patient 210 may be a new patient. The characteristics of the patient data 208 may represent various attributes of the patient 210, including, for example, archival data 208a, medical history data 208b, biomarkers 208c, laboratory test result data 208d, and the like. Clinical predictions 212 may include, for example, patient survival probabilities. The probability of survival may indicate a likelihood that the patient survives for a predetermined time (e.g., 500 days, 1000 days, 1500 days, etc.) after the patient is diagnosed with the medical condition (e.g., advanced cancer). The clinical decision support system 200 may be a software system executing on a computer system, such as computer system 10 of fig. 4.
Further, the patient pool determination module 204 may be coupled with a patient database 214 that stores patient data for a group of patients. As described below, patient data of the patient database 214 may be used to train the machine learning predictive model 206. The patient pool determination module 204 may identify from the patient database 214a patient pool having similar attributes as the patient 210 and its patient data 216. The patient pool determination module 204 may identify patient pools based on patients 210 that are similar to those patients in terms of a subset of the attributes that have the highest correlation to clinical predictions made by the machine learning predictive model 206. The clinical decision support system may then obtain patient data 216 corresponding to the patient pool from the patient database 214. The portal 205 may perform additional processing of the patient data 216 (e.g., a comparison between the patient data 216 of the patient pool and the patient data 208 of the patient 210).
FIG. 2B illustrates a table 220 providing an example of attributes including archival data 208a, medical history data 208B, biomarkers 208c, and laboratory test result data 208d. For example, the archival data 208a may include various categories of information such as age, gender, and race. The medical history data 208b may include various categories of information such as diagnostic results including cancer stage, histology, chalcone syndrome index (Charlson comorbidity index, CCI), which predicts risk of mortality based on the presence of specific complications conditions, american eastern tumor collaborative group (Eastern Cooperative Oncology Group, ECOG) values, which describe functional levels in terms of patient self-care, daily activity, and physical ability. The medical history data 208b may also include other information such as the patient's habits (e.g., whether the patient is smoking or not). Laboratory test results 208c may include different categories of laboratory test results for a patient, such as white blood cell count, hemoglobin count, platelet count, hematocrit count, red blood cell count, creatinine count, lymphocyte count, measurement of protein, bilirubin, calcium, sodium, potassium, alkaline phosphatase, carbon dioxide, monocytes, chlorine, lactate dehydrogenase, glucose, and the like. Biomarker data 208d may include measurements of various biomarkers for different cancer types, such as Estrogen Receptor (ER), progestin Receptor (PR), human epidermal growth factor receptor 2 (HER 2), epidermal growth factor receptor (EGFR or HER 1), ALK (anaplastic lymphoma kinase) for lung cancer, KRAS genes for lung cancer and colorectal cancer, BRAF genes for colorectal cancer, and the like. It is appreciated that other attributes of clinical data, such as biopsy image data, not shown in fig. 2B, may also be provided to the machine learning prediction model 206 for clinical prediction.
In table 220, each attribute may be represented by a continuous numerical feature, a binary feature (the value may be one or zero), or some single thermal vector indicating a value in one of a set of possible categories of attributes. For example, age may be represented as a continuous numerical feature. As another example, the attribute corresponding to the test result of biomarker ER may be a single thermal code. Such attributes may be associated with the following data categories: biomarker results positive, biomarker results negative, biomarker results failed, and biomarker not tested. The one-hot encoding may generate four features, each corresponding to one of the above categories. For each patient, only one of the four features will have a value of 1 (the feature corresponding to the category of its attribute) and the other three will have a value of 0. The following table illustrates this, where an example is given for four patients, each with a different category of ER biomarker attribute.
Table 1: exemplary properties of biomarker ER
A. random survival forest
The machine learning predictive model 206 of fig. 2A may be implemented using various techniques, such as a Random Survival Forest (RSF) model. Fig. 2C illustrates an example of the RSF model 230. As shown in fig. 2C, the random survival forest model 230 may include a plurality of decision trees, including, for example, decision trees 232 and 234. Each decision tree may include a plurality of nodes including a root node (e.g., root node 232a of decision tree 232, root node 234a of decision tree 234) and child nodes (e.g., child nodes 232b, 232c, 232d, and 232e of decision tree 232, child nodes 234b and 234c of decision tree 234). Each parent node having child nodes (e.g., nodes 232a, 232b, and 234 a) may be associated with a predetermined classification criteria to classify a patient into one of its child nodes. The child nodes without child nodes are end nodes, which include nodes 232d and 232e (of decision tree 232) and nodes 234b and 234c (of decision tree 234). Node survival is calculated at each end node of each tree. When used to predict survival of the patient 210, the patient 210 is assigned to the end node of each tree based on the data 208 of the patient 210. For example, the decision tree 232 may output a cumulative survival probability value 236, while the decision tree 234 may output a cumulative survival probability value 238. The survival probability of the patient 210 may be calculated by averaging the node survival of each of the terminal nodes to which the patient is assigned. For example, the average survival probability value 240 may be calculated based on an average between the survival probability values 236, 238 and the survival probability values output by other decision trees.
Each decision tree may be assigned to process a different subset of features. For example, as shown in FIG. 2C, patient data 242 includes a set of features { S 0、S1、S2、S3、S4、...Sn }. Each feature may represent an attribute shown in fig. 2B, or any other attribute described herein. Decision tree 232 may be assigned to process features S 0 and S 1, decision tree 234 may be assigned to process feature S 2, while other decision trees may be assigned to process other feature subsets. A parent node in the decision tree may then compare a subset of patient data 242 corresponding to one or more of the assigned features to one or more thresholds to classify the patient 210 into one of its child nodes. For example, referring to decision tree 232, if the patient data for feature S0 exceeds a threshold x0, root node 232a may classify the patient into child node 232 b; otherwise, it is distributed to the end node 232 c. Child node 232b may further classify the parent node to one of terminal nodes 232d or 232c based on the patient data of feature S1. The decision tree 232 may output a cumulative survival probability of 10%, 20%, or 30% based on which terminal node the patient is classified into based on features S0 and S1. Furthermore, decision tree 234 may also output a cumulative survival probability of 50% or 90% based on into which terminal node the patient is classified based on feature S2.
The RSF model 230 of fig. 2C may be constructed to determine the cumulative survival probability (e.g., 1 year, 3 years, 5 years, etc.) since diagnosis up to a predetermined time. A plurality of RSF models may be included in the machine learning prediction model 206. Referring back to fig. 2A, the clinical prediction module 202 may receive the time 222 as an input for determining the cumulative survival probability. The clinical prediction module 202 may then select the RSF model trained for time 222 to calculate the cumulative survival probability up to time 222.
B. training operations
A training operation may be performed to generate each decision tree in the RSF model, a subset of features assigned to each decision tree, a classification criterion at each parent node of the decision tree, and an output value at each terminal node. Fig. 2D illustrates an example of a training operation. The training operations may be performed by a training module 250, which may be part of or external to the clinical decision support tool 200 (fig. 2A). The training operation may be performed with patient data from a larger patient population of the patient database 214. As described above, the RSF model may be trained to determine the cumulative survival probability since diagnosis up to a predetermined time. Training data for training a particular RSF model to determine the cumulative probability of survival (e.g., 1, 3, 5, etc. years) to truncate to a predetermined time since diagnosis may include death and deletion to the predetermined time, and then all surviving patients at the predetermined time will be deleted. In some examples, the training data may include death and deletions for longer periods of time (e.g., 5 years of death and deletion data for RSF models that output cumulative survival probabilities for up to 3 years).
In particular, the patient database 214 may store attributes of patients shown in the table 220. The training module 250 performs a random put-back sampling process on the patient data 252 for the root node of each tree in the RSF model. The process of random snap-back sampling (random SAMPLING WITH REPLACEMENT) is commonly referred to as "bootstrapping" and is also referred to as "bagging" because all trees combine/aggregate to form a random forest. Each tree is also assigned a random subset of features. The root node (and each parent node thereafter) may then be split into child nodes during the recursive node splitting process as part of the training operation. During node splitting, a node comprising a subset of patients may be split into two child nodes based on a threshold for the subset of features. The features at each split and their thresholds are selected to maximize the difference in survival probability between the two child nodes (e.g., based on a log-rank (log-rank) test).
As an example, during training of the decision tree 232 of fig. 2C, it may be determined that the bootstrap samples of the patient data are divided into two groups based on the feature S0 and the threshold x0, the difference in survival probability between the two groups may be maximized, such that selecting a different feature (e.g., S1) or setting a different threshold for the feature S0 may result in a smaller difference between the survival probabilities of the two groups, and the child node 232a may be generated. This process may then be repeated for child node 232a to generate additional child nodes until, for example, a threshold minimum number of patients is reached in a particular child node. Once the splitting process has stopped, all nodes without child nodes can become end nodes. For example, in at least one of the terminal nodes 232d and 232e, the number of patients reaches a threshold minimum number, and thus the root splitting operation stops at these nodes. The output at each of these terminal nodes may be calculated from the outcome data of the patient classified into that terminal node. The training operation may be repeated to generate a decision tree for outputting the survival probabilities at different times, such that the RSF model 230 may output a survival function that predicts the survival probabilities of the patient at different times.
Referring to fig. 2E, the training module 250 may also determine feature importance metrics 260 associated with the machine learning predictive model 206. The feature importance metrics 260 may define the relevance of each feature by studying the effect of each feature on the error of the machine learning predictive model 206. By the machine learning predictive model 206, feature importance metrics 260 may be determined for survival probabilities up to a predetermined time (e.g., 3 years), and different feature importance metrics 260 may be determined for survival probability predictions up to different predetermined times (e.g., 3 years, 5 years, 7 years, etc.).
In one embodiment, to determine the feature importance metrics 260, the training module 250 may obtain a set of out-of-bag (OOB) samples of the patient data 252 from the patient database 214. The OOB samples of each tree may include samples of patient data not included in the bootstrap samples of the tree in fig. 2D. For those samples, the values of the feature may be ranked, and a prediction error rate 262 may be obtained from each decision tree that processes the OOB samples with the ranked values of the feature. For this feature, the raw importance score 264 may be calculated based on, for example, a difference average of the prediction error rate outputs for each decision tree. This process may be repeated for each feature to calculate a respective raw importance score 264 for each feature. A high raw importance score may indicate that the feature is more relevant to survival prediction, while a low raw importance score may indicate that the degree of relevance of the feature is lower. At the end of the training operation, the features may be ranked based on their importance scores, with features having higher relevance being ranked higher.
In one example, training module 250 may calculate prediction error rate 262 based on calculating a consistency index (c-index). The consistency index of the OOB samples may be calculated based on a pairwise comparison of an estimated value of a cumulative risk function (CHF) of the model and an actual time to death between patients in the OOB samples. For each pair of patients, if at a given point in time the relative survival probability of the pair of patients matches the death time sequence of the pair of patients, then this is consistent for the patient, otherwise this is inconsistent for the patient. For example, if the CHF estimate for a first patient of the pair of patients is higher than the CHF estimate for a second patient of the pair of patients, and the first patient dies before the second patient, then this is consistent for the patients. Otherwise, this is inconsistent for the patient. The c-index can be calculated based on the following equation:
The prediction error rate can then be calculated as the inverse of the C-index. Since the probability of survival may vary over time, the prediction error rate and thus the original importance score may also vary over time. Thus, as shown in fig. 2E, the feature importance metric 260 may include different raw importance scores 264 for different times 266.
C similarity-based patient pooling
Based on the feature importance metrics 260 and the patient data 208 of the patient 210, the patient pool determination module 204 may identify a patient pool and its patient data 216 from the patient database 214 that have similar attributes to the patient 210. Fig. 2F illustrates exemplary internal components and operation of the patient pool determination module 204. As shown in fig. 2F, the patient pool determination module 204 includes a feature weight selection module 270 and a similarity determination module 272.
Feature weight selection module 270 may rank the features by feature importance value 260 and select the x features with the highest feature importance value (x may be a predetermined number, e.g., 20, or based on rules, e.g., all features whose importance values are greater than the average importance value across all features). The top ranked set of x features may be denoted as E. The feature weight selection module 270 may then fit RSFs using only the features in E and recalculate the feature importance values for these features based on this new RSF. These new raw feature importance values are scaled according to the following equation, feature k is noted as w k:
sigma k∈Ewk=1;wk >0 for k ε E (equation 2)
The scaled feature importance values, w k, are then used as weights in the similarity determination module 272.
The similarity determination module 272 may then identify patients similar to the patient 210 in the patient database 214 based on the scaled feature importance values/weights 274. The similarity determination module 272 may determine a weighted aggregate similarity s (x i,xj) between two patients (x i and x j) based on the following equation:
s (x i,xj)=∑kwksijk (equation 3)
In equation 3, s ijk represents the degree of similarity between the two patients x i and x j in terms of feature k, while w k is the scaled feature importance value 274. More important features may have a degree of similarity associated with a greater weight. In the case where feature k is represented by a binary value or a one-hot coded vector, the degree of similarity s ijk may be one if the features k of both patients are one, or the one-hot vectors are perfectly matched; otherwise the value of s ijk takes zero. Further, in the case where the value of the feature k is taken from the range R k, the degree of similarity s ijk can be calculated based on the following equation:
The similarity determination module 272 may calculate a weighted aggregate similarity s (x i,xj) between the patient 210 and the patient represented by the patient database 214 using equation 3 and select a similarity-based patient pool based on the weighted aggregate similarity. The similarity-based patient pool can be considered to be similar to the patient not only in terms of their health status, but also in terms of the features that are most relevant to clinical predictions. In one example, the similarity determination module 272 may select the similarity-based patient pool based on the degree of similarity of the similarity-based patient pool with the patient 210 calculated according to equations 3 or 4 exceeding the similarity threshold 280. In another example, the similarity determination module 272 may select a predetermined number of patients with the highest degree of similarity to the patient 210, defined based on the pool size threshold 282, as part of the similarity-based patient pool.
The similarity determination module 272 may then obtain the attributes and medical outcomes of the similarity-based patient pool and output them as part of the patient data 216 to facilitate clinical decisions for the patient 210. For example, referring back to fig. 1C, the similarity determination module 272 may identify a similarity-based patient pool for which the K-M curve 124 may be generated, as well as the patient data 216. Portal 205 may compare between patient data 216 of the patient pool, patient data 208, and patient data of the training patient set in patient database 214 for each feature present in the respective sets of data and output the comparison. Based on the comparison, the clinician can determine that the EGFR positivity of the patient pool is much higher than that of the training patient pool, and can conduct further studies on EFGR (e.g., EFGR-targeted therapies) to improve patient survival.
III method
Fig. 3 illustrates an example of a method 300 of facilitating clinical decisions. The method 300 may be performed by, for example, the clinical decision support tool 200 of fig. 2A.
In step 302, the clinical decision support tool may receive first data corresponding to a plurality of features of a first patient (e.g., a new patient), wherein each feature represents an attribute of a plurality of attributes. The first data may be entered through a computer interface, such as portal 205, or directly from a patient database, such as patient database 214. The first patient may be a new patient such as patient 210.
Referring to fig. 2A, the first data may include patient data 208 corresponding to attributes of the new patient. The attributes may include various archival information such as age and gender of the patient. Each attribute may be represented as one or more features, each of which may be represented as a vector for input into the machine learning model. Attributes may also include a patient's medical history (e.g., which treatment(s) the patient has received), a patient's habits (e.g., whether the patient smoked), a category of laboratory test results for the patient (e.g., white blood cell count, hemoglobin count, platelet count, hematocrit count, red blood cell count, creatinine count, lymphocyte count, protein, bilirubin, calcium, sodium, potassium, glucose measurements). The attributes may also indicate measurements of various biomarkers for different cancer types, such as Estrogen Receptor (ER), progestin Receptor (PR), human epidermal growth factor receptor 2 (HER 2), epidermal growth factor receptor (EGFR or HER 1), ALK (anaplastic lymphoma kinase) for lung cancer, KRAS gene for lung and colorectal cancer, BRAF gene for colorectal cancer, and the like.
In step 304, the clinical decision support tool may input the first data to a machine learning model to generate a result of the clinical prediction for the first patient, the machine learning model being associated with a plurality of feature importance metrics defining a relevance of each of the plurality of features to the clinical prediction.
Referring to fig. 2A, a clinical decision support tool may include a machine learning predictive model 206 to make clinical predictions 212 based on patient data. The machine learning model 206 may include the RSF model 230 of fig. 2C, which may output a predicted survival function as a clinical prediction based on patient data. Clinical predictions 212 may include, for example, patient survival probabilities. The probability of survival may indicate a likelihood that the patient survives for a predetermined time (e.g., 500 days, 1000 days, 1500 days, etc.) after the patient is diagnosed with the medical condition (e.g., advanced cancer). As described above, in some examples, the machine learning predictive model 206 may include a plurality of RSF models configured to predict the probability of survival up to different predetermined times. One of the RSF models may be selected, predicting the survival probability up to the input time based on the input time.
In addition, the machine learning predictive model 206 is also associated with a plurality of feature importance metrics such as feature importance metric 260. Referring to fig. 2E, feature importance metrics 260 may define the relevance of each feature by studying its impact on the error of machine learning predictive model 206 and may be determined by training module 250 based on a set of out-of-bag (OOB) samples of patient data 252 from patient database 214. The OOB samples may include samples of patient data not involved in the bagging process used to build the RSF model 230. For those samples, the values of the feature may be ranked, and a prediction error rate may be obtained from each decision tree that processes the OOB samples with the ranked values of the feature. The prediction error rate may be calculated based on the consistency index (c-index) based on equation 1 above. For this feature, the raw importance score may be calculated based on, for example, a difference average of the prediction error rate outputs for each decision tree. This process may be repeated for each feature to calculate a respective raw importance score for each feature. A high raw importance score may indicate that the feature is more relevant to survival prediction, while a low raw importance score may indicate that the degree of relevance of the feature is lower. At the end of the training operation, the features may be ranked based on their importance scores, with features having higher relevance being ranked higher.
In step 306, the clinical decision support tool may obtain second data corresponding to the plurality of features of each patient in the patient group based on a degree of similarity between the first patient and the patient group in terms of at least some of the plurality of features, the degree of similarity based on the first data, the second data, and the plurality of feature importance metrics.
Specifically, the second data may be obtained by the patient pool determination module 204 of the clinical decision support tool 200, which includes a feature weight selection module 270 and a similarity determination module 272. Referring to fig. 2F, the feature weight selection module 270 may rank the features by feature importance value 260 and select the x features with the highest feature importance value (x may be a predetermined number, e.g., 20, or based on rules, e.g., all features whose importance values are greater than the average importance value across all features). The top ranked set of x features may be denoted as E. The feature weight selection module 270 may then fit RSFs using only the features in E and recalculate the feature importance values for these features based on this new RSF. The degree of similarity between the first patient and the other patients may be calculated based on the above equations 3 to 4. In a first example, a patient may be selected based on similarity to a new patient exceeding a threshold. In a second example, patients in the database are ranked according to their similarity to the first patient, and a predetermined number of the highest ranked patients are selected.
In step 308, the clinical decision support tool may generate content based on the results of the clinical prediction and at least a portion of the second data.
Specifically, in some examples, the content may include output summary statistics (e.g., median survival time) of the patient pool (patient group), K-M curves of the patient pool, and the like. In some examples, a comparison may be made between patient data of the patient group, patient data of the first patient, and patient data of a training patient set (e.g., a patient represented by a training data set that trains a machine learning model) to generate a comparison result.
In step 310, the clinical decision support tool may output content to make a clinical decision for the first patient based on the content. For example, referring back to fig. 1C, the content may indicate that the EGFR positivity of the patient pool is much higher than that of the training patient set, and further studies on EFGR (e.g., treatment targeting EFGR) may be performed to improve patient survival.
IV. computer system
Any of the computer systems mentioned herein may utilize any suitable number of subsystems. An example of such a subsystem is shown in computer system 10 of fig. 4. In some embodiments, the computer system comprises a single computer device, wherein the subsystem may be a component of the computer device. In other embodiments, a computer system may include multiple computer devices, each being a subsystem with internal components. Computer systems may include desktop and portable computers, tablet computers, mobile phones, and other mobile devices. In some embodiments, a cloud infrastructure (e.g., amazon Web Services), a Graphics Processing Unit (GPU), or the like may be used to implement the disclosed techniques.
The subsystems shown in fig. 4 are interconnected via a system bus 75. Additional subsystems such as a printer 74, keyboard 78, storage 79, monitor 76 (which is coupled to display adapter 82), and the like are shown. Peripheral devices and input/output (I/O) devices coupled to I/O controller 71 may be connected by any number of means known in the art, such as an input/output (I/O) port 77 (e.g., USB,) Is connected to a computer system. For example, I/O port 77 or external interface 81 (e.g., ethernet, wi-Fi, etc.) may be used to connect computer system 10 to a wide area network, such as the Internet, a mouse input device, or a scanner. Interconnection through system bus 75 allows central processor 73 to communicate with each subsystem and control the execution of multiple instructions from system memory 72 or storage 79 (e.g., a fixed disk, such as a hard drive, or an optical disk), as well as the exchange of information between subsystems. The system memory 72 and/or the storage 79 may comprise computer-readable media. Another subsystem is a data collection device 85 such as a camera, microphone, accelerometer, etc. Any of the data mentioned herein may be output from one component to another and may be output to a user.
The computer system may include multiple identical components or subsystems, connected together, for example, through external interfaces 81 or through internal interfaces. In some embodiments, the computer systems, subsystems, or devices may communicate over a network. In this case, one computer may be regarded as a client and another computer may be regarded as a server, wherein each computer may be regarded as a part of the same computer system. The client and server may each include multiple systems, subsystems, or components.
Aspects of the embodiments may be implemented in a modular or integrated manner, in the form of control logic, using hardware (e.g., an application specific integrated circuit or field programmable gate array) and/or using computer software with a general programmable processor. As used herein, a processor includes a single-core processor, a multi-core processor on the same integrated chip, or multiple processing units on a single circuit board or networked. Based on the disclosure and teachings provided herein, one of ordinary skill in the art will know and appreciate other ways and/or methods of implementing embodiments of the invention using hardware and combinations of hardware and software.
Any of the software components or functions described in this application may be implemented as software code executed by a processor using any suitable computer language, such as, for example, java, C, C++, C#, objective-C, swift, or a scripting language, such as Perl or Python, using, for example, conventional or object-oriented techniques. The software code may be stored as a series of instructions or commands on a computer readable medium for storage and/or transmission. Suitable non-transitory computer readable media may include Random Access Memory (RAM), read Only Memory (ROM), magnetic media such as a hard disk drive or floppy disk, or optical media such as Compact Disk (CD) or DVD (digital versatile disk), flash memory, etc. The computer readable medium may be any combination of such storage or transmission means.
Such programs may also be encoded and transmitted using carrier signals adapted for transmission via a wired, optical, and/or wireless network conforming to various protocols including the internet. As such, a computer readable medium may be created using a data signal encoded with such a program. The computer readable medium encoded with the program code may be packaged with a compatible device or provided separately from other devices (e.g., downloaded over the internet). Any such computer-readable medium may reside on or within a single computer product (e.g., a hard drive, CD, or entire computer system), and may reside on or within different computer products within a system or network. The computer system may include a monitor, printer, or other suitable display for providing any of the results mentioned herein to the user.
V. examples
Fig. 5 illustrates one example of how patient data 500 from a patient pool may be used. Building patient data from a patient pool involves a queue building, any suitable type of queue building method may be used, such as the similarity-based patient pooling method described herein. For example, the methods of generating patient data from the similarity-based patient pools described in connection with fig. 2A and 2F may be used in the examples described herein. Other methods of queue construction may also be applicable.
As shown in fig. 5, patient data 500 from a patient pool may be accessed, processed, and/or used by a disease itinerary information tool 502 to automatically extract useful data about a patient's disease itinerary from patient electronic health records and/or other patient databases. In some embodiments, the disease tour information tool 502 may include a patient care information extraction module 504, a patient health extraction module 506, and an additional patient treatment and services module 508. Various embodiments of disease trip information tool 502 may include any combination of one or more of the modules described herein.
In some embodiments, the patient care information extraction module 504 includes an algorithm for extracting information about the order of how patients are cared for in the cohort. This extracted information may be displayed to the user to facilitate and facilitate the user's learning from disease tours for a particular queue. This extracted information can also be used for risk factor analysis of the queue.
In some embodiments, the patient health extraction module 506 may include an algorithm for extracting information about the health of the patient from the patient data. For example, metrics of health status of some patients include, for example, patient reported outcomes or experiences, such as reported symptoms, disabilities, various aspects of health status, health ideas, and the like. In some embodiments, the user may be provided with a more customized display that may be used to view the health of the patient as a patient group from the personal level and queue. For example, the user may be provided with overall statistics on how the identified patients in the cohort report their treatment.
In some embodiments, the additional patient treatment and services module 508 may include algorithms for extracting information about non-medical additional services (i.e., non-medicinal and non-medicinal services) such as, for example, rehabilitation, psychotherapy, physical therapy, and occupational therapy. In some embodiments, this extracted information may be used to determine which additional non-medical interventions benefit a particular patient cohort, such as rehabilitation clinics, psychological treatments, physical treatments, and/or occupational treatments.
Fig. 6 illustrates another example of how patient data 600 from a patient pool may be used. As shown in fig. 6, patient data 600 from a patient pool may be accessed, processed, and used by a suggestion tool 602 to suggest common terminology for entering data fields of a spreadsheet or record. In some embodiments, the recommendation tool 602 may include a common Electronic Medical Record (EMR) term extraction module 604 and a common diagnostic test extraction module 606. Various embodiments of suggestion tool 602 may include any combination of one or more of the modules described herein.
In some embodiments, the common EMR term extraction module 604 may include an algorithm for extracting common terms for data fields in an EMR system, and then the algorithm uses these extracted common terms as suggestions for the user who is filling in EMR. For example, in some embodiments, one or more of the data fields in the EMR that the user is filling in may be automatically filled in with a preselected text field according to the highest frequency of commonly used terms extracted in the queue. In some embodiments, when the text field is selected, the user may instead be provided with an ordered list of commonly used terms, where the list is ordered based on how frequently the terms appear in the queue. In some embodiments, common EMR terms may be extracted from EMR from a pooled patient pool. In other embodiments, common EMR terms may be extracted from a broader EMR data set formed from a larger patient cell. In some embodiments, common EMR terms may be extracted from one or more EMR data sets.
In some embodiments, the common diagnostic test extraction module 606 may include algorithms for extracting common diagnostic tests for the diagnostic test suggestion system. In some embodiments, the data set used to extract the common diagnostic test may be limited to data from a pool of queued patients. In some embodiments, the methods of generating patient data from a similarity-based patient pool described in connection with fig. 2A and 2F may be used to construct a queue having similar characteristics to a particular patient. The suggestion system identifies diagnostic tests performed by the queue. The system may use the extracted information to suggest such tests that are considered for a particular patient.
In some embodiments, the algorithms used by the data extraction module described herein may be, but are not limited to, process mining algorithms, deep learning algorithms, and sequence alignment methods.
Any of the methods described herein may be performed in whole or in part by a computer system comprising one or more processors, which may be configured to perform steps. Thus, embodiments may be directed to a computer system configured to perform the steps of any of the methods described herein, possibly with different components performing the corresponding steps or groups of steps. Although presented as numbered steps, the steps of the methods herein may be performed simultaneously or in a different order. Furthermore, part of the steps may be used together with part of the steps in other methods. In addition, all or part of the steps may be optional. In addition, any steps of any method may be performed using modules, units, circuits, or other means for performing the steps.
The particular details of the particular embodiments may be combined in any suitable manner without departing from the spirit and scope of the embodiments of the invention. However, other embodiments of the invention may be directed to specific embodiments relating to each individual aspect, or specific combinations of these individual aspects.
The foregoing description of the exemplary embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form described, and many modifications and variations are possible in light of the above teaching.
References to "a," "an," or "the" are intended to mean "one or more" unless specifically indicated to the contrary. Unless specifically indicated to the contrary, the use of "or" is intended to mean "comprising or" rather than "excluding or". References to "a first" component do not necessarily require that a second component be provided. Furthermore, reference to a "first" or "second" component does not limit the referenced component to a particular location unless explicitly stated otherwise.
All patents, patent applications, publications, and descriptions mentioned herein are incorporated by reference in their entirety for all purposes. None is considered to be prior art.
Claims (18)
1. A computer-implemented method of facilitating clinical decisions, the computer-implemented method comprising:
receiving first data corresponding to a plurality of features of a first patient, each feature representing an attribute of a plurality of attributes;
inputting the first data to a machine learning model to generate a result of a clinical prediction for the first patient, the machine learning model being associated with a plurality of feature importance metrics defining a relevance of each of the plurality of features to the clinical prediction;
Obtaining second data corresponding to the plurality of features of each patient in the patient group based on a degree of similarity between the first patient and the patient group with respect to at least some of the plurality of features, the degree of similarity based on the first data, the second data, and a plurality of feature data importance metrics;
generating content based on the results of the clinical prediction and at least a portion of the second data; and
The content is output for making a clinical decision for the first patient based on the content.
2. The method of claim 1, wherein the plurality of attributes comprises at least one of: patient profile data, results of one or more laboratory tests of the first patient, biopsy image data of the first patient, molecular biomarkers of the first patient, tumor sites of the first patient, or tumor stage of the first patient.
3. The method of claim 1, wherein the plurality of attributes comprises one or more attributes representing measurements of biomarkers for different cancer types.
4. The method of claim 1, wherein the clinical prediction comprises at least one of: the probability of survival of the first patient at a predetermined time since the first patient was diagnosed with a tumor, the survival time of the first patient since the first patient was diagnosed with a tumor, or the outcome of treatment.
5. The method of claim 4, wherein the machine learning model comprises a random forest survival model comprising f decision trees each configured to process a subset of the first subset of data to generate a cumulative survival probability; and
Wherein the patient's survival rate at the predetermined time is determined based on an average of the cumulative survival probabilities output by a plurality of decision trees.
6. The method of claim 1, wherein the patient group is a first patient group;
Wherein the first patient group is selected from a second patient group; and
Wherein the machine learning model is trained based on patient data of the second patient group.
7. The method as recited in claim 6, further comprising:
ranking the plurality of features based on the relevance of each feature of the plurality of features to the clinical prediction;
determining a subset of the plurality of features based on the ranking; and
The first patient group is determined based on a degree of similarity between the first patient and the first patient group with respect to the subset of the plurality of features.
8. The method of claim 7, wherein the first patient group is selected from the second patient group based on the degree of similarity between the first patient and the first patient group with respect to the subset of the plurality of features exceeding a threshold.
9. The method of claim 7, wherein the first patient group is selected from the second patient group based on selecting a threshold number of patients having a highest degree of similarity to the first patient with respect to the subset of the plurality of features.
10. The method as recited in claim 1, further comprising:
calculating a weighted aggregate similarity degree based on summing the scaled similarity degrees for each of the at least some of the plurality of features, each similarity degree scaled by a weight based on the relevance of the feature; and
The patient group is identified based on the weighted aggregate similarity degree between the first patient and each patient in the patient group.
11. The method of claim 1, wherein a feature importance metric of a feature is determined based on a relationship between errors in results of clinical predictions generated by the machine learning model for a second patient in the first patient group;
Wherein the result of clinical prediction is generated from a plurality of values of the characteristic of the second patient; and
Wherein the error is calculated based on comparing the result of the clinical prediction to an actual clinical outcome of the second patient.
12. The method of claim 6, wherein the content comprises at least one of: the median survival time of the first patient group, or the Kaplan-Meier survival curve of the first patient group.
13. The method of claim 6, wherein the content comprises values for one or more of the first subset of the plurality of features of the first patient, the first patient group, and the second patient group.
14. A computer product comprising a computer readable medium storing a plurality of instructions for controlling a computer system to perform the operations of any of the methods described above.
15. A system, comprising:
the computer product of claim 16; and
One or more processors configured to execute instructions stored on the computer-readable medium.
16. A system comprising means for performing any of the methods described above.
17. A system configured to perform any of the above methods.
18. A system comprising modules that individually perform the steps of any of the methods described above.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US63/362373 | 2022-04-01 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118974839A true CN118974839A (en) | 2024-11-15 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11664126B2 (en) | Clinical predictor based on multiple machine learning models | |
Wosiak et al. | Integrating Correlation‐Based Feature Selection and Clustering for Improved Cardiovascular Disease Diagnosis | |
Linden et al. | Modeling time‐to‐event (survival) data using classification tree analysis | |
Verburg et al. | Comparison of regression methods for modeling intensive care length of stay | |
US7809660B2 (en) | System and method to optimize control cohorts using clustering algorithms | |
JP2022514162A (en) | Systems and methods for designing clinical trials | |
JP2018524137A (en) | Method and system for assessing psychological state | |
US11915827B2 (en) | Methods and systems for classification to prognostic labels | |
Harris et al. | Advances in conceptual and methodological issues in symptom cluster research: a 20-year perspective | |
US20210397996A1 (en) | Methods and systems for classification using expert data | |
EP3797423A1 (en) | System and method for integrating genotypic information and phenotypic measurements for precision health assessments | |
Mlakar et al. | Mining telemonitored physiological data and patient-reported outcomes of congestive heart failure patients | |
Mansouri et al. | Predicting hospital length of stay of neonates admitted to the NICU using data mining techniques | |
Guzman-Castillo et al. | A tutorial on selecting and interpreting predictive models for ordinal health-related outcomes | |
Hong et al. | Predicting risk of mortality in pediatric ICU based on ensemble step-wise feature selection | |
US20130253892A1 (en) | Creating synthetic events using genetic surprisal data representing a genetic sequence of an organism with an addition of context | |
JP6737519B1 (en) | Program, learning model, information processing device, information processing method, and learning model generation method | |
CN118974839A (en) | Patient pooling based on machine learning model | |
WO2023187139A1 (en) | Patient pooling based on machine learning model | |
JP6777351B2 (en) | Programs, information processing equipment and information processing methods | |
Santos | Breast cancer survival prediction using machine learning and gene expression profiles | |
US20230253115A1 (en) | Methods and systems for predicting in-vivo response to drug therapies | |
Mariam et al. | Unsupervised clustering of longitudinal clinical measurements in electronic health records | |
US12142380B2 (en) | Method and an apparatus for building a longevity profile | |
CN118352007B (en) | Disease data analysis method and system based on crowd queue multiunit study data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication |