CN117275661A

CN117275661A - Deep reinforcement learning-based lung cancer patient medication prediction method and device

Info

Publication number: CN117275661A
Application number: CN202311567874.0A
Authority: CN
Inventors: 常云青; 冯秀芳; 董云云; 白玉洁; 张源榕; 杨炳乾
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2023-12-22
Anticipated expiration: 2043-11-23
Also published as: CN117275661B

Abstract

The invention provides a method and a device for predicting the medication of a lung cancer patient based on deep reinforcement learning, belonging to the technical field of the medication prediction of the lung cancer patient; the technical problems to be solved are as follows: providing a method and a device for predicting the medication of a lung cancer patient based on deep reinforcement learning; the technical scheme adopted for solving the technical problems is as follows: collecting lung cancer patient data information, extracting vital signs and related medical histories of a lung cancer patient within a period of time, and preprocessing the vital signs and the related medical histories to construct a patient data set; constructing a patient-based environment model by using the collected data, wherein the model is used for simulating a reward mechanism of a drug effect on a patient body and comprises a patient state, a drug action space, a reward function, a transfer model and an initial state; constructing a network model comprising an online network and a target network for calculating each possible drug regimen adjustment value for the current state of the patient; the method is applied to the drug administration prediction of lung cancer patients.

Description

Deep reinforcement learning-based lung cancer patient medication prediction method and device

Technical Field

The invention provides a method and a device for predicting the medication of a lung cancer patient based on deep reinforcement learning, and belongs to the technical field of the medication prediction of the lung cancer patient.

Background

The deep reinforcement learning is a technology combining the deep learning and reinforcement learning, and can optimize the decision process of the intelligent body through simulating and learning the behaviors and results in the environment, and can be applied to the aspects of medical diagnosis, treatment scheme design, health management and the like in the personalized medical field, thereby providing more accurate and effective medical services for patients.

Personalized medicine is a medical mode based on individual differentiation of patients and based on unique gene, physiological and psychological characteristics of the patients, and corresponding traditional medical modes usually only consider general rules, and individual differences among patients are ignored, so that the prediction effect is poor, the disease mode of the patients cannot be found, and the optimal medicine adjustment scheme cannot be predicted; among these, the patient with lung cancer needs to predict the medication situation by analyzing a large amount of medical data and the lung cancer characteristics of the individual patient, and the currently adopted traditional medical mode cannot meet the prediction requirement for medication.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and solves the technical problems that: a method and a device for predicting the medication of a lung cancer patient based on deep reinforcement learning are provided.

In order to solve the technical problems, the invention adopts the following technical scheme: a lung cancer patient medication prediction method based on deep reinforcement learning comprises the following medication prediction steps:

step S1: collecting lung cancer patient data information;

step S2: extracting vital signs and related medical histories of a patient suffering from lung cancer for a period of time and preprocessing the vital signs and the related medical histories to construct a patient data set;

step S3: constructing a patient-based environmental model using the collected data for simulating a reward mechanism of a drug effect on a patient's body, comprising: patient state, medication action space, reward function, transfer model, and initial state;

step S4: setting up a network model comprising an online network and a target network, and calculating an adjustment value of each medication scheme under the current state of a patient;

step S5: taking the collected patient history treatment data as input, and outputting a predicted drug adjustment scheme;

step S6: updating network parameters by using a random gradient descent method;

step S7: through constantly interacting with the environment, the method carries out training and learning for multiple times, achieves the goal of rewarding maximization, and predicts and outputs the medication type and medication dosage adjustment scheme suitable for patients.

The specific method for collecting the data information of the lung cancer patient in the step S1 comprises the following steps:

step S11: collecting personal basic information, medical history, physiological data and drug treatment scheme data of a lung cancer patient;

step S12: collecting data on the relationship between the type of medication, dosage and treatment effect of a lung cancer patient, comprising: the patient takes different drug types and the change data of the lung tumor size of the patient when the patient takes different doses.

The specific method for constructing the patient data set in the step S2 is as follows:

step S21: screening lung cancer patient data of a set age group;

step S22: preprocessing the screened data, including: removing repeated data, processing missing values and processing abnormal values;

step S23: and (3) dividing the data obtained in the previous step into a training set and a testing set according to the ratio of 8:2 after the data obtained in the previous step are stored.

The specific method for constructing the environment model based on the patient in the step S3 is as follows:

step S31: determining a patient state space S, including tumor size, pathological stage and physiological index data;

features defining the patient's condition of a visit include: demographic, medical history, disease risk, historical medicine, laboratory data, physical measurement, and establishing a state space according to the information to obtain a multidimensional state vector;

step S32: determining a medicine action space A, including adjustment of medicine types and dosages thereof;

according to the type and the dosage of the historical medication of the patient, determining the adjustment scheme of the medication, wherein the action space A comprises four-dimensional vectors: no prescription change 0, 1 increase of drug dose, 2 decrease of drug dose, 3 replacement of drug;

step S33: determining a reward function R, designing a reasonable reward function according to the disease condition, the drug dosage and the treatment effect factors of a patient, and feeding back the change of various indexes of the body when the patient takes different drugs and the doses thereof to mark the improvement or the deterioration of the tumor;

step S34: based on patient historical medication data, a probability model is built for transitioning to each medication action in the current patient stateCalculating the current state of the patient +.>Taking medication strategy->After transition to the next state +.>Probability of (2)PAnd use +.>Strategy is used for balancing development and exploration, and expected benefits at the current moment are maximized;

step S35: for the state of the patient at the beginning of the treatment, an initial state is determined based on the patient's basic condition and medical history.

The specific method for constructing the network model to calculate the adjustment value of the medication scheme of the patient in the step S4 is as follows:

step S41: setting up an online network for calculating the adjustment value of each personalized medication scheme under the current physical state of the patient, and updating the optimal scheme according to the adjustment value;

the parameter weight of the online network is updated in the process of each iteration to minimize the difference between the predicted value and the target value in the current state;

step S42: calculating a target value according to the action in the current state and the maximum adjustment value in the next state;

step S43: constructing a neural network model to calculate an action-cost function Q (s, a) for estimating the cumulative rewards of each adjustment scheme in the current treatment state, wherein the action-cost function Q is denoted as Q (s, a) and is used for calculating an expected return value Q caused by the change of the body index of a patient after taking the medication strategy a in the state s;

step S44: training the DQN model, two structurally identical neural networks were used: online networkAnd a target networkFor obtaining optimal dosing action decisions +.>And is->Training is carried out;

wherein,for the current state of the patient,acurrent medication strategy for the patient,/->For online networksQParameter of->For calculating in a given state->Has the maximum at the bottomQAction of valuea，QExpected return value for patient physical index change in online network, < >>The expected return value for the patient's physical index change in the target network,Lfor loss function, calculating the difference between the two to perform network training;

estimating expected action values using a target networkTo calculate a loss functionLBy tracking the online network parameters in each training iteration +.>To update the target network->Parameter->Finally, the optimal personalized medicine adjustment scheme suitable for the patient is obtained.

The specific method for outputting the predicted drug adjustment scheme in the step S5 is as follows:

step S51: the method comprises the steps that five kinds of information including patient illness state, physiological indexes, laboratory examination results, image examination results and medication conditions are formed into a high-dimensional vector to serve as input data of a model;

step S52: the output of the model is a drug adjustment regimen based on the patient's condition, including four drug adjustment regimens: definition 0 indicates no prescription change, 1 indicates increased medication dose, 2 indicates decreased medication dose, and 3 indicates replacement medication, respectively.

The device for realizing the lung cancer patient medication prediction method based on deep reinforcement learning comprises an acquisition computer for collecting lung cancer patient data information, a data server for collecting and storing the data information, and a prediction server for building a network model, training learning and outputting a prediction scheme.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a prediction method and a device for using medicines for individualized lung cancer patients, which are mainly based on the deep reinforcement learning step of a strategy optimization algorithm, and are characterized in that an individualized environment model is built by collecting historical medicine data of the patients, and intelligent bodies and the environment are trained to perform interactive learning, so that the future optimal medicine scheme of the lung cancer patients is obtained through adjustment and prediction; the deep reinforcement learning method adopted by the invention has the perception capability of deep learning and the decision capability of reinforcement learning, can integrate perception, learning and decision into the same framework, and is used for solving the problem of high-dimensional decision based on time sequences, so that the intelligent body can be trained to learn an optimal drug adjustment scheme based on the treatment history of a cancer patient, and more accurate and effective medical service is provided for the lung cancer patient.

It should be noted that, all actions for acquiring signals, information or data in the present application are performed under the condition of conforming to the corresponding data protection rule policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

Drawings

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a diagram of a drug prediction model structure employing a strategy-based optimization algorithm;

FIG. 2 is a flow chart of the steps of the medication prediction method of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, the method for predicting the medication of the lung cancer patient based on the deep reinforcement learning and strategy optimization algorithm adopted by the invention specifically comprises the following steps:

step S1: collecting lung cancer patient data information;

step S2: extracting vital signs and related medical histories of a patient suffering from lung cancer within 6 months, and preprocessing the vital signs and the related medical histories to construct a patient data set;

step S3: constructing a patient-based environment model by using the collected data, wherein the model is used for simulating a reward mechanism of a drug effect on a patient body and comprises a patient state, a drug action space, a reward function, a transfer model and an initial state;

step S4: constructing a network model comprising an online network and a target network, wherein the network model is used for calculating the Q value adjusted by each possible drug scheme under the current state of a patient;

step S5: taking the collected historical treatment data of the patient as input, and outputting a predicted drug adjustment scheme;

step S6: updating network parameters by using a random gradient descent method;

step S7: the intelligent body performs multiple training and learning, and outputs the optimal personalized medicine adjustment scheme.

Further, the collecting lung cancer patient data information in step S1 includes:

step S11: collecting personal basic information, medical history, physiological data, drug treatment scheme and other relevant data of a lung cancer patient;

step S12: and collecting the relation data between the medicine types, the dosages and the treatment effects of the lung cancer patients.

Further, the personal basic information described in step S11 includes the patient' S age, sex, race, smoking, and others; medical history including complications, cancer complications, hospitalization history, emergency treatment history, etc.; physiological data includes systolic pressure (SBP), diastolic pressure (DBP), heart rate, weight, height, BMI, etc.; drug treatment regimens include chemotherapy, targeted therapy, immunotherapy, and anti-vascular therapy.

Further, the relationship between the drug type, the dose and the therapeutic effect in step S12 includes the change of the physical index of the patient when the patient takes different drug types and doses, which is specifically represented by the change of the lung tumor size of the lung cancer patient.

Further, extracting and preprocessing vital signs of a lung cancer patient within 6 months as described in step S2 includes:

step S21: screening lung cancer patient data above 18 years old and below 75 years old;

step S22: and preprocessing the screened data. Operations including removing duplicate data, processing missing values (for missing physical measurements, replacing missing data with the value of the nearest data point of the same patient, if the data is still missing, estimating the missing data using the median of the variable observations of all patients without missing data), processing outliers, etc., to ensure the integrity and accuracy of the data;

step S23: and (3) dividing the data obtained in the previous step into a training set and a testing set according to the ratio of 8:2 after the data obtained in the previous step are stored, and training and evaluating the effectiveness of the drug prediction model.

Further, the constructing the patient-based environment model using the collected data in step S3 includes:

step S31: determining a patient state space S: the state space refers to the pathological condition of patients, including tumor size, pathological stage, physiological index and the like. Features defining patient visit status in the present invention include demographics, medical history, disease risk, historical medication, laboratory data, and physical measurements. Establishing a state space according to the information to obtain a 20-dimensional state vector;

step S32: determining a medicine action space A: the action space refers to actions that an intelligent doctor can take, i.e. the adjustment of the kind of drug and its dosage. According to the type and the dosage of the historical medication of a patient, the invention determines the adjustment scheme of the medication, and consists of four-dimensional vectors of 0 no-prescription change, 1 increase of the medication dosage, 2 decrease of the medication dosage and 3 replacement of the medication;

step S33: determining a reward function R: the reward function refers to feedback rewards obtained after the intelligent agent makes corresponding actions according to the current state in the reinforcement learning algorithm. According to the factors such as the illness state, the medicine dosage and the treatment effect of the patient, a reasonable reward function is designed to feed back the changes of various indexes of the body when the patient takes different medicines and dosages thereof, and the tumor improvement or deterioration is marked to assist doctors in providing medicine adjustment schemes for the patient;

step S34: establishment ofTransfer model: the transition model is the probability of transition between finger state space and action space, i.e., the probability of transition to the next state after taking some action in the current state. The invention establishes a probability model for transferring to each medication action under the current patient state according to the historical medication data of the patient, and uses +.>Strategies trade-off development and exploration, maximizing the expected benefit at the current moment, e.g. at time step t, the patient has k possible medication selection strategies, then +.>The policy may be expressed as:

；

wherein,representing possible medication strategies,/->Expressed in given policy->Cumulative rewards due to changes in physical index of patient, < ->For calculating so->Administration strategy to achieve maximum->，∈To explore the rate parameters, expressed in∈Random medication strategy selection at 1-∈Is selected by greedy medication policy actionsThe administration strategy with the maximum action value is +.>；

Step S35: determining an initial state: the initial state refers to a state of the patient at the time of starting treatment, and is determined according to the basic condition of the patient and the medical history.

Further, the network model in step S4 is a deep reinforcement learning model based on a policy optimization algorithm, and is specifically used for solving the decision problem of the high-dimensional space, and the specific construction steps are as follows:

step S41: an online network is built for calculating the Q value of each personalized medicine adjustment scheme under the current physical state of the patient, and the optimal scheme is updated according to the Q value. Parameters (weights) of the online network are updated in the process of each iteration to minimize the gap between the predicted Q value and the target Q value in the current state;

step S42: the target network is used for estimating a target Q value, i.e. calculating the target Q value based on the action in the current state and the maximum Q value in the next state. The parameters of the target network are updated slowly compared with those of the online network to maintain the stability of the target Q value. During each iteration, the parameters of the target network are copied from the online network, but are not updated directly.

By constructing a neural network model to calculate the action-cost function Q of estimating the jackpot for each adjustment scheme at the current visit status, the cost function can be expressed as Q (s, a). To train the model, two structurally identical neural networks were used: online networkAnd target network->. The online network is used for obtaining optimal drug administration action decisionAnd is->Training is performed. Estimating the expected action value using the target network>To calculate the loss function L and by slowly tracking the online network parameter +.>To update its parameters->Finally, the optimal personalized medicine adjustment scheme suitable for the patient is obtained.

Further, the step S5 takes the collected treatment data of the patient as input, and outputs a predicted drug adjustment scheme, which specifically includes:

step S52: the output of the model is a drug adjustment scheme based on the condition of the patient, and specifically comprises four drug adjustment schemes of 0 no-prescription change, 1 increase of drug dosage, 2 decrease of drug dosage and 3 replacement of drug.

Further, the updating of network parameters using random gradient descent as described in step S6, and in addition, to alleviate the problems of related data and non-stationary distribution, an empirical replay mechanism is introduced to interact each time-step agent with the environment to obtain a transition sampleStored in buffers and randomly decimated, in this way, differences in data distribution can be mitigated, thereby smoothing the training distribution of many behaviors in the past.

Further, in step S7, the agent continuously interacts with the environment, and through learning the strategy, the goal of maximizing rewards is achieved, and the optimal medicine and the dosage adjustment scheme thereof suitable for the patient are predicted, so that the doctor can make more intelligent medicine adjustment for the patient, and the survival rate and life quality of the lung cancer patient can be improved.

In order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the invention is further described in detail by describing exemplary embodiments of the application with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. The following describes the technical scheme of the present invention in detail with reference to examples and drawings, but the scope of protection is not limited thereto.

In this embodiment, specifically based on the drug prediction model structure shown in fig. 1, the personalized drug adjustment can be performed for the lung cancer patient according to the actual situation of the patient through the step flow of the technical scheme shown in fig. 2, and the processing steps include:

step S1: collecting lung cancer patient data information;

the data used in the experiment was collected from a hospital in a region covering 6124 lung cancer patients, including 632565 outpatient visits over a period of time, and the personal basic information, medical history, physiological data, drug treatment regimen and other relevant data of the patients visited during the period and the relationship data between the type of drug taken by the patients, the dosage and the treatment effect were recorded.

Step S2: vital signs and related medical history of lung cancer patients within 6 months are extracted and preprocessed to construct a patient dataset:

screening collected lung cancer patient data, such as patient data of 18 years old and 75 years old or less, and preprocessing the screened data, wherein the method comprises the following steps: removing duplicate data, processing missing values, for missing physical measured values, replacing missing data by using the value of the nearest data point of the same patient, if the data is still lost, estimating the lost data by using the median of the variable observed values of all patients without the lost data, processing abnormal values and the like so as to ensure the integrity and the accuracy of the data; normalizing vital sign data so that all features are in the same scale range, and avoiding the influence of overlarge weight difference among different features on the performance of a model; the preprocessed data are stored and then divided into a training set and a testing set according to the ratio of 8:2, and the training set and the testing set are used for training and evaluating the effectiveness of a medicine prediction model.

Step S3: constructing a patient-based environment model by using the collected data, wherein the model is used for simulating a reward mechanism of a drug effect on a patient body and comprises a patient state, a drug action space, a reward function, a transfer model and an initial state:

step S31: determining a patient state space S: the state refers to the perception of the environment by the machine, i.e. the specific environment in which the agent is located, all possible states are referred to as state spaces, and in this embodiment the features defining the patient's visit status include: demographics, medical history, risk of illness, historical medication, laboratory data, and physical measurements; wherein the continuous variable is normalized to a common scale, the binary variable is expressed as 0 or 1, other classified variables are converted into a plurality of binary variables by adopting single thermal coding, and finally, a state space is established according to the information to obtain a 20-dimensional state vector;

step S32: determining a medicine action space A: refers to a specific action that an agent can take in the current environment, in an embodiment, the action space consists of four-dimensional vectors of 0 no-prescription change, 1 increase of medicine dosage, 2 decrease of medicine dosage, 3 change of medicine; wherein no prescription change indicates that the same medication and dosage as the previous prescription is used, and increasing or decreasing the medication dosage refers to adjusting the dosage of the medication taken by the patient in the current input state;

step S33: determining a reward function R: the reinforcement learning winning function is used for evaluating feedback rewards obtained by the agent after executing a certain action, and is a function for mapping environment feedback to scalar values, and the goal in reinforcement learning is to enable the agent to learn an optimal strategy through interaction with the environment so as to maximize accumulated rewards;

in the present embodiment, the bonus function is set as follows:

；

wherein:is at presenttPatient status at moment->Is at presenttDosing action performed by the moment agent, +.>Is thattPatient status at +1, ∈1>Is that the agent is executing the drug administration action +.>Posterior slave state->Transition to State->The awards obtained;s_rrepresenting the patient's survival in the current state, < + >>Representing the toxic side effects of a drug on a patient in the current state, for guiding an agent to avoid selecting drugs harmful to the patient,/I>Representing the cost of the selected drug in the current state, for guiding the agent to avoid selecting too expensive drugs; />、/>、/>The specific values are respectively set to be 1, -0.5 and-0.5 for the weight coefficients; training DQN models to optimize accumulationA prize, the jackpot being equal to the current prize plus the desired jackpot for the next visit multiplied by the discount factor +.>The model is able to estimate the impact of current actions on short-term and long-term results;

step S34: and (3) establishing a transfer model:；

the transition model refers to the transition probability between a state and an action, namely the probability of transition to the next state after taking a certain action in the current state; usingThe strategy is used for balancing development and exploration, the development is correct for maximizing expected benefits at the current moment, and the exploration possibly brings about the maximization of total benefits in the long term; />Is a common strategy in reinforcement learning, which means that there is a very small positive number +.>Is selected randomly, leaving +.>The probability of selecting the action with the greatest action value among the existing actions, for example, at time step t, the agent has k possible actions, respectively denoted +.> Let->Representing actionsaIs>The policy may be expressed as:

；

step S35: determining an initial stateThe initial state refers to a state of the patient at the time of starting treatment, and is determined according to the basic condition of the patient and the medical history.

building a fully connected neural network with 2 hidden layers, wherein each hidden layer comprises 64 neurons, and adopting batch normalization and a leakage-ReLU activation function; the input layer has 20 dimensions, the output layer has 4 dimensions, which correspond to the state vector and the action space, respectively, and the learning rateSet to 0.001, batch size 256, target network update parameter +.>Set to 0.01; to control the stability of the model, a discount factor λ=0.5 is set; training the model by using an Adam optimizer, wherein the maximum iteration number is 100,000; the online network and the target network have the same structure, but the parameter values are different; an empirical playback mechanism is used to store all experience and to randomly extract a number of samples therefrom for training to reduce estimation errors and high variance problems. And the parameters of the target network are used for updating the parameters of the online network at regular intervals so as to improve the stability and speed up the convergence rate of the model.

Step S5: the extracted clinical information of the patient is input into a network as a state space, and the network is subjected to continuous iterative updating to finally output actions, wherein the final output actions comprise four options of 0 no-prescription change, 1 increase of medicine dosage, 2 decrease of medicine dosage and 3 replacement of medicine.

Step S6: the random gradient descent method is used for updating network parameters, and the specific steps include:

step S61: calculating the return of each state-action pair, generating a transition sampleWherein->Is in state->Execution of action down->The obtained prize value forms an empirical replay memory of size N>；

Step S62: initializing online network parametersAnd target network parameters->；

Step S63: playback of memory from experienceExtracting a batch of history samples;

step S64: selecting an optimal action for each state transition process；

Step S65: calculating expected action value from target network；

Step S66: calculating action value of current drug adjustment by online network；

Step S67: calculating a Q loss value L, and repeating S64-S67 for each sample in each batch of samples;

step S68: updating parameter values by a loss value L training network；

Step S69: updating parameter values。

Step S7: the intelligent agent continuously optimizes during the period of training and learning for a plurality of times, adjusts the related weight parameters, gradually increases the accumulated rewards, finally predicts the future drug adjustment scheme of the patient according to different conditions of different patients, changes the drug or adjusts the drug dosage, and assists doctors in adjusting the drug scheme for the patient.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A lung cancer patient medication prediction method based on deep reinforcement learning is characterized in that: the method comprises the following medicine use prediction steps:

step S1: collecting lung cancer patient data information;

step S6: updating network parameters by using a random gradient descent method;

2. The method for predicting medication for a lung cancer patient based on deep reinforcement learning of claim 1, wherein the method comprises the steps of: the specific method for collecting the data information of the lung cancer patient in the step S1 comprises the following steps:

3. The method for predicting medication for a lung cancer patient based on deep reinforcement learning according to claim 2, wherein the method comprises the following steps: the specific method for constructing the patient data set in the step S2 is as follows:

step S21: screening lung cancer patient data of a set age group;

4. The method for predicting medication for a lung cancer patient based on deep reinforcement learning according to claim 3, wherein the method comprises the steps of: the specific method for constructing the environment model based on the patient in the step S3 is as follows:

step S34: based on patient historical medication data, a probability model is built for transitioning to each medication action in the current patient stateCalculating the current state of the patient +.>Taking medication strategy->After transition to the next state +.>Probability of (2)And use +.>Strategy is used for balancing development and exploration, and expected benefits at the current moment are maximized;

5. The method for predicting medication for a lung cancer patient based on deep reinforcement learning of claim 4, wherein the method comprises the steps of: the specific method for constructing the network model to calculate the adjustment value of the medication scheme of the patient in the step S4 is as follows:

step S44: training the DQN model, two structurally identical neural networks were used: online networkAnd target network->For obtaining optimal dosing action decisions +.>And is->Training is carried out;

wherein,for the current state of the patient->Current medication strategy for the patient,/->For online network->Parameter of->For calculating in a given state->The lower part has the maximum->Action of value->，/>Expected return value for patient physical index change in online network, < >>The expected return value for the patient's physical index change in the target network,Lfor loss function, calculating the difference between the two to perform network training;

6. The method for predicting medication for a lung cancer patient based on deep reinforcement learning of claim 5, wherein the method comprises the steps of: the specific method for outputting the predicted drug adjustment scheme in the step S5 is as follows:

7. A device for use in implementing the deep reinforcement learning-based lung cancer patient medication prediction method of claim 1, characterized in that: the system comprises a collecting computer for collecting lung cancer patient data information, a data server for collecting and storing the data information, and a prediction server for building a network model, training and learning and outputting a prediction scheme.