CN117116370A - Chemical reaction yield prediction method and electronic equipment - Google Patents
Chemical reaction yield prediction method and electronic equipment Download PDFInfo
- Publication number
- CN117116370A CN117116370A CN202311018103.6A CN202311018103A CN117116370A CN 117116370 A CN117116370 A CN 117116370A CN 202311018103 A CN202311018103 A CN 202311018103A CN 117116370 A CN117116370 A CN 117116370A
- Authority
- CN
- China
- Prior art keywords
- model
- training
- reaction
- super
- preset
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 139
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000012549 training Methods 0.000 claims abstract description 94
- 239000013598 vector Substances 0.000 claims abstract description 80
- 238000011156 evaluation Methods 0.000 claims abstract description 67
- 238000012216 screening Methods 0.000 claims abstract description 23
- 239000000126 substance Substances 0.000 claims abstract description 13
- 238000012360 testing method Methods 0.000 claims description 36
- 238000013209 evaluation strategy Methods 0.000 claims description 17
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000013145 classification model Methods 0.000 claims description 13
- 239000000376 reactant Substances 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 9
- 230000009467 reduction Effects 0.000 claims description 8
- 238000010845 search algorithm Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000005094 computer simulation Methods 0.000 claims description 3
- 238000010801 machine learning Methods 0.000 abstract description 5
- 230000006978 adaptation Effects 0.000 abstract description 3
- 230000004044 response Effects 0.000 description 5
- 239000003446 ligand Substances 0.000 description 4
- 238000007637 random forest analysis Methods 0.000 description 4
- 239000003054 catalyst Substances 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 239000002904 solvent Substances 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/10—Analysis or design of chemical reactions, syntheses or processes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Crystallography & Structural Chemistry (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Chemical Kinetics & Catalysis (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Analytical Chemistry (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of chemical reaction yield prediction, and particularly discloses a chemical reaction yield prediction method and electronic equipment, wherein the method comprises the following steps: constructing an instantiation model; encoding the sample data to generate a reaction vector; acquiring a super parameter combination and initializing an instantiation model to obtain an initialization model; inputting the reaction vector into an initialization model to operate and obtain model parameters, and obtaining corresponding evaluation indexes; reporting the evaluation index to a super-parameter adjustment tool; screening out an excellent model according to the evaluation index corresponding to the initialized model; and storing the model according to the super parameter combination of the excellent model and the model parameters. According to the invention, a plurality of machine learning models are trained and stored, and a model with higher adaptation degree to chemical reaction can be obtained, so that the accuracy of the model for predicting the chemical yield is improved, manual adjustment is not required for super-parameter combination, the efficiency and quality of model training are further improved, and the method has extremely high use and popularization values.
Description
Technical Field
The invention relates to the technical field of chemical reaction yield prediction, in particular to a chemical reaction yield prediction method and electronic equipment.
Background
To improve the accuracy of prediction of chemical reaction yields in order to optimize reaction conditions, improve yields, and reduce costs, chemical reaction yield prediction models train systems. The chemical reaction yield prediction model training system can solve the following problems:
1. the cost is reduced: by predicting the reaction yield, the optimal reaction condition can be better selected, so that the reaction cost is reduced, and the chemical synthesis is more economical; by predicting the reaction yield, the trial-and-error cost can be reduced, the efficiency of the chemical experiment is improved, and the failure rate in the chemical experiment is reduced, so that the time and the cost are saved;
2. improving the yield: by optimizing the reaction conditions, selecting the catalyst and the ligand, and the like, the reaction yield is improved, and the chemical synthesis process is more efficient.
The existing method for constructing the yield prediction model and the yield prediction method predict the yield by utilizing a random forest algorithm, map the factors into a plurality of factor sets through a factor mapping module according to a plurality of factors which possibly affect the yield in training data, and construct the processed training data and weights into a random forest model through a model construction module.
However, the existing mode has (1) single descriptor of reactants, products and reaction conditions; (2) The precision of the machine learning model used, the random forest model, is limited; (3) The super-parameters of the machine learning model have a larger influence on the model performance, and manual adjustment of the super-parameters is inefficient.
Therefore, a new solution to the above-mentioned problems is needed for those skilled in the art.
Disclosure of Invention
Aiming at the technical problems in the prior art, the invention provides a chemical reaction yield prediction method and electronic equipment.
The invention includes a chemical reaction yield prediction method, comprising:
acquiring configuration information of a user, and constructing an instantiation model according to the configuration information;
reading sample data of chemical yield, and encoding the sample data according to a preset reaction vector generation strategy to generate a reaction vector;
acquiring a super-parameter combination from a preset super-parameter adjustment tool, and initializing an instantiation model according to the super-parameter combination to obtain an initialization model;
inputting the reaction vector into an initialization model for operation to obtain model parameters, and obtaining corresponding evaluation indexes according to a preset evaluation strategy;
reporting the evaluation index to a super-parameter adjustment tool;
acquiring a super parameter combination from the super parameter adjusting tool again, and executing the subsequent steps until reaching the preset stopping condition;
screening an excellent model according to a preset model screening strategy and an evaluation index corresponding to the obtained initialization model;
and storing the model according to the super parameter combination of the excellent model and the model parameters.
Further, obtaining configuration information of the user, and constructing an instantiation model according to the configuration information, including:
reading configuration information of a user from a set path in a project folder; the configuration information at least comprises model names, the number N of training models and training duration;
and reading model data from the project folder according to the model names in the configuration information, and acquiring a super-parameter search space of the corresponding model.
Further, reading sample data of chemical yield, and encoding the sample data according to a preset reaction vector generation strategy to generate a reaction vector; comprising the following steps:
reading sample data, and dividing the sample data into a training sample set and a test sample set according to a set proportion; the training sample set and the test sample set both contain reactants, products and reaction conditions;
and sequentially encoding reactants, products and reaction conditions in the training sample set and the test sample set according to a reaction vector generation strategy to generate a reaction vector corresponding to each sample data.
Further, when a super-parameter combination is obtained from a preset super-parameter adjustment tool, the super-parameter combination is an optimal parameter set which is found by the super-parameter adjustment tool after traversing the super-parameter search space through a preset search algorithm; or, the super-parameter combination is an optimal parameter set which is found by a super-parameter adjustment tool after traversing the super-parameter search space through a preset search algorithm and combining with an evaluation index.
Further, inputting the reaction vector into an initialization model for operation to obtain model parameters, and obtaining corresponding evaluation indexes according to a preset evaluation strategy; comprising the following steps:
inputting the training set reaction vector into an initialization model for training to obtain model parameters;
generating a training model according to the model parameters;
inputting the reaction vector of the test set into a training model for testing to obtain a prediction result;
and evaluating the training model according to the evaluation strategy, the test set response vector and the prediction result to obtain an evaluation index.
Further, the training model is evaluated according to an evaluation strategy, a test set response vector and a prediction result, and an evaluation index is obtained, wherein the evaluation index comprises:
when the training model is a classification model,
calculating model accuracyWherein TP represents the real number of cases in which the real situation is a positive case and the predicted result is a positive case, FN represents the real number of false cases in which the real situation is a positive case and the predicted result is a negative case, FP represents the real number of false positive cases in which the real situation is a negative case and the predicted result is a positive case, TN represents the real number of false cases in which the real situation is a negative case and the predicted result is a negative case;
calculating the accuracy of the model
Calculating model recall ratio
Calculation of modelingEnergy parameterWherein P represents model Precision, and R represents model Recall ratio Recall;
calculate model AUC values:wherein pred pos Representing the number of positive examples of the predicted result, pred neg Representing the number of counterexamples of the predicted result; pos num The number of positive examples of the real situation is represented; neg (neg) num The number of counter examples is indicated.
Further, the training model is evaluated according to an evaluation strategy, a test set response vector and a prediction result, and an evaluation index is obtained, wherein the evaluation index comprises:
when the training model is a regression model,
calculating the mean absolute error of the modelWherein y is i Predictive tag value representing the ith sample,/->A true tag value representing the i-th sample; n represents the total number of samples;
calculating the maximum error of the model
Calculating root mean square error of model
Calculating the decision coefficients of the modelWherein (1)>Representing the average of n real labels;
pearson correlation coefficients of a computational modelx and y represent the value of the real result and the value of the predicted result, respectively; m is m x And m y Respectively representing the average value of the sum y;
kendell correlation coefficient of calculation modelWherein N is 1 Indicating the consistent quantity of the real situation and the predicted result, N 2 Indicating the number of real cases and inconsistent predicted results.
Further, the method also comprises the following steps:
if the type of the instantiation model is a classification model, processing the reaction vector through a preset oversampling algorithm to obtain a few types of reaction vectors;
inputting the original reaction vector and a few types of reaction vectors into an initialization model together for operation;
and/or the number of the groups of groups,
performing dimension reduction operation on the reaction vector through a preset dimension reduction algorithm to obtain a low-dimension reaction vector;
and inputting the low-dimensional reaction vector into an initialization model for operation.
Further, the evaluation indexes comprise a main evaluation index and a secondary evaluation index; screening an excellent model according to a preset model screening strategy and an evaluation index corresponding to the obtained initialization model; comprising the following steps:
screening a plurality of training models with top ranking according to the main evaluation index;
and screening an optimal training model from the training models according to the secondary evaluation indexes to serve as an excellent model.
The invention also includes an electronic device comprising:
a memory for storing a computer program;
and a processor for implementing the chemical reaction yield prediction method when executing the computer program.
According to the chemical reaction yield prediction method and the electronic equipment, an instantiation model is constructed through the acquired configuration information of a user, then the read sample data of the chemical yield is coded to generate a reaction vector, a super parameter combination is acquired from a preset super parameter adjustment tool, the instantiation model is initialized to obtain an initialization model, the reaction vector is input into the initialization model to operate to obtain model parameters, a corresponding evaluation index is obtained according to a preset evaluation strategy, before a preset stop condition is not met, a super parameter combination is acquired from the super parameter adjustment tool again, operation is continued to obtain the model parameters and the evaluation index until the preset stop condition is met, an excellent model is screened out according to a preset model screening strategy and is stored, and the stored model can be used for yield prediction of the chemical reaction; according to the invention, a plurality of machine learning models are trained and stored, and a model with higher adaptation degree to chemical reaction can be obtained, so that the accuracy of the model for predicting chemical yield is improved, and the super-parameter combination is a more excellent super-parameter obtained by combining the evaluation parameters of the existing model, and is free from manual adjustment, so that the efficiency and quality of model training are further improved, and the method has extremely high use and popularization values.
Drawings
For a clearer description of embodiments of the invention or of solutions in the prior art, the drawings which are used in the description of the embodiments or of the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from them without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart showing steps of a chemical reaction yield prediction method according to an embodiment of the present invention;
FIG. 2 is a flowchart showing a chemical reaction yield prediction method according to an embodiment of the present invention;
FIG. 3 is a flowchart showing a chemical reaction yield prediction method according to an embodiment of the present invention;
FIG. 4 is a flowchart showing a chemical reaction yield prediction method according to an embodiment of the present invention;
fig. 5 is a structural composition diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
The present invention is directed to a chemical reaction yield prediction method, as shown in fig. 1, comprising:
step S10: and acquiring configuration information of the user, and constructing an instantiation model according to the configuration information.
Prior to the implementation of this step, model data for a plurality of machine models is created, from which one or more specified instantiation models are constructed from the plurality of machine models via configuration information of a user. The machine model may include XGBoost (eXtreme Gradient Boosting, extreme gradient lifting), random forest, SVM (Support Vector Machine ), KNN (K-Nearest Neighbor), and the like, and may be classified into two main types, i.e., regression model and classification model, according to the types of models. The classification model may be used to qualitatively and quantitatively predict the absence of yield of a chemical reaction, and the regression model may be used to qualitatively and quantitatively predict the magnitude of the yield value of a chemical reaction.
Specifically, as shown in fig. 2, step S10 includes:
step S101: the user's configuration information is read in from under the set path in the project folder. The configuration information at least comprises model names, the number N of training models and training time length.
The project folder in the embodiment of the invention comprises model data of various machine models, and configuration information of a user is read through a set path to acquire configuration requirements of the user, such as model names designated by the user, the number N of training models obtained by training the models, total training duration and the like. In addition, the user may generate configuration information together according to other needs, including but not limited to: (1) kinds of models (classification model/regression model); (2) Whether the model needs to be stored after model training; (3) whether only the model is trained and not tested; (4) whether only the model is tested without training; (5) model parameter adjustment information; (6) number of parallel model training tasks; (7) a time stamp; (8) saving the number of models after parameter adjustment; (9) training the test set proportion; (10) a random number seed; (11) Sample data related parameters (data set name, data set file name), etc.
Step S102: and reading model data from the project folder according to the model names in the configuration information, and acquiring a super-parameter search space of the corresponding model.
And reading corresponding model data according to the model name in the configuration information, and acquiring a super-parameter search space of the model. The method is used for subsequent model training and testing.
After the instantiation model construction is completed, the next step is performed.
Step S20: and reading sample data of chemical yield, and encoding the sample data according to a preset reaction vector generation strategy to generate a reaction vector.
The sample data of chemical yield in this embodiment can be determined by the user, and similarly, the stored sample data is read in from the set path of the project folder. And then, sample data are processed to obtain a reaction vector which can be read and processed by the instantiation model.
Specifically, as shown in fig. 3, step S20 includes:
step S201: reading sample data, and dividing the sample data into a training sample set and a test sample set according to a set proportion; both the training sample set and the test sample set contain reactants, products, and reaction conditions.
The set proportion for dividing the training sample set and the test sample set in this step may be included in the configuration information of the user obtained in step S10, and if the configuration information obtained in step S10 does not include the set proportion, the division may be performed according to a proportion value preset by the system. For example, 10000 samples are taken together, the ratio is set to be 8:2, 8000 samples are divided into training sample sets, the remaining 2000 samples are divided into test sample sets, and the dividing mode of the step can be random division or other modes, and the specific limitation is not made here.
Each sample in the training sample set and the test sample set contains reactants, products, and reaction conditions. All sample data of the invention under the same model are the same chemical reaction, but the types of reactants and products in each sample are different, and the corresponding reaction conditions are different, for example, whether catalysts, the types and the contents of the catalysts are adopted among different samples, whether ligands or solvents are adopted, the types and the contents of the ligands and the solvents are adopted, the adding modes of the ligands and the solvents are adopted, and the like.
Step S202: and sequentially encoding reactants, products and reaction conditions in the training sample set and the test sample set according to a reaction vector generation strategy to generate a reaction vector corresponding to each sample data.
The reaction vector generation strategy of the embodiment of the invention comprises the following steps:
respectively calculating molecular fingerprints of reactants and products in a training sample set and a testing sample set according to a molecular fingerprint algorithm, and splicing the molecular fingerprints into reaction vectors;
and respectively coding all the reaction conditions in the training sample set and the test sample set according to a specific reaction condition coding mode, and integrating the codes into the reaction vector to obtain the reaction vector corresponding to the sample data.
The optional molecular fingerprint algorithm in the step comprises Morgan2, morgan3 and the like, and the reaction condition coding mode is realized by adopting a single-heat coding mode.
The algorithm adopted by the reaction vector generation strategy in the embodiment of the invention can be selected according to different chemical reactions, namely different types of models and different coding modes of the reaction vectors can be adopted in the model acquisition process of different chemical reactions.
Step S30: obtaining a super parameter combination from a preset super parameter adjustment tool, and initializing the instantiation model according to the super parameter combination to obtain an initialization model.
The preset hyper-parameter adjustment tool in this embodiment adopts a Search algorithm to find an optimal hyper-parameter combination, and the Search algorithm can be implemented by TPE (Tree-structured Parzen Estimator, tree structure parameter estimation), grid Search (Grid Search), BOHB (Bayesian optimization and Hyperband), and other algorithms. The super parameter adjusting tool can adopt NNI (Neural Network Intelligence) tools, and can automatically adjust the super parameters under the conditions of more super parameters and time adjustment.
Step S40: inputting the reaction vector into an initialization model for operation, obtaining model parameters, and obtaining corresponding evaluation indexes according to a preset evaluation strategy.
Specifically, as shown in fig. 4, step S40 includes:
step S401: and inputting the training set reaction vector into an initialization model for training to obtain model parameters.
The invention aims to predict the yield of the chemical reaction, so that the information corresponding to reactants, products and reaction conditions in the reaction vector of the training set is used as the input of an initialization model, and the information corresponding to the yield is used as the output of the initialization model, thereby training the initialization model. And obtaining model parameters of the initialization model after training is completed.
Step S402: and generating a training model according to the model parameters.
After the model parameters are obtained in step S401, a corresponding training model is generated according to the model parameters and the initialization model.
Step S403: and inputting the reaction vector of the test set into a training model for testing, and obtaining a prediction result.
And taking reaction vector data corresponding to reactants, products and reaction conditions in the reaction vector of the test set as input of a training model, and predicting the yield of the chemical reaction by the training model to obtain a prediction result.
It should be noted that this step also exists for the true result (or referred to as true case) in the test set reaction vector.
Step S404: and evaluating the training model according to the evaluation strategy, the test set response vector and the prediction result to obtain an evaluation index.
The evaluation index of the step is obtained by combining the information of the real situation and the predicted situation of the reaction vector of the test set, and the evaluation result is used for reflecting the predicted effect or the predicted capability of the training model.
Specifically, the evaluation strategy in the embodiment of the invention comprises the following steps:
when the training model is a classification model:
(1) Calculating model accuracyWherein TP represents the real number of cases where the real situation is a positive case and the predicted result is a positive case, FN represents the real number of false cases where the real situation is a positive case and the predicted result is a negative case, FP represents the real number of false positive cases where the real situation is a negative case and the predicted result is a positive case, TN represents the real number of false cases where the real situation is a negative case and the predicted result is a negative case.
The Accuracy Accuracy of the model has a value range of [0,1], and shows that the correct prediction result accounts for the percentage of the total test sample.
(2) Calculating the accuracy of the model
The value range of the model Precision is [0,1], which represents the probability of actually positive samples in all samples predicted to be positive, for example, in the yield prediction, the ratio of samples predicted to be yield in samples representing actual yield.
(3) Calculating model recall ratio
The model Recall ratio Recall has a value in the range of 0,1, and represents the probability of being predicted as a positive sample among samples which are actually positive, for example, in the yield prediction, the ratio of samples which are predicted as having yield to be actually the yield is represented.
(4) Calculating model performance parametersWherein P represents model Precision, and R represents model Recall ratio Recall.
Model performance parameter F 1 The value range of (2) is [0,1]]Is the harmonic mean of Precision and Recall. F (F) 1 The higher the value of (2) is, the more the model can predict the alignment cases as much as possible, and the better the model performance is.
(5) Calculate model AUC values:wherein pred pos Representing the number of positive examples of the predicted result, pred neg Representing the number of counterexamples of the predicted result; pos num The number of positive examples of the real situation is represented; neg (neg) num The number of counter examples is indicated.
The range of values for the model AUC is [0,1], which means: randomly giving a positive sample and a negative sample, and outputting a probability value of the positive sample being positive more than a probability value of the negative sample being positive; if the samples are classified completely randomly, the AUC should be close to 0.5; the AUC of the model trained in general is >0.5; if auc=0.5, the classification effect is the same as completely random. The higher the AUC, the less false the actual lack of yield can be reported as much as possible in yield prediction.
And when the training model is a regression model:
(1) Calculating the mean absolute error of the modelWherein y is i Predictive tag value representing the ith sample,/->A true tag value representing the i-th sample; n represents the total number of samples.
The average absolute error MAE is in the range of [0, ++ ], the closer the value is to 0, the smaller the error is represented.
(2) Calculating the maximum error of the model
The maximum error MaxError has a value range of [0, + ], the closer the value is to 0, the smaller the error is represented.
(3) Calculating root mean square error of model
The root mean square error MSE has a value in the range of 0, + -infinity, and the closer the value is to 0, the smaller the error is.
(4) Calculating the decision coefficients of the modelWherein (1)>Representing the average of n real labels.
Determining the coefficient R 2 The range of the values of (E) is [ - + -infinity, + -infinity [ - + -infinity ]]The closer the value is to 1, the better.
(5) Pearson correlation coefficients of a computational modelx and y represent the value of the real result and the value of the predicted result, respectively; m is m x And m y The mean of the sum y is shown separately.
The value range of the pearson correlation coefficient r is [ -1,1], the larger the absolute value of the coefficient is, the stronger the correlation of the two groups of data is, the positive number is positive correlation, the negative number is negative correlation, and 0 is no correlation.
(6) Kendell correlation coefficient of calculation modelWherein N is 1 Indicating the consistent quantity of the real situation and the predicted result, N 2 Indicating the number of real cases and inconsistent predicted results.
The value range of the Kendell correlation coefficient tau is [ -1,1], the larger the absolute value of the coefficient is, the stronger the correlation of the two groups of data is, the positive number is positive correlation, the negative number is negative correlation, and 0 is no correlation.
After the evaluation index of the corresponding model is obtained, step S50 is performed.
Step S50: reporting the evaluation index to a hyper-parameter adjustment tool.
Step S60: and acquiring a super parameter combination from the super parameter adjusting tool again, and executing the subsequent steps until a preset stopping condition is reached.
When the hyper-parameter tool of the step gives out the hyper-parameter combination again, the hyper-parameter tool is carried out by combining the evaluation index obtained in the previous step. Because the super-parameters are parameters set before machine model learning and are not model parameters obtained through training, the super-parameters are required to be optimized, a group of optimal super-parameters are selected for the model so as to improve the performance and effect of model learning, and the basis of super-parameter optimization in the invention is the evaluation index of the last model.
Therefore, when the step acquires a super-parameter combination from the preset super-parameter adjustment tool, the super-parameter combination is the optimal parameter set which is found by the super-parameter adjustment tool after traversing the super-parameter search space through the preset search algorithm; or, the super-parameter combination is an optimal parameter set which is found by a super-parameter adjustment tool after traversing the super-parameter search space through a preset search algorithm and combining with an evaluation index.
If the preset stopping condition is not met, continuing to acquire a super parameter combination from the super parameter adjustment tool again, and continuing to execute step S30, namely initializing the instantiation model according to the newly acquired super parameter combination to obtain an initialization model, inputting the response vector into the initialization model to operate and obtain model parameters, obtaining corresponding evaluation indexes according to a preset evaluation strategy, and finally reporting the evaluation indexes to the super parameter adjustment tool.
In the model training process of the embodiment of the invention, training and testing can be performed aiming at models with only one name, and parallel training and testing of a plurality of models with different names can be selected. When each model is trained, the number N of training models set in the configuration information of the user can be used as a preset stopping condition, or the total training duration set in the configuration information of the user can be used as a preset stopping condition, or other preset stopping conditions, so that the super-parameter tool does not continue to give out super-parameter combinations, and the initialization model corresponding to the new super-parameter combinations is not generated.
Step S70: and screening out an excellent model according to a preset model screening strategy and an evaluation index corresponding to the obtained initialization model.
And obtaining a plurality of training models through the steps, wherein each training model has a corresponding evaluation index, and selecting an excellent model from the training models according to a preset model screening strategy. The evaluation index in the present embodiment includes a main evaluation index and a sub-evaluation index. The primary and secondary evaluation indexes are set by the skilled person, for example, the default primary evaluation index of the classification model is set as the model performance parameter F1, and the default primary evaluation index of the regression model is set as the root mean square error MSE; setting the default sub-evaluation index of the classification model as AUC and the default sub-evaluation index of the regression model as the determination coefficient R 2 The main evaluation index and the sub evaluation index may be reset by those skilled in the art according to the circumstances, and the present invention is not limited thereto.
The method comprises the steps of screening an excellent model according to a preset model screening strategy and an evaluation index corresponding to an obtained initialization model; the method specifically comprises the following steps:
step S701: and screening a plurality of training models with top ranking according to the main evaluation index.
For example, N training models belonging to the classification model are obtained in total, and the training model with the top 5% of the model performance parameters F1 is screened out.
Step S702: and screening an optimal training model from the training models according to the secondary evaluation indexes to serve as an excellent model.
And screening out the training model with the optimal AUC value from the training models with the top 5% of the ranking of the model performance parameters F1 as the optimal model.
And if the configuration information of the user contains the number of the optimal models, selecting a corresponding number of training models as the optimal models according to the ranking of the evaluation indexes.
Step S80: and storing the model according to the super parameter combination of the excellent model and the model parameters.
The stored model can be used for predicting the yield of the same chemical reaction or the yield of the same chemical reaction.
Specifically, before step S40, the method in the embodiment of the present invention further includes:
if the type of the instantiation model is a classification model, processing the reaction vector through a preset oversampling algorithm to obtain a few types of reaction vectors;
and inputting the original reaction vector and the minority reaction vector into an initialization model together for operation.
For the training task of the classification model, if the problem of unbalanced yield occurs during model training, namely the difference of data of different types in a training sample set is large, the reaction vector can be input into an oversampling algorithm for processing, the data of the type with smaller original quantity is generated, namely the reaction vector of a few types is constructed, and then the original reaction vector and the obtained reaction vector of the few types are input into an initialization model together for operation.
And/or, further comprising the steps of: performing dimension reduction operation on the reaction vector through a preset dimension reduction algorithm to obtain a low-dimension reaction vector; and inputting the low-dimensional reaction vector into an initialization model for operation.
If the data feature dimensions in the training sample set and the test sample set are too many, a preset dimension reduction algorithm can be adopted to carry out dimension reduction operation on the reaction vector, for example, a low-dimension reaction vector is obtained after the operation of a PCA algorithm is adopted, so that redundant features can be removed, main features of data are extracted, and the model training efficiency is improved.
And obtaining corresponding model parameters through the steps, obtaining corresponding evaluation indexes according to a preset evaluation strategy, and executing subsequent steps.
The embodiment of the present invention further includes an electronic device 200, as shown in fig. 5, including: a memory 201 for storing a computer program; a processor 202 for implementing the chemical reaction yield prediction method of the above embodiment when executing a computer program. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, executable files or in some intermediate form, etc. The memory may include at least: any entity or device capable of carrying the computer program code to the photographing apparatus/terminal device, recording medium, computer memory, read-only memory (ROM), random access memory (random access memory, RAM), such as a U-disk, removable hard disk, magnetic or optical disk, and the like.
According to the chemical reaction yield prediction method and the electronic equipment, an instantiation model is constructed through the acquired configuration information of a user, then the read sample data of the chemical yield is coded to generate a reaction vector, a super-parameter combination is acquired from a preset super-parameter adjustment tool, the instantiation model is initialized to obtain an initialization model, the reaction vector is input into the initialization model to operate to obtain model parameters, a corresponding evaluation index is obtained according to a preset evaluation strategy, before a preset stop condition is not met, a super-parameter combination is acquired from the super-parameter adjustment tool again, operation is continued to obtain the model parameters and the evaluation index until the preset stop condition is met, an excellent model is screened out according to a preset model screening strategy and is stored, and the stored model can be used for yield prediction of the chemical reaction; according to the invention, a plurality of machine learning models are trained and stored, and a model with higher adaptation degree to chemical reaction can be obtained, so that the accuracy of the model for predicting chemical yield is improved, and the super-parameter combination is a more excellent super-parameter obtained by combining the evaluation parameters of the existing model, and is free from manual adjustment, so that the efficiency and quality of model training are further improved, and the method has extremely high use and popularization values.
The invention has been further described with reference to specific embodiments, but it should be understood that the detailed description is not to be construed as limiting the spirit and scope of the invention, but rather as providing those skilled in the art with the benefit of this disclosure with the benefit of their various modifications to the described embodiments.
Claims (10)
1. A method for predicting yield of a chemical reaction, the method comprising:
acquiring configuration information of a user, and constructing an instantiation model according to the configuration information;
reading sample data of chemical yield, and encoding the sample data according to a preset reaction vector generation strategy to generate a reaction vector;
acquiring a super-parameter combination from a preset super-parameter adjustment tool, and initializing the instantiation model according to the super-parameter combination to obtain an initialization model;
inputting the reaction vector into the initialization model for operation, obtaining model parameters, and obtaining corresponding evaluation indexes according to a preset evaluation strategy;
reporting the evaluation index to the hyper-parameter adjustment tool;
acquiring a super parameter combination from the super parameter adjusting tool again, and executing the subsequent steps until a preset stop condition is reached;
screening an excellent model according to a preset model screening strategy and an evaluation index corresponding to the obtained initialization model;
and storing the model according to the super parameter combination of the excellent model and the model parameters.
2. The method of claim 1, wherein obtaining configuration information of a user and constructing an instantiation model based on the configuration information, comprises:
reading configuration information of a user from a set path in a project folder; the configuration information at least comprises model names, the number N of training models and training duration;
and reading model data from the project folder according to the model names in the configuration information, and acquiring a super-parameter search space of the corresponding model.
3. The method for predicting chemical reaction yield according to claim 2, wherein sample data of chemical yield is read, and the sample data is encoded according to a preset reaction vector generation strategy to generate a reaction vector; comprising the following steps:
reading sample data, and dividing the sample data into a training sample set and a test sample set according to a set proportion; the training sample set and the test sample set both contain reactants, products and reaction conditions;
and sequentially encoding reactants, products and reaction conditions in the training sample set and the test sample set according to the reaction vector generation strategy to generate a reaction vector corresponding to each sample data.
4. A chemical reaction yield prediction method according to claim 3, wherein when a superparameter combination is obtained from a preset superparameter adjustment tool, the superparameter combination is an optimal parameter set found by the superparameter adjustment tool after traversing the superparameter search space through a preset search algorithm; or, the super parameter combination is an optimal parameter set found after the super parameter search space is traversed by the super parameter adjustment tool through a preset search algorithm and the evaluation index.
5. The method for predicting chemical reaction yield according to claim 4, wherein the reaction vector is input into the initialization model to be operated and model parameters are obtained, and corresponding evaluation indexes are obtained according to a preset evaluation strategy; comprising the following steps:
inputting the training set reaction vector into the initialization model for training to obtain the model parameters;
generating a training model according to the model parameters;
inputting the test set reaction vector into the training model for testing to obtain a prediction result;
and evaluating the training model according to the evaluation strategy, the test set reaction vector and the prediction result to obtain the evaluation index.
6. The method of predicting yield of a chemical reaction according to claim 5, wherein evaluating the training model based on the evaluation strategy, the test set reaction vector, and the prediction result to obtain the evaluation index comprises:
when the training model is a classification model,
calculating model accuracyWherein TP represents the real number of cases in which the real situation is a positive case and the predicted result is a positive case, FN represents the real number of false cases in which the real situation is a positive case and the predicted result is a negative case, FP represents the real number of false positive cases in which the real situation is a negative case and the predicted result is a positive case, TN represents the real number of false cases in which the real situation is a negative case and the predicted result is a negative case;
calculating the accuracy of the model
Calculating model recall ratio
Calculating model performance parametersWherein P represents model Precision, and R represents model Recall ratio Recall;
calculate model AUC values:wherein pred pos Representing the number of positive examples of the predicted result, pred neg Representing the number of counterexamples of the predicted result; pos num The number of positive examples of the real situation is represented; neg (neg) num The number of counter examples is indicated.
7. The method of predicting yield of a chemical reaction according to claim 5, wherein evaluating the training model based on the evaluation strategy, the test set reaction vector, and the prediction result to obtain the evaluation index comprises:
when the training model is a regression model,
calculating the mean absolute error of the modelWherein y is i Predictive tag value representing the ith sample,/->A true tag value representing the i-th sample; n represents the total number of samples;
calculating the maximum error of the model
Calculating root mean square error of model
Determination of computational modelCoefficients ofWherein (1)>Representing the average of n real labels;
pearson correlation coefficients of a computational modelx and y represent the value of the real result and the value of the predicted result, respectively; m is m x And m y Respectively representing the average value of the sum y;
kendell correlation coefficient of calculation modelWherein N is 1 Indicating the consistent quantity of the real situation and the predicted result, N 2 Indicating the number of real cases and inconsistent predicted results.
8. The method for predicting yield of a chemical reaction of claim 1, further comprising:
if the type of the instantiation model is a classification model, processing the reaction vector through a preset oversampling algorithm to obtain a minority reaction vector;
inputting the original reaction vector and the minority reaction vector into the initialization model together for operation;
and/or the number of the groups of groups,
performing dimension reduction operation on the reaction vector through a preset dimension reduction algorithm to obtain a low-dimension reaction vector;
and inputting the low-dimensional reaction vector into the initialization model for operation.
9. The method of predicting yield of a chemical reaction according to claim 5, wherein said evaluation index comprises a main evaluation index and a sub-evaluation index; screening an excellent model according to a preset model screening strategy and an evaluation index corresponding to the obtained initialization model; comprising the following steps:
screening a plurality of training models with top ranking according to the main evaluation index;
and screening an optimal training model from a plurality of training models according to the secondary evaluation index to serve as an excellent model.
10. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing a chemical reaction yield prediction method according to any one of claims 1 to 9 when executing said computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311018103.6A CN117116370A (en) | 2023-08-11 | 2023-08-11 | Chemical reaction yield prediction method and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311018103.6A CN117116370A (en) | 2023-08-11 | 2023-08-11 | Chemical reaction yield prediction method and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117116370A true CN117116370A (en) | 2023-11-24 |
Family
ID=88812055
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311018103.6A Pending CN117116370A (en) | 2023-08-11 | 2023-08-11 | Chemical reaction yield prediction method and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117116370A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826894A (en) * | 2019-10-31 | 2020-02-21 | 支付宝(杭州)信息技术有限公司 | Hyper-parameter determination method and device and electronic equipment |
CN112102899A (en) * | 2020-09-15 | 2020-12-18 | 北京晶派科技有限公司 | Construction method of molecular prediction model and computing equipment |
CN115329661A (en) * | 2022-07-22 | 2022-11-11 | 上海环保(集团)有限公司 | Intelligent dosing model modeling, intelligent dosing system creating and dosing method |
CN115952418A (en) * | 2022-12-30 | 2023-04-11 | 浙江大学嘉兴研究院 | Method and device for optimizing machine learning model based on model hyper-parameters |
CN115983377A (en) * | 2022-12-27 | 2023-04-18 | 中国联合网络通信集团有限公司 | Automatic learning method, device, computing equipment and medium based on graph neural network |
US20230146912A1 (en) * | 2020-06-30 | 2023-05-11 | Huawei Technologies Co., Ltd. | Method, Apparatus, and Computing Device for Constructing Prediction Model, and Storage Medium |
-
2023
- 2023-08-11 CN CN202311018103.6A patent/CN117116370A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826894A (en) * | 2019-10-31 | 2020-02-21 | 支付宝(杭州)信息技术有限公司 | Hyper-parameter determination method and device and electronic equipment |
US20230146912A1 (en) * | 2020-06-30 | 2023-05-11 | Huawei Technologies Co., Ltd. | Method, Apparatus, and Computing Device for Constructing Prediction Model, and Storage Medium |
CN112102899A (en) * | 2020-09-15 | 2020-12-18 | 北京晶派科技有限公司 | Construction method of molecular prediction model and computing equipment |
CN115329661A (en) * | 2022-07-22 | 2022-11-11 | 上海环保(集团)有限公司 | Intelligent dosing model modeling, intelligent dosing system creating and dosing method |
CN115983377A (en) * | 2022-12-27 | 2023-04-18 | 中国联合网络通信集团有限公司 | Automatic learning method, device, computing equipment and medium based on graph neural network |
CN115952418A (en) * | 2022-12-30 | 2023-04-11 | 浙江大学嘉兴研究院 | Method and device for optimizing machine learning model based on model hyper-parameters |
Non-Patent Citations (1)
Title |
---|
谢晖等: "人工智能生物学", 31 August 2021, 西安交通大学出版社, pages: 155 - 165 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109587713B (en) | Network index prediction method and device based on ARIMA model and storage medium | |
US11526722B2 (en) | Data analysis apparatus, data analysis method, and data analysis program | |
WO2019160003A1 (en) | Model learning device, model learning method, and program | |
CN111079836A (en) | Process data fault classification method based on pseudo label method and weak supervised learning | |
CN117434429B (en) | Chip stability testing method and related device | |
CN116466672B (en) | Data center machine room parameter regulation and control method based on artificial intelligence and related device | |
CN112634992A (en) | Molecular property prediction method, training method of model thereof, and related device and equipment | |
US11295229B1 (en) | Scalable generation of multidimensional features for machine learning | |
CN114881343A (en) | Short-term load prediction method and device of power system based on feature selection | |
Sommer et al. | Learning to tune XGboost with XGboost | |
CN115963420A (en) | Battery SOH influence factor analysis method | |
CN118151020B (en) | Method and system for detecting safety performance of battery | |
CN117116370A (en) | Chemical reaction yield prediction method and electronic equipment | |
Ha et al. | Leveraging bayesian optimization to speed up automatic precision tuning | |
CN111026661B (en) | Comprehensive testing method and system for software usability | |
CN114399027A (en) | Method for sequence processing by using neural network and device for sequence processing | |
CN113111948A (en) | Time sequence data classification method and system based on feature recalibration mechanism | |
CN117494908B (en) | Port cargo throughput prediction method and system based on big data | |
CN116913423B (en) | Synthetic process optimization method and system for unsaturated polyester resin | |
CN117274732B (en) | Method and system for constructing optimized diffusion model based on scene memory drive | |
CN118013043B (en) | File data management method, device, equipment and storage medium | |
CN117407666B (en) | Intelligent garbage can parameter analysis and control method and device based on artificial intelligence | |
CN118506101B (en) | Class increment image classification method based on virtual feature generation and replay | |
CN117493821B (en) | Method and system for monitoring environment of micro-module machine room | |
CN111476366B (en) | Model compression method and system for deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |