WO2023115570A1 - 机器学习模型的管理方法、装置、计算机设备及存储介质 - Google Patents
机器学习模型的管理方法、装置、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2023115570A1 WO2023115570A1 PCT/CN2021/141344 CN2021141344W WO2023115570A1 WO 2023115570 A1 WO2023115570 A1 WO 2023115570A1 CN 2021141344 W CN2021141344 W CN 2021141344W WO 2023115570 A1 WO2023115570 A1 WO 2023115570A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- data set
- updated
- models
- machine learning
- Prior art date
Links
- 238000007726 management method Methods 0.000 title claims abstract description 102
- 238000010801 machine learning Methods 0.000 title claims abstract description 71
- 238000003860 storage Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 claims abstract description 226
- 230000008569 process Effects 0.000 claims abstract description 202
- 230000000694 effects Effects 0.000 claims abstract description 48
- 238000012549 training Methods 0.000 claims description 43
- 238000004590 computer program Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000004519 manufacturing process Methods 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 6
- 238000011160 research Methods 0.000 description 20
- 230000008859 change Effects 0.000 description 16
- 239000003814 drug Substances 0.000 description 11
- 229940079593 drug Drugs 0.000 description 10
- 238000012795 verification Methods 0.000 description 9
- 239000002547 new drug Substances 0.000 description 6
- 238000005457 optimization Methods 0.000 description 6
- 235000021110 pickles Nutrition 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 5
- 238000013523 data management Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 5
- 230000007774 longterm Effects 0.000 description 5
- 238000012827 research and development Methods 0.000 description 5
- 238000013500 data storage Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 230000001988 toxicity Effects 0.000 description 4
- 231100000419 toxicity Toxicity 0.000 description 4
- 238000009825 accumulation Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000013144 data compression Methods 0.000 description 3
- 238000013496 data integrity verification Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 230000000704 physical effect Effects 0.000 description 3
- 240000007087 Apium graveolens Species 0.000 description 2
- 235000015849 Apium graveolens Dulce Group Nutrition 0.000 description 2
- 235000010591 Appio Nutrition 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000013075 data extraction Methods 0.000 description 2
- 238000009434 installation Methods 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000002085 persistent effect Effects 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 229940126586 small molecule drug Drugs 0.000 description 2
- 150000003384 small molecules Chemical class 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 208000030453 Drug-Related Side Effects and Adverse reaction Diseases 0.000 description 1
- 206010070863 Toxicity to various agents Diseases 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010494 dissociation reaction Methods 0.000 description 1
- 230000005593 dissociations Effects 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 238000012362 drug development process Methods 0.000 description 1
- 230000002503 metabolic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000035699 permeability Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000002110 toxicologic effect Effects 0.000 description 1
- 231100000723 toxicological property Toxicity 0.000 description 1
- 231100000041 toxicology testing Toxicity 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 210000003462 vein Anatomy 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- the present invention relates to the technical field of artificial intelligence, in particular to a management method, device, computer equipment and storage medium of a machine learning model.
- the machine learning model itself is highly adaptable, and it is possible to find problem domains and corresponding solutions in various situations, but it is still difficult to provide complete solutions for many specific problems in long-term research fields.
- machine learning can solve the problem of placing high-demand SKUs (Stock Keeping Units, which refer to the collection of product sales attributes) through modeling, and can also provide long-tail demand for future warehousing demand , but it is impossible to fully plan the storage cycle and placement of all commodities as a whole.
- SKUs Stock Keeping Units
- an embodiment of the present application provides a method for managing a machine learning model, including:
- association relationship Establish an association relationship between the process node, the data set, and the model; wherein, the association relationship can be used to indicate the data set and the model that the process node determines to call when it is executed.
- the process node before determining the data set and model of the configuration information corresponding to the process node, it also includes:
- the dataset is stored in a dataset repository.
- the process node before determining the data set and model of the configuration information corresponding to the process node, it also includes:
- the model carries a model description for determining a usage scenario of the model.
- the association relationship where the updated data set and/or any updated model is located is updated.
- updating all models associated with the data set includes:
- updating the association relationship of the updated data set and/or any updated model includes:
- update the associated relationship of the updated data set and/or any updated model otherwise do not update the associated relationship of the updated data set and/or any updated model operation.
- the overall effect of the process is represented by a weighted sum of the execution results of each process node.
- an embodiment of the present application provides a management device for a machine learning model, including:
- a screening module configured to determine a data set and a model corresponding to configuration information of a process node; wherein, the process node is any process node in the process, and the configuration information includes execution conditions pointing to the data set and the model ;
- An association module configured to establish an association relationship between the process node, the data set, and the model; wherein, the association relationship can be used to indicate that the process node is determined to call the data set and the described model.
- the device for managing the machine learning model also includes:
- the collection module is used to obtain a data set and store the data set in a data set library.
- the acquisition module is further configured to acquire a model, and store the model in a model library; wherein, the model carries a model description for determining a usage scenario of the model.
- the device for managing the machine learning model also includes:
- a preprocessing module configured to perform versioning processing on the data sets in the data set library and the models in the model library through the DVC tool;
- the preprocessing module is also used to establish the association relationship between a single data set and all related models for the versioned data sets and models.
- the device for managing the machine learning model also includes:
- An update module configured to update all models associated with the data set when the data set is updated
- the updating module is further configured to update the associated relationship of the updated data set and/or any updated model according to the updated data set and all updated models.
- the update module includes:
- a training module configured to use the updated data set as input to train all models associated with the data set
- the judging module is used to judge whether the training effect of the currently trained model satisfies the first preset condition; if it is satisfied, the currently trained model is prompted to be updated, otherwise the operation of updating the currently trained model is not performed.
- the update module also includes:
- An execution module configured to invoke the updated data set and all updated models, and execute all process nodes associated with the updated data set and/or any updated model
- the judging module is also used to judge whether the overall effect of the process satisfies the second preset condition after all the process nodes are executed; Otherwise, the operation of updating the associated relationship of the updated data set and/or any updated model is not performed.
- an embodiment of the present application provides a computer device, including: a processor, a memory, and a computer program stored on the memory, the processor is coupled to the memory, and the processor executes the computer program when working.
- the computer program is used to realize the management method of the machine learning model as described above.
- an embodiment of the present application provides a computer-readable storage medium, the computer storage medium stores computer instructions, and when the computer instructions are executed by a computer, the computer executes the above-mentioned machine learning model. Instructions for managing methods.
- each process node by matching the data sets and models required for the execution of each process node in the whole process, and establishing the association relationship between each process node, data set and model to form a closed loop, in the environment of multi-model shared data sets It can make data production and data use closely related, improve the efficiency and effect of data use, and then improve the management efficiency of long-cycle processes.
- FIG. 1 is a schematic flowchart of a method for managing a machine learning model in an embodiment of the present application.
- Fig. 2 is a schematic flowchart of a method for managing a machine learning model in an embodiment of the present application.
- Fig. 3 is a schematic flowchart of a method for managing a machine learning model in an embodiment of the present application.
- Fig. 4 is a schematic flowchart of a method for managing a machine learning model in an embodiment of the present application.
- Fig. 5 is a schematic flowchart of a method for managing a machine learning model in an embodiment of the present application.
- Fig. 6 is a schematic flowchart of a machine learning model management system used for molecular property prediction in an embodiment of the present application.
- FIG. 7 is a schematic diagram of a system framework corresponding to a machine learning model management method in an embodiment of the present application.
- Fig. 8 is a schematic structural diagram of a machine learning model management device in an embodiment of the present application.
- FIG. 9 is a schematic structural diagram of a method for managing a machine learning model in an embodiment of the present application.
- Fig. 10 is a schematic structural diagram of a method for managing a machine learning model in an embodiment of the present application.
- Fig. 11 is a schematic structural diagram of a machine learning model management method in an embodiment of the present application.
- Machine learning A branch of artificial intelligence.
- the research history of artificial intelligence has a natural and clear vein from focusing on “reasoning”, to focusing on “knowledge”, and then focusing on “learning”.
- machine learning is a way to realize artificial intelligence, that is, using machine learning as a means to solve problems in artificial intelligence.
- Graph database A database that uses a graph structure for semantic query, which uses nodes, edges, and attributes to represent and store data.
- the key concept of the system is the graph, which directly associates data items in storage with collections of data nodes and edges representing relationships between nodes.
- Data version management A software engineering technique that can ensure that the same program files edited by different people are synchronized during the software development process.
- Data Warehouse A central repository of integrated data from one or more disparate sources, storing current and historical data together.
- Model A file that, after being trained, can recognize a specific type of pattern.
- the embodiment of the present application provides a machine learning model management method, device, computer equipment, and storage medium.
- the management method matches the data sets and models required for the execution of each process node in the entire process, and establishes each process node.
- the relationship between data sets and models forms a closed loop.
- data production and data use can be closely related, improving the efficiency and effect of data use, thereby improving the management efficiency of long-cycle processes.
- the management method can use the data set update as a trigger condition, and when the data set is updated or the model is updated, trigger the data set, model and process association (such as DAG, Directed Acyclic Graph, directed acyclic graph, computer A commonly used data structure in the field, because of the excellent characteristics brought by the unique topological structure, it is often used to deal with dynamic programming, seek the shortest path in navigation, data compression and other algorithm scenarios) update and calculation result update, for the model Training, or process effect evaluation, or data quality evaluation provides a basis.
- DAG Directed Acyclic Graph
- directed acyclic graph computer A commonly used data structure in the field, because of the excellent characteristics brought by the unique topological structure, it is often used to deal with dynamic programming, seek the shortest path in navigation, data compression and other algorithm scenarios
- an embodiment of the present application provides a method for managing a machine learning model, including step S100 and step S200 .
- Step S100 Determine the data set and model corresponding to the configuration information of the process node.
- the process node is any process node in the process
- the configuration information includes execution conditions pointing to the data set and the model.
- a whole process includes several process nodes, and each process node has different configuration requirements based on different application scenarios.
- each process node has different configuration requirements based on different application scenarios.
- the case of the model However, these models themselves are quite different, the training process and output results are different, and the output cycle is also different. In a whole process, there are also situations where many models use the same data set.
- the appropriate data set is the key to the effectiveness of the model; the effective execution of process nodes requires the selection of data sets and models that match the process nodes.
- the configuration information of the process node is used to determine which data sets and models need to be used when the process node is executed.
- Step S200 Establish an association relationship between the process node, the data set, and the model.
- association relationship can be used to indicate the data set and the model that are determined to be invoked when the process node is executed.
- the data set, model, and process are taken as the main body of the relationship, as the point in the DAG, and the corresponding versions of the data set, model, and process are used as the connection, and the DAG is constructed as the edge in the DAG.
- the subject of the relationship is a point, and the corresponding version is an edge, and a record in the graph is established.
- G represents the entire undirected graph
- ds_mol_20210421 represents a data set stored in a certain system
- wf_mol_property_0721 represents a stored model
- G(ds_mol_20210421, wf_mol_property_0721) represents a set of relations
- establishing the association relationship between the process node, the data set, and the model includes using a scheduling tool to create and change a process, and to establish model relationships on different nodes in the process.
- the scheduling tool can be selected fromizigi (based on the python language, which can help establish a complex streaming batch task management system), airflow (Airbnb open source scheduling tool written in Python), or celery (distributed Asynchronous message task queue, through which the asynchronous processing of tasks can be easily realized).
- the method for managing a machine learning model further includes step S111 and step S112 .
- Step S111 Obtain a data set.
- Step S112 Store the data set in a data set library.
- the acquired data sets are all structured data, which can directly use mature file version management software.
- DVC data version control, data version management
- git an open source distributed version control system
- machine learning experiment data that is, data sets.
- the training data/test data used may change.
- config ((computer system) configuration)
- the data is also very important.
- DVC tools can easily store data on many storage systems, like local disks, SSH (Secure Shell, a network protocol) servers, or cloud systems such as aws S3 (amazon web service Simple Storage Service, simple storage service, GCP ( Google Cloud Platform, Google Cloud Service), etc.
- the data managed by the DVC tool can be easily shared with other users using this storage system.
- DVC tools and git can realize file version management based on local/remote data warehouses, and then add meta information to the versioned data set (meta information records the basic information of the data set, including creation time, creator, Size, etc., also includes some artificial tags, such as the number of features, the total number of entries, etc. According to these meta information, the data set required for model training or process node execution can be quickly matched), and stored in a structured database for easy query when needed. Since the DVC tool supports both local file system storage and remote file object storage, it can meet the needs of data management in different scenarios. If the data set has encryption requirements, it can also use the AES algorithm when storing files. The method is encrypted and stored.
- the data sets in the data set library can be managed including version management, data compression and storage, data extraction, integrity verification, and caching.
- the method for managing a machine learning model further includes step S121 and step S122 .
- Step S121 Obtain a model.
- Step S122 Store the model in a model library.
- the model carries a model description
- the model description refers to a textual description for the model to be trained, and is used to determine the usage scenario of the model.
- Model serialization refers to converting the model object existing in the memory into a data stream/file object that can be stored in the persistent storage through software engineering means. wait. The specific implementation of model serialization depends on the way the model object exists.
- the model trained by the python machine learning framework sklearn Take the model trained by the python machine learning framework sklearn as an example: through the pickle module that comes with python, the model can be stored as a pickle object and saved in a file. Files can be transferred to other hardware in the same environment, deserialized into model objects by the pickle module, and used in the new environment. The process of verifying the training effect of model training needs to be completed through CI tools, such as the jenkins continuous integration tool.
- DVC tool simple installation
- DVC tool push easy to use
- DVC tool push DVC tool push
- dev pull fast speed
- DVC tool add will generate a new file
- DVC tool add data.sql will generate data.sql.DVC tool (kb level)
- git will upload the file data.sql.DVC tool, DVC tool according to the file of _dvc tool
- You can pull to the corresponding file for example, the data file saved at the remote end). If you need to specify the version of data, model, and code, you only need to git checkout the version number, and then DVC pull.
- the machine learning models used in each process node and each stage of a whole process are connected in series through data sets and comprehensively managed to improve the efficiency of data use, optimize model performance, and discover potential problems in a timely manner.
- the method for managing a machine learning model further includes step S131 and step S132.
- Step S131 Perform versioning processing on the datasets in the dataset repository and the models in the model repository through the DVC tool.
- Step S132 For the versioned data sets and models, establish the association relationship between a single data set and all related models.
- the datasets stored in the dataset library can be used for training the models in the model library, and for calling when process nodes are executed. Since there may be many models using the same data set in the whole process, it is necessary to establish the association relationship between a single data set and all related models in advance to improve the use efficiency of the data set.
- the method for managing a machine learning model further includes step S300 and step S400 .
- Step S300 When the data set is updated, update all models associated with the data set.
- Step S400 According to the updated data set and all the updated models, update the association relationship of the updated data set and/or any updated model.
- the training, verification and iteration of multiple models on the shared data set build a process based on the application scenario, through the process will be different
- the models required by the application scenarios and at different stages are integrated in a DAG, and the data set update is used as the trigger condition to iteratively update the model itself and the process-data set-model relationship where the model is located.
- the update of the data set is used as a trigger condition to calculate the overall effect of the whole process.
- the impact of the change of the data set on each stage can be fed back and analyzed, so as to provide directional guidance for stage optimization. .
- step S300 includes step S310 , step S320 , step S321 and step S322 .
- Step S310 Using the updated data set as input to train all models associated with the data set.
- Step S320 Determine whether the training effect of the currently trained model satisfies the first preset condition.
- whether the model is trained is set according to the actual application scenario and training purpose of each model, that is, the first preset condition.
- Step S321 When satisfied, update the currently trained model.
- Step S322 When not satisfied, do not perform the operation of updating the currently trained model.
- the current model update is terminated, and an update report is output indicating the reason for terminating the current model update.
- step S400 includes step S410, step S420, step S421 and step S422.
- Step S410 call the updated data set and all updated models, and execute all process nodes associated with the updated data set and/or any updated model.
- association relationship when a data set is updated, its associated relationship also needs to be updated synchronously. Similarly, when all associated models are updated due to the update of the dataset, the association relationship of the updated model also needs to be updated. It can be understood that among the association relationships (process-dataset-model) that need to be updated, some data sets and models are updated, and some only the model is updated.
- Step S420 Determine whether the overall effect of the process satisfies the second preset condition after all the process nodes are executed.
- the evaluation of the overall effect of the process is a final value evaluated by adding the effect of each node in the DAG object and the empirical weight.
- Molecular property prediction you can preset a target to be the prediction accuracy rate, then for each step (step1-step6), you can set a proportion in the target, such as 10%, 20%, 20%, 25%, 15%, 10%, so that according to the training results and the verification set, an overall accuracy rate based on the current training set can be obtained.
- This proportion setting requires some experience at the beginning, but with the accumulation of data and the improvement of the model effect, its proportion can be Make adjustments with the final effect to achieve the desired overall effect.
- Step S421 when satisfied, update the association relationship where the updated data set and/or any updated model is located.
- Step S422 If it is not satisfied, do not perform the operation of updating the associated relationship of the updated data set and/or any updated model.
- Combining DVC tools for version management of datasets, models, and processes By connecting datasets and models to build a process based on application scenarios, it can improve the sensitivity of the entire process to data changes, facilitate unified management of effective datasets, and trace each node. The final mass change brought about by the change.
- the drug development process is a long-term development process, which involves quite a few small links.
- Establish a property prediction process for the key drug-like properties of the success rate of small molecule drug development including solubility, potency, selectivity, solubility, bioavailability, permeability, toxicity, etc.
- the selection of these properties is not entirely a choice of advantages and disadvantages, but a comprehensive trade-off is required.
- the data set for its model training is mainly derived from the property determination of various small molecule drug molecules accumulated in experiments.
- the version number (the version number is generated by the DVC tool and obtained through the push operation of the DVC tool), the default name is in the form of name+md5, here is mol_property:87h5d41998yniinkfj348haj01ghp41urqp.
- DVC tool + git + aws S3 Use the same tool (DVC tool + git + aws S3) to manage the first version of the model trained by data scientists or AI algorithm experts. Also use the command line of the DVC tool to upload the model file to the pre-configured aws S3 bucket (pre-configuration is mainly for the configuration of the DVC tool, you need to tell the tool the aws S3 address to be used, this aws S3 address needs to be applied in advance, and in Locally set the access permissions), and get a series of name+md5 encoded versions.
- the version name obtained above can be simply recorded in text or jupyter-notebook, but for subsequent management and display, it is stored uniformly through database alignment.
- the relational database postgresDB is selected for storage, a version table is established, and the corresponding version name is stored in the table as a primery key.
- Some data sets and models that need to be described are stored in meta-related fields, such as the size of the data set, fields , source, maintainer, build date, etc.
- arangodb provided by aws S3 is used for relationship management.
- a node for the stored version name, and set the name and required attributes (for data management, in order to ensure data integrity, traceability, etc., generally include but not limited to: total file size, total number of files, MD5 code, uploader, maintainer, upload time, data description, main features, etc.) into the gallery to get a series of node objects.
- Each object has _id (_id is generated by arangodb, which guarantees the global uniqueness of each node) as a unique index. This _id can be backfilled into the version table as one of the version management attributes.
- the obtained node objects are used to construct the edge relationship, that is, the node objects are connected through graph management.
- the connection relationship here uses the version number obtained during version management to establish side connections. Since there is a one-way dependency between the data set and the model, the number of edges currently established is the relationship between a single data set ⁇ m ⁇ and all related models ⁇ n ⁇ , represented by the set E ⁇ m,n ⁇ .
- the set of edge relations can be stored back in the version table, or it can not be stored. As long as there is a node_id, all the edge relationships and nodes connected to this node can be queried.
- the process of drug small molecule property research can be divided into the following steps: activity, physicochemical properties, metabolic properties, toxicological properties and druggability.
- activity physicochemical properties
- metabolic properties e.g., metabolic properties
- toxicological properties e.g., druggability
- drug activity screening e.g., drug activity screening
- pKa dissociation prediction
- lipid-water partition coefficient prediction e.g., drug toxicity screening.
- the relationship of ⁇ Dataset,DAG,Model ⁇ -> ⁇ result ⁇ can also be established.
- the establishment of this relationship can be managed through the process to provide the benefits brought by each data set change or model change.
- the results change to provide a basis for model training or data quality assessment.
- the evaluation of the overall effect of the process is a final value evaluated by adding the effect of each node in the DAG object and the empirical weight.
- Molecular property prediction you can preset a target to be the prediction accuracy rate, then for each step (step1-step6), you can set a proportion in the target, such as 10%, 20%, 20%, 25%, 15%, 10%, so that according to the training results and the verification set, an overall accuracy rate based on the current training set can be obtained.
- This proportion setting requires some experience at the beginning, but with the accumulation of data and the improvement of the model effect, its proportion can be Make adjustments with the final effect to achieve the desired overall effect.
- the version management of the data set, the model and the process is carried out in combination with the DVC tool.
- a process based on the application scenario is built, so that the update of the data set can be fed back to the overall training process and overall quality faster.
- DAG Through the change of the node task results in DAG, it is possible to feedback and analyze the impact of data set changes on each stage, so as to provide directional guidance for stage optimization.
- the management method of the machine learning model provides a solution for version management of the data set + model + process in combination with the DVC tool.
- FIG. 7 Please refer to FIG. 7 .
- a system framework as shown in Figure 7 is built, including: a process management module, a data management module, a model management module, a relationship management module and a data set storage module, Model storage module, association relationship storage module.
- the data set management module is used to manage data files through the DVC tool+git+aws S3, and store the data files in the data set storage module connected with the data set management module.
- the data set management module is used in the data management module, and provides functions such as data storage, data integrity verification, data multi-version management, and data acquisition through data version management.
- the model management module is used to use the same tool (DVC tool + git + aws S3) to manage the version of the model for the model trained by the data scientist or AI algorithm expert, and store the module in the module management The model storage module to which the module connects.
- DVC tool + git + aws S3 The model storage module to which the module connects.
- the model management module provides partial or complete model management functions in combination with the life cycle of the machine learning model, mainly including functions such as model storage, model version management, and model verification.
- the process management module is used to construct the research and development process based on the application scenarios, so as to obtain the processes applicable to each application scenario, wherein the process management module is connected with the data set management module and the model management module respectively.
- the process management module constructs the relationship between models and data used in different stages of the complete R&D process, including process construction, data set addition and maintenance, model addition and maintenance, process results Management and other functions.
- the relationship management module After the relationship management module is used to complete the file version management, it needs to use the graph database to associate it.
- the arangodb provided by aws S3 is used for relationship management.
- the relationship management module is based on the process management module, using a graph database or other methods to manage the relationship between processes and data sets, processes and models, and data sets and models.
- relationship management module is respectively connected with the data set management module, process management module and model management module, and is used to construct the association relationship between the process and the data set, the process and the model, and the data set and the model, and storing the association relationship in an association relationship storage module connected to the relationship management module.
- the present application provides a machine learning model management device, including a screening module 10 and an association module 20 .
- the screening module 10 is configured to determine a data set and a model corresponding to configuration information of process nodes.
- the process node is any process node in the process
- the configuration information includes execution conditions pointing to the data set and the model.
- a whole process includes several process nodes, and each process node has different configuration requirements based on different application scenarios.
- each process node has different configuration requirements based on different application scenarios.
- the case of the model However, these models themselves are quite different, the training process and output results are different, and the output cycle is also different. In a whole process, there are also situations where many models use the same data set.
- the appropriate data set is the key to the effectiveness of the model; the effective execution of process nodes requires the selection of data sets and models that match the process nodes.
- the configuration information of the process node is used to determine which data sets and models need to be used when the process node is executed.
- the association module 20 is configured to establish an association relationship between the process node, the data set, and the model.
- association relationship can be used to indicate the data set and the model that are determined to be invoked when the process node is executed.
- the data set, model, and process are taken as the main body of the relationship, as the point in the DAG, and the corresponding versions of the data set, model, and process are used as the connection, and the DAG is constructed as the edge in the DAG.
- the subject of the relationship is a point, and the corresponding version is an edge, and a record in the graph is established.
- G represents the entire undirected graph
- ds_mol_20210421 represents a data set stored in a certain system
- wf_mol_property_0721 represents a stored model
- G(ds_mol_20210421, wf_mol_property_0721) represents a set of relations
- establishing the association relationship between the process node, the data set, and the model includes using a scheduling tool to create and change a process, and to establish model relationships on different nodes in the process.
- the scheduling tool can be selected fromizigi (based on the python language, which can help establish a complex streaming batch task management system), airflow (Airbnb open source scheduling tool written in Python), or celery (distributed Asynchronous message task queue, through which the asynchronous processing of tasks can be easily realized).
- the management device of the machine learning model further includes:
- the acquisition module 01 is configured to acquire a data set and store the data set in a data set library.
- the acquired data sets are all structured data, which can directly use mature file version management software.
- DVC data version control, data version management
- git an open source distributed version control system
- machine learning experiment data that is, data sets.
- the training data/test data used may change.
- config ((computer system) configuration)
- the data is also very important.
- DVC tools can easily store data on many storage systems, like local disks, SSH (Secure Shell, a network protocol) servers, or cloud systems such as aws S3 (amazon web service Simple Storage Service, simple storage service, GCP ( Google Cloud Platform, Google Cloud Service), etc.
- the data managed by the DVC tool can be easily shared with other users using this storage system.
- DVC tools and git can realize file version management based on local/remote data warehouses, and then add meta information to the versioned data set (meta information records the basic information of the data set, including creation time, creator, Size, etc., also includes some artificial tags, such as the number of features, the total number of entries, etc. According to these meta information, the data set required for model training or process node execution can be quickly matched), and stored in a structured database for easy query when needed. Since the DVC tool supports both local file system storage and remote file object storage, it can meet the needs of data management in different scenarios. If the data set has encryption requirements, it can also use the AES algorithm when storing files. The method is encrypted and stored.
- the data sets in the data set library can be managed including version management, data compression and storage, data extraction, integrity verification, and caching.
- the acquisition module 01 is further configured to acquire a model, and store the model in a model library.
- the model carries a model description for determining a usage scenario of the model.
- the model carries a model description
- the model description refers to a textual description for the model to be trained, and is used to determine the usage scenario of the model.
- Model serialization refers to converting the model object existing in the memory into a data stream/file object that can be stored in the persistent storage through software engineering means. wait. The specific implementation of model serialization depends on the way the model object exists.
- the model trained by the python machine learning framework sklearn Take the model trained by the python machine learning framework sklearn as an example: through the pickle module that comes with python, the model can be stored as a pickle object and saved in a file. Files can be transferred to other hardware in the same environment, deserialized into model objects by the pickle module, and used in the new environment. The process of verifying the training effect of model training needs to be completed through CI tools, such as the jenkins continuous integration tool.
- DVC tool simple installation
- DVC tool push easy to use
- fast speed after DVC tool add , will generate a new file
- DVC tool add data for example, DVC tool add data.sql will generate data.sql.DVC tool (kb level)
- git will upload the file data.sql.DVC tool
- DVC tool can pull according to the _dvc file to the corresponding file (such as a data file saved at the remote end). If you need to specify the version of data, model, and code, you only need to git checkout the version number, and then DVC pull.
- the machine learning models used in each process node and each stage of a whole process are connected in series through data sets and comprehensively managed to improve the efficiency of data use, optimize model performance, and discover potential problems in a timely manner.
- the management device of the machine learning model further includes:
- the preprocessing module 02 is configured to perform versioning processing on the datasets in the dataset repository and the models in the model repository through the DVC tool.
- the preprocessing module 02 is also used to establish the association relationship between a single data set and all relevant models for the versioned data sets and models.
- the datasets stored in the dataset library can be used for training the models in the model library, and for calling when process nodes are executed. Since there may be many models using the same data set in the whole process, it is necessary to establish the association relationship between a single data set and all related models in advance to improve the use efficiency of the data set.
- the management device of the machine learning model further includes:
- An updating module 30, configured to update all models associated with the data set when the data set is updated.
- the updating module 30 is further configured to update the associated relationship of the updated data set and/or any updated model according to the updated data set and all updated models.
- the training, verification and iteration of multiple models on the shared data set build a process based on the application scenario, through the process will be different
- the models required by the application scenarios and at different stages are integrated in a DAG, and the data set update is used as the trigger condition to iteratively update the model itself and the process-data set-model relationship where the model is located.
- the update of the data set is used as a trigger condition to calculate the overall effect of the whole process.
- the impact of the data set change on each stage can be fed back and analyzed, so as to provide directional guidance for stage optimization. .
- the updating module 30 includes:
- a training module 31 configured to use the updated data set as input to train all models associated with the data set.
- the judging module 32 is used to judge whether the training effect of the currently trained model satisfies the first preset condition; if it is satisfied, the currently trained model is prompted to be updated, otherwise the operation of updating the currently trained model is not performed.
- Whether the model is trained is set according to the actual application scenario and training purpose of each model, that is, the first preset condition. If the training effect of the currently trained model does not meet the first preset condition, the current model update is terminated, and an update report is output to indicate the reason for terminating the current model update.
- the update module further includes:
- the execution module 33 is configured to invoke the updated data set and all updated models, and execute all process nodes associated with the updated data set and/or any updated model.
- association relationship When a dataset is updated, its associated relationship also needs to be updated synchronously. Similarly, when all associated models are updated due to the update of the dataset, the association relationship of the updated model also needs to be updated. It can be understood that among the association relationships (process-dataset-model) that need to be updated, some data sets and models are updated, and some only the model is updated.
- the judging module 32 is also used to judge whether the overall effect of the process satisfies the second preset condition after all the process nodes are executed; if it is satisfied, it prompts to update the updated data set and/or any updated model Otherwise, the operation of updating the associated relationship of the updated data set and/or any updated model is not performed.
- the evaluation of the overall effect of the process is a final value evaluated by adding the effect of each node in the DAG object and the empirical weight.
- Molecular property prediction you can preset a target to be the prediction accuracy rate, then for each step (step1-step6), you can set a proportion in the target, such as 10%, 20%, 20%, 25%, 15%, 10%, so that according to the training results and the verification set, an overall accuracy rate based on the current training set can be obtained.
- This proportion setting requires some experience at the beginning, but with the accumulation of data and the improvement of the model effect, its proportion can be Make adjustments with the final effect to achieve the desired overall effect.
- Combining DVC tools for version management of datasets, models, and processes By connecting datasets and models to build a process based on application scenarios, it can improve the sensitivity of the entire process to data changes, facilitate unified management of effective datasets, and trace each node. The final mass change brought about by the change.
- One embodiment of the present application provides a computer device, including: a processor, a memory, and a computer program stored on the memory, the processor is coupled to the memory, and the processor executes the computer program when working to Implement the management method of the machine learning model as described above.
- An embodiment of the present application provides a computer-readable storage medium, the computer storage medium stores computer instructions, and when the computer instructions are executed by a computer, the computer executes the instructions of the above-mentioned machine learning model management method .
- all or part may be implemented by software, hardware, firmware or any combination thereof.
- software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
- the computer program product includes one or more computer instructions. When the computer instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
- the computer can be a general purpose computer, a special purpose computer, a computer network or other programmable devices.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center via wired (eg coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
- the computer-readable storage medium may be any available medium that can be accessed by a computer, or may be a data storage device such as a server or a data center integrated with one or more available media.
- the available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: Digital Versatile Disc (Digital Versatile Disc, DVD)) or a semiconductor medium (for example: Solid State Disk (Solid State Disk, SSD)) wait.
- a magnetic medium for example: floppy disk, hard disk, magnetic tape
- an optical medium for example: Digital Versatile Disc (Digital Versatile Disc, DVD)
- a semiconductor medium for example: Solid State Disk (Solid State Disk, SSD)
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明涉及人工智能技术领域,特别涉及机器学习模型的管理方法、装置、计算机设备及存储介质。方法包括确定对应流程节点的配置信息的数据集和模型;所述流程节点为流程中的任一流程节点,所述配置信息包括指向所述数据集和所述模型的执行条件;建立所述流程节点与所述数据集、所述模型的关联关系;其中,所述关联关系能够用于指示所述流程节点在被执行时确定调用的所述数据集和所述模型。本发明通过在全流程中匹配各个流程节点执行时所需的数据集和模型,并建立各个流程节点与数据集、模型的关联关系形成闭环,在多模型共享数据集的环境下能够使数据生产与数据使用紧密关联,提高数据使用效率与效果,进而提升长周期流程的管理效率。
Description
本发明涉及人工智能技术领域,特别是涉及机器学习模型的管理方法、装置、计算机设备及存储介质。
机器学习模型本身的适应性较强,在多种场合下都可能找到问题域与对应的解决方案,但在许多对于长周期的研究领域的具体问题上却还难以作为完整解决方案提供。
基于此,本申请的发明人在对现有技术的研究和实践中发现,现有技术中关于机器学习的研究普遍集中在对模型本身的优化上,虽然也有关于模型管理的工程化实践,但对于多种模型在共享数据集上的训练、验证及迭代等的却少有研究和实践。以智能仓储为例,机器学习能够通过建模,解决高需求SKU(Stock Keeping Unit,库存量单位,指商品的销售属性集合)的摆放难题,也能提供长尾需求未来对于仓储的需求量,但不能做到从整体上完全规划所有商品的存放周期、摆放位置等。
又如在制药、材料、化工等行业中,机器学习模型在较多不同的方向上得到了应用。这些领域的研究周期长、方向广,会面临在多个场景中使用模型的情况。然而由于这些模型本身差异较大,训练流程与产出结果相异,产出周期也不同,采用现有技术下的机器学习模型管理容易导致数据生产与数据使用脱节,降低数据使用效率与效果。
【发明内容】
基于此,有必要基于现有技术中存在的在长周期的研究领域,数据生产与数据使用脱节,降低数据使用效率与效果的问题,提供一种机器学习模型的管理方法、装置、计算机设备及存储介质。
第一方面,本申请一个实施例提供一种机器学习模型的管理方法,包括:
确定对应流程节点的配置信息的数据集和模型;其中,所述流程节点为流程中的任一流程节点,所述配置信息包括指向所述数据集和所述模型的执行条件;
建立所述流程节点与所述数据集、所述模型的关联关系;其中,所述关联关系能够用于指示所述流程节点在被执行时确定调用的所述数据集和所述模型。
可选的,在确定对应流程节点的配置信息的数据集和模型之前,还包括:
获取数据集;
将所述数据集存储于数据集库中。
可选的,在确定对应流程节点的配置信息的数据集和模型之前,还包括:
获取模型;
将所述模型存储于模型库中;其中,
所述模型携带有用于确定所述模型的使用场景的模型描述。
可选的,在将所述数据集存储于数据集库中,以及将所述模型存储于模型库中之后,还包括:
通过DVC工具对所述数据集库中的数据集以及所述模型库中的模型进行版本化处理;
对版本化处理后的数据集及模型,建立单个数据集与所有相关模型的关联关系。
可选的,在建立所述流程节点与所述数据集、所述模型的关联关系之后,还包括:
在所述数据集被更新时,更新所有与所述数据集相关联的模型;
根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系。
可选的,所述在所述数据集被更新时,更新所有与所述数据集相关联的模型,包括:
将更新后的数据集作为输入训练所有与所述数据集相关联的模型;
判断当前被训练的模型的训练效果是否满足第一预设条件;
在满足时,更新当前被训练的模型,否则不执行更新当前被训练的模型的操作。
可选的,所述根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系,包括:
调用更新后的数据集和所有更新后的模型,执行与所述更新后的数据集和/或任一更新后的模型关联的所有流程节点;
判断所有流程节点被执行后,流程整体效果是否满足第二预设条件;
在满足时,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系,否则不执行更新所述更新后的数据集和/或任一更新后的模型所在的关联关系的操作。
可选的,所述流程整体效果通过对各流程节点的执行结果进行加权求和来表示。
第二方面,本申请一个实施例提供一种机器学习模型的管理装置,包括:
筛选模块,用于确定对应流程节点的配置信息的数据集和模型;其中,所述流程节点为流程中的任一流程节点,所述配置信息包括指向所述数据集和所述模型的执行条件;
关联模块,用于建立所述流程节点与所述数据集、所述模型的关联关系;其中,所述关联关系能够用于指示所述流程节点在被执行时确定调用的所述数据集和所述模型。
可选的,所述的机器学习模型的管理装置,还包括:
采集模块,用于获取数据集,并将所述数据集存储于数据集库中。
可选的,所述采集模块,还用于获取模型,并将所述模型存储于模型库中;其中,所述模型携带有用于确定所述模型的使用场景的模型描述。
可选的,所述的机器学习模型的管理装置,还包括:
预处理模块,用于通过DVC工具对所述数据集库中的数据集以及所述模型库中的模型进行版本化处理;
所述预处理模块,还用于对版本化处理后的数据集及模型,建立单个数据集与所有相关模型的关联关系。
可选的,所述的机器学习模型的管理装置,还包括:
更新模块,用于在所述数据集被更新时,更新所有与所述数据集相关联的模型;
所述更新模块,还用于根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系。
可选的,所述更新模块,包括:
训练模块,用于将更新后的数据集作为输入训练所有与所述数据集相关联的模型;
判断模块,用于判断当前被训练的模型的训练效果是否满足第一预设条件;在满足时,促使更新当前被训练的模型,否则不执行更新当前被训练的模型的操作。
可选的,所述更新模块,还包括:
执行模块,用于调用更新后的数据集和所有更新后的模型,执行与所述更新后的数据集和/或任一更新后的模型关联的所有流程节点;
所述判断模块,还用于判断所有流程节点被执行后,流程整体效果是否满足第二预设条件;在满足时,促使更新所述更新后的数据集和/或任一更新后的模型所在的关联关系,否则不执行更新所述更新后的数据集和/或任一更新后的模型所在的关联关系的操作。
第三方面,本申请一个实施例提供一种计算机设备,包括:处理器、存储器以及存储在所述存储器上的计算机程序,所述处理器耦合所述存储器,所述处理器在工作时执行所述计算机程序以实现如上述的机器学习模型的管理方法。
第四方面,本申请一个实施例提供一种计算机可读存储介质,所述计算机存储介质存储有计算机指令,当所述计算机指令被计算机执行时,使得所述计算机执行如上述的机器学习模型的管理方法的指令。
上述技术方案中的一个技术方案具有如下优点和有益效果:
本申请各实施例,通过在全流程中匹配各个流程节点执行时所需的数据集和模型,并建立各个流程节点与数据集、模型的关联关系形成闭环,在多模型共享数据集的环境下能够使数据生产与数据使用紧密关联,提高数据使用效率与效果,进而提升长周期流程的管理效率。
本申请将结合附图对实施方式进行说明。本申请的附图仅用于描述实施例,以展示为目的。在不偏离本申请原理的条件下,本领域技术人员能够轻松地通过以下描述根据所述步骤做出其他实施例。
图1为本申请一个实施例中机器学习模型管理方法的流程示意图。
图2为本申请一个实施例中机器学习模型管理方法的流程示意图。
图3为本申请一个实施例中机器学习模型管理方法的流程示意图。
图4为本申请一个实施例中机器学习模型管理方法的流程示意图。
图5为本申请一个实施例中机器学习模型管理方法的流程示意图。
图6为本申请一个实施例中机器学习模型管理系统用于分子性质预测的流程示意图。
图7为本申请一个实施例中机器学习模型管理方法对应的系统框架示意图。
图8为本申请一个实施例中机器学习模型管理装置的结构示意图。
图9为本申请一个实施例中机器学习模型管理方法的结构示意图。
图10为本申请一个实施例中机器学习模型管理方法的结构示意图。
图11为本申请一个实施例中机器学习模型管理方法的结构示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。可以理解的是,此处所描述的具体实施例仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
以下对本文中涉及的部分技术用语进行说明,以便于本领域技术人员理解。
1、机器学习(machine learning):人工智能的一个分支。人工智能的研究历史有着一条从以“推理”为重点,到以“知识”为重点,再到以“学习”为重点的自 然、清晰的脉络。显然,机器学习是实现人工智能的一个途径,即以机器学习为手段解决人工智能中的问题。
2、图数据库(graph database):一个使用图结构进行语义查询的数据库,它使用节点、边和属性来表示和存储数据。该系统的关键概念是图,它直接将存储中的数据项,与数据节点和节点间表示关系的边的集合相关联。
3、数据版本管理:一种软件工程技巧,借此能在软件开发的过程中,确保由不同人所编辑的同一程序文件都得到同步。
4、数据仓库:来自一个或多个不同源的集成数据的中央存储库,将当前和历史数据存储在一起。
5、模型:一个文件,在经过训练后可以识别特定类型的模式。
由于机器学习模型本身的适应性较强,在多种场合下都可能找到问题域与对应的解决方案,因此在研究周期较长的领域,如制药、材料、化工等行业中,机器学习模型在较多不同的方向上得到了应用,尤其如新药与新材料的制备,更是深度学习模型被大量使用的地方。但由于研究周期长、方向广,例如新药研究中既需要研究药物分子与受体的相互作用,又要研究分子本身的物理特性,如溶解度、毒性等等,将会面临在多个场景中使用模型的情况。在多个场景中使用的模型本身差异较大,训练流程与产出结果相异,产出周期也不同等,而现有技术中又缺乏对于多种模型在共享数据集上的训练、验证及迭代的解决方法,导致在一个完整流程中,即使存在有较多模型使用相同数据集的情况,数据生产与数据使用的脱节,数据使用效率低且效果不明显。
基于此,对于长周期的研究领域,存在着对于共同数据集依赖的多维模型,这些模型或基于场景,或基于流程,解决了某个问题。本申请的实施例提供的机器学习模型的管理方法、装置、计算机设备及存储介质,所述管理方法通过在全流程中匹配各个流程节点执行时所需的数据集和模型,并建立各个流程节点与数据集、模型的关联关系形成闭环,在多模型在共享数据集的环境下能够使数据生产与数据使用紧密关联,提高数据使用效率与效果,进而提升长周期流程的管理效率。进一步地,所述管理方法能够以数据集更新作为触发条件,在数据集更新,或模型更新时,触发数据集、模型和流程关联关系(如DAG,Directed Acyclic Graph,有向无环图,计算机领域一种常用数据结构,因为独特的拓扑结构所带来的优异特性,经常被用于处理动态规划,导航中寻求最短路径,数据压缩等多种算法场景)的更新及计算结果更新,为模型训练,或流程效果评估,或数据质量评估提供依据。
为更好地理解本申请实施例提供的技术方案,下面对本申请的一些实施例进行详细描述。
第一方面。
如图1所示,本申请一个实施例提供一种机器学习模型的管理方法,包括步骤S100和步骤S200。
步骤S100:确定对应流程节点的配置信息的数据集和模型。
其中,所述流程节点为流程中的任一流程节点,所述配置信息包括指向所述数据集和所述模型的执行条件。
在一个实施例中,一个全流程(完整流程,或流程整体)包括若干个流程节点,每个流程节点基于不同应用场景而有不同的配置需求。例如:新药研究中既需要研究药物分子与受体的相互作用,又要研究分子本身的物理特性,如溶解度、毒性等等,将会使得新药研究面临在多个不同的应用场景中使用机器学习模型的情况。然而,这些模型本身差异较大,训练流程与产出结果相异,产出周期也不同。在一个全流程中,也会存在有较多模型使用相同数据集的情况。基于此,对于长周期的研究领域,合适的数据集是模型生效的关键;流程节点的有效执行,需要选择与流程节点匹配的数据集和模型。其中,流程节点的配置信息是用于确定流程节点被执行时需要用到的数据集和模型分别是哪些。
步骤S200:建立所述流程节点与所述数据集、所述模型的关联关系。
其中,所述关联关系能够用于指示所述流程节点在被执行时确定调用的所述数据集和所述模型。
在一个实施例中,为了促使各个流程节点与数据集、模型相互关联形成闭环,并在多模型在共享数据集的环境下能够使数据生产与数据使用紧密关联,提高数据使用效率与效果,通过以数据集、模型和流程为关系主体,作为DAG中的点,以数据集、模型和流程各自对应的版本为连接,作为DAG中的边,构建DAG。
如此,能够实现通过全流程的流程节点将各模型串联到DAG中。
示例性的,关系主体为点,各自对应的版本为边,建立图中的一条记录,在一个构建完成的workflow中,存在一个或多个无向图(Graph),图中存在多个节点与边关系,以G表示整个无向图,ds_mol_20210421表示某个系统中存储的数据集,wf_mol_property_0721表示某个存储的模型,则G(ds_mol_20210421,wf_mol_property_0721)则代表了一组关系,关系上存储了对应版本,如G(ds_mol_20210421,wf_mol_property_0721)=(3.12_beta,2.41_alpha),表示数据集ds_mol_20210421与模型wf_mol_property_0721在建立连接时的版本分别为3.12_beta与2.41_alpha。
需要说明的是,建立所述流程节点与所述数据集、所述模型的关联关系,包括利用调度工具进行流程的新建与变更,以流程中不同节点上的模型关系的建立。其中,调度工具可选自luigi(基于python语言的,可帮助建立复杂流式批处理任务管理系统)、airflow(Airbnb开源的一个用Python编写的调度工具),或者celery(基 于python开发的分布式异步消息任务队列,通过它可以轻松的实现任务的异步处理)。
如图2所示,在一个实施例中,在步骤S100之前,所述的机器学习模型的管理方法还包括步骤S111和步骤S112。
步骤S111:获取数据集。
步骤S112:将所述数据集存储于数据集库中。
所获取的数据集均为结构化数据,能够直接使用成熟的文件版本管理软件。
示例性的,通常采用DVC(data version control,数据版本管理)工具和git(一个开源的分布式版本控制系统)一起管理机器学习实验数据,即数据集。每个模型的在训练的时候,所使用的训练数据/测试数据都有可能是变化的,在进行复现实验结果的时候,除了使用同样的代码,config((计算机系统)配置),使用同样的数据也是非常重要的。DVC工具可以轻松地将数据存储在许多存储系统上,像本地磁盘、SSH(Secure Shell,一种网络协议)服务器或云系统如aws S3(amazon web service Simple Storage Service,简便的存储服务、GCP(Google Cloud Platform,谷歌云服务)等。DVC工具管理的数据可以很容易地与其他使用此存储系统的用户共享。
采用DVC工具和git,能够实现基于本地/远端数据仓库的文件版本管理,再将版本化后的数据集加上meta信息(meta信息记录了数据集的基础信息,包括创建时间、创建者、大小等,还包含一些人工标签,比如特征数量、总条目数等。根据这些meta信息能够快速匹配模型训练或流程节点执行所需要的数据集),存于结构化数据库中,便于需要时查询。由于DVC工具既支持本地文件系统存储,也支持远端文件对象存储,因此可以满足不同场景下对于数据管理的需求,如果数据集还有加密需求,也可在文件存储时,使用例如AES算法等方法加密后进行存储。
以上,能够对数据集库中的数据集进行包括版本管理、数据压缩与存储、数据提取、完整性校验、缓存等管理。
如图2所示,在一个实施例中,在步骤S100之前,所述的机器学习模型的管理方法还包括步骤S121和步骤S122。
步骤S121:获取模型。
步骤S122:将所述模型存储于模型库中。
其中,所述模型携带有模型描述,所述模型描述是指针对要训练模型的一段文字性描述,用于确定模型的使用场景。
在不同的应用场景,数据集存在被共享的情况,同样的,模型也存在被复用的情况。为了便于后续对模型本身以及模型所在的流程-数据集-模型的关联关系的优化及迭代,需要对模型进行版本化管理(如通过数据仓库+DVC工具+版本meta进行管理)。在模型的版本化管理过程中,对模型进行序列化,模型序列化是指的通 过软件工程手段,将存在于内存中的模型对象,转化为可存储到持久化存储器中的数据流/文件对象等。模型序列化的具体实现依赖于模型对象存在的方式,以通过python机器学习框架sklearn训练出的模型为例:通过python自带的pickle模块,可将模型存储为pickle对象,并存到文件中,该文件可被转移到其他相同环境的硬件上,并通过pickle模块反序列化成模型对象,并在新环境里使用。而对于模型训练的训练效果进行校验的过程,则需要通过CI工具,如jenkins持续集成工具来完成。
基于上述实施例,通过DVC工具和git结合,对数据、模型、代码进行版本管理:安装简单(pip install DVC工具);使用方便(DVC工具push;dev pull);速度快,在DVC工具add之后,会生成一个新的文件,如,DVC工具add data.sql,会生成data.sql.DVC工具(kb级别),git会上传data.sql.DVC工具这个文件,DVC工具根据_dvc工具的文件可以pull到对应的文件(例如保存在远端的数据文件)。如果需要指定版本的data、model、code,只需要git checkout版本号,然后DVC pull。
将一个全流程的各个流程节点及各个阶段使用的机器学习模型,通过数据集串联起来,综合管理,以提高数据的使用效率,优化模型表现,及时发现潜在的问题。
如图2所示,在一个实施例中,在步骤S112和步骤S122之后,所述的机器学习模型的管理方法还包括步骤S131和步骤S132。
步骤S131:通过DVC工具对所述数据集库中的数据集以及所述模型库中的模型进行版本化处理。
步骤S132:对版本化处理后的数据集及模型,建立单个数据集与所有相关模型的关联关系。
数据集库中存储的数据集可供模型库中的模型进行训练,以及供流程节点被执行时调用。基于全流程中可能存在有较多模型使用相同数据集的情况,需要预先建立单个数据集与所有相关模型的关联关系,以提高数据集的使用效率。
如图3所示,在一个实施例中,在步骤S200之后,所述的机器学习模型的管理方法还包括步骤S300和步骤S400。
步骤S300:在所述数据集被更新时,更新所有与所述数据集相关联的模型。
步骤S400:根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系。
基于现有技术中缺乏关于模型版本管理及迭代的工程化实践,在一个实施例中行,对于多种模型在共享数据集上的训练、验证及迭代:构建基于应用场景的流程,通过流程将不同应用场景所需及不同阶段的模型,整合在一个DAG中,以数据集更新作为触发条件,对模型本身以及模型所在的流程-数据集-模型的关联关系的进行迭代更新。具体的,将数据集更新作为触发条件,计算全流程的流程整体效果, 通过对DAG中的节点任务结果的变化,能够反馈分析数据集变化对于各阶段的影响,从而对阶段优化提出方向性指导。
请参见图4。
如图4所示,在一个实施例中,步骤S300包括步骤S310、步骤S320、步骤S321和步骤S322。
步骤S310:将更新后的数据集作为输入训练所有与所述数据集相关联的模型。
可以理解的是,数据集被更新时,其映射的多个模型也需要同步更新:训练-迭代。
步骤S320:判断当前被训练的模型的训练效果是否满足第一预设条件。
在一个实施例中,根据各个模型的实际应用场景以及训练目的来设定模型是否被训练好,即第一预设条件。
步骤S321:在满足时,更新当前被训练的模型。
步骤S322:在不满足时,不执行更新当前被训练的模型的操作。
在一个实施例中,如果当前被训练的模型的训练效果不满足第一预设条件,则终止当前的模型更新,并输出更新报告指示终止当前的模型更新的原因。
如图4-5所示,在一个实施例中,步骤S400包括步骤S410、步骤S420、步骤S421和步骤S422。
步骤S410:调用更新后的数据集和所有更新后的模型,执行与所述更新后的数据集和/或任一更新后的模型关联的所有流程节点。
在一个实施例中,数据集被更新,其所在的关联关系也需要同步更新。同样的,由于数据集的更新导致的其关联的所有模型被更新,则被更新的模型所在的关联关系也需要被更新。可以理解的是,需要被更新的关联关系(流程-数据集-模型)中,有数据集和模型均被更新的,也有仅仅是模型被更新的。
步骤S420:判断所有流程节点被执行后,流程整体效果是否满足第二预设条件。
其中,所述流程整体效果通过对各流程节点的执行结果进行加权求和来表示。
示例性的,在药物小分子性质研究的流程中,如图6所示,流程整体效果的评估,是通过DAG对象中各个节点的效果,加上经验权重评估的一个最终值。分子性质预测,可以预设一个目标是预测准确率,那么对于各个step(step1-step6),在目标中可以设定一个占比,例如10%、20%、20%、25%、15%、10%,这样根据训练结果及验证集,可以得到一个基于当前训练集下的整体准确率,这个占比设定在开始时需要一定经验,但随着数据累计与模型效果提升,其占比可以随着最终效果进行调整,以达到对整体效果的预期。
步骤S421:在满足时,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系。
步骤S422:在不满足时,不执行更新所述更新后的数据集和/或任一更新后的模型所在的关联关系的操作。
结合DVC工具进行数据集、模型和流程的版本管理,通过连接数据集与模型,构建基于应用场景的流程,能够提高流程整体对于数据变化的敏感度,便于有效数据集的统一管理,追溯各节点变化带来的最终质量变化。
为了进一步让本领域技术人员理解本申请实施例提供的技术方案,下面以药物小分子性质研究流程为本申请的一个具体实施例进行详细描述。
药物研发流程是一个长周期的研发流程,其中涉及到了相当多的小环节。针对小分子药物研发成功率的关键类药性质建立性质预测流程,包括溶解度、效力、选择性、溶解度、生物有效性、渗透性、毒性等。这些性质在选择上不完全有优劣选择,而需要综合权衡,其模型训练的数据集,主要来源于实验积累的对于各种小分子药物分子的性质测定。
首先,通过DVC工具+git+aws S3进行数据文件的管理,在本地配置完成DVC工具的aws S3访问权限后,即可通过DVC工具命令行,将本地数据文件上传到aws S3,并得到一个确定的版本号(版本号是由DVC工具生成的,通过DVC工具push操作得到),默认名称为name+md5形式,这里为mol_property:87h5d41998yniinkfj348haj01ghp41urqp。
利用同样的工具(DVC工具+git+aws S3),针对数据科学家或AI算法专家训练好的模型,进行第一个版本的管理。同样利用DVC工具命令行将模型文件上传至预先配置好的aws S3桶里(预先配置主要是针对DVC工具的配置,需要告诉这个工具要使用的aws S3地址,这个aws S3地址需要提前申请,并在本地设置好访问权限),得到一系列的name+md5编码的版本。
以上得到的版本名称,可以简单的使用文本或jupyter-notebook进行记录,但为了后续管理及展示,通过数据库对齐进行统一存储。这里选用关系型数据库postgresDB进行存储,建立版本表,并将对应版本名作为primery key存入表中,部分数据集及模型需要增加描述的则存在meta相关的字段中,例如数据集的大小、字段、来源、维护人员、生成日期等等。
完成文件版本管理后,需要用到图数据库对其进行关联,这里通过aws S3提供的arangodb进行关系管理。首先针对完成存储的版本名称建立节点,将名称及需要的属性(针对数据管理,为了保证数据完整性、可追溯性等性质,一般要包含但不限于:文件总大小、文件总数、MD5码、上传人、维护人员、上传时间、数据描述、主要特征等信息)插入图库中,得到一系列的节点对象。每个对象上存在_id(_id是由arangodb生成的,保证每个节点的全局唯一性)作为唯一索引。这个_id可以回填至版本表中作为版本管理的属性之一。完成了节点构建后,通过得到的节点对 象,构建边关系,即通过图管理,将节点对象连接起来。这里的连接关系,采用版本管理时得到的版本号,建立边连接。由于数据集与模型之间为单向依赖关系,所以当前建立的边数量,即为单个数据集{m}与所有相关模型{n}间的关系,通过集合E{m,n}表示。边关系集合可以存回版本表中,也可以不存。只要有一个节点_id,即可查询到这个节点连接的所有边关系和节点。
接着进行研发流程的构建。药物小分子性质研究的流程可分为以下几个步骤:活性、理化性质、代谢性质、毒理性质以及成药性。在对这些性质的研究过程中,逐渐针对药物活性筛选、解离度(pKa)预测、脂水分配系数预测、药物毒性筛选等性质形成了有一定预测效果的机器学习模型组。
然后将研发流程中的环节,转换为流程执行器能够识别执行的脚本。使用airflow进行流程构建时,一个新流程对应于一个DAG,每个环节被放入一个Operator中(这里都使用bashOperator),则整个DAG对象包含了6个Operator,其中前五个Operator为独立出发执行,最后一个综合筛选节点结合前5个节点的结果进行预测,需要等待全部运行完成后执行。这里就构建好了一个DAG脚本文件,这个DAG脚本文件同样通过文件版本管理放入版本库中存储,得到一个版本相关的编号。
在完成了从流程到DAG脚本文件的转换后,即进入流程与数据与模型关系的构建。流程与模型的关系构建,在DAG脚本文件的编写过程中已经体现,主要是在Operator中具体执行的模型。这里将模型对应版本也写入到DAG脚本文件用作变量,而数据则通过查询版本表获取当前版本。由此得到了一组新的边关系,包含当前版本的DAG(v),制定版本的模型Model(v),及当前版本的数据集DataSet(v),组成边关系{DAG(v),Model(v)}与{DAG(v),Dataset(v)}两个集合。基于这两个集合,可以进一步建立当数据集更新,或模型更新时,触发的DAG更新及计算结果更新。
对于每一个DAG对象运行的结果,同样可以建立{Dataset,DAG,Model}->{result}的关系,这个关系的建立,可以通过流程管理,提供每一次数据集变化或模型变化,带来的结果变化,为模型训练,或数据质量评估提供依据。
示例性的,在药物小分子性质研究的流程中,如图6所示,流程整体效果的评估,是通过DAG对象中各个节点的效果,加上经验权重评估的一个最终值。分子性质预测,可以预设一个目标是预测准确率,那么对于各个step(step1-step6),在目标中可以设定一个占比,例如10%、20%、20%、25%、15%、10%,这样根据训练结果及验证集,可以得到一个基于当前训练集下的整体准确率,这个占比设定在开始时需要一定经验,但随着数据累计与模型效果提升,其占比可以随着最终效果进行调整,以达到对整体效果的预期。
以上,在药物小分子性质研究流程的实施例中,结合DVC工具进行了数据集、 模型和流程的版本管理。通过连接数据集与模型,构建基于应用场景的流程,让数据集更新能更快反馈到整体训练流程及整体质量上。通过对于DAG中的节点任务结果的变化,能够反馈分析数据集变化对于各阶段的影响,从而对阶段优化提出方向性指导。
在上述实施例中,机器学习模型的管理方法提供了一个结合DVC工具进行了数据集+模型+流程的版本管理的解决方案。
为了进一步让本领域技术人员理解本申请实施例提供的技术方案,请参见图7。
对于数据集+模型+流程的版本管理,在一个实施例中,搭建如图7所示的系统框架,包括:流程管理模块、数据管理模块、模型管理模块、关系管理模块以及数据集存储模块、模型存储模块、关联关系存储模块。
数据集管理模块被用于通过DVC工具+git+aws S3进行数据文件的管理,并将所述数据文件存储于与所述数据集管理模块连接的所述数据集存储模块。
示例性的,所述数据集管理模块用于数据管理模块,通过对于数据的版本管理,提供数据入库存储、数据完整性校验、数据多版本管理、数据获取等功能。
模型管理模块被用于利用同样的工具(DVC工具+git+aws S3),针对数据科学家或AI算法专家训练好的模型,进行模型的版本管理,并将所述模块存储于与所述模块管理模块连接的所述模型存储模块。
示例性的,所述模型管理模块,是结合机器学习模型的生命周期,提供部分或完整的模型管理功能,主要包含了模型存储、模型版本管理、模型校验等功能。
流程管理模块被用于基于应用场景进行研发流程的构建,以得到适用于各个应用场景的流程,其中,所述流程管理模块分别与所述数据集管理模块和所述模型管理模块连接。
示例性的,所述流程管理模块,针对长研发周期,构建针对完整研发流程中不同阶段用到的模型及数据间的关系,包括流程构建、数据集添加与维护、模型添加与维护、流程结果管理等功能。
关系管理模块被用于完成文件版本管理后,需要用到图数据库对其进行关联这里通过aws S3提供的arangodb进行关系管理。
示例性的,所述关系管理模块,是所述在流程管理模块的基础上,利用图数据库或其他方法,管理流程与数据集、流程与模型、数据集与模型之间的关联关系。
可以理解的是,所述关系管理模块,分别与所述数据集管理模块、流程管理模块和模型管理模块连接,用于构建流程与数据集、流程与模型、数据集与模型的关联关系,并将所述关联关系存储于与所述关系管理模块连接的关联关系存储模块。
第二方面。
如图8所示,基于同一发明构思,本申请提供一种机器学习模型的管理装置, 包括筛选模块10和关联模块20。
筛选模块10,用于确定对应流程节点的配置信息的数据集和模型。
其中,所述流程节点为流程中的任一流程节点,所述配置信息包括指向所述数据集和所述模型的执行条件。
在一个实施例中,一个全流程(完整流程,或流程整体)包括若干个流程节点,每个流程节点基于不同应用场景而有不同的配置需求。例如:新药研究中既需要研究药物分子与受体的相互作用,又要研究分子本身的物理特性,如溶解度、毒性等等,将会使得新药研究面临在多个不同的应用场景中使用机器学习模型的情况。然而,这些模型本身差异较大,训练流程与产出结果相异,产出周期也不同。在一个全流程中,也会存在有较多模型使用相同数据集的情况。基于此,对于长周期的研究领域,合适的数据集是模型生效的关键;流程节点的有效执行,需要选择与流程节点匹配的数据集和模型。其中,流程节点的配置信息是用于确定流程节点被执行时需要用到的数据集和模型分别是哪些。
关联模块20,用于建立所述流程节点与所述数据集、所述模型的关联关系。
其中,所述关联关系能够用于指示所述流程节点在被执行时确定调用的所述数据集和所述模型。
在一个实施例中,为了促使各个流程节点与数据集、模型相互关联形成闭环,并在多模型在共享数据集的环境下能够使数据生产与数据使用紧密关联,提高数据使用效率与效果,通过以数据集、模型和流程为关系主体,作为DAG中的点,以数据集、模型和流程各自对应的版本为连接,作为DAG中的边,构建DAG。
如此,能够实现通过全流程的流程节点将各模型串联到DAG中。
示例性的,关系主体为点,各自对应的版本为边,建立图中的一条记录,在一个构建完成的workflow中,存在一个或多个无向图(Graph),图中存在多个节点与边关系,以G表示整个无向图,ds_mol_20210421表示某个系统中存储的数据集,wf_mol_property_0721表示某个存储的模型,则G(ds_mol_20210421,wf_mol_property_0721)则代表了一组关系,关系上存储了对应版本,如G(ds_mol_20210421,wf_mol_property_0721)=(3.12_beta,2.41_alpha),表示数据集ds_mol_20210421与模型wf_mol_property_0721在建立连接时的版本分别为3.12_beta与2.41_alpha。
需要说明的是,建立所述流程节点与所述数据集、所述模型的关联关系,包括利用调度工具进行流程的新建与变更,以流程中不同节点上的模型关系的建立。其中,调度工具可选自luigi(基于python语言的,可帮助建立复杂流式批处理任务管理系统)、airflow(Airbnb开源的一个用Python编写的调度工具),或者celery(基于python开发的分布式异步消息任务队列,通过它可以轻松的实现任务的异步处 理)。
如图9所示,在一个实施例中,所述的机器学习模型的管理装置,还包括:
采集模块01,用于获取数据集,并将所述数据集存储于数据集库中。
所获取的数据集均为结构化数据,能够直接使用成熟的文件版本管理软件。
示例性的,通常采用DVC(data version control,数据版本管理)工具和git(一个开源的分布式版本控制系统)一起管理机器学习实验数据,即数据集。每个模型的在训练的时候,所使用的训练数据/测试数据都有可能是变化的,在进行复现实验结果的时候,除了使用同样的代码,config((计算机系统)配置),使用同样的数据也是非常重要的。DVC工具可以轻松地将数据存储在许多存储系统上,像本地磁盘、SSH(Secure Shell,一种网络协议)服务器或云系统如aws S3(amazon web service Simple Storage Service,简便的存储服务、GCP(Google Cloud Platform,谷歌云服务)等。DVC工具管理的数据可以很容易地与其他使用此存储系统的用户共享。
采用DVC工具和git,能够实现基于本地/远端数据仓库的文件版本管理,再将版本化后的数据集加上meta信息(meta信息记录了数据集的基础信息,包括创建时间、创建者、大小等,还包含一些人工标签,比如特征数量、总条目数等。根据这些meta信息能够快速匹配模型训练或流程节点执行所需要的数据集),存于结构化数据库中,便于需要时查询。由于DVC工具既支持本地文件系统存储,也支持远端文件对象存储,因此可以满足不同场景下对于数据管理的需求,如果数据集还有加密需求,也可在文件存储时,使用例如AES算法等方法加密后进行存储。
以上,能够对数据集库中的数据集进行包括版本管理、数据压缩与存储、数据提取、完整性校验、缓存等管理。
如图9所示,在一个实施例中,所述采集模块01,还用于获取模型,并将所述模型存储于模型库中。
其中,所述模型携带有用于确定所述模型的使用场景的模型描述。
其中,所述模型携带有模型描述,所述模型描述是指针对要训练模型的一段文字性描述,用于确定模型的使用场景。
在不同的应用场景,数据集存在被共享的情况,同样的,模型也存在被复用的情况。为了便于后续对模型本身以及模型所在的流程-数据集-模型的关联关系的优化及迭代,需要对模型进行版本化管理(如通过数据仓库+DVC工具+版本meta进行管理)。在模型的版本化管理过程中,对模型进行序列化,模型序列化是指的通过软件工程手段,将存在于内存中的模型对象,转化为可存储到持久化存储器中的数据流/文件对象等。模型序列化的具体实现依赖于模型对象存在的方式,以通过python机器学习框架sklearn训练出的模型为例:通过python自带的pickle模块,可将模型存储为pickle对象,并存到文件中,该文件可被转移到其他相同环境的硬 件上,并通过pickle模块反序列化成模型对象,并在新环境里使用。而对于模型训练的训练效果进行校验的过程,则需要通过CI工具,如jenkins持续集成工具来完成。
基于上述实施例,通过DVC工具和git结合,对数据、模型、代码进行版本管理:安装简单(pip install DVC工具);使用方便(DVC工具push;dev pull);速度快,在DVC工具add之后,会生成一个新的文件,如,DVC工具add data.sql,会生成data.sql.DVC工具(kb级别),git会上传data.sql.DVC工具这个文件,DVC工具根据_dvc文件可以pull到对应的文件(例如保存在远端的数据文件)。如果需要指定版本的data、model、code,只需要git checkout版本号,然后DVC pull。
将一个全流程的各个流程节点及各个阶段使用的机器学习模型,通过数据集串联起来,综合管理,以提高数据的使用效率,优化模型表现,及时发现潜在的问题。
如图9所示,在一个实施例中,所述的机器学习模型的管理装置,还包括:
预处理模块02,用于通过DVC工具对所述数据集库中的数据集以及所述模型库中的模型进行版本化处理。
所述预处理模块02,还用于对版本化处理后的数据集及模型,建立单个数据集与所有相关模型的关联关系。
数据集库中存储的数据集可供模型库中的模型进行训练,以及供流程节点被执行时调用。基于全流程中可能存在有较多模型使用相同数据集的情况,需要预先建立单个数据集与所有相关模型的关联关系,以提高数据集的使用效率。
如图10所示,在一个实施例中,所述的机器学习模型的管理装置,还包括:
更新模块30,用于在所述数据集被更新时,更新所有与所述数据集相关联的模型。
所述更新模块30,还用于根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系。
基于现有技术中缺乏关于模型版本管理及迭代的工程化实践,在一个实施例中行,对于多种模型在共享数据集上的训练、验证及迭代:构建基于应用场景的流程,通过流程将不同应用场景所需及不同阶段的模型,整合在一个DAG中,以数据集更新作为触发条件,对模型本身以及模型所在的流程-数据集-模型的关联关系的进行迭代更新。具体的,将数据集更新作为触发条件,计算全流程的流程整体效果,通过对DAG中的节点任务结果的变化,能够反馈分析数据集变化对于各阶段的影响,从而对阶段优化提出方向性指导。
如图11所示,在一个实施例中,所述更新模块30,包括:
训练模块31,用于将更新后的数据集作为输入训练所有与所述数据集相关联的模型。
可以理解的是,数据集被更新时,其映射的多个模型也需要同步更新:训练-迭代。
判断模块32,用于判断当前被训练的模型的训练效果是否满足第一预设条件;在满足时,促使更新当前被训练的模型,否则不执行更新当前被训练的模型的操作。
根据各个模型的实际应用场景以及训练目的来设定模型是否被训练好,即第一预设条件。如果当前被训练的模型的训练效果不满足第一预设条件,则终止当前的模型更新,并输出更新报告指示终止当前的模型更新的原因。
如图11所示,在一个实施例中,所述更新模块,还包括:
执行模块33,用于调用更新后的数据集和所有更新后的模型,执行与所述更新后的数据集和/或任一更新后的模型关联的所有流程节点。
数据集被更新,其所在的关联关系也需要同步更新。同样的,由于数据集的更新导致的其关联的所有模型被更新,则被更新的模型所在的关联关系也需要被更新。可以理解的是,需要被更新的关联关系(流程-数据集-模型)中,有数据集和模型均被更新的,也有仅仅是模型被更新的。
所述判断模块32,还用于判断所有流程节点被执行后,流程整体效果是否满足第二预设条件;在满足时,促使更新所述更新后的数据集和/或任一更新后的模型所在的关联关系,否则不执行更新所述更新后的数据集和/或任一更新后的模型所在的关联关系的操作。
其中,所述流程整体效果通过对各流程节点的执行结果进行加权求和来表示。
示例性的,在药物小分子性质研究的流程中,如图6所示,流程整体效果的评估,是通过DAG对象中各个节点的效果,加上经验权重评估的一个最终值。分子性质预测,可以预设一个目标是预测准确率,那么对于各个step(step1-step6),在目标中可以设定一个占比,例如10%、20%、20%、25%、15%、10%,这样根据训练结果及验证集,可以得到一个基于当前训练集下的整体准确率,这个占比设定在开始时需要一定经验,但随着数据累计与模型效果提升,其占比可以随着最终效果进行调整,以达到对整体效果的预期。
结合DVC工具进行数据集、模型和流程的版本管理,通过连接数据集与模型,构建基于应用场景的流程,能够提高流程整体对于数据变化的敏感度,便于有效数据集的统一管理,追溯各节点变化带来的最终质量变化。
第三方面。
本申请一个实施例提供一种计算机设备,包括:处理器、存储器以及存储在所述存储器上的计算机程序,所述处理器耦合所述存储器,所述处理器在工作时执行所述计算机程序以实现如上述的机器学习模型的管理方法。
第四方面。
本申请一个实施例提供一种计算机可读存储介质,所述计算机存储介质存储有计算机指令,当所述计算机指令被计算机执行时,使得所述计算机执行如上述的机器学习模型的管理方法的指令。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(Digital Subscriber Line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(Digital Versatile Disc,DVD))或半导体介质(例如:固态硬盘(Solid State Disk,SSD))等。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。
Claims (17)
- 一种机器学习模型的管理方法,其特征在于,包括:确定对应流程节点的配置信息的数据集和模型;其中,所述流程节点为流程中的任一流程节点,所述配置信息包括指向所述数据集和所述模型的执行条件;建立所述流程节点与所述数据集、所述模型的关联关系;其中,所述关联关系能够用于指示所述流程节点在被执行时确定调用的所述数据集和所述模型。
- 根据权利要求1所述的机器学习模型的管理方法,其特征在于,在确定对应流程节点的配置信息的数据集和模型之前,还包括:获取数据集;将所述数据集存储于数据集库中。
- 根据权利要求2所述的机器学习模型的管理方法,其特征在于,在确定对应流程节点的配置信息的数据集和模型之前,还包括:获取模型;将所述模型存储于模型库中;其中,所述模型携带有用于确定所述模型的使用场景的模型描述。
- 根据权利要求3所述的机器学习模型的管理方法,其特征在于,在将所述数据集存储于数据集库中,以及将所述模型存储于模型库中之后,还包括:通过DVC工具对所述数据集库中的数据集以及所述模型库中的模型进行版本化处理;对版本化处理后的数据集及模型,建立单个数据集与所有相关模型的关联关系。
- 根据权利要求1-4任一项所述的机器学习模型的管理方法,其特征在于,在建立所述流程节点与所述数据集、所述模型的关联关系之后,还包括:在所述数据集被更新时,更新所有与所述数据集相关联的模型;根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系。
- 根据权利要求5所述的机器学习模型的管理方法,其特征在于,所述在所述数据集被更新时,更新所有与所述数据集相关联的模型,包括:将更新后的数据集作为输入,训练所有与所述数据集相关联的模型;判断当前被训练的模型的训练效果是否满足第一预设条件;在满足时,更新当前被训练的模型,否则不执行更新当前被训练的模型的操作。
- 根据权利要求6所述的机器学习模型的管理方法,其特征在于,所述根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系,包括:调用更新后的数据集和所有更新后的模型,执行与所述更新后的数据集和/或任一更新后的模型关联的所有流程节点;判断所有流程节点被执行后,流程整体效果是否满足第二预设条件;在满足时,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系,否则不执行更新所述更新后的数据集和/或任一更新后的模型所在的关联关系的操作。
- 根据权利要求7所述的机器学习模型的管理方法,其特征在于,所述流程整体效果通过对各流程节点的执行结果进行加权求和来表示。
- 一种机器学习模型的管理装置,其特征在于,包括:筛选模块,用于确定对应流程节点的配置信息的数据集和模型;其中,所述流程节点为流程中的任一流程节点,所述配置信息包括指向所述数据集和所述模型的执行条件;关联模块,用于建立所述流程节点与所述数据集、所述模型的关联关系;其中,所述关联关系能够用于指示所述流程节点在被执行时确定调用的所述数据集和所述模型。
- 根据权利要求9所述的机器学习模型的管理装置,其特征在于,还包括:采集模块,用于获取数据集,并将所述数据集存储于数据集库中。
- 根据权利要求10所述的机器学习模型的管理装置,其特征在于,所述采集模块,还用于获取模型,并将所述模型存储于模型库中;其中,所述模型携带有用于确定所述模型的使用场景的模型描述。
- 根据权利要求11所述的机器学习模型的管理装置,其特征在于,还包括:预处理模块,用于通过DVC工具对所述数据集库中的数据集以及所述模型库中的模型进行版本化处理;所述预处理模块,还用于对版本化处理后的数据集及模型,建立单个数据集与所有相关模型的关联关系。
- 根据权利要求9-12所述的机器学习模型的管理装置,其特征在于,还包括:更新模块,用于在所述数据集被更新时,更新所有与所述数据集相关联的模型;所述更新模块,还用于根据更新后的数据集和所有更新后的模型,更新所述更新后的数据集和/或任一更新后的模型所在的关联关系。
- 根据权利要求13所述的机器学习模型的管理装置,其特征在于,所述更新模块,包括:训练模块,用于将更新后的数据集作为输入训练所有与所述数据集相关联的模型;判断模块,用于判断当前被训练的模型的训练效果是否满足第一预设条件;在 满足时,促使更新当前被训练的模型,否则不执行更新当前被训练的模型的操作。
- 根据权利要求14所述的机器学习模型的管理装置,其特征在于,所述更新模块,还包括:执行模块,用于调用更新后的数据集和所有更新后的模型,执行与所述更新后的数据集和/或任一更新后的模型关联的所有流程节点;所述判断模块,还用于判断所有流程节点被执行后,流程整体效果是否满足第二预设条件;在满足时,促使更新所述更新后的数据集和/或任一更新后的模型所在的关联关系,否则不执行更新所述更新后的数据集和/或任一更新后的模型所在的关联关系的操作。
- 一种计算机设备,其特征在于,包括:处理器、存储器以及存储在所述存储器上的计算机程序,所述处理器耦合所述存储器,所述处理器在工作时执行所述计算机程序以实现如权利要求1-8中任一项所述的机器学习模型的管理方法。
- 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机指令,当所述计算机指令被计算机执行时,使得所述计算机执行权利要求1-8任一项所述的机器学习模型的管理方法的指令。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/141344 WO2023115570A1 (zh) | 2021-12-24 | 2021-12-24 | 机器学习模型的管理方法、装置、计算机设备及存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2021/141344 WO2023115570A1 (zh) | 2021-12-24 | 2021-12-24 | 机器学习模型的管理方法、装置、计算机设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023115570A1 true WO2023115570A1 (zh) | 2023-06-29 |
Family
ID=86901122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/141344 WO2023115570A1 (zh) | 2021-12-24 | 2021-12-24 | 机器学习模型的管理方法、装置、计算机设备及存储介质 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2023115570A1 (zh) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10262271B1 (en) * | 2018-02-14 | 2019-04-16 | DataTron Technologies Inc. | Systems and methods for modeling machine learning and data analytics |
CN110321112A (zh) * | 2019-07-02 | 2019-10-11 | 北京百度网讯科技有限公司 | Ai能力研发平台及数据处理方法 |
US20200019882A1 (en) * | 2016-12-15 | 2020-01-16 | Schlumberger Technology Corporation | Systems and Methods for Generating, Deploying, Discovering, and Managing Machine Learning Model Packages |
CN110956272A (zh) * | 2019-11-01 | 2020-04-03 | 第四范式(北京)技术有限公司 | 实现数据处理的方法和系统 |
CN112529023A (zh) * | 2019-09-18 | 2021-03-19 | 上海钛空猫智能科技有限公司 | 一种配置化的人工智能场景应用研发方法和系统 |
CN113110833A (zh) * | 2021-04-15 | 2021-07-13 | 成都新希望金融信息有限公司 | 机器学习模型可视化建模方法、装置、设备及存储介质 |
-
2021
- 2021-12-24 WO PCT/CN2021/141344 patent/WO2023115570A1/zh unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200019882A1 (en) * | 2016-12-15 | 2020-01-16 | Schlumberger Technology Corporation | Systems and Methods for Generating, Deploying, Discovering, and Managing Machine Learning Model Packages |
US10262271B1 (en) * | 2018-02-14 | 2019-04-16 | DataTron Technologies Inc. | Systems and methods for modeling machine learning and data analytics |
CN110321112A (zh) * | 2019-07-02 | 2019-10-11 | 北京百度网讯科技有限公司 | Ai能力研发平台及数据处理方法 |
CN112529023A (zh) * | 2019-09-18 | 2021-03-19 | 上海钛空猫智能科技有限公司 | 一种配置化的人工智能场景应用研发方法和系统 |
CN110956272A (zh) * | 2019-11-01 | 2020-04-03 | 第四范式(北京)技术有限公司 | 实现数据处理的方法和系统 |
CN113110833A (zh) * | 2021-04-15 | 2021-07-13 | 成都新希望金融信息有限公司 | 机器学习模型可视化建模方法、装置、设备及存储介质 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2633398B1 (en) | Managing data set objects in a dataflow graph that represents a computer program | |
CN109997126B (zh) | 事件驱动提取、变换、加载(etl)处理 | |
US9946989B2 (en) | Management and notification of object model changes | |
US10733055B1 (en) | Methods and apparatus related to graph transformation and synchronization | |
US6839724B2 (en) | Metamodel-based metadata change management | |
US11487745B2 (en) | Workflow dependency management system | |
JP5710851B2 (ja) | 影響分析のためのシステムおよび方法 | |
US20170161641A1 (en) | Streamlined analytic model training and scoring system | |
US11403347B2 (en) | Automated master data classification and curation using machine learning | |
WO2020238597A1 (zh) | 基于Hadoop的数据更新方法、装置、系统及介质 | |
JP2020501235A (ja) | 系統メタデータの生成、アクセス、及び表示 | |
US20230161945A1 (en) | Automatic two-way generation and synchronization of notebook and pipeline | |
US20180101465A1 (en) | Static analysis rules and training data repositories | |
CN103745319B (zh) | 一种基于多状态科学工作流的数据世系追溯系统和方法 | |
JP2018067280A (ja) | ソフトウェアサービスの実行のためのシステム、方法、及びプログラム | |
KR20230005382A (ko) | 자동으로 관리되는 코드를 갖는 데이터 과학 워크플로우 실행 플랫폼 및 그래프 기반 데이터 작업 관리 | |
US10289620B1 (en) | Reporting and data governance management | |
US11099837B2 (en) | Providing build avoidance without requiring local source code | |
WO2023115570A1 (zh) | 机器学习模型的管理方法、装置、计算机设备及存储介质 | |
CN114444712A (zh) | 机器学习模型的管理方法、装置、计算机设备及存储介质 | |
US10318524B2 (en) | Reporting and data governance management | |
US11521089B2 (en) | In-database predictive pipeline incremental engine | |
US20220129330A1 (en) | Centralized data management | |
Freund et al. | Exploring Existing Tools for Managing Different Types of Research Data | |
US20210049235A1 (en) | Generating a version associated with a section in a document |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21968699 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |