[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111695702B - Training method, device, equipment and storage medium of molecular generation model - Google Patents

Training method, device, equipment and storage medium of molecular generation model Download PDF

Info

Publication number
CN111695702B
CN111695702B CN202010546027.6A CN202010546027A CN111695702B CN 111695702 B CN111695702 B CN 111695702B CN 202010546027 A CN202010546027 A CN 202010546027A CN 111695702 B CN111695702 B CN 111695702B
Authority
CN
China
Prior art keywords
node
graph
molecular
features
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010546027.6A
Other languages
Chinese (zh)
Other versions
CN111695702A (en
Inventor
徐挺洋
余俊驰
荣钰
黄俊洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010546027.6A priority Critical patent/CN111695702B/en
Publication of CN111695702A publication Critical patent/CN111695702A/en
Application granted granted Critical
Publication of CN111695702B publication Critical patent/CN111695702B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N99/00Subject matter not provided for in other groups of this subclass
    • G06N99/007Molecular computers, i.e. using inorganic molecules

Landscapes

  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a training method of a molecular generation model; comprising the following steps: obtaining a basic molecule and a target molecule; the basic molecules are encoded through the encoding layer to obtain first graph node characteristics and first tree node characteristics, and the target molecules are encoded to obtain second graph node characteristics and second tree node characteristics; matching the first graph node characteristics with the second graph node characteristics through the alignment layer to obtain first similarity characteristics, and matching the first tree node characteristics with the second tree node characteristics to obtain second similarity characteristics; generating graph node characteristics according to the first similarity characteristics and the first graph node characteristics and tree node characteristics according to the second similarity characteristics and the first tree node characteristics through a generation layer; decoding the graph node characteristics and the tree node characteristics through a decoding layer respectively to obtain predicted molecules; updating model parameters based on the difference between the predicted molecule and the target molecule; by the obtained model, a high-property molecule retaining the partial structure of the base molecule can be generated.

Description

Training method, device, equipment and storage medium of molecular generation model
Technical Field
The present application relates to artificial intelligence technology, and in particular, to a training method, apparatus, device and computer readable storage medium for molecular generation models.
Background
Artificial intelligence (AI, artificial Intelligence) is a comprehensive technology of computer science, and by researching the design principle and implementation method of various intelligent machines, the machines have the functions of sensing, reasoning and deciding. Artificial intelligence technology is a comprehensive discipline, and is widely related to fields, such as natural language processing technology, machine learning/deep learning and other directions, and it is believed that with the development of technology, the artificial intelligence technology will be applied in more fields and become more and more valuable.
The artificial intelligence technology is applied to molecular generation, and high-attribute molecules can be generated according to some low-attribute basic molecules through the functions of machine reasoning and decision, so that the time cost for manually finding the high-attribute molecules is reduced; however, the training process of the molecular generation model in the related technology is tedious and time-consuming, and the generated molecules cannot retain the structural information of the original molecules due to the fact that the structural similarity between the molecules cannot be explored, so that the rationality is low.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for generating a model, which can improve the training efficiency of the model, so that the model generated by training can generate a predicted molecule which retains basic molecular structure information.
The technical scheme of the embodiment of the application is realized as follows:
the embodiment of the application provides a molecular generation model method, which comprises a coding layer, an alignment layer, a generation layer and a decoding layer, and comprises the following steps:
obtaining a molecular pair sample containing basic molecules and target molecules, wherein the molecular pair sample is a biological molecular pair sample or a drug molecular pair sample;
wherein the molecular property of the target molecule is higher than that of the base molecule, the base molecule is represented by a first molecular diagram and a first molecular tree, and the target molecule is represented by a second molecular diagram and a second molecular tree;
encoding the basic molecules in the sample by the molecules through the encoding layer to obtain a first graph node characteristic of a first molecular graph and a first tree node characteristic of the first molecular graph, and encoding the target molecules in the sample by the molecules to obtain a second graph node characteristic of a second molecular graph and a second tree node characteristic of the second molecular graph;
Matching the first graph node characteristic with the second graph node characteristic through the alignment layer to obtain a first similarity characteristic of the first sub-graph and the second sub-graph, and matching the first tree node characteristic with the second tree node characteristic to obtain a second similarity characteristic of the first sub-graph and the second sub-graph;
generating graph node characteristics of the predicted molecules according to the first similarity characteristics and the first graph node characteristics through the generation layer, and generating tree node characteristics of the predicted molecules according to the second similarity characteristics and the first tree node characteristics;
decoding the graph node characteristics and the tree node characteristics through the decoding layer respectively to obtain the predicted molecules represented by a molecular graph and a molecular tree;
and obtaining the difference between the predicted molecule and the target molecule, and updating the model parameters of the molecular generation model based on the difference.
A molecular generation method based on a molecular generation model, wherein the molecular generation model comprises: an encoding layer, a generating layer, and a decoding layer, the method comprising:
obtaining basic molecules, wherein the basic molecules are biological molecules or drug molecules and are represented by a first molecular graph and a first molecular tree;
Encoding basic molecules in the sample by the encoding layer to obtain first graph node characteristics of a first molecular graph and first tree node characteristics of the first molecular graph;
generating graph node characteristics of the predicted molecules based on the first graph node characteristics through the generation layer, and generating tree node characteristics of the predicted molecules according to standard Gaussian distribution and the first tree node characteristics;
decoding the graph node characteristics and the tree node characteristics through the decoding layer respectively to obtain the predicted molecules;
the molecular generation model is obtained by training based on the training method of the molecular generation model provided by the embodiment of the application.
The embodiment of the application provides a training device of a molecular generation model, wherein the molecular generation model comprises a coding layer, an alignment layer, a generation layer and a decoding layer, and the device comprises:
the first acquisition module is used for acquiring a molecular pair sample containing basic molecules and target molecules, wherein the molecular pair sample is a biological molecular pair sample or a drug molecular pair sample;
wherein the molecular property of the target molecule is higher than that of the base molecule, the base molecule is represented by a first molecular diagram and a first molecular tree, and the target molecule is represented by a second molecular diagram and a second molecular tree;
The first coding module is used for coding the basic molecules in the sample through the coding layer to obtain a first graph node characteristic of a first molecular graph and a first tree node characteristic of a first molecular tree, and coding the target molecules in the sample to obtain a second graph node characteristic and a second tree node characteristic;
the alignment module is used for matching the first graph node characteristic with the second graph node characteristic through the alignment layer to obtain a first similarity characteristic of the first sub-graph and the second sub-graph, and matching the first tree node characteristic with the second tree node characteristic to obtain a second similarity characteristic of the first sub-graph and the second sub-graph;
the first generation module is used for generating graph node characteristics of the predicted molecules according to the first similarity characteristics and the first graph node characteristics through the generation layer and generating tree node characteristics of the predicted molecules according to the second similarity characteristics and the first tree node characteristics;
the first decoding module is used for respectively decoding the graph node characteristics and the tree node characteristics through the decoding layer to obtain the prediction molecules represented by a molecular graph and a molecular tree;
And the updating module is used for acquiring the difference between the predicted molecule and the target molecule and updating the model parameters of the molecular generation model based on the difference.
In the above scheme, the first coding module is further configured to code, through a graph coding network in a coding layer, at least two first nodes in the first molecular graph to obtain coding features of the at least two first nodes, and take the coding features of the at least two first nodes as the first graph node features; wherein the first molecular diagram corresponds to a molecular structure topology of the base molecule, and the at least two first nodes correspond to constituent elements constituting the base molecule;
coding at least two second nodes in the first sub-tree through a tree coding network in a coding layer to obtain coding features of the at least two second nodes, and taking the coding features of the at least two second nodes as first tree node features; wherein the first molecular tree is constructed based on the molecular structure of the base molecule with constituent elements of the base molecule as second nodes.
In the above solution, the first encoding module is further configured to, for each first node in the first molecular diagram, perform the following operation:
Acquiring edge coding features of at least two edges connected with the first node when the at least two edges connected with the first node exist;
summing the edge coding features of the at least two edges to obtain a first edge aggregation feature;
and generating node coding features of the nodes based on the attribute features of the first nodes and the first edge aggregation features.
In the above solution, the first encoding module is further configured to, for each of at least two edges connected to the first node, perform the following operations:
when the edge is an edge connecting the first node and a neighbor node and at least two edges connecting the neighbor node exist, acquiring attribute characteristics of at least two edges connecting the neighbor node;
summing the attribute characteristics of at least two edges connected with the neighbor nodes to obtain a second edge aggregation characteristic;
and generating edge coding features of the edges based on the attribute features of the first node, the attribute features of the neighbor nodes and the second edge aggregation features.
In the above scheme, the first graph node features include coding features of at least two first nodes in the first molecular graph, and the second graph node features include coding features of at least two third nodes in the second molecular graph;
The alignment module is further configured to obtain, based on the first graph node feature and the second graph node feature, first similarities from each first node in the first sub-graph to at least two third nodes in the second sub-graph, and second similarities from each third node in the second sub-graph to at least two first nodes in the first sub-graph;
and according to the first similarity and the second similarity, the coding features of at least two first nodes in the first graph node features and the second graph node features are aggregated to obtain first similarity features of the first molecular graph and the second molecular graph.
In the above scheme, the alignment module is further configured to aggregate, according to the first similarity, coding features of at least two first nodes in the node features of the first graph and at least two third nodes in the node features of the second graph, so as to obtain similarity features from the first score graph to the second score graph;
according to the second similarity, the coding features of at least two first nodes in the node features of the first graph and at least two third nodes in the node features of the second graph are aggregated, and the similarity features from the second score graph to the first score graph are obtained;
And splicing the similarity characteristics of the first sub graph to the second sub graph and the similarity characteristics of the second sub graph to the first sub graph to obtain the first similarity characteristics of the first sub graph and the second sub graph.
In the above scheme, the alignment module is further configured to respectively perform weighted summation on coding features of at least two third nodes in the second sub-graph according to first similarities from each first node in the first sub-graph to at least two first nodes in the second sub-graph, so as to obtain first aggregate features corresponding to each first node in the first sub-graph;
splicing the coding features of each first node in the first molecular diagram and the corresponding first aggregation features to obtain first splicing features corresponding to each first node in the first molecular diagram;
and summing the first splicing characteristics corresponding to each first node in the first sub-graph to obtain the similarity characteristics from the first sub-graph to the second sub-graph.
In the above solution, the first generating module is further configured to obtain, based on the first similarity feature, a mean and a variance corresponding to the first similarity feature;
based on the mean value and the variance corresponding to the first similarity feature, acquiring Gaussian distribution corresponding to the first similarity feature;
Sampling from the Gaussian distribution to obtain sampling characteristics;
and splicing the sampling feature with the first graph node feature to obtain the graph node feature of the predicted molecule.
In the above scheme, the first updating module is further configured to obtain a probability that the predicted molecule is the same as the target molecule;
based on the difference, acquiring information divergence between posterior probability distribution and standard Gaussian distribution of the sampling feature;
determining a value of a variation loss function based on the probability and the information divergence;
acquiring a center representation of a base molecule, a center representation of a predicted molecule and a center representation of a target molecule;
determining a value of a concealment loss function based on the central representation of the base molecule, the central representation of the predicted molecule, and the central representation of the target molecule;
summing the value of the variation loss function and the value of the hidden loss function to obtain the value of the loss function of the molecular generation model;
updating model parameters of the molecular generation model based on the value of the loss function of the molecular generation model.
In the above scheme, the first graph node features include coding features of at least two first nodes in the first molecular graph, and the first tree node features include coding features of at least two second nodes in the first molecular graph;
The updating module is further used for acquiring a first average value of the coding features of at least two first nodes in the first molecular diagram and a second average value of the coding features of at least two second nodes in the first molecular diagram;
and splicing the first average value and the second average value to obtain the center representation of the basic molecule.
In the above scheme, the first decoding module is further configured to process the node characteristics of the graph through a gating loop network to obtain information vectors transmitted between nodes;
for any decoded node, based on the information vector transferred between the node and other nodes, obtaining the probability of adding a new node, and
and when the probability is determined to be higher than a probability threshold, determining the type of a new node according to the information vector transmitted between the node and other nodes, so as to add the new node at the node according to the type of the new node.
A molecular generation device based on a molecular generation model, the molecular generation model comprising: an encoding layer, a generating layer, and a decoding layer, the apparatus comprising:
the second acquisition module is used for acquiring basic molecules, wherein the basic molecules are biological molecules or drug molecules and are represented by a first molecular graph and a first molecular tree;
The second coding module is used for coding basic molecules in the sample through the coding layer to obtain first graph node characteristics of the first molecular graph and first tree node characteristics of the first molecular graph;
the second generation module is used for generating graph node characteristics of the predicted molecules based on the first graph node characteristics through the generation layer and generating tree node characteristics of the predicted molecules according to standard Gaussian distribution and the first tree node characteristics;
the second decoding module is used for respectively decoding the graph node characteristics and the tree node characteristics through the decoding layer to obtain the prediction molecules;
the molecular generation model is obtained by training based on the training method of the molecular generation model provided by the embodiment of the application.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the training method of the molecular generation model provided by the embodiment of the application when executing the executable instructions stored in the memory.
An embodiment of the present application provides an electronic device, including:
a memory for storing executable instructions;
and the processor is used for realizing the molecular generation method based on the molecular generation model when executing the executable instructions stored in the memory.
The embodiment of the application provides a computer readable storage medium which stores executable instructions for causing a processor to execute the training method of the molecular generation model provided by the embodiment of the application.
The embodiment of the application provides a computer readable storage medium which stores executable instructions for causing a processor to execute, thereby realizing the molecular generation method based on the molecular generation model.
The embodiment of the application has the following beneficial effects:
1) The molecular pair sample containing the basic molecule and the target molecule is obtained, so that a molecular generation model is trained based on the obtained molecular pair sample, and the molecular property of the target molecule is higher than that of the basic molecule, so that the molecular generation model obtained by training can generate a predicted molecule with the molecular property higher than that of the basic molecule based on the basic molecule, thereby realizing optimization of the molecular property;
2) Matching the first graph node characteristic with the second graph node characteristic through the alignment layer to obtain a first similarity characteristic of the first sub-graph and the second sub-graph, and matching the first tree node characteristic with the second tree node characteristic to obtain a second similarity characteristic of the first sub-graph and the second sub-graph; generating graph node characteristics of the predicted molecules according to the first similarity characteristics and the first graph node characteristics through the generation layer, and generating tree node characteristics of the predicted molecules according to the second similarity characteristics and the first tree node characteristics; therefore, the similarity of the basic molecule and the target molecule in the molecular structure can be explored, and the predicted molecule is generated by combining the similarity with the basic molecule, so that the predicted molecule generated by the molecular generation model obtained by training can retain part of structural information of the basic molecule.
Drawings
FIG. 1 is a schematic diagram of an implementation scenario of a training method of a molecular generation model according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 3 is a flow chart of a training method of a molecular generation model according to an embodiment of the present application;
FIG. 4A is a schematic diagram of a molecular structure of a basic molecule according to an embodiment of the present application;
FIG. 4B is a schematic diagram of the molecular structure of a target molecule according to an embodiment of the present application;
FIG. 5A is a schematic diagram of a first molecular diagram provided by an embodiment of the present application;
FIG. 5B is a schematic diagram of providing a first molecular tree according to one embodiment of the present application;
FIG. 6 is a schematic flow chart of a molecular generation method based on a molecular generation model according to an embodiment of the present application;
FIG. 7 is a flow chart of a training method of a molecular generation model according to an embodiment of the present application;
FIG. 8 is a schematic flow chart of a molecular generation method based on a molecular generation model according to an embodiment of the present application;
fig. 9 is a schematic diagram of a composition structure of a training device for molecular generation model according to an embodiment of the present application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.
1) Molecules, which are an ensemble of atoms of which the constituent atoms are joined together in a certain bonding order and spatial arrangement, are referred to as molecular structures.
2) Molecular properties, including physical properties, chemical properties, etc. of molecules, for example, drug molecules with polar groups, have great affinity to water, can attract water molecules, or dissolve in water, and the surface of a solid material formed by the drug molecules is easy to be wetted by water, i.e. the drug molecules are hydrophilic; after contacting with living organism or entering into living organism, certain kind of medicine molecules can cause direct or indirect damage effect, namely the medicine molecules are biological toxicity (biological harmfulness); molecular properties depend not only on the kind and number of constituent atoms, but also on the structure of the molecule.
3) Gaussian distribution, i.e. normal distribution, if the random variable X obeys a probability distribution with a position parameter μ and a scale parameter σ, and its probability density function isThis random variable is called Gaussian random variable (normal random variable), and the distribution to which the Gaussian random variable (normal random variable) is subjected is called Gaussian distribution (normal distribution), denoted as X.about.N (μ, σ) 2 )。
4) A standard gaussian distribution, when μ=0, σ=1, becomes a standard gaussian distribution.
5) Information divergence, also known as Kullback-Leibler divergence or relative entropy, is a measure of asymmetry of the difference between two probability distributions, in information theory, the information divergence is equivalent to the difference in information entropy of the two probability distributions.
Based on the above explanation of terms and terminology involved in the embodiments of the present application, the implementation scenario of the training method of the molecular generation model provided in the embodiments of the present application will be described first, referring to fig. 1, fig. 1 is a schematic diagram of the implementation scenario of the training method of the molecular generation model provided in the embodiments of the present application, and in order to support an exemplary application, the terminal includes a terminal 200-1 and a terminal 200-2, where the terminal 200-1 is located on a developer side and is used to control training of the molecular generation model, and the terminal 200-2 is located on a user side and is used to request generation of predicted molecules corresponding to basic molecules; the terminal 200 is connected to the server 100 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and data transmission is implemented by using a wireless or wired link.
A terminal 200-1 for transmitting training instructions for the molecular generation model to the server 100;
A server 100 for obtaining a molecular pair sample comprising a base molecule and a target molecule; wherein the molecular property of the target molecule is higher than that of the base molecule, the base molecule is represented by a first molecular diagram and a first molecular tree, and the target molecule is represented by a second molecular diagram and a second molecular tree; encoding the basic molecules in the sample by the molecules through the encoding layer to obtain a first graph node characteristic of a first molecular graph and a first tree node characteristic of the first molecular graph, and encoding the target molecules in the sample by the molecules to obtain a second graph node characteristic and a second tree node characteristic; matching the first graph node characteristic with the second graph node characteristic through the alignment layer to obtain a first similarity characteristic of the first sub-graph and the second sub-graph, and matching the first tree node characteristic with the second tree node characteristic to obtain a second similarity characteristic of the first sub-graph and the second sub-graph; generating graph node characteristics of the predicted molecules according to the first similarity characteristics and the first graph node characteristics through the generation layer, and generating tree node characteristics of the predicted molecules according to the second similarity characteristics and the first tree node characteristics; decoding the graph node characteristics and the tree node characteristics through the decoding layer respectively to obtain the predicted molecules represented by a molecular graph and a molecular tree; and obtaining the difference between the predicted molecule and the target molecule, and updating the model parameters of the molecular generation model based on the difference.
After the molecular generation model is trained, the terminal 200-2 is configured to send a molecular generation instruction for instructing generation of a predicted molecule corresponding to the base molecule;
the base molecule is a molecule with a lower molecular property, and a predicted molecule corresponding to the base molecule is generated by sending the predicted molecule to request the generation of a predicted molecule with a molecular property higher than that of the base molecule.
The server 100 is configured to generate a predicted molecule according to the base molecule by training the obtained molecule generation model in response to the molecule generation instruction, and return the predicted molecule to the terminal 200-2.
In some embodiments, the server 100 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal (e.g., terminal 200-1) may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.
In practical application, the molecular generation model provided by the embodiment of the application can be applied to the fields of structural biology and medicine, and drug discovery, molecular optimization, molecular generation and the like are realized through the molecular generation model.
Taking drug discovery as an example, here, both the base molecule and the target molecule in the molecular sample pair trained on the molecular generation model are drug molecules, and the target molecule is superior to the base molecule for the desired drug properties; a model is then generated for the training molecule based on the molecular sample.
After training to obtain a molecular generation model, drug discovery can be performed through the molecular generation model. Illustratively, a drug discovery application client is provided on the terminal, through which a user can input a base molecule, and the terminal transmits a molecule generation instruction of a predicted molecule corresponding to the base molecule to the server 100 through the network 300; after receiving the instruction, the server 100 extracts basic molecules in the instruction, and generates predicted molecules according to the basic molecules; wherein the inputted basic molecule is a drug molecule.
The hardware structure of the electronic device of the training method of the molecular generation model provided by the embodiment of the application is described in detail below, and the electronic device includes, but is not limited to, a server or a terminal. Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device provided by an embodiment of the present application, where the electronic device shown in fig. 2 includes: the various components of the at least one processor 410, memory 450, at least one network interface 420, and user interface 430 electronic device are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.
The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.
Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
network communication module 452 for reaching other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
a presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;
an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the training device for molecular generation model provided in the embodiments of the present application may be implemented in software, and fig. 2 shows the training device 455 for molecular generation model stored in the memory 450, which may be software in the form of a program and a plug-in, and includes the following software modules: the first acquisition module 4551, the first encoding module 4552, the alignment module 4553, the first generation module 4554, the first decoding module 4555 and the update module 4556 are logical, so that any combination or further splitting may be performed according to the functions implemented.
The functions of the respective modules will be described hereinafter.
In other embodiments, the training apparatus for molecular generation models provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the training method for molecular generation models provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, comple x Programmable Logic Device), field programmable gate array (FPGA, field-Programma ble Gate Array), or other electronic components.
Based on the above description of the implementation scenario and the electronic device of the training method for the molecular generation model according to the embodiment of the present application, the training method for the molecular generation model provided by the embodiment of the present application is described below.
The molecular generation model provided by the embodiment of the application comprises a coding layer, an alignment layer, a generation layer and a decoding layer, referring to fig. 3, fig. 3 is a flow diagram of a training method of the molecular generation model provided by the embodiment of the application; in some embodiments, the training method of the molecular generation model may be implemented by a server or a terminal alone or cooperatively, and in an embodiment of the present application, the training method of the molecular generation model provided by the embodiment of the present application includes:
step 301: the server obtains a molecular pair sample comprising a base molecule and a target molecule.
Here, in practical applications, the molecular pair sample may be a biological molecular pair sample or a drug molecular pair sample, that is, the base molecule and the target molecule may be biological molecules (such as protein molecules) or drug molecules (such as drug molecules with polar groups).
In some embodiments, the molecular properties of the target molecule are higher than the molecular properties of the base molecule, the base molecule being represented by a first molecular graph and a first molecular tree, the target molecule being represented by a second molecular graph and a second molecular tree.
In practical implementation, the molecular generation model provided by the embodiment of the application is used for optimizing molecular properties, namely, basic molecules with lower molecular properties are input into the molecular generation model, corresponding processing is carried out on the basic molecules, and predicted molecules with higher molecular properties than the basic molecules are output. Thus, the molecular properties of the molecules used to train the molecular generation model to the target molecules in the sample should be higher than the base molecules. Here, a high molecular property means that the target molecule has a significant improvement with respect to the base molecule with respect to the desired property, e.g., the desired feature is hydrophilicity, and the hydrophilicity of the target molecule should be superior to that of the base molecule.
It should be noted that the basic molecule and the target molecule should be similar in structure, so that the target molecule can be generated based on the basic molecule, that is, the structural similarity between the basic molecule and the target molecule should reach the similarity threshold, for example, fig. 4A is a schematic molecular structure of the basic molecule provided by the embodiment of the present application, fig. 4B is a schematic molecular structure of the target molecule provided by the embodiment of the present application, see fig. 4A and 4B, where a part of the structure in the dashed frame in fig. 4A is the same as a part of the structure in the dashed frame in fig. 4B.
In order to accurately describe the molecular structures of the basic molecule and the target molecule, the embodiment of the application adopts a graph structure and a tree structure to represent the basic molecule and the target molecule, namely, the basic molecule is represented by a first molecular diagram and a first molecular number, and the target molecule is represented by a second molecular diagram and a second molecular diagram.
The graph structure and the tree structure are composed of nodes and edges for connecting the nodes, wherein in the graph structure, the relation between the nodes can be arbitrary, and any two data elements in the graph can be related; in the tree structure, there is a significant hierarchical relationship between data elements, and the data elements on each layer may be related to multiple elements (i.e., child nodes thereof) in the next layer, but only one element (i.e., parent nodes thereof) in the previous layer.
In practical implementation, the graph structure corresponds to the topology of the molecular structure, the nodes in the graph structure correspond to the constituent elements (such as atoms) of the molecule, and the edges correspond to the bonds between the constituent elements, for example, taking the molecular structure in fig. 4A as an example, fig. 5A is a schematic diagram of a first molecular diagram provided in an embodiment of the present application, see fig. 4A and 5A, where the molecular structure in fig. 4A corresponds to the first molecular diagram in fig. 5A, and the constituent elements in fig. 4A correspond to the nodes in the first molecular diagram in fig. 5A.
After the graph structure of the molecule is obtained, some nodes in the graph structure can be contracted into single nodes according to the chemical concept to generate a corresponding tree structure, for example, in the graph structure, one benzene ring comprises 5 nodes, and the five nodes can be contracted into single nodes.
Taking the molecular structure in fig. 4A as an example, fig. 5B is a schematic diagram of providing a first molecular tree according to an embodiment of the present application, referring to fig. 5A and 5B, some nodes of the first molecular tree in fig. 5A are shrunk to one node, so as to obtain a node 501 in fig. 5B in the first molecular tree in fig. 5B, for example, a five-ring in the first molecular tree is shrunk to one node; the six loops in the first molecular diagram are also contracted into one node, resulting in node 502 in fig. 5B.
Step 302: and encoding the basic molecules in the sample through the encoding layer to obtain first graph node characteristics of the first molecular graph and first tree node characteristics of the first molecular tree, and encoding the target molecules in the sample to obtain second graph node characteristics of the second molecular graph and second tree node characteristics of the second molecular tree.
Because the basic molecule is represented by the first molecular diagram and the first molecular tree, in actual implementation, the first molecular diagram and the first molecular tree are input into a coding layer, and the coding layer is used for coding the first molecular diagram and the first molecular tree to obtain the first diagram node characteristic and the first tree node characteristic of the first molecular diagram; and similarly, inputting the second sub-graph and the second sub-graph into a coding layer, and performing coding processing on the second sub-graph and the second sub-graph through the coding layer to obtain the first graph node characteristic and the first tree node characteristic of the second sub-graph.
In some embodiments, the molecules may encode the base molecules in the sample by: coding at least two first nodes in the first molecular graph through a graph coding network in the coding layer to obtain coding characteristics of the at least two first nodes, and taking the coding characteristics of the at least two first nodes as first graph node characteristics; wherein the first molecular diagram corresponds to the molecular structure topology of the base molecule, and the at least two first nodes correspond to constituent elements constituting the base molecule; coding at least two second nodes in the first sub-tree through a tree coding network in the coding layer to obtain coding features of the at least two second nodes, and taking the coding features of the at least two second nodes as first tree node features; the first molecular tree is constructed by taking constituent elements of the basic molecule as second nodes based on the molecular structure of the basic molecule.
Here, the coding layer includes two coding networks, namely a graph coding network and a tree coding network, the graph coding network being used for processing data of the graph structure, namely a first molecular graph; the tree coding network is used for processing data of a tree structure, namely a first sub-tree.
In actual practice, the graph coding network will generate one coding feature for each first node in the first molecular graph and the tree coding network will generate one coding feature for each second node in the first molecular graph. The Graph coding network and the tree coding network may use the same network to implement the coding process, such as an information transfer network (MPNNs, message Passing Neural Networks), a Graph convolutional neural network (GCN, global Cycling Network), and Graph SAGE, graph SAmple and aggreGatE.
In some embodiments, at least two first nodes in the first molecular diagram may be encoded by:
for each first node in the first molecular graph, performing the following operations: acquiring edge coding features of at least two edges connected with the first node when the at least two edges connected with the first node exist; summing the edge coding features of at least two edges to obtain a first edge aggregation feature; and generating node coding features of the first node based on the attribute features and the first edge aggregation features of the first node.
In practical implementation, all edges connected with the first node can be encoded first to obtain edge encoding features of all edges connected with the first node, then the edge encoding features are aggregated into a feature, namely a first edge aggregation feature, and then the node encoding feature of the first node is generated by combining the first edge aggregation feature and the attribute feature of the first node.
Here, the first attribute feature may be determined according to an element category corresponding to the first node, and attribute features of the first node corresponding to different elements are different.
In some embodiments, the edge connecting the first node may be encoded by:
for each of at least two edges connecting a first node, performing the following: when the edge is an edge connecting the first node and the neighbor node and at least two edges connecting the neighbor node exist, acquiring attribute characteristics of at least two edges connecting the neighbor node; summing the attribute characteristics of at least two edges connected with the neighbor nodes to obtain a second edge aggregation characteristic; and generating edge coding features of the edges based on the attribute features of the first node, the attribute features of the neighboring nodes and the second edge aggregation features.
Here, the attribute characteristics of an edge may be determined according to elements corresponding to two nodes to which the edge is connected. In practical implementation, for the edges between the first node and the neighboring node, the attribute features of all edges connected with the neighboring node are acquired, and then the attribute features are aggregated into a feature, namely a second edge aggregation feature, and the edge coding feature of the edge is generated by combining the attribute features of the first node, the attribute features of the neighboring node and the second edge aggregation feature.
In some embodiments, when the number of layers of the graph coding network is at least two, then the coding characteristics of the first node of the t-th layer network output in the graph coding network may be determined by:
when t=1, according to formula (1), the edge coding feature of the edge connecting the first node i and the first node j obtained through the layer t network is obtained:
wherein the method comprises the steps of,The method comprises the steps that the edge coding characteristics of the edges connecting a first node i and a first node j are obtained through a layer t network, wherein the first node j is a neighbor node of the first node i; f (f) 1 Representing a neural network; f (f) i Is an attribute feature of the first node i; f (f) j Is an attribute feature of the first node i; f (f) jk The method comprises the steps that the attribute characteristics of an edge between a first node j and a first node k are adopted, and the first node k is a neighbor node of the first node j;
then, according to the formula (2), obtaining the coding characteristics of the connected first node obtained through the layer t network;
wherein f 2 Representing a neural network;representing the coding characteristics of a first node i output by a layer t network;
when t >1, acquiring edge coding characteristics of an edge connecting the first node i and the first node j obtained through the t-layer network according to a formula (3):
wherein,,the method comprises the steps that the edge coding characteristics of the edges connecting a first node i and a first node j are obtained through a layer t network, wherein the first node j is a neighbor node of the first node i; f (f) 1 Representing a neural network;Is the coding feature of the first node i obtained through the t-1 layer network;Is the coding feature of the first node j obtained through the t-1 layer network;The method comprises the steps that the coding characteristics of edges between a first node j and a first node k are obtained through a t-1 layer network, wherein the first node k is a neighbor node of the first node j; />
Obtaining the coding feature of the connected first node obtained through the layer t network according to the formula (4):
wherein f 2 Representing a neural network;coding feature of the first node i representing the output of the layer t network,Is the coding feature of the first node i obtained through the layer t-1 network.
Here, the coding feature of the first node i obtained by the last layer network processing may be described asThe first picture coding feature is +.>Where nG represents the number of first nodes in the first molecular graph.
In practical implementation, the same coding method as the first molecular diagram can be adopted to code the first molecular diagram, the second molecular diagram and the second molecular diagram, and correspondingly, the first tree coding feature, the second diagram coding feature and the second tree coding feature can be obtained.
Step 303: and matching the first graph node characteristic with the second graph node characteristic through the alignment layer to obtain a first similarity characteristic of the first sub-graph and the second sub-graph, and matching the first tree node characteristic with the second tree node characteristic to obtain a second similarity characteristic of the first sub-graph and the second sub-graph.
Here, the structural difference between different molecules mainly depends on the kinds of atoms in the molecules and the connection modes between the atoms, and the structural similarity between the basic molecules and the target molecules is obtained by matching the first graph node characteristics with the second graph node characteristics and matching the first tree node characteristics with the second tree node characteristics, so that the structural information of the molecules can be better discovered.
In some embodiments, when the first graph node characteristic includes coding characteristics of at least two first nodes in the first score graph and the second graph node characteristic includes coding characteristics of at least two third nodes in the second score graph, the first similarity characteristic between the first score graph and the second score graph may be obtained by: based on the first graph node characteristics and the second graph node characteristics, acquiring first similarity from each first node in the first molecular diagram to at least two third nodes in the second molecular diagram and second similarity from each third node in the second molecular diagram to at least two first nodes in the first molecular diagram; and according to the first similarity and the second similarity, the coding features of at least two first nodes in the first graph node features and the second graph node features are aggregated to obtain first similarity features of the first sub graph and the second sub graph.
In practical implementation, for any first node in the first molecular diagram and any third node in the second molecular diagram, the two-way similarity between the two nodes needs to be calculated.
Illustratively, for the ith first node in the first molecular diagram, its coding features are noted asFor the q third node in the second partial graph, its coding feature is marked +.>First, according to a formula (5), obtaining a first similarity from an ith first node in a first molecular diagram to a qth third node in a second molecular diagram:
wherein w is iq For a first similarity of an ith first node in the first score graph to a qth third node in the second score graph, σ represents a standard deviation;
then, according to the formula (6), obtaining the second similarity from the (q) th third node in the second sub-graph to the (i) th first node in the first sub-graph:
wherein w is qi For the second similarity of the qth third node in the second score graph to the ith first node in the first score graph, σ represents the standard deviation.
By the method, the bidirectional similarity between each first node in the first sub graph and at least two nodes in the second sub graph can be obtained; and after the bidirectional similarity is obtained, the coding features of the third node in the first and second sub-graphs are aggregated together according to the bidirectional similarity pair so as to obtain the first similarity features of the first and second sub-graphs.
Accordingly, the second similarity feature of the first and second sub-trees may be obtained by: when the first tree node characteristic includes the coding characteristic of at least two second nodes in the first sub-tree, and the second tree node characteristic includes the coding characteristic of at least two fourth nodes in the second sub-tree, the first similarity characteristic between the first sub-tree and the second sub-tree can be obtained by: based on the first tree node characteristics and the second tree node characteristics, acquiring third similarity from each second node in the first sub-tree to at least two fourth nodes in the second sub-tree and fourth similarity from each fourth node in the second sub-tree to at least two second nodes in the first sub-tree; and according to the third similarity and the fourth similarity, the coding features of at least two second nodes in the first tree node features and the second tree node features are aggregated to obtain second similarity features of the first sub tree and the second sub tree.
In practical implementation, for any second node in the first node tree and any fourth node in the second node tree, the two-way similarity between the two nodes needs to be calculated. Here, the same method as that for calculating the bidirectional similarity between the node in the first sub-tree and the node in the second sub-tree may be employed to calculate the bidirectional similarity between the node in the first sub-tree and the node in the second sub-tree. Further, a second similarity characteristic of the first and second sub-trees is obtained.
In some embodiments, the first similarity characteristic of the first and second score graphs may be obtained by:
according to the first similarity, the coding features of at least two first nodes in the node features of the first graph and the coding features of at least two third nodes in the node features of the second graph are aggregated to obtain similarity features from the first molecular graph to the second molecular graph; according to the second similarity, the coding features of at least two first nodes in the node features of the first graph and the coding features of at least two third nodes in the node features of the second graph are aggregated to obtain similarity features from the second score graph to the first score graph; and splicing the similarity characteristics of the first sub graph to the second sub graph and the similarity characteristics of the second sub graph to the first sub graph to obtain the first similarity characteristics of the first sub graph and the second sub graph.
The aggregation refers to aggregating the coding features of at least two first nodes in the first graph node features and the coding features of at least two third nodes in the second graph node features into one feature, and in actual implementation, the aggregation of the features can be realized by adopting modes of summation, averaging and the like.
In practical implementation, it is assumed that the obtained similarity characteristics from the first score to the second score areSimilarity feature of the second score to the first score>The resulting first similarity feature may be expressed as
Accordingly, the second similarity feature of the first and second sub-trees may be obtained by:
according to the third similarity, the coding features of at least two second nodes in the first tree node features and the coding features of at least two fourth nodes in the second tree node features are aggregated to obtain similarity features from the first sub-tree to the second sub-tree; according to the fourth similarity, the coding features of at least two second nodes in the first tree node features and the coding features of at least two fourth nodes in the second tree node features are aggregated to obtain similarity features from the second node tree to the first node tree; and splicing the similarity characteristics of the first sub-tree to the second sub-tree and the similarity characteristics of the second sub-tree to the first sub-tree to obtain second similarity characteristics of the first sub-tree and the second sub-tree.
In actual implementation, it is assumed that the obtained similarity characteristics from the first sub-tree to the second sub-tree are as follows Similarity feature of second molecular tree to first molecular tree->The resulting second similarity feature may be expressed as
In some embodiments, the similarity characteristics of the first score graph to the second score graph may be obtained by:
respectively carrying out weighted summation on the coding features of at least two third nodes in the second sub-graph according to the first similarity from each first node in the first sub-graph to at least two third nodes in the second sub-graph to obtain first aggregation features corresponding to each first node in the first sub-graph; splicing the coding features of each first node in the first molecular diagram and the corresponding first aggregation features to obtain first splicing features corresponding to each first node in the first molecular diagram; and summing the first splicing characteristics corresponding to each first node in the first sub-graph to obtain the similarity characteristics from the first sub-graph to the second sub-graph.
In actual implementation, for each first node in the first molecular diagram, the following operations may be performed: and carrying out weighted summation on the coding features of the first nodes in the second sub-graph according to the first similarity of the first nodes to the first nodes in the second sub-graph, wherein the weight corresponding to the coding features of the first nodes in the second sub-graph is the first similarity of the first nodes in the first sub-graph to the first nodes in the second sub-graph.
Exemplary, when the first similarity from the ith first node in the first score graph to the qth third node in the second score graph is w iq Then the weight corresponding to the coding feature of the q-th third node in the second sub-graph is w iq . And further, a first aggregation feature corresponding to the ith first node in the first molecular diagram can be obtainedThat is, the coding features of all the third nodes in the second sub-graph are weighted and summed according to the corresponding weights.
Executing the operation on each first node in the first molecular diagram to obtain a first aggregation characteristic corresponding to each first node in the first molecular diagram; is obtained byAnd after the first aggregation characteristics corresponding to all the first nodes in the first molecular diagram are obtained, splicing the coding characteristics of each first node in the first molecular diagram and the first aggregation characteristics corresponding to the coding characteristics. For example, for the ith first node in the first molecular diagram, its coding feature isFirst aggregation characteristic corresponding thereto->Splicing to obtain a first splicing characteristic ∈>
That is, for each first node in the first score graph, a corresponding first stitching feature is obtained, and then all the obtained first stitching features are aggregated by summing, so as to obtain similarity features from the first score graph to the second score graph.
In practical implementation, similar modes can be adopted to acquire the similarity characteristics from the second sub-tree to the first sub-tree, the similarity characteristics from the first sub-tree to the second sub-tree and the similarity characteristics from the second sub-tree to the first sub-tree.
In some embodiments, the similarity characteristics of the first to second score graphs may be obtained by equation (7)
Wherein g represents the polymerization treatment,for the coding feature of the i-th first node in the first molecular diagram,/and>for a first similarity of the ith first node in the first molecular diagram to the qth third node in the second molecular diagram,/I->Is the coding feature of the q third node in the first molecular diagram.
Correspondingly, obtaining the similarity characteristic from the second sub-graph to the first sub-graph through the formula (8)
Wherein,,and the second similarity from the qth third node in the second molecular diagram to the ith first node in the first molecular diagram.
Similarly, the similarity characteristics from the first sub-tree to the second sub-tree are obtained through a formula (9);
wherein,,for the coding feature of the ith second node in the first molecular tree,/second node>For the third similarity from the ith second node in the first sub-tree to the qth fourth node in the second sub-tree,/I >Is a second sub-treeThe coding feature of the q fourth node.
Correspondingly, the similarity characteristics from the second sub-tree to the first sub-tree are obtained through a formula (10):
wherein,,fourth similarity from the qth fourth node in the second sub-tree to the ith second node in the first sub-tree.
Step 304: and generating graph node characteristics of the predicted molecules according to the first similarity characteristics and the first graph node characteristics through the generation layer, and generating tree node characteristics of the predicted molecules according to the second similarity characteristics and the first tree node characteristics.
In practical implementation, in order to preserve the structural similarity between the predicted molecule and the base molecule, the first similarity feature and the first graph node feature are combined to generate the graph node feature of the predicted molecule in the graph structure layer; and at the tree structure level, combining the second similarity feature and the first tree node feature to generate a tree node feature of the predicted molecule.
In some embodiments, graph node features of the predicted molecules may be generated by: based on the first similarity characteristics, acquiring a mean value and a variance corresponding to the first similarity characteristics; based on the mean and variance corresponding to the first similarity feature, obtaining Gaussian distribution corresponding to the first similarity feature; sampling from Gaussian distribution to obtain sampling characteristics; and splicing the sampling feature with the first graph node feature to obtain the graph node feature of the predicted molecule.
In practical implementation, a resampling trick pair of the variant encoder may be used, i.e. assuming that there is a distribution specific to the first similarity feature, and further assuming that this distribution is a gaussian distribution with two sets of parameters: the mean and variance are needed to be obtained in order to determine the distribution specific to the first similarity feature, where the mean and variance may be obtained by fitting through a neural network.
Here, it is assumed that the sampling feature isThen the sampled features follow a gaussian distribution, i.eWherein m is G Representing a first similarity feature. />
After the gaussian distribution specific to the first similarity feature is obtained, sampling features from the obtained gaussian distribution, where the sampling features include a plurality of feature items, and each feature item corresponds to each first node in the first sub-graph. When the sampling feature is spliced with the first graph node feature, specifically, each feature item in the sampling feature is respectively spliced with the coding feature of each first node in the first graph node feature, so as to obtain the graph node feature of the predicted molecule. For example, when a first graph node feature is represented as Then the graph node characteristic of the predicted molecule can be expressed as +.>
Similarly, tree node features of a predicted molecule may be generated in the following manner: based on the mean and variance corresponding to the second similarity feature, obtaining a Gaussian distribution corresponding to the second similarity feature; sampling from Gaussian distribution to obtain sampling characteristics; and splicing the sampling characteristics with the first tree node characteristics to obtain the tree node characteristics of the predicted molecules.
Here, the manner of acquiring the gaussian distribution corresponding to the second similarity feature is similar to that of acquiring the gaussian distribution corresponding to the first geotropic feature; and the manner in which the sampled feature is stitched to the first tree node feature is similar to the manner in which the sampled feature is stitched to the first graph node feature.
Here, it is assumed that the sampling feature isThen the sampled features follow a gaussian distribution, i.eWherein m is T Representing a first similarity feature. When the first tree node character is expressed asIn this case, the tree node character of the predicted molecule can be expressed as +.>
Step 305: and respectively decoding the graph node characteristics and the tree node characteristics through a decoding layer to obtain predicted molecules represented by the molecular graph and the molecular tree.
In practical implementation, the graph node features and the tree node features may be decoded in a depth-first manner, respectively.
In some embodiments, graph node features may be decoded by: the gating loop network processes the characteristics of the nodes of the graph to obtain information vectors transmitted among the nodes; and for any decoded node, obtaining the probability of adding a new node based on the information vector transmitted between the node and other nodes, and determining the type of the new node according to the information vector transmitted between the node and other nodes when the probability is higher than a probability threshold value so as to add the new node at the node according to the type of the new node.
In actual implementation, after the graph node characteristics of the predicted molecules are obtained, the graph node characteristics are input into a gating loop network, and iteration can be performed on information vectors transmitted between any node and other nodes through the gating loop network.
For example, for the nth node in the molecular diagram of the predicted molecule, the information vector transferred between the nth node in the molecular diagram and the nth node in the molecular diagram can be obtained through the formula (11), namely:
wherein h is uv Is an information vector transferred between the u-th node and the v-th node in the molecular diagram,representing the coding characteristics of the r node in the molecular diagram, h vw Representing information vectors transferred between a v-th node and a w-th node in the molecular graph, wherein the w-th node is a node except the u-th node in neighbor nodes of the v-th node.
When the t iteration reaches the u node, the node is marked as u t Then, first, according to equation (12) and equation (13), the probability of adding a new node is obtained:
wherein τ (·) represents a ReLU function, s (·) represents a sigmoid function,for the weights in the network, +.>Representing the output of the ReLU layer when the t-th iteration reaches the u-th node;Representing the output of the ReLU layer when the t-1 th iteration reaches the u node;Representing information vectors transferred between the ith node and the ith node when the ith iteration reaches the ith node;Representing the probability of adding a new node when the t-th iteration reaches the u-th node.
Here, when the probability of adding the new node reaches the probability threshold, determining to add the new node; otherwise, the decoding is ended. Here, the probability threshold is preset, for example, may be set to 0.5.
When it is determined to add a new node, the type of the new node, i.e., the element to which it corresponds, needs to be determined. In actual implementation, the type of new node can be determined by equation (14):
wherein r is v Indicating the type of the new node,h is the weight in the network uv Is the information vector transferred between the u-th node and the v-th node in the molecular graph.
After determining the new node, the current node may be extended, i.e., the new node is added as a neighbor node of the current node.
Step 306: and obtaining the difference between the predicted molecule and the target molecule, and updating the model parameters of the molecular generation model based on the difference.
The training is aimed at making the predicted molecule and the target molecule as similar as possible, and in practical implementation, the value of the loss function can be determined according to the difference between the predicted molecule and the target molecule, whether the value of the loss function exceeds a preset threshold value is judged, when the value of the loss function exceeds the preset threshold value, an error signal of the coding model is determined based on the loss function, error information is back propagated in the coding model, and model parameters of each layer are updated in the propagation process.
Here, the back propagation is described, since the predicted molecule and the target molecule output by the molecule generation model have errors, the error between the predicted molecule and the target molecule is calculated, and the error is propagated back from the output layer to the hidden layer until the error is propagated to the input layer, and in the back propagation process, the value of the model parameter is adjusted according to the error; the above process is iterated until convergence.
In some embodiments, the model parameters of the molecular generative model may be updated by: obtaining the probability that the predicted molecule is the same as the target molecule; acquiring information divergence between posterior probability distribution of sampling characteristics and standard Gaussian distribution; determining a value of a variation loss function based on the probability and the information divergence; acquiring a center representation of a base molecule, a center representation of a predicted molecule and a center representation of a target molecule; determining a value of the concealment loss function based on the center representation of the base molecule, the center representation of the predicted molecule, and the center representation of the target molecule; summing the value of the variation loss function and the value of the hidden loss function to obtain the value of the loss function of the molecular generation model; model parameters of the molecular generation model are updated based on the value of the loss function of the molecular generation model.
Here, the loss function is divided into two parts, namely a variational loss function and a concealment loss function. The variation loss function is used for predicting the probability that the molecule is the same as the target molecule and the information divergence between the posterior probability distribution of the sampling feature and the standard Gaussian distribution, namely, the probability that the predicted molecule is the same as the target molecule is as high as possible and the posterior probability distribution of the sampling feature is as close to the standard Gaussian distribution as possible.
When the posterior probability distribution of the sampling feature is consistent with the standard gaussian distribution, the sampling feature may be directly sampled from the standard gaussian distribution in the application.
In some embodiments, a variation loss function as shown in equation (15) may be employed:
where L (θ, Φ) represents the variation loss function and E represents the mathematical expectation; z comprises a sampling feature for a first similarity feature and a sampling feature for a second similarity feature, x represents a base molecule, y represents a target molecule; p is p θ (y|x, z) represents the probability that the predicted molecule generated is identical to the target molecule under the conditions of x, z; q φ (z|x, y) represents the posterior probability distribution of z, which refers to the probability of z obtained under the conditions of x, y; p (z|x) represents a standard gaussian distribution; KL represents information divergence.
In practical implementations, the concealment loss function is to constrain the center distance between the base, target, and predicted molecules.
In some embodiments, a distance loss function as shown in equation (16) may be employed:
wherein x is a base molecule, y is a target molecule,for predicting molecules, y' is a randomly selected molecule with lower similarity to the molecular structure of x; correspondingly, G x Denoted by x as center>Is->Center of (1) represents G y Represented by the center of y, G y' Represented as the center of y'; gamma and beta are super parameters.
The center distance between the basic molecule and the predicted molecule can be smaller than the center distance between the basic molecule and the target molecule through the formula (16); the center distance between the basic molecule and the target molecule is smaller than gamma; and the center distance between the basic molecule and the molecule with lower similarity is larger than beta.
In some embodiments, when the first graph node characteristics include encoded characteristics of at least two first nodes in the first molecular graph, the first tree node characteristics include encoded characteristics of at least two second nodes in the first molecular graph, the center representation of the base molecule may be obtained by:
acquiring a first average value of coding features of at least two first nodes in a first molecular diagram and a second average value of coding features of at least two second nodes in the first molecular diagram; and splicing the first average value and the second average value to obtain the center representation of the basic molecule.
In actual implementation, when the first graph node feature is expressed asThe first tree node is represented asThe center representation of the base molecule can be obtained by equation (17):
accordingly, when the second graph node is characterized as represented by a first tree node asThe center representation of the target molecule can be obtained by equation (18):
the center representation of the predicted molecule and the center representation of the molecule having a low degree of molecular structural similarity to the base molecule can be obtained in the same manner as described above.
After the training is performed to obtain the molecular generation model, the generation of the molecules can be realized based on the molecular generation model obtained by the training. Fig. 6 is a schematic flow chart of a molecular generation method based on a molecular generation model according to an embodiment of the present application, referring to fig. 6, the molecular generation method based on a molecular generation model according to an embodiment of the present application includes:
step 601: the server obtains the base molecule.
Here, the base molecule is represented by a first molecular diagram and a first molecular tree. In practical implementation, a user can input basic molecules through the terminal, and after the input is completed, the terminal automatically acquires the basic molecules and sends a molecule generation request corresponding to the basic molecules to the server so that the server acquires the basic molecules and corresponding molecule acquisition requests.
In practical application, the molecular generation method based on the molecular generation model provided by the embodiment of the application can be applied to drug discovery, molecular optimization, molecular generation and the like in the fields of structural biology and medicine. For example, a user may input a known basic drug molecule to generate a new drug molecule via a molecular generation model, wherein the molecular properties of the new drug molecule are higher than the basic drug molecule.
Step 602: and encoding the basic molecules in the sample through the encoding layer to obtain first graph node characteristics of the first molecular graph and first tree node characteristics of the first molecular graph.
In actual implementation, the first molecular graph and the first molecular tree are input into a coding layer, and the coding layer is used for coding the first molecular graph and the first molecular tree to obtain the first graph node characteristic and the first tree node characteristic of the first molecular graph.
In some embodiments, encoding at least two first nodes in the first molecular graph through a graph encoding network in the encoding layer to obtain encoding features of the at least two first nodes, and taking the encoding features of the at least two first nodes as first graph node features; and coding at least two second nodes in the first sub-tree through a tree coding network in the coding layer to obtain coding features of the at least two second nodes, and taking the coding features of the at least two second nodes as first tree node features.
Step 603: generating, by the generation layer, graph node features of the predicted molecules based on the first graph node features, and generating tree node features of the predicted molecules based on the first tree node features.
In actual implementation, sampling features corresponding to the node features of the first graph and sampling features corresponding to the node features of the first tree are obtained from standard Gaussian distribution; splicing the sampling features corresponding to the first graph node features with the first graph node features to obtain graph node features of the predicted molecules; and splicing the sampling features corresponding to the first tree node features with the first tree node features to obtain the tree node features of the predicted molecules.
Step 604: and respectively decoding the graph node characteristics and the tree node characteristics through a decoding layer to obtain predicted molecules.
The molecular generation model is obtained by training based on the training method of the molecular generation model.
In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.
Fig. 7 is a flow chart of a training method of a molecular generation model according to an embodiment of the present application, and in the following, with reference to fig. 7, the training method of a molecular generation model according to an embodiment of the present application will be described, and referring to fig. 7, the training method of a molecular generation model according to an embodiment of the present application includes:
Step 701: constructing a molecular pair sample.
Here, a molecular pair sample is screened from the dataset, including a base molecule x and a target molecule y, wherein a molecular attribute of the base molecule x is lower than a molecular attribute of the target molecule y, and a structural similarity between the base molecule x and the target molecule y is greater than a similarity threshold. Here, the molecular property of the base molecule x is lower than that of the target molecule y, which means that the target molecule has a significant improvement with respect to the base molecule with respect to the desired property.
In practical implementation, the basic molecule x and the target molecule y are represented by a graph structure and a tree structure, that is, the basic molecule is represented by a first molecular diagram and a first molecular tree, and the target molecule is represented by a second molecular diagram and a second molecular tree.
In practical application, for any one first node i in the first molecular diagram, its attribute features may be expressed as f i For any two adjacent first nodes i and j in the first molecular graph, the attribute characteristics of the edge between the connected nodes i and j can be expressed as f ij
Accordingly, the nodes and edges in the first sub-tree, the second sub-tree, and the second sub-tree may also be represented in a similar manner.
Step 702: the molecules encode the sample.
In practical implementation, the basic molecule and the target molecule are respectively input into corresponding coding networks, wherein each coding network comprises two sub-coding networks: graph coding networks and tree coding networks. Here, the graph coding network is used for processing the data of the graph structure; the tree coding network is used for processing the data of the tree structure.
Here, the first molecular diagram is described by way of example as being encoded through a diagram encoding network. In practical application, the first molecular diagram includes at least two first nodes, and the number of layers of the diagram coding network is at least two, so that the coding characteristics of the first nodes output by the t-th layer network in the diagram coding network can be determined by the following modes:
when t=1, according to the formulaAcquiring edge coding characteristics of edges connecting a first node i and a first node j, which are obtained through a layer t network;
wherein,,the method comprises the steps that the edge coding characteristics of the edges connecting a first node i and a first node j are obtained through a layer t network, wherein the first node j is a neighbor node of the first node i; f (f) 1 Representing a neural network; f (f) i Is an attribute feature of the first node i; f (f) j Is an attribute feature of the first node i; f (f) jk The attribute characteristic of the edge between the first node j and the first node k is that the first node k is a neighbor node of the first node j.
Then, according to the formulaAcquiring coding characteristics of a connected first node obtained through a layer t network;
wherein f 2 Representing a neural network;representing the coding characteristics of the first node i of the t-layer network output.
When t>1, according to the formulaAcquiring edge coding characteristics of edges connecting a first node i and a first node j, which are obtained through a layer t network;
wherein,,the method comprises the steps that the edge coding characteristics of the edges connecting a first node i and a first node j are obtained through a layer t network, wherein the first node j is a neighbor node of the first node i; f (f) 1 Representing a neural network;Is the coding feature of the first node i obtained through the t-1 layer network;Is the coding feature of the first node j obtained through the t-1 layer network;For the coding feature of the edge between the first node j and the first node k obtained through the t-1 layer network, the first node k is a neighboring node of the first node j.
According to the formulaAcquiring coding characteristics of a connected first node obtained through a layer t network;
wherein f 2 Representing a neural network;coding feature of the first node i representing the output of the layer t network,/ >Is the coding feature of the first node i obtained through the layer t-1 network.
Here, the coding feature of the first node i obtained by the last layer network processing may be described asThen the first graph node is characterized by +>Where nG represents the number of first nodes in the first molecular graph.
In practical implementation, a similar method can be adopted to encode the first sub-tree, the second sub-tree and the second sub-tree, and correspondingly, the node characteristics of the first tree can be obtainedSecond graph node featureSecond Tree node feature->
Step 703: and obtaining the similarity characteristics between two molecules in the molecular pair sample through a node alignment strategy.
First, the molecular centers are aligned.
Here, according toObtaining a central representation of a base molecule x; according toA central representation of the target molecule y is acquired.
In order to constrain the center of the resulting predicted molecule, a molecule y' is selected from the dataset that has a low degree of similarity in molecular structure to the underlying molecule. Here, the predicted molecule can be obtained in the same manner as described aboveCenter representation of the molecule y ', and center representation of the molecule y'.
The contrast strategy is utilized to predict the basic molecule x, the target molecule y and the predicted molecule And the center-to-center distance between molecules y' are constrained:
wherein G is x Represented by the center of x and is shown,is->Center of (1) represents G y Represented by the center of y, G y' Represented as the center of y'; gamma and beta are super parameters.
Then, the node alignment of the graph structure level and the tree structure level is performed.
Here, for any one first node in the first score graph and any one third node in the second score graph, it is necessary to calculate the bidirectional similarity between the two nodes.
By way of example only, and in an illustrative,for the ith first node in the first molecular diagram and the qth third node in the second molecular diagram, according toAcquiring first similarity from the ith first node in the first molecular diagram to the qth third node in the second molecular diagram; and according to->And acquiring second similarity from the qth third node in the second molecular diagram to the ith first node in the first molecular diagram.
By the method, the bidirectional similarity between each first node in the first sub graph and at least two nodes in the second sub graph can be obtained; and after the bidirectional similarity is obtained, the coding features of the third node in the first and second sub-graphs are aggregated together according to the bidirectional similarity pair so as to obtain the first similarity features of the first and second sub-graphs.
In particular, according toObtaining similarity characteristics of the first to the second score ∈ ->According to->Obtaining similarity characteristics from a first sub-tree to a second sub-tree; then will->And->Splicing to obtain a first similarity characteristic of the first and second sub-graphs
Accordingly, in the same manner as described above, according toObtaining similarity characteristics from a first sub-tree to a second sub-tree; according to->Obtaining similarity characteristics from the second sub tree to the first sub tree; then will->And->Splicing to obtain second similarity characteristic ∈>
Step 704: and sampling the similarity characteristics to obtain sampling characteristics, and generating graph node characteristics and tree node characteristics of the predicted molecules.
Here, the similarity features are sampled in the hidden space by using the resampling technique of the variational self-encoder, so as to obtain sampling features, namely:
then, the features are sampledSplicing the first graph node characteristics to obtain graph node characteristics of the predicted moleculesSample feature->Splicing with the first tree node characteristic to obtain the tree node characteristic of the predicted molecule
When the sampling feature is spliced with the first graph node feature, specifically, each feature item in the sampling feature is respectively spliced with the coding feature of each first node in the first graph node feature.
Step 705: and decoding the graph node characteristics and the tree node characteristics of the predicted molecules to obtain the predicted molecules.
Here, decoding of the graph node feature is described as an example.
For the nth node in the molecular diagram of the predicted molecule, the node may be determined byAnd obtaining an information vector transferred between the information vector and a v-th node in the molecular diagram.
Wherein h is uv Is an information vector transferred between the u-th node and the v-th node in the molecular diagram,representing the coding characteristics of the r node in the molecular diagram, h vw Representing information vectors transferred between a v-th node and a w-th node in the molecular graph, wherein the w-th node is a node except the u-th node in neighbor nodes of the v-th node.
When the t iteration reaches the u node, the node is marked as u t Then according toIs->The probability of adding a new node is obtained.
Wherein τ (·) represents a ReLU function, s (·) represents a sigmoid function,for the weights in the network, +.>Representing the output of the ReLU layer when the t-th iteration reaches the u-th node;Representing the output of the ReLU layer when the t-1 th iteration reaches the u node;Representing information vectors transferred between the ith node and the ith node when the ith iteration reaches the ith node; / >Representing the probability of adding a new node when the t-th iteration reaches the u-th node.
Here, when the probability of adding the new node reaches the probability threshold, determining to add the new node; otherwise, the decoding is ended. Here, the probability threshold is preset, for example, may be set to 0.5.
When it is determined to add a new node, the type of the new node needs to be determined. In particular byTo determine the type of new node.
Wherein r is v Indicating the type of the new node,h is the weight in the network uv Is the information vector transferred between the u-th node and the v-th node in the molecular graph. />
Step 706: and calculating the value of the loss function based on the predicted molecule and the target molecule, and updating the model parameters according to the value of the loss function.
After obtaining the predicted molecule, we expect the molecule to be as similar as possible to the target molecule in the original molecule pair, and we can use the evidence lower bound function of the conditional self-encoder to obtain the variation loss function
Where L (θ, Φ) represents the variation loss function and E represents the mathematical expectation; z represents a sampling feature, x represents a base molecule, and y represents a target molecule; p is p θ (y|x, z) represents the probability that the predicted molecule generated is identical to the target molecule under the conditions of x, z; q φ (z|x, y) represents the posterior probability distribution of z; p (z|x) represents a standard gaussian distribution.
In practical application, the loss function of the whole molecular generation model is as follows: l=l (θ, Φ) +l latent . The value of the loss function can be calculated according to the predicted molecules and the target molecules, and then the model parameters are updated according to the value of the loss function until convergence is achieved, so that the molecular generation model is obtained through training.
After the training is performed to obtain the molecular generation model, the generation of the molecules can be realized based on the molecular generation model obtained by the training. Fig. 8 is a schematic flow chart of a molecular generation method based on a molecular generation model according to an embodiment of the present application, referring to fig. 8, the molecular generation method based on a molecular generation model according to an embodiment of the present application includes:
step 801: obtaining a base molecule.
In practical implementation, a user can input basic molecules through the terminal, and after the input is completed, the terminal automatically acquires the basic molecules and sends a molecule generation request corresponding to the basic molecules to the server so that the server acquires the basic molecules and corresponding molecule acquisition requests.
Here, the base molecule is represented by a first molecular diagram and a first molecular tree.
Step 802: the base molecule is encoded.
In practical implementation, the process and training of coding basic moleculesThe coding process in the training is the same, and the first graph node obtained by coding is characterized in thatAnd first tree node character +.>
Step 803: sampling features from the standard Gaussian distribution to generate graph node features and tree node features of the predicted molecules.
Here, the sampling features are obtained by sampling from two standard Gaussian distributions, respectivelyAnd sample feature->Sample feature->Splicing with the first graph node characteristic to obtain the graph node characteristic of the predicted molecule +.>Sample feature->Splicing with the first tree node characteristic to obtain the tree node characteristic of the predicted molecule +.>
Step 804: and decoding the graph node characteristics and the tree node characteristics of the predicted molecules to obtain the predicted molecules.
The embodiment of the application has the following beneficial effects:
1. solves the defect that the structural similarity between the new molecule and the original molecule is difficult to maintain when the new molecule is generated by the traditional molecule generation method;
2. the method has wide application prospect, can generate new drug molecules with higher molecular properties according to the existing drug molecules, and can improve the molecular properties to a greater extent compared with a molecular generation model in the related technology.
Continuing with the description below of the exemplary structure of the training device 455 for molecular generation model according to the embodiment of the present application, fig. 9 is a schematic structural diagram of the training device for molecular generation model according to the embodiment of the present application, and referring to fig. 9, the molecular generation model includes an encoding layer, an alignment layer, a generation layer, and a decoding layer, where the software modules of the training device 455 for molecular generation model may include:
A first acquisition module 4551 for acquiring a molecular pair sample comprising a base molecule and a target molecule;
wherein the molecular property of the target molecule is higher than that of the base molecule, the base molecule is represented by a first molecular diagram and a first molecular tree, and the target molecule is represented by a second molecular diagram and a second molecular tree;
the first encoding module 4552 is configured to encode, by using the encoding layer, the molecule on the base molecule in the sample to obtain a first graph node feature of a first molecular graph and a first tree node feature of the first molecular graph, and encode the molecule on the target molecule in the sample to obtain a second graph node feature of a second molecular graph and a second tree node feature of the second molecular graph;
an alignment module 4553, configured to match, through the alignment layer, the first graph node feature with the second graph node feature to obtain a first similarity feature of the first sub-graph and the second sub-graph, and match the first tree node feature with the second tree node feature to obtain a second similarity feature of the first sub-graph and the second sub-graph;
a first generating module 4554 configured to generate, by the generating layer, graph node features of a predicted molecule according to the first similarity feature and the first graph node feature, and generate tree node features of a predicted molecule according to the second similarity feature and the first tree node feature;
A first decoding module 4555 configured to decode, by the decoding layer, the graph node feature and the tree node feature, respectively, to obtain the predicted molecules represented by a molecular graph and a molecular tree;
and an updating module 4556 configured to obtain a difference between the predicted molecule and the target molecule, and update a model parameter of the molecular generation model based on the difference.
In some embodiments, the first encoding module 4552 is further configured to encode, through a graph encoding network in an encoding layer, at least two first nodes in the first molecular graph, to obtain encoding features of the at least two first nodes, and use the encoding features of the at least two first nodes as the first graph node features; wherein the first molecular diagram corresponds to a molecular structure topology of the base molecule, and the at least two first nodes correspond to constituent elements constituting the base molecule;
coding at least two second nodes in the first sub-tree through a tree coding network in a coding layer to obtain coding features of the at least two second nodes, and taking the coding features of the at least two second nodes as first tree node features; wherein the first molecular tree is constructed based on the molecular structure of the base molecule with constituent elements of the base molecule as second nodes.
In some embodiments, the first encoding module 4552 is further configured to, for each first node in the first molecular graph, perform the following operations:
acquiring edge coding features of at least two edges connected with the first node when the at least two edges connected with the first node exist;
summing the edge coding features of the at least two edges to obtain a first edge aggregation feature;
and generating node coding features of the nodes based on the attribute features of the first nodes and the first edge aggregation features.
In some embodiments, the first encoding module 4552 is further configured to, for each of at least two edges connecting the first node, perform the following operations:
when the edge is an edge connecting the first node and a neighbor node, acquiring attribute characteristics of at least two edges connecting the neighbor node;
summing the attribute characteristics of at least two edges connected with the neighbor nodes to obtain a second edge aggregation characteristic;
and generating edge coding features of the edges based on the attribute features of the first node, the attribute features of the neighbor nodes and the second edge aggregation features.
In some embodiments, the first graph node features include coding features of at least two first nodes in a first molecular graph, and the second graph node features include coding features of at least two third nodes in a second molecular graph;
The alignment module 4553 is further configured to obtain a first similarity from each first node in the first sub-graph to at least two third nodes in the second sub-graph and a second similarity from each third node in the second sub-graph to at least two first nodes in the first sub-graph based on the first graph node feature and the second graph node feature;
and according to the first similarity and the second similarity, the coding features of at least two first nodes in the first graph node features and the second graph node features are aggregated to obtain first similarity features of the first molecular graph and the second molecular graph.
In some embodiments, the alignment module 4553 is further configured to aggregate, according to the first similarity, the coding features of at least two first nodes in the first graph node features and the coding features of at least two first nodes in the second graph node features, to obtain similarity features from the first score graph to the second score graph;
according to the second similarity, the coding features of at least two first nodes in the node features of the first graph and the coding features of at least two third nodes in the node features of the second graph are aggregated to obtain similarity features from the second sub graph to the first sub graph;
And splicing the similarity characteristics of the first sub graph to the second sub graph and the similarity characteristics of the second sub graph to the first sub graph to obtain the first similarity characteristics of the first sub graph and the second sub graph.
In some embodiments, the alignment module 4553 is further configured to perform weighted summation on the encoding features of at least two third nodes in the second sub-graph according to the first similarities from each first node in the first sub-graph to at least two third nodes in the second sub-graph, so as to obtain first aggregate features corresponding to each first node in the first sub-graph;
splicing the coding features of each first node in the first molecular diagram and the corresponding first aggregation features to obtain first splicing features corresponding to each first node in the first molecular diagram;
and summing the first splicing characteristics corresponding to each first node in the first sub-graph to obtain the similarity characteristics from the first sub-graph to the second sub-graph.
In some embodiments, the first generating module 4554 is further configured to obtain, based on the first similarity feature, a mean and a variance corresponding to the first similarity feature;
based on the mean value and the variance corresponding to the first similarity feature, obtaining Gaussian distribution corresponding to the first similarity feature;
Sampling from the Gaussian distribution to obtain sampling characteristics;
and splicing the sampling feature with the first graph node feature to obtain the graph node feature of the predicted molecule.
In some embodiments, the updating module 4556 is further configured to obtain a probability that the predicted molecule is the same as the target molecule;
acquiring information divergence between posterior probability distribution and standard Gaussian distribution of the sampling characteristics;
determining a value of a variation loss function based on the probability and the information divergence;
acquiring a center representation of a base molecule, a center representation of a predicted molecule and a center representation of a target molecule;
determining a value of a concealment loss function based on the central representation of the base molecule, the central representation of the predicted molecule, and the central representation of the target molecule;
summing the value of the variation loss function and the value of the hidden loss function to obtain the value of the loss function of the molecular generation model;
updating model parameters of the molecular generation model based on the value of the loss function of the molecular generation model.
In some embodiments, the first graph node features include coding features of at least two first nodes in a first molecular graph, and the first tree node features include coding features of at least two second nodes in the first molecular graph;
The updating module 4556 is further configured to obtain a first average value of the coding features of at least two first nodes in the first molecular diagram and a second average value of the coding features of at least two second nodes in the first molecular diagram;
and splicing the first average value and the second average value to obtain the center representation of the basic molecule.
In some embodiments, the first decoding module 4555 is further configured to process the node feature of the graph through a gating loop network to obtain an information vector transferred between nodes;
for any decoded node, based on the information vector transferred between the node and other nodes, obtaining the probability of adding a new node, and
and when the probability is determined to be higher than a probability threshold, determining the type of a new node according to the information vector transmitted between the node and other nodes, so as to add the new node at the node according to the type of the new node.
A molecular generation device based on a molecular generation model, the molecular generation model comprising: an encoding layer, a generating layer, and a decoding layer, the apparatus comprising:
a second acquisition module for acquiring a base molecule, the base molecule being represented by a first molecular graph and a first molecular tree;
The second coding module is used for coding basic molecules in the sample through the coding layer to obtain first graph node characteristics of the first molecular graph and first tree node characteristics of the first molecular graph;
the second generation module is used for generating graph node characteristics of the predicted molecules based on the first graph node characteristics through the generation layer and generating tree node characteristics of the predicted molecules according to standard Gaussian distribution and the first tree node characteristics;
the second decoding module is used for respectively decoding the graph node characteristics and the tree node characteristics through the decoding layer to obtain the prediction molecules;
the molecular generation model is obtained by training based on the training method of the molecular generation model provided by the embodiment of the application.
Embodiments of the present application provide a computer readable storage medium having stored therein executable instructions which, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, as shown in fig. 3.
In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (html, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (12)

1. A method of training a molecular generation model, the molecular generation model comprising a coding layer, an alignment layer, a generation layer, and a decoding layer, the method comprising:
obtaining a molecular pair sample containing basic molecules and target molecules, wherein the molecular pair sample is a biological molecular pair sample or a drug molecular pair sample;
wherein the molecular properties of the target molecule are better than those of the base molecule, the base molecule is represented by a first molecular diagram and a first molecular tree, and the target molecule is represented by a second molecular diagram and a second molecular tree;
encoding at least two first nodes in the first molecular diagram through a diagram encoding network in the encoding layer to obtain encoding characteristics of the at least two first nodes, and taking the encoding characteristics of the at least two first nodes as first diagram node characteristics, wherein the first molecular diagram corresponds to a molecular structure topology of the basic molecule, and the at least two first nodes correspond to constituent elements forming the basic molecule; encoding at least two second nodes in the first sub-tree through a tree encoding network in an encoding layer to obtain encoding characteristics of the at least two second nodes, and taking the encoding characteristics of the at least two second nodes as first tree node characteristics, wherein the first sub-tree is constructed by taking constituent elements of the basic molecules as second nodes based on molecular structures of the basic molecules, and encoding the target molecules in a sample by the molecules to obtain second graph node characteristics of a second sub-tree and second tree node characteristics of the second sub-tree, wherein the second graph node characteristics comprise encoding characteristics of at least two third nodes in the second sub-tree;
Acquiring first similarity from each first node in a first sub-graph to at least two third nodes in a second sub-graph and second similarity from each third node in the second sub-graph to at least two first nodes in the first sub-graph based on the first graph node characteristics and the second graph node characteristics through the alignment layer; according to the first similarity and the second similarity, the coding features of at least two first nodes in the node features of the first graph and at least two third nodes in the node features of the second graph are aggregated to obtain first similarity features of the first sub-graph and the second sub-graph, and the first tree node features and the second tree node features are matched to obtain second similarity features of the first sub-graph and the second sub-graph;
acquiring a mean value and a variance corresponding to the first similarity feature based on the first similarity feature through the generation layer, acquiring a Gaussian distribution corresponding to the first similarity feature based on the mean value and the variance corresponding to the first similarity feature, sampling from the Gaussian distribution to obtain a sampling feature, splicing the sampling feature and the first graph node feature to obtain a graph node feature of a predicted molecule, and generating a tree node feature of the predicted molecule according to the second similarity feature and the first tree node feature;
Decoding the graph node characteristics and the tree node characteristics through the decoding layer respectively to obtain the predicted molecules represented by a molecular graph and a molecular tree;
and obtaining the difference between the predicted molecule and the target molecule, and updating the model parameters of the molecular generation model based on the difference.
2. The method of claim 1, wherein the encoding at least two first nodes in the first molecular diagram comprises:
for each first node in the first molecular graph, performing the following operations:
acquiring edge coding features of at least two edges connected with the first node when the at least two edges connected with the first node exist;
summing the edge coding features of the at least two edges to obtain a first edge aggregation feature;
and generating node coding features of the first node based on the attribute features of the first node and the first edge aggregation features.
3. The method of claim 2, wherein the obtaining edge coding features of at least two edges connecting the first node comprises:
for each of at least two edges connecting the first node, performing the following:
When the edge is an edge connecting the first node and a neighbor node and at least two edges connecting the neighbor node exist, acquiring attribute characteristics of the at least two edges connecting the neighbor node;
summing the attribute characteristics of at least two edges connected with the neighbor nodes to obtain a second edge aggregation characteristic;
and generating edge coding features of the edges based on the attribute features of the first node, the attribute features of the neighbor nodes and the second edge aggregation features.
4. The method of claim 1, wherein aggregating the encoded features of at least two first nodes of the first graph node features and at least two third nodes of the second graph node features to obtain the first similarity features of the first and second score graphs based on the first and second similarities, comprises:
according to the first similarity, the coding features of at least two first nodes in the node features of the first graph and at least two third nodes in the node features of the second graph are aggregated to obtain similarity features from the first molecular graph to the second molecular graph;
according to the second similarity, the coding features of at least two first nodes in the node features of the first graph and at least two third nodes in the node features of the second graph are aggregated, and the similarity features from the second score graph to the first score graph are obtained;
And splicing the similarity characteristics of the first sub graph to the second sub graph and the similarity characteristics of the second sub graph to the first sub graph to obtain the first similarity characteristics of the first sub graph and the second sub graph.
5. The method of claim 4, wherein aggregating the encoded features of at least two first nodes of the first graph node features and at least two third nodes of the second graph node features based on the first similarity comprises:
respectively carrying out weighted summation on coding features of at least two third nodes in a second sub-graph according to first similarity from each first node in the first sub-graph to at least two third nodes in the second sub-graph to obtain first aggregation features corresponding to each first node in the first sub-graph;
splicing the coding features of each first node in the first molecular diagram and the corresponding first aggregation features to obtain first splicing features corresponding to each first node in the first molecular diagram;
and summing the first splicing characteristics corresponding to each first node in the first sub-graph to obtain the similarity characteristics from the first sub-graph to the second sub-graph.
6. The method of claim 1, wherein updating model parameters of the molecular generation model based on the differences comprises:
based on the difference, obtaining the probability that the predicted molecule is the same as the target molecule;
acquiring information divergence between posterior probability distribution and standard Gaussian distribution of the sampling characteristics;
determining a value of a variation loss function based on the probability and the information divergence;
acquiring a center representation of a base molecule, a center representation of a predicted molecule and a center representation of a target molecule;
determining a value of a concealment loss function based on the central representation of the base molecule, the central representation of the predicted molecule, and the central representation of the target molecule;
summing the value of the variation loss function and the value of the hidden loss function to obtain the value of the loss function of the molecular generation model;
updating model parameters of the molecular generation model based on the value of the loss function of the molecular generation model.
7. The method of claim 6, wherein,
the first graph node characteristics comprise coding characteristics of at least two first nodes in a first molecular graph, and the first tree node characteristics comprise coding characteristics of at least two second nodes in the first molecular graph;
The obtaining a central representation of the base molecule comprises:
acquiring a first average value of coding features of at least two first nodes in a first molecular diagram and a second average value of coding features of at least two second nodes in the first molecular diagram;
and splicing the first average value and the second average value to obtain the center representation of the basic molecule.
8. The method of claim 1, wherein the decoding the graph node feature comprises:
processing the node characteristics of the graph through a gate control loop network to obtain information vectors transmitted among the nodes;
for any decoded node, based on the information vector transferred between the node and other nodes, obtaining the probability of adding a new node, and
and when the probability is determined to be higher than a probability threshold, determining the type of a new node according to the information vector transmitted between the node and other nodes, so as to add the new node at the node according to the type of the new node.
9. A molecular generation method based on a molecular generation model, wherein the molecular generation model comprises: an encoding layer, a generating layer, and a decoding layer, the method comprising:
Obtaining basic molecules, wherein the basic molecules are biological molecules or drug molecules and are represented by a first molecular graph and a first molecular tree;
encoding basic molecules in the sample by the encoding layer to obtain first graph node characteristics of a first molecular graph and first tree node characteristics of the first molecular graph;
generating graph node characteristics of the predicted molecules based on the first graph node characteristics through the generation layer, and generating tree node characteristics of the predicted molecules according to the first tree node characteristics;
decoding the graph node characteristics and the tree node characteristics through the decoding layer respectively to obtain the predicted molecules;
wherein the molecular generation model is trained based on the training method of any one of claims 1-8.
10. A training device for a molecular generation model, the molecular generation model comprising an encoding layer, an alignment layer, a generation layer, and a decoding layer, the device comprising:
the first acquisition module is used for acquiring a molecular pair sample containing basic molecules and target molecules, wherein the molecular pair sample is a biological molecular pair sample or a drug molecular pair sample;
wherein the molecular properties of the target molecule are better than those of the base molecule, the base molecule is represented by a first molecular diagram and a first molecular tree, and the target molecule is represented by a second molecular diagram and a second molecular tree;
The first coding module is used for coding at least two first nodes in the first molecular graph through a graph coding network in the coding layer to obtain coding characteristics of the at least two first nodes, and taking the coding characteristics of the at least two first nodes as first graph node characteristics; wherein the first molecular diagram corresponds to a molecular structure topology of the base molecule, and the at least two first nodes correspond to constituent elements constituting the base molecule; coding at least two second nodes in the first sub-tree through a tree coding network in a coding layer to obtain coding features of the at least two second nodes, and taking the coding features of the at least two second nodes as first tree node features; the first molecular tree is constructed by taking constituent elements of the basic molecules as second nodes based on the molecular structure of the basic molecules, and the molecules encode the target molecules in a sample to obtain second graph node characteristics of a second molecular graph and second tree node characteristics of a second molecular tree, wherein the second graph node characteristics comprise encoding characteristics of at least two third nodes in the second molecular graph;
The alignment module is used for acquiring first similarity from each first node in the first sub-graph to at least two third nodes in the second sub-graph and second similarity from each third node in the second sub-graph to at least two first nodes in the first sub-graph based on the first graph node characteristics and the second graph node characteristics through the alignment layer; according to the first similarity and the second similarity, the coding features of at least two first nodes in the node features of the first graph and at least two third nodes in the node features of the second graph are aggregated to obtain first similarity features of the first sub-graph and the second sub-graph, and the first tree node features and the second tree node features are matched to obtain second similarity features of the first sub-graph and the second sub-graph;
the first generation module is used for acquiring a mean value and a variance corresponding to the first similarity feature based on the first similarity feature through the generation layer, acquiring Gaussian distribution corresponding to the first similarity feature based on the mean value and the variance corresponding to the first similarity feature, sampling from the Gaussian distribution to obtain sampling features, splicing the sampling features with the first graph node feature to obtain graph node features of a predicted molecule, and generating tree node features of the predicted molecule according to the second similarity feature and the first tree node feature;
The first decoding module is used for respectively decoding the graph node characteristics and the tree node characteristics through the decoding layer to obtain the prediction molecules represented by a molecular graph and a molecular tree;
and the updating module is used for acquiring the difference between the predicted molecule and the target molecule and updating the model parameters of the molecular generation model based on the difference.
11. An electronic device, comprising:
a memory for storing executable instructions;
a processor for implementing a training method of a molecular generation model according to any one of claims 1 to 8 when executing executable instructions stored in said memory.
12. A computer readable storage medium storing executable instructions for implementing a method of training a molecular generation model according to any one of claims 1 to 8 when executed by a processor.
CN202010546027.6A 2020-06-16 2020-06-16 Training method, device, equipment and storage medium of molecular generation model Active CN111695702B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010546027.6A CN111695702B (en) 2020-06-16 2020-06-16 Training method, device, equipment and storage medium of molecular generation model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010546027.6A CN111695702B (en) 2020-06-16 2020-06-16 Training method, device, equipment and storage medium of molecular generation model

Publications (2)

Publication Number Publication Date
CN111695702A CN111695702A (en) 2020-09-22
CN111695702B true CN111695702B (en) 2023-11-03

Family

ID=72481135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010546027.6A Active CN111695702B (en) 2020-06-16 2020-06-16 Training method, device, equipment and storage medium of molecular generation model

Country Status (1)

Country Link
CN (1) CN111695702B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220189578A1 (en) * 2020-12-14 2022-06-16 International Business Machines Corporation Interpretable molecular generative models
CN112509644B (en) * 2020-12-18 2024-09-20 深圳先进技术研究院 Molecular optimization method, system, terminal equipment and readable storage medium
CN112735540B (en) * 2020-12-18 2024-01-05 深圳先进技术研究院 Molecular optimization method, system, terminal equipment and readable storage medium
CN112530516B (en) * 2020-12-18 2023-12-26 深圳先进技术研究院 Metabolic pathway prediction method, system, terminal equipment and readable storage medium
CN112580789B (en) * 2021-02-22 2021-06-25 支付宝(杭州)信息技术有限公司 Training graph coding network, and method and device for predicting interaction event
CN113241114A (en) * 2021-03-24 2021-08-10 辽宁大学 LncRNA-protein interaction prediction method based on graph convolution neural network
CN113300968B (en) * 2021-04-14 2022-07-15 浙江工业大学 Method for determining node decision threshold in bidirectional molecular communication network based on network coding
CN113838541B (en) * 2021-09-29 2023-10-10 脸萌有限公司 Method and apparatus for designing ligand molecules
CN114566233A (en) * 2022-02-21 2022-05-31 北京百度网讯科技有限公司 Method, device, electronic device and storage medium for molecular screening
CN114937478B (en) * 2022-05-18 2023-03-10 北京百度网讯科技有限公司 Method for training a model, method and apparatus for generating molecules
CN115171807B (en) * 2022-09-07 2022-12-06 合肥机数量子科技有限公司 Molecular coding model training method, molecular coding method and molecular coding system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019157228A1 (en) * 2018-02-09 2019-08-15 D-Wave Systems Inc. Systems and methods for training generative machine learning models
CN110459274A (en) * 2019-08-01 2019-11-15 南京邮电大学 A kind of small-molecule drug virtual screening method and its application based on depth migration study
CN110634539A (en) * 2019-09-12 2019-12-31 腾讯科技(深圳)有限公司 Artificial intelligence-based drug molecule processing method and device and storage medium
CN110993037A (en) * 2019-10-28 2020-04-10 浙江工业大学 Protein activity prediction device based on multi-view classification model
CN111063398A (en) * 2019-12-20 2020-04-24 吉林大学 Molecular discovery method based on graph Bayesian optimization
CN111243658A (en) * 2020-01-07 2020-06-05 西南大学 Biomolecular network construction and optimization method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11417415B2 (en) * 2018-08-10 2022-08-16 International Business Machines Corporation Molecular representation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019157228A1 (en) * 2018-02-09 2019-08-15 D-Wave Systems Inc. Systems and methods for training generative machine learning models
CN110459274A (en) * 2019-08-01 2019-11-15 南京邮电大学 A kind of small-molecule drug virtual screening method and its application based on depth migration study
CN110634539A (en) * 2019-09-12 2019-12-31 腾讯科技(深圳)有限公司 Artificial intelligence-based drug molecule processing method and device and storage medium
CN110993037A (en) * 2019-10-28 2020-04-10 浙江工业大学 Protein activity prediction device based on multi-view classification model
CN111063398A (en) * 2019-12-20 2020-04-24 吉林大学 Molecular discovery method based on graph Bayesian optimization
CN111243658A (en) * 2020-01-07 2020-06-05 西南大学 Biomolecular network construction and optimization method based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于机器学习的抗纤维化中药化合物筛选研究;王曦廷 等;《北京中医药大学学报》;第42卷(第1期);全文 *

Also Published As

Publication number Publication date
CN111695702A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111695702B (en) Training method, device, equipment and storage medium of molecular generation model
Li et al. Autonomous GIS: the next-generation AI-powered GIS
US20230123077A1 (en) Automated cloud data and technology solution delivery using machine learning and artificial intelligence modeling
KR102308002B1 (en) Method and apparatus for generating information
Skolik et al. Equivariant quantum circuits for learning on weighted graphs
US20200142957A1 (en) Learning property graph representations edge-by-edge
US20200257982A1 (en) Categorical feature encoding for property graphs by vertex proximity
Nodehi et al. Estimation of parameters in multivariate wrapped models for data on ap-torus
US20230297853A1 (en) System and method for the latent space optimization of generative machine learning models
Turgut et al. A framework proposal for machine learning-driven agent-based models through a case study analysis
Bakurov et al. General purpose optimization library (GPOL): a flexible and efficient multi-purpose optimization library in Python
US20220083907A1 (en) Data generation and annotation for machine learning
Guo et al. Graph neural networks: Graph transformation
Chen et al. Predicting drug-target interaction via self-supervised learning
Lee et al. River networks: An analysis of simulating algorithms and graph metrics used to quantify topology
US11947503B2 (en) Autoregressive graph generation machine learning models
Neuhäuser et al. Learning the effective order of a hypergraph dynamical system
US20230138367A1 (en) Generation of graphical user interface prototypes
CN114334040A (en) Molecular diagram reconstruction model training method and device and electronic equipment
Thompson et al. The contextual lasso: Sparse linear models via deep neural networks
Larsen et al. A simulated annealing algorithm for maximum common edge subgraph detection in biological networks
CN117542405A (en) Saccharide binding site-oriented prediction method, apparatus, device, and storage medium
CN117149982A (en) Question-answering processing method, device, equipment and storage medium based on artificial intelligence
Sosík et al. Morphogenetic and homeostatic self-assembled systems
US20220207362A1 (en) System and Method For Multi-Task Learning Through Spatial Variable Embeddings

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40028896

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant