US20220165359A1 - Generating anti-infective design spaces for selecting drug candidates - Google Patents
Generating anti-infective design spaces for selecting drug candidates Download PDFInfo
- Publication number
- US20220165359A1 US20220165359A1 US17/319,839 US202117319839A US2022165359A1 US 20220165359 A1 US20220165359 A1 US 20220165359A1 US 202117319839 A US202117319839 A US 202117319839A US 2022165359 A1 US2022165359 A1 US 2022165359A1
- Authority
- US
- United States
- Prior art keywords
- sequences
- updated
- candidate drug
- activities
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013461 design Methods 0.000 title claims abstract description 154
- 230000002924 anti-infective effect Effects 0.000 title claims description 8
- 229940000406 drug candidate Drugs 0.000 title description 16
- 239000003814 drug Substances 0.000 claims abstract description 458
- 229940079593 drug Drugs 0.000 claims abstract description 439
- 150000001875 compounds Chemical class 0.000 claims abstract description 400
- 238000000034 method Methods 0.000 claims abstract description 261
- 230000000694 effects Effects 0.000 claims abstract description 200
- 238000010801 machine learning Methods 0.000 claims abstract description 178
- 108090000765 processed proteins & peptides Proteins 0.000 claims abstract description 83
- 230000008569 process Effects 0.000 claims abstract description 40
- 238000012545 processing Methods 0.000 claims description 160
- 108090000623 proteins and genes Proteins 0.000 claims description 60
- 102000004169 proteins and genes Human genes 0.000 claims description 44
- 230000003993 interaction Effects 0.000 claims description 33
- 239000000126 substance Substances 0.000 claims description 28
- 230000015654 memory Effects 0.000 claims description 27
- 238000004458 analytical method Methods 0.000 claims description 26
- 230000000845 anti-microbial effect Effects 0.000 claims description 18
- 230000002519 immonomodulatory effect Effects 0.000 claims description 12
- 230000000840 anti-viral effect Effects 0.000 claims description 11
- 230000001093 anti-cancer Effects 0.000 claims description 9
- 230000000843 anti-fungal effect Effects 0.000 claims description 8
- 230000009467 reduction Effects 0.000 claims description 8
- 230000027455 binding Effects 0.000 claims description 7
- 230000001070 adhesive effect Effects 0.000 claims description 6
- 239000012620 biological material Substances 0.000 claims description 6
- 238000000354 decomposition reaction Methods 0.000 claims description 6
- 238000000513 principal component analysis Methods 0.000 claims description 6
- 239000000853 adhesive Substances 0.000 claims description 5
- 230000003110 anti-inflammatory effect Effects 0.000 claims description 5
- 230000003227 neuromodulating effect Effects 0.000 claims description 4
- 238000003786 synthesis reaction Methods 0.000 claims description 4
- 238000004220 aggregation Methods 0.000 claims description 3
- 230000001078 anti-cholinergic effect Effects 0.000 claims description 3
- 230000001090 anti-dopaminergic effect Effects 0.000 claims description 3
- 230000002436 anti-noradrenergic effect Effects 0.000 claims description 3
- 230000001705 anti-serotonergic effect Effects 0.000 claims description 3
- 239000011230 binding agent Substances 0.000 claims description 3
- 229920001222 biopolymer Polymers 0.000 claims description 3
- 239000002274 desiccant Substances 0.000 claims description 3
- 238000005538 encapsulation Methods 0.000 claims description 3
- 239000010408 film Substances 0.000 claims description 3
- 239000008394 flocculating agent Substances 0.000 claims description 3
- 239000003446 ligand Substances 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000002263 peptidergic effect Effects 0.000 claims description 3
- 230000000144 pharmacologic effect Effects 0.000 claims description 3
- 230000012846 protein folding Effects 0.000 claims description 3
- 108020003175 receptors Proteins 0.000 claims description 3
- 102000005962 receptors Human genes 0.000 claims description 3
- 239000000565 sealant Substances 0.000 claims description 3
- 230000015572 biosynthetic process Effects 0.000 claims description 2
- 230000001766 physiological effect Effects 0.000 claims description 2
- 230000011664 signaling Effects 0.000 claims description 2
- 238000013473 artificial intelligence Methods 0.000 description 74
- 239000013598 vector Substances 0.000 description 66
- 230000006870 function Effects 0.000 description 65
- 235000018102 proteins Nutrition 0.000 description 42
- 238000012549 training Methods 0.000 description 42
- 230000001364 causal effect Effects 0.000 description 37
- 239000004615 ingredient Substances 0.000 description 36
- 238000013528 artificial neural network Methods 0.000 description 35
- 201000010099 disease Diseases 0.000 description 34
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 34
- 230000037361 pathway Effects 0.000 description 22
- 235000001014 amino acid Nutrition 0.000 description 20
- 238000009826 distribution Methods 0.000 description 19
- 102000004196 processed proteins & peptides Human genes 0.000 description 19
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 18
- 150000001413 amino acids Chemical class 0.000 description 18
- 230000000306 recurrent effect Effects 0.000 description 15
- 238000010586 diagram Methods 0.000 description 14
- 230000001225 therapeutic effect Effects 0.000 description 14
- 238000002474 experimental method Methods 0.000 description 12
- 230000036541 health Effects 0.000 description 12
- 238000012986 modification Methods 0.000 description 12
- 230000004048 modification Effects 0.000 description 12
- 230000008901 benefit Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 10
- 238000007796 conventional method Methods 0.000 description 10
- 238000005457 optimization Methods 0.000 description 10
- 230000000704 physical effect Effects 0.000 description 10
- 239000012634 fragment Substances 0.000 description 9
- 238000004519 manufacturing process Methods 0.000 description 9
- 238000005070 sampling Methods 0.000 description 9
- 238000012360 testing method Methods 0.000 description 9
- 230000009471 action Effects 0.000 description 8
- 239000004599 antimicrobial Substances 0.000 description 8
- 210000004027 cell Anatomy 0.000 description 8
- 238000003860 storage Methods 0.000 description 8
- 238000006243 chemical reaction Methods 0.000 description 7
- 208000024891 symptom Diseases 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000000875 corresponding effect Effects 0.000 description 6
- 230000007246 mechanism Effects 0.000 description 6
- 230000004044 response Effects 0.000 description 6
- 241000894006 Bacteria Species 0.000 description 5
- 102000014434 POLO box domains Human genes 0.000 description 5
- 108050003399 POLO box domains Proteins 0.000 description 5
- 230000004913 activation Effects 0.000 description 5
- 230000002555 anti-neurodegenerative effect Effects 0.000 description 5
- 230000008859 change Effects 0.000 description 5
- 230000002596 correlated effect Effects 0.000 description 5
- 230000001472 cytotoxic effect Effects 0.000 description 5
- 238000007876 drug discovery Methods 0.000 description 5
- 208000015181 infectious disease Diseases 0.000 description 5
- 239000003607 modifier Substances 0.000 description 5
- 238000011282 treatment Methods 0.000 description 5
- 108700026220 vif Genes Proteins 0.000 description 5
- 238000004617 QSAR study Methods 0.000 description 4
- 235000004279 alanine Nutrition 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 4
- 230000010339 dilation Effects 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 238000005295 random walk Methods 0.000 description 4
- 239000007787 solid Substances 0.000 description 4
- 238000013519 translation Methods 0.000 description 4
- 108010015899 Glycopeptides Proteins 0.000 description 3
- 102000002068 Glycopeptides Human genes 0.000 description 3
- 206010019233 Headaches Diseases 0.000 description 3
- 241001465754 Metazoa Species 0.000 description 3
- 241000700605 Viruses Species 0.000 description 3
- 230000004931 aggregating effect Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 3
- 229960000074 biopharmaceutical Drugs 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 239000004020 conductor Substances 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 231100000869 headache Toxicity 0.000 description 3
- 238000013537 high throughput screening Methods 0.000 description 3
- 238000000338 in vitro Methods 0.000 description 3
- 238000001727 in vivo Methods 0.000 description 3
- 230000000670 limiting effect Effects 0.000 description 3
- 150000002632 lipids Chemical class 0.000 description 3
- 230000003990 molecular pathway Effects 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 210000000056 organ Anatomy 0.000 description 3
- 230000036961 partial effect Effects 0.000 description 3
- 230000001105 regulatory effect Effects 0.000 description 3
- 238000009738 saturating Methods 0.000 description 3
- 238000004088 simulation Methods 0.000 description 3
- 238000010200 validation analysis Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000012800 visualization Methods 0.000 description 3
- 108700042778 Antimicrobial Peptides Proteins 0.000 description 2
- 102000044503 Antimicrobial Peptides Human genes 0.000 description 2
- 206010060968 Arthritis infective Diseases 0.000 description 2
- BSYNRYMUTXBXSQ-UHFFFAOYSA-N Aspirin Chemical compound CC(=O)OC1=CC=CC=C1C(O)=O BSYNRYMUTXBXSQ-UHFFFAOYSA-N 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 206010028980 Neoplasm Diseases 0.000 description 2
- 229960001138 acetylsalicylic acid Drugs 0.000 description 2
- 150000001295 alanines Chemical class 0.000 description 2
- 125000003275 alpha amino acid group Chemical group 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004166 bioassay Methods 0.000 description 2
- 230000008827 biological function Effects 0.000 description 2
- 201000011510 cancer Diseases 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- -1 color Substances 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000001143 conditioned effect Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 231100000433 cytotoxic Toxicity 0.000 description 2
- 230000003247 decreasing effect Effects 0.000 description 2
- 238000009510 drug design Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000002401 inhibitory effect Effects 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 238000004895 liquid chromatography mass spectrometry Methods 0.000 description 2
- 230000013011 mating Effects 0.000 description 2
- 230000002503 metabolic effect Effects 0.000 description 2
- 244000005700 microbiome Species 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- 239000005445 natural material Substances 0.000 description 2
- 230000003285 pharmacodynamic effect Effects 0.000 description 2
- 239000004033 plastic Substances 0.000 description 2
- 229920003023 plastic Polymers 0.000 description 2
- 229920001184 polypeptide Polymers 0.000 description 2
- 230000002787 reinforcement Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 150000003384 small molecules Chemical class 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 210000001519 tissue Anatomy 0.000 description 2
- 231100000027 toxicology Toxicity 0.000 description 2
- KZMAWJRXKGLWGS-UHFFFAOYSA-N 2-chloro-n-[4-(4-methoxyphenyl)-1,3-thiazol-2-yl]-n-(3-methoxypropyl)acetamide Chemical compound S1C(N(C(=O)CCl)CCCOC)=NC(C=2C=CC(OC)=CC=2)=C1 KZMAWJRXKGLWGS-UHFFFAOYSA-N 0.000 description 1
- 208000031295 Animal disease Diseases 0.000 description 1
- 108010053481 Antifreeze Proteins Proteins 0.000 description 1
- 208000023275 Autoimmune disease Diseases 0.000 description 1
- 208000031462 Bovine Mastitis Diseases 0.000 description 1
- 206010051548 Burn infection Diseases 0.000 description 1
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 1
- 206010054212 Cardiac infection Diseases 0.000 description 1
- 206010007882 Cellulitis Diseases 0.000 description 1
- 241000193163 Clostridioides difficile Species 0.000 description 1
- 208000035473 Communicable disease Diseases 0.000 description 1
- 208000011231 Crohn disease Diseases 0.000 description 1
- 108010069514 Cyclic Peptides Proteins 0.000 description 1
- 102000001189 Cyclic Peptides Human genes 0.000 description 1
- 201000003883 Cystic fibrosis Diseases 0.000 description 1
- 201000004624 Dermatitis Diseases 0.000 description 1
- 108010016626 Dipeptides Proteins 0.000 description 1
- 206010059866 Drug resistance Diseases 0.000 description 1
- 108090000790 Enzymes Proteins 0.000 description 1
- 102000004190 Enzymes Human genes 0.000 description 1
- 102000010834 Extracellular Matrix Proteins Human genes 0.000 description 1
- 108010037362 Extracellular Matrix Proteins Proteins 0.000 description 1
- 241000233866 Fungi Species 0.000 description 1
- 208000022559 Inflammatory bowel disease Diseases 0.000 description 1
- 206010022678 Intestinal infections Diseases 0.000 description 1
- 208000036209 Intraabdominal Infections Diseases 0.000 description 1
- QNAYBMKLOCPYGJ-REOHCLBHSA-N L-alanine Chemical compound C[C@H](N)C(O)=O QNAYBMKLOCPYGJ-REOHCLBHSA-N 0.000 description 1
- 201000009906 Meningitis Diseases 0.000 description 1
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 1
- 108010025020 Nerve Growth Factor Proteins 0.000 description 1
- 102000007072 Nerve Growth Factors Human genes 0.000 description 1
- 206010051295 Neurological infection Diseases 0.000 description 1
- 206010033078 Otitis media Diseases 0.000 description 1
- 102000000470 PDZ domains Human genes 0.000 description 1
- 108050008994 PDZ domains Proteins 0.000 description 1
- 206010034668 Peritoneal infections Diseases 0.000 description 1
- 229920000037 Polyproline Polymers 0.000 description 1
- 102000029797 Prion Human genes 0.000 description 1
- 108091000054 Prion Proteins 0.000 description 1
- 206010057190 Respiratory tract infections Diseases 0.000 description 1
- 102000014400 SH2 domains Human genes 0.000 description 1
- 108050003452 SH2 domains Proteins 0.000 description 1
- 102000000395 SH3 domains Human genes 0.000 description 1
- 108050008861 SH3 domains Proteins 0.000 description 1
- 206010062255 Soft tissue infection Diseases 0.000 description 1
- 206010048669 Terminal state Diseases 0.000 description 1
- 206010048038 Wound infection Diseases 0.000 description 1
- 239000004480 active ingredient Substances 0.000 description 1
- 230000001464 adherent effect Effects 0.000 description 1
- 239000003242 anti bacterial agent Substances 0.000 description 1
- 230000003214 anti-biofilm Effects 0.000 description 1
- 230000003474 anti-emetic effect Effects 0.000 description 1
- 230000000118 anti-neoplastic effect Effects 0.000 description 1
- 230000000389 anti-prion effect Effects 0.000 description 1
- 229940088710 antibiotic agent Drugs 0.000 description 1
- 239000002111 antiemetic agent Substances 0.000 description 1
- 229940121375 antifungal agent Drugs 0.000 description 1
- 239000003443 antiviral agent Substances 0.000 description 1
- 229940121357 antivirals Drugs 0.000 description 1
- 208000006673 asthma Diseases 0.000 description 1
- 208000010668 atopic eczema Diseases 0.000 description 1
- 238000013476 bayesian approach Methods 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 230000031018 biological processes and functions Effects 0.000 description 1
- 239000012677 causal agent Substances 0.000 description 1
- 210000003850 cellular structure Anatomy 0.000 description 1
- 210000000349 chromosome Anatomy 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 239000002537 cosmetic Substances 0.000 description 1
- 239000013078 crystal Substances 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000001079 digestive effect Effects 0.000 description 1
- 230000005750 disease progression Effects 0.000 description 1
- 238000004821 distillation Methods 0.000 description 1
- 238000005553 drilling Methods 0.000 description 1
- 230000008406 drug-drug interaction Effects 0.000 description 1
- 229940124645 emergency medicine Drugs 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 210000002744 extracellular matrix Anatomy 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 230000003054 hormonal effect Effects 0.000 description 1
- 210000000987 immune system Anatomy 0.000 description 1
- 208000027866 inflammatory disease Diseases 0.000 description 1
- 208000014674 injury Diseases 0.000 description 1
- 150000002617 leukotrienes Chemical class 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000003211 malignant effect Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000010534 mechanism of action Effects 0.000 description 1
- 238000002483 medication Methods 0.000 description 1
- 238000002844 melting Methods 0.000 description 1
- 230000008018 melting Effects 0.000 description 1
- XZWYZXLIPXDOLR-UHFFFAOYSA-N metformin Chemical compound CN(C)C(=N)NC(N)=N XZWYZXLIPXDOLR-UHFFFAOYSA-N 0.000 description 1
- 229960003105 metformin Drugs 0.000 description 1
- 230000003641 microbiacidal effect Effects 0.000 description 1
- 230000000116 mitigating effect Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 108091005601 modified peptides Proteins 0.000 description 1
- 230000004001 molecular interaction Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000005155 neural progenitor cell Anatomy 0.000 description 1
- 230000000926 neurological effect Effects 0.000 description 1
- 230000000324 neuroprotective effect Effects 0.000 description 1
- 239000002858 neurotransmitter agent Substances 0.000 description 1
- 239000003305 oil spill Substances 0.000 description 1
- 230000000399 orthopedic effect Effects 0.000 description 1
- 244000045947 parasite Species 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 239000000816 peptidomimetic Substances 0.000 description 1
- 230000003239 periodontal effect Effects 0.000 description 1
- 231100000614 poison Toxicity 0.000 description 1
- 230000007096 poisonous effect Effects 0.000 description 1
- 229930001118 polyketide hybrid Natural products 0.000 description 1
- 125000003308 polyketide hybrid group Chemical group 0.000 description 1
- 239000003910 polypeptide antibiotic agent Substances 0.000 description 1
- 108010026466 polyproline Proteins 0.000 description 1
- 230000002685 pulmonary effect Effects 0.000 description 1
- 230000009257 reactivity Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000000246 remedial effect Effects 0.000 description 1
- 206010039073 rheumatoid arthritis Diseases 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000004936 stimulating effect Effects 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000008733 trauma Effects 0.000 description 1
- 208000001072 type 2 diabetes mellitus Diseases 0.000 description 1
- 230000003827 upregulation Effects 0.000 description 1
- 208000019206 urinary tract infection Diseases 0.000 description 1
- 230000002792 vascular Effects 0.000 description 1
- 230000035899 viability Effects 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2428—Query predicate definition using graphical user interfaces, including menus and forms
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/248—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/14—Digital output to display device ; Cooperation and interconnection of the display device with other functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/451—Execution arrangements for user interfaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/088—Non-supervised learning, e.g. competitive learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B10/00—ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B45/00—ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/20—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H20/00—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance
- G16H20/10—ICT specially adapted for therapies or health-improving plans, e.g. for handling prescriptions, for steering therapy or for monitoring patient compliance relating to drugs or medications, e.g. for ensuring correct administration to patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
- G16H40/60—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices
- G16H40/67—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the operation of medical equipment or devices for remote operation
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/20—ICT specially adapted for the handling or processing of medical references relating to practices or guidelines
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/40—ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/60—ICT specially adapted for the handling or processing of medical references relating to pathologies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/02—Marketing; Price estimation or determination; Fundraising
- G06Q30/0201—Market modelling; Market analysis; Collecting market data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/04—Manufacturing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/18—Legal services
- G06Q50/184—Intellectual property management
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H40/00—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices
- G16H40/20—ICT specially adapted for the management or administration of healthcare resources or facilities; ICT specially adapted for the management or operation of medical equipment or devices for the management or administration of healthcare resources or facilities, e.g. managing hospital staff or surgery rooms
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Definitions
- This disclosure relates generally to drug discovery. More specifically, this disclosure relates to generating anti-infective design spaces for selecting drug candidates.
- Therapeutics may refer to a branch of medicine concerned with the treatment of disease and the action of remedial agents (e.g., drugs).
- Therapeutics includes, but is not limited to, the field of ethical pharmaceuticals. Entities in the therapeutics industry may discover, develop, produce, and market drugs for use as medications to be administered or self-administered to patients. Goals of administering or self-administering the drugs may include curing the patient of a disease, causing an active disease to enter a state of remission, vaccinating the patient by stimulating the immune system to better protect against the disease, and/or alleviating, mitigating or ameliorating a symptom.
- Existing drug discoveries may be based on any combination of human design, high-throughput screening, synthetic products and natural substances.
- a method includes generating a design space for a protein (e.g., peptide) for an application (e.g., drug application, industrial application, veterinary application, environmental recovery application (e.g., oil spill, plastics in waterways and oceans), etc.).
- the application may refer to a chemical application (e.g., drug) for which the protein is designed.
- the generating includes identifying sequences for the peptide, and updating the sequences by determining, for each of the sequences, a respective set of activities pertaining to the application.
- the updating produces updated sequences each having updated respective activities.
- the method includes generating, based on the updated sequences, a solution space within the design space.
- the solution space includes a target subset of the updated sequences.
- the method includes performing, using a machine learning model to process the solution space, trials to identify a candidate drug compound that represents a sequence having a level of activity that exceeds a threshold level, and transmitting information describing the candidate drug compound to a computing device
- a system may include a memory device storing instructions and a processing device communicatively coupled to the memory device.
- the processing device may execute the instructions to perform one or more operations of any method disclosed herein.
- a tangible, non-transitory computer-readable medium may store instructions and a processing device may execute the instructions to perform one or more operations of any method disclosed herein.
- Couple and its derivatives refer to any direct or indirect communication between two or more elements, independent of whether those elements are in physical contact with one another.
- the term “or” is inclusive, meaning and/or.
- phrases “associated with,” as well as derivatives thereof, means to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like.
- translation may refer to any operation performed wherein data is input in one format, representation, language (computer, purpose-specific, such as drug design or integrated circuit design), structure, appearance or other written, oral or representable instantiation and data is output in a different format, representation, language (computer, purpose-specific, such as drug design or integrated circuit design), structure, appearance or other written, oral or representable instantiation, wherein the data output has a similar or identical meaning, semantically or otherwise, to the data input.
- Translation as a process includes but is not limited to substitution (including macro substitution), encryption, hashing, encoding, decoding or other mathematical or other operations performed on the input data.
- controller means any device, system or part thereof that controls at least one operation. Such a controller may be implemented in hardware or a combination of hardware and software and/or firmware. The functionality associated with any particular controller may be centralized or distributed, whether locally or remotely.
- phrases “at least one of,” when used with a list of items, means that different combinations of one or more of the listed items may be used, and only one item in the list may be needed.
- “at least one of: A, B, and C” includes any of the following combinations: A, B, C, A and B, A and C, B and C, and A and B and C.
- various functions described below can be implemented or supported by one or more computer programs, each of which is formed from computer readable program code and embodied in a computer readable storage medium.
- application and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer readable program code.
- computer readable program code includes any type of computer code, including source code, object code, and executable code.
- computer readable storage medium includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), solid state drive (SSD), or any other type of memory.
- ROM read only memory
- RAM random access memory
- CD compact disc
- DVD digital video disc
- SSD solid state drive
- a “non-transitory” computer readable storage medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals.
- a non-transitory computer readable storage medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device.
- cancer drugs and “candidate drug compounds” may be used interchangeably herein.
- FIG. 1A illustrates a high-level component diagram of an illustrative system architecture according to certain embodiments of this disclosure
- FIG. 1C illustrates first components of an architecture of the creator module according to certain embodiments of this disclosure
- FIG. 1D illustrates second components of the architecture of the creator module according to certain embodiments of this disclosure
- FIG. 1E illustrates an architecture of a variational autoencoder according to certain embodiments of this disclosure
- FIG. 1F illustrates an architecture of a generative adversarial network used to generate candidate drugs according to certain embodiments of this disclosure
- FIG. 1G illustrates types of encodings to represent certain types of drug information according to certain embodiments of this disclosure
- FIG. 1H illustrates an example of concatenating numerous encodings into a candidate drug according to certain embodiments of this disclosure
- FIG. 1I illustrates an example of using a variational autoencoder to generate a latent representation of a candidate drug according to certain embodiments of this disclosure
- FIG. 2 illustrates a data structure storing a biological context representation according to certain embodiments of this disclosure
- FIGS. 3A-3B illustrate a high-level flow diagram according to certain embodiments of this disclosure
- FIG. 4 illustrates example operations of a method for generating and classifying a candidate drug compound according to certain embodiments of this disclosure
- FIGS. 5A-5D provide illustrations of generating a first data structure including a biological context representation of a plurality of drug compounds according to certain embodiments of this disclosure
- FIG. 6 illustrates example operations of a method for translating the first data structure of FIGS. 5A-5D into a second data structure having a second format according to certain embodiments of this disclosure
- FIG. 7 provide illustrations of translating the first data structure of FIGS. 5A-5D into the second data structure having the second format according to certain embodiments of this disclosure
- FIG. 8A-8C provide illustrations of views of a selected candidate drug compound according to certain embodiments of this disclosure.
- FIG. 9 illustrates example operations of a method for presenting a view including a selected candidate drug compound according to certain embodiments of this disclosure
- FIG. 10A illustrates example operations of a method for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure
- FIG. 10B illustrates another example of operations of a method for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure
- FIG. 11 illustrates example operations of a method for using several machine learning models in an artificial intelligence engine architecture to generate peptides according to certain embodiments of this disclosure
- FIG. 12 illustrates example operations of a method for performing a benchmark analysis according to certain embodiments of this disclosure
- FIG. 13 illustrates example operations of a method for slicing a latent representation based on a shape of the latent representation according to certain embodiments of this disclosure
- FIG. 14 illustrates a high-level flow diagram for a therapeutics tool implementing business intelligence according to certain embodiments of this disclosure
- FIG. 15 illustrates an example user interface for using query parameters to generate a solution space that includes protein sequences according to certain embodiments of this disclosure
- FIG. 16 illustrates an example user interface for tracking information pertaining to trials according to certain embodiments of this disclosure
- FIG. 17 illustrates an example user interface for presenting performance metrics of machine learning models that perform trials according to certain embodiments of this disclosure
- FIG. 18 illustrates an example user interface for a candidate dashboard screen according to certain embodiments of this disclosure
- FIG. 19 illustrates example operations of a method for generating a design space for a peptide for an application according to certain embodiments of this disclosure
- FIG. 20 illustrates example operations of a method for comparing performance metrics of machine learning models according to certain embodiments of this disclosure
- FIG. 21 illustrates example operations of a method for presenting a design space and a solution space within a graphical user interface of a therapeutics tool according to certain embodiments of this disclosure
- FIG. 22 illustrates example operations of a method for receiving and presenting of one or more results of performing a selected trial using a machine learning model according to certain embodiments of this disclosure
- FIG. 23 illustrates example operations of a method for using a business intelligence screen to select a desired target product profile for sequences according to certain embodiments of this disclosure.
- FIG. 24 illustrates an example computer system according to certain embodiments of this disclosure.
- conventional techniques for searching for candidate drugs use limited design spaces.
- the design space may refer to parameterization of limits and constraints in a drug space where candidate drug compounds may be designed.
- a design space may also refer to a multidimensional combination and interaction of input variables (e.g., material attributes) and process parameters that have been demonstrated to provide assurance of quality.
- An example of such a fact may include a certain biomedical activity known to be linked to an alpha-helix physical structure of a peptide, where conventional techniques may search for other activities that may result from a peptide having the alpha-helix physical structure.
- Such a limited design space may limit the results obtained.
- it is desirable to enlarge the design space to account for other information such as drug sequence information, drug activity information, drug semantic information, drug chemical information, drug physical information, and so forth.
- enlarging the design space may increase the complexity of searching the design space.
- aspects of the present disclosure generally relate to an artificial intelligence engine for generating candidate drugs.
- the artificial intelligence engine may enlarge the design space to include the combination of drug information (e.g., structural, physical, semantic, activity, sequence, chemical, attributes expressed in solubility data, properties expressed in solubility data, related structures, related drugs, chemical synthesis, biological synthesis, intellectual property data, clinical data, market data, etc.).
- the architecture of the AI engine may include various computational techniques that reduce the computational complexity of using a large design space, thereby saving computing resources (e.g., reducing computing time, reducing processing resources, reducing memory resources, etc.).
- the disclosed architecture may generate superior candidate drugs that include desirable features (e.g., structure, semantics, activity, sequence, clinical outcomes, etc.) found in the larger design space as compared to conventional techniques using the smaller design space.
- the artificial intelligence (AI) engine may use a combination of rational algorithmic discovery and machine learning models (e.g., generative deep learning methods) to produce enhanced therapeutics that may treat any suitable target disease or medical condition.
- the AI engine may discover, translate, design, generate, create, develop, formulate, classify, or test candidate drug compounds that exhibit desired activity (e.g., antimicrobial, immunomodulatory, cytotoxic, neuromodulatory, etc.) in design spaces for target diseases or medical conditions.
- desired activity e.g., antimicrobial, immunomodulatory, cytotoxic, neuromodulatory, etc.
- Such candidate drug compounds that exhibit desired activity in a design space may effectively treat the disease or medical condition associated with that design space.
- a selected candidate drug compound that effectively treats the disease or medical condition may be formulated into an actual drug for administration and may be tested in a lab or at a clinical stage.
- the disclosed embodiments may enable rationally discovery of drug compounds for a larger design space at a larger scale, higher accuracy, or higher efficiency than conventional techniques.
- the AI engine may use various machine learning models to discover, translate, design, generate, create, develop, formulate, classify, or test candidate drug compounds.
- Each of the various machine learning models may perform certain specific operations.
- the types of machine learning models may include various neural networks that perform deep learning, computational biology, or algorithmic discovery. Examples of such neural networks may include generative adversarial networks, recurrent neural networks, convolutional neural networks, fully connected neural networks, etc., as described further below; and such networks may also additionally employ methods of or incorporating causal inference, including counterfactuals, in the process of discovery.
- a biological context representation of a set of drug compounds may be generated.
- the biological context representation may be a continuous representation of a biological setting that is updated as knowledge is acquired or data is updated.
- the biological context representation may be stored in a first data structure having a format (e.g., a knowledge graph) that includes both various nodes pertaining to health artifacts and various relationships connecting the nodes.
- the nodes and relationships may form logical structures having subjects and predicates. For example, one logical structure between two nodes having a relation may be “Genes are associated with Diseases” where “Genes” and “Diseases” are the subjects of the logical structure and “are associated with” is the relation.
- the knowledge graph may encompass actual knowledge, rather than simply statistical inferences, pertaining to a biological setting.
- the information in the knowledge graph may be continuously or periodically updated and the information may be received from various sources curated by the AI engine.
- the knowledge in the biological context representation goes well beyond “dumb” data that just includes quantities of a value because the knowledge represents the relationships between or among numerous different types of data, as well as any or all of direct, indirect, causal, counterfactual or inferred relationships.
- the biological context representation may not be stored, and instead, based on the stream of knowledge included in the biological context representation, may be streamed from data sources into the AI engine that generates the machine learning models.
- the biological context representation may be used to generate candidate drug compounds by translating the first data format to a second data structure having a second format (e.g., a vector).
- the second format may be more computationally efficient or suitable for generating candidate drug compounds that include sequences of ingredients that provide desired activity in a design space.
- “Ingredients” as used herein may refer, without limitation, to substances, compounds, elements, activities (such as the application or removal of electrical charge or a magnetic field for a specific maximum, minimum or discrete amount of time), and mixtures.
- the second format may enable generating views of the levels of activity provided by the sequence of ingredients in a certain design space, as described further below.
- the AI engine may include at least one machine learning model that is trained to use causal inference to generate candidate drug compounds.
- One of the challenges with discovering new therapeutics may include determining whether certain ingredients may be causal agents with respect to certain activity in a design space.
- the sheer number of possible sequences of ingredients may be extraordinarily large due to mathematical combinatorics, such that identifying a cause and effect relationship between ingredients and activity may be impossible or, at best, extremely unlikely, to identify without the disclosed embodiments.
- public-key encryption it is theoretically possible to discover and unlock a private key, but doing this would presently require all the computing power in the world to work longer than the age of the universe: this is an example of what is mathematically possible, but impossible within human time frames and computing power.
- Identifying a cause-and-effect relationship between ingredients and activity, while a different problem, may be similarly mathematically possible, but impossible within human time frames and computer power.)
- the disclosed embodiments may enable the efficient solving of the task of generating candidate drug compounds at scale.
- Causal inference may refer to a process, based on conditions of an occurrence of an effect, of drawing a conclusion about a causal connection.
- Causal inference may analyze a response of an effect variable when a cause is changed.
- Causation may be defined thusly: a variable Xis a cause of Y if Y “listens” to X and determines its response based on what it “hears.”
- the process of causal inference in the field of AI may be particularly beneficial for generating and testing candidate drug compounds for certain diseases or medical conditions because of the use of what are termed counterfactuals.
- a counterfactual posits and examines conditions contrary to what has actually occurred in reality. For example, if someone takes aspirin for a headache, the headache may go away.
- counterfactuals may refer to calculating alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof.
- a counterfactual may enable determining whether a response should stay the same or instead change if something in a sequence does not occur. For example, one counterfactual may include asking: “Would a certain level of activity be the same if a certain ingredient is not included in a sequence of a candidate drug compound?”
- the embodiments may provide technical benefits, such as reducing resources consumed (e.g., time, processing, memory, network bandwidth) by reducing a number of candidate drug compounds that may be considered for classification as a selected candidate drug compound by another machine learning model.
- one application for the AI engine to design, discover, develop, formulate, create, or test candidate drug compounds may pertain to peptide therapeutics.
- a peptide may refer to a compound consisting of two or more amino acids linked in a chain. Example peptides may include dipeptides, tripeptides, tetrapeptides, etc.
- a polypeptide may refer to a long, continuous, and unbranched peptide chain.
- a cyclic peptide may refer to a polypeptide which contains a circular sequence of bonded amino acids.
- a modified peptide may refer to a synthesized peptide that undergoes a modification to a side chain, C-terminus, or N-terminus.
- Peptides may be simple to manufacture at discovery scale, include drug-like characteristics of small molecules, include safety and high specificity of biologics, or provide greater administration flexibility than some other biologics.
- the AI engine may efficiently use a biological context representation of a set of drug compounds and one or more machine learning models to generate a set of candidate drug compounds and classify one of the set of candidate drug compounds as a selected candidate drug compound.
- Some embodiments may use causal inference to remove one or more potential candidate drug compounds from classification, thereby reducing the computational complexity and processing burden of classifying a selected candidate drug compound.
- benchmark analysis may be performed for each type of machine learning model that generates candidate drugs.
- the benchmark analysis may score various parameters of the machine learning models that generate the candidate drugs.
- the various parameters may refer to candidate drug novelty, candidate drug uniqueness, candidate drug similarity, candidate drug validity, etc.
- the scores may be used to recursively tune the machine learning models over time to cause one or more of the parameters to increase for the machine learning models.
- some of the machine learning models may vary in their effectiveness as it pertains to some of the parameters.
- the benchmark analysis may score the candidate drug candidates generated by the machine learning models, rank the machine learning models that generate the highest scoring candidate drug candidates, or select the machine learning models producing the highest scoring candidate drug candidates.
- certain markets e.g., anti-infective, animal, industrial, etc.
- certain markets may prefer, based on a type of data those markets generate, to use certain machine learning models that generate high scores for a subset of parameters.
- the subset of machine learning models that generate the high scores for the subset of parameters may be combined into a package and transmitted to a third party. That is, some embodiments enable custom tailoring of machine learning model packages for particular needs of third parties based on their data.
- additional benefits of the embodiments disclosed herein may include using the AI engine to produce algorithmically designed drug compounds that have been validated in vivo and in vitro and that provide (i) a broad-spectrum activity against greater than, e.g., 900 multi-drug resistant bacteria, (ii) at least, e.g., a 2-to-10 times improvement in exposure time required to generate a drug resistance profile, (iii) effectiveness across, e.g., four key animal infection models (both Gram-positive and Gram-negative bacteria), or (iv) effectiveness against, e.g., biofilms.
- the embodiments disclosed herein may not only apply to the anti-infective market (e.g., for prosthetic joint infections, urinary tract infections, intra-abdominal or peritoneal infections, otitis media, cardiac infections, respiratory infections including but not limited to sequelae from diseases such as cystic fibrosis, neurological infections (e.g., meningitis), dental infections (including periodontal), other organ infections, digestive and intestinal infections (e.g., C. difficile), other physiological system infections, wound and soft tissue infections (e.g., cellulitis), etc.), but to numerous other suitable markets or industries.
- the embodiments may be used in the animal health/veterinary industry, for example, to treat certain animal diseases (e.g., bovine mastitis).
- the embodiments may be used for industrial applications, such as anti-biofouling, or generating optimized control action sequences for machinery.
- the embodiments may also benefit a market for new therapeutic indications, such as those for eczema, inflammatory bowel disease, Crohn's Disease, rheumatoid arthritis, asthma, auto-immune diseases and disease processes in general, inflammatory disease progressions or processes, or oncology treatments and palliatives.
- the video game industry may also benefit from the disclosed techniques to improve the AI used for generating sequences of decisions that non-player characters (NPC) make during gameplay.
- NPC non-player characters
- the knowledge graph may include multiple states of: player characters, non-player characters, levels, settings, actions, results of the actions, and so forth, and one or more machine learning models may use the techniques described herein to generate optimized sequences of decisions for NPCs to make during gameplay when the states are encountered.
- the integrated circuit/chip industry may also benefit from the disclosed techniques to improve the mask works generation and routing processes used for generating the most efficient, highest performance, lowest power, lowest heat generating systems on a chip or solid state devices.
- the knowledge graph may include configurations of mask works and routings of systems on chips or solid state drives, as well as their associated properties (e.g., efficiency, performance, power consumption, operating temperature, etc.).
- the disclosed techniques may generate one or more machine learning models trained using the knowledge graph to generate optimized mask works or routings to achieve desired properties. Accordingly, it should be understood that the disclosed embodiments may benefit any market or industry associated with a sequence (e.g., items, objects, decisions, actions, ingredients, etc.) that can be optimized.
- FIGS. 1A through 14 discussed below, and the various embodiments used to describe the principles of this disclosure are by way of illustration only and should not be construed in any way to limit the scope of the disclosure.
- FIG. 1A illustrates a high-level component diagram of an illustrative system architecture 100 according to certain embodiments of this disclosure.
- the system architecture 100 may include a computing device 102 communicatively coupled to a computing system 116 .
- the computing system 116 may be a real-time software platform, include privacy software or protocols, or include security software or protocols.
- Each of the computing device 102 and components included in the computing system 116 may include one or more processing devices, memory devices, or network interface cards.
- the network interface cards may enable communication via a wireless protocol for transmitting data over short distances, such as Bluetooth, ZigBee, NFC, etc.
- Network 112 may be a public network (e.g., connected to the Internet via wired (Ethernet) or wireless (WiFi)), a private network (e.g., a local area network (LAN) or wide area network (WAN)), or a combination thereof.
- network 112 may also comprise a node or nodes on the Internet of Things (IoT).
- IoT Internet of Things
- the computing device 102 may be any suitable computing device, such as a laptop, tablet, smartphone, or computer.
- the computing device 102 may include a display capable of presenting a user interface of an application 118 .
- the application 118 may be implemented in computer instructions stored on the one or more memory devices of the computing device 102 and executable by the one or more processing devices of the computing device 102 .
- the application 118 may present various screens to a user that present various views (e.g., topographical heatmaps) including measures, gradients, or levels of certain types of activity and optimized sequences of selected candidate drug compounds, information pertaining to the selected candidate drug compounds or other candidate drug compounds, options to modify the sequence of ingredients in the selected candidate drug compound, and so forth, as described in more detail below.
- the computing device 102 may also include instructions stored on the one or more memory devices that, when executed by the one or more processing devices of the computing device 102 , perform operations of any of the methods described herein.
- the computing system 116 may include one or more servers 128 that form a distributed computing system, which may include a cloud computing system.
- the servers 128 may be a rackmount server, a router, a personal computer, a portable digital assistant, a mobile phone, a laptop computer, a tablet computer, a camera, a video camera, a netbook, a desktop computer, a media center, any other device capable of functioning as a server, or any combination of the above.
- Each of the servers 128 may include one or more processing devices, memory devices, data storage, or network interface cards.
- the servers 128 may be in communication with one another via any suitable communication protocol.
- the servers 128 may execute an artificial intelligence (AI) engine 140 that uses one or more machine learning models 132 to perform at least one of the embodiments disclosed herein.
- the computing system 128 may also include a database 150 that stores data, knowledge, and data structures used to perform various embodiments.
- the database 150 may store a knowledge graph containing the biological context representation described further below.
- the database 150 may store the structures of generated candidate drug compounds, the structures of selected candidate drug compounds, and information pertaining to the selected candidate drug compounds (e.g., activity for certain types of ingredients, sequences of ingredients, test results, correlations, semantic information, structural information, physical information, chemical information, etc.).
- the database 150 may be hosted on one or more of the servers 128 .
- the computing system 116 may include a training engine 130 capable of generating one or more machine learning models 132 .
- the training engine 130 may, in some embodiments, be included in the AI engine 140 executing on the server 128 .
- the AI engine 140 may use the training engine 130 to generate the machine learning models 132 trained to perform inferencing operations.
- the machine learning models 132 may be trained to discover, translate, design, generate, create, develop, classify, or test candidate drug compounds, among other things.
- the one or more machine learning models 132 may be generated by the training engine 130 and may be implemented in computer instructions executable by one or more processing devices of the training engine 130 or the servers 128 .
- the training engine 130 may train the one or more machine learning models 132 .
- the one or more machine learning models 132 may be used by any of the modules in the AI engine 140 architecture depicted in FIG. 2 .
- the training engine 130 may be a rackmount server, a router, a personal computer, a portable digital assistant, a smartphone, a laptop computer, a tablet computer, a netbook, a desktop computer, an Internet of Things (IoT) device, any other desired computing device, or any combination of the above.
- the training engine 130 may be cloud-based, be a real-time software platform, include privacy software or protocols, or include security software or protocols.
- the training engine 130 may train the one or more machine learning models 132 .
- the training engine 130 may use a base data set of biological context representation (e.g., physical properties data, peptide activity data, microbe data, antimicrobial data, anti-neurodegenerative compound data, pro-neuroplasticity compound data, clinical outcome data, etc.) for a set of drug compounds.
- the biological context representation may include sequences of ingredients for the drug compounds.
- the results may include information indicating levels of certain types of activity associated with certain design spaces. In one embodiment, the results may include causal inference information pertaining to whether certain ingredients in the drug compounds are correlated with or determined by certain effects (e.g., activity levels) in the design space.
- the one or more machine learning models 132 may refer to model artifacts created by the training engine 130 using training data that includes training inputs and corresponding target outputs.
- the training engine 130 may find patterns in the training data wherein such patterns map the training input to the target output and generate the machine learning models 132 that capture these patterns.
- the training engine 130 may reside on server 128 .
- the artificial intelligence engine 140 , the database 150 , or the training engine 130 may reside on the computing device 102 .
- the one or more machine learning models 132 may comprise, e.g., a single level of linear or non-linear operations (e.g., a support vector machine (SVM)) or the machine learning models 132 may be a deep network, i.e., a machine learning model comprising multiple levels of non-linear operations.
- deep networks are neural networks, including generative adversarial networks, convolutional neural networks, recurrent neural networks with one or more hidden layers, and fully connected neural networks (e.g., each artificial neuron may transmit its output signal to the input of the remaining neurons, as well as to itself).
- the machine learning model may include numerous layers or hidden layers that perform calculations (e.g., dot products) using various neurons.
- one or more of the machine learning models 132 may be trained to use causal inference and counterfactual s.
- the machine learning model 132 trained to use causal inference may accept one or more inputs, such as (i) assumptions, (ii) queries, and (iii) data.
- the machine learning model 132 may be trained to output one or more outputs, such as (i) a decision as to whether a query may be answered, (ii) an objective function (also referred to as an estimand) that provides an answer to the query for any received data, and (iii) an estimated answer to the query and an estimated uncertainty of the answer, where the estimated answer is based on the data and the objective function, and the estimated uncertainty reflects the quality of data (i.e., a measure which takes into account the degree or salience of incorrect data or missing data).
- the assumptions may also be referred to as constraints and may be simplified into statements used in the machine learning model 132 .
- the queries may refer to scientific questions for which the answers are desired.
- the answers estimated using causal inference by the machine learning model may include optimized sequences of ingredients in selected candidate drug compounds.
- certain causal diagrams may be generated, as well as logical statements, and patterns may be detected. For example, one pattern may indicate that “there is no path connecting ingredient D and activity P,” which may translate to a statistical statement “D and P are independent.” If alternative calculations using counterfactuals contradict or do not support that statistical statement, then the machine learning model 132 or the biological context representation may be updated.
- another machine learning model 132 may be used to compute a degree of fitness which represents a degree to which the data is compatible with the assumptions used by the machine learning model that uses causal inference.
- a generative adversarial network may generate a set of candidate drug compounds without using causal inference.
- the GAN may generate a set of candidate drug compounds using causal inference.
- a GAN refers to a class of deep learning algorithms including two neural networks, a generator and a discriminator, that both compete with one another to achieve a goal.
- the generator goal may include generating candidate drug compounds, including compatible/incompatible sequences of ingredients, and effective/ineffective sequences of ingredients, etc. that the discriminator classifies as feasible candidate drug compounds, including compatible and effective sequences of ingredients that may produce desired activity levels for a design space.
- the generator may use causal inference, including counterfactuals, to calculate numerous alternative scenarios that indicate whether a certain result (e.g., activity level) still follows when any element or aspect of a sequence changes.
- the generator may be a neural network based on Markov models (e.g., Deep Markov Models), which may perform causal inference.
- Markov models e.g., Deep Markov Models
- one or more of the counterfactuals used during the causal inference may be determined and provided by the scientist module.
- the discriminator goal may include distinguishing candidate drug compounds which include undesirable sequences of ingredients from candidate drug compounds which include desirable sequences of ingredients.
- the generator initially generates candidate drug compounds and continues to generate better candidate drug compounds after each iteration until the generator eventually begins to generate candidate drug compounds that are valid drug compounds which produce certain levels of activity within a design space.
- a candidate drug compound may be “valid” when it produces a certain level of effectiveness (e.g., above a threshold activity level as determined by a standard (e.g., regulatory entity)) in a design space.
- the discriminator may receive real drug compound information from a dataset and the candidate drug compounds generated by the generator.
- “Real drug compound,” as used in this disclosure, may refer to a drug compound that has been approved by any regulatory (governmental) body or agency. The generator obtains the results from the discriminator and applies the results in order to generate better (e.g., valid) candidate drug compounds.
- the two neural networks, the generator and the discriminator may be trained simultaneously.
- the discriminator may receive an input and then output a scalar indicating whether a candidate drug compound is an actual or viable drug compound.
- the discriminator may resemble an energy function that outputs a low value (e.g., close to 0) when input is a valid drug compound and a positive value when the input is not a valid drug compound (e.g., if it includes an incorrect sequence of ingredients for certain activity levels pertaining to a design space).
- the generator function may be denoted as G(V), where V is generally a vector randomly sampled in a standard distribution (e.g., Gaussian).
- the vector may be any suitable dimension and may be referred to as an embedding herein.
- the role of the generator is to produce candidate drug candidates to train the discriminator function (D(Y)) to output the values indicating the candidate drug candidate is valid (e.g., a low value), where Y is generally a vector referred to as an embedding and where, further, Y may include candidate drug compounds or real drug compounds.
- the discriminator During training, the discriminator is presented with a valid drug compound and adjusts its parameters (e.g., weights and biases) to output a value indicative of the validity of the candidate drug compounds that produce real activity levels in certain design spaces.
- the discriminator may receive a modified candidate drug compound (e.g., modified using counterfactuals) generated by the generator and adjust its parameters to output a value indicative of whether the modified candidate drug compound provides the same or a different activity level in the design space.
- a modified candidate drug compound e.g., modified using counterfactuals
- the discriminator may use a gradient of an objective function to increase the value of the output.
- the discriminator may be trained as an unsupervised “density estimator,” i.e., a contrast function produces a low value for desired data (e.g., candidate drug compounds that include sequences producing desired levels of certain types of activity in a design space) and higher output for undesired data (e.g., candidate drug compounds that include sequences producing undesirable levels of certain types of activity in a design space).
- the generator may receive the gradient of the discriminator with respect to each modified candidate drug compound it produces. The generator uses the gradient to train itself to produce modified candidate drug compounds that the discriminator determines include sequences producing desired levels of certain types of activity in a design space.
- Recurrent neural networks include the functionality, in the context of a hidden layer, to process information sequences and store information about previous computations. As such, recurrent neural networks may have or exhibit a “memory.” Recurrent neural networks may include connections between nodes that form a directed graph along a temporal sequence. Keeping and analyzing information about previous states enables recurrent neural networks to process sequences of inputs to recognize patterns (e.g., such as sequences of ingredients and correlations with certain types of activity level). Recurrent neural networks may be similar to Markov chains. For example, Markov chains may refer to stochastic models describing sequences of possible events in which the probability of any given event depends only on the state information contained in the previous event. Thus, Markov chains also use an internal memory to store at least the state of the previous event. These models may be useful in determining causal inference, such as whether an event at a current node changes as a result of the state of a previous node changing.
- the set of candidate drug compounds generated may be input into another machine learning model 132 trained to classify of the set of candidate drug compounds as a selected candidate drug compound.
- the classifier may be trained to rank the set of candidate drug compounds using any suitable ranking (i.e., for example, non-parametric) technique.
- one or more clustering techniques may be used to cluster the set of candidate drug compounds.
- the machine learning model 132 may also perform objective optimization techniques while clustering.
- the objective optimization may include using a minimization or maximization function for each candidate drug compound in the clusters.
- a cluster may refer to a group of data objects similar to one another within the same cluster, but dissimilar to the objects in the other clusters.
- Cluster analysis may be used to classify the data into relative groups (clusters).
- clustering may include K-means clustering where “K” defines the number of clusters. Performing K-means clustering may comprise specifying the number of clusters, specifying the cluster seeds, assigning each point to a centroid, and adjusting the centroid.
- Additional clustering techniques may include hierarchical clustering and density based spatial clustering.
- Hierarchy clustering may be used to identify the groups in the set of candidate drug compounds where there is no set number of clusters to be generated. As a result, a tree-based representation of the objects in the various groups may be generated.
- Density-based spatial clustering may be used to identify clusters of any shape in a dataset having noise and outliers. This form of clustering also does not require specifying the number of clusters to be generated.
- FIG. 1B illustrates an architecture of the artificial intelligence engine according to certain embodiments of this disclosure.
- the architecture may include a biological context representation 200 , a creator module 151 , a descriptor module 152 , a scientist module 153 , a reinforcer module 154 , and a conductor module 155 .
- the architecture may provide a platform that improves its machine learning models over time by using benchmark analysis to produce enhanced candidate drug compounds for target design spaces.
- the platform may also continuously or continually learn new information from literature, clinical trials, studies, research, or any suitable data source about drug compounds. The newly learned information may be used to continuously or continually train the machine learning models to evolve with evolving information.
- the biological context representation 200 may be implemented in a general manner such that it can be applied to solve different types of problems across different markets.
- the underlying structure of the biological context representation 200 may include nodes and relationships between the nodes. There may be semantic information, activity information, structural information, chemical information, pathway information, and so forth represented in the biological context representation 200 .
- the biological context representation 200 may include any number of layers of information (e.g., five layers of information). The first layer may pertain to molecular structure and physical property information, the second layer may pertain to molecule-to-molecule interactions, the third layer may pertain to molecule pathway interactions, the fourth layer may pertain to molecule cell profile associations, and the fifth layer may pertain to therapeutics (including those using biologics) and indications relevant for molecules.
- the biological context representation 200 is discussed further below with reference to FIGS. 2 and 5 .
- those various encodings may be selected to preferentially represent certain types of data. For example, to effectively capture common backbone structures of molecules, Morgan fingerprints may be used to describe physical properties of the candidate drug compounds. The encodings are discussed further below with reference to FIG. 1G .
- Each of the creator modules 151 may include one or more generative machine learning models trained to generate new candidate drug compounds.
- the new candidate drug compounds are then added to the biological context representation 200 .
- the term “creator module” and “generative model” may be used interchangeably herein.
- Each node in the biological context representation 200 may be a candidate drug compound (e.g., a peptide candidate).
- the generative machine learning modules included in the creator module 151 may be of different types and perform different functions.
- the different types and different functions may include a variational autoencoder, structured transformer, Mini Batch Discriminator, dilation, self-attention, upsampling, loss, and the like.
- a variational autoencoder structured transformer
- Mini Batch Discriminator dilation
- self-attention self-attention
- upsampling loss, and the like.
- the variational autoencoder may simultaneously train two machine learning models, an inference model q ⁇ (z
- both the inference model and the generative model may be conditioned on a chosen attribute of the sequences.
- Both models may be jointly optimized using a tractable variational Bayesian approach which maximizes an evidence lower bound (ELBO)
- conditionals may be parameterized in terms of two sub-networks: an encoder that computes embeddings from structure-based features and edge features, and a decoder that autoregressively predicts amino acid letter s i given the preceding sequence and structural embeddings from the encoder.
- Mode collapse occurs in generative adversarial networks when the generator generates a limited diversity of samples, or even the same sample, regardless of the input.
- some embodiments implement a Mini Batch Discriminator (MBD) approach. MBDs each work as an extra layer in the network that computes the standard deviation across the batch of examples (the batch contains only real drug compounds or only candidate drug compounds). If the batch contains a small variety of examples, the standard deviation will be low, and the discriminator will be able to use this information to lower the score for each example in the batch.
- MBDs Mini Batch Discriminator
- convolution filters may be capable of detecting local features, but they have limitations when it comes to relationships separated by long distances. Accordingly, some embodiments implement convolution filters with dilation. By introducing gaps into convolution kernels, such techniques increase the receptive field without increasing the number of parameters. Dilation rate may be applied to one convolution filter in each residual block of a generator or a discriminator. In this way, by the last layer of the generative adversarial network, filters may include a large enough receptive field to learn relationships separated by long-distances. Residual blocks are discussed further below with reference to FIG. 1F .
- the architecture of the generative adversarial network disclosed herein implements a self-attention mechanism.
- the self-attention mechanism may include a number of layers that highlight different areas of importance across the entire sequence and allow the discriminator to determine whether parts in distant portions of the protein are consistent with each other.
- some embodiments implement techniques best suited for protein generation. For example, nearest-neighbor interpolation, transposed convolution, and sub-pixel convolution may be used. Sub-pixel shuffle convolution may be used to increase resolution of a design space during candidate drug compound generation. Any combination of these techniques may be used in the upsampling layers. In some embodiments, transposed convolution by itself may be used for all upsampling layers.
- the loss function it is a component that aids in the successful performance of a neural network.
- Various losses such as non-saturating, non-saturating with R 1 regularization, hinge, hinge with relativistic average, and Wasserstein and Wasserstein with gradient penalty losses, may be used.
- the non-saturating loss with R 1 regularization may be used for the generative adversarial network.
- the descriptor module 152 may include one or more machine learning models trained to generate descriptions for each of the candidate drug compounds generated by the creator module 151 .
- the descriptor module 152 may be trained to use different encodings to represent the different types of information included in the candidate drug compound.
- the descriptor module 152 may populate the information in the candidate drug compound with ordinal values, cardinal values, categorical values, etc. depending on the type of information.
- the descriptor module 152 may include a classifier that analyzes the candidate drug compound and determines whether it is a cancer peptide, an antimicrobial peptide, or a different peptide.
- the descriptor module 152 describes the structure and the physiochemical properties of the candidate drug compound.
- the reinforcer module 154 may include one or more machine learning models trained to analyze, based on the descriptions, the structure and the physiochemical properties of the candidate drug compounds in the biological context representation 200 . Based on the analysis, the reinforcer module 154 may identify a set of experiments to perform on the candidate drug compounds to elicit certain desired data (e.g., activity effectiveness, biomedical features, etc.). The identification may be performed by matching a pattern of the structure and physiochemical properties of the candidate drug compounds with the structure and physiochemical properties of other drug compounds and determining which experiments were performed on the other drug compounds to elicit desired data. The experiments may include in vitro or in vivo experiments. Further, the reinforcer module 154 may identify experiments that should not be performed for the candidate drug compounds if a determination is made that those experiments yield useless data for drug compounds.
- desired data e.g., activity effectiveness, biomedical features, etc.
- the conductor module 155 may include one or more machine learning models trained to perform inference queries on the data stored in the biological context representation 200 .
- the inference queries may pertain to performing queries to improve the quality of the data in the biological context representation 200 .
- An inference query refers to the process of identifying a first node and a second node similar to the first node, and to obtaining data from the second node to fill a data gap in the first node.
- An inference query may be executed to search for another node having similarities to the node with the gap and may fill the gap with the data from the other node.
- the scientist module 153 may include one or more machine learning models trained to perform benchmark analysis to evaluate various parameters of the creator module 151 .
- the scientist module 153 may generate scores for the candidate compound drugs generated by the creator module 151 .
- the benchmark analysis may be used to electronically and recursively optimize the creator module 151 to generate candidate drug compounds having improved scores in subsequent generation rounds.
- There may be several types of benchmarks (e.g., distribution learning benchmarks, goal-directed benchmarks, etc.) used by the scientist module 153 to evaluate generative machine learning models used by the creator module 151 .
- one or more parameters e.g., validity, uniqueness, novelty, Frechet ChemNet Distance (FCD), internal diversity, Kullback-Leibler (KL) divergence, similarity, rediscovery, isomer capability, median compounds, etc.
- FCD Frechet ChemNet Distance
- KL Kullback-Leibler
- One type of benchmark used by the scientist module 153 may include a distribution learning benchmark.
- the distribution learning benchmark evaluates, when given a set of molecules, how well the creator module 151 generates new molecules which follow the same chemical distribution. For example, when provided with therapeutic peptides, the distribution learning benchmark evaluates how well the creator module 151 generates other therapeutic peptides having similar chemical distributions.
- the distribution learning benchmark may include generating a score for an ability of the creator module 151 to generate valid candidate drug compounds, a score for an ability of the creator module 151 to generate unique candidate drug compounds, a score for an ability of the creator module 151 to generate novel candidate drug compounds, a Frechet ChemNet Distance (FCD) score for the creator module 151 , an internal diversity score for the creator module 151 , a KL divergence score for the creator module 151 , and so forth.
- FCD Frechet ChemNet Distance
- the validity score may be determined as a ratio of valid candidate drug compounds to non-valid candidate drug compounds of generated candidate drug compounds. In some embodiments, the ratio may be determined from a certain number (e.g., 10,000) of candidate drug compounds. In some embodiments, candidate drug compounds may be considered valid if their representation (e.g., simplified molecular-input line-entry system (SMILES)) can be successfully parsed using any suitable parser.
- SILES simplified molecular-input line-entry system
- the uniqueness score may be determined by sampling candidate drug compounds generated by the creator module 151 until a certain number (e.g., 10,000) of valid molecules are identified by identical representations (e.g., canonical SMILES strings).
- the uniqueness score may be determined as the number of different representations divided by the certain number (e.g., 10,000).
- the novelty score may be determined by generating candidate drug compounds until a certain number (e.g., 10,000) of different representations (e.g., canonical SMILES strings) are obtained and computing the ratio of candidate drug compounds (including real drug compounds) not present in the training dataset.
- a certain number e.g., 10,000
- different representations e.g., canonical SMILES strings
- the Frechet ChemNet Distance (FCD) score may be determined by selecting a random subset of a certain number (e.g., 10,000) of drug compounds from the training dataset, and generating candidate drug compounds using the creator module 151 until a certain number (10,000) of valid candidate drug compounds are obtained.
- the FCD between the subset of the drug compounds and the candidate drug compounds may be determined.
- the FCD may consider chemically and biologically relevant information about drug compounds, and also measure the diversity of the set via the distribution of generated candidate drug compounds.
- the FCD may detect if generated candidate drug compounds are diverse, and the FCD may detect if generated candidate drug compounds have similar chemical and biological properties as real drug compounds.
- the internal diversity score may assess the chemical diversity within a set of generated candidate drug compounds (“GROUP”).
- GROUP generated candidate drug compounds
- T(m 1 , m 2 ) is the Tanimoto Similarity (SNN) between molecule 1 , m 1 , and molecule 2 , m 2 .
- Variable G is the set of candidate drug compounds and variable p is the set number of groups being tested.
- SNN measures the dissimilarity to external diversity
- the internal diversity score may consider dissimilarity between generated candidate drug compounds.
- the internal diversity score may be used to detect mode collapse in certain generative models. For example, mode collapse may occur when the generative model produces a limited variety of candidate drug compounds while ignoring some areas of a design space. A higher score for the internal diversity corresponds to higher diversity in the set of candidate drug compounds generated.
- the KL divergence score may be determined by calculating physiochemical descriptors for both the candidate drug compounds and the real drug compounds. Further, a determination may be made of the distribution of maximum nearest neighbor similarities on fingerprints (e.g., extended connectivity fingerprint of up to four bonds (ECFP4)) for both the candidate drug compounds and the real drug compounds. The distribution of these descriptors may be determined via kernel density estimation for continuous descriptors, or as a histogram for discrete descriptors.
- the KL divergence D KL,i may be determined for each descriptor i, and is aggregated to determine the KL divergence score S via:
- the isomer capability score may be determined by whether molecules may be generated that correspond to a target molecular formula (for example C7H8N2O2).
- the isomers for a given molecular formula can in principle be enumerated, but except for small molecules this number will in general be very large.
- the isomer capability score represents fully-determined tasks that assess the flexibility of the creator module to generate molecules following a simple pattern (which is a priori unknown).
- a second type of benchmark may include a goal-directed benchmark.
- the goal-direct benchmark may evaluate whether the creator module 151 generates a best possible candidate drug compound to satisfy a pre-defined goal (e.g., activity level in a design space).
- a resulting benchmark score may be calculated as a weighted average of the candidate drug compound scores.
- the candidate drug compounds with the best benchmark scores may be assigned a larger weight.
- generative models of the creator module 151 may be tuned to deliver a few candidate drug compounds with top scores, while also generating candidate drug compounds with satisfactory scores.
- the resulting benchmark score may be calculated as the mean of these average scores.
- the resulting benchmark score may be a combination of the top-1, top-10, and top-100 scores, in which the resulting benchmark score is determined by the following relationship:
- s is an n-dimensional (e.g., 100-dimensional) vector of candidate drug compound scores s v 1 ⁇ i ⁇ 100 sorted in decreasing order (e.g., s i ⁇ s j for i ⁇ j).
- Variable G is the set of candidate drug compounds and variable p is the set number of groups being tested.
- the goal-directed benchmark may include generating a score for an ability of the creator module 151 to generate candidate drug compounds similar to a real drug compound, a score for an ability of the creator module 151 to rediscover the potential viability of previously-known drug compounds (e.g., using a drug which is prescribed for certain conditions for a new condition or disease), and the like.
- the similarity score may be determined using nearest neighbor scoring, fragment similarity scoring, scaffold similarity scoring, SMARTS scoring, and the like.
- Nearest neighbor scoring e.g., nns(G, R)
- the score corresponds to the Tanimoto similarity when considering the fingerprint r and may be determined by the following relationship:
- NNS ⁇ ( G , R ) 1 ⁇ G ⁇ ⁇ ⁇ m G ⁇ ⁇ in ⁇ ⁇ G max ⁇ ⁇ T ⁇ ( m G ⁇ m R )
- m R and m G are representations of the real drug compounds (R) and the candidate drug compounds (G) as bit strings (e.g., digital fingerprints, e.g., outputs of hash functions, etc.).
- the resulting score reflects how similar candidate drug compounds are to real drug compounds in terms of chemical structures encoded in these fingerprints.
- Morgan fingerprints may be used with a radius of a configurable value (e.g., 2) and an encoding with a configurable number of bits (e.g., 1024). The radius and encoding bits may be configured to produce desirable results in a biochemical space.
- the similarity score may be determined using fragment similarity scoring, which itself may be defined as the cosine distance between vectors of fragment frequencies.
- fragment similarity scoring itself may be defined as the cosine distance between vectors of fragment frequencies.
- the distance is determined by the following relationship:
- Candidate drug compounds and real drug compounds may be fragmented using any suitable decomposition algorithm.
- the fragment similarity scoring score represents the similarity of the set of candidate drug compounds and the set of real drug compounds at the level of chemical fragments.
- the similarity score may be determined using scaffold similarity scoring, which may be determined in a similar way to the fragment similarity scoring.
- the scaffold similarity scoring may be determined as a cosine similarity between the vectors S G and S R that represent frequencies of scaffolds in a set of candidate drug compounds (G) and a set of real drug compound (R).
- the scaffold similarity scoring score may be determined by the following relationship:
- the similarity score may be determined using SMARTS scoring.
- SMARTS scoring may be implemented according to the relationship: SMART (a, b).
- the SMARTS scoring may evaluate whether the SMARTS pattern s is present in a candidate drug compound.
- $b$ is a Boolean value indicating whether the SMARTS pattern should be present (true) or absent (false).
- a score of 1 is returned if the SMARTS pattern is found. If the pattern is not found, then a score of 0 , for false, is returned.
- a goal-directed benchmark may include determining a rediscovery score for the creator module 151 .
- certain real drug compounds may be removed from the training dataset and the creator module 151 may be retrained using the modified training set lacking the removed real drug compounds. If the creator module 151 is able to generate (“rediscover”) a candidate drug compound that is identical or substantially similar to the removed real drug compounds, then a high rediscovery score may be assigned. Such a technique may be used to validate the creator module 151 is effectively trained or tuned.
- a Gaussian modifier may be implemented to target a specific value of some property, while giving high scores when the underlying value is close to the target. It may be adjustable as desired.
- a minimum Gaussian modifier may correspond to the right half of a Gaussian function and values smaller than a threshold may be given a full score, while values larger than the threshold decrease continuously to zero.
- a maximum Gaussian modifier may correspond to a left half of the Gaussian function and values larger than the threshold are given a full score, while values smaller than the threshold decrease continuously to zero.
- a threshold modifier may attribute a full score to values above a given threshold, while values smaller than the threshold decrease linearly to zero.
- the competing generative models may include a random sampling, best of dataset method, SMILES genetic algorithm (GA), graph GA, graph Monte-Carlo tree search (MCTS), SMILES long short-term memory (LSTM), character-level recurrent neural networks (CharRNN), variational autoencoder, adversarial autoencoder, Latent generative adversarial network (LatentGAN), junction tree variational autoencoder (JT-VAE), and objective-reinforced generative adversarial network (ORGAN).
- G SMILES genetic algorithm
- MCTS graph Monte-Carlo tree search
- LSTM long short-term memory
- CharRNN character-level recurrent neural networks
- VarN variational autoencoder
- adversarial autoencoder adversarial autoencoder
- LatentGAN Latent generative adversarial network
- JT-VAE junction tree variational autoencoder
- ORGAN objective-reinforced generative adversarial network
- this baseline samples at random the requested number of molecules (candidate drug compounds) for the dataset. Random sampling may provide a lower bound for the goal-directed benchmarks, because no optimization is performed to obtain the returned molecules. Random sampling may provide an upper bound for the distribution learning benchmarks, because the molecules returned may be taken directly for the original distribution.
- one goal of de novo molecular design is to explore unknown parts of the biochemical space, generating new candidate drug compounds with better properties than the drug compounds already known.
- the best of dataset scores the entire generated dataset including the candidate drug compounds with a provided scoring function and returns the highest scoring molecules. This effectively provides a lower bound for the goal-directed benchmarks that enables the creator module 151 to create better candidate drug compounds than the real or candidate drug compounds provided.
- this technique may evolve string molecular representations using mutations exploiting the SMILES context-free grammar.
- a certain number e.g., 300
- each molecule is represented by 300 genes.
- an offspring of a certain number (e.g., 600) of new molecules may be generated by randomly mutating the population molecules.
- these new molecules may be merged with the current population and a new generation is chosen by selecting the top scoring molecules overall. This process may be repeated a certain number of times (e.g., 1000) or until progress has stopped for a certain number (e.g., 5) of consecutive epochs. Distribution-learning benchmarks do not apply to this baseline.
- this GA involves molecule evolution at the graph level. For each goal-directed benchmark a certain number (e.g., 100) of highest scoring molecules in the dataset are selected as the initial population. During each epoch, a mating pool of a certain number (e.g., 200) of molecules is sampled with replacement from the population, using scores as weights. This pool may contain many repeated molecules if their score is high. A new population of a certain number (e.g., 100) is then generated by iteratively choosing two molecules at random from the mating pool and applying a crossover operation. With probability of, e.g., 0.5 (i.e., 100/200), a mutation is also applied to the offspring molecule. This process is repeated a certain number (e.g., 1000) of times or until progress has stopped for a certain number (e.g., 5) of consecutive epochs. Distribution-learning benchmarks do not apply to this baseline.
- a certain number e.g., 100
- Distribution-learning benchmarks do not
- the statistics used during sampling may be computed on the training dataset. For this baseline, no initial population is selected for the goal-directed benchmarks. Each new molecule may be generated by running a certain number (e.g., 40) of simulations, starting from a base molecule. At each step, a certain number (e.g., 25) of children are considered and the sampling stops when reaching a certain number (e.g., 60) of atoms. The best-scoring molecule found during the sampling may be returned. A population of a certain number (e.g., 100) of molecules is generated at each epoch.
- a certain number e.g., 100
- This process may be repeated a certain number (e.g., 1000) of times or until progress has stopped for a certain number (e.g., 5) of consecutive epochs.
- a certain number e.g. 1000
- a certain number e.g. 5, 10
- the generation starts from a base molecule and a new molecule is generated with the same parameters.
- the goal-directed benchmarks the only difference is that no scoring function is provided, so the first molecule to reach terminal state is returned instead of the highest scoring molecule.
- the technique is a baseline model, consisting of an LSTM neural network which predicts the next character of partial SMILES strings.
- a SMILES LSTM may be used with 3 layers of hidden size of 1024.
- a certain number e.g., 20
- the model generated a certain number (e.g., 8192) of molecules and a certain number (e.g., 1024) of the top scoring molecules may be used to fine-tune the model parameters.
- the model may generate the requested number of molecules.
- CharRNN character-level recurrent neural networks
- the technique treats the task of generating SMILES as a language model attempting to learn the statistical structure of SMILES syntax by training it on a large corpus of SMILES.
- the CharRNN parameters may be optimized using maximum likelihood estimation (MLE).
- CharRNN may be implemented using LSTM RNN cells stacked into a certain number of layers (e.g., 3 layers) with a certain number of hidden dimensions (e.g., 600 hidden dimensions) .
- VAE variational autoencoder
- a higher-dimensional data representation e.g., vector
- the lower-dimensional space is called the latent space, which is often a continuous vector space with normally distributed latent representation.
- the latent representation of our data may contain all the important information needed to represent an original data point.
- the latent representation represents the features of the original data point. In other words, one or more machine learning models may learn the data features of the original data point and simplify its representation to make it more efficient to analyze.
- VAE parameters may be optimized to encode and decode data by minimizing the reconstruction loss while also minimizing a KL-divergence term arising from the variational approximation, such that the KL-divergence term may loosely be interpreted as a regularization term. Since molecules are discrete objects, properly trained VAE defines an invertible continuous representation of a molecule.
- the encoder may implement a bidirectional Gated Recurrent Unit (GRU) with a linear output layer.
- the decoder may be a 3-layer GRU RNN of 512 hidden dimensions with intermediate dropout layers, the layers having a dropout probability of 0.2.
- Training may be performed with a batch size of a certain number (e.g., 128), utilizing a gradient clipping of 50 and a KL-term weight of 1, and further optimized with a learning rate of 0.0003 across 50 epochs. Other training parameters may be used to perform the embodiments disclosed herein.
- AAE adversarial autoencoders
- VAE adversarial autoencoders
- the KL divergence term is avoided by training a discriminator network to predict whether a given sample came from the latent space of the AE or from a prior distribution of the autoencoder (AE). Parameters may be optimized to minimize the reconstruction loss and to minimize the discriminator loss.
- the AAE model may consist of an encoder with a 1-layer bidirectional LSTM with 380 hidden dimensions, a decoder with a 2-layer LSTM with 640 hidden dimensions and a shared embedding of size 32.
- the latent space is of 640 dimensions
- the discriminator networks is a 2-layer fully connected neural network with 640 and 256 nodes respectively, utilizing the ELU activation function. Training may be performed with a batch size of 128, with an optimizer using a learning rate of 0.001 across 25 epochs. Other training parameters may be used to perform the embodiments disclosed herein.
- the technique encodes SMILES strings into latent vector representations of size 512.
- a Wasserstein Generative Adversarial network with Gradient Penalty may be trained to generate latent vectors resembling that of the training set, which are then decoded using a heteroencoder.
- JT-VAE junction tree variational autoencoder
- ORGAN objective-reinforced generative adversarial network
- the model is a sequence-generation model based on adversarial training that aims at generating discrete sequences that emulate a data distribution while using reinforcement learning to bias the generation process towards some desired objective rewards.
- ORGAN incorporates at least 2 networks: a generator network and a discriminator network.
- the goal of the generator network is to create candidate drug compounds indistinguishable from the empirical data distribution of real drug compounds.
- the discriminator exists to learn to distinguish a candidate drug compound from real data samples. Both models are trained in alternation.
- the gradient must be back-propagated between the generator and discriminator networks.
- Reinforcement uses an N-depth Monte Carlo tree search, and the reward is a weighted sum of probabilities from the discriminator and objective reward.
- Both the generator and discriminator may be pre-trained for 250 and 50 epochs, respectively, and then jointly trained for 100 epochs utilizing an optimizer with a learning rate of 0.0001.
- the learning rate may refer to a hyperparameter of a neural network, and the learning rate may be a number that determines an amount of change (e.g., weights, hidden layers, etc.) to make to a machine learning model in response to an estimated error.
- Bayesian optimization may be used to determine the optimal learning rate during training of a particular neural network.
- validity and uniqueness of candidate drug compounds may be used as rewards.
- the scientist module 153 may also include one or more machine learning models trained to perform causal inference using counterfactuals.
- the causal inference as described herein, may be used to determine whether the creator module 151 actually generated a candidate drug candidate, including a desired activity in such candidate, or if it was determined because of noisy data (e.g., scarce or incorrect data).
- FIG. 1C illustrates first components of an architecture of the creator module 151 according to certain embodiments of this disclosure.
- a candidate design space 156 and data 157 may be included in the biological context representation 200 , such space 156 and data 157 to include the various sequences of the candidate drug compounds or real drug compounds.
- the creator module 151 may populate the candidate design space 156 .
- the candidate design space 156 may include a vast amount of information retrieved from numerous sources or generated by the AI engine 140 .
- the candidate design space 156 may include information pertaining to antimicrobial peptides, anticancer peptides, peptidomimetics, uProteins and aCRFs, non-ribosomal peptides, and general peptides that are retrieved via genomic screening, literature research, or computationally designed using the AI engine 140 .
- the candidate design space 156 may be updated each time the creator module 151 generates a new candidate drug compound.
- the candidate design space 156 may also be updated continuously or continually as new literature is published or genomic screenings are performed.
- the creator module 151 may also use data 157 to generate the candidate drug compounds.
- the data 157 may be generated or provided by the descriptor module 152 .
- the data may be received from any suitable source.
- the data may include molecular information pertaining to chemistry/biochemistry, targets, networks, cells, clinical trials, market (e.g., analysis, results, etc.) that result from performing simulations or experiments.
- the creator module 151 may encode the candidate design space 156 and the data 157 into various encodings.
- an attention message-passing neural network may be used to encode molecular graphs.
- An initial set of states may be constructed, one for each node in a molecular graph. Then, each node may be allowed to exchange information, to “message” with its neighboring nodes. Each message may be a vector describing an atom of a molecule from the atom's perspective in the molecule. After one such step, each node state will contain an awareness of its immediate neighborhood. Repeating the step makes each node aware of its second-order neighborhood, and so forth.
- an attention layer may be used to identify interesting features of a molecule.
- a certain weight e.g., heavy, light
- a message that occurs more than a threshold number of times may be weighted more heavily than a message that occurs fewer than the threshold number of times. Any suitable weighting may be configured to cause a message to stand out more.
- the attention mechanism may aggregate the messages with their weights.
- the techniques may be able to scale to remain computationally efficient as the number of messages increases.
- resource e.g., processing, memory
- Such a technique may be beneficial because it reduces resource (e.g., processing, memory) consumption when performing computations with a large design space, including information in that design space pertaining to structure, semantic, sequence, physiochemical properties, etc.
- m (t) v is the message function
- a t is the attention function
- U t is the node update function
- N(v) is the set of neighbors of node v in graph G
- h (t) v is the hidden state of node v at time t
- m (t) v is a corresponding message vector.
- messages will be passed from its neighbors and aggregated as the message vector m (t) from its surrounding environment. Then the hidden state h (t) v is updated by the message vector.
- y ⁇ circumflex over ( ) ⁇ is a resulting fixed-length feature vector generated for the graph, and R is a readout function invariant to node ordering, a feature allowing the MPNN framework to be invariant to graph isomorphism.
- the graph feature vector y ⁇ circumflex over ( ) ⁇ then is passed to a fully connected layer to give prediction. All functions M t , U t , and R are neural networks, and their weights are learned during training.
- a “Candidates Only Data” encoding 158 may encode just the information from the candidate design space
- a “Candidates and Simulated Data” encoding 159 may encode information from the candidate design space 156 and the simulated data from the data 157
- a “Candidates with All Data” encoding 160 may encode information from the candidate design space 156 and both the simulated and experimental data from the data 157
- a “Heterologous Networks” encoding 161 may be generated using the “Candidates with All Data” encoding 160 .
- the encodings 158 , 159 , 160 , and 161 may include information pertaining to molecular structure, physiochemical properties, semantics, and so forth.
- Each of the encodings 158 , 159 , 160 , and 161 may be input into a separate machine learning model trained to generate an embedding.
- ML Model A, ML Model B, ML Model C, and ML Model D may be included in a “Single Candidate Embedding” Layer.
- “Candidates Only Data” encoding 158 may be input into ML Model A, which outputs a “Candidate Embedding” 162 .
- “Candidates and Simulated Data” encoding 159 may be input into ML Model B, which outputs a “Candidate and Simulated Data Embedding” 163 .
- “Candidates with All Data” encoding 160 may be input into ML Model C, which outputs “Candidate with All Data Embedding” 164 .
- “Heterologous Networks” encoding 161 may be input into ML Model D, which outputs “Graph and Network Embedding” 165 .
- the embeddings 162 , 163 , 164 , and 165 may represent information pertaining to a single candidate drug compound.
- FIG. 1D illustrates second components of the architecture of the creator module 151 according to certain embodiments of this disclosure.
- the encodings 158 , 159 , 160 , and 161 are input into ML Model F, which is trained to output a candidate drug compound based on the encodings 158 , 159 , 160 , and 161 .
- the embeddings 162 , 163 , 164 , and 165 are input into ML Model G, which is trained to output a candidate drug compound based on the embeddings 162 , 163 , 164 , and 165 .
- the “Heterologous Networks” 161 may be input into ML Model I, which is trained to output a candidate drug compound based on the “Heterologous Networks” 161 .
- the embeddings 162 , 163 , 164 , and 165 are also input into ML Model E in a “Knowledge Landscape Embedding” layer 167 .
- the ML Model E is trained to output a “Latent Representation” based on the embeddings 162 , 163 , 164 , and 165 .
- the “Latent Representation” 168 may include an “Activity Landscape” 169 and a “Continuous Representation” 170 .
- the “Continuous Representation” 170 may include information (e.g., structural, semantic, etc.) pertaining to all of the molecules (e.g., real drug compounds and candidate drug compounds), and the “Activity Landscape” 169 may include activity information for all of the molecules.
- the ML Model E may be a variational autoencoder that receives the embeddings 162 , 163 , 164 , and 165 and outputs lower-dimensional embeddings that are machine-readable and less computationally expensive for processing. The lower-dimensional embeddings may be used to generate the “Latent Representation” 168 .
- An architecture of the variational autoencoder is described further below with reference to FIG. 1E .
- the “Latent Representation” 168 is input into the ML Model H.
- ML Model H may be any suitable type of machine learning model described herein.
- ML Model H may be trained to analyze the “Latent Representation” 168 and generate a candidate drug compound.
- the “Latent Representation” 168 may include multiple dimensions (e.g., tens, hundreds, thousands) and may have a particular shape. The shape may be rectangular, cube, cuboid, spherical, an amorphous blob, conical, or any suitable shape having any number of dimensions.
- the ML Model H may be a generative adversarial network, as described herein.
- the ML Model H may determine a shape of the “Latent Representation” 168 and may determine an area of the shape from which to obtain a slice based on “interesting” aspects of that area.
- An interesting aspect may be a peak, valley, a flat portion, or any combination thereof.
- the ML Model H may use an attention mechanism to determine what is “interesting” and what is not.
- the interesting aspect may be indicative of a desirable feature, such as a desirable activity for a particular disease or medical condition.
- the slice may include a combination of a portion of any of the information included in the “Latent Representation” 168 , such as the structural information, physiochemical properties, semantic information, and so forth.
- the information included in the slice may be represented as an eigenvector that includes any number of dimensions from the “Latent Representation” 168 .
- the term “slice” and “candidate drug compound” may be used interchangeably.
- the slice may be visually presented on a display screen, as shown in FIG. 8A .
- a decoder may be used to transform the slice from the lower-dimensional vector to a higher-dimensional vector, which may be analyzed to determine what information is included in that slice. For example, the decoder may obtain a set of coordinates from the higher-dimensional vector which may be back-calculated to determine what information (e.g., structural, physiochemical, semantic, etc.) they represent.
- information e.g., structural, physiochemical, semantic, etc.
- Each of the candidate drug compounds generated by the ML Model F, ML Model G, ML Model H, and ML Model I may be ranked and one of the candidate drug compounds may be classified as a selected candidate drug compound, as described herein. Further, the candidate drug compounds may be input into one or more machine learning models trained to perform benchmark analysis, as described herein. Based on the benchmark analysis, any of the machine learning models in the creator module 151 may be optimized (e.g., tuning weights, adding or removing hidden layers, changing an activation function, etc.) to modify a parameter (e.g., uniqueness, validity, novelty, etc.) score for the machine learning models when generating subsequent candidate drug compounds.
- a parameter e.g., uniqueness, validity, novelty, etc.
- FIG. 1E illustrates an architecture of a variational autoencoder machine learning model according to certain embodiments of this disclosure.
- the variational autoencoder may include an input layer, an encoder layer, a latent layer, a decoder layer, and an output layer.
- the input layer may receive fingerprints of drug compounds or candidate drug compounds represented as higher-dimensional vectors, as well as associated drug concentration(s).
- the encoder layer may include one or more hidden layers, activation functions, and the like.
- the encoder layer may receive the fingerprint and drug concentration from the input layer and may perform operations to translate the higher-dimensional vectors into lower-dimensional vectors, as described herein.
- the latent layer may receive the lower-dimensional vectors and represent them in the “Latent Representation” 168 .
- the latent layer may input the “Latent Representation” 168 into the ML Model H, which is a generative adversarial network including a generator and a discriminator, as described herein.
- the architecture of the generator and the discriminator is discussed further below with reference to FIG. 1F .
- the generator generates candidate drug compounds, and the discriminator analyzes the candidate drug compounds to determine whether they are valid or not.
- the GI in FIG. 1F may refer to a general inference layer and the GI layer may generate the candidate drug compounds.
- the candidate drug compounds output by the latent layer may be input into the decoder layer where the lower-dimensional vectors are translated back into the higher-dimensional vectors.
- the decoder layer may include one or more hidden layers, activation functions, and the like.
- the decoder layer may output the fingerprints and the drug concentration.
- the output fingerprint and drug concentration may be analyzed to determine how closely they match the input fingerprint and drug concentration. If the output and input substantially match, the variational autoencoder may be properly trained. If the output and the input do not substantially match, one or more layers of the variational autoencoder may be tuned (e.g., modify weights, add or remove hidden layers).
- FIG. 1F illustrates an architecture of a generative adversarial network used to generate candidate drugs according to certain embodiments of this disclosure. As depicted, there is an architecture for the discriminator, discriminator residual block, generator, and generator residual block.
- the discriminator architecture may receive a sequence (e.g., candidate drug compound) as an input.
- the discriminator architecture may include an arrangement of blocks in a particular order that improves computational efficiency when processing the sequence to determine whether the sequence is valid or not.
- the particular order of blocks includes a first residual block, a self-attention block, a second residual block, a third residual block, a fourth residual block, a fifth residual block, and a sixth residual block.
- the discriminator may output a score (e.g., 0 or 1) for whether the received sequence is valid or not.
- the discriminator residual block architecture may receive an input filtered into two processing pathways.
- a first processing pathway performs a conversion operation on the input.
- the second processing pathway performs several operations, including a conversion, a batch normalization operation, a leaky rectified linear (e.g., ReLu) operation, a conversion operation, and another batch normalization operation.
- the leaky ReLu operation may perform a threshold operation, where any input value less than zero is multiplied by a fixed scalar, for example.
- the output from the first and second processing pathways is summed and then output.
- the generator architecture may receive a noise (e.g., biological context representation 200 ) as an input.
- the generator architecture may include an arrangement of blocks in a particular order that improves computational efficiency when processing the noise to generate a sequence (e.g., candidate drug compound).
- the particular order of blocks includes a first residual block, a second residual block, a third residual block, a fourth residual block, a fifth residual block, a self-attention block, and a sixth residual block.
- the generator may output a score (e.g., 0 or 1) for whether the received sequence is valid or not.
- the generator residual block architecture may receive an input filtered into two processing pathways.
- a first processing pathway performs a de-conversion operation on the input.
- the second processing pathway performs several operations, including a conversion, a batch normalization operation, a leaky ReLu operation, a de-conversion operation, and another batch normalization operation.
- the output from the first and second processing pathways is summed and then output.
- FIG. 1G illustrates types of encodings to represent certain types of drug information according to certain embodiments of this disclosure.
- a table 180 includes three columns labeled “Encoding”, “Compressed?”, and “Information”.
- the “Encoding” column includes rows storing a type of encoding used to represent a certain type of information; the “Compressed?” column includes rows storing an indication of whether the encoding in that row is compressed; and the “Information” column includes rows storing a type of information represented by the encoding in each respective row.
- the descriptor module 152 may include a machine learning module trained to analyze a candidate drug compound and identify various structural properties, physiochemical properties, and the like.
- the descriptor module 152 may be trained to represent the type of structural and physiochemical properties using an encoding that increases computational efficiency and to store a description including the encodings at a node representing the candidate drug compound. During processing, the encodings may be aggregated for each candidate drug compound.
- MMILES For example, using an alphanumeric string, SMILES encoding spells out molecular structure from a beginning portion to an ending portion.
- Morgan Fingerprints may be useful for temporal molecular structures and the descriptor module 152 may include a machine learning module trained to output a compressed vector. Morgan Fingerprints may include the isomer for a particular molecule, and common backbone structures for molecules.
- SMILES, Morgan Fingerprints, InChl, One-Hot, N-gram, Graph-based Graphic Processing Unit Nearest Neighbor Search (GGNN), Gene regulatory network (GRN), M-P Neural Network (MPNN), and Knowledge Graph (Structural/Semantic) encodings represent structural information of molecules (drug compounds).
- the Morgan Fingerprints, GGNN, GRN, and MPNN are also compressed to improve computations, while the SMILES, InChl, One-Hot, N-gram, and the Knowledge Graph are not compressed.
- Quantitative structure-activity relationship may represent physiochemical properties of molecules. These encodings may not be compressed.
- the QSAR encoding may include the type of activity (e.g., and without limitation to a particular physiological or anatomical organ, organ, state or states, or to a particular disease-process, antiviral, antimicrobial, antifungal, antiemetic, antineoplastic, anti-inflammatory, leukotriene inhibitory, neurotransmitter inhibitory, etc.) the molecule provides.
- the encodings selected for each type of information may optimize the computations when considering such a large design space with information pertaining to structure, physiochemical properties, and semantic information.
- the large design space referred to may include not only a string of amino acid sequences, and physiochemical properties, but also the semantic information, such as system biology and ontological information, including relationships between nodes, molecular pathways, molecular interactions, molecular family, and the like.
- FIG. 1H illustrates an example of concatenating (merging) numerous encodings into a candidate drug compound according to certain embodiments of this disclosure.
- a concatenated vector 191 may represent an embedding for a candidate drug compound.
- an ensemble learning approach may be implemented by using different types of techniques to generate unique encodings and merge those unique encodings to improve generated candidate drug compounds.
- various encoding techniques may be used to represent different types of information.
- the different types of information (e.g., structural, semantic, etc.) may be represented by unique encodings.
- molecular graphs and Morgan Fingerprints may represent structural and physical molecular information.
- Activity data may represent molecular structural knowledge or molecular physiochemical knowledge
- a knowledge graph may represent molecular semantic knowledge.
- An attention message passing neural network (AMPNN) or long short-term memory (LSTM) may receive the molecular graph and Morgan Fingerprints as input and output the structural/physical information represented by 1 s and 0 s.
- One-hot may receive the activity data as input and output the structural knowledge represented by is and Os.
- AMPNN may receive a knowledge graph as input and output semantic knowledge represented by is and Os.
- the resulting concatenated vector 191 is a combination of each type of information for a single candidate drug compound. Accordingly, the single candidate drug compound may include better properties and more robust information than conventional techniques.
- FIG. 1I illustrates an example of using a variational autoencoder (VAE) to generate a Latent Representation 168 of a candidate drug compound according to certain embodiments of this disclosure.
- the concatenated vector 191 (e.g., embedding) may be higher-dimensional prior to being input to the VAE.
- the VAE may be trained to translate the higher-dimensional concatenated vector 191 to a lower-dimensional concatenated vector that represents the Latent Representation 168 .
- FIG. 2 illustrates a data structure storing a biological context representation 200 according to certain embodiments of this disclosure.
- Biology is context-dependent and dynamic. For example, the same molecule can manifest multiple, potentially competing, phenotypes. Further, data on an existing drug labeled as antimicrobial can suggest a null behavior in applications against different microbes or even against the same microbes but in different contexts, e.g., temperature, pressure, environmental, contextual, comorbid. To accurately predict candidate drug compounds that provide desirable activity levels in design spaces, the machine learning models 132 are trained to handle evolving knowledge maps of biology and drug compounds. Further, conventional techniques for discovery and generating drug compounds may be ineffective for biological data because such data is non-Euclidian.
- the biological context representation 200 generated by the disclosed techniques may be used to graphically model the continually or continuously modifying biological and drug compound knowledge. That is, the biology may be represented as graphs within a comprehensive knowledge graph (e.g., biological context representation 200 ), where the graphs have complex relationships and interdependencies between nodes.
- a comprehensive knowledge graph e.g., biological context representation 200
- the biological context representation 200 may be stored in a first data structure having a first format.
- the first format may be a graph, an array, a linked list, or any suitable data format capable of storing the biological context representation.
- FIG. 2 illustrates various types of data received from various sources, including physical properties data 202 , peptide activity data 204 , microbe data 206 , antimicrobial compound data 208 , clinical outcome data 210 , evidence-based guidelines 212 , disease association data 214 , pathway data 216 , compound data 218 , gene interaction data 220 , anti-neurodegenerative compound data 222 , or pro-neuroplasticity compound data 224 .
- the example data may be curated by the AI engine 140 or a person having a certain degree (e.g., a degree in data science, molecular biology, microbiology, etc.), certification, license (e.g., a licensed medical doctor (e.g., M.D. or D.O.), or credential.
- the data in the biological context representation 200 may be retrieved from any suitable data source (e.g., digital libraries, websites, databases, files, or the like). These examples are not meant to be limiting.
- the example types of data are also not meant to be limiting and other types of data may be stored within the biological context representation without departing from the scope of this disclosure.
- the various data included in the biological context representation 200 may be linked based on one or more relationships between or among the data, in order to represent knowledge pertaining to the biological context or drug compound.
- the physical properties data 202 includes physical properties exhibited by the drug compound.
- the physical properties may refer to characteristics that provide a physical description of the drug such as color, particle size, crystalline structure, melting point, and solubility.
- the physical properties data 202 may also include chemical property data, such as the structure, form, and reactivity of a substance.
- biological data may also be included (e.g., anti-neurodegenerative compound data, pro-neuroplasticity compound data, anti-cancer data) in the biological context representation 200 .
- the peptide activity data 204 may include various types of activity exhibited by the drug.
- the activity may be hormonal, antimicrobial, immunomodulatory, cytotoxic, neurological, and the like.
- a peptide may refer to a short chain of amino acids linked by peptide bonds.
- the microbe data 206 may include information pertaining to cellular structure (e.g., unicellular, multicellular, etc.) of a microscopic organism.
- the microbes may refer to bacteria, parasites, fungi, viruses, prions, or any combination of these, etc.
- the antimicrobial compound data 208 may include information pertaining to agents that kill microbes or stop their growth. This data may include classifications based on the microorganisms against which the antimicrobial compound acts (e.g., antibiotics act against bacteria but not against viruses; antivirals act against viruses but not against bacteria). The antimicrobial compound may also be classified according to function (e.g., microbicidal, meaning “that which kills, vitiates, inactivates or otherwise impairs the activity of certain microbes”).
- the clinical outcome data 210 may include information pertaining to the administration of a drug compound to a subject in a clinical setting. For example, upon or subsequent to administration of the drug compound, the outcome may be a prevented disease, cured disease, treated symptom, etc.
- the evidence-based guidelines 212 may include information pertaining to guidelines based upon clinical studies for acceptable treatment or therapeutics for certain diseases or medical conditions.
- Evidence-based guidelines data 212 may include data specific to various specialties within healthcare such as, for example, obstetrics, anesthesiology, hepatology, gastroenterology, neurology, pulmonology, orthopedics, pediatrics, trauma care (including but not limited to burns and post-burn infections), histology, oncology, ophthalmology, endocrinology, rheumatology, internal medicine, surgery (including reconstructive (plastic) and cosmetic), vascular medicine, emergency medicine, radiology, psychiatry, cardiology, urology, gynecology, genetics, and dermatology.
- the evidence-based guidelines 212 include systematically developed statements to assist practitioner and patient decisions about appropriate health care (e.g., types of drugs to prescribe for treatment) for specific clinical circumstances.
- the disease association data 214 may include information about which disease or medical condition the drug compounds are associated with.
- the drug compound Metformin may be associated with the disease type 2 diabetes.
- the pathway data 216 may include information pertaining in a design space to the relationships or paths between ingredients (e.g., chemicals) and activity levels.
- the compound data 218 may include information pertaining to the compound such as the sequence of ingredients (e.g., type, amount, etc.) in the compound.
- the compound data 218 can include data specific to the various types of drug compounds that are designed, defined, developed, or distributed.
- the gene interaction data 220 may include information pertaining to which gene the drug compound or a disease may interact with.
- the anti-neurodegenerative compound data 222 may include information pertaining to characteristics of anti-neurodegenerative compounds, such as their physical and chemical properties and activities on portions of tissue.
- the activity may include anti-inflammatory or neuro-protective actions.
- the pro-neuroplasticity compound data 224 may include information pertaining to characteristics of pro-neuroplasticity compound, such as their physical and chemical properties and activities on portions of tissue. For example, the activity may enhance the capacity of motor systems by upregulation of neurotrophins.
- FIGS. 3A-3B illustrate a high-level flow diagram according to certain embodiments of this disclosure.
- a flow diagram 300 begins with obtaining heterogeneous datasets, such as the biological context representation 200 .
- Heterogeneous datasets may refer to populations or samples of data that are different (e.g., as opposed to homogenous datasets where the data is the same).
- the heterogeneous datasets may include compound data (e.g., peptide sequence data), clinical outcome data, or activity data (in vitro and in vivo activity), as well as any other suitable data depicted in FIG. 2 .
- the data structure storing the heterogeneous datasets may be translated to a second data structure having a second format (e.g., a 2-dimensional vector) that the AI engine 140 may use to generate the candidate drug compounds.
- the next step in the flow diagram 300 includes training the one or more machine learning models 132 using the heterogeneous datasets.
- the one or more machine learning models 132 e.g., generative models
- a machine learning model may use causal inference and counterfactuals when generating the set of candidate drug compounds.
- a GAN may be used in conjunction with causal inference to generate the set of candidate drug compounds.
- a certain number (e.g., over 100,000 candidate drug compounds) of novel candidate drug compounds may be generated in a set. That is, each candidate drug compound in the set of candidate drug compounds is intended to be unique.
- the next step in the flow diagram 300 includes inputting the set of candidate drug compounds into one or more machine learning models 132 trained to classify the set of candidate drug compounds.
- the machine learning models 132 may perform supervised or unsupervised filtering. In some embodiments, the machine learning models 132 may perform clustering to rank the various candidate drug compounds to classify one candidate drug compound as a selected candidate drug compound. In some embodiments, the machine learning models 132 may output a subset (e.g., 1,000 to 10,000, or more, or fewer) of candidate drug compounds.
- the next step in the flow diagram 300 may include performing experimental validation by validating whether each candidate drug compound in the subset of candidate drug compounds provides the desired level of certain types of activity in a design space.
- the results of the experimental validation may be fed back into the heterogeneous dataset to reinforce and expand the experimental dataset.
- the next step in the flow diagram 300 may include performing peptide drug optimization.
- the optimizations may include performing gradient descent or ascent using the sequence of ingredients in the candidate drug compounds to attempt to increase or decrease certain activity levels in a design space.
- the results of the peptide drug optimization may be fed back into the heterogeneous datasets to reinforce and expand the experimental dataset.
- FIG. 3B illustrates another high-level flow diagram 310 according to some embodiments.
- a heterogeneous network of biology may be included in a knowledge graph of a biological context representation 200 .
- Various paths or meta-paths may be expressed between nodes in the biological context representation 200 .
- the meta-paths may include indications for compound upregulates, pathway participates, disease associations, gene interactions, and compound data.
- the biological context representation 200 may be translated from a first format (e.g., knowledge graph) to a format (e.g., vector) that may be processed by the AI engine 140 .
- the AI engine 140 may use one or more machine learning models to traverse the knowledge graph by performing random walks until a corpus of random walks is generated, wherein such random walks include the indications associated with the meta-paths representing sequences of ingredients.
- the corpus of random walks may be referred to as a set of candidate drug compounds.
- a generative adversarial network using causal inference may be used to generate the set of candidate drug compounds.
- the set of candidate drug compounds may be stored in a higher-dimensional vector.
- the AI engine 140 may compress the higher-dimensional vector of the set of candidate drug compounds into a lower-dimensional vector of the set of candidate drug compounds, depicted as biological embeddings in FIG. 3B .
- the lower-dimensional vector may include fewer dimensions (e.g., 2, 3, . . . N) than the higher-dimensional vector (e.g., greater than N).
- the nodes may be organized by the meta-path indicators and by dimension.
- the lower-dimensional vector of the set of candidate drug compounds may be input to one or more machine learning models 132 trained to perform classification.
- the classification techniques may include using clustering to filter out candidate drug compounds that produce undesirable levels of types of activity.
- views presenting the levels of types of activity of each candidate drug compound in a design space may be generated using the lower-dimensional vectors. These views may also be presented to a user via the computing device 102 .
- the machine learning models 132 may output a candidate drug candidate classified as a selected candidate drug candidate based on the clustering.
- the selected candidate drug candidate may include an optimized sequence of ingredients that provides the most desirable levels of a certain type of activity in a design space.
- FIG. 4 illustrates example operations of a method 400 for generating and classifying a candidate drug candidate compound according to certain embodiments of this disclosure.
- the method 400 is performed by processing logic that may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general-purpose computer system or a specialized machine), or a combination of both.
- the method 400 or each of their individual functions, routines, subroutines, or operations may be performed by one or more processors of a computing device (e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 ).
- the method 400 may be performed by a single processing thread.
- the method 400 may be performed by two or more processing threads, each thread implementing one or more individual functions, routines, subroutines, or operations of the methods
- one or more accelerators may be used to increase the performance of a processing device by offloading various functions, routines, subroutines, or operations from the processing device.
- One or more operations of the method 400 may be performed by the training engine 130 of FIG. 1 .
- the method 400 is depicted and described as a series of operations. However, operations in accordance with this disclosure can occur in various orders or concurrently, and with other operations not presented and described herein. For example, the operations depicted in the method 400 may occur in combination with any other operation of any other method disclosed herein. Furthermore, not all illustrated operations may be required to implement the method 400 in accordance with the disclosed subject matter. In addition, those skilled in the art will understand and appreciate that the method 400 could alternatively be represented as a series of interrelated states via a state diagram or events.
- the processing device may generate a biological context representation 200 of a set of drug compounds.
- the biological context representation 200 may include a first data structure having a first format (e.g., a knowledge graph).
- the biological context representation 200 may include, for each drug compound of the set of drug compounds, one or more relationships between or among, without limitation, (i) physical properties data 202 , (ii) peptide activity data 204 , (iii) microbe data 206 , (iv) antimicrobial compound data 208 , (v) clinical outcome data 210 , (vi) evidence-based guidelines 212 , (vii) disease association data 214 , (viii) pathway data 216 , (ix), compound data 218 , (x) gene interaction data 220 , (xi) antimicrobial compound data, (xii) pro-neuroplasticity data 224 , or some combination thereof.
- the processing device may translate, by the artificial intelligence engine 140 , the first data structure having the first format to a second data structure having a second format.
- the translating may include converting the first data structure having the first format (e.g., knowledge graph) to the second data structure having the second format (e.g., vector) according to a specific set of rules executed by the artificial intelligence engine 140 .
- the translating may be performed by one or more of the machine learning models 132 .
- a recurrent neural network may perform at least a portion of the translating.
- the translating may include obtaining a higher-dimensional vector and compressing the higher-dimensional vector into a lower-dimensional vector (e.g., two-dimensional, three-dimensional, four-dimensional), referred to as an embedding herein.
- a lower-dimensional vector e.g., two-dimensional, three-dimensional, four-dimensional
- one or more embeddings may be created from the first data structure having the first format.
- the lower-dimensional vector may have at least one fewer dimension than the higher-dimensional vector.
- the processing device may generate, based on the second data structure having the second format, a set of candidate drug compounds.
- the generating may be performed by one or more of the machine learning models 132 .
- a generative adversarial network may perform the generating of the set of candidate drug compounds.
- the set of candidate drug compounds may be associated with design spaces pertaining to antimicrobial, anticancer, antibiofilm, or the like.
- a biofilm may include any syntrophic consortium of microorganisms in which cells stick to each other and often also to a surface. These adherent cells may become embedded within an extracellular matrix that is composed of extracellular polymeric substances (EPS).
- EPS extracellular polymeric substances
- the processing device may classify a candidate drug compound from the set of candidate drug compounds as a selected candidate drug compound.
- the classifying may be performed by one or more of the machine learning models 132 .
- a classifier trained using supervised or unsupervised learning may perform the classifying.
- the classifier may use clustering techniques to rank and classify the selected candidate drug compound.
- the processing device may generate a set of views including a representation of a design space.
- the design space may be antimicrobial.
- the processing device may cause the set of views to be presented on a computing device (e.g., computing device 102 ).
- the representation of the design space may pertain to, without limitation, (i) antimicrobial activity, (ii) immunomodulatory activity, (iii) neuromodulatory activity, (iv) cytotoxic activity, or some combination thereof.
- Each view of the set of views may present an optimized sequence representing the selected candidate drug compound.
- the optimized sequence in each view may be generated using any suitable optimization technique.
- the optimization technique may include maximizing or minimizing an objective function by systematically selecting input values from a domain of values and computing the value using the objective function.
- the domain of values may include a subset of values from a Euclidean space.
- the subset of values may satisfy one or more constraints, equalities, or inequalities.
- a value that minimizes or maximizes the objective function may be referred to as an optimal solution.
- Certain values in the subset may result in a gradient of the objective function being zero. Those certain values may be at stationary points, where a first derivative at those points with respect to time (dt) is zero.
- the gradient may refer to a scalar-valued differentiable function (e.g., objective function) of several variables, where a point p is a vector whose components are the partial derivatives of the objective function. If the gradient is not a zero vector at a certain point p, then a direction of the gradient is the direction of fastest increase of the objective function at the certain point p.
- objective function a scalar-valued differentiable function
- Gradients may be used in gradient descent, which refers to a first-order iterative optimization algorithm for finding the local minimum of an objective function. To find the local minimum, gradient descent may proceed by performing operations proportional to the negative of the gradient of the objective function at a current point. In some embodiments, the optimized sequence may be found for a candidate drug compound performing gradient descent in the design space. Additionally, gradient ascent, which is the algorithm opposite to gradient descent, may determine a local maximum of the objective function at various points in the design space.
- the views generated may include a topographical heatmap, itself including indicators for the least activity at points in the design space and the most activity at points in the design space.
- the indicator associated with the most activity may represent a local maximum obtained using gradient ascent.
- the indicator associated with the least activity may represent a local minimum obtained using gradient descent.
- the optimal sequence may be generated by navigating points between the local minima and local maxima.
- the optimized sequence may be overlaid on the indicators ranging from at least one least active property to an at least one most active property.
- the processing device may cause the selected candidate drug compound to be formulated. In some embodiments, the processing device may cause the selected candidate drug compound to be created, manufactured, developed, synthesized, or the like. In some embodiments, the processing device may cause the selected candidate drug compound to be presented on a computing device (e.g., computing device 102 ).
- the selected candidate drug compound may include one or more active ingredients (e.g., chemicals) at a specified amount.
- FIGS. 5A-5D provide illustrations of generating a first data structure including a biological context representation 200 of a plurality of drug compound devices according to certain embodiments of this disclosure.
- the first data format may include a knowledge graph.
- the biological context representation 200 may capture an entire biological context by integrating every known association or relationship for each drug compound into a comprehensive knowledge graph.
- FIG. 5A presents the biological context representation 200 including biomedical and domain knowledge on peptide activity, microbes, antimicrobial compounds, clinical outcomes, and any relevant information depicted in FIG. 2 .
- a table 500 may include rows representing various categories (A, B, C, D, and E) pertaining to a biological context for each drug compound and columns representing sub-categories (1, 2, 3, 4, and 5).
- the table includes subcategories for category A: A 1 2D fingerprints, A 2 3D fingerprints, A 3 Scaffolds, A 4 Structure Keys, A 5 Physicochemical/B: B 1 Mechanism.
- Charts 502 , 504 , and 506 represent characteristics for each subcategory.
- the characteristics for chart 502 include the size of molecules, for chart 504 the complexity of variables, and for 506 the correlation with mechanism of action.
- Another chart 508 may represent the various characteristics of the subcategories using an indicator (such as a range of colors from 0 to 1) to express the values of the characteristics in relation to each other.
- FIG. 5B illustrates a different representation 520 of characteristics for several subcategories (e.g., A 1 , B 1 , C 5 , D 1 , and E 3 ) across different subject matter areas (e.g., neurology and psychiatry, infectious disease, gastroenterology, cardiology, ophthalmology, oncology, endocrinology, pulmonary, rheumatology, and malignant hematology.).
- the representation 520 provides an even more granular representation of the biological context representation 200 than does the chart 508 .
- Flowchart 530 represents the process for generating candidate drugs as described further herein.
- FIG. 5C illustrates a knowledge graph 540 representing the biological context representation 200 .
- the knowledge graph 540 may refer to a cognitive map.
- the knowledge graph 540 represents a graph traversed by the AI engine 140 , when generating candidate drug compounds having desired levels of certain types of activity in a design space.
- Individual nodes in the knowledge graph 540 represent a health artifact (health-related information) or relationship (predicate) gleaned and curated from numerous data sources.
- the knowledge represented in the knowledge graph 540 may be improved over time as the machine learning models discover new associations, correlations, or relationships.
- the nodes and relationships may form logical structures that represent knowledge (e.g., Genes, Participates, and Pathways).
- FIG. 5D illustrates another representation of the knowledge graph 540 that more clearly identifies all the various relationships among the nodes.
- FIG. 6 illustrates example operations of a method 600 for translating the first data structure of FIGS. 5A-5B a second data structure according to certain embodiments of this disclosure.
- Method 600 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 ).
- processors of a computing device e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 .
- one or more operations of the method 600 are implemented in computer instructions that are stored on a memory device and executed by a processing device.
- the method 600 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 600 may be performed in some combination with any of the operations of any of the methods described herein.
- the method 600 may include operation 404 from the previously described method 400 depicted in FIG. 4 .
- the processing device may translate, by the artificial intelligence engine 140 , the first data structure having the first format (e.g., knowledge graph) to the second data structure having the second format (e.g., vector).
- the method 600 in FIG. 6 includes operations 602 and 604 .
- the processing device may obtain a higher dimensional vector from the biological context representation 200 . This process is further illustrated in FIG. 7 .
- the processing device may compress the higher-dimensional vector to a lower dimensional-vector.
- the compressing may be performed by a first machine learning model 132 trained to perform deep autoencoding via a recurrent neural network configured to output the lower-dimensional vector.
- the processing device may train the first machine learning model 132 by using a second machine learning model 132 to recreate the first data structure having the first format.
- the second machine learning model 132 is trained to perform a decoding operation to recreate the first data structure having the first format.
- the decoding operation may be performed on the second data structure having the second data format (e.g., two-dimensional vector).
- FIG. 7 provides illustrations of translating the first data structure of FIGS. 5A-5B to the second data structure according to certain embodiments of this disclosure.
- Aggregated biological data may be difficult to model and format correctly for an AI engine to process.
- Aspects of the present disclosure overcome the hurdle of modeling and formatting the aggregated biological data to enable the AI engine 140 to generate candidate drug compounds accurately and efficiently.
- a higher-dimensional vector 700 may be obtained from the biological context representation 200 .
- the higher-dimensional vector is compressed to a lower-dimensional vector 702 .
- the recurrent neural network performing autoencoding is trained using another machine learning model 132 that recreates the higher-dimensional vector 704 . If the other machine learning model 132 is unable to recreate higher-dimensional vector 704 from the lower-dimensional vector 702 , then the other machine learning model 132 provides feedback to the recurrent neural network performing autoencoding in order to update its weights, biases, or any suitable parameters.
- FIGS. 8A-8C provide illustrations of views of a selected candidate drug compound according to certain embodiments of this disclosure.
- FIG. 8A illustrates a view 800 including antimicrobial activity
- FIG. 8B illustrates a view 802 including immunomodulatory activity
- FIG. 8C illustrates a view 804 including cytotoxic activity.
- Each view presents a topographical heatmap where one axis is for sequence parameter y and the other axis is for sequence parameter x.
- Each view includes an indicator ranging from a least active property to a most active property.
- each view includes an optimized sequence 806 for a selected candidate drug compound classified by the classifier (machine learning model 132 ). These views may be presented to the user on a computing device 102 . Further, the selected candidate drug compound 806 may be formulated, generated, created, manufactured, developed, or tested.
- FIG. 9 illustrates example operations of a method 900 for presenting a view including a selected candidate drug compound according to certain embodiments of this disclosure.
- Method 900 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as computing device 102 ).
- processors of a computing device e.g., any component of FIG. 1 , such as computing device 102 .
- one or more operations of the method 1000 are implemented in computer instructions that are stored on a memory device and executed by a processing device.
- the method 1000 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 1000 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may receive, from the artificial intelligence engine 140 , a candidate drug compound generated by the artificial intelligence engine 140 .
- the processing device may generate a view including the candidate drug compound overlaid on a representation of a design space.
- the view may present a topographical heatmap of the representation of the design space.
- the topographical heatmap may include the candidate drug compound overlaid on indicators ranging from an at least one least active property to an at least one most active property.
- a topographical heatmap is depicted as an example in the view, other suitable visual elements (e.g., graphs, charts, two-dimensional density plots, three-dimensional density plots, etc.) may be used to depict the representation of the design space.
- the processing device may present the view on a display screen of a computing device (e.g., computing device 102 ).
- FIG. 10A illustrates example operations of a method 1000 for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure.
- Method 1000 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 ).
- processors of a computing device e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 .
- one or more operations of the method 1000 are implemented in computer instructions that are stored on a memory device and executed by a processing device.
- the method 1000 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 1000 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may perform one or more modifications pertaining to the biological context representation 200 , the second data structure having the second format, or some combination thereof.
- the processing device may use causal inference to determine whether the one or more modifications provide one or more desired performance results.
- using causal inference may further include using 1006 counterfactuals to calculate alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof.
- a counterfactual may refer to determining whether the desired performance still results if something does not occur during the calculation. For example, in a scenario, a person may improve their health after taking a medication. The counterfactual may be used in causal inference to calculate an alternative scenario to see whether the person's health improved without taking the medication. If the person's health still improved without taking the medication it may be inferred that the medication did not cause the health of the person to improve.
- the medication is correlated with causing the health of the person to improve. There may, however, be other factors involved in conjunction with taking the medication that actually cause the health of the person to improve.
- FIG. 10B illustrates another example of operations of method 1050 for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure.
- Method 1050 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 ).
- processors of a computing device e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 .
- one or more operations of the method 1050 are implemented in computer instructions that are stored on a memory device and executed by a processing device.
- the method 1050 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 1050 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may generate a set of candidate drug compounds by performing a modification using causal inference based on a counterfactual.
- the counterfactual may include removing an ingredient from a sequence of ingredients to determine whether a candidate drug compound provides the same level or type of activity it previously provided when the ingredient was included in the sequence. If the same level or type of activity is still provided after application of the counterfactual (e.g., removal of the ingredient), then the processing device may use causal inference to determine that the ingredient is not correlated with the level or type of activity. If the same level or type of activity is not present after application of the counterfactual (e.g., removal of the ingredient), then the processing device may use causal inference to determine that the ingredient is correlated with the level or type of activity.
- the processing device may classify a candidate dug compound from the set of candidate drug compounds as a selected candidate drug compound, as previously described herein.
- FIG. 11 illustrates example operations of a method 1100 for using several machine learning models in an artificial intelligence engine architecture to generate peptides according to certain embodiments of this disclosure.
- Method 1100 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 ).
- processors of a computing device e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 .
- one or more operations of the method 1100 are implemented in computer instructions stored on a memory device and executed by a processing device.
- the method 1100 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 1100 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may generate, via a creator module 151 , a candidate drug compound including a sequence for candidate drug compound.
- the sequence for the candidate drug compound includes a concatenated vector that may include drug compound sequence information, drug compound activity information, drug compound structure information, and drug compound semantic information.
- the candidate drug compound may be generated using a GAN.
- the processing device may use an attention message passing neural network including an attention mechanism that identifies and assigns a weight to a desired feature in a portion of the knowledge graph.
- the desired feature may be included in the candidate drug compound as drug compound semantic information, drug compound structural information, drug compound activity information, or some combination thereof.
- the creator module 151 may generate the candidate drug compound by performing ensemble learning by concatenating a set of encodings.
- the encodings may each respective sequences represented in a vector.
- a first encoding of the set of encodings may pertain to drug compound sequence information.
- a second encoding of the set of encodings may pertain to drug compound structural information.
- a third encoding of the set of encodings may pertain to peptide activity information.
- a fourth encoding of the set of encodings may pertain to drug compound semantic information.
- the creator module 151 may generate the candidate drug compound using an autoencoder machine learning model trained to receive a higher-dimensional vector encoding representing the candidate drug compound and output a lower-dimensional vector embedding representing the candidate drug compound.
- the creator module 151 may generate a latent representation using the lower-dimensional vector embedding representing the candidate drug compound.
- the processing device may include, via the creator module 151 , the candidate for the candidate drug compound as a node in a knowledge graph (e.g., biological context representation 200 ).
- the knowledge graph may include a first layer including structure and physical properties of molecules, a second layer including molecule-to-molecule interactions, a third layer including molecular pathway interactions, a fourth layer including molecular cell profile associations, and a fifth layer including molecular therapeutics and indications.
- Indications may refer to drug indications, or the disease which gives a valid reason for clinicians to administer a specific drug.
- the processing device may generate, via a descriptor module 152 , a description of the candidate drug compound at the node in the knowledge graph.
- the description may include drug compound sequence information, drug compound structural information, drug compound activity information, and drug compound semantic information.
- the processing device may perform, via a scientist module 153 , a benchmark analysis of a parameter of the creator module 151 .
- the scientist module 153 may perform causal inference using the candidate drug compound in a design space pertaining to biomedical activity (e.g., antimicrobial, anticancer, etc.) to determine if the candidate drug compound still provides a desired effect regarding the type of biomedical activity if the candidate drug compound, or the design space, is changed.
- biomedical activity e.g., antimicrobial, anticancer, etc.
- the processing device may modify, based on the benchmark analysis, the creator module 151 to change the parameter in a desired way during a subsequent benchmark analysis.
- Changing the parameter in a desired way may refer to changing a value of the parameter in a desired way.
- Changing the value of the parameter in the desired way may refer to increasing or decreasing the value of the parameter.
- a self-improving AI engine 140 is disclosed that increasingly generates better candidate drug components over time by recursively updating the creator module 151 based on baselines.
- “change the parameter” means change a value of the parameter as desired (e.g., either increase or decrease).
- the processing device may generate, via a reinforcer module 154 based on the candidate drug compound and the description, experiments that produce desired data for the candidate drug compound.
- the experiments may be generated in response to the candidate drug compound and the description being similar to a real drug compound and another description of the real drug compound.
- the reinforce module 154 may determine that certain experiments for the real drug compound elicited desired data and may select those experiments to perform for the candidate drug compound.
- the processing device may perform the experiments (e.g., by running simulations) to collect data pertaining to the candidate drug compound.
- the processing device may determine, based on the data, an effectiveness of the candidate drug compound.
- FIG. 12 illustrates example operations of a method 1200 for performing a benchmark analysis according to certain embodiments of this disclosure.
- Method 1200 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 ).
- processors of a computing device e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 .
- one or more operations of the method 1200 are implemented in computer instructions that are stored on a memory device and executed by a processing device.
- the method 1200 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 1200 may be performed in some combination with any of the operations of any of the methods described herein.
- the method 1200 includes additional operations included in block 1108 of FIG. 11 .
- the processing device generates, via the scientist module 143 , a score for a parameter of the creator module 151 that generated the candidate drug compound.
- the parameter may include a validity of the candidate drug compound, uniqueness of the candidate drug compound, novelty of the candidate drug compound, similarity of the candidate drug compound to another candidate drug compound, or some combination thereof.
- the processing device may rank a set of creator modules 151 based on the score, where the set of creator modules comprises the creator module. For example, other creator modules in the set of creator modules may be scored based on the candidate drug compounds they generated. The set of creator modules may be ranked for each respective category from highest scoring to lowest scoring or vice versa.
- the processing device may determine which creator module 151 of the set of creator modules performs better for each respective parameter.
- the scores of the parameters for each of the set of creator modules 151 may be presented on a display screen of a computing device.
- the best performing creator modules for each parameter may also be presented on the display screen.
- the processing device may tune the set of creator modules 151 to cause the set of creator modules 151 to receive higher scores for certain parameters during subsequent benchmark analysis.
- the tuning may optimize certain weights, activation functions, hidden layer number, loss, and the like of one or more generative modules included in the creator modules.
- the processing device may select, based on the parameters, a subset of the set of creator modules 151 to use to generate subsequent candidate drug compounds having desired parameter scores. For example, it may be desired to generate drug candidate compounds that result in a high uniqueness score.
- the creator module(s) 151 associated with high uniqueness scores may be selected in the subset of creator modules 151 .
- the processing device may transmit the subset of the set of creator modules as a package to a third-party to be used with data of the third-party.
- the subset of the set of creator modules may be trained to process a type of the data of the third-party.
- Other modules such as the reinforce module, the descriptor module, the scientist module, and the conductor module may be included in the package delivered to the third-party.
- a knowledge graph including data pertaining to the third-party may be included in the package. In such a way, the disclosed techniques may provide custom tailored packages that may be used by the third party to perform the embodiments disclosed herein.
- FIG. 13 illustrates example operations of a method 1300 for slicing a latent representation based on a shape of the latent representation according to certain embodiments of this disclosure.
- Method 1300 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 ).
- processors of a computing device e.g., any component of FIG. 1 , such as server 128 executing the artificial intelligence engine 140 .
- one or more operations of the method 1300 are implemented in computer instructions stored on a memory device and executed by a processing device.
- the method 1300 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 1300 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may determine a shape of the multi-dimensional, continuous representation of the set of candidates.
- the processing device may determine, based on the shape, a slice to obtain from the multi-dimensional, multi-dimensional, continuous representation of the set of candidates.
- the processing device may determine, using a decoder, which dimensions are included in the slice. The dimensions may pertain to peptide sequence information, peptide structural information, peptide activity information, peptide semantic information, or some combination thereof.
- the processing device may determine, based on the dimensions, an effectiveness of a biomedical feature of the slice.
- FIG. 14 illustrates a high-level flow diagram for a therapeutics tool implementing, incorporating or using business intelligence according to certain embodiments of this disclosure.
- a business intelligence screen may be presented in a graphical user interface on the computing device 102 .
- the computing device 102 may be operated by a person assigned to a development team, business intelligence team, or the like.
- the user interface may include various graphical elements (e.g., buttons, slider bars, radio buttons, input boxes, etc.) that enable the user to enter, select, configure, etc. a desired target product profile 1400 for sequences (e.g., peptide).
- the target product profile may include pharmacology data 1402 (e.g., drug brand name (if applicable), drug generic name, drug dose, clinical trial information and results, toxicology, stability, safety, efficacy, dose cost, etc.), pharmacokinetic data, pharmacodynamic data, activity data, manufacturing data 1404 (e.g., liquid chromatography mass spectrometry (LCMS) data, ability to be manufactured, scalability in production, etc.), compliance data, biological data 1406 (e.g., metabolic information (e.g., half-life, LD50, etc.), sequence data, pathway, interactions, indications, symptoms, genes, etc.), or some combination thereof.
- pharmacology data 1402 e.g., drug brand name (if applicable), drug generic name, drug dose, clinical trial information and results, toxicology, stability, safety, efficacy, dose cost, etc.
- pharmacokinetic data e.g., pharmacodynamic data
- activity data e.g., activity data
- manufacturing data 1404 e
- the target product profile may be entered, selected, configured, etc. via the user interface.
- the computing device 102 or the artificial intelligence engine 140 may select or filter the design space to present a solution space which includes sequences that match (e.g., partially or exactly) the target product profile.
- the sequences may be selected, based on the target product profile, from a library of sequences.
- the library of sequences may be generated by one or more machine learning models 132 of the artificial intelligence engine 140 performing the techniques described herein.
- the artificial intelligence engine 140 may attempt to generate sequences having features pertinent to the target product profile.
- the dynamically generated sequences may be added to the library of sequences and may be presented on the user interface of the computing device 102 .
- the sequences that match the target product profile may include a list of candidate drug compounds (e.g., peptide candidates) or relevant candidate drug compound features.
- the features may include biomedical ontological relations, terms, characteristics, descriptors, or the like or non-biomedical ontological relations, terms, characteristics, descriptors, or the like.
- the features may include levels of structural (e.g., physical, chemical, biological, etc.) information, semantic information, activity, classes of activity, indications (e.g., clinical outcomes), genes, indications, symptoms, interactions, folding properties, wave properties, stabilities of modification, sequence information (e.g., location or number of amino acids in a strand), and so forth.
- the user may use one or more graphical elements presented on the graphical user interface to select one or more of the sequences. Selecting the one or more sequences may cause another user interface, such as a candidate dashboard screen, to present additional data pertaining to the one or more selected sequences. In some embodiments, selecting the one or more sequences may cause the one or more sequences to be manufactured, produced, synthesized, or the like.
- the first portion 1502 includes various graphical elements to enable a user to select certain information, features, identifiers, query parameters, etc. that may be used to filter, constrain, build, generate, etc. the solution space within a design space for proteins for particular applications.
- the design space may include up to every conceivable or known (e.g., facts) configuration of sequences of proteins (e.g., peptides) in certain biochemical or biomedical applications (e.g., antimicrobial, anti-cancer, anti-viral, anti-fungal, anti-prion, immunomodulatory, neuromodulatory, a physiological effect caused by a signaling peptide, etc.).
- the design space may be created based on the knowledge graph that includes ontological data pertaining to sequences of proteins for up to every conceivable or known configuration of sequences of proteins.
- a resolution of the design space may be modified by identifying, as a first order, features or activities pertaining to the sequences.
- the term “resolution” may refer to the process of reducing, partitioning or separating something into its components (e.g., features or activities pertaining to the sequences).
- one graphical element 1508 may include a dropdown box that enables entering, selecting, configuring, etc. one or more query parameters.
- the query parameters may include desirable sequence parameters associated with features, activities, properties, biomedically-related ontological relations, terms, characteristics, descriptors, or the like or non-biomedically-related ontological relations, terms, characteristics, descriptors, or the like.
- the query parameters may be used in any combination to generate different visualizations of solution spaces having sequences.
- a one-dimensional visualization of sequences related to that one query parameter may be presented in the first portion 1502 .
- “n” (where “n” is a positive integer) query parameters are of interest to a user, then an n-dimensional visualization of the sequences can be related to the n query parameters.
- the solution spaces that are generated or presented may be saved in the database 150 .
- the artificial intelligence engine 140 may distill, based on the selected query parameters, the design space into the solution space 1506 . For example, the distillation process may include selecting sequences as candidate drug compounds that produce activities (e.g., query parameters) exceeding a certain threshold level.
- the solution space 1506 may be generated to include those candidate drug compounds.
- the user interface 1500 enables a user to modify the query parameters to essentially tune the solution space presented such that desired sequences having particular features pertaining to the query parameters are depicted at least one of efficiently, accurately, and in a condensed visual format.
- Such a technique is beneficial because it distills a large (typically, very large) amount of data in the knowledge graph down into a visually comprehensible format, thereby increasing explain ability and understandability.
- Due to the improved user interface 1500 a user's experience using the computing device may be enhanced because the user does not have to switch between or among multiple user interfaces or to perform multiple queries to find different solution spaces.
- the enhanced user interface 1500 may save computing resources by using the query parameters to enable data reduction from a large protein design space to salient sequences in the solution space 1506 .
- the disclosed machine learning models may be trained to generate results (e.g., solution space 1506 ) superior to those results produced by conventional techniques. Additionally, the results produced using the disclosed techniques may have been previously computationally infeasible using conventional techniques.
- the second portion 1504 may include more granularly detailed data pertaining to the solution space 1506 and the sequences included therein.
- the second portion 1504 includes a legend and various windows pertaining to interactions, associations, and proteins.
- the legend includes information pertaining to polo-box domain (e.g., the PDZ domain, SH3 domain, WW domain, WH1 domain, TK domain, PTP domain, PTB domain, SH2 domain, etc.), binding site (e.g., C-terminus, polyproline, phosphosite, etc.), interaction information, and network information.
- the various information is color-coded and correlated with the color-coded clusters presented in the first portion.
- some of the information (e.g., polo-box domain and binding sites) in the legend are associated with different shapes to differentiate each type of information's graphics.
- the interaction information in the legend depicts how the various selections of polo-box domain information interact with each other, and the network information in the legend depicts how various clusters are connected in a network. Depicting the solution space using these techniques may provide an enhanced user interface by distilling a large amount of complex biochemical information about candidate drug compounds into a format easily understandable to a target user (e.g., peptide designer, business intelligence user). To make decisions pertaining to selecting candidate drug compounds without drilling down into additional screens, the user may view the user interface 1500 , thereby saving computing resources and enhancing the user's experience using the computing device 102 .
- the window depicts a likelihood of pairwise interactions between two proteins. For example, “Protein 1 ” Q8IXW0 and “Protein 2 ” Q96RU3 have a probability of 0.52 of interacting.
- the window includes certain information pertaining to ontological terms concerning biological functions in subgraphs associated with the query that caused the solution space to be generated.
- the window, including protein information includes various graphical elements (e.g., input boxes) to enable the entering of information pertaining to descriptions of the protein or ontological terms related to the protein.
- the user interface 1500 may include one or more graphical elements 1512 configured to enable selecting one or more of the sequences in the solution space.
- the user may use the graphical element 1512 to select a sequence to view additional information pertaining to the selected sequence, to cause the selected sequence to be manufactured, produced, synthesized, etc.
- a sequence selected is in the solution space
- a user may be shown the topographical heatmap depicted in FIGS. 8A-8C .
- the sequence 806 depicted in FIG. 8A has a particular path along a traversal or feature map, where the path is specific to the query parameter entered (e.g., number of alanine amino acids).
- Each point on the traversal may be associated with a particular level of activity measured by one or more trained machine learning models 132 that generate the sequence 806 .
- selecting a sequence in the solution space 1506 may cause another user interface 1800 to be presented, such as a candidate dashboard screen in FIG. 18 .
- FIG. 16 illustrates an example user interface 1600 for tracking information pertaining to trials according to certain embodiments of this disclosure.
- the trial information includes columns for a name of the trial (computation run), a tag indicating whether the trial is a test only, a creation date (start time of execution), a runtime length, a sweep, an encoder identifier (architecture of machine learning model), a number of training data, a number of validation data, an accuracy, an epoch, a human_iou (human intersection over union), and an iou (intersection over union).
- a feature classification metric may also be user defined.
- a feature may refer to a descriptor that a machine learning model 132 is learning to classify.
- one such feature may be “stability” and a machine learning model 132 may classify the following: if a peptide sequence is a stable sequence.
- the feature classification metric would be “stability” in that example.
- Other metrics may include accuracy, precision, intersection over union, or the like.
- the trial information may be useful to a protein designer by enabling the protein designer to determine which trials are more successful than other trials, more accurate than other trials, and the like. Further, the trial information may enable the protein designer to generate new trials that include beneficial features of previous trials.
- FIG. 17 illustrates an example user interface 1700 for presenting performance metrics of machine learning models that perform trials according to certain embodiments of this disclosure.
- the performance metrics may include process graphic processing unit (GPU) usage (%), process GPU power usage (%), process GPU memory allocated (%), process GPU time spent accessing memory (%), and process GPU temperature (degrees, e.g., Celsius.
- Each metric may include a graph that includes representations (e.g., lines) associated with respective machine learning models.
- the graph may include an X axis corresponding to the time or time elapsed or other time measure, and a Y axis corresponding to a value amount (e.g., a cost value).
- the representations for each machine learning model may be overlaid on the graph to enable a comparison of how each machine learning model performed for a particular metric.
- the performance metrics may be used to assign a cost value to each of the machine learning models.
- the cost may refer to how many resources (processor, memory, network, etc.) are used by the machine learning model during performance of trials, temperatures of components caused by the machine learning model during performance of trials, energy utilization, memory utilization, processor utilization, and other direct and indirect measures of money and non-money cost, among others.
- Assigning a cost e.g., a weighted value or average as the sum of nodes traversed on a graph or as the expected value or other mathematical or statistical measure related to such cost
- the disclosed techniques may enable saving computing resources by evaluating and assigning costs to certain machine learning models that perform better than other machine learning models.
- FIG. 18 illustrates an example user interface 1800 for a candidate dashboard screen according to certain embodiments of this disclosure.
- the candidate dashboard screen includes selected information (e.g., chemical, physical, structural, chemical, semantic, etc.) about a candidate drug compound and, preferably, all of the available information thereabout.
- the user interface 1800 may enable a user to see a snapshot of all data (e.g., structure, correlation heatmap, related trials, trial result data, external references (aliases, synonyms, etc.)) related to a particular candidate drug compound.
- the user interface 1800 may be presented when a user selects a sequence in the solution space 1506 presented in FIG. 15 .
- the user interface 1800 includes two-dimensional 1804 and three-dimensional 1802 energy correlations.
- the energy correlations may correspond to energy functions associated with each position in a domain.
- a given energy correlation represents a correlation between each position of a protein in relation to all the other positions in the protein.
- the energy correlation may represent indications (e.g., color coded sections) pertaining to stability as the stability affects a specific function.
- An amino acid in context with the adjacent amino acids may affect the local folding properties of the peptide.
- Energy correlation values are inversely related (although the degree of relation may vary) to the strength of a specific amino acid (or amino acid modification) at a specific position in a peptide chain for a peptide designed for a specific function.
- FIG. 19 illustrates example operations of a method 1900 for generating a design space for a peptide for an application according to certain embodiments of this disclosure.
- Method 1900 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.).
- processors of a computing device e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.
- one or more operations of the method 1900 are implemented in computer instructions stored on a memory device and executed by a processing device.
- the method 1900 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 1900 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may generate a design space for a peptide for an application.
- the application may include at least one of the following functional biomaterials (e.g., adhesives, sealants, binders, chelates, diagnostic reporters, or some combination thereof) and structural biomaterials (e.g., biopolymers, encapsulation films, flocculants, desiccants, or some combination thereof): anti-infective, anti-cancer, antimicrobial, antiviral, anti-fungal, anti-inflammatory, anti-cholinergic, anti-dopaminergic, anti-serotonergic, anti-noradrenergic, anti-prionic, and anti-fungal.
- functional biomaterials e.g., adhesives, sealants, binders, chelates, diagnostic reporters, or some combination thereof
- structural biomaterials e.g., biopolymers, encapsulation films, flocculants, desiccants, or some combination thereof
- the processing device may generate the design space by (i) identifying 1904 a set of sequences for the peptide, and (ii) updating 1906 , the set of sequences, by determining, for each of the set of sequences, a respective set of activities (e.g., immunomodulatory activity, receptor binding activity, self-aggregation, cell-penetrating activity, anti-viral activity, peptidergic activity, cell-permeating, or the like) pertaining to the application. Updating the set of sequences may produce an updated set of sequences, wherein each updated set of sequences has an updated respective set of activities.
- a respective set of activities e.g., immunomodulatory activity, receptor binding activity, self-aggregation, cell-penetrating activity, anti-viral activity, peptidergic activity, cell-permeating, or the like
- the processing device may generate, based on the updated set of sequences each having the updated respective set of activities, a solution space within the design space.
- the solution space may include a target subset of the updated set of sequences, wherein each updated set of sequences has the updated respective set of activities.
- the processing device may receive a query parameter selected, generated, or transmitted from a user interface presented on the computing device 102 .
- the processing device may use the query parameter to generate the solution space. For example, using a machine learning model trained to measure, based on the query parameter, a level of the updated respective set of activities, the processing device may generate the solution space within the design space.
- One or more query parameters may be selected as constraints to be used to generate the solution space. Essentially, the query parameters may be used to create bounds of the solution space within the design space.
- the query parameters may be selected, generated, or transmitted from a user interface presented on the computing device 102 and transmitted to the artificial intelligence engine 140 . Based on the query parameters, the artificial intelligence engine 140 may use one or more machine learning models to generate the solution space within the design space.
- the query parameter may include sequence parameters pertaining to biomedically-related ontological relations, terms, characteristics, descriptors, or the like or non-biomedically-related ontological relations, terms, characteristics, descriptors, or the like.
- the biomedical ontology terms may include indications, genes, symptoms, alanine properties, etc.
- the non-biomedical ontology terms may include physical descriptors and characteristics, such as interactions (e.g., adhesive), folding properties (e.g., aggregating versus loose), wave properties (e.g., fluorescent, luminescent, iridescent), stability of modification (e.g., glycopeptides, lipid peptides, chelates, lasso peptides), etc.
- the processing device may receive a desired threshold level of a target activity for the query parameter, with such threshold level configured such that the target subset of sequences must exceed the threshold level in order to be included in the solution space.
- the desired threshold level may be any suitable value, percentage, measurement, quantity, etc.
- a user may select a number of alanines (e.g., 5) as the query parameter and specify the desired threshold level of a target activity (e.g., immunomodulatory activity). Accordingly, the processing device may return a target subset of sequences having 5 alanines that exceed the desired threshold level of immunomodulatory activity.
- the processing device may perform dimension reduction to identify the target subset. Said reduction may be performed via a machine learning model using the query parameter and the updated set of sequences, using an algorithm such as uniform manifold approximation and projection (UMAP).
- UMAP uniform manifold approximation and projection
- a UMAP-based technique may use a Riemannian manifold, which refers to a real, smooth manifold M equipped with a positive-definite inner product g p on the tangent space T p M at each point p. The family g p of inner products is called a Riemannian metric.
- a Riemannian metric enables defining several geometric notions on the Riemannian manifold, such as an angle at an intersection, length of a curve, area of a surface and higher-dimensional analogues (e.g., volume, etc.), extrinsic curvature of sub-manifolds, and intrinsic curvature of the manifold itself.
- UMAP may assume that data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.
- the UMAP-based technique may involve certain initial assumptions such as: (i) there exists a manifold on which the data (e.g., candidate drug compounds) would be uniformly distributed; (ii) the underlying manifold of interest is locally connected; or (iii) preserving the topological structure of this manifold is the primary goal. Based on the assumptions, the UMAP-based technique may construct a graph by: (i) constructing a weighted k-neighbor graph; (ii) applying some transform on the edges to local distances; and (iii) dealing with the inherent asymmetry of the k-neighbor graph. The UMAP-based technique may perform graph layout procedures including: (i) defining an objective function that preserves desired characteristics of this k-neighbor graph; and (ii) finding a low-dimensional representation which optimizes this objective function.
- one or more other techniques may be used, such as linear decomposition, principal component analysis (PCA), kernel PCA, matrix factorization, generalized discriminant analysis, linear discriminant analysis, autoencoding, or some combination thereof.
- PCA principal component analysis
- kernel PCA kernel PCA
- matrix factorization generalized discriminant analysis
- linear discriminant analysis autoencoding, or some combination thereof.
- the processing device may receive a selection of a sequence from the target subset of sequences in the solution space.
- the selection may be made using a graphical element of a user interface presented on the computing device 102 , and the selection may be transmitted from the computing device 102 to the artificial intelligence engine 140 .
- the processing device may provide information pertaining to the sequence for presentation in a user interface on the computing device 102 .
- the information may include at least classes of proteins, protein-to-protein interactions, protein-ligand interactions, protein homology and phylogeny, sequence and structure motifs, chemical and physical stability measures, pharmacological associations, systems biology attributes, protein folding descriptors or constraints, or some combination thereof.
- the processing device using a machine learning model 132 to process the solution space, may perform one or more trials.
- the one or more trials are configured to identify a candidate drug compound that represents a sequence having at least one level of activity that exceeds one or more threshold levels.
- the one or more threshold levels may be predetermined or configured by a user (e.g., peptide designer).
- the one or more threshold levels may be a value, percentage, amount, etc. that the candidate drug compound exhibits with respect to antiviral activity.
- the processing device may transmit information describing the candidate drug compound to a computing device 102 .
- the computing device 102 may be operated by a drug candidate designer (e.g., protein, peptide, etc.) interested in sequences that exhibit certain activity for an application.
- the computing device 102 may also be operated by a business user interested in sequences that have certain target product profiles (e.g., pertaining to manufacturing, pharmacology, etc.).
- the processing device may provide the solution space to the computing device 102 for presentation as a topographical map in a user interface of the computing device 102 .
- the topographical map may include a set of indications that, for a sequence, each represent a level of activity at a given point on the topographical map.
- FIGS. 8A-8C depict examples of topographical heatmaps that may be presented on the user interface of the computing device 102 . As depicted, FIG. 8A illustrates a view 800 including antimicrobial activity, FIG. 8B illustrates a view 802 including immunomodulatory activity, and FIG. 8C illustrates a view 804 including cytotoxic activity.
- Each view presents a topographical heatmap where one axis is for sequence parameter y and the other axis is for sequence parameter x.
- Each view includes an indicator (e.g., color code) ranging from a least active property to a most active property.
- each view includes an optimized sequence 806 for a selected candidate drug compound classified by the classifier (machine learning model 132 ). These views may be presented to the user on a computing device 102 . Further, an optimized sequence may be selected, generated or transmitted in or via the user interface using a graphical element (e.g., button, mouse cursor, etc.). The selected sequence may cause another user interface (e.g., candidate dashboard in FIG. 18 ) that provides additional information pertaining to the sequence to be presented. In some embodiments, selecting the sequence may cause the sequence to be formulated, generated, created, manufactured, developed, or tested.
- FIG. 20 illustrates example operations of a method 2000 for comparing performance metrics of machine learning models according to certain embodiments of this disclosure.
- Method 2000 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.).
- processors of a computing device e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.
- one or more operations of the method 2000 are implemented in computer instructions stored on a memory device and executed by a processing device.
- the method 2000 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 2000 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may determine one or more metrics of the machine learning model that performs one or more trials.
- the one or more metrics may include memory usage, graphic processing unit temperature, power usage, processor usage, central processing usage, or some combination thereof.
- FIG. 17 presents examples of the one or more metrics used to analyze the machine learning model that performs the one or more trials.
- the processing device compares the one or more metrics to one or more second metrics of a second machine learning model that performs the one or more trials.
- the comparison may illuminate which of the machine learning model or the second machine learning model performs better than the other.
- the machine learning model may perform the same trials but consume less processor resources or memory resources. Accordingly, the machine learning model may be used to subsequently perform those trials and the second machine learning model may be pruned from being selected or tuned (e.g., adjusting weights, bias, levels of hidden nodes, etc.) to improve its metrics.
- the disclosed techniques provide a technical benefit of enabling the continuous or continual monitoring of the performance of the machine learning models and, preferably, further optimizing which machine learning models perform trials to improve metrics (e.g., processor usage, power usage, graphic processing unit temperature, etc.).
- metrics e.g., processor usage, power usage, graphic processing unit temperature, etc.
- FIG. 21 illustrates example operations of a method 2100 for presenting a design space and a solution space within a graphical user interface of a therapeutics tool according to certain embodiments of this disclosure.
- Method 2100 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.).
- processors of a computing device e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.
- one or more operations of the method 2100 are implemented in computer instructions stored on a memory device and executed by a processing device.
- the method 2100 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 2100 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may present, in a first screen of a graphical user interface (GUI) of a therapeutic tool, a design space for a protein for an application.
- GUI graphical user interface
- the therapeutic tool is a peptide therapeutic design tool, a peptide business intelligence tool, or both.
- the protein is a peptide.
- the design space may include a set of sequences each containing a respective set of activities pertaining to the application. As described herein, the design space may be generated based on a knowledge graph pertaining to peptides.
- the design space may be presented as a two-dimensional (2D) elevation map, a three-dimensional (3D) shape, or an n-dimensional (nD) mathematical representation.
- the processing device may receive, via a graphical element (e.g., button, input box, radio button, dropdown list, slider, etc.) in the first screen, a selection of one or more query parameters of the design space.
- the one or more query parameters may include a sequence parameter pertaining to biomedical ontology terms or non-biomedical ontology terms. The biomedically-related ontological relations, terms, characteristics, descriptors, etc.
- the non-biomedically-related ontological relations, terms, characteristics, descriptors, etc. may pertain to physical characteristics, descriptors, or some combination thereof.
- Example physical characteristics and descriptors may include information pertaining to interactions (e.g., adhesive properties), folding properties, (e.g., aggregating versus loose), wave properties (e.g., fluorescent, luminescent, iridescent, etc.), measures of stability of modification (e.g., with respect to glycopeptides, lipid peptides, chelates, lasso peptides, etc.), and the like.
- interactions e.g., adhesive properties
- folding properties e.g., aggregating versus loose
- wave properties e.g., fluorescent, luminescent, iridescent, etc.
- measures of stability of modification e.g., with respect to glycopeptides, lipid peptides, chelates, lasso peptides, etc.
- the processing device may present, in a second screen of the GUI, a solution space that includes a subset of the set of sequences, each sequence containing the respective set of activities.
- the subset of the set of sequences is selected based on the one or more query parameters.
- the solution space may be generated within the design space by one or more machine learning models 132 trained to measure, based on the one or more query parameters, a respective level of one or more of the respective set of activities of each of the set of sequences in the subset of sequences.
- the query parameters essentially create the bounds of the solution space within the design space. Generating the solution space may include grouping or binning, based on the query parameter, sequences as possible or not possible.
- “Possible,” as used herein, means constructible in reality, economically feasible, chemically feasible, biologically feasible, or otherwise reasonably feasible. “Not possible,” as used herein, means not able to be constructed in reality, economically infeasible, chemically infeasible, biologically infeasible, or otherwise reasonably infeasible.
- the machine learning model 132 may be a variational autoencoder, as described herein. In some embodiments, the machine learning model 132 may be any suitable machine learning model capable of performing decomposition methods.
- the solution space is presented as a topographical map in the GUI.
- the topographical map may include a set of indications, wherein each set of indications represents a level of activity for a sequence associated with a given point on the topographical map.
- the second screen may include a first portion presenting one or more clusters (e.g., color-coded) representing the subset of the set of sequences. As shown in FIG. 15 , the first portion may depict how, in a network, the clusters are organized and interact with each other.
- the one or more color-coded clusters may represent, using an energy correlation, each sequence in the subset.
- the energy correlation may include a correlation between each position of each sequence in the subset and other positions of other sequences in the subset.
- the term “energy correlation” may refer to stability as it affects a specific function of the subset of sequences, or it may also refer to, e.g., a strength of an amino acid in a sequence relative to a strength of another amino acid at a different position in the sequence. For example, an amino acid in context with an adjacent amino acid affects the local folding properties of a peptide.
- Energy correlation values are, to some degree, inversely related to a strength of a specific amino acid (or amino acid modification), where the amino acid is located at a specific position in the peptide chain.
- the first portion visually represents high-level general information pertaining to the set of sequences in the solution space.
- the visual representation of the solution space may provide an enhanced user interface to a protein designer. For example, by visually depicting the interactions of the clusters representing the set of sequences in a network, a protein designer may be provided with a vast amount of information cognitively understandable by a user in a single user interface without the user's having to view numerous user interfaces to perform additional queries as to how sequences interact with other sequences in a network.
- the second screen may include a second portion presenting data pertaining to the subset of the set of sequences represented by the one or more clusters.
- the data presented in the second portion may be more granular and detailed than the data in the clusters presented in the first portion of the second screen.
- the second portion may include a legend and various windows, including detailed data, as described above with reference to FIG. 15 .
- the detailed data may enable a protein designer to drill down to understand very specific information about the clusters presented in the solution space.
- the specific information may pertain to polo-box domains (PBD), binding sites, interactions, network, associations, biological functions, and the like.
- the detailed data may describe one or more objects associated with the subset of the set of sequences.
- the one or more objects may include a candidate drug compound, an activity, a drug, a gene, a pathway, a physical descriptor, an interaction (e.g., adhesive, etc.), a folding property (e.g., aggregating versus loose), a wave property (e.g., fluorescent, luminescent, iridescent, etc.), a stability of modification (e.g., glycopeptides, lipid peptides, chelates, lasso peptides, etc.), or some combination thereof.
- a candidate drug compound an activity, a drug, a gene, a pathway, a physical descriptor, an interaction (e.g., adhesive, etc.), a folding property (e.g., aggregating versus loose), a wave property (e.g., fluorescent, luminescent, iridescent, etc.), a stability of modification (e.g., glycopeptides, lipid peptides, chelates, lasso peptides, etc.), or
- the processing device may receive, using a graphical element (e.g., button, mouse cursor, input box, dropdown list, slider, radio button, etc.) of the second screen, a selection of a sequence from the subset of the set of sequences. The selection may be based on the sequence being previously untraversed. To that end, the processing device may store each sequence included in the subset presented in the solution space and may track whether the sequence has been generated or traversed before. The processing device may store an indicator (e.g., flag) with each sequence in the database 150 , and the indicator may represent whether the respective sequence has been traversed or is or remains untraversed.
- a graphical element e.g., button, mouse cursor, input box, dropdown list, slider, radio button, etc.
- the sequence traversed may be presented in a first manner (e.g., with a particular color) while the sequence untraversed may be presented in a second manner (e.g., with a different color than the first manner).
- the second screen may provide a graphical element that enables filtering to view only the sequences traversed or, alternatively, untraversed.
- the processing device may present, in the second screen, additional information pertaining to the sequence.
- the additional information may include a candidate drug compound, an interaction, an activity, a drug, a gene, a pathway, or some combination thereof.
- the processing device may receive, using a graphical element of the second screen, a selection of a sequence from the subset of the set of sequences.
- the processing device may present, in a third screen, a candidate dashboard (e.g., candidate dashboard screen of FIG. 18 ) including information pertaining to the selected sequence.
- the information may pertain to a structure of the sequence, a correlation heatmap, experimental data, a list of probabilistic scores generated by one or more inference models, external data related to the sequence (e.g., all related external data to a specific peptide, such as database IDs, aliases, synonyms, etc.), or some combination thereof.
- the list of probabilistic scores may be represented as violin plots detailing a success probability of the sequence in a specific function (e.g., activity such as anti-viral, anti-microbial, anti-fungal, anti-prionic, etc.) across a set of conditions (e.g., query parameters).
- a specific function e.g., activity such as anti-viral, anti-microbial, anti-fungal, anti-prionic, etc.
- a set of conditions e.g., query parameters
- the processing device may receive, in the GUI, one or more parameters pertaining to one or more machine learning models 132 of the artificial intelligence engine 140 .
- the one or more parameters may refer to hyper parameters and may pertain to one or more constraints (e.g., epochs, batch sizes, attention, processor usage, memory usage, execution time, etc.) for the one or more machine learning models to implement when using the solution space to perform one or more trials.
- the processing device may receive, using a graphical element of the second screen, a selection of a sequence from the subset of the set of sequences.
- the processing device may cause the sequence to be manufactured, synthesized, or produced.
- FIG. 22 illustrates example operations of a method 2200 for receiving and presenting of one or more results of performing a selected trial using a machine learning model according to certain embodiments of this disclosure.
- Method 2200 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.).
- processors of a computing device e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.
- one or more operations of the method 2200 are implemented in computer instructions stored on a memory device and executed by a processing device.
- the method 2200 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 2200 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may receive a selection of a trial configured to be performed by a machine learning model 132 .
- the machine learning model may use the solution space generated, as described with reference to FIG. 23 .
- the trial may include traversing the solution space according to a specific route, a random route, or a combination of a specific route and a random route.
- the traversal may result in points having different activities in the solution space.
- the points may represent a sequence and may be referred to as a candidate drug compound herein.
- the traversal may specify a particular location of a point as a starting point or a particular location of a destination point.
- the traversal may or may not specify the route to traverse to get from the starting point to the destination point.
- the traversal may just specify a starting point or a destination point, and the machine learning model 132 may randomly traverse the solution space to generate different sequences having different activities.
- the one or more machine learning models 132 may be trained to perform maximization functions or minimization functions.
- the machine learning model may measure level of activity at some or all of the points on the surface of the solution space and perform a maximization function by traversing the points having the maximum level of activity relative to other points in proximity.
- the machine learning model may measure level of activity at some or all of the points on the surface of the solution space and perform a minimization function by traversing the points having the minimum level of activity relative to other proximate points.
- the machine learning model may be trained to perform a combination of minimization and maximization functions while performing the traversals.
- the selection of the trial may be transmitted to the artificial intelligence engine 140 .
- the artificial intelligence engine 140 may use the one or more machine learning models 132 to perform the selected trial using the solution space.
- the processing device of the computing device 102 may receive, from the artificial intelligence engine 140 , one or more results of performing the trial.
- the one or more results may (i) provide a location of a point reached in the solution space after performing a traversal of the solution space defined by the trial, or (ii) provide a metric of one or more of the machine learning models 132 used by the artificial intelligence engine 140 to perform the trial.
- the metric may pertain to the process graphic processing unit (GPU) usage (%), the process GPU power usage (%), the process GPU memory allocated (%), the process GPU time spent accessing memory (%), and the process GPU temperature (degrees, e.g., Celsius) (as shown in FIG. 17 ).
- the one or more results may be presented on a user interface of the computing device 102 .
- the one or more results may be compared to select the one or more machine learning models that reached or came closest to a desired point in the solution space, took a desired route (or as close to the desire route as possible) during traversal to the point, generated a desired sequence having desired activity levels, consumed the least or a lesser amount of processor resources, generated the lowest or a lower temperature for the graphic processing unit, consumed the least or a lesser amount of memory resources, or some combination thereof.
- the machine learning models not selected may be subsequently tuned to attempt to improve their results when subsequently performing the same or different trials.
- FIG. 23 illustrates example operations of a method 2300 for using a business intelligence screen to select a desired target product profile for sequences according to certain embodiments of this disclosure.
- Method 2300 includes operations performed by processors of a computing device (e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.).
- processors of a computing device e.g., any component of FIG. 1 , such as computing device 102 , server 128 executing the artificial intelligence engine 140 , etc.
- one or more operations of the method 2300 are implemented in computer instructions stored on a memory device and executed by a processing device.
- the method 2300 may be performed in the same or a similar manner as described above in regard to method 400 .
- the operations of the method 2300 may be performed in some combination with any of the operations of any of the methods described herein.
- the processing device may receive, from a graphical element of a business intelligence screen of the graphical user interface (GUI), a target product profile.
- the target product profile may include pharmacology data, pharmacokinetic data, activity data, manufacturing data (e.g., cost to manufacture, requirements for manufacturing, etc.), compliance data, clinical trial data, or some combination thereof.
- the target product profile may be transmitted to the artificial intelligence engine 140 .
- the artificial intelligence engine 140 may execute one or more machine learning models 132 trained to generate or search for sequences that match the target product profile to within a certain threshold level (e.g., percentage, partial, exact, etc.).
- the processing device may receive, from the artificial intelligence engine 140 , a second subset of the set of sequences.
- the second subset of the set of sequences may be selected based on the target product profile.
- the processing device may present, in the GUI, the second subset of the set of sequences.
- the GUI may include one or more graphical elements that enable the user to drill-down to view detailed data pertaining to one or more of the sequences matching (partially or exactly) the target product profile.
- the GUI may include a graphical element that enables selecting one or more sequences to manufacture, produce, synthesize, or the like.
- FIG. 24 illustrates example computer system 2400 which can perform any one or more of the methods described herein, in accordance with one or more aspects of the present disclosure.
- computer system 2400 may correspond to the computing device 102 (e.g., user computing device), one or more servers 128 of the computing system 116 , the training engine 130 , or any suitable component of FIG. 1 .
- the computer system 2400 may be capable of executing application 118 or the one or more machine learning models 132 of FIG. 1 .
- the computer system may be connected (e.g., networked) to other computer systems in a LAN, an intranet, an extranet, or the Internet.
- the computer system may operate in the capacity of a server in a client-server network environment.
- the computer system may be a personal computer (PC), a tablet computer, a wearable (e.g., wristband), a set-top box (STB), a personal Digital Assistant (PDA), a mobile phone, a camera, a video camera, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device.
- PC personal computer
- tablet computer a wearable (e.g., wristband), a set-top box (STB), a personal Digital Assistant (PDA), a mobile phone, a camera, a video camera, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device.
- PDA personal Digital Assistant
- the computer system 2400 includes a processing device 2402 , a volatile memory 2404 (e.g., random access memory (RAM)) and a non-volatile memory 2406 (e.g., read-only memory (ROM), flash memory, solid state drives (SSDs), and a data storage device 1108 , which communicate with each other via a bus 2410 .
- a volatile memory 2404 e.g., random access memory (RAM)
- non-volatile memory 2406 e.g., read-only memory (ROM), flash memory, solid state drives (SSDs)
- SSDs solid state drives
- Processing device 2402 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, the processing device 2402 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets.
- the processing device 2402 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a system on a chip, a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like.
- the processing device 2402 may include more than one processing device, and each of the processing devices may be the same or different types.
- the processing device 2402 may include or be communicatively coupled to one or more accelerators 2403 configured to offload various data-processing tasks from the processing device 2402 .
- the processing device 2402 is configured to execute instructions for performing any of the operations and
- the computer system 2400 may further include a network interface device 2412 .
- the network interface device 2412 may be configured to communicate data via any suitable communication protocol.
- the network interface devices 2412 may enable wireless (e.g., WiFi, Bluetooth, ZigBee, etc.) or wired (e.g., Ethernet, etc.) communications.
- the computer system 2400 also may include a video display 2414 (e.g., a liquid crystal display (LCD), a light-emitting diode (LED), an organic light-emitting diode (OLED), a quantum LED, a cathode ray tube (CRT), a shadow mask CRT, an aperture grille CRT, or a monochrome CRT), one or more input devices 2416 (e.g., a keyboard or a mouse), and one or more speakers 2418 (e.g., a speaker).
- the video display 2414 and the input device(s) 2416 may be combined into a single component or device (e.g., an LCD touch screen).
- the data storage device 2416 may include a computer-readable medium 2420 on which the instructions 2422 embodying any one or more of the methods, operations, or functions described herein is stored.
- the instructions 2422 may also reside, completely or at least partially, within the main memory 2404 or within the processing device 2402 during execution thereof by the computer system 2400 . As such, the main memory 2404 and the processing device 2402 also constitute computer-readable media.
- the instructions 2422 may further be transmitted or received over a network via the network interface device 2412 .
- While the computer-readable storage medium 2420 is shown in the illustrative examples to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) that store the one or more sets of instructions.
- the term “computer-readable storage medium” shall also be taken to include any medium capable of storing, encoding, or carrying a set of instructions for execution by the machine, where such set of instructions cause the machine to perform any one or more of the methodologies of the present disclosure.
- the term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media.
- generating a design space for a peptide for an application wherein the generating comprises:
- Clause 2 The method of any preceding clause, wherein the generating the solution space within the design space is performed by a second machine learning model trained to measure, based on a query parameter, a level of the updated respective plurality of activities, wherein the query parameter comprises a sequence parameter.
- the solution space within the design space, wherein the solution space comprises the target subset of the plurality of sets of the updated plurality of sequences, and each sequence of the updated plurality of sequences in the target subset comprises the updated respective plurality of activities that are modified in view of the query parameter.
- receiving the query parameter further comprises receiving the query parameter from a graphical element of a user interface presenting the design space.
- biomaterials comprising adhesives, sealants, binders, chelates, diagnostic reporters, or some combination thereof, and
- biomaterials comprising biopolymers, encapsulation films, flocculants, desiccants, or some combination thereof.
- the information comprises at least classes of:
- the solution space to the computing device for presentation as a topographical map in a user interface of the computing device, wherein the topographical map comprises a plurality of indications that each represent a level of activity for a sequence at a given point on the topographical map.
- the updated respective plurality of activities comprises immunomodulatory activity, receptor binding activity, self-aggregation, cell-penetrating activity, anti-viral activity, peptidergic activity, or some combination thereof.
- the one or more metrics comprise memory usage, graphic processing unit temperature, power usage, processor usage, central processing unit temperature, or some combination thereof;
- a tangible, non-transitory computer-readable medium storing instructions that, when executed, cause a processing device to:
- Clause 14 The computer-readable medium of any preceding clause, wherein the generating the solution space within the design space is performed by a second machine learning model trained to measure, based on a query parameter, a level of the updated respective plurality of activities, wherein the query parameter comprises a sequence parameter.
- the solution space within the design space, wherein the solution space comprises the target subset of the plurality of sets of the updated plurality of sequences, and each sequence of the updated plurality of sequences in the target subset comprises the updated respective plurality of activities that are modified in view of the query parameter.
- Clause 17 The computer-readable medium of any preceding clause, wherein the receiving the query parameter further comprises receiving the query parameter from a graphical element of a user interface presenting the design space.
- a system comprising:
- a memory device storing instructions
- processing device communicatively coupled to the memory device, the processing device executes the instructions to:
- Clause 20 The system of any preceding clause, wherein the generating the solution space within the design space is performed by a second machine learning model trained to measure, based on a query parameter, a level of the updated respective plurality of activities, wherein the query parameter comprises a sequence parameter.
- a method for presenting, on a computing device, a graphical user interface (GUI) of a therapeutic tool comprising:
- a first portion presenting one or more color-coded clusters representing the subset of the plurality of sequences
- a second portion presenting data pertaining to the subset of the plurality of sequences represented by the one or more color-coded clusters, wherein the data describes one or more objects associated with the subset of the plurality of sequences, and the one or more objects comprise a candidate drug compound, an activity, an interaction, a drug, a gene, a pathway , a physical descriptor, a characteristic, an interaction, a folding property, a wave property, a stability of modification, or some combination thereof.
- Clause 23 The method of any preceding clause, wherein the one or more color-coded clusters represent, using an energy correlation , each sequence in the subset, and the energy correlation comprises a correlation between each position of each sequence in the subset and other positions of other sequences in the subset.
- Clause 24 The method of any preceding clause, wherein the solution space is presented as a topographical map in the GUI, wherein the topographical map comprises a plurality of indications that each represent a level of activity for a sequence associated with a given point on the topographical map.
- Clause 26 The method of any preceding clause, wherein the solution space is generated within the design space by one or more machine learning models trained to measure, based on the query parameter, a respective level of one or more of the respective plurality of activities of each of the plurality of sequences in the subset, wherein the query parameter comprises a sequence parameter.
- the sequence responsive to the selection of the sequence, presenting, in the second screen, additional information pertaining to the sequence, wherein the additional information comprises a candidate drug compound, an interaction, an activity, a drug, a gene, a pathway, or some combination thereof.
- a candidate dashboard comprising information pertaining to the sequence, wherein the information pertains to a structure of the sequence, a correlation heatmap, experimental data, a list of probabilistic scores generated by inference models, external data related to the sequence, or some combination thereof.
- a metric of a machine learning model used by the artificial intelligence engine to perform the trial wherein the metric pertains to memory usage, graphic processing unit temperature, power usage, processor usage, central processing unit temperature, or some combination thereof.
- a target product profile comprises pharmacology data, pharmacokinetic data, pharmacodynamic data, activity data, manufacturing data, compliance data, clinical trial data, or some combination thereof;
- Clause 32 The method of any preceding clause, wherein the therapeutic tool is a peptide therapeutic tool.
- Clause 34 The method of any preceding clause, wherein the one or more query parameters comprise a plurality of biomedical ontology terms, a plurality of non-biomedical ontology terms, or some combination thereof.
- a tangible, non-transitory computer-readable medium storing instructions that, when executed, cause a processing device to:
- GUI graphical user interface
- a first portion presenting one or more color-coded clusters representing the subset of the plurality of sequences
- a second portion presenting data pertaining to the subset of the plurality of sequences represented by the one or more color-coded clusters, wherein the data describes one or more objects associated with the subset of the plurality of sequences, and the one or more objects comprise a candidate drug compound, an activity, an interaction, a drug, a gene, a pathway, a physical descriptor, a characteristic, an interaction, a folding property, a wave property, a stability of modification, or some combination thereof.
- a memory device storing instructions
- the processing device executes the instructions :
- a design space for a protein for an application wherein the design space comprises a plurality of sequences each containing a respective plurality of activities pertaining to the application;
- a solution space that includes a subset of the plurality of sequences each containing the respective plurality of activities, wherein the subset of the plurality of sequences is selected based on the one or more query parameters.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Epidemiology (AREA)
- Biophysics (AREA)
- Primary Health Care (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Pathology (AREA)
- Medicinal Chemistry (AREA)
- Bioethics (AREA)
- Pharmacology & Pharmacy (AREA)
- Toxicology (AREA)
- Physiology (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Remote Sensing (AREA)
Abstract
Description
- This applications claims the benefit of U.S. Provisional Application Ser. No. 63/117,068, filed Nov. 23, 2020 titled “Generating Anti-Infective Design Spaces for Selecting Drug Candidates”. The provisional application is incorporated by reference herein as if reproduced in full below.
- This disclosure relates generally to drug discovery. More specifically, this disclosure relates to generating anti-infective design spaces for selecting drug candidates.
- Therapeutics may refer to a branch of medicine concerned with the treatment of disease and the action of remedial agents (e.g., drugs). Therapeutics includes, but is not limited to, the field of ethical pharmaceuticals. Entities in the therapeutics industry may discover, develop, produce, and market drugs for use as medications to be administered or self-administered to patients. Goals of administering or self-administering the drugs may include curing the patient of a disease, causing an active disease to enter a state of remission, vaccinating the patient by stimulating the immune system to better protect against the disease, and/or alleviating, mitigating or ameliorating a symptom. Existing drug discoveries may be based on any combination of human design, high-throughput screening, synthetic products and natural substances.
- In one aspect, a method includes generating a design space for a protein (e.g., peptide) for an application (e.g., drug application, industrial application, veterinary application, environmental recovery application (e.g., oil spill, plastics in waterways and oceans), etc.). The application may refer to a chemical application (e.g., drug) for which the protein is designed. The generating includes identifying sequences for the peptide, and updating the sequences by determining, for each of the sequences, a respective set of activities pertaining to the application. The updating produces updated sequences each having updated respective activities. The method includes generating, based on the updated sequences, a solution space within the design space. The solution space includes a target subset of the updated sequences. The method includes performing, using a machine learning model to process the solution space, trials to identify a candidate drug compound that represents a sequence having a level of activity that exceeds a threshold level, and transmitting information describing the candidate drug compound to a computing device.
- In another aspect, a system may include a memory device storing instructions and a processing device communicatively coupled to the memory device. The processing device may execute the instructions to perform one or more operations of any method disclosed herein.
- In another aspect, a tangible, non-transitory computer-readable medium may store instructions and a processing device may execute the instructions to perform one or more operations of any method disclosed herein.
- Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.
- Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document. The term “couple” and its derivatives refer to any direct or indirect communication between two or more elements, independent of whether those elements are in physical contact with one another. The terms “transmit,” “receive,” and “communicate,” as well as derivatives thereof, encompass both direct and indirect communication. The terms “transmit,” “receive,” and “communicate,” as well as derivatives thereof, encompass both communication with remote systems and communication within a system, including reading and writing to different portions of a memory device. The terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation. The term “or” is inclusive, meaning and/or. The phrase “associated with,” as well as derivatives thereof, means to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, have a relationship to or with, or the like. The term “translate” may refer to any operation performed wherein data is input in one format, representation, language (computer, purpose-specific, such as drug design or integrated circuit design), structure, appearance or other written, oral or representable instantiation and data is output in a different format, representation, language (computer, purpose-specific, such as drug design or integrated circuit design), structure, appearance or other written, oral or representable instantiation, wherein the data output has a similar or identical meaning, semantically or otherwise, to the data input. Translation as a process includes but is not limited to substitution (including macro substitution), encryption, hashing, encoding, decoding or other mathematical or other operations performed on the input data. The same means of translation performed on the same input data will consistently yield the same output data, while a different means of translation performed on the same input data may yield different output data which nevertheless preserves all or part of the meaning or function of the input data, for a given purpose. Notwithstanding the foregoing, in a mathematically degenerate case, a translation can output data identical to the input data. The term “controller” means any device, system or part thereof that controls at least one operation. Such a controller may be implemented in hardware or a combination of hardware and software and/or firmware. The functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. The phrase “at least one of,” when used with a list of items, means that different combinations of one or more of the listed items may be used, and only one item in the list may be needed. For example, “at least one of: A, B, and C” includes any of the following combinations: A, B, C, A and B, A and C, B and C, and A and B and C.
- Moreover, various functions described below can be implemented or supported by one or more computer programs, each of which is formed from computer readable program code and embodied in a computer readable storage medium. The terms “application” and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects, classes, instances, related data, or a portion thereof adapted for implementation in a suitable computer readable program code. The phrase “computer readable program code” includes any type of computer code, including source code, object code, and executable code. The phrase “computer readable storage medium” includes any type of medium capable of being accessed by a computer, such as read only memory (ROM), random access memory (RAM), a hard disk drive, a compact disc (CD), a digital video disc (DVD), solid state drive (SSD), or any other type of memory. A “non-transitory” computer readable storage medium excludes wired, wireless, optical, or other communication links that transport transitory electrical or other signals. A non-transitory computer readable storage medium includes media where data can be permanently stored and media where data can be stored and later overwritten, such as a rewritable optical disc or an erasable memory device.
- The terms “candidate drugs” and “candidate drug compounds” may be used interchangeably herein.
- Definitions for other certain words and phrases are provided throughout this patent document. Those of ordinary skill in the art should understand that in many if not most instances, such definitions apply to prior as well as future uses of such defined words and phrases.
- For a more complete understanding of this disclosure and its advantages, reference is now made to the following description, taken in conjunction with the accompanying drawings, in which:
-
FIG. 1A illustrates a high-level component diagram of an illustrative system architecture according to certain embodiments of this disclosure; -
FIG. 1B illustrates an architecture of the artificial intelligence engine according to certain embodiments of this disclosure; -
FIG. 1C illustrates first components of an architecture of the creator module according to certain embodiments of this disclosure; -
FIG. 1D illustrates second components of the architecture of the creator module according to certain embodiments of this disclosure; -
FIG. 1E illustrates an architecture of a variational autoencoder according to certain embodiments of this disclosure; -
FIG. 1F illustrates an architecture of a generative adversarial network used to generate candidate drugs according to certain embodiments of this disclosure; -
FIG. 1G illustrates types of encodings to represent certain types of drug information according to certain embodiments of this disclosure; -
FIG. 1H illustrates an example of concatenating numerous encodings into a candidate drug according to certain embodiments of this disclosure; -
FIG. 1I illustrates an example of using a variational autoencoder to generate a latent representation of a candidate drug according to certain embodiments of this disclosure; -
FIG. 2 illustrates a data structure storing a biological context representation according to certain embodiments of this disclosure; -
FIGS. 3A-3B illustrate a high-level flow diagram according to certain embodiments of this disclosure; -
FIG. 4 illustrates example operations of a method for generating and classifying a candidate drug compound according to certain embodiments of this disclosure; -
FIGS. 5A-5D provide illustrations of generating a first data structure including a biological context representation of a plurality of drug compounds according to certain embodiments of this disclosure; -
FIG. 6 illustrates example operations of a method for translating the first data structure ofFIGS. 5A-5D into a second data structure having a second format according to certain embodiments of this disclosure; -
FIG. 7 provide illustrations of translating the first data structure ofFIGS. 5A-5D into the second data structure having the second format according to certain embodiments of this disclosure; -
FIG. 8A-8C provide illustrations of views of a selected candidate drug compound according to certain embodiments of this disclosure; -
FIG. 9 illustrates example operations of a method for presenting a view including a selected candidate drug compound according to certain embodiments of this disclosure; -
FIG. 10A illustrates example operations of a method for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure; -
FIG. 10B illustrates another example of operations of a method for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure; -
FIG. 11 illustrates example operations of a method for using several machine learning models in an artificial intelligence engine architecture to generate peptides according to certain embodiments of this disclosure; -
FIG. 12 illustrates example operations of a method for performing a benchmark analysis according to certain embodiments of this disclosure; -
FIG. 13 illustrates example operations of a method for slicing a latent representation based on a shape of the latent representation according to certain embodiments of this disclosure; -
FIG. 14 illustrates a high-level flow diagram for a therapeutics tool implementing business intelligence according to certain embodiments of this disclosure; -
FIG. 15 illustrates an example user interface for using query parameters to generate a solution space that includes protein sequences according to certain embodiments of this disclosure; -
FIG. 16 illustrates an example user interface for tracking information pertaining to trials according to certain embodiments of this disclosure; -
FIG. 17 illustrates an example user interface for presenting performance metrics of machine learning models that perform trials according to certain embodiments of this disclosure; -
FIG. 18 illustrates an example user interface for a candidate dashboard screen according to certain embodiments of this disclosure; -
FIG. 19 illustrates example operations of a method for generating a design space for a peptide for an application according to certain embodiments of this disclosure; -
FIG. 20 illustrates example operations of a method for comparing performance metrics of machine learning models according to certain embodiments of this disclosure; -
FIG. 21 illustrates example operations of a method for presenting a design space and a solution space within a graphical user interface of a therapeutics tool according to certain embodiments of this disclosure; -
FIG. 22 illustrates example operations of a method for receiving and presenting of one or more results of performing a selected trial using a machine learning model according to certain embodiments of this disclosure; -
FIG. 23 illustrates example operations of a method for using a business intelligence screen to select a desired target product profile for sequences according to certain embodiments of this disclosure; and -
FIG. 24 illustrates an example computer system according to certain embodiments of this disclosure. - Conventional drug discoveries based on human design, high-throughput screening, or natural substances may be inefficient, riven with noise, limited in application, not efficacious, dangerous or poisonous, or not defensible. Further, in some instances, there are instances of certain diseases (e.g., instances of prosthetic joint infections) that do not have a corresponding existing therapeutic to treat the certain diseases or which provide temporary results against which the disease is refractory. One reason for the lack of an existing therapeutic may be the conventional drug discovery techniques are incapable of discovering the therapeutic needed to treat the certain diseases. By “treat,” we mean that the disease at hand is cured inter alia, that it is not refractory to treatment. The amount of knowledge, data, assumptions, and queries used to discover a therapeutic to treat the certain disease may be unattainable, overwhelming, or inefficiently determined, such that conventional drug discovery techniques cannot overcome these obstacles. Improvement is desired in the field of therapeutics.
- Further, conventional techniques for searching for candidate drugs use limited design spaces. For example, some conventional techniques focus on a fact about drugs, where such facts constrain the design space that is searched. The design space may refer to parameterization of limits and constraints in a drug space where candidate drug compounds may be designed. A design space may also refer to a multidimensional combination and interaction of input variables (e.g., material attributes) and process parameters that have been demonstrated to provide assurance of quality. An example of such a fact may include a certain biomedical activity known to be linked to an alpha-helix physical structure of a peptide, where conventional techniques may search for other activities that may result from a peptide having the alpha-helix physical structure. Such a limited design space may limit the results obtained. Thus, it is desirable to enlarge the design space to account for other information such as drug sequence information, drug activity information, drug semantic information, drug chemical information, drug physical information, and so forth. However, enlarging the design space may increase the complexity of searching the design space.
- Accordingly, aspects of the present disclosure generally relate to an artificial intelligence engine for generating candidate drugs. By using various encoding types that enable performing searches in the design space in an efficient manner, the artificial intelligence engine (AI) may enlarge the design space to include the combination of drug information (e.g., structural, physical, semantic, activity, sequence, chemical, attributes expressed in solubility data, properties expressed in solubility data, related structures, related drugs, chemical synthesis, biological synthesis, intellectual property data, clinical data, market data, etc.). The architecture of the AI engine may include various computational techniques that reduce the computational complexity of using a large design space, thereby saving computing resources (e.g., reducing computing time, reducing processing resources, reducing memory resources, etc.). At the same time, the disclosed architecture may generate superior candidate drugs that include desirable features (e.g., structure, semantics, activity, sequence, clinical outcomes, etc.) found in the larger design space as compared to conventional techniques using the smaller design space.
- The artificial intelligence (AI) engine may use a combination of rational algorithmic discovery and machine learning models (e.g., generative deep learning methods) to produce enhanced therapeutics that may treat any suitable target disease or medical condition. The AI engine may discover, translate, design, generate, create, develop, formulate, classify, or test candidate drug compounds that exhibit desired activity (e.g., antimicrobial, immunomodulatory, cytotoxic, neuromodulatory, etc.) in design spaces for target diseases or medical conditions. Such candidate drug compounds that exhibit desired activity in a design space may effectively treat the disease or medical condition associated with that design space. In some embodiments, a selected candidate drug compound that effectively treats the disease or medical condition may be formulated into an actual drug for administration and may be tested in a lab or at a clinical stage.
- In general, the disclosed embodiments may enable rationally discovery of drug compounds for a larger design space at a larger scale, higher accuracy, or higher efficiency than conventional techniques. The AI engine may use various machine learning models to discover, translate, design, generate, create, develop, formulate, classify, or test candidate drug compounds. Each of the various machine learning models may perform certain specific operations. The types of machine learning models may include various neural networks that perform deep learning, computational biology, or algorithmic discovery. Examples of such neural networks may include generative adversarial networks, recurrent neural networks, convolutional neural networks, fully connected neural networks, etc., as described further below; and such networks may also additionally employ methods of or incorporating causal inference, including counterfactuals, in the process of discovery.
- In some embodiments, a biological context representation of a set of drug compounds may be generated. The biological context representation may be a continuous representation of a biological setting that is updated as knowledge is acquired or data is updated. The biological context representation may be stored in a first data structure having a format (e.g., a knowledge graph) that includes both various nodes pertaining to health artifacts and various relationships connecting the nodes. The nodes and relationships may form logical structures having subjects and predicates. For example, one logical structure between two nodes having a relation may be “Genes are associated with Diseases” where “Genes” and “Diseases” are the subjects of the logical structure and “are associated with” is the relation. In such a way, the knowledge graph may encompass actual knowledge, rather than simply statistical inferences, pertaining to a biological setting.
- The information in the knowledge graph may be continuously or periodically updated and the information may be received from various sources curated by the AI engine. The knowledge in the biological context representation goes well beyond “dumb” data that just includes quantities of a value because the knowledge represents the relationships between or among numerous different types of data, as well as any or all of direct, indirect, causal, counterfactual or inferred relationships. In some embodiments, the biological context representation may not be stored, and instead, based on the stream of knowledge included in the biological context representation, may be streamed from data sources into the AI engine that generates the machine learning models.
- The biological context representation may be used to generate candidate drug compounds by translating the first data format to a second data structure having a second format (e.g., a vector). The second format may be more computationally efficient or suitable for generating candidate drug compounds that include sequences of ingredients that provide desired activity in a design space. “Ingredients” as used herein may refer, without limitation, to substances, compounds, elements, activities (such as the application or removal of electrical charge or a magnetic field for a specific maximum, minimum or discrete amount of time), and mixtures. Further, the second format may enable generating views of the levels of activity provided by the sequence of ingredients in a certain design space, as described further below.
- At a high level, the AI engine may include at least one machine learning model that is trained to use causal inference to generate candidate drug compounds. One of the challenges with discovering new therapeutics may include determining whether certain ingredients may be causal agents with respect to certain activity in a design space. The sheer number of possible sequences of ingredients may be extraordinarily large due to mathematical combinatorics, such that identifying a cause and effect relationship between ingredients and activity may be impossible or, at best, extremely unlikely, to identify without the disclosed embodiments. (For example, in public-key encryption, it is theoretically possible to discover and unlock a private key, but doing this would presently require all the computing power in the world to work longer than the age of the universe: this is an example of what is mathematically possible, but impossible within human time frames and computing power. Identifying a cause-and-effect relationship between ingredients and activity, while a different problem, may be similarly mathematically possible, but impossible within human time frames and computer power.) Based on advances in computing hardware (e.g., graphic processing unit processing cores) and the AI techniques using causal inference described herein, the disclosed embodiments may enable the efficient solving of the task of generating candidate drug compounds at scale.
- Causal inference may refer to a process, based on conditions of an occurrence of an effect, of drawing a conclusion about a causal connection. Causal inference may analyze a response of an effect variable when a cause is changed. Causation may be defined thusly: a variable Xis a cause of Y if Y “listens” to X and determines its response based on what it “hears.” The process of causal inference in the field of AI may be particularly beneficial for generating and testing candidate drug compounds for certain diseases or medical conditions because of the use of what are termed counterfactuals. A counterfactual posits and examines conditions contrary to what has actually occurred in reality. For example, if someone takes aspirin for a headache, the headache may go away. The counterfactual asks what would have happened if the person had not taken aspirin, i.e., would the headache still have gone away, or would it have remained or even gotten worse? Accordingly, counterfactuals may refer to calculating alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof. A counterfactual may enable determining whether a response should stay the same or instead change if something in a sequence does not occur. For example, one counterfactual may include asking: “Would a certain level of activity be the same if a certain ingredient is not included in a sequence of a candidate drug compound?”
- By simulating numerous alternative scenarios to further optimize and hone the accuracy of a sequence of ingredients in the candidate drug compounds, such techniques may enable reducing the number of viable candidate drug compounds. As a result, the embodiments may provide technical benefits, such as reducing resources consumed (e.g., time, processing, memory, network bandwidth) by reducing a number of candidate drug compounds that may be considered for classification as a selected candidate drug compound by another machine learning model.
- In some embodiments, one application for the AI engine to design, discover, develop, formulate, create, or test candidate drug compounds may pertain to peptide therapeutics. A peptide may refer to a compound consisting of two or more amino acids linked in a chain. Example peptides may include dipeptides, tripeptides, tetrapeptides, etc. A polypeptide may refer to a long, continuous, and unbranched peptide chain. A cyclic peptide may refer to a polypeptide which contains a circular sequence of bonded amino acids. A modified peptide may refer to a synthesized peptide that undergoes a modification to a side chain, C-terminus, or N-terminus. Peptides may be simple to manufacture at discovery scale, include drug-like characteristics of small molecules, include safety and high specificity of biologics, or provide greater administration flexibility than some other biologics.
- The disclosed techniques provide numerous benefits over conventional techniques for designing, developing, or testing candidate drug compounds. For example, the AI engine may efficiently use a biological context representation of a set of drug compounds and one or more machine learning models to generate a set of candidate drug compounds and classify one of the set of candidate drug compounds as a selected candidate drug compound. Some embodiments may use causal inference to remove one or more potential candidate drug compounds from classification, thereby reducing the computational complexity and processing burden of classifying a selected candidate drug compound.
- In addition, benchmark analysis may be performed for each type of machine learning model that generates candidate drugs. The benchmark analysis may score various parameters of the machine learning models that generate the candidate drugs. The various parameters may refer to candidate drug novelty, candidate drug uniqueness, candidate drug similarity, candidate drug validity, etc. The scores may be used to recursively tune the machine learning models over time to cause one or more of the parameters to increase for the machine learning models. In some embodiments, some of the machine learning models may vary in their effectiveness as it pertains to some of the parameters. In addition, to generate subsequent candidate drug candidates, the benchmark analysis may score the candidate drug candidates generated by the machine learning models, rank the machine learning models that generate the highest scoring candidate drug candidates, or select the machine learning models producing the highest scoring candidate drug candidates.
- Also, certain markets (e.g., anti-infective, animal, industrial, etc.) may prefer, based on a type of data those markets generate, to use certain machine learning models that generate high scores for a subset of parameters. Accordingly, in some embodiments, the subset of machine learning models that generate the high scores for the subset of parameters may be combined into a package and transmitted to a third party. That is, some embodiments enable custom tailoring of machine learning model packages for particular needs of third parties based on their data.
- Further, additional benefits of the embodiments disclosed herein may include using the AI engine to produce algorithmically designed drug compounds that have been validated in vivo and in vitro and that provide (i) a broad-spectrum activity against greater than, e.g., 900 multi-drug resistant bacteria, (ii) at least, e.g., a 2-to-10 times improvement in exposure time required to generate a drug resistance profile, (iii) effectiveness across, e.g., four key animal infection models (both Gram-positive and Gram-negative bacteria), or (iv) effectiveness against, e.g., biofilms.
- It should be noted that the embodiments disclosed herein may not only apply to the anti-infective market (e.g., for prosthetic joint infections, urinary tract infections, intra-abdominal or peritoneal infections, otitis media, cardiac infections, respiratory infections including but not limited to sequelae from diseases such as cystic fibrosis, neurological infections (e.g., meningitis), dental infections (including periodontal), other organ infections, digestive and intestinal infections (e.g., C. difficile), other physiological system infections, wound and soft tissue infections (e.g., cellulitis), etc.), but to numerous other suitable markets or industries. For example, the embodiments may be used in the animal health/veterinary industry, for example, to treat certain animal diseases (e.g., bovine mastitis). Also, the embodiments may be used for industrial applications, such as anti-biofouling, or generating optimized control action sequences for machinery. The embodiments may also benefit a market for new therapeutic indications, such as those for eczema, inflammatory bowel disease, Crohn's Disease, rheumatoid arthritis, asthma, auto-immune diseases and disease processes in general, inflammatory disease progressions or processes, or oncology treatments and palliatives. The video game industry may also benefit from the disclosed techniques to improve the AI used for generating sequences of decisions that non-player characters (NPC) make during gameplay. For example, the knowledge graph may include multiple states of: player characters, non-player characters, levels, settings, actions, results of the actions, and so forth, and one or more machine learning models may use the techniques described herein to generate optimized sequences of decisions for NPCs to make during gameplay when the states are encountered. The integrated circuit/chip industry may also benefit from the disclosed techniques to improve the mask works generation and routing processes used for generating the most efficient, highest performance, lowest power, lowest heat generating systems on a chip or solid state devices. For example, the knowledge graph may include configurations of mask works and routings of systems on chips or solid state drives, as well as their associated properties (e.g., efficiency, performance, power consumption, operating temperature, etc.). The disclosed techniques may generate one or more machine learning models trained using the knowledge graph to generate optimized mask works or routings to achieve desired properties. Accordingly, it should be understood that the disclosed embodiments may benefit any market or industry associated with a sequence (e.g., items, objects, decisions, actions, ingredients, etc.) that can be optimized.
-
FIGS. 1A through 14 , discussed below, and the various embodiments used to describe the principles of this disclosure are by way of illustration only and should not be construed in any way to limit the scope of the disclosure. -
FIG. 1A illustrates a high-level component diagram of anillustrative system architecture 100 according to certain embodiments of this disclosure. In some embodiments, thesystem architecture 100 may include acomputing device 102 communicatively coupled to acomputing system 116. Thecomputing system 116 may be a real-time software platform, include privacy software or protocols, or include security software or protocols. Each of thecomputing device 102 and components included in thecomputing system 116 may include one or more processing devices, memory devices, or network interface cards. The network interface cards may enable communication via a wireless protocol for transmitting data over short distances, such as Bluetooth, ZigBee, NFC, etc. Additionally, the network interface cards may enable communicating data via a wired protocol over short or long distances, and in one example, thecomputing device 102 and thecomputing system 116 may communicate with anetwork 112.Network 112 may be a public network (e.g., connected to the Internet via wired (Ethernet) or wireless (WiFi)), a private network (e.g., a local area network (LAN) or wide area network (WAN)), or a combination thereof. In some embodiments,network 112 may also comprise a node or nodes on the Internet of Things (IoT). - The
computing device 102 may be any suitable computing device, such as a laptop, tablet, smartphone, or computer. Thecomputing device 102 may include a display capable of presenting a user interface of anapplication 118. Theapplication 118 may be implemented in computer instructions stored on the one or more memory devices of thecomputing device 102 and executable by the one or more processing devices of thecomputing device 102. Theapplication 118 may present various screens to a user that present various views (e.g., topographical heatmaps) including measures, gradients, or levels of certain types of activity and optimized sequences of selected candidate drug compounds, information pertaining to the selected candidate drug compounds or other candidate drug compounds, options to modify the sequence of ingredients in the selected candidate drug compound, and so forth, as described in more detail below. Thecomputing device 102 may also include instructions stored on the one or more memory devices that, when executed by the one or more processing devices of thecomputing device 102, perform operations of any of the methods described herein. - In some embodiments, the
computing system 116 may include one ormore servers 128 that form a distributed computing system, which may include a cloud computing system. Theservers 128 may be a rackmount server, a router, a personal computer, a portable digital assistant, a mobile phone, a laptop computer, a tablet computer, a camera, a video camera, a netbook, a desktop computer, a media center, any other device capable of functioning as a server, or any combination of the above. Each of theservers 128 may include one or more processing devices, memory devices, data storage, or network interface cards. Theservers 128 may be in communication with one another via any suitable communication protocol. Theservers 128 may execute an artificial intelligence (AI)engine 140 that uses one or moremachine learning models 132 to perform at least one of the embodiments disclosed herein. Thecomputing system 128 may also include adatabase 150 that stores data, knowledge, and data structures used to perform various embodiments. For example, thedatabase 150 may store a knowledge graph containing the biological context representation described further below. Further, thedatabase 150 may store the structures of generated candidate drug compounds, the structures of selected candidate drug compounds, and information pertaining to the selected candidate drug compounds (e.g., activity for certain types of ingredients, sequences of ingredients, test results, correlations, semantic information, structural information, physical information, chemical information, etc.). Although depicted separately from theserver 128, in some embodiments, thedatabase 150 may be hosted on one or more of theservers 128. - In some embodiments the
computing system 116 may include atraining engine 130 capable of generating one or moremachine learning models 132. Although depicted separately from theAI engine 140, thetraining engine 130 may, in some embodiments, be included in theAI engine 140 executing on theserver 128. In some embodiments, theAI engine 140 may use thetraining engine 130 to generate themachine learning models 132 trained to perform inferencing operations. Themachine learning models 132 may be trained to discover, translate, design, generate, create, develop, classify, or test candidate drug compounds, among other things. The one or moremachine learning models 132 may be generated by thetraining engine 130 and may be implemented in computer instructions executable by one or more processing devices of thetraining engine 130 or theservers 128. To generate the one or moremachine learning models 132, thetraining engine 130 may train the one or moremachine learning models 132. The one or moremachine learning models 132 may be used by any of the modules in theAI engine 140 architecture depicted inFIG. 2 . - The
training engine 130 may be a rackmount server, a router, a personal computer, a portable digital assistant, a smartphone, a laptop computer, a tablet computer, a netbook, a desktop computer, an Internet of Things (IoT) device, any other desired computing device, or any combination of the above. Thetraining engine 130 may be cloud-based, be a real-time software platform, include privacy software or protocols, or include security software or protocols. - To generate the one or more
machine learning models 132, thetraining engine 130 may train the one or moremachine learning models 132. Thetraining engine 130 may use a base data set of biological context representation (e.g., physical properties data, peptide activity data, microbe data, antimicrobial data, anti-neurodegenerative compound data, pro-neuroplasticity compound data, clinical outcome data, etc.) for a set of drug compounds. For example, the biological context representation may include sequences of ingredients for the drug compounds. The results may include information indicating levels of certain types of activity associated with certain design spaces. In one embodiment, the results may include causal inference information pertaining to whether certain ingredients in the drug compounds are correlated with or determined by certain effects (e.g., activity levels) in the design space. - The one or more
machine learning models 132 may refer to model artifacts created by thetraining engine 130 using training data that includes training inputs and corresponding target outputs. Thetraining engine 130 may find patterns in the training data wherein such patterns map the training input to the target output and generate themachine learning models 132 that capture these patterns. Although depicted separately from theserver 128, in some embodiments, thetraining engine 130 may reside onserver 128. Further, in some embodiments, theartificial intelligence engine 140, thedatabase 150, or thetraining engine 130 may reside on thecomputing device 102. - As described in more detail below, the one or more
machine learning models 132 may comprise, e.g., a single level of linear or non-linear operations (e.g., a support vector machine (SVM)) or themachine learning models 132 may be a deep network, i.e., a machine learning model comprising multiple levels of non-linear operations. Examples of deep networks are neural networks, including generative adversarial networks, convolutional neural networks, recurrent neural networks with one or more hidden layers, and fully connected neural networks (e.g., each artificial neuron may transmit its output signal to the input of the remaining neurons, as well as to itself). For example, the machine learning model may include numerous layers or hidden layers that perform calculations (e.g., dot products) using various neurons. In some embodiments, one or more of themachine learning models 132 may be trained to use causal inference and counterfactual s. - For example, the
machine learning model 132 trained to use causal inference may accept one or more inputs, such as (i) assumptions, (ii) queries, and (iii) data. Themachine learning model 132 may be trained to output one or more outputs, such as (i) a decision as to whether a query may be answered, (ii) an objective function (also referred to as an estimand) that provides an answer to the query for any received data, and (iii) an estimated answer to the query and an estimated uncertainty of the answer, where the estimated answer is based on the data and the objective function, and the estimated uncertainty reflects the quality of data (i.e., a measure which takes into account the degree or salience of incorrect data or missing data). The assumptions may also be referred to as constraints and may be simplified into statements used in themachine learning model 132. The queries may refer to scientific questions for which the answers are desired. - The answers estimated using causal inference by the machine learning model may include optimized sequences of ingredients in selected candidate drug compounds. As the machine learning model estimates answers (e.g., candidate drug compounds), certain causal diagrams may be generated, as well as logical statements, and patterns may be detected. For example, one pattern may indicate that “there is no path connecting ingredient D and activity P,” which may translate to a statistical statement “D and P are independent.” If alternative calculations using counterfactuals contradict or do not support that statistical statement, then the
machine learning model 132 or the biological context representation may be updated. For example, anothermachine learning model 132 may be used to compute a degree of fitness which represents a degree to which the data is compatible with the assumptions used by the machine learning model that uses causal inference. There are certain techniques that may be employed by the othermachine learning model 132 to reduce the uncertainty and increase the degree of compatibility. The techniques may include those for maximum likelihood, propensity scores, confidence indicators, or significance tests, among others. - In some embodiments, a generative adversarial network (GAN) may generate a set of candidate drug compounds without using causal inference. In some embodiments, the GAN may generate a set of candidate drug compounds using causal inference. A GAN refers to a class of deep learning algorithms including two neural networks, a generator and a discriminator, that both compete with one another to achieve a goal. For example, regarding candidate drug compound generation, the generator goal may include generating candidate drug compounds, including compatible/incompatible sequences of ingredients, and effective/ineffective sequences of ingredients, etc. that the discriminator classifies as feasible candidate drug compounds, including compatible and effective sequences of ingredients that may produce desired activity levels for a design space. In one embodiment, the generator may use causal inference, including counterfactuals, to calculate numerous alternative scenarios that indicate whether a certain result (e.g., activity level) still follows when any element or aspect of a sequence changes. For example, the generator may be a neural network based on Markov models (e.g., Deep Markov Models), which may perform causal inference. In some embodiments, one or more of the counterfactuals used during the causal inference may be determined and provided by the scientist module. The discriminator goal may include distinguishing candidate drug compounds which include undesirable sequences of ingredients from candidate drug compounds which include desirable sequences of ingredients.
- In some embodiments, the generator initially generates candidate drug compounds and continues to generate better candidate drug compounds after each iteration until the generator eventually begins to generate candidate drug compounds that are valid drug compounds which produce certain levels of activity within a design space. A candidate drug compound may be “valid” when it produces a certain level of effectiveness (e.g., above a threshold activity level as determined by a standard (e.g., regulatory entity)) in a design space. In order to classify the candidate drug compounds as a valid drug compound or invalid candidate drug compound, the discriminator may receive real drug compound information from a dataset and the candidate drug compounds generated by the generator. “Real drug compound,” as used in this disclosure, may refer to a drug compound that has been approved by any regulatory (governmental) body or agency. The generator obtains the results from the discriminator and applies the results in order to generate better (e.g., valid) candidate drug compounds.
- General details regarding the GAN are now discussed. The two neural networks, the generator and the discriminator, may be trained simultaneously. The discriminator may receive an input and then output a scalar indicating whether a candidate drug compound is an actual or viable drug compound. In some embodiments, the discriminator may resemble an energy function that outputs a low value (e.g., close to 0) when input is a valid drug compound and a positive value when the input is not a valid drug compound (e.g., if it includes an incorrect sequence of ingredients for certain activity levels pertaining to a design space).
- There are two functions that may be used, the generator function (G(V)), and the discriminator function (D(Y)). The generator function may be denoted as G(V), where V is generally a vector randomly sampled in a standard distribution (e.g., Gaussian). The vector may be any suitable dimension and may be referred to as an embedding herein. The role of the generator is to produce candidate drug candidates to train the discriminator function (D(Y)) to output the values indicating the candidate drug candidate is valid (e.g., a low value), where Y is generally a vector referred to as an embedding and where, further, Y may include candidate drug compounds or real drug compounds.
- During training, the discriminator is presented with a valid drug compound and adjusts its parameters (e.g., weights and biases) to output a value indicative of the validity of the candidate drug compounds that produce real activity levels in certain design spaces. Next, the discriminator may receive a modified candidate drug compound (e.g., modified using counterfactuals) generated by the generator and adjust its parameters to output a value indicative of whether the modified candidate drug compound provides the same or a different activity level in the design space.
- The discriminator may use a gradient of an objective function to increase the value of the output. The discriminator may be trained as an unsupervised “density estimator,” i.e., a contrast function produces a low value for desired data (e.g., candidate drug compounds that include sequences producing desired levels of certain types of activity in a design space) and higher output for undesired data (e.g., candidate drug compounds that include sequences producing undesirable levels of certain types of activity in a design space). The generator may receive the gradient of the discriminator with respect to each modified candidate drug compound it produces. The generator uses the gradient to train itself to produce modified candidate drug compounds that the discriminator determines include sequences producing desired levels of certain types of activity in a design space.
- Recurrent neural networks include the functionality, in the context of a hidden layer, to process information sequences and store information about previous computations. As such, recurrent neural networks may have or exhibit a “memory.” Recurrent neural networks may include connections between nodes that form a directed graph along a temporal sequence. Keeping and analyzing information about previous states enables recurrent neural networks to process sequences of inputs to recognize patterns (e.g., such as sequences of ingredients and correlations with certain types of activity level). Recurrent neural networks may be similar to Markov chains. For example, Markov chains may refer to stochastic models describing sequences of possible events in which the probability of any given event depends only on the state information contained in the previous event. Thus, Markov chains also use an internal memory to store at least the state of the previous event. These models may be useful in determining causal inference, such as whether an event at a current node changes as a result of the state of a previous node changing.
- The set of candidate drug compounds generated may be input into another
machine learning model 132 trained to classify of the set of candidate drug compounds as a selected candidate drug compound. The classifier may be trained to rank the set of candidate drug compounds using any suitable ranking (i.e., for example, non-parametric) technique. For example, in some embodiments, one or more clustering techniques may be used to cluster the set of candidate drug compounds. To classify the selected candidate drug compound, themachine learning model 132 may also perform objective optimization techniques while clustering. To classify the selected candidate drug compound having desired levels of certain types of activity, the objective optimization may include using a minimization or maximization function for each candidate drug compound in the clusters. - A cluster may refer to a group of data objects similar to one another within the same cluster, but dissimilar to the objects in the other clusters. Cluster analysis may be used to classify the data into relative groups (clusters). One example of clustering may include K-means clustering where “K” defines the number of clusters. Performing K-means clustering may comprise specifying the number of clusters, specifying the cluster seeds, assigning each point to a centroid, and adjusting the centroid.
- Additional clustering techniques may include hierarchical clustering and density based spatial clustering. Hierarchy clustering may be used to identify the groups in the set of candidate drug compounds where there is no set number of clusters to be generated. As a result, a tree-based representation of the objects in the various groups may be generated. Density-based spatial clustering may be used to identify clusters of any shape in a dataset having noise and outliers. This form of clustering also does not require specifying the number of clusters to be generated.
-
FIG. 1B illustrates an architecture of the artificial intelligence engine according to certain embodiments of this disclosure. The architecture may include abiological context representation 200, acreator module 151, adescriptor module 152, ascientist module 153, areinforcer module 154, and aconductor module 155. The architecture may provide a platform that improves its machine learning models over time by using benchmark analysis to produce enhanced candidate drug compounds for target design spaces. The platform may also continuously or continually learn new information from literature, clinical trials, studies, research, or any suitable data source about drug compounds. The newly learned information may be used to continuously or continually train the machine learning models to evolve with evolving information. - The
biological context representation 200 may be implemented in a general manner such that it can be applied to solve different types of problems across different markets. The underlying structure of thebiological context representation 200 may include nodes and relationships between the nodes. There may be semantic information, activity information, structural information, chemical information, pathway information, and so forth represented in thebiological context representation 200. Thebiological context representation 200 may include any number of layers of information (e.g., five layers of information). The first layer may pertain to molecular structure and physical property information, the second layer may pertain to molecule-to-molecule interactions, the third layer may pertain to molecule pathway interactions, the fourth layer may pertain to molecule cell profile associations, and the fifth layer may pertain to therapeutics (including those using biologics) and indications relevant for molecules. Thebiological context representation 200 is discussed further below with reference toFIGS. 2 and 5 . - Further, to increase computing processing using various encodings, those various encodings may be selected to preferentially represent certain types of data. For example, to effectively capture common backbone structures of molecules, Morgan fingerprints may be used to describe physical properties of the candidate drug compounds. The encodings are discussed further below with reference to
FIG. 1G . - Although just one
creator module 151 is depicted, there may any suitable number ofcreator modules 151. Each of thecreator modules 151 may include one or more generative machine learning models trained to generate new candidate drug compounds. The new candidate drug compounds are then added to thebiological context representation 200. To that end, the term “creator module” and “generative model” may be used interchangeably herein. Each node in thebiological context representation 200 may be a candidate drug compound (e.g., a peptide candidate). - The generative machine learning modules included in the
creator module 151 may be of different types and perform different functions. The different types and different functions may include a variational autoencoder, structured transformer, Mini Batch Discriminator, dilation, self-attention, upsampling, loss, and the like. Each of these generative machine learning model types and functions is briefly explained below. - Regarding the variational autoencoder, it may simultaneously train two machine learning models, an inference model qφ(z|x) and a generative model pθ(x|z)pθ(z) for data x and a latent variable z. In some embodiments, both the inference model and the generative model may be conditioned on a chosen attribute of the sequences. Both models may be jointly optimized using a tractable variational Bayesian approach which maximizes an evidence lower bound (ELBO)
- Regarding the structured transformer, it may perform autoregressive decomposition to decompose the joint probability distribution of the sequence given the structure p=(s|x) autoregressively as:
-
p(s|x)=πi p(s i |x <i) - The conditional probability p(si|x<i) of amino acid si at position i is conditioned on both the input structure x and the preceding amino acid si and the preceding amino acid s<1={s1, . . . si i−1}. These conditionals may be parameterized in terms of two sub-networks: an encoder that computes embeddings from structure-based features and edge features, and a decoder that autoregressively predicts amino acid letter si given the preceding sequence and structural embeddings from the encoder.
- Mode collapse occurs in generative adversarial networks when the generator generates a limited diversity of samples, or even the same sample, regardless of the input. To overcome mode collapse, some embodiments implement a Mini Batch Discriminator (MBD) approach. MBDs each work as an extra layer in the network that computes the standard deviation across the batch of examples (the batch contains only real drug compounds or only candidate drug compounds). If the batch contains a small variety of examples, the standard deviation will be low, and the discriminator will be able to use this information to lower the score for each example in the batch. To further reduce mode collapse occurrence, some embodiments balance the sampling frequency of the training dataset clusters.
- Regarding dilation, convolution filters may be capable of detecting local features, but they have limitations when it comes to relationships separated by long distances. Accordingly, some embodiments implement convolution filters with dilation. By introducing gaps into convolution kernels, such techniques increase the receptive field without increasing the number of parameters. Dilation rate may be applied to one convolution filter in each residual block of a generator or a discriminator. In this way, by the last layer of the generative adversarial network, filters may include a large enough receptive field to learn relationships separated by long-distances. Residual blocks are discussed further below with reference to
FIG. 1F . - Regarding self-attention, different areas of a protein have different associations and effects on overall protein behavior. Accordingly, the architecture of the generative adversarial network disclosed herein implements a self-attention mechanism. The self-attention mechanism may include a number of layers that highlight different areas of importance across the entire sequence and allow the discriminator to determine whether parts in distant portions of the protein are consistent with each other.
- Regarding upsampling, some embodiments implement techniques best suited for protein generation. For example, nearest-neighbor interpolation, transposed convolution, and sub-pixel convolution may be used. Sub-pixel shuffle convolution may be used to increase resolution of a design space during candidate drug compound generation. Any combination of these techniques may be used in the upsampling layers. In some embodiments, transposed convolution by itself may be used for all upsampling layers.
- Regarding the loss function, it is a component that aids in the successful performance of a neural network. Various losses, such as non-saturating, non-saturating with R1 regularization, hinge, hinge with relativistic average, and Wasserstein and Wasserstein with gradient penalty losses, may be used. In some embodiments, due to performance increases, the non-saturating loss with R1 regularization may be used for the generative adversarial network.
- Details pertaining to the architecture of the
creator module 151 are described below with reference toFIGS. 1C-1I . - The
descriptor module 152 may include one or more machine learning models trained to generate descriptions for each of the candidate drug compounds generated by thecreator module 151. Thedescriptor module 152 may be trained to use different encodings to represent the different types of information included in the candidate drug compound. Thedescriptor module 152 may populate the information in the candidate drug compound with ordinal values, cardinal values, categorical values, etc. depending on the type of information. For example, thedescriptor module 152 may include a classifier that analyzes the candidate drug compound and determines whether it is a cancer peptide, an antimicrobial peptide, or a different peptide. Thedescriptor module 152 describes the structure and the physiochemical properties of the candidate drug compound. - The
reinforcer module 154 may include one or more machine learning models trained to analyze, based on the descriptions, the structure and the physiochemical properties of the candidate drug compounds in thebiological context representation 200. Based on the analysis, thereinforcer module 154 may identify a set of experiments to perform on the candidate drug compounds to elicit certain desired data (e.g., activity effectiveness, biomedical features, etc.). The identification may be performed by matching a pattern of the structure and physiochemical properties of the candidate drug compounds with the structure and physiochemical properties of other drug compounds and determining which experiments were performed on the other drug compounds to elicit desired data. The experiments may include in vitro or in vivo experiments. Further, thereinforcer module 154 may identify experiments that should not be performed for the candidate drug compounds if a determination is made that those experiments yield useless data for drug compounds. - The
conductor module 155 may include one or more machine learning models trained to perform inference queries on the data stored in thebiological context representation 200. The inference queries may pertain to performing queries to improve the quality of the data in thebiological context representation 200. For example, there may be a gap in data in one of the nodes (e.g., candidate drug compounds) stored in thebiological context representation 200. An inference query refers to the process of identifying a first node and a second node similar to the first node, and to obtaining data from the second node to fill a data gap in the first node. An inference query may be executed to search for another node having similarities to the node with the gap and may fill the gap with the data from the other node. - The
scientist module 153 may include one or more machine learning models trained to perform benchmark analysis to evaluate various parameters of thecreator module 151. In some embodiments, thescientist module 153 may generate scores for the candidate compound drugs generated by thecreator module 151. The benchmark analysis may be used to electronically and recursively optimize thecreator module 151 to generate candidate drug compounds having improved scores in subsequent generation rounds. There may be several types of benchmarks (e.g., distribution learning benchmarks, goal-directed benchmarks, etc.) used by thescientist module 153 to evaluate generative machine learning models used by thecreator module 151. As described herein, one or more parameters (e.g., validity, uniqueness, novelty, Frechet ChemNet Distance (FCD), internal diversity, Kullback-Leibler (KL) divergence, similarity, rediscovery, isomer capability, median compounds, etc.) of thecreator module 151 may be scored during benchmark analysis. The benchmark analysis may also be used to electronically and recursively optimize thecreator module 151 to improve scores of the parameters in subsequent generation rounds. Any combination of the benchmarks described below may be used to evaluate thecreator module 151. - One type of benchmark used by the
scientist module 153 may include a distribution learning benchmark. The distribution learning benchmark evaluates, when given a set of molecules, how well thecreator module 151 generates new molecules which follow the same chemical distribution. For example, when provided with therapeutic peptides, the distribution learning benchmark evaluates how well thecreator module 151 generates other therapeutic peptides having similar chemical distributions. - The distribution learning benchmark may include generating a score for an ability of the
creator module 151 to generate valid candidate drug compounds, a score for an ability of thecreator module 151 to generate unique candidate drug compounds, a score for an ability of thecreator module 151 to generate novel candidate drug compounds, a Frechet ChemNet Distance (FCD) score for thecreator module 151, an internal diversity score for thecreator module 151, a KL divergence score for thecreator module 151, and so forth. Each of the distribution learning benchmarks is now discussed. - The validity score may be determined as a ratio of valid candidate drug compounds to non-valid candidate drug compounds of generated candidate drug compounds. In some embodiments, the ratio may be determined from a certain number (e.g., 10,000) of candidate drug compounds. In some embodiments, candidate drug compounds may be considered valid if their representation (e.g., simplified molecular-input line-entry system (SMILES)) can be successfully parsed using any suitable parser.
- The uniqueness score may be determined by sampling candidate drug compounds generated by the
creator module 151 until a certain number (e.g., 10,000) of valid molecules are identified by identical representations (e.g., canonical SMILES strings). The uniqueness score may be determined as the number of different representations divided by the certain number (e.g., 10,000). - The novelty score may be determined by generating candidate drug compounds until a certain number (e.g., 10,000) of different representations (e.g., canonical SMILES strings) are obtained and computing the ratio of candidate drug compounds (including real drug compounds) not present in the training dataset.
- The Frechet ChemNet Distance (FCD) score may be determined by selecting a random subset of a certain number (e.g., 10,000) of drug compounds from the training dataset, and generating candidate drug compounds using the
creator module 151 until a certain number (10,000) of valid candidate drug compounds are obtained. The FCD between the subset of the drug compounds and the candidate drug compounds may be determined. The FCD may consider chemically and biologically relevant information about drug compounds, and also measure the diversity of the set via the distribution of generated candidate drug compounds. The FCD may detect if generated candidate drug compounds are diverse, and the FCD may detect if generated candidate drug compounds have similar chemical and biological properties as real drug compounds. The FCD score (“S”) is determined using the following relationship: S=exp(−0.2*FCD). - The internal diversity score may assess the chemical diversity within a set of generated candidate drug compounds (“GROUP”). The internal diversity score may be determined using the following relationship:
-
- In the equation in [0067], T(m1, m2) is the Tanimoto Similarity (SNN) between
molecule 1, m1, andmolecule 2, m2. Variable G is the set of candidate drug compounds and variable p is the set number of groups being tested. While SNN measures the dissimilarity to external diversity, the internal diversity score may consider dissimilarity between generated candidate drug compounds. The internal diversity score may be used to detect mode collapse in certain generative models. For example, mode collapse may occur when the generative model produces a limited variety of candidate drug compounds while ignoring some areas of a design space. A higher score for the internal diversity corresponds to higher diversity in the set of candidate drug compounds generated. - The KL divergence score may be determined by calculating physiochemical descriptors for both the candidate drug compounds and the real drug compounds. Further, a determination may be made of the distribution of maximum nearest neighbor similarities on fingerprints (e.g., extended connectivity fingerprint of up to four bonds (ECFP4)) for both the candidate drug compounds and the real drug compounds. The distribution of these descriptors may be determined via kernel density estimation for continuous descriptors, or as a histogram for discrete descriptors. The KL divergence DKL,i may be determined for each descriptor i, and is aggregated to determine the KL divergence score S via:
-
- Where k is the number of descriptors (e.g., k=9).
- The isomer capability score may be determined by whether molecules may be generated that correspond to a target molecular formula (for example C7H8N2O2). The isomers for a given molecular formula can in principle be enumerated, but except for small molecules this number will in general be very large. The isomer capability score represents fully-determined tasks that assess the flexibility of the creator module to generate molecules following a simple pattern (which is a priori unknown).
- A second type of benchmark may include a goal-directed benchmark. The goal-direct benchmark may evaluate whether the
creator module 151 generates a best possible candidate drug compound to satisfy a pre-defined goal (e.g., activity level in a design space). A resulting benchmark score may be calculated as a weighted average of the candidate drug compound scores. In some embodiments, the candidate drug compounds with the best benchmark scores may be assigned a larger weight. As such, generative models of thecreator module 151 may be tuned to deliver a few candidate drug compounds with top scores, while also generating candidate drug compounds with satisfactory scores. For each of the goal-directed benchmarks, one or several average scores may be determined for the given number of top candidate drug compounds and then the resulting benchmark score may be calculated as the mean of these average scores. For example, the resulting benchmark score may be a combination of the top-1, top-10, and top-100 scores, in which the resulting benchmark score is determined by the following relationship: -
- Where s is an n-dimensional (e.g., 100-dimensional) vector of candidate drug compound scores sv1<i<100 sorted in decreasing order (e.g., si≥sj for i<j). Variable G is the set of candidate drug compounds and variable p is the set number of groups being tested.
- The goal-directed benchmark may include generating a score for an ability of the
creator module 151 to generate candidate drug compounds similar to a real drug compound, a score for an ability of thecreator module 151 to rediscover the potential viability of previously-known drug compounds (e.g., using a drug which is prescribed for certain conditions for a new condition or disease), and the like. - The similarity score may be determined using nearest neighbor scoring, fragment similarity scoring, scaffold similarity scoring, SMARTS scoring, and the like. Nearest neighbor scoring (e.g., nns(G, R)) may refer to a scoring function that determines the similarity of the candidate drug compound to a target real drug compound g. The score corresponds to the Tanimoto similarity when considering the fingerprint r and may be determined by the following relationship:
-
- Where mR and mG are representations of the real drug compounds (R) and the candidate drug compounds (G) as bit strings (e.g., digital fingerprints, e.g., outputs of hash functions, etc.). The resulting score reflects how similar candidate drug compounds are to real drug compounds in terms of chemical structures encoded in these fingerprints. In some embodiments, Morgan fingerprints may be used with a radius of a configurable value (e.g., 2) and an encoding with a configurable number of bits (e.g., 1024). The radius and encoding bits may be configured to produce desirable results in a biochemical space.
- The similarity score may be determined using fragment similarity scoring, which itself may be defined as the cosine distance between vectors of fragment frequencies. For a set of candidate drug compounds (G), its fragment frequency vectorfG has a size equal to the size of all chemical fragments in the dataset, and elements of fG represent frequencies with which the corresponding fragments appear in G. The distance is determined by the following relationship:
-
Frag(G, R)=1−cos(fGfR) - Candidate drug compounds and real drug compounds may be fragmented using any suitable decomposition algorithm. The fragment similarity scoring score represents the similarity of the set of candidate drug compounds and the set of real drug compounds at the level of chemical fragments.
- The similarity score may be determined using scaffold similarity scoring, which may be determined in a similar way to the fragment similarity scoring. For example, the scaffold similarity scoring may be determined as a cosine similarity between the vectors SG and SR that represent frequencies of scaffolds in a set of candidate drug compounds (G) and a set of real drug compound (R). The scaffold similarity scoring score may be determined by the following relationship:
-
Frag(G,R)=1−cos(sGsR). - The similarity score may be determined using SMARTS scoring. SMARTS scoring may be implemented according to the relationship: SMART (a, b). The SMARTS scoring may evaluate whether the SMARTS pattern s is present in a candidate drug compound. $b$ is a Boolean value indicating whether the SMARTS pattern should be present (true) or absent (false). When the pattern is desired, a score of 1, for true, is returned if the SMARTS pattern is found. If the pattern is not found, then a score of 0, for false, is returned.
- In some embodiments, a goal-directed benchmark may include determining a rediscovery score for the
creator module 151. In some embodiments, certain real drug compounds may be removed from the training dataset and thecreator module 151 may be retrained using the modified training set lacking the removed real drug compounds. If thecreator module 151 is able to generate (“rediscover”) a candidate drug compound that is identical or substantially similar to the removed real drug compounds, then a high rediscovery score may be assigned. Such a technique may be used to validate thecreator module 151 is effectively trained or tuned. - Various modifiers may be used to modify the scores for the various benchmarks discussed above. For example, a Gaussian modifier may be implemented to target a specific value of some property, while giving high scores when the underlying value is close to the target. It may be adjustable as desired. A minimum Gaussian modifier may correspond to the right half of a Gaussian function and values smaller than a threshold may be given a full score, while values larger than the threshold decrease continuously to zero. A maximum Gaussian modifier may correspond to a left half of the Gaussian function and values larger than the threshold are given a full score, while values smaller than the threshold decrease continuously to zero. A threshold modifier may attribute a full score to values above a given threshold, while values smaller than the threshold decrease linearly to zero.
- There are a variety of competing generative models that may be used to evaluate the performance of the
creator module 151. For example, the competing generative models may include a random sampling, best of dataset method, SMILES genetic algorithm (GA), graph GA, graph Monte-Carlo tree search (MCTS), SMILES long short-term memory (LSTM), character-level recurrent neural networks (CharRNN), variational autoencoder, adversarial autoencoder, Latent generative adversarial network (LatentGAN), junction tree variational autoencoder (JT-VAE), and objective-reinforced generative adversarial network (ORGAN). Each of these competing generative models will now be discussed briefly. - Regarding random sampling, this baseline samples at random the requested number of molecules (candidate drug compounds) for the dataset. Random sampling may provide a lower bound for the goal-directed benchmarks, because no optimization is performed to obtain the returned molecules. Random sampling may provide an upper bound for the distribution learning benchmarks, because the molecules returned may be taken directly for the original distribution.
- Regarding best of dataset method (or “best of dataset” herein), one goal of de novo molecular design is to explore unknown parts of the biochemical space, generating new candidate drug compounds with better properties than the drug compounds already known. The best of dataset scores the entire generated dataset including the candidate drug compounds with a provided scoring function and returns the highest scoring molecules. This effectively provides a lower bound for the goal-directed benchmarks that enables the
creator module 151 to create better candidate drug compounds than the real or candidate drug compounds provided. - Regarding SMILES GA, this technique may evolve string molecular representations using mutations exploiting the SMILES context-free grammar. For each goal-directed benchmark, a certain number (e.g., 300) of highest scoring molecules in the dataset may be selected as an initial population. In this example, each molecule is represented by 300 genes. During each epoch an offspring of a certain number (e.g., 600) of new molecules may be generated by randomly mutating the population molecules. After deduplication and scoring, these new molecules may be merged with the current population and a new generation is chosen by selecting the top scoring molecules overall. This process may be repeated a certain number of times (e.g., 1000) or until progress has stopped for a certain number (e.g., 5) of consecutive epochs. Distribution-learning benchmarks do not apply to this baseline.
- Regarding graph GA, this GA involves molecule evolution at the graph level. For each goal-directed benchmark a certain number (e.g., 100) of highest scoring molecules in the dataset are selected as the initial population. During each epoch, a mating pool of a certain number (e.g., 200) of molecules is sampled with replacement from the population, using scores as weights. This pool may contain many repeated molecules if their score is high. A new population of a certain number (e.g., 100) is then generated by iteratively choosing two molecules at random from the mating pool and applying a crossover operation. With probability of, e.g., 0.5 (i.e., 100/200), a mutation is also applied to the offspring molecule. This process is repeated a certain number (e.g., 1000) of times or until progress has stopped for a certain number (e.g., 5) of consecutive epochs. Distribution-learning benchmarks do not apply to this baseline.
- Regarding graph MCTS, the statistics used during sampling may be computed on the training dataset. For this baseline, no initial population is selected for the goal-directed benchmarks. Each new molecule may be generated by running a certain number (e.g., 40) of simulations, starting from a base molecule. At each step, a certain number (e.g., 25) of children are considered and the sampling stops when reaching a certain number (e.g., 60) of atoms. The best-scoring molecule found during the sampling may be returned. A population of a certain number (e.g., 100) of molecules is generated at each epoch. This process may be repeated a certain number (e.g., 1000) of times or until progress has stopped for a certain number (e.g., 5) of consecutive epochs. For the distribution learning benchmark. the generation starts from a base molecule and a new molecule is generated with the same parameters. As for the goal-directed benchmarks, the only difference is that no scoring function is provided, so the first molecule to reach terminal state is returned instead of the highest scoring molecule.
- Regarding SMILES LSTM, the technique is a baseline model, consisting of an LSTM neural network which predicts the next character of partial SMILES strings. In some embodiments, a SMILES LSTM may be used with 3 layers of hidden size of 1024. For the goal-directed benchmarks, a certain number (e.g., 20) of iterations of hill-climbing may be performed; at each step the model generated a certain number (e.g., 8192) of molecules and a certain number (e.g., 1024) of the top scoring molecules may be used to fine-tune the model parameters. For the distribution-learning benchmark, the model may generate the requested number of molecules.
- Regarding character-level recurrent neural networks (CharRNN), the technique treats the task of generating SMILES as a language model attempting to learn the statistical structure of SMILES syntax by training it on a large corpus of SMILES. The CharRNN parameters may be optimized using maximum likelihood estimation (MLE). In some embodiments, CharRNN may be implemented using LSTM RNN cells stacked into a certain number of layers (e.g., 3 layers) with a certain number of hidden dimensions (e.g., 600 hidden dimensions) . In some embodiments, to prevent overfitting, a dropout layer may be added between intermediate layers with a certain dropout probability (e.g., p=0.2). Training may be performed with a batch size of a certain number (e.g., 64) using an optimizer.
- Regarding a variational autoencoder (VAE), it is a framework for training two neural networks, an encoder and a decoder, to learn a mapping from a higher-dimensional data representation (e.g., vector) into a lower-dimensional data representation and from the lower-dimensional data representation back to the higher-dimensional data representation. The lower-dimensional space is called the latent space, which is often a continuous vector space with normally distributed latent representation. The latent representation of our data may contain all the important information needed to represent an original data point. The latent representation represents the features of the original data point. In other words, one or more machine learning models may learn the data features of the original data point and simplify its representation to make it more efficient to analyze. VAE parameters may be optimized to encode and decode data by minimizing the reconstruction loss while also minimizing a KL-divergence term arising from the variational approximation, such that the KL-divergence term may loosely be interpreted as a regularization term. Since molecules are discrete objects, properly trained VAE defines an invertible continuous representation of a molecule.
- In some embodiments, aspects from both implementations may be combined. The encoder may implement a bidirectional Gated Recurrent Unit (GRU) with a linear output layer. The decoder may be a 3-layer GRU RNN of 512 hidden dimensions with intermediate dropout layers, the layers having a dropout probability of 0.2. Training may be performed with a batch size of a certain number (e.g., 128), utilizing a gradient clipping of 50 and a KL-term weight of 1, and further optimized with a learning rate of 0.0003 across 50 epochs. Other training parameters may be used to perform the embodiments disclosed herein.
- Regarding adversarial autoencoders (AAE), they combine the idea of VAE with that of adversarial training as found in a GAN. In AAE, the KL divergence term is avoided by training a discriminator network to predict whether a given sample came from the latent space of the AE or from a prior distribution of the autoencoder (AE). Parameters may be optimized to minimize the reconstruction loss and to minimize the discriminator loss. The AAE model may consist of an encoder with a 1-layer bidirectional LSTM with 380 hidden dimensions, a decoder with a 2-layer LSTM with 640 hidden dimensions and a shared embedding of size 32. The latent space is of 640 dimensions, and the discriminator networks is a 2-layer fully connected neural network with 640 and 256 nodes respectively, utilizing the ELU activation function. Training may be performed with a batch size of 128, with an optimizer using a learning rate of 0.001 across 25 epochs. Other training parameters may be used to perform the embodiments disclosed herein.
- Regarding LatentGAN, the technique encodes SMILES strings into latent vector representations of size 512. A Wasserstein Generative Adversarial network with Gradient Penalty may be trained to generate latent vectors resembling that of the training set, which are then decoded using a heteroencoder.
- Regarding a junction tree variational autoencoder (JT-VAE), the model generates molecular graphs in two phases. The model first generates a tree-structured scaffold over chemical substructures, and then combines them into a molecule with a graph message passing network. This approach enables incrementally expanding molecules while maintaining chemical validity at every step.
- Regarding an objective-reinforced generative adversarial network (ORGAN), the model is a sequence-generation model based on adversarial training that aims at generating discrete sequences that emulate a data distribution while using reinforcement learning to bias the generation process towards some desired objective rewards. ORGAN incorporates at least 2 networks: a generator network and a discriminator network. The goal of the generator network is to create candidate drug compounds indistinguishable from the empirical data distribution of real drug compounds. The discriminator exists to learn to distinguish a candidate drug compound from real data samples. Both models are trained in alternation.
- To properly train a GAN, the gradient must be back-propagated between the generator and discriminator networks. Reinforcement uses an N-depth Monte Carlo tree search, and the reward is a weighted sum of probabilities from the discriminator and objective reward. Both the generator and discriminator may be pre-trained for 250 and 50 epochs, respectively, and then jointly trained for 100 epochs utilizing an optimizer with a learning rate of 0.0001. The learning rate may refer to a hyperparameter of a neural network, and the learning rate may be a number that determines an amount of change (e.g., weights, hidden layers, etc.) to make to a machine learning model in response to an estimated error. Bayesian optimization may be used to determine the optimal learning rate during training of a particular neural network. In some embodiments, validity and uniqueness of candidate drug compounds may be used as rewards.
- The
scientist module 153 may also include one or more machine learning models trained to perform causal inference using counterfactuals. The causal inference, as described herein, may be used to determine whether thecreator module 151 actually generated a candidate drug candidate, including a desired activity in such candidate, or if it was determined because of noisy data (e.g., scarce or incorrect data). -
FIG. 1C illustrates first components of an architecture of thecreator module 151 according to certain embodiments of this disclosure. Acandidate design space 156 anddata 157 may be included in thebiological context representation 200,such space 156 anddata 157 to include the various sequences of the candidate drug compounds or real drug compounds. In some embodiments, thecreator module 151 may populate thecandidate design space 156. Thecandidate design space 156 may include a vast amount of information retrieved from numerous sources or generated by theAI engine 140. Thecandidate design space 156 may include information pertaining to antimicrobial peptides, anticancer peptides, peptidomimetics, uProteins and aCRFs, non-ribosomal peptides, and general peptides that are retrieved via genomic screening, literature research, or computationally designed using theAI engine 140. Thecandidate design space 156 may be updated each time thecreator module 151 generates a new candidate drug compound. Thecandidate design space 156 may also be updated continuously or continually as new literature is published or genomic screenings are performed. - The
creator module 151 may also usedata 157 to generate the candidate drug compounds. In some embodiments, thedata 157 may be generated or provided by thedescriptor module 152. In some embodiments, the data may be received from any suitable source. The data may include molecular information pertaining to chemistry/biochemistry, targets, networks, cells, clinical trials, market (e.g., analysis, results, etc.) that result from performing simulations or experiments. - The
creator module 151 may encode thecandidate design space 156 and thedata 157 into various encodings. In some embodiments, an attention message-passing neural network may be used to encode molecular graphs. An initial set of states may be constructed, one for each node in a molecular graph. Then, each node may be allowed to exchange information, to “message” with its neighboring nodes. Each message may be a vector describing an atom of a molecule from the atom's perspective in the molecule. After one such step, each node state will contain an awareness of its immediate neighborhood. Repeating the step makes each node aware of its second-order neighborhood, and so forth. During the message-passing stage and based on the total number of occurrences of a message, an attention layer may be used to identify interesting features of a molecule. A certain weight (e.g., heavy, light) may be assigned to a message that occurs more or fewer than a threshold number of times, thereby causing that message to stand out more when the messages are aggregated. For example, a message that occurs a very small number of times (e.g., less than a threshold) may be more likely to include a desirable feature as opposed to a message that occurs a large number of times. In another example, a message that occurs more than a threshold number of times may be weighted more heavily than a message that occurs fewer than the threshold number of times. Any suitable weighting may be configured to cause a message to stand out more. - Using a summation function to reduce the size of the messages and increase computational efficiency, the attention mechanism may aggregate the messages with their weights. In such a way, the techniques may be able to scale to remain computationally efficient as the number of messages increases. Such a technique may be beneficial because it reduces resource (e.g., processing, memory) consumption when performing computations with a large design space, including information in that design space pertaining to structure, semantic, sequence, physiochemical properties, etc.
- After a chosen number of “messaging rounds”, all the context-aware node states are collected and converted to a summary representing the whole graph. All the transformations in the steps above may be carried out with machine learning models (e.g., neural networks), yielding a machine learning model that can be trained with known techniques to optimize the summary representation for the current task. The following relationships may be used by the attention message-passing neural network:
-
- m(t) v is the message function, At is the attention function, Ut is the node update function, N(v) is the set of neighbors of node v in graph G, h(t) v is the hidden state of node v at time t, and m(t) v is a corresponding message vector. For each atom v, messages will be passed from its neighbors and aggregated as the message vector m(t) from its surrounding environment. Then the hidden state h(t) v is updated by the message vector.
- y{circumflex over ( )} is a resulting fixed-length feature vector generated for the graph, and R is a readout function invariant to node ordering, a feature allowing the MPNN framework to be invariant to graph isomorphism. The graph feature vector y{circumflex over ( )} then is passed to a fully connected layer to give prediction. All functions Mt, Ut, and R are neural networks, and their weights are learned during training.
- As depicted, a “Candidates Only Data”
encoding 158 may encode just the information from the candidate design space, a “Candidates and Simulated Data”encoding 159 may encode information from thecandidate design space 156 and the simulated data from thedata 157, and a “Candidates with All Data”encoding 160 may encode information from thecandidate design space 156 and both the simulated and experimental data from thedata 157. Further, a “Heterologous Networks” encoding 161 may be generated using the “Candidates with All Data”encoding 160. Theencodings - Each of the
encodings - “Candidates Only Data”
encoding 158 may be input into ML Model A, which outputs a “Candidate Embedding” 162. “Candidates and Simulated Data”encoding 159 may be input into ML Model B, which outputs a “Candidate and Simulated Data Embedding” 163. “Candidates with All Data”encoding 160 may be input into ML Model C, which outputs “Candidate with All Data Embedding” 164. “Heterologous Networks” encoding 161 may be input into ML Model D, which outputs “Graph and Network Embedding” 165. Theembeddings -
FIG. 1D illustrates second components of the architecture of thecreator module 151 according to certain embodiments of this disclosure. As depicted, theencodings encodings - The
embeddings embeddings embeddings layer 167. The ML Model E is trained to output a “Latent Representation” based on theembeddings - The “Latent Representation” 168 may include an “Activity Landscape” 169 and a “Continuous Representation” 170. The “Continuous Representation” 170 may include information (e.g., structural, semantic, etc.) pertaining to all of the molecules (e.g., real drug compounds and candidate drug compounds), and the “Activity Landscape” 169 may include activity information for all of the molecules. In some embodiments, the ML Model E may be a variational autoencoder that receives the
embeddings FIG. 1E . - The “Latent Representation” 168 is input into the ML Model H. ML Model H may be any suitable type of machine learning model described herein. ML Model H may be trained to analyze the “Latent Representation” 168 and generate a candidate drug compound. The “Latent Representation” 168 may include multiple dimensions (e.g., tens, hundreds, thousands) and may have a particular shape. The shape may be rectangular, cube, cuboid, spherical, an amorphous blob, conical, or any suitable shape having any number of dimensions. The ML Model H may be a generative adversarial network, as described herein. The ML Model H may determine a shape of the “Latent Representation” 168 and may determine an area of the shape from which to obtain a slice based on “interesting” aspects of that area. An interesting aspect may be a peak, valley, a flat portion, or any combination thereof. The ML Model H may use an attention mechanism to determine what is “interesting” and what is not. The interesting aspect may be indicative of a desirable feature, such as a desirable activity for a particular disease or medical condition. The slice may include a combination of a portion of any of the information included in the “Latent Representation” 168, such as the structural information, physiochemical properties, semantic information, and so forth. The information included in the slice may be represented as an eigenvector that includes any number of dimensions from the “Latent Representation” 168. The term “slice” and “candidate drug compound” may be used interchangeably. The slice may be visually presented on a display screen, as shown in
FIG. 8A . - A decoder may be used to transform the slice from the lower-dimensional vector to a higher-dimensional vector, which may be analyzed to determine what information is included in that slice. For example, the decoder may obtain a set of coordinates from the higher-dimensional vector which may be back-calculated to determine what information (e.g., structural, physiochemical, semantic, etc.) they represent.
- Each of the candidate drug compounds generated by the ML Model F, ML Model G, ML Model H, and ML Model I may be ranked and one of the candidate drug compounds may be classified as a selected candidate drug compound, as described herein. Further, the candidate drug compounds may be input into one or more machine learning models trained to perform benchmark analysis, as described herein. Based on the benchmark analysis, any of the machine learning models in the
creator module 151 may be optimized (e.g., tuning weights, adding or removing hidden layers, changing an activation function, etc.) to modify a parameter (e.g., uniqueness, validity, novelty, etc.) score for the machine learning models when generating subsequent candidate drug compounds. -
FIG. 1E illustrates an architecture of a variational autoencoder machine learning model according to certain embodiments of this disclosure. In some embodiments, the variational autoencoder may include an input layer, an encoder layer, a latent layer, a decoder layer, and an output layer. The input layer may receive fingerprints of drug compounds or candidate drug compounds represented as higher-dimensional vectors, as well as associated drug concentration(s). The encoder layer may include one or more hidden layers, activation functions, and the like. The encoder layer may receive the fingerprint and drug concentration from the input layer and may perform operations to translate the higher-dimensional vectors into lower-dimensional vectors, as described herein. The latent layer may receive the lower-dimensional vectors and represent them in the “Latent Representation” 168. The latent layer may input the “Latent Representation” 168 into the ML Model H, which is a generative adversarial network including a generator and a discriminator, as described herein. The architecture of the generator and the discriminator is discussed further below with reference toFIG. 1F . The generator generates candidate drug compounds, and the discriminator analyzes the candidate drug compounds to determine whether they are valid or not. The GI inFIG. 1F may refer to a general inference layer and the GI layer may generate the candidate drug compounds. - The candidate drug compounds output by the latent layer may be input into the decoder layer where the lower-dimensional vectors are translated back into the higher-dimensional vectors. The decoder layer may include one or more hidden layers, activation functions, and the like. The decoder layer may output the fingerprints and the drug concentration. The output fingerprint and drug concentration may be analyzed to determine how closely they match the input fingerprint and drug concentration. If the output and input substantially match, the variational autoencoder may be properly trained. If the output and the input do not substantially match, one or more layers of the variational autoencoder may be tuned (e.g., modify weights, add or remove hidden layers).
-
FIG. 1F illustrates an architecture of a generative adversarial network used to generate candidate drugs according to certain embodiments of this disclosure. As depicted, there is an architecture for the discriminator, discriminator residual block, generator, and generator residual block. - The discriminator architecture may receive a sequence (e.g., candidate drug compound) as an input. The discriminator architecture may include an arrangement of blocks in a particular order that improves computational efficiency when processing the sequence to determine whether the sequence is valid or not. For example, the particular order of blocks includes a first residual block, a self-attention block, a second residual block, a third residual block, a fourth residual block, a fifth residual block, and a sixth residual block. The discriminator may output a score (e.g., 0 or 1) for whether the received sequence is valid or not.
- The discriminator residual block architecture may receive an input filtered into two processing pathways. A first processing pathway performs a conversion operation on the input. The second processing pathway performs several operations, including a conversion, a batch normalization operation, a leaky rectified linear (e.g., ReLu) operation, a conversion operation, and another batch normalization operation. The leaky ReLu operation may perform a threshold operation, where any input value less than zero is multiplied by a fixed scalar, for example. The output from the first and second processing pathways is summed and then output.
- The generator architecture may receive a noise (e.g., biological context representation 200) as an input. The generator architecture may include an arrangement of blocks in a particular order that improves computational efficiency when processing the noise to generate a sequence (e.g., candidate drug compound). For example, the particular order of blocks includes a first residual block, a second residual block, a third residual block, a fourth residual block, a fifth residual block, a self-attention block, and a sixth residual block. The generator may output a score (e.g., 0 or 1) for whether the received sequence is valid or not.
- The generator residual block architecture may receive an input filtered into two processing pathways. A first processing pathway performs a de-conversion operation on the input. The second processing pathway performs several operations, including a conversion, a batch normalization operation, a leaky ReLu operation, a de-conversion operation, and another batch normalization operation. The output from the first and second processing pathways is summed and then output.
-
FIG. 1G illustrates types of encodings to represent certain types of drug information according to certain embodiments of this disclosure. A table 180 includes three columns labeled “Encoding”, “Compressed?”, and “Information”. The “Encoding” column includes rows storing a type of encoding used to represent a certain type of information; the “Compressed?” column includes rows storing an indication of whether the encoding in that row is compressed; and the “Information” column includes rows storing a type of information represented by the encoding in each respective row. Thedescriptor module 152 may include a machine learning module trained to analyze a candidate drug compound and identify various structural properties, physiochemical properties, and the like. Thedescriptor module 152 may be trained to represent the type of structural and physiochemical properties using an encoding that increases computational efficiency and to store a description including the encodings at a node representing the candidate drug compound. During processing, the encodings may be aggregated for each candidate drug compound. - For example, using an alphanumeric string, SMILES encoding spells out molecular structure from a beginning portion to an ending portion. Morgan Fingerprints may be useful for temporal molecular structures and the
descriptor module 152 may include a machine learning module trained to output a compressed vector. Morgan Fingerprints may include the isomer for a particular molecule, and common backbone structures for molecules. - As depicted, SMILES, Morgan Fingerprints, InChl, One-Hot, N-gram, Graph-based Graphic Processing Unit Nearest Neighbor Search (GGNN), Gene regulatory network (GRN), M-P Neural Network (MPNN), and Knowledge Graph (Structural/Semantic) encodings represent structural information of molecules (drug compounds). The Morgan Fingerprints, GGNN, GRN, and MPNN are also compressed to improve computations, while the SMILES, InChl, One-Hot, N-gram, and the Knowledge Graph are not compressed.
- Quantitative structure-activity relationship (QSAR), Z-descriptors, and the Knowledge Graph encodings may represent physiochemical properties of molecules. These encodings may not be compressed. The QSAR encoding may include the type of activity (e.g., and without limitation to a particular physiological or anatomical organ, organ, state or states, or to a particular disease-process, antiviral, antimicrobial, antifungal, antiemetic, antineoplastic, anti-inflammatory, leukotriene inhibitory, neurotransmitter inhibitory, etc.) the molecule provides. The encodings selected for each type of information may optimize the computations when considering such a large design space with information pertaining to structure, physiochemical properties, and semantic information. The large design space referred to may include not only a string of amino acid sequences, and physiochemical properties, but also the semantic information, such as system biology and ontological information, including relationships between nodes, molecular pathways, molecular interactions, molecular family, and the like.
-
FIG. 1H illustrates an example of concatenating (merging) numerous encodings into a candidate drug compound according to certain embodiments of this disclosure. A concatenatedvector 191 may represent an embedding for a candidate drug compound. In some embodiments, an ensemble learning approach may be implemented by using different types of techniques to generate unique encodings and merge those unique encodings to improve generated candidate drug compounds. As depicted, various encoding techniques may be used to represent different types of information. The different types of information (e.g., structural, semantic, etc.) may be represented by unique encodings. For example, molecular graphs and Morgan Fingerprints may represent structural and physical molecular information. Activity data (e.g., QSAR) may represent molecular structural knowledge or molecular physiochemical knowledge, and a knowledge graph may represent molecular semantic knowledge. An attention message passing neural network (AMPNN) or long short-term memory (LSTM) may receive the molecular graph and Morgan Fingerprints as input and output the structural/physical information represented by 1 s and 0 s. One-hot may receive the activity data as input and output the structural knowledge represented by is and Os. AMPNN may receive a knowledge graph as input and output semantic knowledge represented by is and Os. The resulting concatenatedvector 191 is a combination of each type of information for a single candidate drug compound. Accordingly, the single candidate drug compound may include better properties and more robust information than conventional techniques. -
FIG. 1I illustrates an example of using a variational autoencoder (VAE) to generate aLatent Representation 168 of a candidate drug compound according to certain embodiments of this disclosure. The concatenated vector 191 (e.g., embedding) may be higher-dimensional prior to being input to the VAE. The VAE may be trained to translate the higher-dimensionalconcatenated vector 191 to a lower-dimensional concatenated vector that represents theLatent Representation 168. -
FIG. 2 illustrates a data structure storing abiological context representation 200 according to certain embodiments of this disclosure. Biology is context-dependent and dynamic. For example, the same molecule can manifest multiple, potentially competing, phenotypes. Further, data on an existing drug labeled as antimicrobial can suggest a null behavior in applications against different microbes or even against the same microbes but in different contexts, e.g., temperature, pressure, environmental, contextual, comorbid. To accurately predict candidate drug compounds that provide desirable activity levels in design spaces, themachine learning models 132 are trained to handle evolving knowledge maps of biology and drug compounds. Further, conventional techniques for discovery and generating drug compounds may be ineffective for biological data because such data is non-Euclidian. - In some embodiments, the
biological context representation 200 generated by the disclosed techniques may be used to graphically model the continually or continuously modifying biological and drug compound knowledge. That is, the biology may be represented as graphs within a comprehensive knowledge graph (e.g., biological context representation 200), where the graphs have complex relationships and interdependencies between nodes. - The
biological context representation 200 may be stored in a first data structure having a first format. The first format may be a graph, an array, a linked list, or any suitable data format capable of storing the biological context representation. In particular,FIG. 2 illustrates various types of data received from various sources, includingphysical properties data 202,peptide activity data 204,microbe data 206,antimicrobial compound data 208,clinical outcome data 210, evidence-basedguidelines 212,disease association data 214,pathway data 216,compound data 218,gene interaction data 220,anti-neurodegenerative compound data 222, orpro-neuroplasticity compound data 224. - These example data may be curated by the
AI engine 140 or a person having a certain degree (e.g., a degree in data science, molecular biology, microbiology, etc.), certification, license (e.g., a licensed medical doctor (e.g., M.D. or D.O.), or credential. Further, the data in thebiological context representation 200 may be retrieved from any suitable data source (e.g., digital libraries, websites, databases, files, or the like). These examples are not meant to be limiting. Thus, the example types of data are also not meant to be limiting and other types of data may be stored within the biological context representation without departing from the scope of this disclosure. Further, the various data included in thebiological context representation 200 may be linked based on one or more relationships between or among the data, in order to represent knowledge pertaining to the biological context or drug compound. - The
physical properties data 202 includes physical properties exhibited by the drug compound. The physical properties may refer to characteristics that provide a physical description of the drug such as color, particle size, crystalline structure, melting point, and solubility. In some instances, thephysical properties data 202 may also include chemical property data, such as the structure, form, and reactivity of a substance. In some embodiments, biological data may also be included (e.g., anti-neurodegenerative compound data, pro-neuroplasticity compound data, anti-cancer data) in thebiological context representation 200. - The
peptide activity data 204 may include various types of activity exhibited by the drug. For example, the activity may be hormonal, antimicrobial, immunomodulatory, cytotoxic, neurological, and the like. A peptide may refer to a short chain of amino acids linked by peptide bonds. - The
microbe data 206 may include information pertaining to cellular structure (e.g., unicellular, multicellular, etc.) of a microscopic organism. The microbes may refer to bacteria, parasites, fungi, viruses, prions, or any combination of these, etc. - The
antimicrobial compound data 208 may include information pertaining to agents that kill microbes or stop their growth. This data may include classifications based on the microorganisms against which the antimicrobial compound acts (e.g., antibiotics act against bacteria but not against viruses; antivirals act against viruses but not against bacteria). The antimicrobial compound may also be classified according to function (e.g., microbicidal, meaning “that which kills, vitiates, inactivates or otherwise impairs the activity of certain microbes”). - The
clinical outcome data 210 may include information pertaining to the administration of a drug compound to a subject in a clinical setting. For example, upon or subsequent to administration of the drug compound, the outcome may be a prevented disease, cured disease, treated symptom, etc. - The evidence-based
guidelines 212 may include information pertaining to guidelines based upon clinical studies for acceptable treatment or therapeutics for certain diseases or medical conditions. Evidence-basedguidelines data 212 may include data specific to various specialties within healthcare such as, for example, obstetrics, anesthesiology, hepatology, gastroenterology, neurology, pulmonology, orthopedics, pediatrics, trauma care (including but not limited to burns and post-burn infections), histology, oncology, ophthalmology, endocrinology, rheumatology, internal medicine, surgery (including reconstructive (plastic) and cosmetic), vascular medicine, emergency medicine, radiology, psychiatry, cardiology, urology, gynecology, genetics, and dermatology. In the example described herein, the evidence-basedguidelines 212 include systematically developed statements to assist practitioner and patient decisions about appropriate health care (e.g., types of drugs to prescribe for treatment) for specific clinical circumstances. - The
disease association data 214 may include information about which disease or medical condition the drug compounds are associated with. For example, the drug compound Metformin may be associated with thedisease type 2 diabetes. - The
pathway data 216 may include information pertaining in a design space to the relationships or paths between ingredients (e.g., chemicals) and activity levels. - The
compound data 218 may include information pertaining to the compound such as the sequence of ingredients (e.g., type, amount, etc.) in the compound. In the therapeutics industry, for example, thecompound data 218 can include data specific to the various types of drug compounds that are designed, defined, developed, or distributed. - The
gene interaction data 220 may include information pertaining to which gene the drug compound or a disease may interact with. - The
anti-neurodegenerative compound data 222 may include information pertaining to characteristics of anti-neurodegenerative compounds, such as their physical and chemical properties and activities on portions of tissue. For example, the activity may include anti-inflammatory or neuro-protective actions. - The
pro-neuroplasticity compound data 224 may include information pertaining to characteristics of pro-neuroplasticity compound, such as their physical and chemical properties and activities on portions of tissue. For example, the activity may enhance the capacity of motor systems by upregulation of neurotrophins. -
FIGS. 3A-3B illustrate a high-level flow diagram according to certain embodiments of this disclosure. RegardingFIG. 3A , a flow diagram 300 begins with obtaining heterogeneous datasets, such as thebiological context representation 200. Heterogeneous datasets may refer to populations or samples of data that are different (e.g., as opposed to homogenous datasets where the data is the same). The heterogeneous datasets may include compound data (e.g., peptide sequence data), clinical outcome data, or activity data (in vitro and in vivo activity), as well as any other suitable data depicted inFIG. 2 . - The data structure storing the heterogeneous datasets may be translated to a second data structure having a second format (e.g., a 2-dimensional vector) that the
AI engine 140 may use to generate the candidate drug compounds. The next step in the flow diagram 300 includes training the one or moremachine learning models 132 using the heterogeneous datasets. The one or more machine learning models 132 (e.g., generative models) may generate a set of candidate drug compounds based on the heterogeneous datasets. As described herein, a machine learning model may use causal inference and counterfactuals when generating the set of candidate drug compounds. Further, a GAN may be used in conjunction with causal inference to generate the set of candidate drug compounds. In some embodiments, a certain number (e.g., over 100,000 candidate drug compounds) of novel candidate drug compounds may be generated in a set. That is, each candidate drug compound in the set of candidate drug compounds is intended to be unique. - The next step in the flow diagram 300 includes inputting the set of candidate drug compounds into one or more
machine learning models 132 trained to classify the set of candidate drug compounds. Themachine learning models 132 may perform supervised or unsupervised filtering. In some embodiments, themachine learning models 132 may perform clustering to rank the various candidate drug compounds to classify one candidate drug compound as a selected candidate drug compound. In some embodiments, themachine learning models 132 may output a subset (e.g., 1,000 to 10,000, or more, or fewer) of candidate drug compounds. - The next step in the flow diagram 300 may include performing experimental validation by validating whether each candidate drug compound in the subset of candidate drug compounds provides the desired level of certain types of activity in a design space. The results of the experimental validation may be fed back into the heterogeneous dataset to reinforce and expand the experimental dataset.
- The next step in the flow diagram 300 may include performing peptide drug optimization. The optimizations may include performing gradient descent or ascent using the sequence of ingredients in the candidate drug compounds to attempt to increase or decrease certain activity levels in a design space. The results of the peptide drug optimization may be fed back into the heterogeneous datasets to reinforce and expand the experimental dataset.
-
FIG. 3B illustrates another high-level flow diagram 310 according to some embodiments. As depicted, a heterogeneous network of biology may be included in a knowledge graph of abiological context representation 200. Various paths or meta-paths may be expressed between nodes in thebiological context representation 200. For example, the meta-paths may include indications for compound upregulates, pathway participates, disease associations, gene interactions, and compound data. - The
biological context representation 200 may be translated from a first format (e.g., knowledge graph) to a format (e.g., vector) that may be processed by theAI engine 140. TheAI engine 140 may use one or more machine learning models to traverse the knowledge graph by performing random walks until a corpus of random walks is generated, wherein such random walks include the indications associated with the meta-paths representing sequences of ingredients. The corpus of random walks may be referred to as a set of candidate drug compounds. A generative adversarial network using causal inference may be used to generate the set of candidate drug compounds. The set of candidate drug compounds may be stored in a higher-dimensional vector. - The
AI engine 140 may compress the higher-dimensional vector of the set of candidate drug compounds into a lower-dimensional vector of the set of candidate drug compounds, depicted as biological embeddings inFIG. 3B . In some embodiments, the lower-dimensional vector may include fewer dimensions (e.g., 2, 3, . . . N) than the higher-dimensional vector (e.g., greater than N). As depicted, the nodes may be organized by the meta-path indicators and by dimension. - To output a subset of candidate drug compounds, the lower-dimensional vector of the set of candidate drug compounds may be input to one or more
machine learning models 132 trained to perform classification. The classification techniques may include using clustering to filter out candidate drug compounds that produce undesirable levels of types of activity. In some embodiments, to enable theAI engine 140 to perform the classification, views presenting the levels of types of activity of each candidate drug compound in a design space may be generated using the lower-dimensional vectors. These views may also be presented to a user via thecomputing device 102. Themachine learning models 132 may output a candidate drug candidate classified as a selected candidate drug candidate based on the clustering. For example, the selected candidate drug candidate may include an optimized sequence of ingredients that provides the most desirable levels of a certain type of activity in a design space. -
FIG. 4 illustrates example operations of amethod 400 for generating and classifying a candidate drug candidate compound according to certain embodiments of this disclosure. Themethod 400 is performed by processing logic that may include hardware (circuitry, dedicated logic, etc.), software (such as is run on a general-purpose computer system or a specialized machine), or a combination of both. Themethod 400 or each of their individual functions, routines, subroutines, or operations may be performed by one or more processors of a computing device (e.g., any component ofFIG. 1 , such asserver 128 executing the artificial intelligence engine 140). In certain implementations, themethod 400 may be performed by a single processing thread. Alternatively, themethod 400 may be performed by two or more processing threads, each thread implementing one or more individual functions, routines, subroutines, or operations of the methods In some embodiments, one or more accelerators may be used to increase the performance of a processing device by offloading various functions, routines, subroutines, or operations from the processing device. One or more operations of themethod 400 may be performed by thetraining engine 130 ofFIG. 1 . - For simplicity of explanation, the
method 400 is depicted and described as a series of operations. However, operations in accordance with this disclosure can occur in various orders or concurrently, and with other operations not presented and described herein. For example, the operations depicted in themethod 400 may occur in combination with any other operation of any other method disclosed herein. Furthermore, not all illustrated operations may be required to implement themethod 400 in accordance with the disclosed subject matter. In addition, those skilled in the art will understand and appreciate that themethod 400 could alternatively be represented as a series of interrelated states via a state diagram or events. - At 402, the processing device may generate a
biological context representation 200 of a set of drug compounds. Thebiological context representation 200 may include a first data structure having a first format (e.g., a knowledge graph). Thebiological context representation 200 may include, for each drug compound of the set of drug compounds, one or more relationships between or among, without limitation, (i)physical properties data 202, (ii)peptide activity data 204, (iii)microbe data 206, (iv)antimicrobial compound data 208, (v)clinical outcome data 210, (vi) evidence-basedguidelines 212, (vii)disease association data 214, (viii)pathway data 216, (ix),compound data 218, (x)gene interaction data 220, (xi) antimicrobial compound data, (xii)pro-neuroplasticity data 224, or some combination thereof. - At 404, the processing device may translate, by the
artificial intelligence engine 140, the first data structure having the first format to a second data structure having a second format. The translating may include converting the first data structure having the first format (e.g., knowledge graph) to the second data structure having the second format (e.g., vector) according to a specific set of rules executed by theartificial intelligence engine 140. In some embodiments, the translating may be performed by one or more of themachine learning models 132. For example, a recurrent neural network may perform at least a portion of the translating. - The translating may include obtaining a higher-dimensional vector and compressing the higher-dimensional vector into a lower-dimensional vector (e.g., two-dimensional, three-dimensional, four-dimensional), referred to as an embedding herein. In some embodiments, one or more embeddings may be created from the first data structure having the first format. There may be any suitable number of dimensions of the embeddings. When used for classifying candidate drug compounds, the number of dimensions may be selected based on a desired performance to process the embeddings. The lower-dimensional vector may have at least one fewer dimension than the higher-dimensional vector.
- At 406, the processing device may generate, based on the second data structure having the second format, a set of candidate drug compounds. In some embodiments, the generating may be performed by one or more of the
machine learning models 132. For example, a generative adversarial network may perform the generating of the set of candidate drug compounds. In some embodiments, the set of candidate drug compounds may be associated with design spaces pertaining to antimicrobial, anticancer, antibiofilm, or the like. A biofilm may include any syntrophic consortium of microorganisms in which cells stick to each other and often also to a surface. These adherent cells may become embedded within an extracellular matrix that is composed of extracellular polymeric substances (EPS). - At 408, the processing device may classify a candidate drug compound from the set of candidate drug compounds as a selected candidate drug compound. In some embodiments, the classifying may be performed by one or more of the
machine learning models 132. For example, a classifier trained using supervised or unsupervised learning may perform the classifying. In some embodiments, the classifier may use clustering techniques to rank and classify the selected candidate drug compound. - In some embodiments, the processing device may generate a set of views including a representation of a design space. The design space may be antimicrobial. The processing device may cause the set of views to be presented on a computing device (e.g., computing device 102). The representation of the design space may pertain to, without limitation, (i) antimicrobial activity, (ii) immunomodulatory activity, (iii) neuromodulatory activity, (iv) cytotoxic activity, or some combination thereof. Each view of the set of views may present an optimized sequence representing the selected candidate drug compound.
- The optimized sequence in each view may be generated using any suitable optimization technique. The optimization technique may include maximizing or minimizing an objective function by systematically selecting input values from a domain of values and computing the value using the objective function. The domain of values may include a subset of values from a Euclidean space. The subset of values may satisfy one or more constraints, equalities, or inequalities. A value that minimizes or maximizes the objective function may be referred to as an optimal solution. Certain values in the subset may result in a gradient of the objective function being zero. Those certain values may be at stationary points, where a first derivative at those points with respect to time (dt) is zero. The gradient may refer to a scalar-valued differentiable function (e.g., objective function) of several variables, where a point p is a vector whose components are the partial derivatives of the objective function. If the gradient is not a zero vector at a certain point p, then a direction of the gradient is the direction of fastest increase of the objective function at the certain point p.
- Gradients may be used in gradient descent, which refers to a first-order iterative optimization algorithm for finding the local minimum of an objective function. To find the local minimum, gradient descent may proceed by performing operations proportional to the negative of the gradient of the objective function at a current point. In some embodiments, the optimized sequence may be found for a candidate drug compound performing gradient descent in the design space. Additionally, gradient ascent, which is the algorithm opposite to gradient descent, may determine a local maximum of the objective function at various points in the design space.
- The views generated may include a topographical heatmap, itself including indicators for the least activity at points in the design space and the most activity at points in the design space. The indicator associated with the most activity may represent a local maximum obtained using gradient ascent. The indicator associated with the least activity may represent a local minimum obtained using gradient descent. The optimal sequence may be generated by navigating points between the local minima and local maxima. The optimized sequence may be overlaid on the indicators ranging from at least one least active property to an at least one most active property.
- In some embodiments, the processing device may cause the selected candidate drug compound to be formulated. In some embodiments, the processing device may cause the selected candidate drug compound to be created, manufactured, developed, synthesized, or the like. In some embodiments, the processing device may cause the selected candidate drug compound to be presented on a computing device (e.g., computing device 102). The selected candidate drug compound may include one or more active ingredients (e.g., chemicals) at a specified amount.
-
FIGS. 5A-5D provide illustrations of generating a first data structure including abiological context representation 200 of a plurality of drug compound devices according to certain embodiments of this disclosure. The first data format may include a knowledge graph. Thebiological context representation 200 may capture an entire biological context by integrating every known association or relationship for each drug compound into a comprehensive knowledge graph. -
FIG. 5A presents thebiological context representation 200 including biomedical and domain knowledge on peptide activity, microbes, antimicrobial compounds, clinical outcomes, and any relevant information depicted inFIG. 2 . A table 500 may include rows representing various categories (A, B, C, D, and E) pertaining to a biological context for each drug compound and columns representing sub-categories (1, 2, 3, 4, and 5). For example, the table includes subcategories for category A:A1 2D fingerprints,A2 3D fingerprints, A3 Scaffolds, A4 Structure Keys, A5 Physicochemical/B: B1 Mechanism. Of activity, B2 Metabolic Genes, B3 Crystals, B4 Binding, B5 High-throughput Screening bioassays/C: C1 S. molecular Roles, C2 S. molecular Pathway, C3 Signal. Pathway, C4 Biological Process, C5 Interactome/D: D1 - Transcript, D2 Cancer Cell lines, D3 Chromosome Genetics, D4 Morphology, D5 Cell bioassays/E: E1 Therapeutic Areas, E2 Indications, E3 Side effects, E4 Disease & Toxicology, E5 Drug-drug interaction.
-
Charts chart 502 include the size of molecules, for chart 504 the complexity of variables, and for 506 the correlation with mechanism of action. Anotherchart 508 may represent the various characteristics of the subcategories using an indicator (such as a range of colors from 0 to 1) to express the values of the characteristics in relation to each other. -
FIG. 5B illustrates adifferent representation 520 of characteristics for several subcategories (e.g., A1, B1, C5, D1, and E3) across different subject matter areas (e.g., neurology and psychiatry, infectious disease, gastroenterology, cardiology, ophthalmology, oncology, endocrinology, pulmonary, rheumatology, and malignant hematology.). Accordingly, therepresentation 520 provides an even more granular representation of thebiological context representation 200 than does thechart 508.Flowchart 530 represents the process for generating candidate drugs as described further herein. -
FIG. 5C illustrates aknowledge graph 540 representing thebiological context representation 200. Theknowledge graph 540 may refer to a cognitive map. In particular, theknowledge graph 540 represents a graph traversed by theAI engine 140, when generating candidate drug compounds having desired levels of certain types of activity in a design space. Individual nodes in theknowledge graph 540 represent a health artifact (health-related information) or relationship (predicate) gleaned and curated from numerous data sources. Further, the knowledge represented in theknowledge graph 540 may be improved over time as the machine learning models discover new associations, correlations, or relationships. The nodes and relationships may form logical structures that represent knowledge (e.g., Genes, Participates, and Pathways).FIG. 5D illustrates another representation of theknowledge graph 540 that more clearly identifies all the various relationships among the nodes. -
FIG. 6 illustrates example operations of amethod 600 for translating the first data structure ofFIGS. 5A-5B a second data structure according to certain embodiments of this disclosure.Method 600 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such asserver 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of themethod 600 are implemented in computer instructions that are stored on a memory device and executed by a processing device. Themethod 600 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 600 may be performed in some combination with any of the operations of any of the methods described herein. - The
method 600 may includeoperation 404 from the previously describedmethod 400 depicted inFIG. 4 . For example, at 404 in themethod 600, the processing device may translate, by theartificial intelligence engine 140, the first data structure having the first format (e.g., knowledge graph) to the second data structure having the second format (e.g., vector). Themethod 600 inFIG. 6 includesoperations - At 602, the processing device may obtain a higher dimensional vector from the
biological context representation 200. This process is further illustrated inFIG. 7 . - At 604, the processing device may compress the higher-dimensional vector to a lower dimensional-vector. The compressing may be performed by a first
machine learning model 132 trained to perform deep autoencoding via a recurrent neural network configured to output the lower-dimensional vector. - At 606, the processing device may train the first
machine learning model 132 by using a secondmachine learning model 132 to recreate the first data structure having the first format. The secondmachine learning model 132 is trained to perform a decoding operation to recreate the first data structure having the first format. The decoding operation may be performed on the second data structure having the second data format (e.g., two-dimensional vector). -
FIG. 7 provides illustrations of translating the first data structure ofFIGS. 5A-5B to the second data structure according to certain embodiments of this disclosure. Aggregated biological data may be difficult to model and format correctly for an AI engine to process. Aspects of the present disclosure overcome the hurdle of modeling and formatting the aggregated biological data to enable theAI engine 140 to generate candidate drug compounds accurately and efficiently. - As depicted, a higher-
dimensional vector 700 may be obtained from thebiological context representation 200. Using a recurrent neural network performing autoencoding, the higher-dimensional vector is compressed to a lower-dimensional vector 702. The recurrent neural network performing autoencoding is trained using anothermachine learning model 132 that recreates the higher-dimensional vector 704. If the othermachine learning model 132 is unable to recreate higher-dimensional vector 704 from the lower-dimensional vector 702, then the othermachine learning model 132 provides feedback to the recurrent neural network performing autoencoding in order to update its weights, biases, or any suitable parameters. -
FIGS. 8A-8C provide illustrations of views of a selected candidate drug compound according to certain embodiments of this disclosure. As depicted,FIG. 8A illustrates aview 800 including antimicrobial activity,FIG. 8B illustrates aview 802 including immunomodulatory activity, andFIG. 8C illustrates aview 804 including cytotoxic activity. Each view presents a topographical heatmap where one axis is for sequence parameter y and the other axis is for sequence parameter x. Each view includes an indicator ranging from a least active property to a most active property. Further each view includes an optimizedsequence 806 for a selected candidate drug compound classified by the classifier (machine learning model 132). These views may be presented to the user on acomputing device 102. Further, the selectedcandidate drug compound 806 may be formulated, generated, created, manufactured, developed, or tested. -
FIG. 9 illustrates example operations of amethod 900 for presenting a view including a selected candidate drug compound according to certain embodiments of this disclosure.Method 900 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such as computing device 102). In some embodiments, one or more operations of themethod 1000 are implemented in computer instructions that are stored on a memory device and executed by a processing device. Themethod 1000 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 1000 may be performed in some combination with any of the operations of any of the methods described herein. - At 902, the processing device may receive, from the
artificial intelligence engine 140, a candidate drug compound generated by theartificial intelligence engine 140. - At 904, the processing device may generate a view including the candidate drug compound overlaid on a representation of a design space. The view may present a topographical heatmap of the representation of the design space. The topographical heatmap may include the candidate drug compound overlaid on indicators ranging from an at least one least active property to an at least one most active property. Although a topographical heatmap is depicted as an example in the view, other suitable visual elements (e.g., graphs, charts, two-dimensional density plots, three-dimensional density plots, etc.) may be used to depict the representation of the design space.
- At 906, the processing device may present the view on a display screen of a computing device (e.g., computing device 102).
-
FIG. 10A illustrates example operations of amethod 1000 for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure.Method 1000 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such asserver 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of themethod 1000 are implemented in computer instructions that are stored on a memory device and executed by a processing device. Themethod 1000 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 1000 may be performed in some combination with any of the operations of any of the methods described herein. - At 1002, the processing device may perform one or more modifications pertaining to the
biological context representation 200, the second data structure having the second format, or some combination thereof. - At 1004, the processing device may use causal inference to determine whether the one or more modifications provide one or more desired performance results. In some embodiments, using causal inference may further include using 1006 counterfactuals to calculate alternative scenarios based on past actions, occurrences, results, regressions, regression analyses, correlations, or some combination thereof. A counterfactual may refer to determining whether the desired performance still results if something does not occur during the calculation. For example, in a scenario, a person may improve their health after taking a medication. The counterfactual may be used in causal inference to calculate an alternative scenario to see whether the person's health improved without taking the medication. If the person's health still improved without taking the medication it may be inferred that the medication did not cause the health of the person to improve. However, if the person's health did not improve without taking the medication, it may be inferred that the medication is correlated with causing the health of the person to improve. There may, however, be other factors involved in conjunction with taking the medication that actually cause the health of the person to improve.
-
FIG. 10B illustrates another example of operations ofmethod 1050 for using causal inference during the generation of candidate drug compounds according to certain embodiments of this disclosure.Method 1050 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such asserver 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of themethod 1050 are implemented in computer instructions that are stored on a memory device and executed by a processing device. Themethod 1050 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 1050 may be performed in some combination with any of the operations of any of the methods described herein. - At 1052, the processing device may generate a set of candidate drug compounds by performing a modification using causal inference based on a counterfactual. For example, the counterfactual may include removing an ingredient from a sequence of ingredients to determine whether a candidate drug compound provides the same level or type of activity it previously provided when the ingredient was included in the sequence. If the same level or type of activity is still provided after application of the counterfactual (e.g., removal of the ingredient), then the processing device may use causal inference to determine that the ingredient is not correlated with the level or type of activity. If the same level or type of activity is not present after application of the counterfactual (e.g., removal of the ingredient), then the processing device may use causal inference to determine that the ingredient is correlated with the level or type of activity.
- At 1054, the processing device may classify a candidate dug compound from the set of candidate drug compounds as a selected candidate drug compound, as previously described herein.
-
FIG. 11 illustrates example operations of amethod 1100 for using several machine learning models in an artificial intelligence engine architecture to generate peptides according to certain embodiments of this disclosure.Method 1100 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such asserver 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of themethod 1100 are implemented in computer instructions stored on a memory device and executed by a processing device. Themethod 1100 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 1100 may be performed in some combination with any of the operations of any of the methods described herein. - At
block 1102, the processing device may generate, via acreator module 151, a candidate drug compound including a sequence for candidate drug compound. The sequence for the candidate drug compound includes a concatenated vector that may include drug compound sequence information, drug compound activity information, drug compound structure information, and drug compound semantic information. - In some embodiments, the candidate drug compound may be generated using a GAN. In some embodiments, the processing device may use an attention message passing neural network including an attention mechanism that identifies and assigns a weight to a desired feature in a portion of the knowledge graph. The desired feature may be included in the candidate drug compound as drug compound semantic information, drug compound structural information, drug compound activity information, or some combination thereof.
- In some embodiments, the
creator module 151 may generate the candidate drug compound by performing ensemble learning by concatenating a set of encodings. The encodings may each respective sequences represented in a vector. A first encoding of the set of encodings may pertain to drug compound sequence information. A second encoding of the set of encodings may pertain to drug compound structural information. A third encoding of the set of encodings may pertain to peptide activity information. A fourth encoding of the set of encodings may pertain to drug compound semantic information. - In some embodiments, the
creator module 151 may generate the candidate drug compound using an autoencoder machine learning model trained to receive a higher-dimensional vector encoding representing the candidate drug compound and output a lower-dimensional vector embedding representing the candidate drug compound. Thecreator module 151 may generate a latent representation using the lower-dimensional vector embedding representing the candidate drug compound. - At
block 1104, the processing device may include, via thecreator module 151, the candidate for the candidate drug compound as a node in a knowledge graph (e.g., biological context representation 200). In some embodiments, the knowledge graph may include a first layer including structure and physical properties of molecules, a second layer including molecule-to-molecule interactions, a third layer including molecular pathway interactions, a fourth layer including molecular cell profile associations, and a fifth layer including molecular therapeutics and indications. Indications may refer to drug indications, or the disease which gives a valid reason for clinicians to administer a specific drug. - At
block 1106, the processing device may generate, via adescriptor module 152, a description of the candidate drug compound at the node in the knowledge graph. The description may include drug compound sequence information, drug compound structural information, drug compound activity information, and drug compound semantic information. - At
block 1108, based on the description, the processing device may perform, via ascientist module 153, a benchmark analysis of a parameter of thecreator module 151. In some embodiments, thescientist module 153 may perform causal inference using the candidate drug compound in a design space pertaining to biomedical activity (e.g., antimicrobial, anticancer, etc.) to determine if the candidate drug compound still provides a desired effect regarding the type of biomedical activity if the candidate drug compound, or the design space, is changed. - At
block 1110, the processing device may modify, based on the benchmark analysis, thecreator module 151 to change the parameter in a desired way during a subsequent benchmark analysis. Changing the parameter in a desired way may refer to changing a value of the parameter in a desired way. Changing the value of the parameter in the desired way may refer to increasing or decreasing the value of the parameter. Accordingly, a self-improvingAI engine 140 is disclosed that increasingly generates better candidate drug components over time by recursively updating thecreator module 151 based on baselines. In some embodiments, “change the parameter” means change a value of the parameter as desired (e.g., either increase or decrease). - In some embodiments, the processing device may generate, via a
reinforcer module 154 based on the candidate drug compound and the description, experiments that produce desired data for the candidate drug compound. The experiments may be generated in response to the candidate drug compound and the description being similar to a real drug compound and another description of the real drug compound. For example, the reinforcemodule 154 may determine that certain experiments for the real drug compound elicited desired data and may select those experiments to perform for the candidate drug compound. The processing device may perform the experiments (e.g., by running simulations) to collect data pertaining to the candidate drug compound. The processing device may determine, based on the data, an effectiveness of the candidate drug compound. -
FIG. 12 illustrates example operations of amethod 1200 for performing a benchmark analysis according to certain embodiments of this disclosure.Method 1200 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such asserver 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of themethod 1200 are implemented in computer instructions that are stored on a memory device and executed by a processing device. Themethod 1200 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 1200 may be performed in some combination with any of the operations of any of the methods described herein. - The
method 1200 includes additional operations included inblock 1108 ofFIG. 11 . Atblock 1202, the processing device generates, via the scientist module 143, a score for a parameter of thecreator module 151 that generated the candidate drug compound. The parameter may include a validity of the candidate drug compound, uniqueness of the candidate drug compound, novelty of the candidate drug compound, similarity of the candidate drug compound to another candidate drug compound, or some combination thereof. - At
block 1204, the processing device may rank a set ofcreator modules 151 based on the score, where the set of creator modules comprises the creator module. For example, other creator modules in the set of creator modules may be scored based on the candidate drug compounds they generated. The set of creator modules may be ranked for each respective category from highest scoring to lowest scoring or vice versa. - At
block 1206, the processing device may determine whichcreator module 151 of the set of creator modules performs better for each respective parameter. The scores of the parameters for each of the set ofcreator modules 151 may be presented on a display screen of a computing device. The best performing creator modules for each parameter may also be presented on the display screen. - At
block 1208, the processing device may tune the set ofcreator modules 151 to cause the set ofcreator modules 151 to receive higher scores for certain parameters during subsequent benchmark analysis. The tuning may optimize certain weights, activation functions, hidden layer number, loss, and the like of one or more generative modules included in the creator modules. - At
block 1210, the processing device may select, based on the parameters, a subset of the set ofcreator modules 151 to use to generate subsequent candidate drug compounds having desired parameter scores. For example, it may be desired to generate drug candidate compounds that result in a high uniqueness score. The creator module(s) 151 associated with high uniqueness scores may be selected in the subset ofcreator modules 151. - At
block 1212, the processing device may transmit the subset of the set of creator modules as a package to a third-party to be used with data of the third-party. The subset of the set of creator modules may be trained to process a type of the data of the third-party. Other modules, such as the reinforce module, the descriptor module, the scientist module, and the conductor module may be included in the package delivered to the third-party. Also, a knowledge graph including data pertaining to the third-party may be included in the package. In such a way, the disclosed techniques may provide custom tailored packages that may be used by the third party to perform the embodiments disclosed herein. -
FIG. 13 illustrates example operations of amethod 1300 for slicing a latent representation based on a shape of the latent representation according to certain embodiments of this disclosure.Method 1300 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such asserver 128 executing the artificial intelligence engine 140). In some embodiments, one or more operations of themethod 1300 are implemented in computer instructions stored on a memory device and executed by a processing device. Themethod 1300 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 1300 may be performed in some combination with any of the operations of any of the methods described herein. - At
block 1302, the processing device may determine a shape of the multi-dimensional, continuous representation of the set of candidates. Atblock 1304, the processing device may determine, based on the shape, a slice to obtain from the multi-dimensional, multi-dimensional, continuous representation of the set of candidates. Atblock 1306, the processing device may determine, using a decoder, which dimensions are included in the slice. The dimensions may pertain to peptide sequence information, peptide structural information, peptide activity information, peptide semantic information, or some combination thereof. Atblock 1308, the processing device may determine, based on the dimensions, an effectiveness of a biomedical feature of the slice. -
FIG. 14 illustrates a high-level flow diagram for a therapeutics tool implementing, incorporating or using business intelligence according to certain embodiments of this disclosure. A business intelligence screen may be presented in a graphical user interface on thecomputing device 102. Thecomputing device 102 may be operated by a person assigned to a development team, business intelligence team, or the like. The user interface may include various graphical elements (e.g., buttons, slider bars, radio buttons, input boxes, etc.) that enable the user to enter, select, configure, etc. a desiredtarget product profile 1400 for sequences (e.g., peptide). The target product profile may include pharmacology data 1402 (e.g., drug brand name (if applicable), drug generic name, drug dose, clinical trial information and results, toxicology, stability, safety, efficacy, dose cost, etc.), pharmacokinetic data, pharmacodynamic data, activity data, manufacturing data 1404 (e.g., liquid chromatography mass spectrometry (LCMS) data, ability to be manufactured, scalability in production, etc.), compliance data, biological data 1406 (e.g., metabolic information (e.g., half-life, LD50, etc.), sequence data, pathway, interactions, indications, symptoms, genes, etc.), or some combination thereof. In some embodiments, while the user interface is presenting a design space for proteins, the target product profile may be entered, selected, configured, etc. via the user interface. Thecomputing device 102 or theartificial intelligence engine 140 may select or filter the design space to present a solution space which includes sequences that match (e.g., partially or exactly) the target product profile. - The sequences may be selected, based on the target product profile, from a library of sequences. The library of sequences may be generated by one or more
machine learning models 132 of theartificial intelligence engine 140 performing the techniques described herein. In some embodiments, if a certain number of sequences (e.g., 0, 5, 10, etc.) are found or not found to have a matching target product profile, then theartificial intelligence engine 140 may attempt to generate sequences having features pertinent to the target product profile. The dynamically generated sequences may be added to the library of sequences and may be presented on the user interface of thecomputing device 102. - The sequences that match the target product profile may include a list of candidate drug compounds (e.g., peptide candidates) or relevant candidate drug compound features. The features may include biomedical ontological relations, terms, characteristics, descriptors, or the like or non-biomedical ontological relations, terms, characteristics, descriptors, or the like. For example, the features may include levels of structural (e.g., physical, chemical, biological, etc.) information, semantic information, activity, classes of activity, indications (e.g., clinical outcomes), genes, indications, symptoms, interactions, folding properties, wave properties, stabilities of modification, sequence information (e.g., location or number of amino acids in a strand), and so forth. The user may use one or more graphical elements presented on the graphical user interface to select one or more of the sequences. Selecting the one or more sequences may cause another user interface, such as a candidate dashboard screen, to present additional data pertaining to the one or more selected sequences. In some embodiments, selecting the one or more sequences may cause the one or more sequences to be manufactured, produced, synthesized, or the like.
-
FIG. 15 illustrates anexample user interface 1500 for using query parameters to generate a solution space including protein sequences according to certain embodiments of this disclosure. Theuser interface 1500 includes afirst portion 1502 and asecond portion 1504. The first portion includes a landscape view of asolution space 1506 within a design space. Various color-coded clusters may be represented the sequences included in the solution space. The sequences are visualized as interacting with each other via connections in a network. Information pertaining to the sequences may be stored in eigenvectors and presented in any number of applicable dimensions. - The
first portion 1502 includes various graphical elements to enable a user to select certain information, features, identifiers, query parameters, etc. that may be used to filter, constrain, build, generate, etc. the solution space within a design space for proteins for particular applications. The design space may include up to every conceivable or known (e.g., facts) configuration of sequences of proteins (e.g., peptides) in certain biochemical or biomedical applications (e.g., antimicrobial, anti-cancer, anti-viral, anti-fungal, anti-prion, immunomodulatory, neuromodulatory, a physiological effect caused by a signaling peptide, etc.). - The design space may be created based on the knowledge graph that includes ontological data pertaining to sequences of proteins for up to every conceivable or known configuration of sequences of proteins. A resolution of the design space may be modified by identifying, as a first order, features or activities pertaining to the sequences. The term “resolution” may refer to the process of reducing, partitioning or separating something into its components (e.g., features or activities pertaining to the sequences).
- For example, one
graphical element 1508 may include a dropdown box that enables entering, selecting, configuring, etc. one or more query parameters. Although a dropdown box is shown, any suitable graphical element may be used. The query parameters may include desirable sequence parameters associated with features, activities, properties, biomedically-related ontological relations, terms, characteristics, descriptors, or the like or non-biomedically-related ontological relations, terms, characteristics, descriptors, or the like. The query parameters may be used in any combination to generate different visualizations of solution spaces having sequences. If just one query parameter is of interest to a user (e.g., protein engineer, protein designer, peptide engineer, peptide designer, etc.), then a one-dimensional visualization of sequences related to that one query parameter may be presented in thefirst portion 1502. If “n” (where “n” is a positive integer) query parameters are of interest to a user, then an n-dimensional visualization of the sequences can be related to the n query parameters. The solution spaces that are generated or presented may be saved in thedatabase 150. Theartificial intelligence engine 140 may distill, based on the selected query parameters, the design space into thesolution space 1506. For example, the distillation process may include selecting sequences as candidate drug compounds that produce activities (e.g., query parameters) exceeding a certain threshold level. Thesolution space 1506 may be generated to include those candidate drug compounds. - The
user interface 1500 enables a user to modify the query parameters to essentially tune the solution space presented such that desired sequences having particular features pertaining to the query parameters are depicted at least one of efficiently, accurately, and in a condensed visual format. Such a technique is beneficial because it distills a large (typically, very large) amount of data in the knowledge graph down into a visually comprehensible format, thereby increasing explain ability and understandability. Due to the improveduser interface 1500, a user's experience using the computing device may be enhanced because the user does not have to switch between or among multiple user interfaces or to perform multiple queries to find different solution spaces. The enhanceduser interface 1500 may save computing resources by using the query parameters to enable data reduction from a large protein design space to salient sequences in thesolution space 1506. Further, the disclosed machine learning models may be trained to generate results (e.g., solution space 1506) superior to those results produced by conventional techniques. Additionally, the results produced using the disclosed techniques may have been previously computationally infeasible using conventional techniques. - The
second portion 1504 may include more granularly detailed data pertaining to thesolution space 1506 and the sequences included therein. For example, thesecond portion 1504 includes a legend and various windows pertaining to interactions, associations, and proteins. The legend includes information pertaining to polo-box domain (e.g., the PDZ domain, SH3 domain, WW domain, WH1 domain, TK domain, PTP domain, PTB domain, SH2 domain, etc.), binding site (e.g., C-terminus, polyproline, phosphosite, etc.), interaction information, and network information. The various information is color-coded and correlated with the color-coded clusters presented in the first portion. Additionally, some of the information (e.g., polo-box domain and binding sites) in the legend are associated with different shapes to differentiate each type of information's graphics. The interaction information in the legend depicts how the various selections of polo-box domain information interact with each other, and the network information in the legend depicts how various clusters are connected in a network. Depicting the solution space using these techniques may provide an enhanced user interface by distilling a large amount of complex biochemical information about candidate drug compounds into a format easily understandable to a target user (e.g., peptide designer, business intelligence user). To make decisions pertaining to selecting candidate drug compounds without drilling down into additional screens, the user may view theuser interface 1500, thereby saving computing resources and enhancing the user's experience using thecomputing device 102. The window, including interactions, depicts a likelihood of pairwise interactions between two proteins. For example, “Protein 1” Q8IXW0 and “Protein 2” Q96RU3 have a probability of 0.52 of interacting. The window, including associations, includes certain information pertaining to ontological terms concerning biological functions in subgraphs associated with the query that caused the solution space to be generated. The window, including protein information, includes various graphical elements (e.g., input boxes) to enable the entering of information pertaining to descriptions of the protein or ontological terms related to the protein. - The
user interface 1500 may include one or moregraphical elements 1512 configured to enable selecting one or more of the sequences in the solution space. The user may use thegraphical element 1512 to select a sequence to view additional information pertaining to the selected sequence, to cause the selected sequence to be manufactured, produced, synthesized, etc. For example, if a sequence selected is in the solution space, a user may be shown the topographical heatmap depicted inFIGS. 8A-8C . Thesequence 806 depicted inFIG. 8A has a particular path along a traversal or feature map, where the path is specific to the query parameter entered (e.g., number of alanine amino acids). Each point on the traversal may be associated with a particular level of activity measured by one or more trainedmachine learning models 132 that generate thesequence 806. In some embodiments, selecting a sequence in thesolution space 1506 may cause anotheruser interface 1800 to be presented, such as a candidate dashboard screen inFIG. 18 . -
FIG. 16 illustrates anexample user interface 1600 for tracking information pertaining to trials according to certain embodiments of this disclosure. The trial information includes columns for a name of the trial (computation run), a tag indicating whether the trial is a test only, a creation date (start time of execution), a runtime length, a sweep, an encoder identifier (architecture of machine learning model), a number of training data, a number of validation data, an accuracy, an epoch, a human_iou (human intersection over union), and an iou (intersection over union). Further, a feature classification metric may also be user defined. A feature may refer to a descriptor that amachine learning model 132 is learning to classify. For example, one such feature may be “stability” and amachine learning model 132 may classify the following: if a peptide sequence is a stable sequence. The feature classification metric would be “stability” in that example. Other metrics may include accuracy, precision, intersection over union, or the like. The trial information may be useful to a protein designer by enabling the protein designer to determine which trials are more successful than other trials, more accurate than other trials, and the like. Further, the trial information may enable the protein designer to generate new trials that include beneficial features of previous trials. -
FIG. 17 illustrates anexample user interface 1700 for presenting performance metrics of machine learning models that perform trials according to certain embodiments of this disclosure. As depicted, the performance metrics may include process graphic processing unit (GPU) usage (%), process GPU power usage (%), process GPU memory allocated (%), process GPU time spent accessing memory (%), and process GPU temperature (degrees, e.g., Celsius. Each metric may include a graph that includes representations (e.g., lines) associated with respective machine learning models. The graph may include an X axis corresponding to the time or time elapsed or other time measure, and a Y axis corresponding to a value amount (e.g., a cost value). The representations for each machine learning model may be overlaid on the graph to enable a comparison of how each machine learning model performed for a particular metric. - The performance metrics may be used to assign a cost value to each of the machine learning models. The cost may refer to how many resources (processor, memory, network, etc.) are used by the machine learning model during performance of trials, temperatures of components caused by the machine learning model during performance of trials, energy utilization, memory utilization, processor utilization, and other direct and indirect measures of money and non-money cost, among others. Assigning a cost (e.g., a weighted value or average as the sum of nodes traversed on a graph or as the expected value or other mathematical or statistical measure related to such cost) to each of the machine learning models may enable generating sequences that traverse the solution space to a desired location in the cheapest way possible. Accordingly, the disclosed techniques may enable saving computing resources by evaluating and assigning costs to certain machine learning models that perform better than other machine learning models.
-
FIG. 18 illustrates anexample user interface 1800 for a candidate dashboard screen according to certain embodiments of this disclosure. The candidate dashboard screen includes selected information (e.g., chemical, physical, structural, chemical, semantic, etc.) about a candidate drug compound and, preferably, all of the available information thereabout. Theuser interface 1800 may enable a user to see a snapshot of all data (e.g., structure, correlation heatmap, related trials, trial result data, external references (aliases, synonyms, etc.)) related to a particular candidate drug compound. Theuser interface 1800 may be presented when a user selects a sequence in thesolution space 1506 presented inFIG. 15 . - The
user interface 1800 includes two-dimensional 1804 and three-dimensional 1802 energy correlations. The energy correlations may correspond to energy functions associated with each position in a domain. A given energy correlation represents a correlation between each position of a protein in relation to all the other positions in the protein. The energy correlation may represent indications (e.g., color coded sections) pertaining to stability as the stability affects a specific function. An amino acid in context with the adjacent amino acids may affect the local folding properties of the peptide. Energy correlation values are inversely related (although the degree of relation may vary) to the strength of a specific amino acid (or amino acid modification) at a specific position in a peptide chain for a peptide designed for a specific function. -
FIG. 19 illustrates example operations of amethod 1900 for generating a design space for a peptide for an application according to certain embodiments of this disclosure.Method 1900 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such ascomputing device 102,server 128 executing theartificial intelligence engine 140, etc.). In some embodiments, one or more operations of themethod 1900 are implemented in computer instructions stored on a memory device and executed by a processing device. Themethod 1900 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 1900 may be performed in some combination with any of the operations of any of the methods described herein. - At
block 1902, the processing device may generate a design space for a peptide for an application. The application may include at least one of the following functional biomaterials (e.g., adhesives, sealants, binders, chelates, diagnostic reporters, or some combination thereof) and structural biomaterials (e.g., biopolymers, encapsulation films, flocculants, desiccants, or some combination thereof): anti-infective, anti-cancer, antimicrobial, antiviral, anti-fungal, anti-inflammatory, anti-cholinergic, anti-dopaminergic, anti-serotonergic, anti-noradrenergic, anti-prionic, and anti-fungal. The processing device may generate the design space by (i) identifying 1904 a set of sequences for the peptide, and (ii) updating 1906, the set of sequences, by determining, for each of the set of sequences, a respective set of activities (e.g., immunomodulatory activity, receptor binding activity, self-aggregation, cell-penetrating activity, anti-viral activity, peptidergic activity, cell-permeating, or the like) pertaining to the application. Updating the set of sequences may produce an updated set of sequences, wherein each updated set of sequences has an updated respective set of activities. - At
block 1908, the processing device may generate, based on the updated set of sequences each having the updated respective set of activities, a solution space within the design space. The solution space may include a target subset of the updated set of sequences, wherein each updated set of sequences has the updated respective set of activities. - In some embodiments, the processing device may receive a query parameter selected, generated, or transmitted from a user interface presented on the
computing device 102. The processing device may use the query parameter to generate the solution space. For example, using a machine learning model trained to measure, based on the query parameter, a level of the updated respective set of activities, the processing device may generate the solution space within the design space. One or more query parameters may be selected as constraints to be used to generate the solution space. Essentially, the query parameters may be used to create bounds of the solution space within the design space. The query parameters may be selected, generated, or transmitted from a user interface presented on thecomputing device 102 and transmitted to theartificial intelligence engine 140. Based on the query parameters, theartificial intelligence engine 140 may use one or more machine learning models to generate the solution space within the design space. - The query parameter may include sequence parameters pertaining to biomedically-related ontological relations, terms, characteristics, descriptors, or the like or non-biomedically-related ontological relations, terms, characteristics, descriptors, or the like. For example, the biomedical ontology terms may include indications, genes, symptoms, alanine properties, etc. The non-biomedical ontology terms may include physical descriptors and characteristics, such as interactions (e.g., adhesive), folding properties (e.g., aggregating versus loose), wave properties (e.g., fluorescent, luminescent, iridescent), stability of modification (e.g., glycopeptides, lipid peptides, chelates, lasso peptides), etc.
- In some embodiments, in addition to the query parameter, the processing device may receive a desired threshold level of a target activity for the query parameter, with such threshold level configured such that the target subset of sequences must exceed the threshold level in order to be included in the solution space. The desired threshold level may be any suitable value, percentage, measurement, quantity, etc. For example, a user may select a number of alanines (e.g., 5) as the query parameter and specify the desired threshold level of a target activity (e.g., immunomodulatory activity). Accordingly, the processing device may return a target subset of sequences having 5 alanines that exceed the desired threshold level of immunomodulatory activity.
- In some embodiments, the processing device may perform dimension reduction to identify the target subset. Said reduction may be performed via a machine learning model using the query parameter and the updated set of sequences, using an algorithm such as uniform manifold approximation and projection (UMAP). UMAP, a nonlinear dimensionality reduction technique, may scale well on sparse data. A UMAP-based technique may use a Riemannian manifold, which refers to a real, smooth manifold M equipped with a positive-definite inner product gp on the tangent space TpM at each point p. The family gp of inner products is called a Riemannian metric. A Riemannian metric enables defining several geometric notions on the Riemannian manifold, such as an angle at an intersection, length of a curve, area of a surface and higher-dimensional analogues (e.g., volume, etc.), extrinsic curvature of sub-manifolds, and intrinsic curvature of the manifold itself. UMAP may assume that data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant.
- The UMAP-based technique may involve certain initial assumptions such as: (i) there exists a manifold on which the data (e.g., candidate drug compounds) would be uniformly distributed; (ii) the underlying manifold of interest is locally connected; or (iii) preserving the topological structure of this manifold is the primary goal. Based on the assumptions, the UMAP-based technique may construct a graph by: (i) constructing a weighted k-neighbor graph; (ii) applying some transform on the edges to local distances; and (iii) dealing with the inherent asymmetry of the k-neighbor graph. The UMAP-based technique may perform graph layout procedures including: (i) defining an objective function that preserves desired characteristics of this k-neighbor graph; and (ii) finding a low-dimensional representation which optimizes this objective function.
- In some embodiments, one or more other techniques may be used, such as linear decomposition, principal component analysis (PCA), kernel PCA, matrix factorization, generalized discriminant analysis, linear discriminant analysis, autoencoding, or some combination thereof.
- In some embodiments, the processing device may receive a selection of a sequence from the target subset of sequences in the solution space. The selection may be made using a graphical element of a user interface presented on the
computing device 102, and the selection may be transmitted from thecomputing device 102 to theartificial intelligence engine 140. In response to receiving the selection of the sequence, the processing device may provide information pertaining to the sequence for presentation in a user interface on thecomputing device 102. The information may include at least classes of proteins, protein-to-protein interactions, protein-ligand interactions, protein homology and phylogeny, sequence and structure motifs, chemical and physical stability measures, pharmacological associations, systems biology attributes, protein folding descriptors or constraints, or some combination thereof. - At
block 1910, the processing device, using amachine learning model 132 to process the solution space, may perform one or more trials. The one or more trials are configured to identify a candidate drug compound that represents a sequence having at least one level of activity that exceeds one or more threshold levels. The one or more threshold levels may be predetermined or configured by a user (e.g., peptide designer). For example, the one or more threshold levels may be a value, percentage, amount, etc. that the candidate drug compound exhibits with respect to antiviral activity. - At
block 1912, the processing device may transmit information describing the candidate drug compound to acomputing device 102. Thecomputing device 102 may be operated by a drug candidate designer (e.g., protein, peptide, etc.) interested in sequences that exhibit certain activity for an application. Thecomputing device 102 may also be operated by a business user interested in sequences that have certain target product profiles (e.g., pertaining to manufacturing, pharmacology, etc.). - In some embodiments, the processing device may provide the solution space to the
computing device 102 for presentation as a topographical map in a user interface of thecomputing device 102. The topographical map may include a set of indications that, for a sequence, each represent a level of activity at a given point on the topographical map.FIGS. 8A-8C depict examples of topographical heatmaps that may be presented on the user interface of thecomputing device 102. As depicted,FIG. 8A illustrates aview 800 including antimicrobial activity,FIG. 8B illustrates aview 802 including immunomodulatory activity, andFIG. 8C illustrates aview 804 including cytotoxic activity. Each view presents a topographical heatmap where one axis is for sequence parameter y and the other axis is for sequence parameter x. Each view includes an indicator (e.g., color code) ranging from a least active property to a most active property. Further, each view includes an optimizedsequence 806 for a selected candidate drug compound classified by the classifier (machine learning model 132). These views may be presented to the user on acomputing device 102. Further, an optimized sequence may be selected, generated or transmitted in or via the user interface using a graphical element (e.g., button, mouse cursor, etc.). The selected sequence may cause another user interface (e.g., candidate dashboard inFIG. 18 ) that provides additional information pertaining to the sequence to be presented. In some embodiments, selecting the sequence may cause the sequence to be formulated, generated, created, manufactured, developed, or tested. -
FIG. 20 illustrates example operations of amethod 2000 for comparing performance metrics of machine learning models according to certain embodiments of this disclosure.Method 2000 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such ascomputing device 102,server 128 executing theartificial intelligence engine 140, etc.). In some embodiments, one or more operations of themethod 2000 are implemented in computer instructions stored on a memory device and executed by a processing device. Themethod 2000 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 2000 may be performed in some combination with any of the operations of any of the methods described herein. - At
block 2002, the processing device may determine one or more metrics of the machine learning model that performs one or more trials. The one or more metrics may include memory usage, graphic processing unit temperature, power usage, processor usage, central processing usage, or some combination thereof.FIG. 17 presents examples of the one or more metrics used to analyze the machine learning model that performs the one or more trials. - At
block 2004, the processing device compares the one or more metrics to one or more second metrics of a second machine learning model that performs the one or more trials. The comparison may illuminate which of the machine learning model or the second machine learning model performs better than the other. For example, the machine learning model may perform the same trials but consume less processor resources or memory resources. Accordingly, the machine learning model may be used to subsequently perform those trials and the second machine learning model may be pruned from being selected or tuned (e.g., adjusting weights, bias, levels of hidden nodes, etc.) to improve its metrics. As a result, the disclosed techniques provide a technical benefit of enabling the continuous or continual monitoring of the performance of the machine learning models and, preferably, further optimizing which machine learning models perform trials to improve metrics (e.g., processor usage, power usage, graphic processing unit temperature, etc.). -
FIG. 21 illustrates example operations of amethod 2100 for presenting a design space and a solution space within a graphical user interface of a therapeutics tool according to certain embodiments of this disclosure.Method 2100 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such ascomputing device 102,server 128 executing theartificial intelligence engine 140, etc.). In some embodiments, one or more operations of themethod 2100 are implemented in computer instructions stored on a memory device and executed by a processing device. Themethod 2100 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 2100 may be performed in some combination with any of the operations of any of the methods described herein. - At
block 2102, the processing device may present, in a first screen of a graphical user interface (GUI) of a therapeutic tool, a design space for a protein for an application. In some embodiments, the therapeutic tool is a peptide therapeutic design tool, a peptide business intelligence tool, or both. In some embodiments, the protein is a peptide. The design space may include a set of sequences each containing a respective set of activities pertaining to the application. As described herein, the design space may be generated based on a knowledge graph pertaining to peptides. The design space may be presented as a two-dimensional (2D) elevation map, a three-dimensional (3D) shape, or an n-dimensional (nD) mathematical representation. - At
block 2104, the processing device may receive, via a graphical element (e.g., button, input box, radio button, dropdown list, slider, etc.) in the first screen, a selection of one or more query parameters of the design space. The one or more query parameters may include a sequence parameter pertaining to biomedical ontology terms or non-biomedical ontology terms. The biomedically-related ontological relations, terms, characteristics, descriptors, etc. may pertain to indications, function (e.g., catalyze a chemical reaction (e.g., enzyme) or control a structure of water (antifreeze proteins)), activity (e.g., anti-viral, anti-microbial, anti-cancer, anti-fungal, anti-prionic, etc.), genes, symptoms, or some combination thereof. The non-biomedically-related ontological relations, terms, characteristics, descriptors, etc. may pertain to physical characteristics, descriptors, or some combination thereof. Example physical characteristics and descriptors may include information pertaining to interactions (e.g., adhesive properties), folding properties, (e.g., aggregating versus loose), wave properties (e.g., fluorescent, luminescent, iridescent, etc.), measures of stability of modification (e.g., with respect to glycopeptides, lipid peptides, chelates, lasso peptides, etc.), and the like. - At block 2106, the processing device may present, in a second screen of the GUI, a solution space that includes a subset of the set of sequences, each sequence containing the respective set of activities. The subset of the set of sequences is selected based on the one or more query parameters. In some embodiments, the solution space may be generated within the design space by one or more
machine learning models 132 trained to measure, based on the one or more query parameters, a respective level of one or more of the respective set of activities of each of the set of sequences in the subset of sequences. The query parameters essentially create the bounds of the solution space within the design space. Generating the solution space may include grouping or binning, based on the query parameter, sequences as possible or not possible. “Possible,” as used herein, means constructible in reality, economically feasible, chemically feasible, biologically feasible, or otherwise reasonably feasible. “Not possible,” as used herein, means not able to be constructed in reality, economically infeasible, chemically infeasible, biologically infeasible, or otherwise reasonably infeasible. In some embodiments, themachine learning model 132 may be a variational autoencoder, as described herein. In some embodiments, themachine learning model 132 may be any suitable machine learning model capable of performing decomposition methods. - In some embodiments, the solution space is presented as a topographical map in the GUI. The topographical map may include a set of indications, wherein each set of indications represents a level of activity for a sequence associated with a given point on the topographical map. In some embodiments, the second screen may include a first portion presenting one or more clusters (e.g., color-coded) representing the subset of the set of sequences. As shown in
FIG. 15 , the first portion may depict how, in a network, the clusters are organized and interact with each other. - In some embodiments, the one or more color-coded clusters may represent, using an energy correlation, each sequence in the subset. The energy correlation may include a correlation between each position of each sequence in the subset and other positions of other sequences in the subset. The term “energy correlation” may refer to stability as it affects a specific function of the subset of sequences, or it may also refer to, e.g., a strength of an amino acid in a sequence relative to a strength of another amino acid at a different position in the sequence. For example, an amino acid in context with an adjacent amino acid affects the local folding properties of a peptide. Energy correlation values are, to some degree, inversely related to a strength of a specific amino acid (or amino acid modification), where the amino acid is located at a specific position in the peptide chain.
- Thus, the first portion visually represents high-level general information pertaining to the set of sequences in the solution space. The visual representation of the solution space may provide an enhanced user interface to a protein designer. For example, by visually depicting the interactions of the clusters representing the set of sequences in a network, a protein designer may be provided with a vast amount of information cognitively understandable by a user in a single user interface without the user's having to view numerous user interfaces to perform additional queries as to how sequences interact with other sequences in a network.
- The second screen may include a second portion presenting data pertaining to the subset of the set of sequences represented by the one or more clusters. The data presented in the second portion may be more granular and detailed than the data in the clusters presented in the first portion of the second screen. The second portion may include a legend and various windows, including detailed data, as described above with reference to
FIG. 15 . The detailed data may enable a protein designer to drill down to understand very specific information about the clusters presented in the solution space. The specific information may pertain to polo-box domains (PBD), binding sites, interactions, network, associations, biological functions, and the like. The detailed data may describe one or more objects associated with the subset of the set of sequences. The one or more objects may include a candidate drug compound, an activity, a drug, a gene, a pathway, a physical descriptor, an interaction (e.g., adhesive, etc.), a folding property (e.g., aggregating versus loose), a wave property (e.g., fluorescent, luminescent, iridescent, etc.), a stability of modification (e.g., glycopeptides, lipid peptides, chelates, lasso peptides, etc.), or some combination thereof. - In some embodiments, the processing device may receive, using a graphical element (e.g., button, mouse cursor, input box, dropdown list, slider, radio button, etc.) of the second screen, a selection of a sequence from the subset of the set of sequences. The selection may be based on the sequence being previously untraversed. To that end, the processing device may store each sequence included in the subset presented in the solution space and may track whether the sequence has been generated or traversed before. The processing device may store an indicator (e.g., flag) with each sequence in the
database 150, and the indicator may represent whether the respective sequence has been traversed or is or remains untraversed. In some embodiments, the sequence traversed may be presented in a first manner (e.g., with a particular color) while the sequence untraversed may be presented in a second manner (e.g., with a different color than the first manner). In some embodiments, the second screen may provide a graphical element that enables filtering to view only the sequences traversed or, alternatively, untraversed. Responsive to the selection of the sequence, the processing device may present, in the second screen, additional information pertaining to the sequence. The additional information may include a candidate drug compound, an interaction, an activity, a drug, a gene, a pathway, or some combination thereof. - In some embodiments, the processing device may receive, using a graphical element of the second screen, a selection of a sequence from the subset of the set of sequences. The processing device may present, in a third screen, a candidate dashboard (e.g., candidate dashboard screen of
FIG. 18 ) including information pertaining to the selected sequence. The information may pertain to a structure of the sequence, a correlation heatmap, experimental data, a list of probabilistic scores generated by one or more inference models, external data related to the sequence (e.g., all related external data to a specific peptide, such as database IDs, aliases, synonyms, etc.), or some combination thereof. In some embodiments, the list of probabilistic scores may be represented as violin plots detailing a success probability of the sequence in a specific function (e.g., activity such as anti-viral, anti-microbial, anti-fungal, anti-prionic, etc.) across a set of conditions (e.g., query parameters). - In some embodiments, the processing device may receive, in the GUI, one or more parameters pertaining to one or more
machine learning models 132 of theartificial intelligence engine 140. The one or more parameters may refer to hyper parameters and may pertain to one or more constraints (e.g., epochs, batch sizes, attention, processor usage, memory usage, execution time, etc.) for the one or more machine learning models to implement when using the solution space to perform one or more trials. - In some embodiments, the processing device may receive, using a graphical element of the second screen, a selection of a sequence from the subset of the set of sequences. The processing device may cause the sequence to be manufactured, synthesized, or produced.
-
FIG. 22 illustrates example operations of amethod 2200 for receiving and presenting of one or more results of performing a selected trial using a machine learning model according to certain embodiments of this disclosure.Method 2200 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such ascomputing device 102,server 128 executing theartificial intelligence engine 140, etc.). In some embodiments, one or more operations of themethod 2200 are implemented in computer instructions stored on a memory device and executed by a processing device. Themethod 2200 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 2200 may be performed in some combination with any of the operations of any of the methods described herein. - At
block 2202, the processing device may receive a selection of a trial configured to be performed by amachine learning model 132. The machine learning model may use the solution space generated, as described with reference toFIG. 23 . The trial may include traversing the solution space according to a specific route, a random route, or a combination of a specific route and a random route. The traversal may result in points having different activities in the solution space. The points may represent a sequence and may be referred to as a candidate drug compound herein. The traversal may specify a particular location of a point as a starting point or a particular location of a destination point. The traversal may or may not specify the route to traverse to get from the starting point to the destination point. In some embodiments, the traversal may just specify a starting point or a destination point, and themachine learning model 132 may randomly traverse the solution space to generate different sequences having different activities. While traversing the surface of the solution space, the one or moremachine learning models 132 may be trained to perform maximization functions or minimization functions. For example, the machine learning model may measure level of activity at some or all of the points on the surface of the solution space and perform a maximization function by traversing the points having the maximum level of activity relative to other points in proximity. In some embodiments, the machine learning model may measure level of activity at some or all of the points on the surface of the solution space and perform a minimization function by traversing the points having the minimum level of activity relative to other proximate points. In some embodiments, the machine learning model may be trained to perform a combination of minimization and maximization functions while performing the traversals. - The selection of the trial may be transmitted to the
artificial intelligence engine 140. Theartificial intelligence engine 140 may use the one or moremachine learning models 132 to perform the selected trial using the solution space. Atblock 2204, the processing device of thecomputing device 102 may receive, from theartificial intelligence engine 140, one or more results of performing the trial. The one or more results may (i) provide a location of a point reached in the solution space after performing a traversal of the solution space defined by the trial, or (ii) provide a metric of one or more of themachine learning models 132 used by theartificial intelligence engine 140 to perform the trial. The metric may pertain to the process graphic processing unit (GPU) usage (%), the process GPU power usage (%), the process GPU memory allocated (%), the process GPU time spent accessing memory (%), and the process GPU temperature (degrees, e.g., Celsius) (as shown inFIG. 17 ). The one or more results may be presented on a user interface of thecomputing device 102. The one or more results may be compared to select the one or more machine learning models that reached or came closest to a desired point in the solution space, took a desired route (or as close to the desire route as possible) during traversal to the point, generated a desired sequence having desired activity levels, consumed the least or a lesser amount of processor resources, generated the lowest or a lower temperature for the graphic processing unit, consumed the least or a lesser amount of memory resources, or some combination thereof. The machine learning models not selected may be subsequently tuned to attempt to improve their results when subsequently performing the same or different trials. -
FIG. 23 illustrates example operations of amethod 2300 for using a business intelligence screen to select a desired target product profile for sequences according to certain embodiments of this disclosure.Method 2300 includes operations performed by processors of a computing device (e.g., any component ofFIG. 1 , such ascomputing device 102,server 128 executing theartificial intelligence engine 140, etc.). In some embodiments, one or more operations of themethod 2300 are implemented in computer instructions stored on a memory device and executed by a processing device. Themethod 2300 may be performed in the same or a similar manner as described above in regard tomethod 400. The operations of themethod 2300 may be performed in some combination with any of the operations of any of the methods described herein. - At
block 2302, the processing device may receive, from a graphical element of a business intelligence screen of the graphical user interface (GUI), a target product profile. The target product profile may include pharmacology data, pharmacokinetic data, activity data, manufacturing data (e.g., cost to manufacture, requirements for manufacturing, etc.), compliance data, clinical trial data, or some combination thereof. The target product profile may be transmitted to theartificial intelligence engine 140. Theartificial intelligence engine 140 may execute one or moremachine learning models 132 trained to generate or search for sequences that match the target product profile to within a certain threshold level (e.g., percentage, partial, exact, etc.). - At
block 2304, the processing device may receive, from theartificial intelligence engine 140, a second subset of the set of sequences. The second subset of the set of sequences may be selected based on the target product profile. - At
block 2306, the processing device may present, in the GUI, the second subset of the set of sequences. The GUI may include one or more graphical elements that enable the user to drill-down to view detailed data pertaining to one or more of the sequences matching (partially or exactly) the target product profile. The GUI may include a graphical element that enables selecting one or more sequences to manufacture, produce, synthesize, or the like. -
FIG. 24 illustratesexample computer system 2400 which can perform any one or more of the methods described herein, in accordance with one or more aspects of the present disclosure. In one example,computer system 2400 may correspond to the computing device 102 (e.g., user computing device), one ormore servers 128 of thecomputing system 116, thetraining engine 130, or any suitable component ofFIG. 1 . Thecomputer system 2400 may be capable of executingapplication 118 or the one or moremachine learning models 132 ofFIG. 1 . The computer system may be connected (e.g., networked) to other computer systems in a LAN, an intranet, an extranet, or the Internet. The computer system may operate in the capacity of a server in a client-server network environment. The computer system may be a personal computer (PC), a tablet computer, a wearable (e.g., wristband), a set-top box (STB), a personal Digital Assistant (PDA), a mobile phone, a camera, a video camera, or any device capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that device. Further, while only a single computer system is illustrated, the term “computer” shall also be taken to include any collection of computers that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methods discussed herein. - The
computer system 2400 includes aprocessing device 2402, a volatile memory 2404 (e.g., random access memory (RAM)) and a non-volatile memory 2406 (e.g., read-only memory (ROM), flash memory, solid state drives (SSDs), and adata storage device 1108, which communicate with each other via abus 2410. -
Processing device 2402 represents one or more general-purpose processing devices such as a microprocessor, central processing unit, or the like. More particularly, theprocessing device 2402 may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. Theprocessing device 2402 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a system on a chip, a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. Theprocessing device 2402 may include more than one processing device, and each of the processing devices may be the same or different types. Theprocessing device 2402 may include or be communicatively coupled to one ormore accelerators 2403 configured to offload various data-processing tasks from theprocessing device 2402. Theprocessing device 2402 is configured to execute instructions for performing any of the operations and steps discussed herein. - The
computer system 2400 may further include anetwork interface device 2412. Thenetwork interface device 2412 may be configured to communicate data via any suitable communication protocol. In some embodiments, thenetwork interface devices 2412 may enable wireless (e.g., WiFi, Bluetooth, ZigBee, etc.) or wired (e.g., Ethernet, etc.) communications. Thecomputer system 2400 also may include a video display 2414 (e.g., a liquid crystal display (LCD), a light-emitting diode (LED), an organic light-emitting diode (OLED), a quantum LED, a cathode ray tube (CRT), a shadow mask CRT, an aperture grille CRT, or a monochrome CRT), one or more input devices 2416 (e.g., a keyboard or a mouse), and one or more speakers 2418 (e.g., a speaker). In one illustrative example, thevideo display 2414 and the input device(s) 2416 may be combined into a single component or device (e.g., an LCD touch screen). - The
data storage device 2416 may include a computer-readable medium 2420 on which theinstructions 2422 embodying any one or more of the methods, operations, or functions described herein is stored. Theinstructions 2422 may also reside, completely or at least partially, within themain memory 2404 or within theprocessing device 2402 during execution thereof by thecomputer system 2400. As such, themain memory 2404 and theprocessing device 2402 also constitute computer-readable media. Theinstructions 2422 may further be transmitted or received over a network via thenetwork interface device 2412. - While the computer-
readable storage medium 2420 is shown in the illustrative examples to be a single medium, the term “computer-readable storage medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) that store the one or more sets of instructions. The term “computer-readable storage medium” shall also be taken to include any medium capable of storing, encoding, or carrying a set of instructions for execution by the machine, where such set of instructions cause the machine to perform any one or more of the methodologies of the present disclosure. The term “computer-readable storage medium” shall accordingly be taken to include, but not be limited to, solid-state memories, optical media, and magnetic media. - None of the description in this application should be read as implying that any particular element, step, or function is an essential element that must be included in the claim scope. The scope of patented subject matter is defined only by the claims. Moreover, none of the claims is intended to invoke 35 U.S.C. § 112(f) unless the exact words “means for” are followed by a participle.
- Consistent with the above disclosure, the examples of systems and method enumerated in the following clauses are specifically contemplated and are intended as a non-limiting set of examples.
-
Clause 1. A method comprising: - generating a design space for a peptide for an application, wherein the generating comprises:
- identifying a plurality of sequences for the peptide; and
- updating the plurality of sequences by determining, for each of the plurality of sequences, a respective plurality of activities pertaining to the application, wherein the updating produces an updated plurality of sequences each having an updated respective plurality of activities;
- generating, based on the updated plurality of sequences each having the updated respective plurality of activities, a solution space within the design space, wherein the solution space comprises a target subset of the updated plurality of sequences each having the updated respective plurality of activities;
- performing, using a machine learning model to process the solution space, one or more trials to identify a candidate drug compound that represents a sequence having at least one level of activity that exceeds one or more threshold levels; and
- transmitting information describing the candidate drug compound to a computing device.
-
Clause 2. The method of any preceding clause, wherein the generating the solution space within the design space is performed by a second machine learning model trained to measure, based on a query parameter, a level of the updated respective plurality of activities, wherein the query parameter comprises a sequence parameter. -
Clause 3. The method of any preceding clause, further comprising: - receiving the query parameter; and
- generating, based on the query parameter and the updated plurality of sequences each having the updated respective plurality of activities, the solution space within the design space, wherein the solution space comprises the target subset of the plurality of sets of the updated plurality of sequences, and each sequence of the updated plurality of sequences in the target subset comprises the updated respective plurality of activities that are modified in view of the query parameter.
-
Clause 4. The method of any preceding clause, wherein the generating the solution space within the design space further comprises performing, using the query parameter and the updated plurality of sequences each having the updated respective plurality of activities, uniform manifold approximation and projection (UMAP) for dimension reduction to identify the target subset. -
Clause 5. The method of any preceding clause, wherein the receiving the query parameter further comprises receiving the query parameter from a graphical element of a user interface presenting the design space. -
Clause 6. The method of any preceding clause, further comprising: - receiving the query parameter and a desired threshold level of a target activity for the query parameter that the target subset is to exceed in order to be included in the solution space.
- Clause 7. The method of any preceding clause, wherein the application comprises at least one of:
- anti-infective,
- anti-cancer,
- antimicrobial,
- anti-viral,
- anti-fungal,
- anti-inflammatory,
- anti-cholinergic,
- anti-dopaminergic,
- anti-serotonergic,
- anti-noradrenergic,
- anti-prionic,
- functional biomaterials comprising adhesives, sealants, binders, chelates, diagnostic reporters, or some combination thereof, and
- structural biomaterials comprising biopolymers, encapsulation films, flocculants, desiccants, or some combination thereof.
-
Clause 8. The method of any preceding clause, further comprising: - receiving a selection of a sequence from the target subset; and
- providing information pertaining to the sequence, wherein the information comprises at least classes of:
- proteins,
- protein-to-protein interactions,
- protein-ligand interactions,
- protein homology and phylogeny,
- sequence and structure motifs,
- chemical and physical stability,
- pharmacological associations,
- systems biology,
- protein folding, or
- some combination thereof.
- Clause 9. The method of any preceding clause, further comprising:
- providing the solution space to the computing device for presentation as a topographical map in a user interface of the computing device, wherein the topographical map comprises a plurality of indications that each represent a level of activity for a sequence at a given point on the topographical map.
- Clause 10. The method of any preceding clause, further comprising causing the candidate drug compound to be manufactured.
- Clause 11. The method of any preceding clause, wherein the updated respective plurality of activities comprises immunomodulatory activity, receptor binding activity, self-aggregation, cell-penetrating activity, anti-viral activity, peptidergic activity, or some combination thereof.
- Clause 12. The method of any preceding clause, further comprising:
- determining one or more metrics of the machine learning model that performs the one or more trials, wherein the one or more metrics comprise memory usage, graphic processing unit temperature, power usage, processor usage, central processing unit temperature, or some combination thereof; and
- comparing the one or more metrics to one or more second metrics of a second machine learning model that performs the one or more trials.
-
Clause 13. A tangible, non-transitory computer-readable medium storing instructions that, when executed, cause a processing device to: - generate a design space for a peptide for an application, wherein the generating comprises:
- identifying a plurality of sequences for the peptide; and
- updating the plurality of sequences by determining, for each of the plurality of sequences, a respective plurality of activities pertaining to the application, wherein the updating produces an updated plurality of sequences each having an updated respective plurality of activities;
- generate, based on the updated plurality of sequences each having the updated respective plurality of activities, a solution space within the design space, wherein the solution space comprises a target subset of the updated plurality of sequences each having the updated respective plurality of activities;
- perform, using a machine learning model to process the solution space, one or more trials to identify a candidate drug compound that represents a sequence having at least one level of activity that exceeds one or more threshold levels; and
- transmit information describing the candidate drug compound to a computing device.
- Clause 14. The computer-readable medium of any preceding clause, wherein the generating the solution space within the design space is performed by a second machine learning model trained to measure, based on a query parameter, a level of the updated respective plurality of activities, wherein the query parameter comprises a sequence parameter.
-
Clause 15. The computer-readable medium of any preceding clause, wherein the processing device is further to: - receive the query parameter; and
- generate, based on the query parameter and the updated plurality of sequences each having the updated respective plurality of activities, the solution space within the design space, wherein the solution space comprises the target subset of the plurality of sets of the updated plurality of sequences, and each sequence of the updated plurality of sequences in the target subset comprises the updated respective plurality of activities that are modified in view of the query parameter.
- Clause 16. The computer-readable medium of any preceding clause, wherein the generating the solution space within the design space further comprises performing, using the query parameter and the updated plurality of sequences each having the updated respective plurality of activities, uniform manifold approximation and projection (UMAP) for dimension reduction to identify the target subset.
- Clause 17. The computer-readable medium of any preceding clause, wherein the receiving the query parameter further comprises receiving the query parameter from a graphical element of a user interface presenting the design space.
- Clause 18. The computer-readable medium of any preceding clause, wherein the processing device is further to:
- receive the query parameter and a desired threshold level of a target activity for the query parameter that the target subset is to exceed in order to be included in the solution space.
- Clause 19. A system comprising:
- a memory device storing instructions; and
- a processing device communicatively coupled to the memory device, the processing device executes the instructions to:
- generate a design space for a peptide for an application, wherein the generating comprises:
- identifying a plurality of sequences for the peptide; and
- updating the plurality of sequences by determining, for each of the plurality of sequences, a respective plurality of activities pertaining to the application, wherein the updating produces an updated plurality of sequences each having an updated respective plurality of activities;
- generate, based on the updated plurality of sequences each having the updated respective plurality of activities, a solution space within the design space, wherein the solution space comprises a target subset of the updated plurality of sequences each having the updated respective plurality of activities;
- perform, using a machine learning model to process the solution space, one or more trials to identify a candidate drug compound that represents a sequence having at least one level of activity that exceeds one or more threshold levels; and
- transmit information describing the candidate drug compound to a computing device.
-
Clause 20. The system of any preceding clause, wherein the generating the solution space within the design space is performed by a second machine learning model trained to measure, based on a query parameter, a level of the updated respective plurality of activities, wherein the query parameter comprises a sequence parameter. - Clause 21. A method for presenting, on a computing device, a graphical user interface (GUI) of a therapeutic tool, the method comprising:
- presenting, in a first screen of the GUI, a design space for a protein for an application, wherein the design space comprises a plurality of sequences each containing a respective plurality of activities pertaining to the application;
- receiving, via a graphical element in the first screen, a selection of one or more query parameters of the design space; and
- presenting, in a second screen of the GUI, a solution space that includes a subset of the plurality of sequences each containing the respective plurality of activities, wherein the subset of the plurality of sequences is selected based on the one or more query parameters.
- Clause 22. The method of any preceding clause, wherein the second screen comprises:
- a first portion presenting one or more color-coded clusters representing the subset of the plurality of sequences, and
- a second portion presenting data pertaining to the subset of the plurality of sequences represented by the one or more color-coded clusters, wherein the data describes one or more objects associated with the subset of the plurality of sequences, and the one or more objects comprise a candidate drug compound, an activity, an interaction, a drug, a gene, a pathway , a physical descriptor, a characteristic, an interaction, a folding property, a wave property, a stability of modification, or some combination thereof.
- Clause 23. The method of any preceding clause, wherein the one or more color-coded clusters represent, using an energy correlation , each sequence in the subset, and the energy correlation comprises a correlation between each position of each sequence in the subset and other positions of other sequences in the subset.
- Clause 24. The method of any preceding clause, wherein the solution space is presented as a topographical map in the GUI, wherein the topographical map comprises a plurality of indications that each represent a level of activity for a sequence associated with a given point on the topographical map.
- Clause 25. The method of any preceding clause, wherein the design space is generated based on a knowledge graph pertaining to peptides and the design space is presented as a two-dimensional (2D) elevation map, a three-dimensional (3D) shape or an n-dimensional (nD) mathematical representation.
- Clause 26. The method of any preceding clause, wherein the solution space is generated within the design space by one or more machine learning models trained to measure, based on the query parameter, a respective level of one or more of the respective plurality of activities of each of the plurality of sequences in the subset, wherein the query parameter comprises a sequence parameter.
- Clause 27. The method of any preceding clause, further comprising:
- receiving, using a graphical element of the second screen, a selection of a sequence from the subset of the plurality of sequences, wherein the selection is based on the sequence being previously untraversed; and
- responsive to the selection of the sequence, presenting, in the second screen, additional information pertaining to the sequence, wherein the additional information comprises a candidate drug compound, an interaction, an activity, a drug, a gene, a pathway, or some combination thereof.
- Clause 28. The method of any preceding clause, further comprising:
- receiving, using a graphical element of the second screen, a selection of a sequence from the subset of the plurality of sequences; and
- presenting, in a third screen of the GUI, a candidate dashboard comprising information pertaining to the sequence, wherein the information pertains to a structure of the sequence, a correlation heatmap, experimental data, a list of probabilistic scores generated by inference models, external data related to the sequence, or some combination thereof.
- Clause 29. The method of any preceding clause, further comprising:
- receiving a selection of a trial configured to be performed by a machine learning model, wherein the machine learning model uses the solution space; and
- receiving, from an artificial intelligence engine, one or more results of performing the trial, wherein the one or more results:
- provide a location of a point reached in the solution space after performing a traversal of the solution space defined by the trial, and
- provide a metric of a machine learning model used by the artificial intelligence engine to perform the trial, wherein the metric pertains to memory usage, graphic processing unit temperature, power usage, processor usage, central processing unit temperature, or some combination thereof.
- Clause 30. The method of any preceding clause, further comprising:
- receiving, from a graphical element of a business intelligence screen of the GUI, a target product profile, wherein the target product profile comprises pharmacology data, pharmacokinetic data, pharmacodynamic data, activity data, manufacturing data, compliance data, clinical trial data, or some combination thereof;
- receiving, from an artificial intelligence engine, a second subset of the plurality of sequences, wherein the second subset of the plurality of sequences is selected based on the target product profile; and
- presenting, in the GUI, the second subset of the plurality of sequences.
- Clause 31. The method of any preceding clause, further comprising:
- receiving, in the GUI, one or more parameters pertaining to one or more machine learning models of an artificial intelligence engine, wherein the one or more parameters pertain to one or more constraints for the one or more machine learning models to implement when performing one or more trials using the solution space.
- Clause 32. The method of any preceding clause, wherein the therapeutic tool is a peptide therapeutic tool.
- Clause 33. The method of any preceding clause, wherein the protein is a peptide.
- Clause 34. The method of any preceding clause, wherein the one or more query parameters comprise a plurality of biomedical ontology terms, a plurality of non-biomedical ontology terms, or some combination thereof.
- Clause 35. The method of any preceding clause, wherein the plurality of biomedical ontology terms pertain to indications, genes, symptoms, or some combination thereof, and the plurality of non-biomedical ontology terms pertain to characteristics, descriptors, or some combination thereof.
- Clause 36. The method of any preceding clause, further comprising:
- receiving, using a graphical element of the second screen, a selection of a sequence from the subset of the plurality of sequences; and
- causing the sequence to be manufactured, synthesized, or produced.
- Clause 37. A tangible, non-transitory computer-readable medium storing instructions that, when executed, cause a processing device to:
- presenting, in a first screen of a graphical user interface (GUI), a design space for a protein for an application, wherein the design space comprises a plurality of sequences each containing a respective plurality of activities pertaining to the application;
- receiving, via a graphical element in the first screen, a selection of one or more query parameters of the design space; and
- presenting, in a second screen of the GUI, a solution space that includes a subset of the plurality of sequences each containing the respective plurality of activities, wherein the subset of the plurality of sequences is selected based on the one or more query parameters.
- Clause 38. The computer-readable medium of any preceding clause, wherein the second screen comprises:
- a first portion presenting one or more color-coded clusters representing the subset of the plurality of sequences, and
- a second portion presenting data pertaining to the subset of the plurality of sequences represented by the one or more color-coded clusters, wherein the data describes one or more objects associated with the subset of the plurality of sequences, and the one or more objects comprise a candidate drug compound, an activity, an interaction, a drug, a gene, a pathway, a physical descriptor, a characteristic, an interaction, a folding property, a wave property, a stability of modification, or some combination thereof.
- Clause 39. The computer-readable medium of any preceding clause, wherein the one or more color-coded clusters represent, using an energy correlation, each sequence in the subset, and the energy correlation comprises a correlation between each position of each sequence in the subset and other positions of other sequences in the subset.
- Clause 40. A system comprising:
- a memory device storing instructions; and
- a processing device communicatively coupled to the memory device, the processing device executes the instructions :
- present, in a first screen of a graphical user interface (GUI), a design space for a protein for an application, wherein the design space comprises a plurality of sequences each containing a respective plurality of activities pertaining to the application;
- receive, via a graphical element in the first screen, a selection of one or more query parameters of the design space; and
- present, in a second screen of the GUI, a solution space that includes a subset of the plurality of sequences each containing the respective plurality of activities, wherein the subset of the plurality of sequences is selected based on the one or more query parameters.
Claims (20)
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/319,839 US20220165359A1 (en) | 2020-11-23 | 2021-05-13 | Generating anti-infective design spaces for selecting drug candidates |
US17/404,810 US11436246B2 (en) | 2020-11-23 | 2021-08-17 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/404,657 US11424008B2 (en) | 2020-11-23 | 2021-08-17 | Generating anti-infective design spaces for selecting drug candidates |
PCT/US2021/057328 WO2022108733A1 (en) | 2020-11-23 | 2021-10-29 | Generating anti-infective design spaces for selecting drug candidates |
US17/892,701 US12087404B2 (en) | 2020-11-23 | 2022-08-22 | Generating anti-infective design spaces for selecting drug candidates |
US17/902,438 US11967400B2 (en) | 2020-11-23 | 2022-09-02 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202063117068P | 2020-11-23 | 2020-11-23 | |
US202063117083P | 2020-11-23 | 2020-11-23 | |
US17/319,839 US20220165359A1 (en) | 2020-11-23 | 2021-05-13 | Generating anti-infective design spaces for selecting drug candidates |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/319,923 Continuation US11403316B2 (en) | 2020-11-23 | 2021-05-13 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
Related Child Applications (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/319,923 Continuation US11403316B2 (en) | 2020-11-23 | 2021-05-13 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/404,810 Continuation US11436246B2 (en) | 2020-11-23 | 2021-08-17 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/404,657 Continuation US11424008B2 (en) | 2020-11-23 | 2021-08-17 | Generating anti-infective design spaces for selecting drug candidates |
Publications (1)
Publication Number | Publication Date |
---|---|
US20220165359A1 true US20220165359A1 (en) | 2022-05-26 |
Family
ID=81658299
Family Applications (8)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/319,839 Pending US20220165359A1 (en) | 2020-11-23 | 2021-05-13 | Generating anti-infective design spaces for selecting drug candidates |
US17/319,923 Active US11403316B2 (en) | 2020-11-23 | 2021-05-13 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/404,657 Active US11424008B2 (en) | 2020-11-23 | 2021-08-17 | Generating anti-infective design spaces for selecting drug candidates |
US17/404,810 Active US11436246B2 (en) | 2020-11-23 | 2021-08-17 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/878,365 Active US11848076B2 (en) | 2020-11-23 | 2022-08-01 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/892,701 Active US12087404B2 (en) | 2020-11-23 | 2022-08-22 | Generating anti-infective design spaces for selecting drug candidates |
US17/902,438 Active US11967400B2 (en) | 2020-11-23 | 2022-09-02 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US18/543,611 Pending US20240203529A1 (en) | 2020-11-23 | 2023-12-18 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
Family Applications After (7)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/319,923 Active US11403316B2 (en) | 2020-11-23 | 2021-05-13 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/404,657 Active US11424008B2 (en) | 2020-11-23 | 2021-08-17 | Generating anti-infective design spaces for selecting drug candidates |
US17/404,810 Active US11436246B2 (en) | 2020-11-23 | 2021-08-17 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/878,365 Active US11848076B2 (en) | 2020-11-23 | 2022-08-01 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US17/892,701 Active US12087404B2 (en) | 2020-11-23 | 2022-08-22 | Generating anti-infective design spaces for selecting drug candidates |
US17/902,438 Active US11967400B2 (en) | 2020-11-23 | 2022-09-02 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
US18/543,611 Pending US20240203529A1 (en) | 2020-11-23 | 2023-12-18 | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates |
Country Status (2)
Country | Link |
---|---|
US (8) | US20220165359A1 (en) |
WO (1) | WO2022108733A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190139647A1 (en) * | 2016-06-27 | 2019-05-09 | Koninkluke Philips N.V. | Evaluation of decision tree using ontology |
CN116665763A (en) * | 2023-05-18 | 2023-08-29 | 中南大学 | Metabolism path deducing method based on multi-view multi-tag learning |
CN116721777A (en) * | 2023-08-10 | 2023-09-08 | 中国医学科学院药用植物研究所 | Neural network-based drug efficacy evaluation method, device, equipment and medium |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11581060B2 (en) * | 2019-01-04 | 2023-02-14 | President And Fellows Of Harvard College | Protein structures from amino-acid sequences using neural networks |
US11782918B2 (en) * | 2020-12-11 | 2023-10-10 | International Business Machines Corporation | Selecting access flow path in complex queries |
CN116400892B (en) * | 2023-06-07 | 2023-09-15 | 南京国睿信维软件有限公司 | Unified analysis and display method based on MBSE heterogeneous model |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210081804A1 (en) * | 2017-05-30 | 2021-03-18 | GTN Ltd. | Tensor network machine learning system |
Family Cites Families (78)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7742877B1 (en) * | 1999-07-22 | 2010-06-22 | Becton, Dickinson & Company | Methods, apparatus and computer program products for formulating culture media |
US6849403B1 (en) | 1999-09-08 | 2005-02-01 | Exact Sciences Corporation | Apparatus and method for drug screening |
US6432409B1 (en) * | 1999-09-14 | 2002-08-13 | Antigen Express, Inc. | Hybrid peptides modulate the immune response |
JP4902925B2 (en) | 2000-02-08 | 2012-03-21 | シンヴェント エイエス | Novel genes encoding nystatin polyketide synthase and their handling and use |
US20040115726A1 (en) | 2001-09-14 | 2004-06-17 | Renpei Nagashima | Method, system, apparatus and device for discovering and preparing chemical compounds for medical and other uses. |
US20050032119A1 (en) | 2001-04-02 | 2005-02-10 | Astex Technology Ltd. | Crystal structure of cytochrome P450 |
WO2002093318A2 (en) | 2001-05-15 | 2002-11-21 | Psychogenics Inc. | Systems and methods for monitoring behavior informatics |
US8793073B2 (en) | 2002-02-04 | 2014-07-29 | Ingenuity Systems, Inc. | Drug discovery methods |
US20050084907A1 (en) | 2002-03-01 | 2005-04-21 | Maxygen, Inc. | Methods, systems, and software for identifying functional biomolecules |
GB2409916A (en) * | 2003-07-04 | 2005-07-13 | Intellidos Ltd | Joining query templates to query collated data |
US20050060305A1 (en) | 2003-09-16 | 2005-03-17 | Pfizer Inc. | System and method for the computer-assisted identification of drugs and indications |
US20060052943A1 (en) | 2004-07-28 | 2006-03-09 | Karthik Ramani | Architectures, queries, data stores, and interfaces for proteins and drug molecules |
US20060106545A1 (en) | 2004-11-12 | 2006-05-18 | Jubilant Biosys Ltd. | Methods of clustering proteins |
WO2007107879A2 (en) | 2006-03-23 | 2007-09-27 | Novasaid Ab | Methods for building atomic models of protein molecules and determining drug candidates using mgst1 |
WO2009092800A1 (en) | 2008-01-24 | 2009-07-30 | Novasaid Ab | Protein structure and method of using protein structure |
US20100082599A1 (en) * | 2008-09-30 | 2010-04-01 | Goetz Graefe | Characterizing Queries To Predict Execution In A Database |
US20120296090A1 (en) | 2011-04-04 | 2012-11-22 | The Methodist Hospital Research Institute | Drug Repositioning Methods For Targeting Breast Tumor Initiating Cells |
US20130252280A1 (en) | 2012-03-07 | 2013-09-26 | Genformatic, Llc | Method and apparatus for identification of biomolecules |
US9169287B2 (en) | 2013-03-15 | 2015-10-27 | Massachusetts Institute Of Technology | Solid phase peptide synthesis processes and associated systems |
US9695214B2 (en) | 2013-03-15 | 2017-07-04 | Massachusetts Institute Of Technology | Solid phase peptide synthesis processes and associated systems |
HUE048104T2 (en) | 2013-09-27 | 2020-05-28 | Codexis Inc | Structure based predictive modeling |
JP6667447B2 (en) | 2013-11-15 | 2020-03-18 | ヒンジ セラピューティクス,インコーポレイテッド | Computer-aided modeling for treatment design. |
US20150371009A1 (en) | 2014-06-19 | 2015-12-24 | Jake Yue Chen | Drug identification models and methods of using the same to identify compounds to treat disease |
CN108289925A (en) | 2015-09-17 | 2018-07-17 | 麻省理工学院 | Solid-phase peptide synthesis and related system |
EP3350197A4 (en) | 2015-09-17 | 2019-04-24 | Massachusetts Institute of Technology | Methods and systems for solid phase peptide synthesis |
US20170147743A1 (en) | 2015-11-23 | 2017-05-25 | University Of Miami | Rapid identification of pharmacological targets and anti-targets for drug discovery and repurposing |
US10776712B2 (en) * | 2015-12-02 | 2020-09-15 | Preferred Networks, Inc. | Generative machine learning systems for drug design |
US20190018933A1 (en) * | 2016-01-15 | 2019-01-17 | Preferred Networks, Inc. | Systems and methods for multimodal generative machine learning |
WO2017132550A1 (en) * | 2016-01-28 | 2017-08-03 | The Brigham And Women's Hospital, Inc. | Detection of an antibody against a pathogen |
US11774944B2 (en) | 2016-05-09 | 2023-10-03 | Strong Force Iot Portfolio 2016, Llc | Methods and systems for the industrial internet of things |
US20190279737A1 (en) | 2016-11-24 | 2019-09-12 | Industry-University Cooperation Foundation Hanyang University | Method of discovering new drug candidate targeting disorder-to-order transition region and apparatus for discovering new drug candidate |
EP3568782A1 (en) | 2017-01-13 | 2019-11-20 | Massachusetts Institute Of Technology | Machine learning based antibody design |
US20180372724A1 (en) | 2017-06-26 | 2018-12-27 | The Regents Of The University Of California | Methods and apparatuses for prediction of mechanism of activity of compounds |
KR101991725B1 (en) | 2017-07-06 | 2019-06-21 | 부경대학교 산학협력단 | Methods for target-based drug screening through numerical inversion of quantitative structure-drug performance relationships and molecular dynamics simulation |
US11587644B2 (en) | 2017-07-28 | 2023-02-21 | The Translational Genomics Research Institute | Methods of profiling mass spectral data using neural networks |
US20190095584A1 (en) | 2017-09-26 | 2019-03-28 | International Business Machines Corporation | Mechanism of action derivation for drug candidate adverse drug reaction predictions |
WO2019161204A1 (en) * | 2018-02-19 | 2019-08-22 | Protabit LLC | Platform for protein storage, analysis and engineering |
WO2019191777A1 (en) * | 2018-03-30 | 2019-10-03 | Board Of Trustees Of Michigan State University | Systems and methods for drug design and discovery comprising applications of machine learning with differential geometric modeling |
WO2020009916A1 (en) | 2018-07-03 | 2020-01-09 | Yale University | System and method for using microbiome to de-risk drug development |
US11680063B2 (en) | 2018-09-06 | 2023-06-20 | Insilico Medicine Ip Limited | Entangled conditional adversarial autoencoder for drug discovery |
US11393560B2 (en) | 2018-11-13 | 2022-07-19 | Recursion Pharmaceuticals, Inc. | Systems and methods for high throughput compound library creation |
MX2021007556A (en) | 2018-12-21 | 2021-09-10 | Biontech Us Inc | Method and systems for prediction of hla class ii-specific epitopes and characterization of cd4+ t cells. |
US20210217498A1 (en) | 2018-12-24 | 2021-07-15 | Medirita | Data processing apparatus and method for predicting effectiveness and safety of new drug candidate substance |
CA3127965A1 (en) | 2019-02-11 | 2020-08-20 | Flagship Pioneering Innovations Vi, Llc | Machine learning guided polypeptide analysis |
WO2020208555A1 (en) | 2019-04-09 | 2020-10-15 | Eth Zurich | Systems and methods to classify antibodies |
US20200327963A1 (en) | 2019-04-11 | 2020-10-15 | Accenture Global Solutions Limited | Latent Space Exploration Using Linear-Spherical Interpolation Region Method |
US20200392178A1 (en) | 2019-05-15 | 2020-12-17 | International Business Machines Corporation | Protein-targeted drug compound identification |
US11651841B2 (en) | 2019-05-15 | 2023-05-16 | International Business Machines Corporation | Drug compound identification for target tissue cells |
US11152125B2 (en) | 2019-06-06 | 2021-10-19 | International Business Machines Corporation | Automatic validation and enrichment of semantic relations between medical entities for drug discovery |
GB201909925D0 (en) | 2019-07-10 | 2019-08-21 | Benevolentai Tech Limited | Identifying one or more compounds for targeting a gene |
WO2021035097A1 (en) | 2019-08-21 | 2021-02-25 | Fountain Therapeutics, Inc. | Cell age classification and drug screening |
EP4018020A4 (en) | 2019-08-23 | 2023-09-13 | Geaenzymes Co. | Systems and methods for predicting proteins |
US20220348903A1 (en) | 2019-09-13 | 2022-11-03 | The University Of Chicago | Method and apparatus using machine learning for evolutionary data-driven design of proteins and other sequence defined biomolecules |
KR102110176B1 (en) | 2019-10-11 | 2020-05-13 | 주식회사 메디리타 | Method and apparatus for deriving new drug candidate substance |
CN113129999B (en) | 2019-12-31 | 2024-06-18 | 高丽大学校产学协力团 | New drug candidate substance output method and device, model construction method and recording medium |
US20210287763A1 (en) | 2020-03-16 | 2021-09-16 | Innoplexus Ag | System and method for selecting a set of candidate drug compounds |
US11174289B1 (en) | 2020-05-21 | 2021-11-16 | International Business Machines Corporation | Artificial intelligence designed antimicrobial peptides |
CN111753543B (en) | 2020-06-24 | 2024-03-12 | 北京百度网讯科技有限公司 | Medicine recommendation method, device, electronic equipment and storage medium |
CN111816252B (en) | 2020-07-21 | 2021-08-31 | 腾讯科技(深圳)有限公司 | Drug screening method and device and electronic equipment |
JP2023536118A (en) | 2020-07-28 | 2023-08-23 | フラッグシップ パイオニアリング イノベーションズ シックス,エルエルシー | Deep learning for novel antibody affinity maturation (correction) and property improvement |
US20220036968A1 (en) | 2020-07-30 | 2022-02-03 | Frontier Medicines Corporation | Processing biophysical screening data and identifying and characterizing protein sites for drug discovery |
CN111755078B (en) | 2020-07-30 | 2022-09-23 | 腾讯科技(深圳)有限公司 | Drug molecule attribute determination method, device and storage medium |
US11127488B1 (en) | 2020-09-25 | 2021-09-21 | Accenture Global Solutions Limited | Machine learning systems for automated pharmaceutical molecule screening and scoring |
US11615324B2 (en) | 2020-12-16 | 2023-03-28 | Ro5 Inc. | System and method for de novo drug discovery |
CN114822717A (en) | 2021-01-28 | 2022-07-29 | 腾讯科技(深圳)有限公司 | Artificial intelligence-based drug molecule processing method, device, equipment and storage medium |
KR102472724B1 (en) | 2021-01-29 | 2022-12-01 | 주식회사 인세리브로 | A method and an apparatus for designing target-specific drug combining deep-learning algorithm and water pharmacophore model |
US20220328140A1 (en) | 2021-04-05 | 2022-10-13 | Applied BioMath, LLC | Methods and apparatus for therapeutic feasibility assessment using quantitative systems pharmacology and rule-based reasoning systems |
US11587643B2 (en) | 2021-05-07 | 2023-02-21 | Peptilogics, Inc. | Methods and apparatuses for a unified artificial intelligence platform to synthesize diverse sets of peptides and peptidomimetics |
CN113409884B (en) | 2021-06-30 | 2022-07-22 | 北京百度网讯科技有限公司 | Training method of sequencing learning model, sequencing method, device, equipment and medium |
US20230034559A1 (en) | 2021-07-18 | 2023-02-02 | Sunstella Technology Corporation | Automated prediction of clinical trial outcome |
US11450407B1 (en) | 2021-07-22 | 2022-09-20 | Pythia Labs, Inc. | Systems and methods for artificial intelligence-guided biomolecule design and assessment |
CN113838536B (en) | 2021-09-13 | 2022-06-10 | 烟台国工智能科技有限公司 | Translation model construction method, product prediction model construction method and prediction method |
US20230083769A1 (en) | 2021-09-14 | 2023-03-16 | City University Of Hong Kong | Machine learing based method of screening potential drug candidate, and a method thereof |
US20230094323A1 (en) | 2021-09-15 | 2023-03-30 | Korea Advanced Institute Of Science And Technology | System and method for optimizing general purpose biological network for drug response prediction using meta-reinforcement learning agent |
US20230098833A1 (en) | 2021-09-17 | 2023-03-30 | The University Of Hong Kong | Deepdrug: an expert-led directed graph neural networking drug-repurposing framework for identification of a lead combination of drugs protecting against alzheimer's disease and related disorders |
US20230098285A1 (en) | 2021-09-24 | 2023-03-30 | Seoul National University R&Db Foundation | Apparatus and method for generating a protein-drug interaction prediction model for predicting protein-drug interaction and determining its uncertainty, and protein-drug interaction prediction apparatus and method |
WO2023049466A2 (en) | 2021-09-27 | 2023-03-30 | Marwell Bio Inc. | Machine learning for designing antibodies and nanobodies in-silico |
CN114187979A (en) | 2022-02-15 | 2022-03-15 | 北京晶泰科技有限公司 | Data processing, model training, molecular prediction and screening method and device thereof |
-
2021
- 2021-05-13 US US17/319,839 patent/US20220165359A1/en active Pending
- 2021-05-13 US US17/319,923 patent/US11403316B2/en active Active
- 2021-08-17 US US17/404,657 patent/US11424008B2/en active Active
- 2021-08-17 US US17/404,810 patent/US11436246B2/en active Active
- 2021-10-29 WO PCT/US2021/057328 patent/WO2022108733A1/en active Application Filing
-
2022
- 2022-08-01 US US17/878,365 patent/US11848076B2/en active Active
- 2022-08-22 US US17/892,701 patent/US12087404B2/en active Active
- 2022-09-02 US US17/902,438 patent/US11967400B2/en active Active
-
2023
- 2023-12-18 US US18/543,611 patent/US20240203529A1/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210081804A1 (en) * | 2017-05-30 | 2021-03-18 | GTN Ltd. | Tensor network machine learning system |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190139647A1 (en) * | 2016-06-27 | 2019-05-09 | Koninkluke Philips N.V. | Evaluation of decision tree using ontology |
US11769599B2 (en) * | 2016-06-27 | 2023-09-26 | Koninklijke Philips N.V. | Evaluation of decision tree using ontology |
CN116665763A (en) * | 2023-05-18 | 2023-08-29 | 中南大学 | Metabolism path deducing method based on multi-view multi-tag learning |
CN116721777A (en) * | 2023-08-10 | 2023-09-08 | 中国医学科学院药用植物研究所 | Neural network-based drug efficacy evaluation method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
WO2022108733A1 (en) | 2022-05-27 |
US20220164342A1 (en) | 2022-05-26 |
US12087404B2 (en) | 2024-09-10 |
US11403316B2 (en) | 2022-08-02 |
US20220164343A1 (en) | 2022-05-26 |
US20220399082A1 (en) | 2022-12-15 |
US20220415446A1 (en) | 2022-12-29 |
US11848076B2 (en) | 2023-12-19 |
US20230037376A1 (en) | 2023-02-09 |
US11424008B2 (en) | 2022-08-23 |
US11436246B2 (en) | 2022-09-06 |
US20220165360A1 (en) | 2022-05-26 |
US20240203529A1 (en) | 2024-06-20 |
US11967400B2 (en) | 2024-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11462304B2 (en) | Artificial intelligence engine architecture for generating candidate drugs | |
US12087404B2 (en) | Generating anti-infective design spaces for selecting drug candidates | |
Aguilera-Mendoza et al. | Automatic construction of molecular similarity networks for visual graph mining in chemical space of bioactive peptides: an unsupervised learning approach | |
US11587643B2 (en) | Methods and apparatuses for a unified artificial intelligence platform to synthesize diverse sets of peptides and peptidomimetics | |
Huang et al. | Machine learning applications for therapeutic tasks with genomics data | |
US20240344123A1 (en) | Methods and apparatuses for generating peptides by synthesizing a portion of a design space to identify peptides having non-canonical amino acids | |
US20220059196A1 (en) | Artificial intelligence engine for generating candidate drugs using experimental validation and peptide drug optimization | |
Wu et al. | Sega: Structural entropy guided anchor view for graph contrastive learning | |
US20220384058A1 (en) | Methods and apparatuses for using artificial intelligence trained to generate candidate drug compounds based on dialects | |
WO2022236126A1 (en) | Methods and apparatuses for a unified artificial intelligence platform to synthesize diverse sets of peptides and peptidomimetics | |
US20240379191A1 (en) | Generating enhanced graphical user interfaces for presentation of anti-infective design spaces for selecting drug candidates | |
Mukaidaisi | Protein-Ligand Binding Affinity Directed Multi-Objective Drug Design Based on Fragment Representation Methods | |
Reksoprodjo et al. | Department of Mathematics and Computer Science | |
Santos | RINALDO-RatIoNAL Drug design wOrkbench | |
Winter | Unsupervised Learning of Molecular Representations for Drug Development | |
Hamad Al Nuaimi | Streaming Feature Grouping and Selection (Sfgs) For Big Data Classification | |
Haywood | Artificial Intelligence for Chemical Synthesis: Improving the Workflow of Medicinal Chemists using Computer-Aided Synthesis Planning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
AS | Assignment |
Owner name: PEPTILOGICS, INC., PENNSYLVANIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LEE, FRANCIS;STECKBECK, JONATHAN D., DR.;HOLSTE, HANNES;SIGNING DATES FROM 20220829 TO 20220908;REEL/FRAME:061046/0973 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |