[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN108959843B - Computer screening method of chemical small molecule drug of target RNA - Google Patents

Computer screening method of chemical small molecule drug of target RNA Download PDF

Info

Publication number
CN108959843B
CN108959843B CN201810573816.1A CN201810573816A CN108959843B CN 108959843 B CN108959843 B CN 108959843B CN 201810573816 A CN201810573816 A CN 201810573816A CN 108959843 B CN108959843 B CN 108959843B
Authority
CN
China
Prior art keywords
rna
small molecule
small molecules
screening
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810573816.1A
Other languages
Chinese (zh)
Other versions
CN108959843A (en
Inventor
崔庆华
周源
曾攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jianmu Technology Co Ltd
Original Assignee
Beijing Jianmu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jianmu Technology Co Ltd filed Critical Beijing Jianmu Technology Co Ltd
Priority to CN201810573816.1A priority Critical patent/CN108959843B/en
Priority to PCT/CN2018/090267 priority patent/WO2019232748A1/en
Publication of CN108959843A publication Critical patent/CN108959843A/en
Application granted granted Critical
Publication of CN108959843B publication Critical patent/CN108959843B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Pharmaceuticals Containing Other Organic And Inorganic Compounds (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a computer screening method of a chemical small molecule drug of a target RNA, which comprises the following steps: (1) collecting and sorting a training data set, (2) mining the characteristics of a prediction method, (3) creating a prediction method and a model, and (4) verifying the prediction method and the model. The invention can be used for computer screening of chemical small molecules of the target RNA; RNA-based prevention and treatment of major diseases provides new solutions.

Description

Computer screening method of chemical small molecule drug of target RNA
Technical Field
The invention relates to a computer screening method of a drug, in particular to a computer screening method of a chemical small molecule drug of a target RNA.
Background
Genes (DNA) are depositors of genetic material, which are responsible for directing the construction of proteins, which are considered molecules that ultimately perform specific biological functions, while RNA is considered an intermediate molecule that links DNA and protein. Therefore, traditionally, much attention has been focused on the study of proteins (including protein-encoding DNA), and little attention has been paid to RNA. Traditional drug development is mainly based on target proteins, for example, more than 95% of drugs recorded in drug bank databases have protein as their target, but most of proteins do not have targetability (drug target), and only about 400 proteins can be targeted until now, so drug development targeting other kinds of molecules is an urgent priority for disease prevention and treatment. In recent years, with the implementation of the human genome project and the ENCODE project, it has been surprisingly found that in humans, DNA capable of encoding proteins accounts for only about 2% of the total DNA, and most of the remaining 98% of DNA is transcribed into RNA but not translated into protein, and is called non-coding RNA (ncRNA). With the rapid development of high-throughput technologies such as RNA-Seq, a large number of non-coding RNAs have been found, for example, in human body, 4 thousands of mirnas (micro RNAs) and 10 thousands of long non-coding RNAs (long non-coding RNAs) have been found. Research shows that the RNA molecules have important biological functions and are closely related to diseases, even messenger RNA has functions which are not limited to communication of DNA and protein, but have various important functions at the RNA level, people begin to realize that RNA is becoming a potential key target of disease intervention, and the research and development of drugs targeting RNA are attracting wide attention.
One large class of RNA-targeting molecules with drug-targeting potential is RNA or DNA (referred to herein as nucleic acids, to distinguish them from "RNA targets"), such as small interfering RNA (siRNA), antisense oligonucleotides (ASO), miRNA, aptamers (aptamers), and the like. For example, the nucleic acid drug Mirvirasen of Roche (Roche) for the treatment of hepatitis c, targeting human liver-specific miRNA miR-122, has begun phase 2 clinical trials. However, nucleic acid drugs naturally have some disadvantages, such as off-target effect (off-target), susceptibility to immune reaction caused by exogenous macromolecules, poor stability, and difficulty in entering cells. These disadvantages, especially the latter two, severely hamper the drugability of nucleic acids. For example, siRNA is degraded after entering blood circulation for as short as a few minutes, and has very poor stability, which is one of the major obstacles for nucleic acid drug development. In addition, after hundreds of millions of years of evolution, in order to resist the invasion of external harmful substances, double-layer lipid cell membranes are evolved, and exogenous nucleic acids are prevented from entering cells, so that target RNA is difficult to regulate, which is another main obstacle of nucleic acid patent drugs. Thus, in addition to continuing intensive research into nucleic acid-based RNA-targeting drugs, the international scientific community has also begun directing eye light to other possible RNA-targeting strategies, where small chemical molecules begin to reveal the headquarters. Chemical small molecules in drug development refer to organic molecules with molecular weights less than 900 daltons.
Chemical small molecules have good stability and are easy to enter cells, the defects of nucleic acid drugs are greatly overcome, and historically, small molecules have been successful in targeting RNA, such as streptomycin and tetracycline (tetracycline) which target RNA of bacteria. However, a major bottleneck which seriously hinders the development of the field at present is the insufficient calculation method for chemical small molecule screening of the target RNA. International groups of topics including applicants have attempted in this field. Such as miRNA-environmental factor (mostly chemical small molecule) bioinformatics database and prediction platform mirenarchitecture based on miRNA transcriptome or mRNA transcriptome, small molecule and miRNA association database SM2miR and prediction algorithm, but the methods essentially predict the association between miRNA and small molecule 'function', and are not true drug prediction targeting miRNA. Although Kuntz laboratories have attempted to apply "protein-small molecule" docking software "Dock 6.0" to "RNA-small molecule" docking, the method has significant drawbacks as follows: 1) it depends on the RNA tertiary structure, but most of the RNA tertiary structure is unknown, and the RNA tertiary structure is different from the protein tertiary structure, the former has poorer rigidity and stronger flexibility; 2) dock 6 is designed for "protein-small molecule" docking, and the physicochemical properties of RNA, force field parameters, and protein are far apart, so Dock 6 cannot be used in RNA. Recently, Disney laboratories first biologically identified some chemical small molecules with bound small RNA fragments of hairpin loops (hairpin) and ridges (bridge), and then, by using the interaction data, they designed the prediction algorithm Informia, but the algorithm is only suitable for small RNA fragments and is not suitable for large RNA molecules, and the latter is more numerous and more complex, and the action mechanism is different from that of small RNA. In addition, because the Informina data, programs, and servers are not disclosed, it is unclear how accurate the Informina data, programs, and servers are. By combining the above analysis, the current preliminary attempts have disadvantages, the problem of screening targeted RNA drugs is still far from the task, and an updated calculation method is needed to supplement the problems.
According to the above analysis, the screening of directly targeted molecular drugs, the spatial structure and force field of the molecules seem to be indispensable, the number of RNA molecules with known spatial structures is few, and the RNA force field is not clear, which seems to be a pair of contradictions that are difficult to reconcile.
Disclosure of Invention
The invention aims to provide a computer screening method of a chemical small molecule drug of a target RNA aiming at the defects of the prior art, the method utilizes RNA sequence source information and chemical small molecule physicochemical properties to construct a random forest model, and can help to screen the chemical small molecule of the target RNA more conveniently and effectively. The chemical small molecules of the present invention refer to organic molecules having a molecular weight of less than 900 daltons.
The purpose of the invention is realized by the following technical scheme:
a computer screening method of a chemical small molecule drug of a target RNA comprises the following steps: (1) collecting and sorting a data set, (2) mining characteristics used for training a prediction method, (3) creating the prediction method and a model, and (4) verifying the prediction method and the model.
Preferably, the step of collecting and collating the data set comprises the steps of:
(a) retrieving and acquiring structures only consisting of RNA and small molecules from a PDB (protein data bank) database, and extracting corresponding information from the structures, wherein the corresponding information comprises the interaction condition of the RNA and the small molecules and the specific interaction position of the RNA and the small molecules, and the information is used as a training data set;
(b) RNA interaction with small molecules outside the PDB database was collected from SMMRNA (Small molecular modules of RNA) databases and literature reports as test data sets.
Preferably, the mining is used for training the features of the prediction method, and comprises the following steps:
(a) extracting related characteristics of RNA including sequence, structure and function;
(b) physicochemical properties of small molecules were calculated, including Number of Hydrogen Bond Acceptors (HBA), Number of Hydrogen Bond Donors (HBD), Octanol/water distribution coefficient (logP), Molar refractive index (MR), Molecular Weight (MW), and Topological Polar Surface Area (TPSA).
Preferably, the relevant features include: nucleotide class, functional site, nucleotide distance and nds (nucleotide distance) curve, nucleotide frequency and pairing status.
The method and the model for creating the prediction comprise the following steps: and (3) adopting a Balanced Random Forest (BRF) model to establish a calculation method for RNA-chemical small molecule interaction prediction.
Since small molecules usually bind only to local regions of RNA, the RNA is first converted into fragments, but the small molecule-bound fragments (positive samples) are much smaller than the unbound fragments (negative samples) within the resulting fragments, and therefore a computational method for creating predictions of RNA-chemical small molecule interactions is used using a Balanced Random Forest (BRF) model.
Preferably, the calculation method for creating the RNA-chemical small molecule interaction prediction by using the balanced random forest model comprises the following steps: and dividing the negative samples in the training data set into a plurality of parts to reduce the quantity difference between each negative sample and each positive sample, respectively matching with the positive samples to perform model training, and summarizing the output results of the models.
The verification prediction method and the model comprise the following steps: and (4) evaluating the performance of the model obtained in the step (3).
Preferably, the performance evaluation comprises: cross validation using a training data set and/or independent validation using a test data set.
Preferably, the performance evaluation comprises: 5 positive and 5 negative predictors were selected for biological validation.
The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to a high-throughput screening platform.
The invention also adopts the following scheme that the computer screening method of the chemical small molecule drug of the target RNA is applied to the computer screening by taking the RNA as the target compound.
The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to the PDB database.
The invention also adopts the following scheme that the computer screening method of the chemical small molecule drug of the target RNA is applied to the following fields: the application in a high-throughput screening platform; the application in computer screening by taking RNA as a target compound; and/or application in a PDB database; and/or application in SMMRNA databases; and/or in the application of miRNA-based environmental factor development platform mirenenvironment; and/or use in targeted drug screening; and/or in the prevention and treatment of major diseases.
The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to the targeted drug selection. By applying the method, the chemical small molecules kaempferol (kaempferol) and Quercetin (Quercetin) of the target lncSHGL are predicted.
The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to the prevention and treatment of serious diseases. A new lncRNA, lncSHGL, which plays a key role in the metabolism of hepatic glycolipids and is a new drug target for the intervention of metabolic diseases such as fatty liver, diabetes and the like, is discovered in the early period. By using the method, the combination of kaempferol (kaempferol) and Quercetin (Quercetin) with lncSHGL is predicted, and the two chemical small molecules are potential prevention and treatment medicines for fatty liver and diabetes.
The invention has the beneficial effects that:
aiming at the important problem of chemical small molecule drug screening of RNA which is a novel disease intervention target, the invention creates a calculation method of chemical small molecule screening of target RNA based on machine learning (by using a random forest method) on the basis of analyzing RNA sequence characteristics and small molecule physicochemical properties due to the limitations of few RNA space structure data, flexible structure, unknown force field and the like. The invention can be used for computer screening of chemical small molecules of the target RNA; RNA-based prevention and treatment of major diseases provides new solutions.
The invention provides a new idea, a new strategy and a new method for screening the target RNA medicament.
Description of the drawings:
FIG. 1. nucleotide distances calculated from RNA sequences (sequence is used to predict secondary structure first, and then distance is calculated) are highly correlated with spatial structure calculated nucleotide distances;
FIG. 2.AK098656 has high specificity in vascular smooth muscle cell expression;
FIG. 3 shows that after AK098656 gene transfer, both systolic pressure (a) and diastolic pressure (b) of rats are significantly increased;
fig. 4. results of the computational method cross-validation created (a) and test results on independent SMMRNA and literature-derived independent datasets (b).
Detailed Description
The following examples and experimental examples are intended to illustrate the present invention, but are not intended to limit the scope of the present invention. The present invention will be further described with reference to specific examples and experimental examples.
Example 1:
1. collection and arrangement of RNA-chemical Small molecule interaction data
1) Training data set
And retrieving a structure only consisting of an RNA chain and small molecules in the PDB database, and cleaning the downloaded PDB structure data to be used as a source of a training data set. If all the small molecules contained in the PDB structure are metal ions or solvent molecules in a buffer solution commonly used in structural biology research, or the length of an RNA chain contained in the PDB structure does not exceed 20, the small molecules are not retained. Next, information on RNA-small molecule interactions is extracted from the retained PDB structure. Since 4.0 angstroms (Angstrom) is about the turning point for the weakest hydrogen bonds and the strongest van der Waals forces, 4.0 angstroms is taken as a threshold for judging the interaction between small molecules and RNA. An interaction is considered to exist if the closest distance between the nucleotide and the atom of the small molecule is less than 4.0 angstroms. As the PDB structure as the source of the training data set has fresh RNA-small molecule pairs without interaction, the small molecules involved in all the PDB structures are firstly sorted out to calculate the Euclidean distance between the physicochemical properties of the small molecules, then, the rest small molecules are respectively sequenced according to the Euclidean distance between the rest small molecules and the physicochemical properties of the small molecules contained in the structure according to one or more small molecules interacted with the RNA chain in each PDB structure, and in order to reduce the possibility of generating false negative RNA-small molecule pairs as much as possible, the intersection of the small molecules with the Euclidean distance sequencing between the 80 th quantiles and the 90 th quantiles of the physicochemical properties is selected to be used for artificially generating the RNA-small molecule interaction pairs without interaction.
2) Independent test data set
RNA-small molecule interactions and possible non-interacting RNA-small molecule pairs were collected manually from the literature as test datasets and new RNA-small molecule interaction data not included in the PDB database was obtained from the SMMRNA database.
2. Calculation of RNA-related characteristics and small molecule physicochemical properties
In one aspect, RNA-related features are extracted from a number of sequence, structure and function perspectives, specifically, for each nucleotide, the following features are extracted separately in sequence:
(1) the nucleotide species itself (A, U, C, G and N);
(2) whether a pair is formed with an additional nucleotide;
(3) whether it is the predicted functional site of the Rsite2 algorithm previously proposed by the applicant;
(4) the geometric distance normalized by this nucleotide in secondary structure scores NNDS values:
NNDS=∑dist(nti-ntj)/∑dist(ntcentroid-ntj)
wherein nti,ntj,ntcentroidThe nucleotide to be detected, any nucleotide in RNA and the coordinate vector of the RNA center are respectively adopted, and the Euclidean distance is adopted when the nucleotide distance is calculated.
Subsequently, as a result of the fragmentation process of the RNA, the above features (1) to (3) are put into the vector of the corresponding fragment, and (4) are converted into an average value to be assigned to the corresponding fragment, whether the fragment interacts with the small molecule is determined according to whether the nucleotide located at the center of the fragment interacts with the small molecule, and the deletion values in the fragments beyond both ends of the RNA sequence are filled with the normalized NDS values of (1) N (2) or (3) or (4) the first or last nucleotide, respectively, by default. Furthermore, the frequency of the individual nucleotides and of the nucleotide triplets is also counted over the individual fragments. The RNA secondary structure used to determine the status of nucleotide pairing results from multiple pathways, including extraction from the PDB structure using RNApdee (http:// rnapdee. cs. put. poznan. pl /), manual annotation according to relevant literature reports and prediction of RNA sequence using RNAfold.
On the other hand, the chemical small molecule Structure files include a Structure Data Format (SDF) file directly obtained from a PDB database and a Simplified Molecular Input Line Entry (SMILES) format file retrieved from a PubChem database (https:// PubChem.ncbi.nlm.nih.gov /) of NCBI. And then, calculating the physicochemical properties of the chemical micromolecule structure file according to the obtained chemical micromolecule structure file by using an Open Babel software package, wherein the physicochemical properties comprise the number of hydrogen bond acceptors HBA, the number of hydrogen bond donors HDA, the octanol/water distribution coefficient MW, the molar refractive index MR, the topological polar surface area TPSA and the like. These indices can be obtained directly as counts or integrated through the physicochemical properties of known small molecule fragments. For example, for a small molecule containing n fragments, the TPSA of each fragment can be queried and calculated by weighted summation of the number of fragments:
Figure BDA0001686704880000061
3. method for creating RNA-chemical small molecule interaction prediction
1) Calculating RNA-chemical small molecule interaction tendency fraction
Since RNA only interacts locally with small chemical molecules, applicants propose the idea of fragmenting RNA. Therefore, the RNA related characteristics input into the model are obtained based on the RNA sequence fragments, and the model directly predicts whether the RNA sequence fragments interact with the chemical small molecules, and further integrates the prediction result of the fragment level into the RNA molecule level to make an overall assessment on the tendency of the RNA molecule to interact with the chemical small molecules. Therefore, the fragments predicted to have the possibility of interacting with the chemical small molecules in the RNA sequence are firstly found out, the proportion of the fragments comprising the fragments which are predicted to have the possibility of interacting with the chemical small molecules in the RNA sequence and the fragments which are from the left to the right to the most 5 adjacent fragments is calculated, then the fragments are sorted according to the proportion, the ratio of the average value of the proportion of the fragments to the average value of the distance between the central sequences of the fragments is calculated, and the higher the ratio is, the RNA sequence fragments which can act with the chemical small molecules are distributed more densely on the RNA molecule, and the interaction tendency score is taken as a DRIP (Drug-RNA interaction predictor) score.
2) Creating RNA-chemical small molecule interaction prediction models
Because the number of fragments in the data set which do not interact with the small molecules is far more than that of fragments which interact with the small molecules, a Balanced Random Forest (BRF) model which divides the negative samples into a plurality of parts and respectively matches with the positive samples is adopted, and in addition, the number difference between the negative samples and the positive samples in each part is reduced as much as possible, and the negative samples are limited to be divided into 10 parts at most in order to avoid excessively increasing the complexity of the model. The random forest model is constructed by using R-packet randomForest.
A random forest is a phylogenetic classification model (ensemble) which is actually formed by a plurality of decision trees, one decision tree is trained from a part of samples, wherein paths from root nodes to leaf nodes indicate how the value conditions θ (xi) of different features should be combined according to the weight w to realize classification of the selected part of samples. Finally, the random forest model realizes the prediction of the classification vector y by integrating a series of decision trees:
Figure BDA0001686704880000071
and optimizing in a step-by-step mode in view of more integrated characteristics in the model and adjustable parameters in the construction process. Firstly, because RNA is subjected to fragmentation treatment, the influence of the fragment lengths of different RNA sequences on the model performance is tested; after adjusting the length of the RNA sequence fragment, the characteristic is screened. In a trained random forest model, the importance score of a single feature is expressed as Gini Importance (GI), the classification goodness of the segmentation (split) mode kappa of the feature in each tree is expressed as Gini impurity i (kappa), and then the Gini impurities in all the trees T are summarized to obtain the importance score of the feature population:
Figure BDA0001686704880000072
testing the influence of different feature combinations on the model performance, wherein the feature combinations comprise all reserved features, each group of RNA related features are respectively removed, and the small molecule physicochemical properties are standardized by using molecular weight and then the molecular weight is reserved or removed; after the characteristic combination is selected, the proportion of positive and negative fragments in a data set is adjusted, the proportion of the positive and negative fragments corresponding to each micromolecule is different, the model prediction result is biased, the proportion of the negative fragments and the positive fragments corresponding to the micromolecules is controlled to the same level by operating the negative fragments which do not interact with the micromolecules, the negative fragments and the positive fragments corresponding to the micromolecules are doubled from 10 to 1 until the proportion is doubled to 640 to 1, for the condition that the quantity of the negative fragments corresponding to the micromolecules is insufficient, gaps are filled by pseudo negative fragments generated by randomly sampling and randomly mutating one nucleotide in the existing negative fragments, the other characteristics of the artificially manufactured pseudo negative fragments except the sequence are kept consistent with the original negative fragments, and for the condition that the quantity of the negative fragments corresponding to the micromolecules is surplus, RNA sequence fragments are clustered inside the negative fragments and between the negative fragments and the positive fragments by using a CD-HIT tool, then preferentially reserving the negative segments similar to the positive segments according to the clustering result, reducing the redundancy inside the negative segments, and ensuring the representativeness of the reserved negative segments as much as possible; then, under the condition of controlling the proportion of positive and negative fragments corresponding to the small molecules, the influence of different RNA sequence lengths on the model performance is compared again; and finally, setting the number of the classification trees in the random forest model to be increased by 100 from 100 to 1000 each time, and comparing and selecting the number of the classification trees.
4. Verification of created RNA-chemical small molecule interaction prediction method
5-fold cross validation is performed on the training data set, and the prediction performance is mainly evaluated by sensitivity (sensitivity), specificity (specificity) and Matthews Correlation Coefficient (MCC), and the evaluation indexes are defined as follows:
Figure BDA0001686704880000081
Figure BDA0001686704880000082
Figure BDA0001686704880000083
since these evaluation indices depend on specific classifier thresholds, we will also plot ROC curves and use the area under the curve AUC values for evaluation in order to fully evaluate the predictor.
The created method is run on a separate test data set to assess its accuracy.
All drug small molecule structure data were downloaded from drug library (https:// www.drugbank.ca /), and models with different parameters set during optimization were applied to the drug library to screen for small molecules that could interact with AK 098656. Each of 5 positive and negative predictions were selected for further biological validation. Because the BIACORE intermolecular interaction analyzer of GE has the advantages of wide applicable sample types (including chemical small molecules and RNA), no need of labeling molecules, real-time property, ultrahigh sensitivity (weak and transient molecular interaction can be monitored), and the like, the BIACORE analyzer of GE is used for verifying the predicted positive and negative results.
Example 2:
1. collection and arrangement of RNA-chemical small molecule interaction data
A set of reliable and proven RNA-chemical small molecule interaction data is the basis for creating a targeted RNA chemical small molecule screening calculation method. To do so, applicants download the relevant data from the PDB database and analyze it for collation as a training data set. In addition, to verify the proposed prediction method, new RNA-chemical small molecule interaction pairs not included in PDB were obtained from SMMRNA (small molecule models of RNA) databases, new experimentally confirmed RNA-chemical small molecule interaction pairs were manually retrieved from published literature, and SMMRNA and literature retrieval results were used together as independent test datasets.
2. Calculation of RNA-related characteristics and small molecule physicochemical Properties
Extracting RNA-related characteristics such as Nucleotide classes, functional sites, Nucleotide Distance and (NDS) curves, Nucleotide frequencies, pairing states and the like from multiple angles such as sequences, structures, functions and the like based on RNA-chemical small molecule interaction data of a training data set; extracting a structure file from chemical small Molecular structure data, and calculating physicochemical properties including the Number of Hydrogen Bond Acceptors (HBA), the Number of Hydrogen Bond Donors (HBD), Octanol/water distribution coefficient (logP), Molar refractive index (MR), Molecular Weight (MW), Topological Polar Surface Area (TPSA), and the like.
3. Creation of RNA-chemical Small molecule interaction prediction method
Because the number of RNA fragments in the data set which do not interact with the chemical small molecules is far more than that of the fragments which interact with the chemical small molecules, a computing method for establishing RNA-chemical small molecule interaction prediction by dividing a negative sample into a plurality of Balanced Random Forest (BRF) models which are respectively matched with a positive sample is adopted. In addition, the optimization is performed in a step-by-step manner in view of the characteristics integrated in the model and the number of adjustable parameters in the construction process.
4. Verification of RNA-chemical small molecule interaction prediction method
In order to verify the accuracy of the created RNA-chemical small molecule interaction prediction method, 5-fold cross validation is carried out on a training data set, and the prediction performance of the random forest model is evaluated by adopting an AUC value. Runs were then made on separate test data sets, also evaluated using AUC values. Finally, the method is applied to the lncRNA-AK098656 which is previously discovered by the applicant and is specific to the vascular smooth muscle, and 5 positive prediction results and 5 negative prediction results are selected for biological verification.
Example 3:
for the research on the computational method of chemical small molecule drug screening of targeted RNA, the applicant has created a miRNA-based environmental factor (mostly chemical small molecules) development platform miREnvironment (Cui et al, bioinformatics 2011). Small interfering proteins are generally functional sites on interfering proteins, and thus determining functional sites of RNA is an important basis for interfering target RNA. The applicant has successively proposed methods for predicting RNA functional sites such as Rsite, Rsite2(Cui et al scientific Reports 2015,2016), SRAMP (Cui et al nucleic Acids Res 2016, m6A methylation site prediction), and PPUS (Cui et al bioinformatics 2015, pseudouracil site prediction). The applicant discloses that functional sites obtained by RNA sequence and spatial structure have significant consistency and correlation (FIG. 1), which indicates that the RNA sequence contains RNA spatial structure information, and further suggests that in the case of extreme lack of RNA spatial structure data and unknown RNA force field, the RNA sequence characteristics can be used for predicting the chemical small molecules interacting with the RNA sequence.
Example 4:
a vascular smooth muscle specific lncRNA-AK098656 (figure 2) was verified to be significantly elevated in the blood of hypertensive patients, and the blood pressure of rats after being transferred with AK098656 gene was significantly elevated (Jin L et al hypertension 2018,71(2): 262-.
Example 5:
applicants have collated more than 300 pairs of RNA-chemical small molecule interactions from PDB database collections. More than 100 pairs of RNA-chemical small molecule interaction pairs were obtained from SMMRNA databases and literature. Analysis shows that some RNA sequence characteristics are related to chemical small molecule interaction, such as triplet frequency, Rsite2 site, etc. and that some small molecule physicochemical properties are related to RNA interaction, such as octanol/water distribution coefficient, topological polar surface area, etc. A prediction method DRIP is preliminarily constructed based on random forests, and 5-fold cross validation results show that the AUC reaches 0.818, and the AUC reaches 0.829 (figure 4) on SMMRNA and literature-derived independent test data sets, so that the created method has certain accuracy in predicting RNA-chemical small molecule interaction.

Claims (8)

1. A computer screening method of chemical small molecule drugs of target RNA is characterized in that: comprises the following steps: (1) collecting and sorting a data set, (2) mining characteristics used for training a prediction method, (3) creating a prediction method and a model, and (4) verifying the prediction method and the model; wherein,
the step (1) of collecting and collating the data sets comprises the steps of:
(a) retrieving and acquiring structures only consisting of RNA and small molecules from a PDB database, and extracting corresponding information from the structures, wherein the corresponding information comprises the interaction condition of the RNA and the small molecules and the specific interaction position of the RNA and the small molecules, and the information is used as a training data set; the training data set is sequentially screened through a first screening condition, a second screening condition and a third screening condition; wherein,
first screening conditions: if all the small molecules contained in the PDB structure are metal ions or solvent molecules in a buffer solution used in structural biology research, or the length of an RNA chain contained in the PDB structure does not exceed 20 nucleotides, the small molecules are not reserved;
second screening conditions: extracting RNA-small molecule interaction information from the PDB structure; adopting 4.0 angstroms as a threshold value for judging the interaction between the small molecules and the RNA; if the nearest distance between the RNA and the atoms of the small molecule is less than 4.0 angstroms, the RNA and the atoms of the small molecule are considered to have interaction, and subsequent operation is carried out;
and (3) third screening conditions: respectively sequencing the small molecules according to Euclidean distances of physicochemical properties between the small molecules and the small molecules contained in the structure according to one or more small molecules which interact with an RNA chain contained in each PDB structure, and selecting an intersection of the small molecules of which the Euclidean distances of the physicochemical properties are 80-90% in a descending order; (b) collecting the interaction data of RNA and small molecules outside the PDB database from an SMMRNA database and literature reports as a test data set;
the step (2) of mining features for training a prediction method comprises the following steps:
(a) extracting RNA sequence fragment related characteristics;
(b) calculating the physicochemical properties of the small molecules;
the step (3) of creating a prediction method and a model comprises the following steps: creating an equalized random forest model configured to obtain RNA sequence segment-related features input to the random forest model and physicochemical property features of small molecules input to the random forest model;
and training the random forest model according to the training data set.
2. The in silico screening method of RNA-targeted chemical small molecule drugs of claim 1, wherein: the step (2) of mining features for training a prediction method comprises the following steps:
(a) the relevant characteristics of the extracted RNA sequence fragment comprise the relevant characteristics of sequence, structure and function;
(b) the physical and chemical properties of the calculated micromolecules comprise the number of hydrogen bond acceptors, the number of hydrogen bond donors, octanol/water distribution coefficients, molar refractive indexes, molecular weights and topological polar surface areas.
3. The in silico screening method of RNA-targeted chemical small molecule drugs according to claim 2, characterized in that: the relevant features include: nucleotide class, functional site, nucleotide distance and NDS curve, nucleotide frequency and pairing status.
4. The in silico screening method of RNA-targeted chemical small molecule drugs of claim 1, wherein: a calculation method for establishing RNA-chemical small molecule interaction prediction by adopting a balanced random forest model comprises the following steps: and dividing the negative samples in the training data set into a plurality of parts to reduce the quantity difference between each negative sample and each positive sample, respectively matching with the positive samples to perform model training, and summarizing the output results of the models.
5. The in silico screening method of RNA-targeted chemical small molecule drugs of claim 1, wherein: the step (4) of verifying the prediction method and the model comprises the following steps: and (4) evaluating the performance of the model obtained in the step (3).
6. The in silico screening method of RNA-targeted chemical small molecule drugs according to claim 5, characterized in that: the performance evaluation comprises the following steps: cross validation using the training data set and/or independent validation using the test data set.
7. The in silico screening method of RNA-targeted chemical small molecule drugs according to claim 5, characterized in that: the performance evaluation comprises the following steps: 5 positive and 5 negative predictors were selected for biological validation.
8. The computer screening method of the RNA-targeted chemical small molecule drug according to claim 1 is applied to the following fields: the application in a high-throughput screening platform; and/or in the application of computer screening by taking RNA as a target compound; and/or application in a PDB database; and/or application in SMMRNA databases; and/or in the application of miRNA-based environmental factor development platform mirenenvironment; and/or use in targeted drug screening; and/or in the prevention and treatment of major diseases.
CN201810573816.1A 2018-06-06 2018-06-06 Computer screening method of chemical small molecule drug of target RNA Active CN108959843B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810573816.1A CN108959843B (en) 2018-06-06 2018-06-06 Computer screening method of chemical small molecule drug of target RNA
PCT/CN2018/090267 WO2019232748A1 (en) 2018-06-06 2018-06-07 Computer screening method for chemical small molecule medication targeting rna

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810573816.1A CN108959843B (en) 2018-06-06 2018-06-06 Computer screening method of chemical small molecule drug of target RNA

Publications (2)

Publication Number Publication Date
CN108959843A CN108959843A (en) 2018-12-07
CN108959843B true CN108959843B (en) 2021-07-06

Family

ID=64493024

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810573816.1A Active CN108959843B (en) 2018-06-06 2018-06-06 Computer screening method of chemical small molecule drug of target RNA

Country Status (2)

Country Link
CN (1) CN108959843B (en)
WO (1) WO2019232748A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111081316A (en) * 2020-03-25 2020-04-28 元码基因科技(北京)股份有限公司 Method and device for screening new coronary pneumonia candidate drugs

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222178A (en) * 2011-03-31 2011-10-19 清华大学深圳研究生院 Method for screening and/or designing medicines aiming at multiple targets
CN107075515A (en) * 2013-11-22 2017-08-18 米纳治疗有限公司 C/EBP α compositions and application method
CN107058521A (en) * 2017-03-17 2017-08-18 中国科学院北京基因组研究所 A kind of detecting system for detecting human immunity state
CN107893078A (en) * 2017-11-28 2018-04-10 西安交通大学 Target siRNA, expression vector and virion and its pharmacy application of synaptotagmin 11

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587510A (en) * 2008-05-23 2009-11-25 中国科学院上海药物研究所 Method for predicting compound carcinogenic toxicity based on complex sampling and improvement decision forest algorithm
US20100138205A1 (en) * 2008-10-10 2010-06-03 Los Alamos National Security, Llc Stochastic molecular binding simulation
CN106548196A (en) * 2016-10-20 2017-03-29 中国科学院深圳先进技术研究院 A kind of random forest sampling approach and device for non-equilibrium data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222178A (en) * 2011-03-31 2011-10-19 清华大学深圳研究生院 Method for screening and/or designing medicines aiming at multiple targets
CN107075515A (en) * 2013-11-22 2017-08-18 米纳治疗有限公司 C/EBP α compositions and application method
CN107058521A (en) * 2017-03-17 2017-08-18 中国科学院北京基因组研究所 A kind of detecting system for detecting human immunity state
CN107893078A (en) * 2017-11-28 2018-04-10 西安交通大学 Target siRNA, expression vector and virion and its pharmacy application of synaptotagmin 11

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Small molecules against RNA targets attract big backers;Asher Mullard等;《Nature Reviews Drug Discovery》;20171128;第16卷;第813-815页 *
基于分子描述符和机器学习方法预测和虚拟筛选乳腺癌靶向蛋白HEC1抑制剂;何冰等;《物理化学学报》;20150930;第31卷(第9期);第1795-1802页 *
基于网络药理学的miRNA和环境因子相互作用分析与建模;崔庆华等;《中国药理通讯》;20121231;第29卷(第3期);第18页 *

Also Published As

Publication number Publication date
WO2019232748A1 (en) 2019-12-12
CN108959843A (en) 2018-12-07

Similar Documents

Publication Publication Date Title
Do et al. Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features
CN106599615B (en) A kind of sequence signature analysis method for predicting miRNA target gene
WO2016201564A1 (en) Neural network architectures for linking biological sequence variants based on molecular phenotype, and systems and methods therefor
CN112951327B (en) Drug sensitivity prediction method, electronic device, and computer-readable storage medium
CN110111840B (en) Somatic mutation detection method
WO2013190085A1 (en) Systems and methods for generating biomarker signatures with integrated dual ensemble and generalized simulated annealing techniques
CN111370073B (en) Medicine interaction rule prediction method based on deep learning
CN113488104B (en) Cancer driving gene prediction method and system based on local and global network centrality analysis
CN107679367B (en) Method and system for identifying co-regulation network function module based on network node association degree
WO2023197718A1 (en) Circular rna ires prediction method
CN112270958A (en) Prediction method based on hierarchical deep learning miRNA-lncRNA interaction relation
Yones et al. High precision in microRNA prediction: A novel genome-wide approach with convolutional deep residual networks
CN108959843B (en) Computer screening method of chemical small molecule drug of target RNA
CN114388063B (en) Non-differential gene associated with malignant phenotype of tumor cell and screening method and application thereof
Hwang et al. Big data and deep learning for RNA biology
CN112992273A (en) Early colorectal cancer risk prediction evaluation model and system
CN113921084B (en) Multi-dimensional target prediction method and system for disease-related non-coding RNA (ribonucleic acid) regulation and control axis
CN111785319A (en) Drug relocation method based on differential expression data
KR102376212B1 (en) Gene expression marker screening method using neural network based on gene selection algorithm
Kuznetsov Mathematical modeling of avidity distribution and estimating general binding properties of transcription factors from genome-wide binding profiles
Nugraha et al. Performance analysis of relief and mRMR algorithm combination for selecting features in lupus Genome-Wide Association Study
Wang et al. Deep Learning Integration with Phenotypic Similarities and Heterogeneous Networks for Drug-Target Interaction Prediction
Uthayopas et al. PRIMITI: a computational approach for accurate prediction of miRNA-target mRNA interaction
Cheng et al. Raw signal segmentation for estimating RNA modifications and structures from Nanopore direct RNA sequencing data
CN112820347B (en) Disease gene prediction method based on multiple protein network pulse dynamics process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201027

Address after: Fc108-05, basement 1, building 1, yard 13, Dazhongsi, Haidian District, Beijing 100098

Applicant after: Beijing Jianmu Technology Co., Ltd

Address before: 100191 Peking University Health Science Center, Haidian District, Xueyuan Road, 38, Beijing

Applicant before: Peking University

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant