CN108959843B

CN108959843B - Computer screening method of chemical small molecule drug of target RNA

Info

Publication number: CN108959843B
Application number: CN201810573816.1A
Authority: CN
Inventors: 崔庆华; 周源; 曾攀
Original assignee: Beijing Jianmu Technology Co Ltd
Current assignee: Beijing Jianmu Technology Co Ltd
Priority date: 2018-06-06
Filing date: 2018-06-06
Publication date: 2021-07-06
Anticipated expiration: 2038-06-06
Also published as: WO2019232748A1; CN108959843A

Abstract

The invention discloses a computer screening method of a chemical small molecule drug of a target RNA, which comprises the following steps: (1) collecting and sorting a training data set, (2) mining the characteristics of a prediction method, (3) creating a prediction method and a model, and (4) verifying the prediction method and the model. The invention can be used for computer screening of chemical small molecules of the target RNA; RNA-based prevention and treatment of major diseases provides new solutions.

Description

Computer screening method of chemical small molecule drug of target RNA

Technical Field

The invention relates to a computer screening method of a drug, in particular to a computer screening method of a chemical small molecule drug of a target RNA.

Background

Genes (DNA) are depositors of genetic material, which are responsible for directing the construction of proteins, which are considered molecules that ultimately perform specific biological functions, while RNA is considered an intermediate molecule that links DNA and protein. Therefore, traditionally, much attention has been focused on the study of proteins (including protein-encoding DNA), and little attention has been paid to RNA. Traditional drug development is mainly based on target proteins, for example, more than 95% of drugs recorded in drug bank databases have protein as their target, but most of proteins do not have targetability (drug target), and only about 400 proteins can be targeted until now, so drug development targeting other kinds of molecules is an urgent priority for disease prevention and treatment. In recent years, with the implementation of the human genome project and the ENCODE project, it has been surprisingly found that in humans, DNA capable of encoding proteins accounts for only about 2% of the total DNA, and most of the remaining 98% of DNA is transcribed into RNA but not translated into protein, and is called non-coding RNA (ncRNA). With the rapid development of high-throughput technologies such as RNA-Seq, a large number of non-coding RNAs have been found, for example, in human body, 4 thousands of mirnas (micro RNAs) and 10 thousands of long non-coding RNAs (long non-coding RNAs) have been found. Research shows that the RNA molecules have important biological functions and are closely related to diseases, even messenger RNA has functions which are not limited to communication of DNA and protein, but have various important functions at the RNA level, people begin to realize that RNA is becoming a potential key target of disease intervention, and the research and development of drugs targeting RNA are attracting wide attention.

One large class of RNA-targeting molecules with drug-targeting potential is RNA or DNA (referred to herein as nucleic acids, to distinguish them from "RNA targets"), such as small interfering RNA (siRNA), antisense oligonucleotides (ASO), miRNA, aptamers (aptamers), and the like. For example, the nucleic acid drug Mirvirasen of Roche (Roche) for the treatment of hepatitis c, targeting human liver-specific miRNA miR-122, has begun phase 2 clinical trials. However, nucleic acid drugs naturally have some disadvantages, such as off-target effect (off-target), susceptibility to immune reaction caused by exogenous macromolecules, poor stability, and difficulty in entering cells. These disadvantages, especially the latter two, severely hamper the drugability of nucleic acids. For example, siRNA is degraded after entering blood circulation for as short as a few minutes, and has very poor stability, which is one of the major obstacles for nucleic acid drug development. In addition, after hundreds of millions of years of evolution, in order to resist the invasion of external harmful substances, double-layer lipid cell membranes are evolved, and exogenous nucleic acids are prevented from entering cells, so that target RNA is difficult to regulate, which is another main obstacle of nucleic acid patent drugs. Thus, in addition to continuing intensive research into nucleic acid-based RNA-targeting drugs, the international scientific community has also begun directing eye light to other possible RNA-targeting strategies, where small chemical molecules begin to reveal the headquarters. Chemical small molecules in drug development refer to organic molecules with molecular weights less than 900 daltons.

Chemical small molecules have good stability and are easy to enter cells, the defects of nucleic acid drugs are greatly overcome, and historically, small molecules have been successful in targeting RNA, such as streptomycin and tetracycline (tetracycline) which target RNA of bacteria. However, a major bottleneck which seriously hinders the development of the field at present is the insufficient calculation method for chemical small molecule screening of the target RNA. International groups of topics including applicants have attempted in this field. Such as miRNA-environmental factor (mostly chemical small molecule) bioinformatics database and prediction platform mirenarchitecture based on miRNA transcriptome or mRNA transcriptome, small molecule and miRNA association database SM2miR and prediction algorithm, but the methods essentially predict the association between miRNA and small molecule 'function', and are not true drug prediction targeting miRNA. Although Kuntz laboratories have attempted to apply "protein-small molecule" docking software "Dock 6.0" to "RNA-small molecule" docking, the method has significant drawbacks as follows: 1) it depends on the RNA tertiary structure, but most of the RNA tertiary structure is unknown, and the RNA tertiary structure is different from the protein tertiary structure, the former has poorer rigidity and stronger flexibility; 2) dock 6 is designed for "protein-small molecule" docking, and the physicochemical properties of RNA, force field parameters, and protein are far apart, so Dock 6 cannot be used in RNA. Recently, Disney laboratories first biologically identified some chemical small molecules with bound small RNA fragments of hairpin loops (hairpin) and ridges (bridge), and then, by using the interaction data, they designed the prediction algorithm Informia, but the algorithm is only suitable for small RNA fragments and is not suitable for large RNA molecules, and the latter is more numerous and more complex, and the action mechanism is different from that of small RNA. In addition, because the Informina data, programs, and servers are not disclosed, it is unclear how accurate the Informina data, programs, and servers are. By combining the above analysis, the current preliminary attempts have disadvantages, the problem of screening targeted RNA drugs is still far from the task, and an updated calculation method is needed to supplement the problems.

According to the above analysis, the screening of directly targeted molecular drugs, the spatial structure and force field of the molecules seem to be indispensable, the number of RNA molecules with known spatial structures is few, and the RNA force field is not clear, which seems to be a pair of contradictions that are difficult to reconcile.

Disclosure of Invention

The invention aims to provide a computer screening method of a chemical small molecule drug of a target RNA aiming at the defects of the prior art, the method utilizes RNA sequence source information and chemical small molecule physicochemical properties to construct a random forest model, and can help to screen the chemical small molecule of the target RNA more conveniently and effectively. The chemical small molecules of the present invention refer to organic molecules having a molecular weight of less than 900 daltons.

The purpose of the invention is realized by the following technical scheme:

a computer screening method of a chemical small molecule drug of a target RNA comprises the following steps: (1) collecting and sorting a data set, (2) mining characteristics used for training a prediction method, (3) creating the prediction method and a model, and (4) verifying the prediction method and the model.

Preferably, the step of collecting and collating the data set comprises the steps of:

(a) retrieving and acquiring structures only consisting of RNA and small molecules from a PDB (protein data bank) database, and extracting corresponding information from the structures, wherein the corresponding information comprises the interaction condition of the RNA and the small molecules and the specific interaction position of the RNA and the small molecules, and the information is used as a training data set;

(b) RNA interaction with small molecules outside the PDB database was collected from SMMRNA (Small molecular modules of RNA) databases and literature reports as test data sets.

Preferably, the mining is used for training the features of the prediction method, and comprises the following steps:

(a) extracting related characteristics of RNA including sequence, structure and function;

(b) physicochemical properties of small molecules were calculated, including Number of Hydrogen Bond Acceptors (HBA), Number of Hydrogen Bond Donors (HBD), Octanol/water distribution coefficient (logP), Molar refractive index (MR), Molecular Weight (MW), and Topological Polar Surface Area (TPSA).

Preferably, the relevant features include: nucleotide class, functional site, nucleotide distance and nds (nucleotide distance) curve, nucleotide frequency and pairing status.

The method and the model for creating the prediction comprise the following steps: and (3) adopting a Balanced Random Forest (BRF) model to establish a calculation method for RNA-chemical small molecule interaction prediction.

Since small molecules usually bind only to local regions of RNA, the RNA is first converted into fragments, but the small molecule-bound fragments (positive samples) are much smaller than the unbound fragments (negative samples) within the resulting fragments, and therefore a computational method for creating predictions of RNA-chemical small molecule interactions is used using a Balanced Random Forest (BRF) model.

Preferably, the calculation method for creating the RNA-chemical small molecule interaction prediction by using the balanced random forest model comprises the following steps: and dividing the negative samples in the training data set into a plurality of parts to reduce the quantity difference between each negative sample and each positive sample, respectively matching with the positive samples to perform model training, and summarizing the output results of the models.

The verification prediction method and the model comprise the following steps: and (4) evaluating the performance of the model obtained in the step (3).

Preferably, the performance evaluation comprises: cross validation using a training data set and/or independent validation using a test data set.

Preferably, the performance evaluation comprises: 5 positive and 5 negative predictors were selected for biological validation.

The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to a high-throughput screening platform.

The invention also adopts the following scheme that the computer screening method of the chemical small molecule drug of the target RNA is applied to the computer screening by taking the RNA as the target compound.

The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to the PDB database.

The invention also adopts the following scheme that the computer screening method of the chemical small molecule drug of the target RNA is applied to the following fields: the application in a high-throughput screening platform; the application in computer screening by taking RNA as a target compound; and/or application in a PDB database; and/or application in SMMRNA databases; and/or in the application of miRNA-based environmental factor development platform mirenenvironment; and/or use in targeted drug screening; and/or in the prevention and treatment of major diseases.

The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to the targeted drug selection. By applying the method, the chemical small molecules kaempferol (kaempferol) and Quercetin (Quercetin) of the target lncSHGL are predicted.

The invention also adopts the following scheme that the chemical small molecule drug computer screening method of the target RNA is applied to the prevention and treatment of serious diseases. A new lncRNA, lncSHGL, which plays a key role in the metabolism of hepatic glycolipids and is a new drug target for the intervention of metabolic diseases such as fatty liver, diabetes and the like, is discovered in the early period. By using the method, the combination of kaempferol (kaempferol) and Quercetin (Quercetin) with lncSHGL is predicted, and the two chemical small molecules are potential prevention and treatment medicines for fatty liver and diabetes.

The invention has the beneficial effects that:

aiming at the important problem of chemical small molecule drug screening of RNA which is a novel disease intervention target, the invention creates a calculation method of chemical small molecule screening of target RNA based on machine learning (by using a random forest method) on the basis of analyzing RNA sequence characteristics and small molecule physicochemical properties due to the limitations of few RNA space structure data, flexible structure, unknown force field and the like. The invention can be used for computer screening of chemical small molecules of the target RNA; RNA-based prevention and treatment of major diseases provides new solutions.

The invention provides a new idea, a new strategy and a new method for screening the target RNA medicament.

Description of the drawings:

FIG. 1. nucleotide distances calculated from RNA sequences (sequence is used to predict secondary structure first, and then distance is calculated) are highly correlated with spatial structure calculated nucleotide distances;

FIG. 2.AK098656 has high specificity in vascular smooth muscle cell expression;

FIG. 3 shows that after AK098656 gene transfer, both systolic pressure (a) and diastolic pressure (b) of rats are significantly increased;

fig. 4. results of the computational method cross-validation created (a) and test results on independent SMMRNA and literature-derived independent datasets (b).

Detailed Description

The following examples and experimental examples are intended to illustrate the present invention, but are not intended to limit the scope of the present invention. The present invention will be further described with reference to specific examples and experimental examples.

Example 1:

1. collection and arrangement of RNA-chemical Small molecule interaction data

1) Training data set

And retrieving a structure only consisting of an RNA chain and small molecules in the PDB database, and cleaning the downloaded PDB structure data to be used as a source of a training data set. If all the small molecules contained in the PDB structure are metal ions or solvent molecules in a buffer solution commonly used in structural biology research, or the length of an RNA chain contained in the PDB structure does not exceed 20, the small molecules are not retained. Next, information on RNA-small molecule interactions is extracted from the retained PDB structure. Since 4.0 angstroms (Angstrom) is about the turning point for the weakest hydrogen bonds and the strongest van der Waals forces, 4.0 angstroms is taken as a threshold for judging the interaction between small molecules and RNA. An interaction is considered to exist if the closest distance between the nucleotide and the atom of the small molecule is less than 4.0 angstroms. As the PDB structure as the source of the training data set has fresh RNA-small molecule pairs without interaction, the small molecules involved in all the PDB structures are firstly sorted out to calculate the Euclidean distance between the physicochemical properties of the small molecules, then, the rest small molecules are respectively sequenced according to the Euclidean distance between the rest small molecules and the physicochemical properties of the small molecules contained in the structure according to one or more small molecules interacted with the RNA chain in each PDB structure, and in order to reduce the possibility of generating false negative RNA-small molecule pairs as much as possible, the intersection of the small molecules with the Euclidean distance sequencing between the 80 th quantiles and the 90 th quantiles of the physicochemical properties is selected to be used for artificially generating the RNA-small molecule interaction pairs without interaction.

2) Independent test data set

RNA-small molecule interactions and possible non-interacting RNA-small molecule pairs were collected manually from the literature as test datasets and new RNA-small molecule interaction data not included in the PDB database was obtained from the SMMRNA database.

2. Calculation of RNA-related characteristics and small molecule physicochemical properties

In one aspect, RNA-related features are extracted from a number of sequence, structure and function perspectives, specifically, for each nucleotide, the following features are extracted separately in sequence:

(1) the nucleotide species itself (A, U, C, G and N);

(2) whether a pair is formed with an additional nucleotide;

(3) whether it is the predicted functional site of the Rsite2 algorithm previously proposed by the applicant;

(4) the geometric distance normalized by this nucleotide in secondary structure scores NNDS values:

NNDS＝∑dist(nt_i-nt_j)/∑dist(nt_centroid-nt_j)

wherein nt_i,nt_j,nt_centroidThe nucleotide to be detected, any nucleotide in RNA and the coordinate vector of the RNA center are respectively adopted, and the Euclidean distance is adopted when the nucleotide distance is calculated.

Subsequently, as a result of the fragmentation process of the RNA, the above features (1) to (3) are put into the vector of the corresponding fragment, and (4) are converted into an average value to be assigned to the corresponding fragment, whether the fragment interacts with the small molecule is determined according to whether the nucleotide located at the center of the fragment interacts with the small molecule, and the deletion values in the fragments beyond both ends of the RNA sequence are filled with the normalized NDS values of (1) N (2) or (3) or (4) the first or last nucleotide, respectively, by default. Furthermore, the frequency of the individual nucleotides and of the nucleotide triplets is also counted over the individual fragments. The RNA secondary structure used to determine the status of nucleotide pairing results from multiple pathways, including extraction from the PDB structure using RNApdee (http:// rnapdee. cs. put. poznan. pl /), manual annotation according to relevant literature reports and prediction of RNA sequence using RNAfold.

On the other hand, the chemical small molecule Structure files include a Structure Data Format (SDF) file directly obtained from a PDB database and a Simplified Molecular Input Line Entry (SMILES) format file retrieved from a PubChem database (https:// PubChem.ncbi.nlm.nih.gov /) of NCBI. And then, calculating the physicochemical properties of the chemical micromolecule structure file according to the obtained chemical micromolecule structure file by using an Open Babel software package, wherein the physicochemical properties comprise the number of hydrogen bond acceptors HBA, the number of hydrogen bond donors HDA, the octanol/water distribution coefficient MW, the molar refractive index MR, the topological polar surface area TPSA and the like. These indices can be obtained directly as counts or integrated through the physicochemical properties of known small molecule fragments. For example, for a small molecule containing n fragments, the TPSA of each fragment can be queried and calculated by weighted summation of the number of fragments:

3. method for creating RNA-chemical small molecule interaction prediction

1) Calculating RNA-chemical small molecule interaction tendency fraction

Since RNA only interacts locally with small chemical molecules, applicants propose the idea of fragmenting RNA. Therefore, the RNA related characteristics input into the model are obtained based on the RNA sequence fragments, and the model directly predicts whether the RNA sequence fragments interact with the chemical small molecules, and further integrates the prediction result of the fragment level into the RNA molecule level to make an overall assessment on the tendency of the RNA molecule to interact with the chemical small molecules. Therefore, the fragments predicted to have the possibility of interacting with the chemical small molecules in the RNA sequence are firstly found out, the proportion of the fragments comprising the fragments which are predicted to have the possibility of interacting with the chemical small molecules in the RNA sequence and the fragments which are from the left to the right to the most 5 adjacent fragments is calculated, then the fragments are sorted according to the proportion, the ratio of the average value of the proportion of the fragments to the average value of the distance between the central sequences of the fragments is calculated, and the higher the ratio is, the RNA sequence fragments which can act with the chemical small molecules are distributed more densely on the RNA molecule, and the interaction tendency score is taken as a DRIP (Drug-RNA interaction predictor) score.

2) Creating RNA-chemical small molecule interaction prediction models

Because the number of fragments in the data set which do not interact with the small molecules is far more than that of fragments which interact with the small molecules, a Balanced Random Forest (BRF) model which divides the negative samples into a plurality of parts and respectively matches with the positive samples is adopted, and in addition, the number difference between the negative samples and the positive samples in each part is reduced as much as possible, and the negative samples are limited to be divided into 10 parts at most in order to avoid excessively increasing the complexity of the model. The random forest model is constructed by using R-packet randomForest.

A random forest is a phylogenetic classification model (ensemble) which is actually formed by a plurality of decision trees, one decision tree is trained from a part of samples, wherein paths from root nodes to leaf nodes indicate how the value conditions θ (xi) of different features should be combined according to the weight w to realize classification of the selected part of samples. Finally, the random forest model realizes the prediction of the classification vector y by integrating a series of decision trees:

and optimizing in a step-by-step mode in view of more integrated characteristics in the model and adjustable parameters in the construction process. Firstly, because RNA is subjected to fragmentation treatment, the influence of the fragment lengths of different RNA sequences on the model performance is tested; after adjusting the length of the RNA sequence fragment, the characteristic is screened. In a trained random forest model, the importance score of a single feature is expressed as Gini Importance (GI), the classification goodness of the segmentation (split) mode kappa of the feature in each tree is expressed as Gini impurity i (kappa), and then the Gini impurities in all the trees T are summarized to obtain the importance score of the feature population:

testing the influence of different feature combinations on the model performance, wherein the feature combinations comprise all reserved features, each group of RNA related features are respectively removed, and the small molecule physicochemical properties are standardized by using molecular weight and then the molecular weight is reserved or removed; after the characteristic combination is selected, the proportion of positive and negative fragments in a data set is adjusted, the proportion of the positive and negative fragments corresponding to each micromolecule is different, the model prediction result is biased, the proportion of the negative fragments and the positive fragments corresponding to the micromolecules is controlled to the same level by operating the negative fragments which do not interact with the micromolecules, the negative fragments and the positive fragments corresponding to the micromolecules are doubled from 10 to 1 until the proportion is doubled to 640 to 1, for the condition that the quantity of the negative fragments corresponding to the micromolecules is insufficient, gaps are filled by pseudo negative fragments generated by randomly sampling and randomly mutating one nucleotide in the existing negative fragments, the other characteristics of the artificially manufactured pseudo negative fragments except the sequence are kept consistent with the original negative fragments, and for the condition that the quantity of the negative fragments corresponding to the micromolecules is surplus, RNA sequence fragments are clustered inside the negative fragments and between the negative fragments and the positive fragments by using a CD-HIT tool, then preferentially reserving the negative segments similar to the positive segments according to the clustering result, reducing the redundancy inside the negative segments, and ensuring the representativeness of the reserved negative segments as much as possible; then, under the condition of controlling the proportion of positive and negative fragments corresponding to the small molecules, the influence of different RNA sequence lengths on the model performance is compared again; and finally, setting the number of the classification trees in the random forest model to be increased by 100 from 100 to 1000 each time, and comparing and selecting the number of the classification trees.

4. Verification of created RNA-chemical small molecule interaction prediction method

5-fold cross validation is performed on the training data set, and the prediction performance is mainly evaluated by sensitivity (sensitivity), specificity (specificity) and Matthews Correlation Coefficient (MCC), and the evaluation indexes are defined as follows:

since these evaluation indices depend on specific classifier thresholds, we will also plot ROC curves and use the area under the curve AUC values for evaluation in order to fully evaluate the predictor.

The created method is run on a separate test data set to assess its accuracy.

All drug small molecule structure data were downloaded from drug library (https:// www.drugbank.ca /), and models with different parameters set during optimization were applied to the drug library to screen for small molecules that could interact with AK 098656. Each of 5 positive and negative predictions were selected for further biological validation. Because the BIACORE intermolecular interaction analyzer of GE has the advantages of wide applicable sample types (including chemical small molecules and RNA), no need of labeling molecules, real-time property, ultrahigh sensitivity (weak and transient molecular interaction can be monitored), and the like, the BIACORE analyzer of GE is used for verifying the predicted positive and negative results.

Example 2:

1. collection and arrangement of RNA-chemical small molecule interaction data

A set of reliable and proven RNA-chemical small molecule interaction data is the basis for creating a targeted RNA chemical small molecule screening calculation method. To do so, applicants download the relevant data from the PDB database and analyze it for collation as a training data set. In addition, to verify the proposed prediction method, new RNA-chemical small molecule interaction pairs not included in PDB were obtained from SMMRNA (small molecule models of RNA) databases, new experimentally confirmed RNA-chemical small molecule interaction pairs were manually retrieved from published literature, and SMMRNA and literature retrieval results were used together as independent test datasets.

Extracting RNA-related characteristics such as Nucleotide classes, functional sites, Nucleotide Distance and (NDS) curves, Nucleotide frequencies, pairing states and the like from multiple angles such as sequences, structures, functions and the like based on RNA-chemical small molecule interaction data of a training data set; extracting a structure file from chemical small Molecular structure data, and calculating physicochemical properties including the Number of Hydrogen Bond Acceptors (HBA), the Number of Hydrogen Bond Donors (HBD), Octanol/water distribution coefficient (logP), Molar refractive index (MR), Molecular Weight (MW), Topological Polar Surface Area (TPSA), and the like.

3. Creation of RNA-chemical Small molecule interaction prediction method

Because the number of RNA fragments in the data set which do not interact with the chemical small molecules is far more than that of the fragments which interact with the chemical small molecules, a computing method for establishing RNA-chemical small molecule interaction prediction by dividing a negative sample into a plurality of Balanced Random Forest (BRF) models which are respectively matched with a positive sample is adopted. In addition, the optimization is performed in a step-by-step manner in view of the characteristics integrated in the model and the number of adjustable parameters in the construction process.

4. Verification of RNA-chemical small molecule interaction prediction method

In order to verify the accuracy of the created RNA-chemical small molecule interaction prediction method, 5-fold cross validation is carried out on a training data set, and the prediction performance of the random forest model is evaluated by adopting an AUC value. Runs were then made on separate test data sets, also evaluated using AUC values. Finally, the method is applied to the lncRNA-AK098656 which is previously discovered by the applicant and is specific to the vascular smooth muscle, and 5 positive prediction results and 5 negative prediction results are selected for biological verification.

Example 3:

for the research on the computational method of chemical small molecule drug screening of targeted RNA, the applicant has created a miRNA-based environmental factor (mostly chemical small molecules) development platform miREnvironment (Cui et al, bioinformatics 2011). Small interfering proteins are generally functional sites on interfering proteins, and thus determining functional sites of RNA is an important basis for interfering target RNA. The applicant has successively proposed methods for predicting RNA functional sites such as Rsite, Rsite2(Cui et al scientific Reports 2015,2016), SRAMP (Cui et al nucleic Acids Res 2016, m6A methylation site prediction), and PPUS (Cui et al bioinformatics 2015, pseudouracil site prediction). The applicant discloses that functional sites obtained by RNA sequence and spatial structure have significant consistency and correlation (FIG. 1), which indicates that the RNA sequence contains RNA spatial structure information, and further suggests that in the case of extreme lack of RNA spatial structure data and unknown RNA force field, the RNA sequence characteristics can be used for predicting the chemical small molecules interacting with the RNA sequence.

Example 4:

a vascular smooth muscle specific lncRNA-AK098656 (figure 2) was verified to be significantly elevated in the blood of hypertensive patients, and the blood pressure of rats after being transferred with AK098656 gene was significantly elevated (Jin L et al hypertension 2018,71(2): 262-.

Example 5:

applicants have collated more than 300 pairs of RNA-chemical small molecule interactions from PDB database collections. More than 100 pairs of RNA-chemical small molecule interaction pairs were obtained from SMMRNA databases and literature. Analysis shows that some RNA sequence characteristics are related to chemical small molecule interaction, such as triplet frequency, Rsite2 site, etc. and that some small molecule physicochemical properties are related to RNA interaction, such as octanol/water distribution coefficient, topological polar surface area, etc. A prediction method DRIP is preliminarily constructed based on random forests, and 5-fold cross validation results show that the AUC reaches 0.818, and the AUC reaches 0.829 (figure 4) on SMMRNA and literature-derived independent test data sets, so that the created method has certain accuracy in predicting RNA-chemical small molecule interaction.

Claims

1. A computer screening method of chemical small molecule drugs of target RNA is characterized in that: comprises the following steps: (1) collecting and sorting a data set, (2) mining characteristics used for training a prediction method, (3) creating a prediction method and a model, and (4) verifying the prediction method and the model; wherein,

the step (1) of collecting and collating the data sets comprises the steps of:

(a) retrieving and acquiring structures only consisting of RNA and small molecules from a PDB database, and extracting corresponding information from the structures, wherein the corresponding information comprises the interaction condition of the RNA and the small molecules and the specific interaction position of the RNA and the small molecules, and the information is used as a training data set; the training data set is sequentially screened through a first screening condition, a second screening condition and a third screening condition; wherein,

first screening conditions: if all the small molecules contained in the PDB structure are metal ions or solvent molecules in a buffer solution used in structural biology research, or the length of an RNA chain contained in the PDB structure does not exceed 20 nucleotides, the small molecules are not reserved;

second screening conditions: extracting RNA-small molecule interaction information from the PDB structure; adopting 4.0 angstroms as a threshold value for judging the interaction between the small molecules and the RNA; if the nearest distance between the RNA and the atoms of the small molecule is less than 4.0 angstroms, the RNA and the atoms of the small molecule are considered to have interaction, and subsequent operation is carried out;

and (3) third screening conditions: respectively sequencing the small molecules according to Euclidean distances of physicochemical properties between the small molecules and the small molecules contained in the structure according to one or more small molecules which interact with an RNA chain contained in each PDB structure, and selecting an intersection of the small molecules of which the Euclidean distances of the physicochemical properties are 80-90% in a descending order; (b) collecting the interaction data of RNA and small molecules outside the PDB database from an SMMRNA database and literature reports as a test data set;

the step (2) of mining features for training a prediction method comprises the following steps:

(a) extracting RNA sequence fragment related characteristics;

(b) calculating the physicochemical properties of the small molecules;

the step (3) of creating a prediction method and a model comprises the following steps: creating an equalized random forest model configured to obtain RNA sequence segment-related features input to the random forest model and physicochemical property features of small molecules input to the random forest model;

and training the random forest model according to the training data set.

2. The in silico screening method of RNA-targeted chemical small molecule drugs of claim 1, wherein: the step (2) of mining features for training a prediction method comprises the following steps:

(a) the relevant characteristics of the extracted RNA sequence fragment comprise the relevant characteristics of sequence, structure and function;

(b) the physical and chemical properties of the calculated micromolecules comprise the number of hydrogen bond acceptors, the number of hydrogen bond donors, octanol/water distribution coefficients, molar refractive indexes, molecular weights and topological polar surface areas.

3. The in silico screening method of RNA-targeted chemical small molecule drugs according to claim 2, characterized in that: the relevant features include: nucleotide class, functional site, nucleotide distance and NDS curve, nucleotide frequency and pairing status.

4. The in silico screening method of RNA-targeted chemical small molecule drugs of claim 1, wherein: a calculation method for establishing RNA-chemical small molecule interaction prediction by adopting a balanced random forest model comprises the following steps: and dividing the negative samples in the training data set into a plurality of parts to reduce the quantity difference between each negative sample and each positive sample, respectively matching with the positive samples to perform model training, and summarizing the output results of the models.

5. The in silico screening method of RNA-targeted chemical small molecule drugs of claim 1, wherein: the step (4) of verifying the prediction method and the model comprises the following steps: and (4) evaluating the performance of the model obtained in the step (3).

6. The in silico screening method of RNA-targeted chemical small molecule drugs according to claim 5, characterized in that: the performance evaluation comprises the following steps: cross validation using the training data set and/or independent validation using the test data set.

7. The in silico screening method of RNA-targeted chemical small molecule drugs according to claim 5, characterized in that: the performance evaluation comprises the following steps: 5 positive and 5 negative predictors were selected for biological validation.

8. The computer screening method of the RNA-targeted chemical small molecule drug according to claim 1 is applied to the following fields: the application in a high-throughput screening platform; and/or in the application of computer screening by taking RNA as a target compound; and/or application in a PDB database; and/or application in SMMRNA databases; and/or in the application of miRNA-based environmental factor development platform mirenenvironment; and/or use in targeted drug screening; and/or in the prevention and treatment of major diseases.