CN106874704A - The sub- recognition methods of key regulatory in a kind of common regulated and control network of gene based on linear model - Google Patents
The sub- recognition methods of key regulatory in a kind of common regulated and control network of gene based on linear model Download PDFInfo
- Publication number
- CN106874704A CN106874704A CN201710004254.4A CN201710004254A CN106874704A CN 106874704 A CN106874704 A CN 106874704A CN 201710004254 A CN201710004254 A CN 201710004254A CN 106874704 A CN106874704 A CN 106874704A
- Authority
- CN
- China
- Prior art keywords
- gene
- regulator
- expression
- linear model
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 108090000623 proteins and genes Proteins 0.000 title claims abstract description 180
- 238000000034 method Methods 0.000 title claims abstract description 71
- 230000001105 regulatory effect Effects 0.000 title abstract description 13
- 230000014509 gene expression Effects 0.000 claims abstract description 101
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims abstract description 53
- 201000010099 disease Diseases 0.000 claims abstract description 52
- 230000033228 biological regulation Effects 0.000 claims abstract description 30
- 239000002679 microRNA Substances 0.000 claims description 31
- 238000005457 optimization Methods 0.000 claims description 27
- 230000000694 effects Effects 0.000 claims description 26
- 230000009471 action Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 11
- 108091070501 miRNA Proteins 0.000 claims description 10
- 230000008859 change Effects 0.000 claims description 8
- 230000006916 protein interaction Effects 0.000 claims description 6
- 101150025711 TF gene Proteins 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 2
- 238000011160 research Methods 0.000 abstract description 5
- 230000007246 mechanism Effects 0.000 abstract description 3
- 108091023040 Transcription factor Proteins 0.000 description 26
- 102000040945 Transcription factor Human genes 0.000 description 26
- 108700011259 MicroRNAs Proteins 0.000 description 22
- 230000006870 function Effects 0.000 description 14
- 238000002474 experimental method Methods 0.000 description 12
- 206010033128 Ovarian cancer Diseases 0.000 description 11
- 206010061535 Ovarian neoplasm Diseases 0.000 description 11
- 238000010201 enrichment analysis Methods 0.000 description 7
- 230000037361 pathway Effects 0.000 description 6
- 238000013518 transcription Methods 0.000 description 5
- 206010028980 Neoplasm Diseases 0.000 description 4
- 230000031018 biological processes and functions Effects 0.000 description 4
- 102000004169 proteins and genes Human genes 0.000 description 4
- 230000035897 transcription Effects 0.000 description 4
- 102100029284 Hepatocyte nuclear factor 3-beta Human genes 0.000 description 3
- 101001062347 Homo sapiens Hepatocyte nuclear factor 3-beta Proteins 0.000 description 3
- 238000002790 cross-validation Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000019491 signal transduction Effects 0.000 description 3
- 241000206602 Eukaryota Species 0.000 description 2
- 108091092284 Homo sapiens miR-515-1 stem-loop Proteins 0.000 description 2
- 108091092278 Homo sapiens miR-515-2 stem-loop Proteins 0.000 description 2
- 102000048850 Neoplasm Genes Human genes 0.000 description 2
- 108700019961 Neoplasm Genes Proteins 0.000 description 2
- 201000011510 cancer Diseases 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000003828 downregulation Effects 0.000 description 2
- 102000042567 non-coding RNA Human genes 0.000 description 2
- 108091027963 non-coding RNA Proteins 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 102000007665 Extracellular Signal-Regulated MAP Kinases Human genes 0.000 description 1
- 108700039691 Genetic Promoter Regions Proteins 0.000 description 1
- 108091068941 Homo sapiens miR-106a stem-loop Proteins 0.000 description 1
- 108091067618 Homo sapiens miR-181a-2 stem-loop Proteins 0.000 description 1
- 108091069034 Homo sapiens miR-193a stem-loop Proteins 0.000 description 1
- 108091069013 Homo sapiens miR-206 stem-loop Proteins 0.000 description 1
- 108091070399 Homo sapiens miR-26b stem-loop Proteins 0.000 description 1
- 108091070398 Homo sapiens miR-29a stem-loop Proteins 0.000 description 1
- 108091069021 Homo sapiens miR-30b stem-loop Proteins 0.000 description 1
- 108091070382 Homo sapiens miR-33a stem-loop Proteins 0.000 description 1
- 108091067259 Homo sapiens miR-362 stem-loop Proteins 0.000 description 1
- 108091032109 Homo sapiens miR-423 stem-loop Proteins 0.000 description 1
- 108091064367 Homo sapiens miR-509-1 stem-loop Proteins 0.000 description 1
- 108091086508 Homo sapiens miR-509-2 stem-loop Proteins 0.000 description 1
- 108091087072 Homo sapiens miR-509-3 stem-loop Proteins 0.000 description 1
- 108091064467 Homo sapiens miR-520c stem-loop Proteins 0.000 description 1
- 108091063810 Homo sapiens miR-539 stem-loop Proteins 0.000 description 1
- 108091063771 Homo sapiens miR-586 stem-loop Proteins 0.000 description 1
- 108091061683 Homo sapiens miR-601 stem-loop Proteins 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000004071 biological effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000030833 cell death Effects 0.000 description 1
- 230000033077 cellular process Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000009510 drug design Methods 0.000 description 1
- 238000013090 high-throughput technology Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000001404 mediated effect Effects 0.000 description 1
- 201000001441 melanoma Diseases 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000001243 protein synthesis Methods 0.000 description 1
- 230000014493 regulation of gene expression Effects 0.000 description 1
- 102000037983 regulatory factors Human genes 0.000 description 1
- 108091008025 regulatory factors Proteins 0.000 description 1
- 230000009711 regulatory function Effects 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
- 230000014616 translation Effects 0.000 description 1
- 230000010415 tropism Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
Landscapes
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Theoretical Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Public Health (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Physiology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses the sub- recognition methods of key regulatory in a kind of gene based on linear model altogether regulated and control network, using gene expression profile data and gene regulation relation data, the identification of key regulatory in the common regulated and control network of gene is completed by building the expression of disease gene known to Linear Model for Prediction.The present invention realizes simple, key regulatory in the common regulated and control network of gene only just can need to be relatively accurately identified according to gene expression profile data and gene regulation relation, and being experimentally confirmed the regulator of identification has critically important biological meaning, there is important theory significance and practical value for the research of disease mechanisms.
Description
Technical Field
The invention belongs to the field of computational biology, and relates to a method for identifying key regulators in a gene co-regulation network based on a linear model.
Background
In the post-genome era, understanding the functions of genes, non-coding RNAs, proteins and other related biomolecules suggests that the mechanism of realization of biological processes becomes one of the most important research targets in current computing system biology and bioinformatics. Among them, the study of gene regulation is a very important subject. Understanding the regulation mechanism of gene expression plays an important role in understanding the mechanisms of biological processes and disease development. In eukaryotes, there are two important classes of regulatory factors: transcription Factor (TF) and microRNA (miRNA) which regulate the expression level of a target gene at the transcription level and the post-transcription level respectively. Transcription factors are a class of proteins with specific functions that turn on the transcription process of a gene by binding to the promoter region of the gene. miRNA is a new gene regulatory element discovered in recent years, is an endogenous non-coding RNA with a regulatory function found in eukaryotes, and has a size of about 20-25 nucleotides. Transcription factors, mirnas, play important roles in the regulation of gene expression, which extends throughout a variety of biological activities and disease processes. On the basis, researches find that the transcription factors and the miRNA have wide interaction and cooperative regulation, and the transcription factors and the miRNA form a complex co-regulation network. The co-regulation network comprises transcription factor regulation miRNA, transcription factor regulation target gene, miRNA regulation transcription factor and regulation function of the target gene, and the regulation functions reflect each stage of the life process and function execution of cell molecules, so that the co-regulation network comprises more abundant biological information than a single network. Therefore, effective identification of key regulators on the co-regulatory network is important for clinical treatment of diseases and drug design, which may provide a new approach for treatment of human diseases.
With the rapid development of high-throughput technology, a large amount of genomics, transcriptomics, proteomics and other omics data are generated, and a new opportunity is provided for biomolecular function research. The previous identification algorithm for key points mainly focuses on the identification of key proteins on a protein interaction network. Evolutionary studies of transcriptional regulatory networks are more difficult than protein interaction networks. Firstly, credible transcription regulation network data is still difficult to obtain; secondly, in view of the existing transcription regulation network, due to the functional characteristics of the network, the presented topological characteristics are greatly different from the protein interaction network, and the presented topological characteristics of the regulation network are more complex due to the tropism of the regulation function. Thus, the recognition of key regulators on the regulatory network is more complex than the recognition of key proteins. In recent years, research on regulation and control networks is increasing, and there have been many methods for identifying key regulators on the regulation and control networks based on computation, mainly the following methods: based on information flow models (RWRs), ranking algorithms (PageRanking), constructing classifiers (SVM), Regularized least-squares classification, Bayesian networks, regression-based models, and the like. However, the existing methods have more or less some problems: such as inability to process large data, too high of a time complexity, accuracy to be improved, etc. In 2015, Alexandra and the like propose an MIPRIP method, a linear model is used for identifying key regulators on a regulation network, and experimental results show that the linear model-based method can effectively identify regulators with important biological significance. However, the method only considers the relation between the transcription factor and the gene, does not consider the interaction and cooperative regulation relation between regulators in the co-regulation network, and simultaneously, the identification precision is also to be improved.
Therefore, there is a need to design a method for identifying key regulators in a gene co-regulation network based on a linear model.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for identifying key regulators in a gene co-regulation network based on a linear model. The method for identifying the key regulators in the gene co-regulation network based on the linear model can identify the key regulators with biological significance in the gene co-regulation network more accurately only according to gene expression profile data and gene regulation relation.
The technical solution of the invention is as follows:
a method for identifying key regulators in a gene co-regulation network based on a linear model comprises the following steps:
step 1) constructing a gene co-regulation network:
inputting gene expression profile data, gene regulation relationship and Protein interaction data (PPI), filtering out action relationship pairs without expression profile data nodes, and establishing a gene co-regulation network (GCN), wherein the GCN comprises three nodes: regulator miRNA (microRNA), regulator TF and gene, wherein action edges exist among nodes: miRNA-gene, TF-gene and gene-gene;
if any two points in the gene co-regulation network GCN have an action relation, the edge weight is 1, otherwise, the edge weight is 0;
step 2) respectively calculating activity values of a regulator miRNA, a regulator TF and adjacent genes of known disease genes;
activity values, i.e., the influence values of miRNA, TF, and adjacent genes on known disease genes;
step 3) in the constructed gene co-regulation network GCN, constructing a linear model by using gene expression profile data and activity values of the regulator and the adjacent genes obtained in the step 2), predicting the expression of the known disease genes, and obtaining the predicted expression value of the known disease genes;
and 4) converting the linear model constructed in the step 3) into an optimization problem according to the minimization of the difference between the predicted expression value and the real expression value of the known disease gene, solving the optimization problem based on the mixed integer linear programming idea, and finally identifying a key regulator in the gene co-regulation network.
Further, the linear model expression constructed for predicting the expression of known disease genes is as follows:
wherein i represents a known disease gene, m, t and g represent a regulator miRNA, a regulator TF and a neighboring gene of the known disease gene i respectively;
g′i,srepresenting the predicted expression value of the known disease Gene i in sample s, β0Additional weights (additive offset) for linear models, M, T, G for miRNA set, TF set, gene set, βm、βt、βgRespectively representing the optimization parameters of m, t and g, and directly calculating by using an optimizer during the optimization problem processing in the step 4);
esm,i、tst,i、gsg,irespectively representing the action side weights of m, t, g and i, and taking the value as 0 or 1;
actm,s、actt,s、actg,srespectively representing the activity values of m, t and g in a sample s;
the sample s refers to data of a certain observed individual with a known disease.
Further, said minimizing the difference between the predicted expression value and the true expression value of the gene transforms the linear model into an optimization problem, expressed as:
wherein, gi,s、g′i,sRespectively representing the real expression value and the predicted expression value of the disease gene i in a sample S, and respectively representing a known disease gene set and a total sample set of the disease by O and S;
solving the optimization problem by adopting a Gurobi optimizer, recording the times of each regulator selected by the optimizer in the process of solving the optimization problem, ranking all regulators according to the selection times, and taking the regulator with the rank of 50 as a final candidate regulator.
After the Gurobi optimizer is installed, the Gurobi function can be directly called to perform optimization problem processing only by introducing a Gurobi package into the R language, and the Gurobi function has three input parameters: the optimization model is obtained by converting the constructed linear model into an optimization problem by minimizing the difference between the predicted expression value and the actual expression value of the known disease gene of the constructed linear model, wherein the optimization model is the timeLimit and the OutputFlag, the timeLimit generally takes the value of 600, and the OutputFlag takes the default value of 0. To obtain a series of models of typically different sizes, a linear model was constructed by constraining the number of regulators of the gene. For each known disease gene, the number of regulators is set to 1 to k respectively to construct a linear model.
Further, the activity values of the regulator miRNA, the regulator TF and the adjacent gene are calculated by the following two methods, respectively:
1) calculating the activity values of the regulator miRNA and the regulator TF:
first, the reference expression values of all target genes of the regulator r are calculated:
wherein r represents a regulator, namely a regulator miRNA or a regulator TF;target Gene g representing regulator rtThe reference expression value of (a) is gene gtThe average of the expression values in all samples where the expression level of regulator r tended to 0; e (r) ->0 indicates that the expression level of the regulator r tends to 0;
the reference expression value of the target gene refers to the expression value of the target gene when no regulation effect is exerted;
secondly, calculating the difference between the reference expression value of the target gene and the real expression value after the influence of the regulator, namely the expression level change value of the target geneComprises the following steps:
wherein, ygt,sTarget Gene g representing regulator rtThe true expression value in the sample s,target Gene g representing regulator rtA change in expression level of;
thirdly, a simple linear model is constructed according to the expression level change value of the target gene, and the activity value act of the regulon is solvedr,s:
Wherein G' represents a target gene set of a regulator r,respectively representing the sum of the expression level change values of the target gene set of the regulon r and the sum of the reference expression values;
3) calculating the activity value of the adjacent genes, and solving by adopting the cumulative effect based on the expression influence of the adjacent genes on all action genes, namely:
wherein N represents the total number of genes in the sample s, gsg,iRepresenting the weight of the action of the gene g with respect to the gene i in the sample s, gi,sRepresents the expression value of the gene i in the sample s, which is the data of a certain observed individual with known diseases.
Further, after normalizing the activity values of the regulon and the adjacent gene obtained in the step 2), the activity values are used for constructing a linear model in the step 3).
Advantageous effects
The invention provides a method (co-BOTLM) for identifying key regulators in a gene co-regulation network based on a linear model, which utilizes gene expression profile data and gene regulation relation to predict the expression of known disease genes by constructing the linear model to complete the identification of the key regulators in the gene co-regulation network.
Compared with the existing method for identifying the key regulators based on the linear model, the co-BOTLM method has the following advantages:
1) the method is applied to a co-regulation network, the co-regulation network contains richer biological information than a single network, so that the identified regulators have more important biological significance;
2) adding protein interaction data (PPI information) taking into account that the expression of a gene may be affected by a neighboring gene;
3) and a new method is introduced to calculate the activity values of the regulator and the adjacent genes, so that the accuracy of the cancer gene expression prediction is effectively improved. The method is simple to realize, and the key regulators in the gene co-regulation network can be accurately identified only according to the gene expression profile data and the gene regulation relationship.
Experiments prove that the co-BOTLM can effectively identify key regulators in a gene co-regulation network, and the identified key regulators have important biological significance. Meanwhile, compared with other methods, the accuracy is improved. The specific experimental result chart is compared and analyzed in detail in the examples.
Drawings
FIG. 1 is a flow chart of the co-BOTLM of the present invention.
Detailed Description
The invention will be described in further detail below with reference to the following figures and specific examples:
example 1:
identification model of key regulators in gene co-regulation network based on linear model
The invention defines key regulators in a gene co-regulatory network as: by utilizing gene expression profile data and gene regulation relation, the expression of known disease genes is predicted by constructing a linear model, so that regulators seriously influencing the disease gene expression in a co-regulation network are identified.
To clearly describe the model of key regulator recognition in a linear model-based gene co-regulatory network, the inventors defined the correlation of this model as follows:
the proposed construction of a linear model to predict the expression of known disease genes is as follows:
the key regulator recognition model in a linear model-based gene co-regulatory network aims at recognizing regulators that seriously affect the expression of disease genes in the co-regulatory network. The identification of key nodes in the gene co-regulation network is completed by constructing a linear model by utilizing gene expression profile data and a gene regulation relation to predict the expression of known disease genes.
The whole process of the key regulator identification method in the gene co-regulation network based on the linear model is shown in figure 1. Firstly, inputting gene expression profile data, gene regulation relation and PPI data. The method co-BOTLM can be divided into 4 sub-processes:
1) constructing a gene co-regulation network;
2) considering that the expression of the gene may be affected by the regulator and the adjacent gene, activity values of miRNA, TF and the adjacent gene (i.e., influence values of miRNA, TF and the adjacent gene on the known disease gene) are calculated for the known disease gene, respectively;
3) constructing a linear model by using the expression profile data of the genes in the obtained gene co-regulation network, and predicting the expression of the genes with known diseases;
4) converting the linear model into an optimization problem according to the minimization of the difference between the gene prediction expression value and the real expression value, solving the optimization problem based on a mixed integer linear programming idea (MILP), finally identifying a key regulator in the gene co-regulation network, and ending the whole identification process;
solving the optimization problem by adopting a Gurobi optimizer, recording the times of each regulator selected by the optimizer in the process of solving the optimization problem, ranking all regulators according to the selection times, and taking the regulator with the rank of 50 as a final candidate regulator.
After the Gurobi optimizer is installed, the Gurobi function can be directly called to perform optimization problem processing only by introducing a Gurobi package into the R language, and the Gurobi function has three input parameters: the optimization model comprises an optimization model, timeLimit and OutputFlag, wherein the timeLimit generally takes a value of 600, the OutputFlag takes a default value of 0, and the optimization model is obtained by converting a constructed linear model into an optimization problem by minimizing the difference between the predicted expression value and the real expression value of the known disease gene. To obtain a series of models of typically different sizes, a linear model was constructed by constraining the number of regulators of the gene. For each known disease gene, the number of regulators is set to 1 to k respectively to construct a linear model. In this example, k is 5 (after many experiments, when k is 5, the experiment effect is optimal).
Validity verification method of key regulator identification method in gene co-regulation network based on linear model
To verify the effectiveness of the co-BOTLM method, the co-BOTLM method was applied to a set of ovarian cancer data sets. The experimental data set included: ovarian cancer sample data, gene regulatory relationships, PPI data, known ovarian cancer-associated disease genes. The ovarian cancer sample data is downloaded from a TCGA database, 385 samples are obtained in total, and an ovarian cancer expression profile data set containing 559 miRNA and 12456 genes is obtained by filtering genes with undersized absolute values of expression values or no obvious differential expression in each sample. Action relation data include miRNA-gene, TF-gene and PPI data, which are respectively downloaded from MicroCosm website, ENCODE database and Biogrid database. By mapping the ovarian cancer expression profile data set and the action relation with each other, a miRNA-TF gene co-regulation network is finally constructed, wherein the network comprises three types of nodes: 12381 genes, 559 miRNAs and 75 TF, the functional relationship existing between the nodes: 59660 for gene-gene, 241722 for miRNA-gene and 9877 for TF-gene. For known ovarian cancer related disease genes, 379 genes are downloaded from the DDOC database, and the disease genes without expression profile data or regulation and control relationship are filtered out, and finally, 123 genes are remained.
In the example, a three-fold cross validation experiment is performed, the prediction precision of the co-BOTLM method is compared with that of the MIPRIP method proposed by Alexandra et al, a Pearson correlation coefficient PCC is used for calculating the similarity between disease gene expression data predicted by the co-BOTLM method and real expression data, the higher the PCC value is, the higher the similarity is, and the higher the accuracy of a linear model constructed by the co-BOTLM method is, so that the precision of the experiment result is higher. The PCC values in the examples are calculated using the cor function in the R language. Meanwhile, in the example, characteristic and functional enrichment analysis is also carried out on the regulons identified by the co-BOTLM method.
1. Analyzing experimental results and verifying algorithm effectiveness
Table 1: first-20 ranked regulators in miRNA-TF gene co-regulation network
No. | Identified key regulators | Number of target genes | Number of optimizer selections |
1 | hsa-mir-106a* | 377 | 50 |
2 | hsa-mir-586 | 508 | 43 |
3 | hsa-mir-423-5p | 496 | 38 |
4 | hsa-mir-515-3p | 512 | 34 |
5 | hsa-mir-181a-2* | 496 | 34 |
6 | hsa-mir-768-3p | 530 | 32 |
7 | hsa-mir-663 | 480 | 32 |
8 | hsa-mir-539 | 382 | 31 |
9 | hsa-mir-206 | 477 | 30 |
10 | hsa-mir-509-3p | 552 | 30 |
11 | hsa-mir-362-3p | 512 | 25 |
12 | hsa-mir-378* | 519 | 24 |
13 | hsa-mir-520c-3p | 566 | 24 |
14 | hsa-mir-33a | 523 | 24 |
15 | hsa-mir-29a* | 495 | 23 |
16 | hsa-mir-193a-3p | 496 | 23 |
17 | hsa-mir-601 | 484 | 23 |
18 | FOXA2 | 169 | 23 |
19 | hsa-mir-26b | 466 | 22 |
20 | hsa-mir-30b | 541 | 22 |
In the example, after the three-fold cross validation experiment, the average PPC value is finally obtained to be 0.535, which shows that the gene expression value predicted by the linear model in the invention has higher similarity with the real expression value, so that the accuracy of the linear model constructed by the co-BOTLM method is higher, and the key regulators in the network can be effectively identified. After the experiment is finished, ranking all regulators according to the selection times of the optimizer to the regulators, and taking the first 50 regulators as candidate key regulators in the example. In table 1 above, the top 20 regulators are listed, and it can be seen that none of the genes regulated by any of the regulators other than FOXA2 is less than 300, and many of them have been confirmed to be associated with ovarian cancer. Because of the too little TF experimental data, the FOXA2 target gene is less. This indicates that the identified regulators have a role in the co-regulatory network of ovarian cancer genes, which may be related to the expression of a large number of genes, including known genes of ovarian cancer disease, and thus have a critical role in the co-regulatory network.
2. Method co-BOTLM is compared with MIPRIP method experiment, and accuracy of algorithm is verified
Table 2: PCC value of method MIPRIP experimental result
No. | 1 | 2 | 3 | 4 | 5 |
1 | 0.3329907 | 0.4312150 | 0.4436449 | 0.4731776 | 0.4893458 |
2 | 0.3195237 | 0.4221495 | 0.4500000 | 0.4687850 | 0.4851402 |
3 | 0.3214019 | 0.4341121 | 0.4571028 | 0.4768224 | 0.4916822 |
Note: 1-3: represents a three-fold cross validation experiment, 1-5: expressing the number k value of regulators for constructing linear model
Table 3: PCC values of Process co-BOTLM experiment results
No. | 1 | 2 | 3 | 4 | 5 |
1 | 0.5018750 | 0.5709821 | 0.5940179 | 0.6112500 | 0.6227679 |
2 | 0.4858036 | 0.5575893 | 0.5869643 | 0.6025893 | 0.6164286 |
3 | 0.4956250 | 0.5518750 | 0.5691964 | 0.5918750 | 0.6059821 |
The MIPRIP method and the co-bollm method of the present invention are both based on linear models to identify key regulators of specific diseases, however, there are three differences: 1) the MIPRIP method is applied to a regulation network, the co-BOTLM method is applied to a co-regulation network, and transcription factors and miRNA have wide interaction and cooperative regulation, so that the co-regulation network contains richer biological information than a single network; 2) for factors affecting the expression of disease genes, the co-BOTLM method also considers the possible effect of adjacent genes on the factors, except transcription factors and miRNA; 3) the MIPRIP method is different from the co-BOTLM method in the calculation mode of the activity values of the transcription factors and the miRNA. Since the MIPRIP method is applied to the regulation and control of a network, without considering the co-regulation relationship in the network, the transcription factor is regarded as a common gene in the present example when a comparative laboratory is performed. Tables 2 and 3 show the PCC values obtained from the experimental results of the MIPRIP method and the co-BOTLM method, respectively, and it is obvious from the tables that the co-BOTLM method obtains higher PCC values, the average PCC value is 0.571, and the average PCC value of the MIPRIP method is 0.433. Obviously, the gene expression value predicted by the co-BOTLM method has higher similarity with the real expression value, so that the experiment indirectly shows that the co-BOTLM method has higher precision and the reliability of the identified key regulator is higher.
3. Experiment result function enrichment analysis, and result validity verification
Table 4: top 10 regulator GO enrichment analysis
An Ncellular component assembly: regulon ranking, enriched GO terms: GO terms ranked 3 top by P-value (smaller is better), GO number: number of GO terms P-value <0.05, P-value: <0.05 indicated high enrichment.
Table 5: top 10 regulator KEGG pathway enrichment analysis
No.: regulator ranking, enriched KEGG pathway: and (3) KEGG channels ranked in the top 3 according to P-value (the smaller the better), the number of KEGG: number of KEGG of P-value <0.05, P-value: <0.05 indicated high enrichment.
In order to verify that the key regulators identified by the co-BOTLM method in the invention are biologically significant, in this example, GOstats in the R language is used to perform GO enrichment analysis and KEGG pathway enrichment analysis on the identified key regulators respectively. Table 4 and table 5 show the GO and KEGG pathway enrichment analysis results for the top 10 regulators, respectively.
It is clear from table 4 that most of the top 10 regulators identified by the co-BOTLM method in the present invention are enriched with more than 300 GO terms, wherein the more frequently enriched GO terms are: cellular components, cellular processes, cell death, negative regulation of dentriticcell differentiation, and the like, indicate that the identified regulators participate in a large number of cell-related life processes. The number of GO terms enriched by hsa-mir-515-3p and hsa-mir-768-3p is less than 100, probably because the target genes of the two miRNAs are less matched with the GOstats library, and Jiang et al have demonstrated in 2016 that hsa-mir-768-3p has a potential prognostic function in ovarian cancer because of its down-regulation linked to MEK/ERK-mediated enhancement in protein synthesis in melanoma cells. Similarly, it is evident from table 5 that most of the top 10 regulators are enriched in at least 5 KEGG pathways, among which the more frequently enriched biological processes are: the fact that the identified regulators are involved in a large number of cancers and signaling pathways and have close relationship with the cancers is shown by the fact that the regulators identified in the conservation, the pathway in the cancer, the signaling pathway, the ErbB signaling pathway and the like. In conclusion, it is well demonstrated that experimentally identified regulators are involved in a large number of biological processes, especially those associated with cellular activity and cancer, and thus are of great biological interest.
Claims (5)
1. A method for identifying key regulators in a gene co-regulation network based on a linear model is characterized by comprising the following steps:
step 1) constructing a gene co-regulation network:
inputting gene expression profile data, gene regulation relation and protein interaction data, filtering action relation pairs without expression profile data nodes, establishing a gene co-regulation network GCN, wherein the gene co-regulation network GCN comprises three nodes in common: regulator miRNA, regulator TF and gene, there are action edges between the nodes: miRNA-gene, TF-gene and gene-gene;
if any two points in the gene co-regulation network GCN have an action relation, the edge weight is 1, otherwise, the edge weight is 0;
step 2) respectively calculating activity values of a regulator miRNA, a regulator TF and adjacent genes of known disease genes;
step 3) in the constructed gene co-regulation network GCN, constructing a linear model by using gene expression profile data and activity values of the regulator and the adjacent genes obtained in the step 2), predicting the expression of the known disease genes, and obtaining the predicted expression value of the known disease genes;
and 4) converting the linear model constructed in the step 3) into an optimization problem according to the minimization of the difference between the predicted expression value and the real expression value of the known disease gene, solving the optimization problem based on the mixed integer linear programming idea, and finally identifying a key regulator in the gene co-regulation network.
2. The method for identifying key regulators in a linear model-based gene co-regulation network according to claim 1, wherein the linear model expression constructed for predicting the expression of known disease genes is as follows:
wherein i represents a known disease gene, m, t and g represent a regulator miRNA, a regulator TF and a neighboring gene of the known disease gene i respectively;
g′i,srepresenting the predicted expression value of the known disease Gene i in sample s, β0Additional weight of linear model M, T, G represents miRNA set, TF set and gene set, βm、βt、βgRespectively representing the optimization parameters of m, t and g, and directly calculating by using an optimizer during the optimization problem processing in the step 4);
esm,i、tst,i、gsg,irespectively representing the action side weights of m, t, g and i, and taking the value as 0 or 1;
actm,s、actt,s、actg,srespectively representing the activity values of m, t and g in a sample s;
the sample s refers to data of a certain observed individual with a known disease.
3. The method for identifying key regulators in a linear model-based gene co-regulation network according to claim 2, wherein the linear model is transformed into an optimization problem according to minimization of the difference between the predicted expression value and the true expression value of the gene, which is expressed as:
wherein, gi,s、g′i,sRespectively representing the real expression value and the predicted expression value of the disease gene i in a sample S, and respectively representing a known disease gene set and a total sample set of the disease by O and S;
solving the optimization problem by adopting a Gurobi optimizer, recording the times of each regulator selected by the optimizer in the process of solving the optimization problem, ranking all regulators according to the selection times, and taking the regulator with the rank of 50 as a final candidate regulator.
4. The method for identifying key regulators in a linear model-based gene co-regulation network according to any one of claims 1-3, wherein the activity values of the regulator miRNA, the regulator TF and the adjacent genes are calculated by the following two methods respectively:
1) calculating the activity values of the regulator miRNA and the regulator TF:
first, the reference expression values of all target genes of the regulator r are calculated:
wherein r represents a regulator, namely a regulator miRNA or a regulator TF;target Gene g representing regulator rtThe reference expression value of (a) is gene gtThe average of the expression values in all samples where the expression level of regulator r tended to 0; e (r) ->0 indicates that the expression level of the regulator r tends to 0;
secondly, calculating the difference between the reference expression value of the target gene and the real expression value after the influence of the regulator, namely the expression level change value of the target geneComprises the following steps:
wherein,target Gene g representing regulator rtThe true expression value in the sample s,target Gene g representing regulator rtA change in expression level of;
thirdly, a simple linear model is constructed according to the expression level change value of the target gene, and the activity value act of the regulon is solvedr,s:
Wherein G' represents a target gene set of a regulator r,respectively representing the sum of the expression level change values of the target gene set of the regulon r and the sum of the reference expression values;
2) calculating the activity value of the adjacent genes, and solving by adopting the cumulative effect based on the expression influence of the adjacent genes on all action genes, namely:
wherein N represents the total number of genes in the sample s, gsg,iRepresenting the weight of the action of the gene g with respect to the gene i in the sample s, gi,sRepresents the expression value of the gene i in the sample s, which is the data of a certain observed individual with known diseases.
5. The method for identifying key regulators in the linear model-based gene co-regulation network according to claim 4, wherein the activity values of the regulators and adjacent genes obtained in the step 2) are normalized and then used for constructing the linear model in the step 3).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710004254.4A CN106874704B (en) | 2017-01-04 | 2017-01-04 | A kind of gene based on linear model is total to the sub- recognition methods of key regulatory in regulated and control network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710004254.4A CN106874704B (en) | 2017-01-04 | 2017-01-04 | A kind of gene based on linear model is total to the sub- recognition methods of key regulatory in regulated and control network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106874704A true CN106874704A (en) | 2017-06-20 |
CN106874704B CN106874704B (en) | 2019-02-19 |
Family
ID=59164588
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710004254.4A Active CN106874704B (en) | 2017-01-04 | 2017-01-04 | A kind of gene based on linear model is total to the sub- recognition methods of key regulatory in regulated and control network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106874704B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391962A (en) * | 2017-09-05 | 2017-11-24 | 武汉古奥基因科技有限公司 | The method of gene or site to disease regulation relationship is analysed based on multigroup credit |
CN107679367A (en) * | 2017-09-20 | 2018-02-09 | 湖南大学 | A kind of common regulated and control network functional module recognition methods and system based on the network node degree of association |
CN109308934A (en) * | 2018-08-20 | 2019-02-05 | 唐山照澜海洋科技有限公司 | A kind of gene regulatory network construction method based on integration characteristic importance and chicken group's algorithm |
CN111304200A (en) * | 2020-02-11 | 2020-06-19 | 山东大学 | CeRNA (cellular ribonucleic acid) regulation and control network for regulating and controlling osteointegration around rat implant with hyperlipidemia and application of network |
CN111613268A (en) * | 2020-05-27 | 2020-09-01 | 中山大学 | Method for determining gene expression regulation mechanism based on single cell transcriptome data |
CN111833964A (en) * | 2020-06-24 | 2020-10-27 | 华中农业大学 | Method for mining superior locus of Bayesian network optimized by integer linear programming |
CN112102876A (en) * | 2020-09-27 | 2020-12-18 | 西安交通大学 | Method for automatically modeling gene circuit and transcription regulation and control relation |
CN115798600A (en) * | 2023-02-03 | 2023-03-14 | 北京灵迅医药科技有限公司 | Genome data analysis method, apparatus, device and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030104463A1 (en) * | 2001-12-03 | 2003-06-05 | Siemens Aktiengesellschaft | Identification of pharmaceutical targets |
CN101719194A (en) * | 2009-12-03 | 2010-06-02 | 上海大学 | Artificial gene regulatory network simulation method |
CN101719195A (en) * | 2009-12-03 | 2010-06-02 | 上海大学 | Inference method of stepwise regression gene regulatory network |
-
2017
- 2017-01-04 CN CN201710004254.4A patent/CN106874704B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030104463A1 (en) * | 2001-12-03 | 2003-06-05 | Siemens Aktiengesellschaft | Identification of pharmaceutical targets |
CN101719194A (en) * | 2009-12-03 | 2010-06-02 | 上海大学 | Artificial gene regulatory network simulation method |
CN101719195A (en) * | 2009-12-03 | 2010-06-02 | 上海大学 | Inference method of stepwise regression gene regulatory network |
Non-Patent Citations (2)
Title |
---|
YING LIN等: "Transcription factor and miRNA", 《SCIENTIFIC REPORTS》 * |
许艳等: "整合分析基因表达与拷贝数变异识别癌症的驱动基因及调控子miRNAs", 《现代生物医学进展》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107391962A (en) * | 2017-09-05 | 2017-11-24 | 武汉古奥基因科技有限公司 | The method of gene or site to disease regulation relationship is analysed based on multigroup credit |
CN107679367A (en) * | 2017-09-20 | 2018-02-09 | 湖南大学 | A kind of common regulated and control network functional module recognition methods and system based on the network node degree of association |
CN107679367B (en) * | 2017-09-20 | 2020-02-21 | 湖南大学 | Method and system for identifying co-regulation network function module based on network node association degree |
CN109308934A (en) * | 2018-08-20 | 2019-02-05 | 唐山照澜海洋科技有限公司 | A kind of gene regulatory network construction method based on integration characteristic importance and chicken group's algorithm |
CN111304200A (en) * | 2020-02-11 | 2020-06-19 | 山东大学 | CeRNA (cellular ribonucleic acid) regulation and control network for regulating and controlling osteointegration around rat implant with hyperlipidemia and application of network |
CN111304200B (en) * | 2020-02-11 | 2022-04-15 | 山东大学 | CeRNA (cellular ribonucleic acid) regulation and control network for regulating and controlling osteointegration around rat implant with hyperlipidemia and application of network |
CN111613268A (en) * | 2020-05-27 | 2020-09-01 | 中山大学 | Method for determining gene expression regulation mechanism based on single cell transcriptome data |
CN111613268B (en) * | 2020-05-27 | 2023-02-24 | 中山大学 | Method for determining gene expression regulation mechanism based on single cell transcriptome data |
CN111833964A (en) * | 2020-06-24 | 2020-10-27 | 华中农业大学 | Method for mining superior locus of Bayesian network optimized by integer linear programming |
CN112102876A (en) * | 2020-09-27 | 2020-12-18 | 西安交通大学 | Method for automatically modeling gene circuit and transcription regulation and control relation |
CN115798600A (en) * | 2023-02-03 | 2023-03-14 | 北京灵迅医药科技有限公司 | Genome data analysis method, apparatus, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106874704B (en) | 2019-02-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106874704B (en) | A kind of gene based on linear model is total to the sub- recognition methods of key regulatory in regulated and control network | |
US20210397995A1 (en) | Systems and methods relating to network-based biomarker signatures | |
Jelizarow et al. | Over-optimism in bioinformatics: an illustration | |
JP6407242B2 (en) | System and method for network-based biological activity assessment | |
CN110459264B (en) | Method for predicting relevance of circular RNA and diseases based on gradient enhanced decision tree | |
Kim et al. | Integrative clustering of multi-level omics data for disease subtype discovery using sequential double regularization | |
CN106295246A (en) | Find the lncRNA relevant to tumor and predict its function | |
CN111933212A (en) | Clinical omics data processing method and device based on machine learning | |
CN107679367B (en) | Method and system for identifying co-regulation network function module based on network node association degree | |
CN105808976A (en) | Recommendation model based miRNA target gene prediction method | |
Zheng et al. | An adaptive sparse subspace clustering for cell type identification | |
Zhong et al. | scGET: predicting cell fate transition during early embryonic development by single-cell graph entropy | |
CN108427865B (en) | Method for predicting correlation between LncRNA and environmental factors | |
Tran et al. | scREMOTE: Using multimodal single cell data to predict regulatory gene relationships and to build a computational cell reprogramming model | |
Kalyakulina et al. | Disease classification for whole-blood DNA methylation: meta-analysis, missing values imputation, and XAI | |
CN109712717A (en) | A kind of cancer correlation MicroRNA recognition methods based on miRNA- gene regulation module | |
Chu et al. | Integrated genomic analysis of biological gene sets with applications in lung cancer prognosis | |
Sarkar et al. | Identification of miRNA biomarkers for diverse cancer types using statistical learning methods at the whole-genome scale | |
Gonçalves et al. | Regulatory snapshots: integrative mining of regulatory modules from expression time series and regulatory networks | |
Reddy et al. | Designing Cell-Type-Specific Promoter Sequences Using Conservative Model-Based Optimization | |
Liu et al. | Towards key genes identification for breast cancer survival risk with neural network models | |
Liu et al. | miRNA-disease associations prediction based on neural tensor decomposition | |
KR20170017231A (en) | METHOD OF ACCESS TO IDENTIFYING GENE-microRNA MODULES IN CANCER | |
Ceddia et al. | Network modeling and analysis of normal and cancer gene expression data | |
Zhang et al. | Finding disagreement pathway signatures and constructing an ensemble model for cancer classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |