CN106874704A

CN106874704A - The sub- recognition methods of key regulatory in a kind of common regulated and control network of gene based on linear model

Info

Publication number: CN106874704A
Application number: CN201710004254.4A
Authority: CN
Inventors: 王伟胜; 曾亚菲; 骆嘉伟; 刘智明; 蔡洁
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2017-01-04
Filing date: 2017-01-04
Publication date: 2017-06-20
Anticipated expiration: 2037-01-04
Also published as: CN106874704B

Abstract

The invention discloses the sub- recognition methods of key regulatory in a kind of gene based on linear model altogether regulated and control network, using gene expression profile data and gene regulation relation data, the identification of key regulatory in the common regulated and control network of gene is completed by building the expression of disease gene known to Linear Model for Prediction.The present invention realizes simple, key regulatory in the common regulated and control network of gene only just can need to be relatively accurately identified according to gene expression profile data and gene regulation relation, and being experimentally confirmed the regulator of identification has critically important biological meaning, there is important theory significance and practical value for the research of disease mechanisms.

Description

Method for identifying key regulators in gene co-regulation network based on linear model

Technical Field

The invention belongs to the field of computational biology, and relates to a method for identifying key regulators in a gene co-regulation network based on a linear model.

Background

In the post-genome era, understanding the functions of genes, non-coding RNAs, proteins and other related biomolecules suggests that the mechanism of realization of biological processes becomes one of the most important research targets in current computing system biology and bioinformatics. Among them, the study of gene regulation is a very important subject. Understanding the regulation mechanism of gene expression plays an important role in understanding the mechanisms of biological processes and disease development. In eukaryotes, there are two important classes of regulatory factors: transcription Factor (TF) and microRNA (miRNA) which regulate the expression level of a target gene at the transcription level and the post-transcription level respectively. Transcription factors are a class of proteins with specific functions that turn on the transcription process of a gene by binding to the promoter region of the gene. miRNA is a new gene regulatory element discovered in recent years, is an endogenous non-coding RNA with a regulatory function found in eukaryotes, and has a size of about 20-25 nucleotides. Transcription factors, mirnas, play important roles in the regulation of gene expression, which extends throughout a variety of biological activities and disease processes. On the basis, researches find that the transcription factors and the miRNA have wide interaction and cooperative regulation, and the transcription factors and the miRNA form a complex co-regulation network. The co-regulation network comprises transcription factor regulation miRNA, transcription factor regulation target gene, miRNA regulation transcription factor and regulation function of the target gene, and the regulation functions reflect each stage of the life process and function execution of cell molecules, so that the co-regulation network comprises more abundant biological information than a single network. Therefore, effective identification of key regulators on the co-regulatory network is important for clinical treatment of diseases and drug design, which may provide a new approach for treatment of human diseases.

With the rapid development of high-throughput technology, a large amount of genomics, transcriptomics, proteomics and other omics data are generated, and a new opportunity is provided for biomolecular function research. The previous identification algorithm for key points mainly focuses on the identification of key proteins on a protein interaction network. Evolutionary studies of transcriptional regulatory networks are more difficult than protein interaction networks. Firstly, credible transcription regulation network data is still difficult to obtain; secondly, in view of the existing transcription regulation network, due to the functional characteristics of the network, the presented topological characteristics are greatly different from the protein interaction network, and the presented topological characteristics of the regulation network are more complex due to the tropism of the regulation function. Thus, the recognition of key regulators on the regulatory network is more complex than the recognition of key proteins. In recent years, research on regulation and control networks is increasing, and there have been many methods for identifying key regulators on the regulation and control networks based on computation, mainly the following methods: based on information flow models (RWRs), ranking algorithms (PageRanking), constructing classifiers (SVM), Regularized least-squares classification, Bayesian networks, regression-based models, and the like. However, the existing methods have more or less some problems: such as inability to process large data, too high of a time complexity, accuracy to be improved, etc. In 2015, Alexandra and the like propose an MIPRIP method, a linear model is used for identifying key regulators on a regulation network, and experimental results show that the linear model-based method can effectively identify regulators with important biological significance. However, the method only considers the relation between the transcription factor and the gene, does not consider the interaction and cooperative regulation relation between regulators in the co-regulation network, and simultaneously, the identification precision is also to be improved.

Therefore, there is a need to design a method for identifying key regulators in a gene co-regulation network based on a linear model.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for identifying key regulators in a gene co-regulation network based on a linear model. The method for identifying the key regulators in the gene co-regulation network based on the linear model can identify the key regulators with biological significance in the gene co-regulation network more accurately only according to gene expression profile data and gene regulation relation.

The technical solution of the invention is as follows:

a method for identifying key regulators in a gene co-regulation network based on a linear model comprises the following steps:

step 1) constructing a gene co-regulation network:

inputting gene expression profile data, gene regulation relationship and Protein interaction data (PPI), filtering out action relationship pairs without expression profile data nodes, and establishing a gene co-regulation network (GCN), wherein the GCN comprises three nodes: regulator miRNA (microRNA), regulator TF and gene, wherein action edges exist among nodes: miRNA-gene, TF-gene and gene-gene;

if any two points in the gene co-regulation network GCN have an action relation, the edge weight is 1, otherwise, the edge weight is 0;

step 2) respectively calculating activity values of a regulator miRNA, a regulator TF and adjacent genes of known disease genes;

activity values, i.e., the influence values of miRNA, TF, and adjacent genes on known disease genes;

step 3) in the constructed gene co-regulation network GCN, constructing a linear model by using gene expression profile data and activity values of the regulator and the adjacent genes obtained in the step 2), predicting the expression of the known disease genes, and obtaining the predicted expression value of the known disease genes;

and 4) converting the linear model constructed in the step 3) into an optimization problem according to the minimization of the difference between the predicted expression value and the real expression value of the known disease gene, solving the optimization problem based on the mixed integer linear programming idea, and finally identifying a key regulator in the gene co-regulation network.

Further, the linear model expression constructed for predicting the expression of known disease genes is as follows:

wherein i represents a known disease gene, m, t and g represent a regulator miRNA, a regulator TF and a neighboring gene of the known disease gene i respectively;

g′_i,srepresenting the predicted expression value of the known disease Gene i in sample s, β₀Additional weights (additive offset) for linear models, M, T, G for miRNA set, TF set, gene set, β_m、β_t、β_gRespectively representing the optimization parameters of m, t and g, and directly calculating by using an optimizer during the optimization problem processing in the step 4);

es_m,i、ts_t,i、gs_g,irespectively representing the action side weights of m, t, g and i, and taking the value as 0 or 1;

act_m,s、act_t,s、act_g,srespectively representing the activity values of m, t and g in a sample s;

the sample s refers to data of a certain observed individual with a known disease.

Further, said minimizing the difference between the predicted expression value and the true expression value of the gene transforms the linear model into an optimization problem, expressed as:

wherein, g_i,s、g′_i,sRespectively representing the real expression value and the predicted expression value of the disease gene i in a sample S, and respectively representing a known disease gene set and a total sample set of the disease by O and S;

solving the optimization problem by adopting a Gurobi optimizer, recording the times of each regulator selected by the optimizer in the process of solving the optimization problem, ranking all regulators according to the selection times, and taking the regulator with the rank of 50 as a final candidate regulator.

After the Gurobi optimizer is installed, the Gurobi function can be directly called to perform optimization problem processing only by introducing a Gurobi package into the R language, and the Gurobi function has three input parameters: the optimization model is obtained by converting the constructed linear model into an optimization problem by minimizing the difference between the predicted expression value and the actual expression value of the known disease gene of the constructed linear model, wherein the optimization model is the timeLimit and the OutputFlag, the timeLimit generally takes the value of 600, and the OutputFlag takes the default value of 0. To obtain a series of models of typically different sizes, a linear model was constructed by constraining the number of regulators of the gene. For each known disease gene, the number of regulators is set to 1 to k respectively to construct a linear model.

Further, the activity values of the regulator miRNA, the regulator TF and the adjacent gene are calculated by the following two methods, respectively:

1) calculating the activity values of the regulator miRNA and the regulator TF:

first, the reference expression values of all target genes of the regulator r are calculated:

wherein r represents a regulator, namely a regulator miRNA or a regulator TF;target Gene g representing regulator r_tThe reference expression value of (a) is gene g_tThe average of the expression values in all samples where the expression level of regulator r tended to 0; e (r) ->0 indicates that the expression level of the regulator r tends to 0;

the reference expression value of the target gene refers to the expression value of the target gene when no regulation effect is exerted;

secondly, calculating the difference between the reference expression value of the target gene and the real expression value after the influence of the regulator, namely the expression level change value of the target geneComprises the following steps:

wherein, y_gt,sTarget Gene g representing regulator r_tThe true expression value in the sample s,target Gene g representing regulator r_tA change in expression level of;

thirdly, a simple linear model is constructed according to the expression level change value of the target gene, and the activity value act of the regulon is solved_r,s：

Wherein G' represents a target gene set of a regulator r,respectively representing the sum of the expression level change values of the target gene set of the regulon r and the sum of the reference expression values;

3) calculating the activity value of the adjacent genes, and solving by adopting the cumulative effect based on the expression influence of the adjacent genes on all action genes, namely:

wherein N represents the total number of genes in the sample s, gs_g,iRepresenting the weight of the action of the gene g with respect to the gene i in the sample s, g_i,sRepresents the expression value of the gene i in the sample s, which is the data of a certain observed individual with known diseases.

Further, after normalizing the activity values of the regulon and the adjacent gene obtained in the step 2), the activity values are used for constructing a linear model in the step 3).

Advantageous effects

The invention provides a method (co-BOTLM) for identifying key regulators in a gene co-regulation network based on a linear model, which utilizes gene expression profile data and gene regulation relation to predict the expression of known disease genes by constructing the linear model to complete the identification of the key regulators in the gene co-regulation network.

Compared with the existing method for identifying the key regulators based on the linear model, the co-BOTLM method has the following advantages:

1) the method is applied to a co-regulation network, the co-regulation network contains richer biological information than a single network, so that the identified regulators have more important biological significance;

2) adding protein interaction data (PPI information) taking into account that the expression of a gene may be affected by a neighboring gene;

3) and a new method is introduced to calculate the activity values of the regulator and the adjacent genes, so that the accuracy of the cancer gene expression prediction is effectively improved. The method is simple to realize, and the key regulators in the gene co-regulation network can be accurately identified only according to the gene expression profile data and the gene regulation relationship.

Experiments prove that the co-BOTLM can effectively identify key regulators in a gene co-regulation network, and the identified key regulators have important biological significance. Meanwhile, compared with other methods, the accuracy is improved. The specific experimental result chart is compared and analyzed in detail in the examples.

Drawings

FIG. 1 is a flow chart of the co-BOTLM of the present invention.

Detailed Description

The invention will be described in further detail below with reference to the following figures and specific examples:

example 1:

identification model of key regulators in gene co-regulation network based on linear model

The invention defines key regulators in a gene co-regulatory network as: by utilizing gene expression profile data and gene regulation relation, the expression of known disease genes is predicted by constructing a linear model, so that regulators seriously influencing the disease gene expression in a co-regulation network are identified.

To clearly describe the model of key regulator recognition in a linear model-based gene co-regulatory network, the inventors defined the correlation of this model as follows:

the proposed construction of a linear model to predict the expression of known disease genes is as follows:

the key regulator recognition model in a linear model-based gene co-regulatory network aims at recognizing regulators that seriously affect the expression of disease genes in the co-regulatory network. The identification of key nodes in the gene co-regulation network is completed by constructing a linear model by utilizing gene expression profile data and a gene regulation relation to predict the expression of known disease genes.

The whole process of the key regulator identification method in the gene co-regulation network based on the linear model is shown in figure 1. Firstly, inputting gene expression profile data, gene regulation relation and PPI data. The method co-BOTLM can be divided into 4 sub-processes:

1) constructing a gene co-regulation network;

2) considering that the expression of the gene may be affected by the regulator and the adjacent gene, activity values of miRNA, TF and the adjacent gene (i.e., influence values of miRNA, TF and the adjacent gene on the known disease gene) are calculated for the known disease gene, respectively;

3) constructing a linear model by using the expression profile data of the genes in the obtained gene co-regulation network, and predicting the expression of the genes with known diseases;

4) converting the linear model into an optimization problem according to the minimization of the difference between the gene prediction expression value and the real expression value, solving the optimization problem based on a mixed integer linear programming idea (MILP), finally identifying a key regulator in the gene co-regulation network, and ending the whole identification process;

After the Gurobi optimizer is installed, the Gurobi function can be directly called to perform optimization problem processing only by introducing a Gurobi package into the R language, and the Gurobi function has three input parameters: the optimization model comprises an optimization model, timeLimit and OutputFlag, wherein the timeLimit generally takes a value of 600, the OutputFlag takes a default value of 0, and the optimization model is obtained by converting a constructed linear model into an optimization problem by minimizing the difference between the predicted expression value and the real expression value of the known disease gene. To obtain a series of models of typically different sizes, a linear model was constructed by constraining the number of regulators of the gene. For each known disease gene, the number of regulators is set to 1 to k respectively to construct a linear model. In this example, k is 5 (after many experiments, when k is 5, the experiment effect is optimal).

Validity verification method of key regulator identification method in gene co-regulation network based on linear model

To verify the effectiveness of the co-BOTLM method, the co-BOTLM method was applied to a set of ovarian cancer data sets. The experimental data set included: ovarian cancer sample data, gene regulatory relationships, PPI data, known ovarian cancer-associated disease genes. The ovarian cancer sample data is downloaded from a TCGA database, 385 samples are obtained in total, and an ovarian cancer expression profile data set containing 559 miRNA and 12456 genes is obtained by filtering genes with undersized absolute values of expression values or no obvious differential expression in each sample. Action relation data include miRNA-gene, TF-gene and PPI data, which are respectively downloaded from MicroCosm website, ENCODE database and Biogrid database. By mapping the ovarian cancer expression profile data set and the action relation with each other, a miRNA-TF gene co-regulation network is finally constructed, wherein the network comprises three types of nodes: 12381 genes, 559 miRNAs and 75 TF, the functional relationship existing between the nodes: 59660 for gene-gene, 241722 for miRNA-gene and 9877 for TF-gene. For known ovarian cancer related disease genes, 379 genes are downloaded from the DDOC database, and the disease genes without expression profile data or regulation and control relationship are filtered out, and finally, 123 genes are remained.

In the example, a three-fold cross validation experiment is performed, the prediction precision of the co-BOTLM method is compared with that of the MIPRIP method proposed by Alexandra et al, a Pearson correlation coefficient PCC is used for calculating the similarity between disease gene expression data predicted by the co-BOTLM method and real expression data, the higher the PCC value is, the higher the similarity is, and the higher the accuracy of a linear model constructed by the co-BOTLM method is, so that the precision of the experiment result is higher. The PCC values in the examples are calculated using the cor function in the R language. Meanwhile, in the example, characteristic and functional enrichment analysis is also carried out on the regulons identified by the co-BOTLM method.

1. Analyzing experimental results and verifying algorithm effectiveness

Table 1: first-20 ranked regulators in miRNA-TF gene co-regulation network

No.	Identified key regulators	Number of target genes	Number of optimizer selections
				1	hsa-mir-106a*	377	50
2	hsa-mir-586	508	43
				3	hsa-mir-423-5p	496	38
4	hsa-mir-515-3p	512	34
				5	hsa-mir-181a-2*	496	34
6	hsa-mir-768-3p	530	32
				7	hsa-mir-663	480	32
8	hsa-mir-539	382	31
				9	hsa-mir-206	477	30
10	hsa-mir-509-3p	552	30
				11	hsa-mir-362-3p	512	25
12	hsa-mir-378*	519	24
				13	hsa-mir-520c-3p	566	24
14	hsa-mir-33a	523	24
				15	hsa-mir-29a*	495	23
16	hsa-mir-193a-3p	496	23
				17	hsa-mir-601	484	23
18	FOXA2	169	23
				19	hsa-mir-26b	466	22
20	hsa-mir-30b	541	22

In the example, after the three-fold cross validation experiment, the average PPC value is finally obtained to be 0.535, which shows that the gene expression value predicted by the linear model in the invention has higher similarity with the real expression value, so that the accuracy of the linear model constructed by the co-BOTLM method is higher, and the key regulators in the network can be effectively identified. After the experiment is finished, ranking all regulators according to the selection times of the optimizer to the regulators, and taking the first 50 regulators as candidate key regulators in the example. In table 1 above, the top 20 regulators are listed, and it can be seen that none of the genes regulated by any of the regulators other than FOXA2 is less than 300, and many of them have been confirmed to be associated with ovarian cancer. Because of the too little TF experimental data, the FOXA2 target gene is less. This indicates that the identified regulators have a role in the co-regulatory network of ovarian cancer genes, which may be related to the expression of a large number of genes, including known genes of ovarian cancer disease, and thus have a critical role in the co-regulatory network.

2. Method co-BOTLM is compared with MIPRIP method experiment, and accuracy of algorithm is verified

Table 2: PCC value of method MIPRIP experimental result

No.	1	2	3	4	5
						1	0.3329907	0.4312150	0.4436449	0.4731776	0.4893458
2	0.3195237	0.4221495	0.4500000	0.4687850	0.4851402
						3	0.3214019	0.4341121	0.4571028	0.4768224	0.4916822

Note: 1-3: represents a three-fold cross validation experiment, 1-5: expressing the number k value of regulators for constructing linear model

Table 3: PCC values of Process co-BOTLM experiment results

No.	1	2	3	4	5
						1	0.5018750	0.5709821	0.5940179	0.6112500	0.6227679
2	0.4858036	0.5575893	0.5869643	0.6025893	0.6164286
						3	0.4956250	0.5518750	0.5691964	0.5918750	0.6059821

The MIPRIP method and the co-bollm method of the present invention are both based on linear models to identify key regulators of specific diseases, however, there are three differences: 1) the MIPRIP method is applied to a regulation network, the co-BOTLM method is applied to a co-regulation network, and transcription factors and miRNA have wide interaction and cooperative regulation, so that the co-regulation network contains richer biological information than a single network; 2) for factors affecting the expression of disease genes, the co-BOTLM method also considers the possible effect of adjacent genes on the factors, except transcription factors and miRNA; 3) the MIPRIP method is different from the co-BOTLM method in the calculation mode of the activity values of the transcription factors and the miRNA. Since the MIPRIP method is applied to the regulation and control of a network, without considering the co-regulation relationship in the network, the transcription factor is regarded as a common gene in the present example when a comparative laboratory is performed. Tables 2 and 3 show the PCC values obtained from the experimental results of the MIPRIP method and the co-BOTLM method, respectively, and it is obvious from the tables that the co-BOTLM method obtains higher PCC values, the average PCC value is 0.571, and the average PCC value of the MIPRIP method is 0.433. Obviously, the gene expression value predicted by the co-BOTLM method has higher similarity with the real expression value, so that the experiment indirectly shows that the co-BOTLM method has higher precision and the reliability of the identified key regulator is higher.

3. Experiment result function enrichment analysis, and result validity verification

Table 4: top 10 regulator GO enrichment analysis

An Ncellular component assembly: regulon ranking, enriched GO terms: GO terms ranked 3 top by P-value (smaller is better), GO number: number of GO terms P-value <0.05, P-value: <0.05 indicated high enrichment.

Table 5: top 10 regulator KEGG pathway enrichment analysis

No.: regulator ranking, enriched KEGG pathway: and (3) KEGG channels ranked in the top 3 according to P-value (the smaller the better), the number of KEGG: number of KEGG of P-value <0.05, P-value: <0.05 indicated high enrichment.

In order to verify that the key regulators identified by the co-BOTLM method in the invention are biologically significant, in this example, GOstats in the R language is used to perform GO enrichment analysis and KEGG pathway enrichment analysis on the identified key regulators respectively. Table 4 and table 5 show the GO and KEGG pathway enrichment analysis results for the top 10 regulators, respectively.

It is clear from table 4 that most of the top 10 regulators identified by the co-BOTLM method in the present invention are enriched with more than 300 GO terms, wherein the more frequently enriched GO terms are: cellular components, cellular processes, cell death, negative regulation of dentriticcell differentiation, and the like, indicate that the identified regulators participate in a large number of cell-related life processes. The number of GO terms enriched by hsa-mir-515-3p and hsa-mir-768-3p is less than 100, probably because the target genes of the two miRNAs are less matched with the GOstats library, and Jiang et al have demonstrated in 2016 that hsa-mir-768-3p has a potential prognostic function in ovarian cancer because of its down-regulation linked to MEK/ERK-mediated enhancement in protein synthesis in melanoma cells. Similarly, it is evident from table 5 that most of the top 10 regulators are enriched in at least 5 KEGG pathways, among which the more frequently enriched biological processes are: the fact that the identified regulators are involved in a large number of cancers and signaling pathways and have close relationship with the cancers is shown by the fact that the regulators identified in the conservation, the pathway in the cancer, the signaling pathway, the ErbB signaling pathway and the like. In conclusion, it is well demonstrated that experimentally identified regulators are involved in a large number of biological processes, especially those associated with cellular activity and cancer, and thus are of great biological interest.

Claims

1. A method for identifying key regulators in a gene co-regulation network based on a linear model is characterized by comprising the following steps:

step 1) constructing a gene co-regulation network:

inputting gene expression profile data, gene regulation relation and protein interaction data, filtering action relation pairs without expression profile data nodes, establishing a gene co-regulation network GCN, wherein the gene co-regulation network GCN comprises three nodes in common: regulator miRNA, regulator TF and gene, there are action edges between the nodes: miRNA-gene, TF-gene and gene-gene;

2. The method for identifying key regulators in a linear model-based gene co-regulation network according to claim 1, wherein the linear model expression constructed for predicting the expression of known disease genes is as follows:

g_{i, s}^{'} = β_{0} + Σ_{m = 0}^{M} β_{m} * {es}_{m, i} * {act}_{m, s} + Σ_{t = 1}^{T} β_{t} * {ts}_{t, i} * {act}_{t, s} + Σ_{g = 1}^{G} β_{g} * {gs}_{g, i} * {act}_{g, s}

g′_i,srepresenting the predicted expression value of the known disease Gene i in sample s, β₀Additional weight of linear model M, T, G represents miRNA set, TF set and gene set, β_m、β_t、β_gRespectively representing the optimization parameters of m, t and g, and directly calculating by using an optimizer during the optimization problem processing in the step 4);

3. The method for identifying key regulators in a linear model-based gene co-regulation network according to claim 2, wherein the linear model is transformed into an optimization problem according to minimization of the difference between the predicted expression value and the true expression value of the gene, which is expressed as:

4. The method for identifying key regulators in a linear model-based gene co-regulation network according to any one of claims 1-3, wherein the activity values of the regulator miRNA, the regulator TF and the adjacent genes are calculated by the following two methods respectively:

1) calculating the activity values of the regulator miRNA and the regulator TF:

y_{r, g_{t}}^{b} = E (y_{r, g_{t}} | e (r) - > 0)

wherein,target Gene g representing regulator r_tThe true expression value in the sample s,target Gene g representing regulator r_tA change in expression level of;

2) calculating the activity value of the adjacent genes, and solving by adopting the cumulative effect based on the expression influence of the adjacent genes on all action genes, namely:

5. The method for identifying key regulators in the linear model-based gene co-regulation network according to claim 4, wherein the activity values of the regulators and adjacent genes obtained in the step 2) are normalized and then used for constructing the linear model in the step 3).