Abstract
The transcriptome association study has helped prioritize many causal genes for detailed study and thus further helped the development of many therapeutic strategies for multiple diseases. How- ever, prioritizing the causal gene only does not seem always to be able to offer sufficient guidance to the downstream analysis. Thus, in this paper, we propose to perform the association studies from another perspective: we aim to prioritize genes with a tradeoff between the pursuit of the causality evidence and the interest of the genes in the pathway. We introduce a new method for transcriptome association study by incorporating the information of gene regulatory networks. In addition to directly building the regularization into variable selection methods, we also expect the method to report p-values of the associated genes so that these p-values have been empirically proved trustworthy by geneticists. Thus, we introduce a high-dimension variable selection method with the following two merits: it has a flexible modeling power that allows the domain experts to consider the structure of covariates so that prior knowledge, such as the gene regulatory network, can be integrated; it also calculates the p-value, with a practical manner widely accepted by geneticists, so that the identified covariates can be directly assessed with statistical guarantees. With simulations, we demonstrate the empirical strength of our method against other high-dimension variable selection methods. We further apply our method to Alzheimer’s disease, and our method identifies interesting sets of genes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barbeira, A.N., et al.: Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9(1), 1–20 (2018)
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodological) 57, 289–300 (1995)
Bertram, L., Tanzi, R.E.: Genome-wide association studies in alzheimer’s disease. Hum. Mol. Genet. 18(R2), R137–R145 (2009)
Bozzo, E.: The moore-penrose inverse of the normalized graph laplacian. Linear Algebra Appl. 439(10), 3038–3043 (2013)
Bozzo, E., Franceschet, M.: Approximations of the generalized inverse of the graph laplacian matrix. Internet Math. 8(4), 456–481 (2012)
Bühlmann, P.: Statistical significance in high-dimensional linear models. Bernoulli 19(4), 1212–1242 (2013)
Cairns, N.J., Lee, V.M.-Y., Trojanowski, J.Q.: The cytoskeleton in neurodegenerative diseases. J. Pathol. J. Pathol. Soc. Great Britain Ireland 204(4), 438–449 (2004)
Crawford, L., Zeng, P., Mukherjee, S., Zhou, X.: Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet. 13(7), e1006869 (2017)
de Leeuw, C.A., Mooij, J.M., Heskes, T., Posthuma, D.: Magma: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11(4), e1004219 (2015)
Dhanwani, R., et al.: T cell responses to neural autoantigens are similar in alzheimer’s disease patients and age-matched healthy controls. Front. Neurosci. 14, 874 (2020)
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3(02), 185–205 (2005)
Efthymiou, A.G., Goate, A.M.: Late onset alzheimer’s disease genetics implicates microglial pathways in disease risk. Mol. Neurodegener. 12(1), 1–12 (2017)
Fan, L., et al.: New insights into the pathogenesis of alzheimer’s disease. Front. Neurol. 10, 1312 (2020)
Feng, H., et al.: Leveraging expression from multiple tissues using sparse canonical correlation analysis and aggregate tests improves the power of transcriptome-wide association studies. PLoS Genet. 17(4), e1008973 (2021)
Feng, H., Mancuso, N., Pasaniuc, B., Kraft, P.: Multitrait transcriptome-wide association study (TWAS) tests. Genetic Epidemiol. 108, 240–256 (2021b)
Gamazon, E.R., et al.: A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47(9), 1091–1098 (2015)
González-Reyes, R.E., Nava-Mesa, M.O., Vargas-Sánchez, K., Ariza-Salamanca, D., Mora-Muñoz, L.: Involvement of astrocytes in alzheimer’s disease from a neuroinflammatory and oxidative stress perspective. Front. Mol. Neurosci. 10, 427 (2017)
Gusev, A., et al.: Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48(3), 245–252 (2016)
Heckerman, D.: Accounting for hidden common causes when inferring cause and effect from observational data. arXiv:1801.00727 (2018)
Hemonnot, A.-L., Hua, J., Ulmann, L., Hirbec, H.: Microglia in alzheimer disease: well-known targets and new opportunities. Front. Aging Neurosci. 11, 233, e1004219 (2019)
Huang, J., Ma, S., Zhang,C.-H.: Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica 18, 1603–1618 (2008)
Jacobs, H.I., et al.: The cerebellum in alzheimer’s disease: evaluating its role in cognitive decline. Brain 141(1), 37–47 (2018)
Javanmard, A., Montanari, A.: Hypothesis testing in high-dimensional regression under the gaussian random design model: asymptotic theory. IEEE Trans. Inf. Theory 60(10), 6522–6554, e1004219 (2014)
Jones, S.V., Kounatidis, I.: Nuclear factor-kappa B and alzheimer disease, unifying genetic and environmental risk factors from cell to humans. Front. Immunol. 8, 1805 (2017)
Kang, H.M., et al.: Efficient control of population structure in model organism association mapping. Genetics 178(3), 1709–1723 (2008)
Kang, H.M., et al.: Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42(4), 348–354 (2010)
Kapoor, A., Nation, D.A.: Role of notch signaling in neurovascular aging and alzheimer’s disease. In: Seminars in Cell and Developmental Biology. Elsevier (2020)
Kim, S., Xing, E.P.: Tree-guided group lasso for multi-task regression with structured sparsity (2010)
Li, C., Li, H.: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24(9), 1175–1182 (2008). ISSN: 1367–4803. https://doi.org/10.1093/bioinformatics/btn081
Lippert, C., Listgarten, J., Liu, Y., Kadie, C.M., Davidson, R.I., Heckerman, D.: Fast linear mixed models for genome-wide association studies. Nat. Methods 8(10), 833–835 (2011)
Lockhart, R., Taylor, J., Tibshirani, R.J., Tibshirani, R.: A significance test for the lasso. Ann. Stat. 42(2), 413 (2014)
Lonsdale, J., et al.: The genotype-tissue expression (GTEX) project. Nat. Genet. 45(6), 580–585 (2013)
Maldonado, Y.M.: Mixed models, posterior means and penalized least-squares. Lecture Notes-Monograph Series, pp. 216–236 (2009)
Masters, C.L., Bateman, R., Blennow, K., Rowe, C.C., Sperling, R.A., Jeffrey, L.: Cummings 2015. “alzheimer’s disease”. Nature Reviews Disease Primers (2015). https://doi.org/10.1038/nrdp
Meinshausen, N., Bühlmann, P.: Stability selection. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 72(4), 417–473, e1004219 (2010)
Murpy, M., LeVine III, H.,: Alzheimer’s disease and the \(\beta \)-amyloid peptide. J. Alzheimers Dis. 19(1), 311–323 (2010)
Niikura, T., Tajima, H., Kita, Y.: Neuronal cell death in alzheimer’s disease and a neuroprotective factor, humanin. Curr. Neuropharmacol. 4(2), 139–147 (2006)
Oughtred, R., et al.: The biogrid interaction database: 2019 update. Nucleic Acids Res. 47(D1), D529–D541 (2018)
Perez-Nievas, B.G., Serrano-Pozo, A.: Deciphering the astrocyte reaction in alzheimer’s disease. Front. Aging Neurosci. 10, 114, e1004219 (2018)
Petersen, K.B., Pedersen, M.S., et al.: The matrix cookbook. Tech. Univ. Denmark 7(15), 510, e1004219 (2008)
Pontén, F., Jirström, K., Uhlén, M.: The human protein atlas-a tool for pathology. J. Pathol. J. Pathol. Soc. Great Britain Ireland 216(4), 387–393, e1004219 (2008)
Puniyani, K., Kim, S., Xing, E.P.: Multi-population GWA mapping via multi-task regularized regression. Bioinformatics 26(12), i208–i216, e1004219 (2010)
Sadigh-Eteghad, S., Sabermarouf, B., Majdi, A., Talebi, M., Farhoudi, M., Mahmoudi, J.: Amyloid-beta: a crucial factor in alzheimer’s disease. Med. Princ. Pract. 24(1), 1–10 (2015)
Safran, M., et al.: Genecards version 3: the human gene integrator. Database 2010 (2010)
Salat, D.H., Kaye, J.A., Janowsky, J.S.: Selective preservation and degeneration within the prefrontal cortex in aging and alzheimer disease. Arch. Neurol. 58(9), 1403–1408 (2001)
Subramanian, J., Savage, J.C., Tremblay, M.È.: Synaptic loss in alzheimer’s disease: mechanistic insights provided by two-photon in vivo imaging of transgenic mouse models. Front. Cell. Neurosci. 14, 445 (2020)
Thompson, W.A., et al.: The problem of negative estimates of variance components. Ann. Math. Stat. 33(1), 273–289 (1962)
Tosto, G., Reitz, C.: Genome-wide association studies in alzheimer’s disease: a review. Curr. Neurol. Neurosci. Rep. 13(10), 381 (2013)
Town, T., Tan, J., Flavell, R.A., Mullan, M.: T-cells in alzheimer’s disease. NeuroMol. Med. 7(3), 255–264 (2005)
Uffelmann, E., et al.: Genome-wide association studies. Nat. Rev. Methods Primers 1(1), 1–21 (2021)
Vagnucci, A.H., Jr., Li, W.W.: Alzheimer’s disease and angiogenesis. Lancet 361(9357), 605–608, e1004219 (2003)
Van Mieghem, P., Devriendt, K., Cetinay, H.: Pseudoinverse of the Laplacian and best spreader node in a network. Phys. Rev. E 96(3), 032311 (2017)
Visscher, P.M., et al.: 10 years of gwas discovery: biology, function, and translation. Am. J. Hum. Genet. 101(1), 5–22, e1004219 (2017)
Wainberg, M., et al.: Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 51(4), 592–599 (2019)
Wang, H., Lengerich, B.J., Aragam, B., Xing, E.P.: Precision lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35(7), 1181–1187 (2018)
Wang, H., Yue, T., Yang, J., Wu, W., Xing, E.P.: Deep mixed model for marginal epistasis detection and population stratification correction in genome-wide association studies. BMC Bioinf. 20(23), 1–11, e1004219 (2019)
Wang, H., Aragam, B., Xing, E.P.: Tradeoffs of linear mixed models in genome-wide association studies. J. Comput. Biol. (2022). (to appear)
Yang, J., Zaitlen, N.A., Goddard, M.E., Visscher, P.M., Price, A.L.: Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46(2), 100–106 (2014)
Yiannopoulou, K.G., Papageorgiou, S.G.: Current and future treatments in alzheimer disease: an update. J. Central Nerv. Syst. Dis. 12, 1179573520907397, e1004219 (2020)
Zetterberg, H., Mattsson, N.: Understanding the cause of sporadic alzheimer’s disease. Expert Rev. Neurother. 14(6), 621–630 (2014)
Zhang, B., et al.: Integrated systems approach identifies genetic nodes and networks in late-onset alzheimer’s disease. Cell 153(3), 707–720 (2013)
Zhang, C.-H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 76(1):217–242 (2014). https://doi.org/10.2307/24772752
Zhang, Z., et al.: Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42(4), 355–360 (2010)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320, e1004219 (2005)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendices
A Additional Simulation Experiments
Different Strengths of the Regulation. Further, we study how the strength of regulation will affect the performances of our methods, and we model this shift of strength with variations of the parameter r in the data generation process, while the rest of the configurations remain the same as the data generation process. Also, we continue to focus on the intermediate level of the previous example where we set \(v=16\).
Similarly, we repeat the experiments three times and plot the ROC curve with standard deviation plotted as the shady areas in Fig. 2.
As Fig. 2 shows, our method is on par with previous hypothesis testing methods over most correlation levels. When \(r=1\), the regulated genes are distributed in the same way as the TF, although are associated with smaller effect sizes. Both LMM and KMM are good enough to uncover the associated genes in this case. When r is smaller (0.5 or 0.3), the regulated genes are less dependent on the TF, the hypothesis testing methods all perform similarly, probably because that when the regulated genes are more independent from the TF, the network structure does not introduce advantages. However, when \(r=0.7\), the KMM method starts to show a clear advantage over other methods. In summary, our proposed method can outperform other methods when there is a strong correlation between the TF and regulated genes (but not too strong when the regulated genes and TF are identically distributed). We believe this is the most frequently seen scenarios in real-world data. In addition, in other scenarios, our method does not perform worse than other methods, so there is no loss in using our method in general. In fact, if one calculates the area under ROC curve for Fig. 2, our method performs the best in all these four tested scenarios, although the advantages of our method in the other three scenarios are marginal.
Misspecified Network Structure. Finally, as our method is built upon the knowledge of network structure, we are interested in knowing what if the network structure is misspecified since in practice, we may not always be able to obtain a network structure faithful to the underlying regulatory mechanism. To simulate this, we introduce another hyperparameter q in the data generation process. When we generate the network structure N, we drop the edges in the network structure with the probability \(1-q\). The rest configuration of data generation is the same as the general one introduced in the preceding texts.
Again, we repeat the experiments three times and plot the ROC curve with standard deviation plotted as the shady areas in Fig. 3.
As Fig. 3 shows, our method is surpringly robust to the misspefication of the prior network structure. When \(q=1\), the input network is faithful to the underlying regulatory network, and the KMM method certainly outperforms the competing methods. Interestingly, the advantages of the KMM method maintain even when half of the edges of the input network are missing (\(q=0.5\)). When \(q=0.3\), which means that 70% of the edges of the underlying regulatory network are missing in the input network for the model, the proposed method start to perform similarly to the previous hypothesis testing methods. Even this case, the calculated area under ROC score of KMM will be higher than those competing methods, although this advantage cannot be observed in the ROC curves.
B Covaraite Regressing
To demonstrate the success correction of these factors, we compared the Spearman’s correction between the expressions and the covariates before and after the correction. Figure 4 shows the comparison of the Spearman’s correlation between the gene expressions and the covariates before and after the regressing across the three different compartments studied in this work, and we can see that the correlation between each genes and the age covaraites drops significantly after the regression.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, H., Lopez, O.L., Wu, W., Xing, E.P. (2022). Gene Set Priorization Guided by Regulatory Networks with p-values through Kernel Mixed Model. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-04749-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)