Gene Set Priorization Guided by Regulatory Networks with p-values through Kernel Mixed Model

Haohan Wang⁹,
Oscar L. Lopez¹⁰,
Wei Wu¹¹ &
…
Eric P. Xing¹²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13278))

Included in the following conference series:

International Conference on Research in Computational Molecular Biology

Abstract

The transcriptome association study has helped prioritize many causal genes for detailed study and thus further helped the development of many therapeutic strategies for multiple diseases. How- ever, prioritizing the causal gene only does not seem always to be able to offer sufficient guidance to the downstream analysis. Thus, in this paper, we propose to perform the association studies from another perspective: we aim to prioritize genes with a tradeoff between the pursuit of the causality evidence and the interest of the genes in the pathway. We introduce a new method for transcriptome association study by incorporating the information of gene regulatory networks. In addition to directly building the regularization into variable selection methods, we also expect the method to report p-values of the associated genes so that these p-values have been empirically proved trustworthy by geneticists. Thus, we introduce a high-dimension variable selection method with the following two merits: it has a flexible modeling power that allows the domain experts to consider the structure of covariates so that prior knowledge, such as the gene regulatory network, can be integrated; it also calculates the p-value, with a practical manner widely accepted by geneticists, so that the identified covariates can be directly assessed with statistical guarantees. With simulations, we demonstrate the empirical strength of our method against other high-dimension variable selection methods. We further apply our method to Alzheimer’s disease, and our method identifies interesting sets of genes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 51.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 64.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An analysis of genetically regulated gene expression across multiple tissues implicates novel gene candidates in Alzheimer’s disease

Article Open access 16 April 2020

Genetically regulated expression in late-onset Alzheimer’s disease implicates risk genes within known and novel loci

Article Open access 06 December 2021

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Article Open access 25 May 2021

References

Barbeira, A.N., et al.: Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat. Commun. 9(1), 1–20 (2018)
Google Scholar
Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodological) 57, 289–300 (1995)
Google Scholar
Bertram, L., Tanzi, R.E.: Genome-wide association studies in alzheimer’s disease. Hum. Mol. Genet. 18(R2), R137–R145 (2009)
Article Google Scholar
Bozzo, E.: The moore-penrose inverse of the normalized graph laplacian. Linear Algebra Appl. 439(10), 3038–3043 (2013)
Article MathSciNet Google Scholar
Bozzo, E., Franceschet, M.: Approximations of the generalized inverse of the graph laplacian matrix. Internet Math. 8(4), 456–481 (2012)
Article MathSciNet Google Scholar
Bühlmann, P.: Statistical significance in high-dimensional linear models. Bernoulli 19(4), 1212–1242 (2013)
Article MathSciNet Google Scholar
Cairns, N.J., Lee, V.M.-Y., Trojanowski, J.Q.: The cytoskeleton in neurodegenerative diseases. J. Pathol. J. Pathol. Soc. Great Britain Ireland 204(4), 438–449 (2004)
Google Scholar
Crawford, L., Zeng, P., Mukherjee, S., Zhou, X.: Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet. 13(7), e1006869 (2017)
Google Scholar
de Leeuw, C.A., Mooij, J.M., Heskes, T., Posthuma, D.: Magma: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11(4), e1004219 (2015)
Google Scholar
Dhanwani, R., et al.: T cell responses to neural autoantigens are similar in alzheimer’s disease patients and age-matched healthy controls. Front. Neurosci. 14, 874 (2020)
Google Scholar
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 3(02), 185–205 (2005)
Article Google Scholar
Efthymiou, A.G., Goate, A.M.: Late onset alzheimer’s disease genetics implicates microglial pathways in disease risk. Mol. Neurodegener. 12(1), 1–12 (2017)
Article Google Scholar
Fan, L., et al.: New insights into the pathogenesis of alzheimer’s disease. Front. Neurol. 10, 1312 (2020)
Google Scholar
Feng, H., et al.: Leveraging expression from multiple tissues using sparse canonical correlation analysis and aggregate tests improves the power of transcriptome-wide association studies. PLoS Genet. 17(4), e1008973 (2021)
Google Scholar
Feng, H., Mancuso, N., Pasaniuc, B., Kraft, P.: Multitrait transcriptome-wide association study (TWAS) tests. Genetic Epidemiol. 108, 240–256 (2021b)
Google Scholar
Gamazon, E.R., et al.: A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 47(9), 1091–1098 (2015)
Google Scholar
González-Reyes, R.E., Nava-Mesa, M.O., Vargas-Sánchez, K., Ariza-Salamanca, D., Mora-Muñoz, L.: Involvement of astrocytes in alzheimer’s disease from a neuroinflammatory and oxidative stress perspective. Front. Mol. Neurosci. 10, 427 (2017)
Google Scholar
Gusev, A., et al.: Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 48(3), 245–252 (2016)
Google Scholar
Heckerman, D.: Accounting for hidden common causes when inferring cause and effect from observational data. arXiv:1801.00727 (2018)
Hemonnot, A.-L., Hua, J., Ulmann, L., Hirbec, H.: Microglia in alzheimer disease: well-known targets and new opportunities. Front. Aging Neurosci. 11, 233, e1004219 (2019)
Google Scholar
Huang, J., Ma, S., Zhang,C.-H.: Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica 18, 1603–1618 (2008)
Google Scholar
Jacobs, H.I., et al.: The cerebellum in alzheimer’s disease: evaluating its role in cognitive decline. Brain 141(1), 37–47 (2018)
Google Scholar
Javanmard, A., Montanari, A.: Hypothesis testing in high-dimensional regression under the gaussian random design model: asymptotic theory. IEEE Trans. Inf. Theory 60(10), 6522–6554, e1004219 (2014)
Google Scholar
Jones, S.V., Kounatidis, I.: Nuclear factor-kappa B and alzheimer disease, unifying genetic and environmental risk factors from cell to humans. Front. Immunol. 8, 1805 (2017)
Google Scholar
Kang, H.M., et al.: Efficient control of population structure in model organism association mapping. Genetics 178(3), 1709–1723 (2008)
Google Scholar
Kang, H.M., et al.: Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42(4), 348–354 (2010)
Google Scholar
Kapoor, A., Nation, D.A.: Role of notch signaling in neurovascular aging and alzheimer’s disease. In: Seminars in Cell and Developmental Biology. Elsevier (2020)
Google Scholar
Kim, S., Xing, E.P.: Tree-guided group lasso for multi-task regression with structured sparsity (2010)
Google Scholar
Li, C., Li, H.: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 24(9), 1175–1182 (2008). ISSN: 1367–4803. https://doi.org/10.1093/bioinformatics/btn081
Lippert, C., Listgarten, J., Liu, Y., Kadie, C.M., Davidson, R.I., Heckerman, D.: Fast linear mixed models for genome-wide association studies. Nat. Methods 8(10), 833–835 (2011)
Google Scholar
Lockhart, R., Taylor, J., Tibshirani, R.J., Tibshirani, R.: A significance test for the lasso. Ann. Stat. 42(2), 413 (2014)
Google Scholar
Lonsdale, J., et al.: The genotype-tissue expression (GTEX) project. Nat. Genet. 45(6), 580–585 (2013)
Google Scholar
Maldonado, Y.M.: Mixed models, posterior means and penalized least-squares. Lecture Notes-Monograph Series, pp. 216–236 (2009)
Google Scholar
Masters, C.L., Bateman, R., Blennow, K., Rowe, C.C., Sperling, R.A., Jeffrey, L.: Cummings 2015. “alzheimer’s disease”. Nature Reviews Disease Primers (2015). https://doi.org/10.1038/nrdp
Meinshausen, N., Bühlmann, P.: Stability selection. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 72(4), 417–473, e1004219 (2010)
Google Scholar
Murpy, M., LeVine III, H.,: Alzheimer’s disease and the \(\beta \)-amyloid peptide. J. Alzheimers Dis. 19(1), 311–323 (2010)
Google Scholar
Niikura, T., Tajima, H., Kita, Y.: Neuronal cell death in alzheimer’s disease and a neuroprotective factor, humanin. Curr. Neuropharmacol. 4(2), 139–147 (2006)
Google Scholar
Oughtred, R., et al.: The biogrid interaction database: 2019 update. Nucleic Acids Res. 47(D1), D529–D541 (2018)
Google Scholar
Perez-Nievas, B.G., Serrano-Pozo, A.: Deciphering the astrocyte reaction in alzheimer’s disease. Front. Aging Neurosci. 10, 114, e1004219 (2018)
Google Scholar
Petersen, K.B., Pedersen, M.S., et al.: The matrix cookbook. Tech. Univ. Denmark 7(15), 510, e1004219 (2008)
Google Scholar
Pontén, F., Jirström, K., Uhlén, M.: The human protein atlas-a tool for pathology. J. Pathol. J. Pathol. Soc. Great Britain Ireland 216(4), 387–393, e1004219 (2008)
Google Scholar
Puniyani, K., Kim, S., Xing, E.P.: Multi-population GWA mapping via multi-task regularized regression. Bioinformatics 26(12), i208–i216, e1004219 (2010)
Google Scholar
Sadigh-Eteghad, S., Sabermarouf, B., Majdi, A., Talebi, M., Farhoudi, M., Mahmoudi, J.: Amyloid-beta: a crucial factor in alzheimer’s disease. Med. Princ. Pract. 24(1), 1–10 (2015)
Google Scholar
Safran, M., et al.: Genecards version 3: the human gene integrator. Database 2010 (2010)
Google Scholar
Salat, D.H., Kaye, J.A., Janowsky, J.S.: Selective preservation and degeneration within the prefrontal cortex in aging and alzheimer disease. Arch. Neurol. 58(9), 1403–1408 (2001)
Article Google Scholar
Subramanian, J., Savage, J.C., Tremblay, M.È.: Synaptic loss in alzheimer’s disease: mechanistic insights provided by two-photon in vivo imaging of transgenic mouse models. Front. Cell. Neurosci. 14, 445 (2020)
Google Scholar
Thompson, W.A., et al.: The problem of negative estimates of variance components. Ann. Math. Stat. 33(1), 273–289 (1962)
Google Scholar
Tosto, G., Reitz, C.: Genome-wide association studies in alzheimer’s disease: a review. Curr. Neurol. Neurosci. Rep. 13(10), 381 (2013)
Google Scholar
Town, T., Tan, J., Flavell, R.A., Mullan, M.: T-cells in alzheimer’s disease. NeuroMol. Med. 7(3), 255–264 (2005)
Google Scholar
Uffelmann, E., et al.: Genome-wide association studies. Nat. Rev. Methods Primers 1(1), 1–21 (2021)
Google Scholar
Vagnucci, A.H., Jr., Li, W.W.: Alzheimer’s disease and angiogenesis. Lancet 361(9357), 605–608, e1004219 (2003)
Google Scholar
Van Mieghem, P., Devriendt, K., Cetinay, H.: Pseudoinverse of the Laplacian and best spreader node in a network. Phys. Rev. E 96(3), 032311 (2017)
Google Scholar
Visscher, P.M., et al.: 10 years of gwas discovery: biology, function, and translation. Am. J. Hum. Genet. 101(1), 5–22, e1004219 (2017)
Google Scholar
Wainberg, M., et al.: Opportunities and challenges for transcriptome-wide association studies. Nat. Genet. 51(4), 592–599 (2019)
Google Scholar
Wang, H., Lengerich, B.J., Aragam, B., Xing, E.P.: Precision lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35(7), 1181–1187 (2018)
Google Scholar
Wang, H., Yue, T., Yang, J., Wu, W., Xing, E.P.: Deep mixed model for marginal epistasis detection and population stratification correction in genome-wide association studies. BMC Bioinf. 20(23), 1–11, e1004219 (2019)
Google Scholar
Wang, H., Aragam, B., Xing, E.P.: Tradeoffs of linear mixed models in genome-wide association studies. J. Comput. Biol. (2022). (to appear)
Google Scholar
Yang, J., Zaitlen, N.A., Goddard, M.E., Visscher, P.M., Price, A.L.: Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46(2), 100–106 (2014)
Google Scholar
Yiannopoulou, K.G., Papageorgiou, S.G.: Current and future treatments in alzheimer disease: an update. J. Central Nerv. Syst. Dis. 12, 1179573520907397, e1004219 (2020)
Google Scholar
Zetterberg, H., Mattsson, N.: Understanding the cause of sporadic alzheimer’s disease. Expert Rev. Neurother. 14(6), 621–630 (2014)
Google Scholar
Zhang, B., et al.: Integrated systems approach identifies genetic nodes and networks in late-onset alzheimer’s disease. Cell 153(3), 707–720 (2013)
Google Scholar
Zhang, C.-H., Zhang, S.S.: Confidence intervals for low dimensional parameters in high dimensional linear models. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 76(1):217–242 (2014). https://doi.org/10.2307/24772752
Zhang, Z., et al.: Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42(4), 355–360 (2010)
Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)
Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320, e1004219 (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Haohan Wang
Alzheimer’s Disease Research Center, University of Pittsburgh Medical Center, Pittsburgh, USA
Oscar L. Lopez
Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Wei Wu
Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
Eric P. Xing

Authors

Haohan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Oscar L. Lopez
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Eric P. Xing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Wei Wu or Eric P. Xing .

Editor information

Editors and Affiliations

Columbia University, New York, NY, USA
Itsik Pe'er

Appendices

A Additional Simulation Experiments

Different Strengths of the Regulation. Further, we study how the strength of regulation will affect the performances of our methods, and we model this shift of strength with variations of the parameter r in the data generation process, while the rest of the configurations remain the same as the data generation process. Also, we continue to focus on the intermediate level of the previous example where we set \(v=16\).

Similarly, we repeat the experiments three times and plot the ROC curve with standard deviation plotted as the shady areas in Fig. 2.

As Fig. 2 shows, our method is on par with previous hypothesis testing methods over most correlation levels. When \(r=1\), the regulated genes are distributed in the same way as the TF, although are associated with smaller effect sizes. Both LMM and KMM are good enough to uncover the associated genes in this case. When r is smaller (0.5 or 0.3), the regulated genes are less dependent on the TF, the hypothesis testing methods all perform similarly, probably because that when the regulated genes are more independent from the TF, the network structure does not introduce advantages. However, when \(r=0.7\), the KMM method starts to show a clear advantage over other methods. In summary, our proposed method can outperform other methods when there is a strong correlation between the TF and regulated genes (but not too strong when the regulated genes and TF are identically distributed). We believe this is the most frequently seen scenarios in real-world data. In addition, in other scenarios, our method does not perform worse than other methods, so there is no loss in using our method in general. In fact, if one calculates the area under ROC curve for Fig. 2, our method performs the best in all these four tested scenarios, although the advantages of our method in the other three scenarios are marginal.

Misspecified Network Structure. Finally, as our method is built upon the knowledge of network structure, we are interested in knowing what if the network structure is misspecified since in practice, we may not always be able to obtain a network structure faithful to the underlying regulatory mechanism. To simulate this, we introduce another hyperparameter q in the data generation process. When we generate the network structure N, we drop the edges in the network structure with the probability \(1-q\). The rest configuration of data generation is the same as the general one introduced in the preceding texts.

Again, we repeat the experiments three times and plot the ROC curve with standard deviation plotted as the shady areas in Fig. 3.

As Fig. 3 shows, our method is surpringly robust to the misspefication of the prior network structure. When \(q=1\), the input network is faithful to the underlying regulatory network, and the KMM method certainly outperforms the competing methods. Interestingly, the advantages of the KMM method maintain even when half of the edges of the input network are missing (\(q=0.5\)). When \(q=0.3\), which means that 70% of the edges of the underlying regulatory network are missing in the input network for the model, the proposed method start to perform similarly to the previous hypothesis testing methods. Even this case, the calculated area under ROC score of KMM will be higher than those competing methods, although this advantage cannot be observed in the ROC curves.

B Covaraite Regressing

To demonstrate the success correction of these factors, we compared the Spearman’s correction between the expressions and the covariates before and after the correction. Figure 4 shows the comparison of the Spearman’s correlation between the gene expressions and the covariates before and after the regressing across the three different compartments studied in this work, and we can see that the correlation between each genes and the age covaraites drops significantly after the regression.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H., Lopez, O.L., Wu, W., Xing, E.P. (2022). Gene Set Priorization Guided by Regulatory Networks with p-values through Kernel Mixed Model. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. https://doi.org/10.1007/978-3-031-04749-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-04749-7_7
Published: 29 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04748-0
Online ISBN: 978-3-031-04749-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics