Large-Scale Collaborative Imaging Genetics Studies of Risk Genetic Factors for Alzheimer’s Disease Across Multiple Institutions

Qingyang Li¹⁸,
Tao Yang¹⁸,
Liang Zhan¹⁹,
Derrek Paul Hibar²⁰,
Neda Jahanshad²⁰,
Yalin Wang¹⁸,
Jieping Ye²¹,
Paul M. Thompson²⁰ &
…
Jie Wang²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9900))

Included in the following conference series:

International Conference on Medical Image Computing and Computer-Assisted Intervention

3869 Accesses

Abstract

Genome-wide association studies (GWAS) offer new opportunities to identify genetic risk factors for Alzheimer’s disease (AD). Recently, collaborative efforts across different institutions emerged that enhance the power of many existing techniques on individual institution data. However, a major barrier to collaborative studies of GWAS is that many institutions need to preserve individual data privacy. To address this challenge, we propose a novel distributed framework, termed Local Query Model (LQM) to detect risk SNPs for AD across multiple research institutions. To accelerate the learning process, we propose a Distributed Enhanced Dual Polytope Projection (D-EDPP) screening rule to identify irrelevant features and remove them from the optimization. To the best of our knowledge, this is the first successful run of the computationally intensive model selection procedure to learn a consistent model across different institutions without compromising their privacy while ranking the SNPs that may collectively affect AD. Empirical studies are conducted on 809 subjects with 5.9 million SNP features which are distributed across three individual institutions. D-EDPP achieved a 66-fold speed-up by effectively identifying irrelevant features.

You have full access to this open access chapter, Download conference paper PDF

Exploring Imaging Genetic Markers of Alzheimer’s Disease Based on a Novel Nonlinear Correlation Analysis Algorithm

Article 03 April 2024

Neuroimaging PheWAS (Phenome-Wide Association Study): A Free Cloud-Computing Platform for Big-Data, Brain-Wide Imaging Association Studies

Article 21 August 2020

The Polygenic Risk Score Knowledge Base offers a centralized online repository for calculating and contextualizing polygenic risk scores

Article Open access 02 September 2022

Keywords

1 Introduction

Alzheimer’s Disease (AD) is a severe and growing worldwide health problem. Many techniques have been developed to investigate AD, such as magnetic resonance imaging (MRI) and genome-wide association studies (GWAS), which are powerful neuroimaging modalities to identify preclinical and clinical AD patients. GWAS [4] are achieving great success in finding single nucleotide polymorphisms (SNPs) associated with AD. For example, APOE is a highly prevalent AD risk gene, and each copy of the adverse variant is associated with a 3-fold increase in AD risk. The Alzheimer’s Disease Neuroimaging Initiative (ADNI) collects neuroimaging and genomic data from elderly individuals across North America. However, processing and integrating genetic data across different institutions is challenging. Each institution may wish to collaborate with others, but often legal or ethical regulations restrict access to individual data, to avoid compromising data privacy.

Some studies, such as ADNI, share genomic data publicly under certain conditions, but more commonly, each participating institution may be required to keep their genomic data private, so collecting all data together may not be feasible. To deal with this challenge, we proposed a novel distributed framework, termed Local Query Model (LQM), to perform the Lasso regression analysis in a distributed manner, learning genetic risk factors without accessing others’ data. However, applying LQM for model selection—such as stability selection—can be very time consuming on a large-scale data set. To speed up the learning process, we proposed a family of distributed safe screening rules (D-SAFE and D-EDPP) to identify irrelevant features and remove them from the optimization without sacrificing accuracy. Next, LQM is employed on the reduced data matrix to train the model so that each institution obtains top risk genes for AD by stability selection on the learnt model without revealing its own data set. We evaluate our method on the ADNI GWAS data, which contains 809 subjects with 5,906,152 SNP features, involving a 80 GB data matrix with approximate 42 billion nonzero elements, distributed across three research institutions. Empirical evaluations demonstrate a speedup of 66-fold gained by D-EDPP, compared to LQM without D-EDPP. Stability selection results show that proposed framework ranked APOE as the first risk SNPs among all features.

2 Data Processing

2.1 ADNI GWAS Data

The ADNI GWAS data contains genotype information for each of the 809 ADNI participants, which consist of 128 patients with AD, 415 with mild cognitive impairment (MCI), 266 cognitively normal (CN). SNPs at approximately 5.9 million specific loci are recorded for each participant. We encode SNPs with the coding scheme in [7] and apply Minor Allele Frequency (MAF) $<0.05$ and Genotype Quality (GQ) $<45$ as two quality control criteria to filter high quality SNPs features, the details refer to [11].

2.2 Data Partition

Lasso [9] is a widely-used regression technique to find sparse representations of data, or predictive models. Standard Lasso takes the form of

$$\begin{aligned} \min _{x} \frac{1}{2}||Ax-y||^2_2+\lambda ||x||_1: x\in \mathbb {R}^p, \end{aligned}$$

(1)

where A is genomic data sets distributed across different institutions, y is the response vector (e.g., hippocampus volume or disease status), x is sparse representa-tion—shared across all institutions and $\lambda $ is a positive regularization parameter.

Suppose that we have m participating institutions. For the ith institution, we denote its data set by $(A_i,y_i)$, where $A_i\in \mathbb {R}^{n_i\times p}$, $n_i$ is the number of subjects in this institution, p is the number of features, and $y_i\in \mathbb {R}^{n_i}$ is the corresponding response vector, and $n=\sum _i^m n_i$. We assume p is the same across all m institutions. Our goal is to apply Lasso to rank risk SNPs of AD based on the distributed data sets $(A_i,y_i)$, $i=1,2,...,m$.

3 Methods

Figure 1 illustrates the general idea of our distributed framework. Suppose that each institution maintains the ADNI genome-wide data for a few subjects. We first apply the distributed Lasso screening rule to pre-identify inactive features and remove them from the training phase. Next, we employ the LQM on the reduced data matrices to perform collaborative analyses across different institutions. Finally, each institution obtains the learnt model and performs stability selection to rank the SNPs that may collectively affect AD. The process of stability selection is to count the frequency of nonzero entries in the solution vectors and select the most frequent ones as the top risk genes for AD. The whole learning procedure results in the same model for all institutions, and preserves data privacy at each of them.

3.1 Local Query Model

We apply a proximal gradient descent algorithm—the Iterative Shrinkage/Thresho-lding Algorithm (ISTA) [2]—to solve problem (1). We define $g(x;A,y)=||Ax-y||^2_2$ as the least square loss function. The general updating rule of ISTA is:

$$\begin{aligned} x^{k+1} = \varGamma _{\lambda t_k}(x^k-t_k \nabla g(x^k;A,y)), \end{aligned}$$

(2)

where k is the iteration number, $t_k$ is an appropriate step size, and $\varGamma $ is the soft thresholding operator [8] defined by $\varGamma _\alpha (x) = sign(x)\cdot (|x|-\alpha )_{+}$.

In view of (2), to solve (1), we need to compute the gradient of the loss function $\nabla g$, which equals to $A^T (Ax-y)$. However, because the data set (A, y) is distributed to different institutions, we cannot compute the gradient directly. To address this challenge, we propose a Local Query Model to learn the model x across multiple institutions without compromising data privacy.

In our study, each institution maintains its own data set $(A_i,y_i)$ to preserve their privacy. To avoid collecting all data matrices $A_i,i=1,2,...,m$ together, we can rewrite the problem (1) as the following equivalent formulation: $\min _{x} \sum _i^m g_i(x; A_i,y_i)+\lambda ||x||_1: i=1,2,...,m,$ where $g_i(x; A_i, y_i)=\frac{1}{2} ||A_i x - y_i||_2^2$ is the least squares loss.

The key of LQM lies in the following decomposition: $\nabla g = A^T(Ax-y)=\sum _{i=1}^m A^T_i (A_ix-y_i)= \sum _{i=1}^m \nabla g_i$. We use “local institution” to denote all the institutions and “global center” to represent the place where intermediate results are calculated. The ith local institution computes $\nabla g_i = A^T_i (A_ix-y_i)$. Then, each local institution sends the partial gradient of the loss function to the global center. After gathering all the gradient information, the global center can compute the accurate gradient with respect to x by adding all $\nabla g_i$ together and send the updated gradient $\nabla g$ back to all the local institutions to compute x.

The master (global center) only servers as the computation center and does not store any data sets. Although the master gets $g_i$, it could not reconstruct $A_i$ and $y_i$. Let $g_i^k$ denote the kth iteration of $g_i$. Suppose x is initialized to be zero, $g_i^1=-A_i^T y_i$ and $g_i^k= A_i(A_i^T x^k-y_i)$. We get $A^T_i A_i x $ by $g_i^k-g_i^1$ but $A_i$ can not be reconstructed since updating and storing x only happens in the workers (local institution). As a result, LQM can properly maintain data privacy for all the institutions.

3.2 Safe Screening Rules for Lasso

The dual problem of Lasso (1) can be formulated as the following equation:

$$\begin{aligned} \sup _{\theta }\left\{ \frac{1}{2}||y||_2^2-\frac{\lambda ^2}{2}||\theta -\frac{y}{\lambda }||_2^2 : \left| [A]_j^T \theta \right| \le 1, j = 1,2,...,p \right\} , \end{aligned}$$

(3)

where $\theta $ is the dual variable and $[A]_j$ denotes the jth column of A. Let $\theta ^*(\lambda )$ be the optimal solution of problem (3) and $x^*(\lambda )$ denotes the optimal solution of problem (1). The Karush–Kuhn–Tucker (KKT) conditions are given by:

$$\begin{aligned} y = Ax^*(\lambda )+\lambda \theta ^*(\lambda ), \end{aligned}$$

(4)

$$\begin{aligned}{}[A]_j^T \theta ^*(\lambda ) \in \left\{ \begin{array}{ll} \text {sign} ([x^*(\lambda )]_j), &{}\quad \text {If }[x^*(\lambda )]_j \ne 0, \\ {[-1,1]}, &{}\quad \text {If } [x^*(\lambda )]_j=0, \end{array} \right. \end{aligned}$$

(5)

where $[x^*(\lambda )]_k$ denotes the kth component of $x^*(\lambda )$. In view of the KKT condition in equation (5), the following rule holds: $\left| [A]_j^T \theta ^*(\lambda ) \right| <1 \Rightarrow [x^*(\lambda )]_j=0 \Rightarrow x_j \text { is an inactive feature}$.

The inactive features have zero components in the optimal solution vector $x^*(\lambda )$ so that we can remove them from the optimization without sacrificing the accuracy of the optimal value in the objective function (1). We call this kind of screening methods as Safe Screening Rules. SAFE [3] is one of highly efficient safe screening methods. In SAFE, the jth entry of $x^*(\lambda )$ is discarded when

$$\begin{aligned} \left| [A]_j^Ty\right| <\lambda - ||[A]_j||_2||y||_2\frac{\lambda _{\max }-\lambda }{\lambda _{\max }}, \end{aligned}$$

(6)

where $\lambda _{\max } = \max _j \left| [A]_j^T y \right| $. As a result, the optimization can be performed on the reduced data matrix $\widetilde{A}$ and the original problem (1) can be reformulated as:

$$\begin{aligned} \min \limits _{\widetilde{x}} {\frac{1}{2}||\widetilde{A}\widetilde{x}-y||_2^2+ \lambda ||\widetilde{x}||_1: \widetilde{x}\in \mathbb {R}^{\widetilde{p}} } \text { and }\widetilde{A}\in \mathbb {R}^{n\times \widetilde{p}}, \end{aligned}$$

(7)

where $\widetilde{p}$ is the number of remaining features after employing safe screening rules. The optimization is performed on a reduced feature matrix, accelerating the whole learning process significantly.

3.3 Distributed Safe Screening Rules for Lasso

As data are distributed to different institutions, we develop a family of distributed Lasso screening rule to identify and discard inactive features in a distributed environment. Suppose ith institution holds the data set $(A_i, y_i)$, we summarize a distributed version of SAFE screening rules (D-SAFE) as follows:

Step 1: $ Q_i=[A_i]^T y_i$, update $Q = \sum _i^m Q_i$ by LQM.
Step 2: $ \lambda _{\max } = \max _j \left| [Q]_j \right| .$
Step 3: If $\left| [A]_j^Ty\right| <\lambda - ||[A]_j||_2||y||_2\frac{\lambda _{\max }-\lambda }{\lambda _{\max }}$, discard jth feature.

To compute $||[A]_j||_2$ in Step 3, we first compute $H_i=||[A_i]_j||_2^2$ and perform LQM to compute H by $H=\sum _i^m H_i$. Then, we have $||[A_i]_j||_2=\sqrt{H}$. Similarly, we can compute $||y||_2$ in Step 3. As the data communication only requires intermediate results, D-SAFE preserves the data privacy at each institution.

To tune the value of $\lambda $, commonly used methods such as cross validation need to solve the Lasso problem along a sequence of parameters $\lambda _0> \lambda _1>...>\lambda _{\kappa }$, which can be very time-consuming. Enhanced Dual Polytope Projection (EDPP) [10] is a highly efficient safe screening rules. Implementation details of EDPP is available on the GitHub: http://dpc-screening.github.io/lasso.html.

To address the problem of data privacy, we propose a distributed Lasso screening rule, termed Distributed Enhanced Dual Polytope Projection (D-EDPP), to identify and discard inactive features along a sequence of parameter values in a distributed manner. The idea of D-EDPP is similar to LQM. Specifically, to update the global variables, we apply LQM to query each local center for intermediate results–computed locally–and we aggregate them at global center. After obtaining the reduced matrix for each institution, we apply LQM to solve the Lasso problem on the reduced data set $\widetilde{A}_i$, $i=1,...,m$. We assume that j indicates the jth column in A, $j=1,...,p$, where p is the number of features. We summarize the proposed D-EDPP in Algorithm 1.

To calculate R, we apply LQM through aggregating all the $R_i$ together in the global center by $R=\sum _i^m R_i$ and send R back to every institution. The same approach is used to calculate T, S and w in D-EDPP. The calculation of $||[A]_j||_2$ and $||v^{\bot }_2(\lambda _{k},\lambda _{k-1})||_2$ follows the same way in D-SAFE. The discarding result of $\lambda _k$ relies on the previous optimal solution $x^*(\lambda _{k-1})$. Especially, $\lambda _k$ equals to $\lambda _{\max }$ when k is zero. Thus, we identify all the elements to be zero at $x^*(\lambda _0)$. When k is 1, we can perform screening based on $x^*(\lambda _0)$.

3.4 Local Query Model for Lasso

To further accelerate the learning process, we apply FISTA [1] to solve the Lasso problem in a distributed manner. The convergence rate of FISTA is $O(1/k^2)$ compared to O(1 / k) of ISTA, where k is the iteration number. We integrate FISTA with LQM (F-LQM) to solve the Lasso problem on the reduced matrix $\widetilde{A}_i$. We summarize the updating rule of F-LQM in kth iteration as follows:

Step 1: $\nabla g_i^k = \widetilde{A}_i^T ( \widetilde{A}_i x^k-y_i )$, update $\nabla g^k = \sum _i^m \nabla g_i^k$ by LQM.
Step 2: $z^k = \varGamma _{\lambda t_k} (x^k - t_k \nabla g^k)$ and $t_{k+1}=\frac{1+\sqrt{1+4t^2_k} }{2}$.
Step 3: $x^{k+1} = z^k + \frac{t_k-1}{t_{k+1}} (z^k-z^{k-1})$.

The matrix $\widetilde{A}_i$ denotes the reduced matrix for the ith institution obtained by D-EDPP rule. We repeat this procedure until a satisfactory global model is obtained. Step 1 calculates $ \nabla g_i^k $ from local data $( \widetilde{A}_i, y_i)$. Then, each institution performs LQM to get the gradient $\nabla g^k$ based on (5). Step 2 updates the auxiliary variables $z^k$ and step size $t_k$. Step 3 updates the model x. Similar to LQM, the data privacy of institutions are well preserved by F-LQM.

4 Experiment

We implement the proposed framework across three institutions on a state-of-the-art distributed platform—Apache Spark—a fast and efficient distributed platform for large-scale data computing. Experiment shows the efficiency and effectiveness of proposed models.

4.1 Comparison of Lasso with and Without D-EDPP Rule

We choose the volume of lateral ventricle as variables being predicted in trials containing 717 subjects by removing subjects without labels. The volumes of brain regions were extracted from each subject’s T1 MRI scan using Freesurfer: http://freesurfer.net. We evaluate the efficiency of D-EDPP across three research institutions that maintain 326, 215, and 176 subjects, respectively. The subjects are stored as HDFS files. We solve the Lasso problem along a sequence of 100 parameter values equally spaced on the linear scale of $\lambda /\lambda _{\max }$ from 1.00 to 0.05. We randomly select 0.1 million to 1 million features by applying F-LQM since [1] proved that FISTA converges faster than ISTA. We report the result in Fig. 2 and achieved about a speedup of 66-fold compared to F-LQM.

4.2 Stability Selection for Top Risk Genetic Factors

We employ stability selection [6, 11] with D-EDPP+F-LQM to select top risk SNPs from the entire GWAS with 5,906,152 features. We conduct four groups of trials in Table 1. In each trial, D-EDPP+F-LQM is carried out along a 100 linear-scale sequence from 1 to 0.05. We simulate this 200 times and perform on 500 of subjects in each round. Table 1 shows the top 5 selected SNPs. APOE, one of the top genetic risk factors for AD [5], is ranked #1 for three groups.

Table 1. Top 5 selected risk SNPs associated with diagnose, the volume of hippocampal, entorhinal cortex, and lateral ventricle at baseline, based on ADNI.

Full size table

References

Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet MATH Google Scholar
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004)
Article MathSciNet MATH Google Scholar
Ghaoui, L.E., Viallon, V., Rabbani, T.: Safe feature elimination for the lasso and sparse supervised learning problems. arXiv preprint arXiv:1009.4219 (2010)
Harold, D., et al.: Genome-wide association study identifies variants at clu and picalm associated with Alzheimer’s disease. Nature Genet. 41(10), 1088–1093 (2009)
Article Google Scholar
Liu, C.C., Kanekiyo, T., Xu, H., Bu, G.: Apolipoprotein e and Alzheimer disease: risk, mechanisms and therapy. Nature Rev. Neurol. 9(2), 106–118 (2013)
Article Google Scholar
Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. Series B (Stat. Methodol.) 72(4), 417–473 (2010)
Article MathSciNet Google Scholar
Sasieni, P.D.: From genotypes to genes: doubling the sample size. Biometrics, 1253–1261 (1997)
Google Scholar
Shalev-Shwartz, S., Tewari, A.: Stochastic methods for l 1-regularized loss minimization. J. Mach. Learn. Res. 12, 1865–1892 (2011)
MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B (Methodol.), 267–288 (1996)
Google Scholar
Wang, J., Zhou, J., Wonka, P., Ye, J.: Lasso screening rules via dual polytope projection. In: Advances in Neural Information Processing Systems (2013)
Google Scholar
Yang, T., et al.: Detecting genetic risk factors for Alzheimer’s disease in whole genome sequence data via lasso screening. In: IEEE International Symposium on Biomedical Imaging, pp. 985–989 (2015)
Google Scholar

Download references

Acknowledgments

This work was supported in part by NIH Big Data to Knowledge (BD2K) Center of Excellence grant U54 EB020403, funded by a cross-NIH consortium including NIBIB and NCI.

Author information

Authors and Affiliations

School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA
Qingyang Li, Tao Yang & Yalin Wang
Department of Engineering and Technology, University of Wisconsin-Stout, Menomonie, WI, USA
Liang Zhan
Imaging Genetics Center, Institute for Neuroimaging and Informatics, University of Southern California, Marina del Rey, CA, USA
Derrek Paul Hibar, Neda Jahanshad & Paul M. Thompson
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
Jieping Ye & Jie Wang

Authors

Qingyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Tao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Liang Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Derrek Paul Hibar
View author publications
You can also search for this author in PubMed Google Scholar
Neda Jahanshad
View author publications
You can also search for this author in PubMed Google Scholar
Yalin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jieping Ye
View author publications
You can also search for this author in PubMed Google Scholar
Paul M. Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Jie Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Wang .

Editor information

Editors and Affiliations

University College London , London, United Kingdom
Sebastien Ourselin
The Hebrew University of Jerusalem , Jerusalem, Israel
Leo Joskowicz
Harvard Medical School , Boston, Massachusetts, USA
Mert R. Sabuncu
Istanbul Technical University , Istanbul, Turkey
Gozde Unal
Harvard Medical School and Brigham and Women's Hospital, Boston, Massachusetts, USA
William Wells

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Q. et al. (2016). Large-Scale Collaborative Imaging Genetics Studies of Risk Genetic Factors for Alzheimer’s Disease Across Multiple Institutions. In: Ourselin, S., Joskowicz, L., Sabuncu, M., Unal, G., Wells, W. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. MICCAI 2016. Lecture Notes in Computer Science(), vol 9900. Springer, Cham. https://doi.org/10.1007/978-3-319-46720-7_39

Download citation

DOI: https://doi.org/10.1007/978-3-319-46720-7_39
Published: 02 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46719-1
Online ISBN: 978-3-319-46720-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The Medical Image Computing and Computer Assisted Intervention Society (opens in a new tab)