Abstract
Modern large-scale genetic association studies generate increasingly high-dimensional datasets. Therefore, some variable selection procedure should be performed before the application of traditional data analysis methods, for reasons of both computational efficiency and problems related to overfitting. We describe here a “wrapper” strategy (SIZEFIT) for variable selection that uses a Random Forests classifier, coupled with various local search/optimization algorithms. We apply it to a large dataset consisting of 2,425 African-American and non-Hispanic white individuals genotyped for 4,869 single-nucleotide polymorphisms (SNPs) in a coronary heart disease (CHD) case-cohort association study (Atherosclerosis Risk in Communities), using incident CHD and plasma low-density lipoprotein (LDL) cholesterol levels as the dependent variables. We show that most SNPs can be safely removed from the dataset without compromising the predictive (classification) accuracy, with only a small number of SNPs (sometimes less than 100) containing any predictive signal. A statistical (SUMSTAT) approach is also applied to the dataset for comparison purposes. We describe a novel method for refining the subset of signal-containing SNPs (FIXFIT), based on an Extremal Optimization algorithm. Finally, we compare the top SNP rankings obtained by different methods and devise practical guidelines for researchers trying to generate a compact subset of predictive SNPs from genome-wide association datasets. Interestingly, there is a significant amount of overlap between seemingly very heterogeneous rankings. We conclude by constructing compact optimal predictive SNP subsets for CHD (less than 150 SNPs) and LDL (less than 300 SNPs) phenotypes, and by comparing various rankings for two well-known positive control SNPs for LDL in the apolipoprotein E gene.
Key words: coronary heart disease, genome-wide association studies, Random Forests classifier, SNPs, variable selection
1. Introduction
The most common design for modern genome-wide (or genomic) association studies involves genotyping cases and controls for a large number of single-nucleotide polymorphisms (SNPs) spaced at some interval across the genome. With the ultimate goal of identifying disease susceptibility genes, one evaluates whether there is a difference in the distribution of genotypes at each locus between cases and controls. A straightforward approach is to carry out a large number of univariate tests, one for each locus, and to select the loci generating the set of lowest p-values (e.g., Klein et al., 2005). There are at least two problems with this approach. First, it does not take into account the nonadditive (epistatic or synergistic) interactions between loci. Second, because of the large number of loci, it is necessary to correct for multiple testing which, in turn, has a detrimental effect on the ability to detect loci with small to moderate effects.
Employing a computer science vernacular, the primary data analysis task for large-scale genomic case–control studies is classification, where one aims to predict the discrete, dependent “target” variable, such as disease status or phenotype, from numerous potentially predictive “input” variables, such as the SNPs. Many modern, scalable classifiers, including recursive partitioning methods (decision trees) and ensemble methods, belong to the broad area of data mining (DM). Although DM has been in existence for more than two decades, the interest of biomedical researchers in DM methodology was first piqued in the late 1990s, following the emergence of chip-based gene expression data (Piatetsky-Shapiro and Tamayo, 2003, and references therein).
Classification is the ultimate goal of many DM activities, and a common and logical step is often to first eliminate all of the variables that are irrelevant to the target (i.e., the primary phenotype of interest). This step is known as “pruning” or, more formally, “variable selection” or “feature set reduction” (Guyon and Eliseeff, 2003, and references therein). Variable selection is often necessary for reasons of scalability and computational efficiency. Perhaps even more importantly, it helps to minimize overfitting, whereby the classification algorithm includes spurious relationships that artificially increase the complexity of the model while in fact decreasing its predictive ability. The purpose of this study was to describe a set of variable selection algorithms (based on a Random Forests [RF] ensemble classifier) that are appropriate for large-scale genomic association studies.
2. Methods
2.1. Establishing the optimal model size
Variable selection is a crucial part of the multistep data analysis strategy shown in Figure 1. In Figure 1, the term “model” means genetic (SNP) and other variables, and some formal representation of their interactions. “Model size” refers to the number of variables contained in the model. Similarly, “model complexity” is proportional to both the model size and the density of interactions between the variables. In this communication we will concentrate on steps 1a and 1b.
The goal of the first stage of this analysis strategy (Fig. 1, step 1a) is to suggest a cut-off at which point variables (SNPs) with gradually decreasing predictive ability are no longer included in the model. It is preferable to generate an objective cut point based on the “signal-to-noise” ratio in the data itself rather than the one based on a predefined criterion, such as a certain p-value. The proposed algorithm for generating an objective cut-off value is based on the formal definition of overfitting. Consider the classification error of model M over the dataset of interest (or training set), Errortraining (M). Assuming that the dataset is sampled from a “true” distribution, or the entire distribution of data, consider also Errortrue (M). Then, model M overfits the training data if there is an alternative (typically smaller, less complex) model m such that
Consequently, it is possible to ascertain the optimal size of m by plotting Errortrue (m) versus m, where m represents, for example, a sequence of models of increasing (forward selection) or decreasing (backward elimination) complexity (Guyon and Eliseeff, 2003, and references therein). This approach is known as a wrapper (a model/variable selection procedure “wrapped” around a classifier) (Kohavi and John, 1997). Because the true distribution (entire distribution of data) is unknown, one has to use resampling (boot-strapping, cross-validation) or dataset subdivision (into training and testing data) to estimate Errortrue (m) values. This estimate is known as the generalization classification error estimate.
Ideally, one could perform an exhaustive search over the full model space (i.e., measure all possible combinations of SNPs and other variables for their predictive value). However, this is clearly impractical for large-scale genomic association studies and may lead to another form of overfitting (over the number of models in the model space). In fact, it is known that the “coarse” model search algorithms tend to perform better than the more exhaustive ones in the overfitting-prone domains (Reunanen, 2003). For these and other reasons, the variable selection scheme detailed in this communication consists of two separate stages. First, an approximate optimal model size is established by applying a coarse wrapper to the dataset. Once the model size is fixed, more sophisticated local search/optimization algorithms are used to further refine the model composition. The primary advantage of this two-stage approach is that it simultaneously addresses both computational complexity and overfitting.
The coarse wrapper used in the first stage of the proposed variable selection strategy is similar to the simple SNP preranking strategy suggested in Hoh et al. (2001). For brevity, we will call it SIZEFIT (because we are trying to fit an optimal-size model to the data). First, the SNPs are ranked in order of their association strength with the phenotype, one by one, using a simple univariate association test. In computer science vernacular this is known as a filter (Guyon and Eliseeff, 2003). An example of such test is contingency chi-square (used throughout this study), but other test can also be used (e.g., Gini index [Gini, 1921], or binormal separation [Forman, 2003]). The highest ranking SNP is added to the available non-SNP variables (such as sex and age), and a classifier is applied to this subset of variables. Then, the two highest ranking SNPs are added to the non-SNP variables, and a classifier is applied again. Then, the three highest ranking SNPs are added, etc. At some point the classifier's generalization classification accuracy will stop growing, because adding more (increasingly irrelevant) SNPs will only result in adding noise to the model.
2.2. Random Forests ensemble classifier
The RF ensemble classifier (Breiman, 2001) consists of many randomized single decision trees. In an ensemble classifier, a new observation (e.g., individual in a case–control association study) is classified separately by each tree (i.e., each tree “votes” for a class assignment), and the class getting the most votes is selected. The single decision tree induction algorithm is essentially CART (Breiman et al., 1984), a recursive partitioning algorithm that uses Gini index as a node (split) impurity measure. However, in RF implementation, each tree is grown to its maximum (i.e., it is not pruned).
To grow a RF ensemble, two mechanisms of randomization are employed: random input variable selection and bootstrapping (Efron and Tibshirani, 1993). With the former, if there are M input variables in the dataset, a number m << M (usually a square root of M) is specified when building a single tree, such that at each node m variables are selected at random out of M and the “best” of these m is used to split the node. The value of m is constant throughout the forest. Because RF computational efficiency is a function of m, not M, RF is much more scalable than other complex classifiers when the number of input variables is large (e.g., hundreds of thousands). The choice of m (in practice, the primary flexible parameter in the RF application) depends on a tradeoff between a tree's classification accuracy and intertree correlation. In our experience and simulation experiments, a “traditional” m—square root of M—proved to be effective and robust (results not shown), in full agreement with the recent RF application literature (e.g., Svetnik et al., 2003).
The second randomization mechanism in RF is bootstrapping. When building each single tree, N observations, or individuals (out of N), are sampled at random with replacement from the original data to create a training set. An advantage of “built-in” bootstrapping is that it allows for a convenient generalization classification error estimate (with no need for cross-validation or dataset subdivision) known as the “out-of-bag” (OOB) error rate. The OOB estimate is computed as follows: although each tree is constructed using a different bootstrap sample from the original data, one-third of the observations are left out of the bootstrap sample (OOB) and are not used in the construction of the kth tree. Each observation left out in the construction of the kth tree is “fed” into the kth tree to obtain this observation's class. Thus, after the complete forest is grown, a virtual testing set classification is obtained for each observation on about one-third of the trees. Take c to be the class that received most of the votes every time a certain observation was OOB. The proportion of times that c is not equal to the true class of that observation, averaged over all the observations, is the OOB error rate.
Two variable importance measures can be inferred with a RF classifier—one based on the OOB error rate and another based on the cumulative Gini index decrease (used throughout this study). It is computed as follows: every time a split of a node (current sample) is made on the variable v, the Gini impurity measure for the two descendent nodes is less than the parent node. Adding up the Gini decreases for each individual variable over all the trees in the forest gives a fast variable importance metric that is usually consistent with the former measure. Importantly, the RF-generated variable importance measures can be used as alternatives to the chi-square filter when preranking the SNPs for the wrapper application.
The RF classifier is also capable of addressing the nonadditive variable interactions because many possible variable combinations are encountered repeatedly within the forest. However, it should be noted here that perfectly collinear (redundant) variables, such as the two SNPs in exact linkage disequilibrium, will have the same ranking if a filter (chi-square) or RF variable importance measure is used. Therefore, the optimal model size obtained via the wrapper experiment is not robust to such redundant variables. Specifically, each of the two aforementioned hypothetical SNPs in exact linkage disequilibrium will occur in ∼50% of the trees in which either SNP is present (because once one of the SNPs is selected, another one will not be added). A detailed discussion of the applicability of RF to the genetic epidemiology data can be found in Province et al. (2001); Lunetta et al. (2004); and Bureau et al. (2005).
2.3. Avoiding variable selection bias
RF is believed to be robust to overfitting over the input variables (Breiman, 2001). However, this does not necessarily mean that the OOB measure is robust to overfitting over the models (or variable subsets) during the variable selection process. The main reason is that initial SNP rankings are obtained from the full dataset. Therefore, there is an incomplete separation of the training and testing data, and the OOB estimate can “peek” at some of the information in the training data (namely, initial variable ranking), and the resulting OOB classification accuracy estimates are often overoptimistic (Ambroise and McLachlan, 2002; Svetnik et al., 2003, 2004). This phenomenon is known as the variable selection bias (VSB). Therefore, in this study we augmented the OOB estimates with alternative generalization error estimates that are not subjected to VSB (such as “nested” and “external-loop” cross-validation, in which the input variables are preranked de novo within each cross-validation fold) (Svetnik et al., 2004).
2.4. Refining the model of fixed size
Because of the univariate nature of the SIZEFIT, or similar wrappers, it is unable to take full advantage of RF “interaction-friendliness.” Perhaps even more importantly, it is not necessarily robust to the choice of metric used to prerank the SNPs.
All that it does is deliver an estimate of the optimal model size (i.e., number of “signal-carrying” SNPs in the dataset). In this section, we propose a novel optimization algorithm for fitting a model of fixed size to a dataset. For brevity, we will call it FIXFIT. FIXFIT is a wrapper; hence, the models are scored using generalization classification error estimates. For the reasons outlined in the preceding section, and also because the model size is fixed (and, therefore, VSB is much less of a concern), OOB estimates were used throughout this study. FIXFIT is an “evolutionary algorithm,” namely a combination/extension of a classical “genetic algorithm” (Holland, 1992) and “extremal optimization” algorithm (Boettcher and Percus, 2000). In contrast to the majority of evolutionary algorithms, however, FIXFIT takes advantage of the RF internally generated variable importance measures (fast cumulative Gini decrease measure, specifically) to identify the “high-quality” and “low-quality” components of the model. Also, FIXFIT has a two-level hierarchy, which makes it more robust and easily parallelized.
FIXFIT works as follows. First, an optimal model size, M, is specified. Then, a randomly chosen subset of M SNPs is selected as the initial model, or “kernel.” Subsequently, FIXFIT repeatedly applies RF to the kernel plus a small (typically s = M/10) random subset of SNPs outside the kernel. The cumulative Gini decrease variable ranking is obtained for all M + s variables. The SNPs from the small random subset that rank higher than M are retained, forming a small addition to the kernel, whereas the s lowest ranking variables (including variables from possibly both the previous kernel and the small random subset) are discarded. Eventually, a number of such small additions are generated, and they are recorded in memory. The procedure is repeated until the generalization classification error of the kernel stops decreasing. Therefore, the result is a stabilized kernel (“refined” model) and a set of small additions (kernel convergence and stability are discussed in Section 3.3).
By applying FIXFIT to the data repeatedly (with different random initial kernels), multiple stabilized kernels can be generated, with their intersection making up the refined model (alternatively, we can apply RF to the union of all the stabilized kernels and small additions recorded by FIXFIT, obtain cumulative Gini decrease variable rankings, and accept top M SNPs as the refined model). The benefits of generating multiple stabilized kernels are (1) the ability to parallelize the computation by assigning each FIXFIT process to a separate CPU and (2) increased robustness. Throughout this study we typically generated two independent stabilized kernels for each dataset.
2.5. Example application
When selecting an appropriate example application, the objective was to identify a large dataset that had both phenotypes of general interest and an imbedded positive natural control. The Atherosclerosis Risk in Communities (ARIC) study is a comprehensive epidemiologic study of coronary heart disease (CHD) and its risk factors (ARIC Investigators, 1989). CHD cases (n = 1,476) were defined by documented myocardial infarction, stable and unstable angina, CHD death, and/or cardiovascular procedures. Prevalent cases were excluded. A stratified random sample of noncases from the ARIC cohort was selected as a comparison group (designated as the cohort random sample [CRS; n = 949]; for this analysis, any cases in the CRS were excluded). The CHD case group is composed of 1,142 non-Hispanic whites and 334 African-Americans. The CRS reference group consists of 618 non-Hispanic whites and 331 African-Americans. In all, 4,869 gene-based SNPs on human chromosome 19 were genotyped for this study. Approximately 1,000 annotated genes were identified on human chromosome 19 based on resequencing of exons, splice junctions, and near five-regulatory domains to identify sequence variation with potential functional significance (Clark et al., 2003). Chromosome 19 was selected because it has multiple genes involved in lipid metabolism and CHD, and it provided an opportunity of an embedded positive control (apolipoprotein E [ApoE] gene). SNP genotyping was completed using the TaqMan and SNPlex genotyping platforms (Applied Biosystems, Foster City, CA). At the time of this study, datasets containing 4,869 verified and validated SNPs for African-Americans and 4,810 verified and validated SNPs for non-Hispanic whites were available. Missing genotype data were rare (detailed statistics are available from the authors upon request), and a simple frequency-based imputation algorithm was used to fill in missing genotypes.
3. Results and Discussion
3.1. Establishing the optimal model size
Figure 2 illustrates an application of the SIZEFIT variable selection strategy (Section 2.1) to the ARIC data (Fig. 2A, African-Americans; Fig. 2B, non-Hispanic whites), with CHD case status as the target variable. Note that applying RF to every nested subset of variables [i.e., (1), (1 + 2), and (1 + 2 + 3)] is computationally prohibitive. Therefore, a “compressed” scale was used (10, 20, 30, … , 4,400, 4,800, and, finally, all variables). The SIZEFIT curve fluctuates at the origin (1 to 10 variables, data not shown), a property it shares with the Set Association graphs. Therefore, it is beneficial to “prime” the SNP variable selection process with one or more strong nongenetic predictors (e.g., sex). The lower curve (i.e., circle data points) in Figure 2 corresponds to the SNPs ranked using chi-square, with sex as the first variable in the model. For example, data point “10” means that RF was applied to the following input variable subset: sex, the highest ranking SNP, the second highest, third, fourth, fifth, sixth, seventh, eighth, and, finally, the ninth highest ranking SNP. The upper curve (i.e., triangle data points) in Figure 2 corresponds to the random variable ranking (an average of two random rankings, to be exact), with sex also fitted as the first variable in the model. Therefore, the upper curve serves as a control. The nadir, or bottom plateau, of the lower curve suggests the optimal model size. The nadir is approximately in the 70–160 SNPs range for African-Americans (Fig. 2A). After that adding more (increasingly irrelevant) SNPs will only result in adding noise to the model. Because of the epistatic/synergistic effects (which any filter will necessarily ignore), it might be preferable to err on a safe side (the upper bound, or 160 in the above example). Once we obtain this upper bound, we can use it as a starting point for more sophisticated search algorithms aimed at refining the model of a fixed size (Sections 2.4 and 3.3). In other words, we can discard from further analyses all SNPs that rank worse than the upper bound (the 160s SNP and higher, in the above example).
RF tends to give unbalanced class prediction errors if one class has a much lower frequency than the other (Breiman, 2001). Therefore, for the non-Hispanic whites (897 cases and 499 noncases), we evaluated different weights for class errors, ranging from the default weight ratio of 1 to the class-balancing 1.8 (as depicted in Fig. 2B) to 3. The shape of the SIZEFIT curve and the location of the bottom plateau proved robust to changes in class weights (results not shown). Of course, class balance was not an issue with the African-Americans (266 cases and 271 controls) or low-density lipoprotein (LDL) phenotype (see below). In general, the OOB error estimate appears to be very stable, showing commendably narrow dispersion when a sufficient number of trees in the forest is used [min (m×20, 2000), where m is the square root of the number of variables]. For example, in the analysis shown in Figure 2B, the range was always <0.5% (measured at 10 different model size points for 10 identical RF runs with different random seeds at each point).
We have also applied the SIZEFIT procedure to the same data as in Figure 2, but with the target variable being LDL cholesterol plasma level (LDL, for brevity) instead of CHD. LDL is an important risk factor for CHD. It is especially appropriate in the context of our data because of a high density of numerous lipid metabolism and apolipoprotein genes on chromosome 19. Specifically, two common SNPs in the APOE gene are known to influence LDL, and thus can serve as positive controls in our experiments (see Section 3.3 for results). Because LDL is a continuous variable, it was discretized into three classes of equal size, with the middle third removed from the analysis (yielding 342 African-Americans and 886 non-Hispanic whites in total, because LDL measurements were not available for some individuals). The resulting SI-ZEFIT curves (not shown) suggest approximately 190–400 (for African-Americans) and 200–500 (for non-Hispanic whites) optimal model size ranges.
3.2. Avoiding variable selection bias
The absolute OOB values at the SIZEFIT curve nadir might look very impressive (as low as 27.37% for African-Americans and 31.16% for non-Hispanic whites in Fig. 2), but this is likely an artifact of VSB (see Section 2.3). To obtain unbiased generalization classification error, we also used fivefold external-loop cross-validation (as suggested in, e.g., Svetnik et al., 2004). Figure 3 depicts the average of 20 cross-validation curves for the African-American dataset (same as in Fig. 2A). There are important differences between this curve and the OOB curve in Figure 2A. First, once the cross-validation curve reaches its nadir, it remains comparatively flat, suggesting that RF indeed does not overfit over the input variables (Breiman, 2001). Second, the error never reaches below ∼41.5% (in stark contrast to the ∼27.4% minimal error achieved using the OOB estimate). These two differences indicate that substantial amount of VSB is present in this domain, similar to that observed in the domains of drug discovery (Svetnik et al., 2004) and microarray data analysis (Díaz-Uriarte and Alvarez de Andrés, 2006). However, one important similarity between the OOB and cross-validation curves remains: the OOB curve starts going up approximately at the point (around 200 variables, in this example) where the cross-validation curve stops going down. Although the curve in Figure 3 appears to reach even lower (than at ∼200 variables) at two data points corresponding to approximately 1,000 and 4,800 variables, the difference in error values between these two data points and the one at ∼200 variables was not significant, because the range of error values was ∼0.4% in 20 cross-validation experiments. We have confirmed this correlation for a number of different datasets (results not shown) and have yet to encounter a single case of a major discrepancy. Therefore, we concur with Braga-Neto et al. (2004) and Díaz-Uriarte and Alvarez de Andrés (2006) on the following practical guidelines for carrying out model selection:
Use nested or external-loop cross-validation (or some other VSB-proof method) to estimate the unbiased generalization error (∼41.5, in this example).
Use either VSB-proof method or OOB to estimate optimal model size. OOB might be preferable because it is much more computationally efficient.
For any new application domain (dataset) perform at least one VSB-proof experiment to confirm that both VSB-proof and OOB curves suggest approximately the same optimal model size.
3.3. Refining the model of fixed size; comparing different models and metrics; positive controls
Optimal model size does not guarantee optimal model composition: once the former is established, a more sophisticated variable selection technique (that relies on a more exhaustive search of the model space), such as FIXFIT (Section 2.4), can be used to further refine the model. FIXFIT aims to generate the model that (1) is robust to the choice of univariate metric used to prerank the variables and (2) accounts (at least to some extent) for nonadditive variable interactions (epistasis and collinearities). FIXFIT proved to be computationally efficient and robust when applied to the ARIC datasets, with the kernels rapidly converging to mostly stable entities (usually, with no more than ∼s/4 SNPs remaining in a state of flux). Figure 4 illustrates FIXFIT's convergence for the African-American CHD dataset (M = 150, s = 15). In Figure 4, generalization error estimate (OOB) is plotted as a function of the number of FIXFIT cycles (each cycle corresponding to 10 single kernel recalculations). About 100 cycles were sufficient to stabilize both the error and most of the kernel. However, because not all SNPs were evaluated by FIXFIT after 100 cycles, we completed 300 or more cycles for all FIXFIT experiments throughout this study. In our experience, FIXFIT converges much faster than hill climbing (results not shown), especially if a completely random subset of SNPs is used for the initial kernel.
We have applied FIXFIT to the same four datasets as in Section 3.1 (African-Americans, CHD; non-Hispanic whites, CHD; African-Americans, LDL; and non-Hispanic whites, LDL). Two series of 300 cycles of FIXFIT algorithm were carried out for each dataset, with two OOBs measured as functions of the number of cycles. Each series was initiated with a randomly selected kernel of a fixed size (suggested by SIZEFIT) that also contained sex in addition to the SNP variables (e.g., the initial kernel for the FIXFIT application in Fig. 4 contained 149 random SNPs and sex). In all FIXFIT experiments, convergence to a sufficiently stable kernel and OOB was achieved after approximately 100 cycles. A certain amount of flux remained in the kernel after convergence (it can be seen as the narrow OOB fluctuations after 100-cycle mark in Fig. 4). This flux is explained by both the random nature of RF and, more importantly, collinearities (many signal-carrying SNPs being in significant linkage disequilibrium). Therefore, in a typical FIXFIT application, the final output should not be just one single model, but rather the intersection (or the union) of the models recorded after the convergence point. Of course, for multiple FIXFIT applications/kernels (two in this case), the intersection of (two) intersections (or unions) should make up the final refined model. In the above example, 133 variables (132 SNPs + sex) out of the 150 remained stable (i.e., were present in the intersection of two intersections), thus comprising the refined model. Similarly, the resulting OOB error should be recorded as an average of all OOB error measurements after the convergence point, averaged over the multiple FIXFIT applications/kernels.
The actual average OOB values for the CHD datasets were ∼31.8% for African-Americans and ∼33.6% for non-Hispanic whites. However, due to VSB, these should not be interpreted as unbiased generalization accuracy estimates. Just as with SIZEFIT, the OOB error is only used here for model/variable selection. To estimate unbiased generalization accuracy of the resulting models, external-loop cross-validation or some other VSB-proof method should be used. Also, FIXFIT OOB values should not be directly compared with the SIZEFIT OOB values because there is significantly less VSB in FIXFIT experiments. Hence, the FIXFIT OOB error is a much less severe underestimate of the true generalization error than the SIZEFIT OOB error.
At this juncture, we have a number of alternative rankings and models, ranging from the simple (SNPs ranked using univariate chi-square filter) to the relatively complex (the FIXFIT model consisting of 132 SNPs and sex in the above example). Two obvious questions are (1) which model is “better” and (2) how different (similar) are these models?
There is no simple answer to the first question. If we use the generalization classification accuracy as the model selection criterion, then FIXFIT models are the best, because FIXFIT is explicitly designed to find the model of a certain size that maximizes the RF's classification accuracy. However, other model selection algorithms in conjunction with other classifiers might prove superior. Also, the fact that a certain SNP is useful for classification does not necessarily imply its biological relevance. Ideally, a comprehensive comparative study using VSB-proof generalization error estimates to score models derived using different metrics, search algorithms, and classifiers should be carried out using both simulated and real data. However, this is clearly beyond the scope of the present communication. Instead, we will concentrate here on the second question: how stable are the resulting models and how much overlap is there between the models obtained using different approaches?
In addition to SIZEFIT and FIXFIT, we have also applied the Set Association (or SUMSTAT, Hoh et al., 2001; Hoh and Ott, 2003) method to our data. Just like SIZEFIT, SUMSTAT aims to find the “objective” optimal model size (in terms of how many SNPs should be included in the model). The SNPs are preranked using chi-square metric (slightly different from our baseline implementation), and the growing models are scored using the statistical fit criterion penalized by the degrees of freedom. Therefore, the SUMSTAT curves are similar to the SIZEFIT curves in that there is a nadir, or a bottom plateau, corresponding to the best models (with the lowest p-values, or the lowest OOB error values, respectively). A SUMSTAT curve for the African-American CHD dataset is shown in Figure 5. Note that the optimal model size suggested by the SUMSTAT application (90 SNPs) is similar to that suggested by SIZEFIT (see Fig. 2A). This result is remarkable because SUMSTAT and SIZEFIT are very different methods derived from very different foundations and algorithms. This implies that the optimal model size is, to a large extent, a function of the data rather than of the variable selection method.
Table 1 summarizes the extent of overlap among the results of different model selection methods for our four primary datasets (African Americans, CHD; non-Hispanic whites, CHD; African-Americans, LDL; non-Hispanic whites, LDL). In all, 150 (300 for LDL) and 600 variables were chosen as the approximate lower and upper optimal model size bounds, as suggested by SIZEFIT. Although 150, 300, and 600 might not be very close to the actual SIZEFIT estimates for each specific dataset, we wanted to have unified bounds so that the numbers could be readily compared across the tables. From the perspective of this study, the most important “overlap” is that of “150 FIXFIT” and “600 chi-square” (125 SNPs in Table 1A). This is because FIXFIT is computationally expensive, whereas chi-square filter is computationally trivial. A 125-strong overlap between the two implies that almost all of the SNPs suggested by applying FIXFIT to the lower bound of the optimal model size are in fact present in the larger (upper bound of the optimal model size) model obtained by just ranking SNPs using a simple chi-square metric. This has important practical implications detailed in Section 4 below. It should also be noted here that the chi-square and SUMSTAT rankings are, not surprisingly, almost identical; the difference is attributed to the additional “weighting” in SUMSTAT (Hoh et al., 2001).
Table 1.
150 FIXFIT (132 kernel) | 150 chi-square | 600 chi-square | 150 RF Gini | 600 RF Gini | 150 SUMSTAT | 600 SUMSTAT | |
---|---|---|---|---|---|---|---|
150 FIXFIT (132 kernel) | 50 | 102 | 24 | 45 | 51 | 102 | |
150 chi-square | 54 | 150 | 21 | 45 | 140 | 149 | |
600 chi-square | 125 | 150 | 41 | 119 | 150 | 569 | |
150 RF Gini | 22 | 49 | 79 | 150 | 20 | 45 | |
600 RF Gini | 56 | 91 | 202 | 150 | 43 | 124 | |
150 SUMSTAT | 51 | 137 | 150 | 50 | 90 | 150 | |
600 SUMSTAT | 118 | 150 | 553 | 78 | 200 | 150 |
African-Americans are shown below the diagonal, non-Hispanic whites are shown above the diagonal. Model labels are self-explanatory: for example, “600 RF Gini” means 600 highest ranking SNPs as suggested by the cumulative Gini index decrease variable importance measure in RF. Similarly, “150 FIXFIT (132 kernel)” means 132 high-ranking SNPs plus sex comprising the converged kernel after applying FIXFIT algorithm to the initial kernel of 149 randomly selected SNPs plus sex. The overlap of these two rankings is 56 in Table 1A below the diagonal (CHD phenotype, African-Americans), meaning that the set of top 600 SNPs as suggested by a single baseline RF application and the set of top 132 SNPs suggested by FIXFIT have 56 SNPs in common.
CHD, coronary heart disease; RF, Random Forests; SNPs, single-nucleotide polymorphisms.
300 FIXFIT (58 kernel) | 300 chi-square | 600 chi-square | 300 RF Gini | 600 RF Gini | 300 SUMSTAT | 600 SUMSTAT | |
---|---|---|---|---|---|---|---|
300 FIXFIT (258 kernel) | 129 | 190 | 56 | 86 | 123 | 187 | |
300 chi-square | 160 | 300 | 77 | 114 | 276 | 300 | |
600 chi-square | 240 | 300 | 117 | 187 | 300 | 560 | |
300 RF Gini | 90 | 121 | 163 | 300 | 74 | 113 | |
600 RF Gini | 134 | 161 | 241 | 300 | 111 | 181 | |
300 SUMSTAT | 145 | 272 | 300 | 115 | 156 | 300 | |
600 SUMSTAT | 233 | 300 | 555 | 161 | 238 | 300 |
African-Americans are shown below the diagonal, non-Hispanic whites are shown above the diagonal.
LDL, low-density lipoprotein; RF, Random Forests.
Finally, recall that one of the motivations behind the LDL analyses was that two common SNPs in the APOE gene known to influence LDL were present in the datasets, and thus could serve as positive controls. These two SNPs, rs7412 and rs429358, code for the well-known APOE alleles (and protein isoforms) ɛ2, ɛ3, and ɛ4. Table 2 shows the rankings for both SNPs in different LDL predictive models (data for CHD models not shown). This particular analysis was also performed for the pooled (both African-Americans and non-Hispanic whites) data. The first SNP (rs7412) was ranked highly by all metrics. The second SNP (rs429358) was ranked highly by the univariate metrics, but not by the metrics that account for the nonadditive variable interactions (RF and FIXFIT). This suggests that the second SNP's signal is somewhat redundant—if the first SNP (rs7412) is present in the sample, adding the second SNP will not increase classification accuracy with respect to LDL. (This redundancy is explained by both the LD between the two SNPs, and by the biology of APOE isoforms in the context of reverse cholesterol transport and LDL.)
Table 2.
Chi-square | RF Gini | SUMSTAT | FIXFIT (300) | |
---|---|---|---|---|
African-Americans | 47 (74) | 94 (297) | 63 (104) | 133 (not in top 300) |
Non-Hispanic whites | 1 (97) | 1 (3,154) | 1 (106) | 1 (not in top 300) |
Combined | 1 (12) | 1 (354) | 1 (13) | 1 (not in top 300) |
RF, Random Forests.
4. Conclusions and Future Work
This study presents a multistage, hierarchical data analysis strategy for large-scale genomic association studies. Its cornerstone is SIZEFIT, which is aimed at ascertaining the lower and upper bounds of the optimal model size. SIZEFIT is computationally efficient because of the high scalability (in number of variables) of the RF classifier. RF has many other attractive properties, but the primary reason it was selected as a classifier for this study was its scalability. Other highly scalable classifiers would work just (or almost) as well. After SIZEFIT is performed, the analyst has a choice. One alternative is to simply remove all SNPs positioned above the upper bound for the optimal model size, thus obtaining a smaller (but still sizable, up to 600 SNPs in the above examples) set of SNPs for further analyses. Another is to follow-up with a more sophisticated model selection algorithm (e.g., FIXFIT). Although computationally expensive, it promises to generate a (possibly more compact) model that performs at least equally well, accounts to some extent for nonadditive variable interactions, and is more robust.
We suggest that if the intended next step in the research project is genotyping these SNPs, on additional samples, the FIXFIT application (possibly using lower SIZEFIT bound for the optimal model size) is appropriate. However, if the next step is in silico bioinformatics analysis, comparisons, meta-analyses, or some kind of descriptive modeling, retaining the full subset (up to 600 SNPs in the above examples) might prove a safer choice.
Ultimately, it depends on the analysis methods' scalability and available resources. Some descriptive analysis methods, such as MDR (Hahn et al., 2003) or CPM (Nelson et al., 2001), just do not scale up as well as the others. Depending on the dataset, the shape of the SIZEFIT curves, and the “favorite” descriptive analysis method, the analyst might choose the lower bound, the upper bound, or even the slightly larger model size as a starting point for further analyses. We believe that Bayesian network modeling is an attractive and efficient (scaling up to hundreds of variables) descriptive modeling method (Rodin et al., 2005); it can be augmented by the boosted classifiers (Cohen and Singer, 1999; Freund and Schapire, 1999) have an additional advantage of embedded clustering (which can account for genetic heterogeneity).
Modern genotyping technologies generate datasets with more than 500,000 SNPs. In our experiments, we were able to apply SIZEFIT directly to the datasets of up to 200,000 SNPs (results not shown). Although we will continue improving our software, one way to substantially increase scalability is to analyze the data chromosome-by-chromosome and then concatenate the “pruned” SNP subsets. This is a promising strategy that we are investigating presently.
Our directions for future work include the following:
The reader will notice that there are no cross-validation error estimates for FIXFIT application in Section 3.3. Although not essential for the variable/model selection purposes, it would be interesting, for example, to compare such estimates to the ones in Figure 3. Unfortunately, these cross-validation error estimates are computationally very expensive. We are currently working on obtaining these estimates as part of our research on the properties of FIXFIT and extremal optimization algorithms in general. Specifically, we are interested in the presence and magnitude of any VSB in extremal optimization.
In this study, we consistently used only one nongenetic predictive variable, sex. The reasons were twofold: first, we wanted the results to be comparable to other “purely SNP” methods, such as SUMSTAT; second, we did not want to “contaminate” this pilot study by adding potentially very strong predictors. In our ongoing and future research we will be adding other (environmental, physiological, etc.) variables to our analyses. Among other things, this will tell us how robust SIZEFIT is to the inclusion of strong non-SNP predictive variables.
Along the similar lines, we intend to investigate how robust SIZEFIT is to the choice of initial preranking metric. In this study, we tacitly assumed that the location of SIZEFIT nadir will not be significantly influenced much by the metric's choice. Although this assumption makes sense, it needs to be tested formally.
Throughout this study, the optimal model size (as suggested by various variable selection methods) consistently proved to be a function of the data (specifically, associations between SNPs and a phenotype) rather than of the variable selection method. However, in part due to VSB, it is possible that the number of nonassociated SNPs might also influence the optimal model size. We plan to investigate this issue by performing chromosome-by-chromosome analysis (as opposed to analyzing the complete dataset) of the number of genome-wide association datasets that are currently being generated as part of our ongoing research program in the genetic epidemiology of CHD. Specifically, we will investigate if optimal model size is “additive.” At the same time, we plan to increase (largely via parallelization) the scalability of our analysis strategy to accommodate datasets of up to 1,000,000 SNPs (such as Affymetrix GeneChip 6.0).
It is our opinion that there is no single best metric, or modeling algorithm, or strategy for analyzing high-dimensional data with many small effects. Many alternative approaches, such as described in this communication and elsewhere (Heidema et al., 2006, and references therein; Heidema et al., 2007), are probably valid and should be applied concertedly if and when possible. One way to objectively compare different methods, however, is to introduce artificial positive controls (similar to the natural ones, such as the APOE SNPs in Table 2)into the datasets, and see whether the methods are capable of dealing with them correctly. This is a promising approach, in part because it can be tailored to a specific dataset or a research project (as opposed to the comprehensive, nonspecialized simulation studies). We are currently working on developing a software framework for artificial positive control analysis in context of large-scale association studies.
It remains to note that all of the data, results, and software (both executables and source code, predominantly in form of R libraries) used in this study are directly available from the authors, free for the nonprofit organizations/purposes.
Acknowledgments
The authors would like to thank Sergei N. Rodin, Stephen Turner, Kent Bailey, Brooke Fridley, Grigoriy Gogoshin, and D.C. Rao for many useful discussions and suggestions. This work was supported in part by the Gilson-Longenbaugh foundation, by the UTHSCH Office of Biotechnology Seed Grant Program, and by the NIH grants R01-HL072810, P50-GM065509, R37-HL051021, U01-HG004402, R01-HL87641, R01-HL072810, and R01-HL7473501. We also thank Sharvari Gujja for technical assistance. The Atherosclerosis Risk in Communities Study is carried out as a collaborative study supported by National Heart, Lung, and Blood Institute contracts N01-HC-55015, N01-HC-55016, N01-HC-55018, N01-HC-55019, N01-HC-55020, N01-HC-55021, and N01-HC-55022. The authors thank the staff and participants of the ARIC study for their important contributions.
Disclosure Statement
No competing financial interests exist.
References
- Ambroise C. McLachlan G.J. Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. U.S.A. 2002;99:6562–6566. doi: 10.1073/pnas.102102699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- ARIC Investigators. The Atherosclerosis Risk in Communities (ARIC) study: design and objectives. Am. J. Epidemiol. 1989;129:687–702. [PubMed] [Google Scholar]
- Boettcher S. Percus A.G. Nature's way of optimizing. Artif. Intell. 2000;119:275–286. [Google Scholar]
- Braga-Neto U. Hashimoto R. Dougherty E.R., et al. Is cross-validation better than resubstitution for ranking genes? Bioinformatics. 2004;20:253–258. doi: 10.1093/bioinformatics/btg399. [DOI] [PubMed] [Google Scholar]
- Breiman L. Random Forests. Mach. Learn. 2001;45:5–32. [Google Scholar]
- Breiman L. Friedman J.H. Olshen R.A., et al. Classification and Regression Trees. Wadsworth International Group; Belmont, CA: 1984. [Google Scholar]
- Bureau A. Dupuis J. Falls K., et al. Identifying SNPs predictive of phenotype using random forests. Genet. Epidemiol. 2005;28:171–182. doi: 10.1002/gepi.20041. [DOI] [PubMed] [Google Scholar]
- Clark A.G. Glanowski S. Nielsen R., et al. Positive selection in the human genome inferred from human-chimp-mouse orthologous gene alignments. Cold Spring Harb. Symp. Quant. Biol. 2003;68:471–477. doi: 10.1101/sqb.2003.68.479. [DOI] [PubMed] [Google Scholar]
- Cohen W.W. Singer Y. Simple, fast, and effective rule learner. AAAI 99, 335–342. 1999. www-2.cs.cmu.edu/∼wcohen/pubs.html www-2.cs.cmu.edu/∼wcohen/pubs.html
- Diáz-Uriarte R. Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3. doi: 10.1186/1471-2105-7-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efron B. Tibshirani R.J. An Introduction to the Bootstrap. Chapman and Hall; London: 1993. [Google Scholar]
- Forman G. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 2003;3:1289–1305. [Google Scholar]
- Freund Y. Schapire R. A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 1999;14:771–780. [Google Scholar]
- Gini C. Measurement of inequality of incomes. Econ. J. 1921;31:124–126. [Google Scholar]
- Guyon E. Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003;3:1157–1182. [Google Scholar]
- Hahn L.W. Ritchie M.D. Moore J.H. Multifactor dimensionality reduction software for detecting gene-gene and gene-environment interactions. Bioinformatics. 2003;19:376–382. doi: 10.1093/bioinformatics/btf869. [DOI] [PubMed] [Google Scholar]
- Heidema A.G. Boer J.M. Nagelkerke N., et al. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet. 2006;7:23. doi: 10.1186/1471-2156-7-23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heidema A.G. Feskens E.J. Doevendans P.A., et al. Analysis of multiple SNPs in genetic association studies: comparison of three multi-locus methods to prioritize and select SNPs. Genet. Epidemiol. 2007;31:910–921. doi: 10.1002/gepi.20251. [DOI] [PubMed] [Google Scholar]
- Hoh J. Ott J. Mathematical multi-locus approaches to localizing complex human trait genes. Nat. Rev. Genet. 2003;4:701–709. doi: 10.1038/nrg1155. [DOI] [PubMed] [Google Scholar]
- Hoh J. Wille A. Ott J. Trimming, weighting and grouping SNPs in human case-control association studies. Genome Res. 2001;11:2115–2119. doi: 10.1101/gr.204001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Holland J.H. Adaptation in Natural and Artificial Systems. MIT press; Cambridge, MA: 1992. [Google Scholar]
- Klein R.J. Zeiss C. Chew E.Y., et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308:385–389. doi: 10.1126/science.1109557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kohavi R. John G. Wrappers for feature selection. Artif. Intell. 1997;97:273–324. [Google Scholar]
- Lunetta K.L. Hayward L.B. Segal J., et al. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet. 2004;5:32. doi: 10.1186/1471-2156-5-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelson M.R. Kardia S.L. Ferrell R.E., et al. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;11:458–470. doi: 10.1101/gr.172901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Piatetsky-Shapiro G. Tamayo P. Microarray data mining: facing the challenges. ACM SIGKDD. 2003;5:1–5. [Google Scholar]
- Province M.A. Shannon W.D. Rao D.C. Classification methods for confronting heterogeneity. Adv. Genet. 2001;42:273–286. doi: 10.1016/s0065-2660(01)42028-1. [DOI] [PubMed] [Google Scholar]
- Reunanen J. Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res. 2003;3:1371–1382. [Google Scholar]
- Rodin A. Brown A. Clark A.G., et al. Mining genetic epidemiology data with Bayesian networks: application to ApoE gene variants and plasma lipid levels. J. Comput. Biol. 2005;12:1–11. doi: 10.1089/cmb.2005.12.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Svetnik V. Liaw A. Tong C., et al. Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003;43:1947–1958. doi: 10.1021/ci034160g. [DOI] [PubMed] [Google Scholar]
- Svetnik V. Liaw A. Tong C., et al. Application of Breiman's Random Forest to modeling structure-activity relationships of pharmaceutical molecules. Mult. Classifier Syst. 2004;5:334–343. [Google Scholar]