Abstract
Sparse regression and classification methods are commonly applied to high-dimensional data to simultaneously build a prediction rule and select relevant predictors. The well-known lasso regression and the more recent sparse partial least squares (SPLS) approach are important examples. In such procedures, the number of identified relevant predictors typically depends on a complexity parameter that has to be adequately tuned. Most often, parameter tuning is performed via cross validation (CV). In the context of lasso penalized logistic regression and SPLS classification, this paper addresses three important questions related to complexity selection: (1) Does the number of folds in CV affect the results of the tuning procedure? (2) Should CV be repeated several times to yield less variable tuning results?, and (3) Is complexity selection robust against resampling?
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Ancona, N., Maglietta, R., Piepoli, A., et al. (2006). On the statistical assessment of classifiers using DNA microarray data. BMC Bioinformatics, 7, 387.
Bernau, C., & Boulesteix, A. L. (2010). Variable selection and parameter tuning in high-dimensional prediction. In electronic COMPSTAT Proceedings, Paris.
Braga-Neto, U., Dougherty, E. R. (2004). Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20, 374–380.
Chun, D., & Keles, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society, 72, 3–25.
Chung, D., & Keles, S. (2010). Sparse Partial Least Squares Classification for High Dimensional Data. Statistical Applications in Genetics and Molecular Biology, 9, 17.
Dougherty, E. R., Zollanvari, A., Braga-Neto, U. M. (2011). The illusion of distribution-free small-sample classification in genomics. Current Genomics, 12, 333–341.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Hanczar, B., Hua, J., Dougherty, E. R. (2007). Decorrelation of the true and estimated classifier errors in high-dimensional settings. EURASIP Journal on Bioinformatics and Systems Biology, 2007, 38473.
Scherzer, C. R., Eklund, A. C., Morse, L. J. et al. (2007). Molecular markers of early Parkinson’s disease based on gene expression in blood. Proceedings of the National Academy of Science, 104, 955–960.
Singh, D., Febbo, P. G., Ross, K. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203–209.
Tang, B. M., McLean, A. S., Dawes, I. W. et al. (2009). Gene-expression profiling of peripheral blood mononuclear cells in sepsis. Critical Care Medicine, 37, 882–888.
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
Wang, Y., Klijn, J. G., Zhang, Y. et al. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365, 671–679.
Zou, H. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society B, 67, 301–320.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Boulesteix, AL., Richter, A., Bernau, C. (2013). Complexity Selection with Cross-validation for Lasso and Sparse Partial Least Squares Using High-Dimensional Data. In: Lausen, B., Van den Poel, D., Ultsch, A. (eds) Algorithms from and for Nature and Life. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-00035-0_26
Download citation
DOI: https://doi.org/10.1007/978-3-319-00035-0_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-00034-3
Online ISBN: 978-3-319-00035-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)