Abstract
Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.
Similar content being viewed by others
References
Santana R, Galdiano J, Pérez A, Bielza C, Larrañaga P, Calvo B, et al. Machine learning in bioinformatics machine learning in bioinformatics. Brief Bioinform. 2006;7:1–16. https://doi.org/10.1093/bib/bbk007.
Kulasingam V, Diamandis EP. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies. Nat Clin Pract Oncol. 2008;5:588–99. https://doi.org/10.1038/ncponc1187.
Vinaixa M, Samino S, Saez I, Duran J, Guinovart JJ, Yanes O. A guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data. Metabolites. 2012;2:775–95. https://doi.org/10.3390/metabo2040775.
Bellman R. Adaptive control processes—a guided tour. Z Angew Math Mech. 1962;42:364–5.
Bishop CM. Pattern recognition and machine learning. Heidelberg: Springer-Verlag Berlin; 2006.
Ghosh D, Poisson LM. “Omics” data and levels of evidence for biomarker discovery. Genomics. 2009;93:13–6. https://doi.org/10.1016/j.ygeno.2008.07.006.
Rubingh CM, Bijlsma S, Derks EPP, Bobeldijk I, Verheij ER, Kochhar S, et al. Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics. 2006;2:53–61. https://doi.org/10.1007/s11306-006-0022-6.
Westad F, Marini F. Validation of chemometric models—a tutorial. Anal Chim Acta. 2015;893:14–24. https://doi.org/10.1016/j.aca.2015.06.056.
Marco S. The need for external validation in machine olfaction: emphasis on health-related applications chemosensors and chemoreception. Anal Bioanal Chem. 2014;406:3941–56. https://doi.org/10.1007/s00216-014-7807-7.
Kennard RW, Stone LA. Computer aided design of experiments. Technometrics. 1969;11:137–48. https://doi.org/10.1080/00401706.1969.10490666.
Galvão RKH, Araujo MCU, José GE, Pontes MJC, Silva EC, Saldanha TCB. A method for calibration and validation subset partitioning. Talanta. 2005;67:736–40. https://doi.org/10.1016/j.talanta.2005.03.025.
Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003;17:166–73. https://doi.org/10.1002/cem.785.
Chevallier S, Bertrand D, Kohler A, Courcoux P. Application of PLS-DA in multivariate image analysis. J Chemom. 2006;20:221–9. https://doi.org/10.1002/cem.994.
Sirven J-B, Sallé B, Mauchien P, Lacour J-L, Maurice S, Manhès G. Feasibility study of rock identification at the surface of Mars by remote laser-induced breakdown spectroscopy and three chemometric methods. J Anal At Spectrom. 2007;22:1471. https://doi.org/10.1039/b704868h.
Ciosek P, Wróblewski W. Miniaturized electronic tongue with an integrated reference microelectrode for the recognition of milk samples. Talanta. 2008;76:548–56. https://doi.org/10.1016/j.talanta.2008.03.051.
Ivorra E, Girón J, Sánchez AJ, Verdú S, Barat JM, Grau R. Detection of expired vacuum-packed smoked salmon based on PLS-DA method using hyperspectral images. J Food Eng. 2013;117:342–9. https://doi.org/10.1016/j.jfoodeng.2013.02.022.
Bassbasi M, De Luca M, Ioele G, Oussama A, Ragno G. Prediction of the geographical origin of butters by partial least square discriminant analysis (PLS-DA) applied to infrared spectroscopy (FTIR) data. J Food Compos Anal. 2014;33:210–5. https://doi.org/10.1016/j.jfca.2013.11.010.
Lo Y-L, Pan W-H, Hsu W-L, Chien Y-C, Chen J-Y, Hsu M-M, et al. Partial least square discriminant analysis discovered a dietary pattern inversely associated with nasopharyngeal carcinoma risk. PLoS One. 2016. https://doi.org/10.1371/journal.pone.0155892.
Pérez-Enciso M, Tenenhaus M. Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Hum Genet. 2003;112:581–92. https://doi.org/10.1007/s00439-003-0921-9.
Boulesteix AL, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform. 2007;8:32–44. https://doi.org/10.1093/bib/bbl016.
Izquierdo-García JL, Rodríguez I, Kyriazis A, Villa P, Barreiro P, Desco M, et al. A novel R-package graphic user interface for the analysis of metabonomic profiles. BMC Bioinformatics. 2009;10. https://doi.org/10.1186/1471-2105-10-363.
Biswas A, Mynampati KC, Umashankar S, Reuben S, Parab G, Rao R, et al. Metdat: a modular and workflow-based free online pipeline for mass spectrometry data processing, analysis and interpretation. Bioinformatics. 2010;26:2639–40. https://doi.org/10.1093/bioinformatics/btq436.
Smolinska A, Blanchet L, Buydens LMC, Wijmenga SS. NMR and pattern recognition methods in metabolomics: from data acquisition to biomarker discovery: a review. Anal Chim Acta. 2012;750:82–97. https://doi.org/10.1016/j.aca.2012.05.049.
Sugimoto M, Kawakami M, Robert M, Soga T, Tomita M. Bioinformatics tools for mass spectroscopy-based metabolomic data processing and analysis. Curr Bioinforma. 2012;7:96–108. https://doi.org/10.2174/157489312799304431.
Cauchi M, Fowler DP, Walton C, Turner C, Jia W, Whitehead RN, et al. Application of gas chromatography mass spectrometry (GC-MS) in conjunction with multivariate classification for the diagnosis of gastrointestinal diseases. Metabolomics. 2014;10:1113–20.
Bro R, Kamstrup-Nielsen MH, Engelsen SB, Savorani F, Rasmussen MA, Hansen L, et al. Forecasting individual breast cancer risk using plasma metabolomics and biocontours. Metabolomics. 2015;11:1376–80. https://doi.org/10.1007/s11306-015-0793-8.
Garreta-Lara E, Campos B, Barata C, Lacorte S, Tauler R. Metabolic profiling of Daphnia magna exposed to environmental stressors by GC–MS and chemometric tools. Metabolomics. 2016;12. https://doi.org/10.1007/s11306-016-1021-x.
Fang J, Wang W, Sun S, Wang Y, Li Q, Lu X, et al. Metabolomics study of renal fibrosis and intervention effects of total aglycone extracts of Scutellaria baicalensis in unilateral ureteral obstruction rats. J Ethnopharmacol. 2016;192:20–9. https://doi.org/10.1016/j.jep.2016.06.014.
Lämmerhofer M, Weckwerth W. Metabolomics in practice successful strategies to generate and analyze metabolic data. Weinheim, Germany: Wiley-VCH Verlag GmbH & Co. KGaA; 2013.
Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006;2:171–96. https://doi.org/10.1007/s11306-006-0037-z.
Gromski PS, Muhamadali H, Ellis DI, Xu Y, Correa E, Turner ML, et al. A tutorial review: metabolomics and partial least squares-discriminant analysis - a marriage of convenience or a shotgun wedding. Anal Chim Acta. 2015;879:10–23. https://doi.org/10.1016/j.aca.2015.02.012.
Eriksson L, Johansson E, Kettaneh-Wold N, Wold S. Introduction to multi-and megavariate data analysis using projection methods (PCA & PLS). Umea: Umetrics AB; 1999.
Mehmood T, Liland KH, Snipen L, Saebø S. A review of variable selection methods in partial least squares regression. Chemom Intell Lab Syst. 2012;118:62–9. https://doi.org/10.1016/j.chemolab.2012.07.010.
Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, Velzen EJJ, et al. Assessment of PLSDA cross validation. Metabolomics. 2008;4:81–9. https://doi.org/10.1007/s11306-007-0099-6.
Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away. J Chemom. 2014;28:213–25. https://doi.org/10.1002/cem.2609.
Sousa PF, Åberg KM. Can we beat overfitting?—a closer look at Cloarec’s PLS algorithm. J Chemom. 2018:e3002. https://doi.org/10.1002/cem.3002.
Agne K, Alexander HJ, Marcis L, Juozas K, Hossam H, Hermann B. Detection of cancer through exhaled breath: a systematic review. Oncotarget. 2015;6. https://doi.org/10.18632/oncotarget.5938.
Steyerberg EW, Bleekerb SE, Moll HA, Grobbee DE, Moons KGM. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J Clin Epidemiol. 2003;56:441–7. https://doi.org/10.1016/S0895-4356(03)00047-7.
Kim J-H. Estimating classification error rate: repeated cross-validation, repeated hold-out and Bootstrap. Comput Stat Data Anal. 2009;53:3735–45. https://doi.org/10.1016/J.CSDA.2009.04.009.
Jiang G, Wang W. Error estimation based on variance analysis of k-fold cross-validation. Pattern Recogn. 2017;69:94–106. https://doi.org/10.1016/j.patcog.2017.03.025.
Wong TT. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 2015;48:2839–46. https://doi.org/10.1016/j.patcog.2015.03.009.
Filzmoser P, Liebmann B, Varmuza K. Repeated double cross validation. J Chemom. 2009;23:160–71. https://doi.org/10.1002/cem.1225.
Anderssen E, Dyrstad K, Westad F, Martens H. Reducing over-optimism in variable selection by cross-model validation. Chemom Intell Lab Syst. 2006;84:69–74. https://doi.org/10.1016/J.CHEMOLAB.2006.04.021.
Martens H, Martens M. Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Qual Prefer. 2000;11:5–16. https://doi.org/10.1016/S0950-3293(99)00039-7.
Kjeldahl K, Bro R. Some common misunderstanding in chemometrics. J Chemom. 2010;24:558–64.
Xia J, Broadhurst DI, Wilson M, Wishart DS. Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics. 2013;9:280–99. https://doi.org/10.1007/s11306-012-0482-9.
Kohavi R (2016) A study of cross-validation and Bootstrap for accuracy estimation and model selection. IJCAI’95 Proceedings of the 14th International Joint Conference on Artificial Intelligence 2:1137–1143.
Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005;21:3301–7. https://doi.org/10.1093/bioinformatics/bti499.
Wood I, Visscher PM, Mengersen KL. Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007;23:1363–70. https://doi.org/10.1093/bioinformatics/btm117.
Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol. 2009;9. https://doi.org/10.1186/1471-2288-9-85.
Szymańska E, Saccenti E, Smilde AK, Westerhuis JA. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics. 2012;8:3–16. https://doi.org/10.1007/s11306-011-0330-3.
Triba MN, Le Moyec L, Amathieu R, Goossens C, Bouchemal N, Nahon P, et al. PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Mol BioSyst. 2015;11:13–9. https://doi.org/10.1039/C4MB00414K.
Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics. 2004;20:374–80. https://doi.org/10.1093/bioinformatics/btg419.
Fu WJ, Carroll RJ, Wang S. Estimating misclassification error with small samples via Bootstrap cross-validation. Bioinformatics. 2005;21:1979–86. https://doi.org/10.1093/bioinformatics/bti294.
Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;10:91. https://doi.org/10.1186/1471-2105-7-91.
Phatak A, De Jong S. The geometry of partial least squares. J Chemom. 1997;11:311–38. https://doi.org/10.1002/(SICI)1099-128X(199707)11:4<311::AID-CEM478>3.0.CO;2-4.
Wold SSM, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58:109–30.
Mevik B-HBHB, Wehrens R. The pls package: principal component and partial least squares regression in R. J Stat Softw. 2007;2007:18.
Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc. 1974;36:111–47. https://doi.org/10.2307/2984809.
Burman P. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning testing methods. Biometrika. 1989;76:503–14.
Efron B, Tibshirani R. Estimating the error rate of a prediction rule. J Am Stat Assoc. 1983;78:316–31. https://doi.org/10.1080/01621459.1983.10477973.
Efron B, Tibshirani R. Improvements on cross-validation: the 632+ Bootstrap method. J Am Stat Assoc. 1997;92:548–60.
Brereton R. Chemometrics for pattern recognition. Chichester: Wiley; 2009.
de Boves HP. Statistical validation of classification and calibration models using bootstrapped Latin partitions. TrAC-Trends Anal Chem. 2006;25:1112–24. https://doi.org/10.1016/j.trac.2006.10.010.
Cruciani G, Baroni M, Clementi S, Costantino G, Riganelli D, Skagerberg B. Predictive ability of regression models. Part I: standard deviation of prediction errors (SDEP). J Chemom. 1992;6:335–46. https://doi.org/10.1002/cem.1180060604.
Wan C, Harrington P d B. Screening GC-MS data for carbamate pesticides with temperature-constrained–cascade correlation neural networks. Anal Chim Acta. 2000;408:1–12. https://doi.org/10.1016/S0003-2670(99)00865-X.
Harrington P d B. Multiple versus single set validation of multivariate models to avoid mistakes. Crit Rev Anal Chem. 2018;48:33–46. https://doi.org/10.1080/10408347.2017.1361314.
Harrington PB, Laurent C, Levinson DF, Levitt P, Markey SP. Bootstrap classification and point-based feature selection from age-staged mouse cerebellum tissues of matrix assisted laser desorption/ionization mass spectra using a fuzzy rule-building expert system. Anal Chim Acta. 2007;599:219–31. https://doi.org/10.1016/j.aca.2007.08.007.
de Boves HP. Support vector machine classification trees based on fuzzy entropy of classification. Anal Chim Acta. 2017;954:14–21. https://doi.org/10.1016/J.ACA.2016.11.072.
Aloglu AK, Harrington PB, Sahin S, Demir C. Prediction of total antioxidant activity of Prunella L. species by automatic partial least square regression applied to 2-way liquid chromatographic UV spectral images. Talanta. 2016;161:503–10. https://doi.org/10.1016/j.talanta.2016.09.014.
Rearden P, Harrington PB, Karnes JJ, Bunker CE. Fuzzy rule-building expert system classification of fuel using solid-phase microextraction two-way gas chromatography differential mobility spectrometric data. Anal Chem. 2007;79:1485–91. https://doi.org/10.1021/ac060527f.
Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. https://doi.org/10.1038/415530a.
van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AAM, Voskuil DW, et al. A gene-expression signature as a predictor of survival in breast Cancer. N Engl J Med. 2002;347:1999–2009. https://doi.org/10.1056/NEJMoa021967.
Guyon I, Li J, Mader T, Pletscher PA, Schneider G, Uhr M. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recogn Lett. 2007;28:1438–44. https://doi.org/10.1016/j.patrec.2007.02.014.
Bogdanov M, Matson WR, Wang L, Matson T, Saunders-Pullman R, Bressman SS, et al. Metabolomic profiling to develop blood biomarkers for Parkinson’s disease. Brain. 2008;131:389–96. https://doi.org/10.1093/brain/awm304.
Abaffy T, Möller MG, Riemer DD, Milikowski C, DeFazio RA. Comparative analysis of volatile metabolomics signals from melanoma and benign skin: a pilot study. Metabolomics. 2013;9:998–1008. https://doi.org/10.1007/s11306-013-0523-z.
Bean HD, Jiménez-Díaz J, Zhu J, Hill JE. Breathprints of model murine bacterial lung infections are linked with immune response. Eur Respir J. 2015;45:181–90. https://doi.org/10.1183/09031936.00015814.
D’Amico A, Di Natale C, Paolesse R, Macagnano A, Martinelli E, Pennazza G, et al. Olfactory systems for medical applications. Sensors Actuators B Chem. 2008;130:458–65. https://doi.org/10.1016/j.snb.2007.09.044.
Franceschi P, Masuero D, Vrhovsek U, Mattivi F, Wehrens R. A benchmark spike-in data set for biomarker identification in metabolomics. J Chemom. 2012;26:16–24. https://doi.org/10.1002/cem.1420.
Schmekel B, Winquist F, Vikström A. Analysis of breath samples for lung cancer survival. Anal Chim Acta. 2014;840:82–6. https://doi.org/10.1016/j.aca.2014.05.034.
Funding
This work was partially funded by the Spanish MINECO program, under grants TEC2011-26143 (SMART-IMS) and TEC2014-59229-R (SIGVOL). The Signal and Information Processing for Sensor Systems group is a consolidated Grup de Recerca de la Generalitat de Catalunya and has support from the Departament d’Universitats, Recerca i Societat de la Informació de la Generalitat de Catalunya (expedient 2017 SGR 1721). This work has received support from the Comissionat per a Universitats i Recerca del DIUE de la Generalitat de Catalunya and the European Social Fund (ESF). Additional financial support has been provided by the Institut de Bioenginyeria de Catalunya (IBEC). IBEC is a member of the CERCA Programme/Generalitat de Catalunya.
Author information
Authors and Affiliations
Contributions
RR wrote the software, analyzed the data, and prepared the figures and text. LF supervised the code of RR and provided useful insights. SM conceived the study and supervised the work. RR and SM authors contributed to writing the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and material
The microarray dataset analyzed during the current study is publicly available at http://ccb.nki.nl/data/.
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
About this article
Cite this article
Rodríguez-Pérez, R., Fernández, L. & Marco, S. Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study. Anal Bioanal Chem 410, 5981–5992 (2018). https://doi.org/10.1007/s00216-018-1217-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00216-018-1217-1