[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study

  • Research Paper
  • Published:
Analytical and Bioanalytical Chemistry Aims and scope Submit manuscript

Abstract

Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Santana R, Galdiano J, Pérez A, Bielza C, Larrañaga P, Calvo B, et al. Machine learning in bioinformatics machine learning in bioinformatics. Brief Bioinform. 2006;7:1–16. https://doi.org/10.1093/bib/bbk007.

    Article  CAS  Google Scholar 

  2. Kulasingam V, Diamandis EP. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies. Nat Clin Pract Oncol. 2008;5:588–99. https://doi.org/10.1038/ncponc1187.

    Article  CAS  PubMed  Google Scholar 

  3. Vinaixa M, Samino S, Saez I, Duran J, Guinovart JJ, Yanes O. A guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data. Metabolites. 2012;2:775–95. https://doi.org/10.3390/metabo2040775.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Bellman R. Adaptive control processes—a guided tour. Z Angew Math Mech. 1962;42:364–5.

    Google Scholar 

  5. Bishop CM. Pattern recognition and machine learning. Heidelberg: Springer-Verlag Berlin; 2006.

  6. Ghosh D, Poisson LM. “Omics” data and levels of evidence for biomarker discovery. Genomics. 2009;93:13–6. https://doi.org/10.1016/j.ygeno.2008.07.006.

    Article  CAS  PubMed  Google Scholar 

  7. Rubingh CM, Bijlsma S, Derks EPP, Bobeldijk I, Verheij ER, Kochhar S, et al. Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics. 2006;2:53–61. https://doi.org/10.1007/s11306-006-0022-6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Westad F, Marini F. Validation of chemometric models—a tutorial. Anal Chim Acta. 2015;893:14–24. https://doi.org/10.1016/j.aca.2015.06.056.

    Article  CAS  PubMed  Google Scholar 

  9. Marco S. The need for external validation in machine olfaction: emphasis on health-related applications chemosensors and chemoreception. Anal Bioanal Chem. 2014;406:3941–56. https://doi.org/10.1007/s00216-014-7807-7.

    Article  CAS  PubMed  Google Scholar 

  10. Kennard RW, Stone LA. Computer aided design of experiments. Technometrics. 1969;11:137–48. https://doi.org/10.1080/00401706.1969.10490666.

    Article  Google Scholar 

  11. Galvão RKH, Araujo MCU, José GE, Pontes MJC, Silva EC, Saldanha TCB. A method for calibration and validation subset partitioning. Talanta. 2005;67:736–40. https://doi.org/10.1016/j.talanta.2005.03.025.

    Article  CAS  PubMed  Google Scholar 

  12. Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003;17:166–73. https://doi.org/10.1002/cem.785.

    Article  CAS  Google Scholar 

  13. Chevallier S, Bertrand D, Kohler A, Courcoux P. Application of PLS-DA in multivariate image analysis. J Chemom. 2006;20:221–9. https://doi.org/10.1002/cem.994.

    Article  CAS  Google Scholar 

  14. Sirven J-B, Sallé B, Mauchien P, Lacour J-L, Maurice S, Manhès G. Feasibility study of rock identification at the surface of Mars by remote laser-induced breakdown spectroscopy and three chemometric methods. J Anal At Spectrom. 2007;22:1471. https://doi.org/10.1039/b704868h.

    Article  CAS  Google Scholar 

  15. Ciosek P, Wróblewski W. Miniaturized electronic tongue with an integrated reference microelectrode for the recognition of milk samples. Talanta. 2008;76:548–56. https://doi.org/10.1016/j.talanta.2008.03.051.

    Article  CAS  PubMed  Google Scholar 

  16. Ivorra E, Girón J, Sánchez AJ, Verdú S, Barat JM, Grau R. Detection of expired vacuum-packed smoked salmon based on PLS-DA method using hyperspectral images. J Food Eng. 2013;117:342–9. https://doi.org/10.1016/j.jfoodeng.2013.02.022.

    Article  CAS  Google Scholar 

  17. Bassbasi M, De Luca M, Ioele G, Oussama A, Ragno G. Prediction of the geographical origin of butters by partial least square discriminant analysis (PLS-DA) applied to infrared spectroscopy (FTIR) data. J Food Compos Anal. 2014;33:210–5. https://doi.org/10.1016/j.jfca.2013.11.010.

    Article  CAS  Google Scholar 

  18. Lo Y-L, Pan W-H, Hsu W-L, Chien Y-C, Chen J-Y, Hsu M-M, et al. Partial least square discriminant analysis discovered a dietary pattern inversely associated with nasopharyngeal carcinoma risk. PLoS One. 2016. https://doi.org/10.1371/journal.pone.0155892.

  19. Pérez-Enciso M, Tenenhaus M. Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Hum Genet. 2003;112:581–92. https://doi.org/10.1007/s00439-003-0921-9.

    Article  PubMed  Google Scholar 

  20. Boulesteix AL, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform. 2007;8:32–44. https://doi.org/10.1093/bib/bbl016.

    Article  CAS  PubMed  Google Scholar 

  21. Izquierdo-García JL, Rodríguez I, Kyriazis A, Villa P, Barreiro P, Desco M, et al. A novel R-package graphic user interface for the analysis of metabonomic profiles. BMC Bioinformatics. 2009;10. https://doi.org/10.1186/1471-2105-10-363.

  22. Biswas A, Mynampati KC, Umashankar S, Reuben S, Parab G, Rao R, et al. Metdat: a modular and workflow-based free online pipeline for mass spectrometry data processing, analysis and interpretation. Bioinformatics. 2010;26:2639–40. https://doi.org/10.1093/bioinformatics/btq436.

    Article  CAS  PubMed  Google Scholar 

  23. Smolinska A, Blanchet L, Buydens LMC, Wijmenga SS. NMR and pattern recognition methods in metabolomics: from data acquisition to biomarker discovery: a review. Anal Chim Acta. 2012;750:82–97. https://doi.org/10.1016/j.aca.2012.05.049.

    Article  CAS  PubMed  Google Scholar 

  24. Sugimoto M, Kawakami M, Robert M, Soga T, Tomita M. Bioinformatics tools for mass spectroscopy-based metabolomic data processing and analysis. Curr Bioinforma. 2012;7:96–108. https://doi.org/10.2174/157489312799304431.

    Article  CAS  Google Scholar 

  25. Cauchi M, Fowler DP, Walton C, Turner C, Jia W, Whitehead RN, et al. Application of gas chromatography mass spectrometry (GC-MS) in conjunction with multivariate classification for the diagnosis of gastrointestinal diseases. Metabolomics. 2014;10:1113–20.

    Article  CAS  Google Scholar 

  26. Bro R, Kamstrup-Nielsen MH, Engelsen SB, Savorani F, Rasmussen MA, Hansen L, et al. Forecasting individual breast cancer risk using plasma metabolomics and biocontours. Metabolomics. 2015;11:1376–80. https://doi.org/10.1007/s11306-015-0793-8.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Garreta-Lara E, Campos B, Barata C, Lacorte S, Tauler R. Metabolic profiling of Daphnia magna exposed to environmental stressors by GC–MS and chemometric tools. Metabolomics. 2016;12. https://doi.org/10.1007/s11306-016-1021-x.

  28. Fang J, Wang W, Sun S, Wang Y, Li Q, Lu X, et al. Metabolomics study of renal fibrosis and intervention effects of total aglycone extracts of Scutellaria baicalensis in unilateral ureteral obstruction rats. J Ethnopharmacol. 2016;192:20–9. https://doi.org/10.1016/j.jep.2016.06.014.

    Article  CAS  PubMed  Google Scholar 

  29. Lämmerhofer M, Weckwerth W. Metabolomics in practice successful strategies to generate and analyze metabolic data. Weinheim, Germany: Wiley-VCH Verlag GmbH & Co. KGaA; 2013.

    Book  Google Scholar 

  30. Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006;2:171–96. https://doi.org/10.1007/s11306-006-0037-z.

    Article  CAS  Google Scholar 

  31. Gromski PS, Muhamadali H, Ellis DI, Xu Y, Correa E, Turner ML, et al. A tutorial review: metabolomics and partial least squares-discriminant analysis - a marriage of convenience or a shotgun wedding. Anal Chim Acta. 2015;879:10–23. https://doi.org/10.1016/j.aca.2015.02.012.

    Article  CAS  PubMed  Google Scholar 

  32. Eriksson L, Johansson E, Kettaneh-Wold N, Wold S. Introduction to multi-and megavariate data analysis using projection methods (PCA & PLS). Umea: Umetrics AB; 1999.

    Google Scholar 

  33. Mehmood T, Liland KH, Snipen L, Saebø S. A review of variable selection methods in partial least squares regression. Chemom Intell Lab Syst. 2012;118:62–9. https://doi.org/10.1016/j.chemolab.2012.07.010.

    Article  CAS  Google Scholar 

  34. Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, Velzen EJJ, et al. Assessment of PLSDA cross validation. Metabolomics. 2008;4:81–9. https://doi.org/10.1007/s11306-007-0099-6.

    Article  CAS  Google Scholar 

  35. Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away. J Chemom. 2014;28:213–25. https://doi.org/10.1002/cem.2609.

    Article  CAS  Google Scholar 

  36. Sousa PF, Åberg KM. Can we beat overfitting?—a closer look at Cloarec’s PLS algorithm. J Chemom. 2018:e3002. https://doi.org/10.1002/cem.3002.

  37. Agne K, Alexander HJ, Marcis L, Juozas K, Hossam H, Hermann B. Detection of cancer through exhaled breath: a systematic review. Oncotarget. 2015;6. https://doi.org/10.18632/oncotarget.5938.

  38. Steyerberg EW, Bleekerb SE, Moll HA, Grobbee DE, Moons KGM. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J Clin Epidemiol. 2003;56:441–7. https://doi.org/10.1016/S0895-4356(03)00047-7.

    Article  PubMed  Google Scholar 

  39. Kim J-H. Estimating classification error rate: repeated cross-validation, repeated hold-out and Bootstrap. Comput Stat Data Anal. 2009;53:3735–45. https://doi.org/10.1016/J.CSDA.2009.04.009.

    Article  Google Scholar 

  40. Jiang G, Wang W. Error estimation based on variance analysis of k-fold cross-validation. Pattern Recogn. 2017;69:94–106. https://doi.org/10.1016/j.patcog.2017.03.025.

    Article  Google Scholar 

  41. Wong TT. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 2015;48:2839–46. https://doi.org/10.1016/j.patcog.2015.03.009.

    Article  Google Scholar 

  42. Filzmoser P, Liebmann B, Varmuza K. Repeated double cross validation. J Chemom. 2009;23:160–71. https://doi.org/10.1002/cem.1225.

    Article  CAS  Google Scholar 

  43. Anderssen E, Dyrstad K, Westad F, Martens H. Reducing over-optimism in variable selection by cross-model validation. Chemom Intell Lab Syst. 2006;84:69–74. https://doi.org/10.1016/J.CHEMOLAB.2006.04.021.

    Article  CAS  Google Scholar 

  44. Martens H, Martens M. Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Qual Prefer. 2000;11:5–16. https://doi.org/10.1016/S0950-3293(99)00039-7.

    Article  Google Scholar 

  45. Kjeldahl K, Bro R. Some common misunderstanding in chemometrics. J Chemom. 2010;24:558–64.

    Article  CAS  Google Scholar 

  46. Xia J, Broadhurst DI, Wilson M, Wishart DS. Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics. 2013;9:280–99. https://doi.org/10.1007/s11306-012-0482-9.

    Article  CAS  PubMed  Google Scholar 

  47. Kohavi R (2016) A study of cross-validation and Bootstrap for accuracy estimation and model selection. IJCAI’95 Proceedings of the 14th International Joint Conference on Artificial Intelligence 2:1137–1143.

  48. Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005;21:3301–7. https://doi.org/10.1093/bioinformatics/bti499.

    Article  CAS  PubMed  Google Scholar 

  49. Wood I, Visscher PM, Mengersen KL. Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007;23:1363–70. https://doi.org/10.1093/bioinformatics/btm117.

    Article  CAS  PubMed  Google Scholar 

  50. Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol. 2009;9. https://doi.org/10.1186/1471-2288-9-85.

  51. Szymańska E, Saccenti E, Smilde AK, Westerhuis JA. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics. 2012;8:3–16. https://doi.org/10.1007/s11306-011-0330-3.

    Article  CAS  PubMed  Google Scholar 

  52. Triba MN, Le Moyec L, Amathieu R, Goossens C, Bouchemal N, Nahon P, et al. PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Mol BioSyst. 2015;11:13–9. https://doi.org/10.1039/C4MB00414K.

    Article  CAS  PubMed  Google Scholar 

  53. Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics. 2004;20:374–80. https://doi.org/10.1093/bioinformatics/btg419.

    Article  CAS  PubMed  Google Scholar 

  54. Fu WJ, Carroll RJ, Wang S. Estimating misclassification error with small samples via Bootstrap cross-validation. Bioinformatics. 2005;21:1979–86. https://doi.org/10.1093/bioinformatics/bti294.

    Article  CAS  PubMed  Google Scholar 

  55. Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;10:91. https://doi.org/10.1186/1471-2105-7-91.

    Article  CAS  Google Scholar 

  56. Phatak A, De Jong S. The geometry of partial least squares. J Chemom. 1997;11:311–38. https://doi.org/10.1002/(SICI)1099-128X(199707)11:4<311::AID-CEM478>3.0.CO;2-4.

    Article  CAS  Google Scholar 

  57. Wold SSM, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58:109–30.

    Article  CAS  Google Scholar 

  58. Mevik B-HBHB, Wehrens R. The pls package: principal component and partial least squares regression in R. J Stat Softw. 2007;2007:18.

    Google Scholar 

  59. Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc. 1974;36:111–47. https://doi.org/10.2307/2984809.

    Article  Google Scholar 

  60. Burman P. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning testing methods. Biometrika. 1989;76:503–14.

    Article  Google Scholar 

  61. Efron B, Tibshirani R. Estimating the error rate of a prediction rule. J Am Stat Assoc. 1983;78:316–31. https://doi.org/10.1080/01621459.1983.10477973.

    Article  Google Scholar 

  62. Efron B, Tibshirani R. Improvements on cross-validation: the 632+ Bootstrap method. J Am Stat Assoc. 1997;92:548–60.

    Google Scholar 

  63. Brereton R. Chemometrics for pattern recognition. Chichester: Wiley; 2009.

  64. de Boves HP. Statistical validation of classification and calibration models using bootstrapped Latin partitions. TrAC-Trends Anal Chem. 2006;25:1112–24. https://doi.org/10.1016/j.trac.2006.10.010.

    Article  CAS  Google Scholar 

  65. Cruciani G, Baroni M, Clementi S, Costantino G, Riganelli D, Skagerberg B. Predictive ability of regression models. Part I: standard deviation of prediction errors (SDEP). J Chemom. 1992;6:335–46. https://doi.org/10.1002/cem.1180060604.

    Article  CAS  Google Scholar 

  66. Wan C, Harrington P d B. Screening GC-MS data for carbamate pesticides with temperature-constrained–cascade correlation neural networks. Anal Chim Acta. 2000;408:1–12. https://doi.org/10.1016/S0003-2670(99)00865-X.

    Article  CAS  Google Scholar 

  67. Harrington P d B. Multiple versus single set validation of multivariate models to avoid mistakes. Crit Rev Anal Chem. 2018;48:33–46. https://doi.org/10.1080/10408347.2017.1361314.

    Article  CAS  PubMed  Google Scholar 

  68. Harrington PB, Laurent C, Levinson DF, Levitt P, Markey SP. Bootstrap classification and point-based feature selection from age-staged mouse cerebellum tissues of matrix assisted laser desorption/ionization mass spectra using a fuzzy rule-building expert system. Anal Chim Acta. 2007;599:219–31. https://doi.org/10.1016/j.aca.2007.08.007.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. de Boves HP. Support vector machine classification trees based on fuzzy entropy of classification. Anal Chim Acta. 2017;954:14–21. https://doi.org/10.1016/J.ACA.2016.11.072.

    Article  Google Scholar 

  70. Aloglu AK, Harrington PB, Sahin S, Demir C. Prediction of total antioxidant activity of Prunella L. species by automatic partial least square regression applied to 2-way liquid chromatographic UV spectral images. Talanta. 2016;161:503–10. https://doi.org/10.1016/j.talanta.2016.09.014.

    Article  CAS  PubMed  Google Scholar 

  71. Rearden P, Harrington PB, Karnes JJ, Bunker CE. Fuzzy rule-building expert system classification of fuel using solid-phase microextraction two-way gas chromatography differential mobility spectrometric data. Anal Chem. 2007;79:1485–91. https://doi.org/10.1021/ac060527f.

    Article  CAS  PubMed  Google Scholar 

  72. Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. https://doi.org/10.1038/415530a.

    Article  Google Scholar 

  73. van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AAM, Voskuil DW, et al. A gene-expression signature as a predictor of survival in breast Cancer. N Engl J Med. 2002;347:1999–2009. https://doi.org/10.1056/NEJMoa021967.

    Article  PubMed  Google Scholar 

  74. Guyon I, Li J, Mader T, Pletscher PA, Schneider G, Uhr M. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recogn Lett. 2007;28:1438–44. https://doi.org/10.1016/j.patrec.2007.02.014.

    Article  Google Scholar 

  75. Bogdanov M, Matson WR, Wang L, Matson T, Saunders-Pullman R, Bressman SS, et al. Metabolomic profiling to develop blood biomarkers for Parkinson’s disease. Brain. 2008;131:389–96. https://doi.org/10.1093/brain/awm304.

    Article  PubMed  Google Scholar 

  76. Abaffy T, Möller MG, Riemer DD, Milikowski C, DeFazio RA. Comparative analysis of volatile metabolomics signals from melanoma and benign skin: a pilot study. Metabolomics. 2013;9:998–1008. https://doi.org/10.1007/s11306-013-0523-z.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  77. Bean HD, Jiménez-Díaz J, Zhu J, Hill JE. Breathprints of model murine bacterial lung infections are linked with immune response. Eur Respir J. 2015;45:181–90. https://doi.org/10.1183/09031936.00015814.

    Article  CAS  PubMed  Google Scholar 

  78. D’Amico A, Di Natale C, Paolesse R, Macagnano A, Martinelli E, Pennazza G, et al. Olfactory systems for medical applications. Sensors Actuators B Chem. 2008;130:458–65. https://doi.org/10.1016/j.snb.2007.09.044.

    Article  CAS  Google Scholar 

  79. Franceschi P, Masuero D, Vrhovsek U, Mattivi F, Wehrens R. A benchmark spike-in data set for biomarker identification in metabolomics. J Chemom. 2012;26:16–24. https://doi.org/10.1002/cem.1420.

    Article  CAS  Google Scholar 

  80. Schmekel B, Winquist F, Vikström A. Analysis of breath samples for lung cancer survival. Anal Chim Acta. 2014;840:82–6. https://doi.org/10.1016/j.aca.2014.05.034.

    Article  CAS  PubMed  Google Scholar 

Download references

Funding

This work was partially funded by the Spanish MINECO program, under grants TEC2011-26143 (SMART-IMS) and TEC2014-59229-R (SIGVOL). The Signal and Information Processing for Sensor Systems group is a consolidated Grup de Recerca de la Generalitat de Catalunya and has support from the Departament d’Universitats, Recerca i Societat de la Informació de la Generalitat de Catalunya (expedient 2017 SGR 1721). This work has received support from the Comissionat per a Universitats i Recerca del DIUE de la Generalitat de Catalunya and the European Social Fund (ESF). Additional financial support has been provided by the Institut de Bioenginyeria de Catalunya (IBEC). IBEC is a member of the CERCA Programme/Generalitat de Catalunya.

Author information

Authors and Affiliations

Authors

Contributions

RR wrote the software, analyzed the data, and prepared the figures and text. LF supervised the code of RR and provided useful insights. SM conceived the study and supervised the work. RR and SM authors contributed to writing the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Santiago Marco.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and material

The microarray dataset analyzed during the current study is publicly available at http://ccb.nki.nl/data/.

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodríguez-Pérez, R., Fernández, L. & Marco, S. Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study. Anal Bioanal Chem 410, 5981–5992 (2018). https://doi.org/10.1007/s00216-018-1217-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00216-018-1217-1

Keywords