[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Principal Components Analysis Based Frameworks for Efficient Missing Data Imputation Algorithms

  • Conference paper
  • First Online:
Recent Challenges in Intelligent Information and Database Systems (ACIIDS 2023)

Abstract

The problem of missing data is common in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques while maintaining good imputation quality. In addition, the frameworks can be used even when some or all of the missing features are categorical or when the number of missing features is large. We also analyze the effect of using different formulations of PCA on the technique. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. Experiments on various scenarios show that PCAI and PIC can work with various imputation algorithms, including state-of-the-art ones, and improve the imputation speed significantly while achieving competitive mean square error/classification accuracy compared to imputing directly on the missing data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://pypi.org/project/missingpy/.

References

  1. Andrews, D.T., Wentzell, P.D.: Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer. Anal. Chim. Acta 350(3), 341–352 (1997)

    Article  Google Scholar 

  2. Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10(1), 5–26 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  3. Buuren, S.v., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in r. J. Stat. Softw. 1–68 (2010)

    Google Scholar 

  4. Dear, R.E.: A principal-component missing-data method for multiple regression models. System Development Corporation (1959)

    Google Scholar 

  5. Dua, D., Graff, C.: UCI machine learning repository (2017). https://archive.ics.uci.edu/ml

  6. Folch-Fortuny, A., Arteaga, F., Ferrer, A.: PCA model building with missing data: new proposals and a comparative study. Chemom. Intell. Lab. Syst. 146, 77–88 (2015)

    Article  Google Scholar 

  7. Grung, B., Manne, R.: Missing values in principal component analysis. Chemom. Intell. Lab. Syst. 42(1–2), 125–139 (1998)

    Article  Google Scholar 

  8. Guyon, I., Li, J., Mader, T., Pletscher, P.A., Schneider, G., Uhr, M.: Competitive baseline methods set new standards for the nips 2003 feature selection benchmark. Pattern Recogn. Lett. 28(12), 1438–1444 (2007)

    Article  Google Scholar 

  9. Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  10. Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 11, 1957–2000 (2010)

    MathSciNet  MATH  Google Scholar 

  11. Iodice D’Enza, A., Palumbo, F., Markos, A.: Single imputation via chunk-wise PCA. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A., Nugent, R. (eds.) IFCS 2019. SCDAKO, pp. 75–82. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-60104-1_9

    Chapter  MATH  Google Scholar 

  12. Jenatton, R., Obozinski, G., Bach, F.: Structured sparse principal component analysis. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 366–373. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  13. Khan, S.I., Hoque, A.S.M.L.: SICE: an improved missing data imputation technique. J. Big Data 7(1), 1–21 (2020)

    Article  Google Scholar 

  14. Lipton, Z.C., Kale, D.C., Wetzel, R., et al.: Modeling missing data in clinical time series with RNNs. Mach. Learn. Healthc. 56 (2016)

    Google Scholar 

  15. Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(Aug), 2287–2322 (2010)

    Google Scholar 

  16. Nguyen, T., Nguyen, D.H., Nguyen, H., Nguyen, B.T., Wade, B.A.: EPEM: efficient parameter estimation for multiple class monotone missing data. Inf. Sci. 567, 1–22 (2021)

    Article  MathSciNet  Google Scholar 

  17. Nguyen, T., Nguyen-Duy, K.M., Nguyen, D.H.M., Nguyen, B.T., Wade, B.A.: DPER: direct parameter estimation for randomly missing data. Knowl.-Based Syst. 240, 108082 (2022)

    Article  Google Scholar 

  18. Nguyen, T., Phan, N.T., Hoang, H.V., Halvorsen, P., Riegler, M.A., Nguyen, B.T.: PMF: efficient parameter estimation for data sets with missing data in some features. SSRN 4260235

    Google Scholar 

  19. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  20. Qu, L., Li, L., Zhang, Y., Hu, J.: PPCA-based missing data imputation for traffic flow volume: a systematical approach. IEEE Trans. Intell. Transp. Syst. 10(3), 512–522 (2009)

    Article  Google Scholar 

  21. Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46(2), 389–422 (2016)

    Article  Google Scholar 

  22. Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1), 125–141 (2008)

    Article  Google Scholar 

  23. Roweis, S.: EM algorithms for PCA and SPCA. Adv. Neural Inf. Process. Syst. 10 (1997)

    Google Scholar 

  24. Rubinsteyn, A., Feldman, S.: Fancyimpute: an imputation library for python (2016). https://github.com/iskandr/fancyimpute

  25. Sakar, C.O., et al.: A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable q-factor wavelet transform. Appl. Soft Comput. 74, 255–263 (2019)

    Article  Google Scholar 

  26. Sportisse, A., Boyer, C., Josse, J.: Estimation and imputation in probabilistic principal component analysis with missing not at random data. Adv. Neural Inf. Process. Syst. 33, 7067–7077 (2020)

    MATH  Google Scholar 

  27. Stekhoven, D.J., Bühlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

    Article  Google Scholar 

  28. Vu, M.A., et al.: Conditional expectation for missing data imputation. arXiv preprint arXiv:2302.00911 (2023)

  29. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)

  30. Yoon, J., Jordon, J., Schaar, M.: Gain: missing data imputation using generative adversarial nets. In: International Conference on Machine Learning, pp. 5689–5698. PMLR (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thu Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, T., Ly, H.T., Riegler, M.A., Halvorsen, P., Hammer, H.L. (2023). Principal Components Analysis Based Frameworks for Efficient Missing Data Imputation Algorithms. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-42430-4_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-42429-8

  • Online ISBN: 978-3-031-42430-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics