Principal Components Analysis Based Frameworks for Efficient Missing Data Imputation Algorithms

Thu Nguyen¹²,
Hoang Thien Ly¹³,
Michael Alexander Riegler¹²,
Pål Halvorsen¹² &
…
Hugo L. Hammer¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1863))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

508 Accesses
4 Citations

Abstract

The problem of missing data is common in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques while maintaining good imputation quality. In addition, the frameworks can be used even when some or all of the missing features are categorical or when the number of missing features is large. We also analyze the effect of using different formulations of PCA on the technique. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. Experiments on various scenarios show that PCAI and PIC can work with various imputation algorithms, including state-of-the-art ones, and improve the imputation speed significantly while achieving competitive mean square error/classification accuracy compared to imputing directly on the missing data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Feature Based Multivariate Data Imputation

Scalable Model-Based Cascaded Imputation of Missing Data

A principal component method to impute missing values for mixed data

Article 24 December 2014

Notes

1.
https://pypi.org/project/missingpy/.

References

Andrews, D.T., Wentzell, P.D.: Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer. Anal. Chim. Acta 350(3), 341–352 (1997)
Article Google Scholar
Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10(1), 5–26 (2016)
Article MathSciNet MATH Google Scholar
Buuren, S.v., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in r. J. Stat. Softw. 1–68 (2010)
Google Scholar
Dear, R.E.: A principal-component missing-data method for multiple regression models. System Development Corporation (1959)
Google Scholar
Dua, D., Graff, C.: UCI machine learning repository (2017). https://archive.ics.uci.edu/ml
Folch-Fortuny, A., Arteaga, F., Ferrer, A.: PCA model building with missing data: new proposals and a comparative study. Chemom. Intell. Lab. Syst. 146, 77–88 (2015)
Article Google Scholar
Grung, B., Manne, R.: Missing values in principal component analysis. Chemom. Intell. Lab. Syst. 42(1–2), 125–139 (1998)
Article Google Scholar
Guyon, I., Li, J., Mader, T., Pletscher, P.A., Schneider, G., Uhr, M.: Competitive baseline methods set new standards for the nips 2003 feature selection benchmark. Pattern Recogn. Lett. 28(12), 1438–1444 (2007)
Article Google Scholar
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Article MathSciNet MATH Google Scholar
Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 11, 1957–2000 (2010)
MathSciNet MATH Google Scholar
Iodice D’Enza, A., Palumbo, F., Markos, A.: Single imputation via chunk-wise PCA. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A., Nugent, R. (eds.) IFCS 2019. SCDAKO, pp. 75–82. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-60104-1_9
Chapter MATH Google Scholar
Jenatton, R., Obozinski, G., Bach, F.: Structured sparse principal component analysis. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 366–373. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Khan, S.I., Hoque, A.S.M.L.: SICE: an improved missing data imputation technique. J. Big Data 7(1), 1–21 (2020)
Article Google Scholar
Lipton, Z.C., Kale, D.C., Wetzel, R., et al.: Modeling missing data in clinical time series with RNNs. Mach. Learn. Healthc. 56 (2016)
Google Scholar
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(Aug), 2287–2322 (2010)
Google Scholar
Nguyen, T., Nguyen, D.H., Nguyen, H., Nguyen, B.T., Wade, B.A.: EPEM: efficient parameter estimation for multiple class monotone missing data. Inf. Sci. 567, 1–22 (2021)
Article MathSciNet Google Scholar
Nguyen, T., Nguyen-Duy, K.M., Nguyen, D.H.M., Nguyen, B.T., Wade, B.A.: DPER: direct parameter estimation for randomly missing data. Knowl.-Based Syst. 240, 108082 (2022)
Article Google Scholar
Nguyen, T., Phan, N.T., Hoang, H.V., Halvorsen, P., Riegler, M.A., Nguyen, B.T.: PMF: efficient parameter estimation for data sets with missing data in some features. SSRN 4260235
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Qu, L., Li, L., Zhang, Y., Hu, J.: PPCA-based missing data imputation for traffic flow volume: a systematical approach. IEEE Trans. Intell. Transp. Syst. 10(3), 512–522 (2009)
Article Google Scholar
Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46(2), 389–422 (2016)
Article Google Scholar
Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1), 125–141 (2008)
Article Google Scholar
Roweis, S.: EM algorithms for PCA and SPCA. Adv. Neural Inf. Process. Syst. 10 (1997)
Google Scholar
Rubinsteyn, A., Feldman, S.: Fancyimpute: an imputation library for python (2016). https://github.com/iskandr/fancyimpute
Sakar, C.O., et al.: A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable q-factor wavelet transform. Appl. Soft Comput. 74, 255–263 (2019)
Article Google Scholar
Sportisse, A., Boyer, C., Josse, J.: Estimation and imputation in probabilistic principal component analysis with missing not at random data. Adv. Neural Inf. Process. Syst. 33, 7067–7077 (2020)
MATH Google Scholar
Stekhoven, D.J., Bühlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Article Google Scholar
Vu, M.A., et al.: Conditional expectation for missing data imputation. arXiv preprint arXiv:2302.00911 (2023)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Yoon, J., Jordon, J., Schaar, M.: Gain: missing data imputation using generative adversarial nets. In: International Conference on Machine Learning, pp. 5689–5698. PMLR (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

SimulaMet, Oslo, Norway
Thu Nguyen, Michael Alexander Riegler, Pål Halvorsen & Hugo L. Hammer
Warsaw University of Technology, Warsaw, Poland
Hoang Thien Ly

Authors

Thu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Hoang Thien Ly
View author publications
You can also search for this author in PubMed Google Scholar
Michael Alexander Riegler
View author publications
You can also search for this author in PubMed Google Scholar
Pål Halvorsen
View author publications
You can also search for this author in PubMed Google Scholar
Hugo L. Hammer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thu Nguyen .

Editor information

Editors and Affiliations

Wrocław University of Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Siridech Boonsang
Iwate Prefectural University, Iwate, Japan
Hamido Fujita
Wrocław University of Science and Technology, Wrocław, Poland
Bogumiła Hnatkowska
National University of Kaohsiung, Kaohsiung, Taiwan
Tzung-Pei Hong
King Mongkut's Institute of Technology, Ladkrabang, Thailand
Kitsuchart Pasupa
Malaysia Japan International Institute of Technology, Kuala Lumpur, Malaysia
Ali Selamat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, T., Ly, H.T., Riegler, M.A., Halvorsen, P., Hammer, H.L. (2023). Principal Components Analysis Based Frameworks for Efficient Missing Data Imputation Algorithms. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-42430-4_21
Published: 29 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42429-8
Online ISBN: 978-3-031-42430-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Principal Components Analysis Based Frameworks for Efficient Missing Data Imputation Algorithms

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Feature Based Multivariate Data Imputation

Scalable Model-Based Cascaded Imputation of Missing Data

A principal component method to impute missing values for mixed data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Principal Components Analysis Based Frameworks for Efficient Missing Data Imputation Algorithms

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Feature Based Multivariate Data Imputation

Scalable Model-Based Cascaded Imputation of Missing Data

A principal component method to impute missing values for mixed data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation