Abstract
The problem of missing data is common in practice. Many imputation methods have been developed to fill in the missing entries. However, not all of them can scale to high-dimensional data, especially the multiple imputation techniques. Meanwhile, the data nowadays tends toward high-dimensional. Therefore, we propose Principal Component Analysis Imputation (PCAI), a simple but versatile framework based on Principal Component Analysis (PCA) to speed up the imputation process and alleviate memory issues of many available imputation techniques while maintaining good imputation quality. In addition, the frameworks can be used even when some or all of the missing features are categorical or when the number of missing features is large. We also analyze the effect of using different formulations of PCA on the technique. Next, we introduce PCA Imputation - Classification (PIC), an application of PCAI for classification problems with some adjustments. Experiments on various scenarios show that PCAI and PIC can work with various imputation algorithms, including state-of-the-art ones, and improve the imputation speed significantly while achieving competitive mean square error/classification accuracy compared to imputing directly on the missing data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Andrews, D.T., Wentzell, P.D.: Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer. Anal. Chim. Acta 350(3), 341–352 (1997)
Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 10(1), 5–26 (2016)
Buuren, S.v., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in r. J. Stat. Softw. 1–68 (2010)
Dear, R.E.: A principal-component missing-data method for multiple regression models. System Development Corporation (1959)
Dua, D., Graff, C.: UCI machine learning repository (2017). https://archive.ics.uci.edu/ml
Folch-Fortuny, A., Arteaga, F., Ferrer, A.: PCA model building with missing data: new proposals and a comparative study. Chemom. Intell. Lab. Syst. 146, 77–88 (2015)
Grung, B., Manne, R.: Missing values in principal component analysis. Chemom. Intell. Lab. Syst. 42(1–2), 125–139 (1998)
Guyon, I., Li, J., Mader, T., Pletscher, P.A., Schneider, G., Uhr, M.: Competitive baseline methods set new standards for the nips 2003 feature selection benchmark. Pattern Recogn. Lett. 28(12), 1438–1444 (2007)
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)
Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 11, 1957–2000 (2010)
Iodice D’Enza, A., Palumbo, F., Markos, A.: Single imputation via chunk-wise PCA. In: Chadjipadelis, T., Lausen, B., Markos, A., Lee, T.R., Montanari, A., Nugent, R. (eds.) IFCS 2019. SCDAKO, pp. 75–82. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-60104-1_9
Jenatton, R., Obozinski, G., Bach, F.: Structured sparse principal component analysis. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 366–373. JMLR Workshop and Conference Proceedings (2010)
Khan, S.I., Hoque, A.S.M.L.: SICE: an improved missing data imputation technique. J. Big Data 7(1), 1–21 (2020)
Lipton, Z.C., Kale, D.C., Wetzel, R., et al.: Modeling missing data in clinical time series with RNNs. Mach. Learn. Healthc. 56 (2016)
Mazumder, R., Hastie, T., Tibshirani, R.: Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11(Aug), 2287–2322 (2010)
Nguyen, T., Nguyen, D.H., Nguyen, H., Nguyen, B.T., Wade, B.A.: EPEM: efficient parameter estimation for multiple class monotone missing data. Inf. Sci. 567, 1–22 (2021)
Nguyen, T., Nguyen-Duy, K.M., Nguyen, D.H.M., Nguyen, B.T., Wade, B.A.: DPER: direct parameter estimation for randomly missing data. Knowl.-Based Syst. 240, 108082 (2022)
Nguyen, T., Phan, N.T., Hoang, H.V., Halvorsen, P., Riegler, M.A., Nguyen, B.T.: PMF: efficient parameter estimation for data sets with missing data in some features. SSRN 4260235
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Qu, L., Li, L., Zhang, Y., Hu, J.: PPCA-based missing data imputation for traffic flow volume: a systematical approach. IEEE Trans. Intell. Transp. Syst. 10(3), 512–522 (2009)
Rahman, M.G., Islam, M.Z.: Missing value imputation using a fuzzy clustering-based EM approach. Knowl. Inf. Syst. 46(2), 389–422 (2016)
Ross, D.A., Lim, J., Lin, R.S., Yang, M.H.: Incremental learning for robust visual tracking. Int. J. Comput. Vis. 77(1), 125–141 (2008)
Roweis, S.: EM algorithms for PCA and SPCA. Adv. Neural Inf. Process. Syst. 10 (1997)
Rubinsteyn, A., Feldman, S.: Fancyimpute: an imputation library for python (2016). https://github.com/iskandr/fancyimpute
Sakar, C.O., et al.: A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable q-factor wavelet transform. Appl. Soft Comput. 74, 255–263 (2019)
Sportisse, A., Boyer, C., Josse, J.: Estimation and imputation in probabilistic principal component analysis with missing not at random data. Adv. Neural Inf. Process. Syst. 33, 7067–7077 (2020)
Stekhoven, D.J., Bühlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Vu, M.A., et al.: Conditional expectation for missing data imputation. arXiv preprint arXiv:2302.00911 (2023)
Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
Yoon, J., Jordon, J., Schaar, M.: Gain: missing data imputation using generative adversarial nets. In: International Conference on Machine Learning, pp. 5689–5698. PMLR (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nguyen, T., Ly, H.T., Riegler, M.A., Halvorsen, P., Hammer, H.L. (2023). Principal Components Analysis Based Frameworks for Efficient Missing Data Imputation Algorithms. In: Nguyen, N.T., et al. Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2023. Communications in Computer and Information Science, vol 1863. Springer, Cham. https://doi.org/10.1007/978-3-031-42430-4_21
Download citation
DOI: https://doi.org/10.1007/978-3-031-42430-4_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-42429-8
Online ISBN: 978-3-031-42430-4
eBook Packages: Computer ScienceComputer Science (R0)