Abstract
We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal component method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method provides a good point estimate of the parameters of the analysis model considered, such as the coefficients of a main effects logistic regression model, and a reliable estimate of the variability of the estimators. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agresti, A.: Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley, New York (2013)
Agresti, A., Coull, B.A.: Approximate is better than ‘exact” for interval estimation of binomial proportions. Am. Stat. 52(2), 119–126 (1998). doi:10.2307/2685469
Albert, A., Anderson, J.A.: On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71(1), 1–10 (1984). doi:10.2307/2336390
Allison, P.D.: Handling missing data by maximum likelihood. In: SAS global forum, pp 1–21 (2012)
Allison, P.D.: Missing Data. Sage, Thousand Oaks (2002)
Applied Mathematics Department, Agrocampus O, France (2010) galetas data set. http://math.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/74258_galetas.txt
Audigier, V., Husson, F., Josse, J.: Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul. (2014). doi:10.1080/00949655.2015.1104683
Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 7, 1–22 (2014)
Barnard, J., Rubin, D.B.: Small sample degrees of freedom with multiple imputation. Biometrika 86, 948–955 (1999)
Bartlett, J.W., Seaman, S.R., White, I.R., Carpenter, J.R.: Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat. Methods. Med. Res. 24, 462 (2014)
Benzécri, J.P.: L’analyse des données. L’analyse des données.Tome II: L’analyse des correspondances. Dunod (1973)
Bernaards, C.A., Belin, T.R., Schafer, J.L.: Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat. Med. 26(6), 1368–1382 (2007)
Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B (Methodological) 36(2), 192 (1974)
Brand, J.P.L., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in sas for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003). doi:10.1111/1467-9574.00219
Candès, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2009). doi:10.1109/TIT.2010.2044061
Carpenter, J.R., Goldstein, H., Kenward, M.G.: REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J. Stat. Softw. 45(5), 1–14 (2011), http://www.jstatsoft.org/v45/i05
Carpenter, J., Kenward, M.: Multiple Imputation and its Application, 1st edn. Wiley, Chichester (2013)
Dawson, R.J.M.: The ‘unusual episode’ data revisited. Journal of Statistics Education 3, 1–7, http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html (1995)
Demirtas, H.: Rounding strategies for multiply imputed binary data. Biom. J. 51(4), 677–688 (2009)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)
Doove, L.L., Van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal. 72, 92–104 (2014). doi:10.1016/j.csda.2013.10.025
Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104(487), 1042–1051 (2009)
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936)
Gavish, M., Donoho, D.: Optimal shrinkageof singular values. arXiv:1405.7511 e-prints (214)
Gelman, A., Hill, J., Su, Y., Yajima, M., Grazia Pittau, M., Goodrich, B., Si, Y.: mi: Missing data imputation and model checking. R package version 0.9-93 (2013)
Gifi, A.: Nonlinear Multivariate Analysis. D.S.W.O. Press, Leiden (1981)
GlaxoSmithKline, Toronto, Ontario, Canada: Blood pressure data set. http://www.math.yorku.ca/Who/Faculty/Ng/ssc2003/BPMainF.htm (2003)
Greenacre, M.J.: Theory and Applications of Correspondence Analysis. Academic Press, London (1984)
Greenacre, M.J., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, Boca Raton (2006)
Harding, T., Tusell, F., Schafer, J.L.: cat: Analysis of categorical-variable datasets with missing values. http://CRAN.R-project.org/package=cat, r package version 0.0-6.5 (2012)
Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. R package version 1.7.2 (2014)
Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. J. Stat. Softw. 45(7), 1–47 (2011)
Husson, F., Josse, J.: missMDA: Handling missing values with multivariate data analysis. http://CRAN.R-project.org/package=missMDA, r package version 1.9 (2015)
Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)
Josse, J., Chavent, M., Liquet, B., Husson, F.: Handling missing values with regularized iterative multiple correspondence analysis. J. Classif. 29, 91–116 (2012)
Josse, J., Husson, F.: Selecting the number of components in PCA using cross-validation approximations. Comput. Stat. Data Anal. 56(6), 1869–1879 (2011)
Josse, J., Husson, F.: missmda a package to handle missing values in and with multivariate data analysis methods. J. Stat. Softw. 25, 1 (2015)
Josse, J., Sardy, S.: Adaptive shrinkage of singular values. Stat. Comput. 71, 1–10 (2015)
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—an S4 package for kernel methods in R. J. Stat. Softw. 11(9):1–20, http://www.jstatsoft.org/v11/i09/ (2004)
King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data: An alternative algorithm for multiple imputation. Am. Polit. Sci. Rev. 95(1), 49–69 (2001)
Lebart, L., Morineau, A., Werwick, K.M.: Multivariate Descriptive Statistical Analysis. Wiley, New-York (1984)
Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. Wiley series in probability and statistics, Wiley, New-York (1987, 2002)
Meinfelder, F., Schnapp, T.: BaBooN: Bayesian bootstrap predictive mean matching—multiple and single imputation for discrete data. https://CRAN.R-project.org/package=BaBooN, r package version 0.2-0 (2015)
Meng, X.L., Rubin, D.B.: Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. J. Am. Stat. Assoc. 86(416), 899–909 (1991)
Nishisato, S.: Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto (1980)
Quartagno, M., Carpenter, J.: jomo: A package for multilevel joint modelling multiple imputation. http://CRAN.R-project.org/package=jomo (2015)
R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.R-project.org/ (2014)
Rousseauw, J., du Plessis, J., Benade, A., Jordann, P., Kotze, J., Jooste, P., Ferreira, J.: Coronary risk factor screening in three rural communities. S. Afr. Med. J. 64, 430–436 (1983)
Rubin, D.B.: Multiple Imputation for Non-Response in Survey. Wiley, New York (1987)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London (1997)
Schafer, J.L.: Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl. 57(1), 19–35 (2003)
Seaman, S.R., Bartlett, J.W., White, I.R.: Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med. Res. Methodol. 12(1), 46 (2012). doi:10.1186/1471-2288-12-46
Shabalin, A., Nobel, B.: Reconstruction of a low-rank matrix in the presence of gaussian noise. J. Multivar. Anal. 118, 67–76 (2013)
Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am. J. Epidemiol. 179(6), 764–774 (2014). doi:10.1093/aje/kwt312
Si, Y., Reiter, J.: Nonparametric bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)
Stekhoven, D.J., Bühlmann, P.: Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Tenenhaus, M., Young, F.W.: An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50, 91–119 (1985)
Van Buuren, S., Groothuis-Oudshoorn, K.: mice. R package version 2.22 (2014)
Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76, 1049–1064 (2006)
Van Buuren, S.: Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics), 1st edn. Chapman and Hall/CRC, Boca Raton (2012)
Van Buuren, S., Groothuis-Oudshoorn, C.G.M.: mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)
Van der Heijden, P., Escofier, B.: Analyse des correspondances: recherches au coeur de l’analyse des données, Presses universitaires de Rennes, Rennes, France, chap Multiple correspondence analysis with missing data, pp 152–170 (2003)
van der Palm, D., van der Ark, L., Vermunt, J.: A comparison of incomplete-data methods for categorical data. Stat. Methods Med. Res. 17, 33 (2014)
Verbanck, M., Josse, J., Husson, F.: Regularised PCA to denoise and visualise data. Stat. Comput. 25(2), 471–486 (2013). doi:10.1007/s11222-013-9444-y
Vermunt, J.K., van Ginkel, J.R., van der Ark, L.A., Sijtsma, K.: Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(38), 369–397 (2008)
Vidotto, D., Kapteijn, M.C., Vermunt, J.: Multiple imputation of missing categorical data using latent class models: State of art. Psychol. Test Assess. Model. 57, 542 (2014)
Yucel, R.M., He, Y., Zaslavsky, A.M.: Using calibration to improve rounding in imputation. Am. Stat. 62, 125–129 (2008)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Audigier, V., Husson, F. & Josse, J. MIMCA: multiple imputation for categorical variables with multiple correspondence analysis. Stat Comput 27, 501–518 (2017). https://doi.org/10.1007/s11222-016-9635-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-016-9635-4