MIMCA: multiple imputation for categorical variables with multiple correspondence analysis

Vincent Audigier¹,
François Husson¹ &
Julie Josse¹

1994 Accesses
30 Citations
Explore all metrics

Abstract

We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal component method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method provides a good point estimate of the parameters of the analysis model considered, such as the coefficients of a main effects logistic regression model, and a reliable estimate of the variability of the estimators. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Estimation of logistic regression with covariates missing separately or simultaneously via multiple imputation methods

Article 15 July 2022

Feature Based Multivariate Data Imputation

The effect of high prevalence of missing data on estimation of the coefficients of a logistic regression model when using multiple imputation

Article Open access 18 July 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Agresti, A.: Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley, New York (2013)
Google Scholar
Agresti, A., Coull, B.A.: Approximate is better than ‘exact” for interval estimation of binomial proportions. Am. Stat. 52(2), 119–126 (1998). doi:10.2307/2685469
MathSciNet Google Scholar
Albert, A., Anderson, J.A.: On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71(1), 1–10 (1984). doi:10.2307/2336390
Article MathSciNet MATH Google Scholar
Allison, P.D.: Handling missing data by maximum likelihood. In: SAS global forum, pp 1–21 (2012)
Allison, P.D.: Missing Data. Sage, Thousand Oaks (2002)
Book MATH Google Scholar
Applied Mathematics Department, Agrocampus O, France (2010) galetas data set. http://math.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/74258_galetas.txt
Audigier, V., Husson, F., Josse, J.: Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul. (2014). doi:10.1080/00949655.2015.1104683
Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 7, 1–22 (2014)
Google Scholar
Barnard, J., Rubin, D.B.: Small sample degrees of freedom with multiple imputation. Biometrika 86, 948–955 (1999)
Article MathSciNet MATH Google Scholar
Bartlett, J.W., Seaman, S.R., White, I.R., Carpenter, J.R.: Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat. Methods. Med. Res. 24, 462 (2014)
Article MathSciNet Google Scholar
Benzécri, J.P.: L’analyse des données. L’analyse des données.Tome II: L’analyse des correspondances. Dunod (1973)
Bernaards, C.A., Belin, T.R., Schafer, J.L.: Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat. Med. 26(6), 1368–1382 (2007)
Article MathSciNet Google Scholar
Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B (Methodological) 36(2), 192 (1974)
MathSciNet MATH Google Scholar
Brand, J.P.L., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in sas for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003). doi:10.1111/1467-9574.00219
Article MathSciNet Google Scholar
Candès, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2009). doi:10.1109/TIT.2010.2044061
Article MathSciNet Google Scholar
Carpenter, J.R., Goldstein, H., Kenward, M.G.: REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J. Stat. Softw. 45(5), 1–14 (2011), http://www.jstatsoft.org/v45/i05
Carpenter, J., Kenward, M.: Multiple Imputation and its Application, 1st edn. Wiley, Chichester (2013)
Book MATH Google Scholar
Dawson, R.J.M.: The ‘unusual episode’ data revisited. Journal of Statistics Education 3, 1–7, http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html (1995)
Demirtas, H.: Rounding strategies for multiply imputed binary data. Biom. J. 51(4), 677–688 (2009)
Article MathSciNet Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)
MathSciNet MATH Google Scholar
Doove, L.L., Van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal. 72, 92–104 (2014). doi:10.1016/j.csda.2013.10.025
Article MathSciNet Google Scholar
Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104(487), 1042–1051 (2009)
Article MathSciNet MATH Google Scholar
Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936)
Article MATH Google Scholar
Gavish, M., Donoho, D.: Optimal shrinkageof singular values. arXiv:1405.7511 e-prints (214)
Gelman, A., Hill, J., Su, Y., Yajima, M., Grazia Pittau, M., Goodrich, B., Si, Y.: mi: Missing data imputation and model checking. R package version 0.9-93 (2013)
Gifi, A.: Nonlinear Multivariate Analysis. D.S.W.O. Press, Leiden (1981)
MATH Google Scholar
GlaxoSmithKline, Toronto, Ontario, Canada: Blood pressure data set. http://www.math.yorku.ca/Who/Faculty/Ng/ssc2003/BPMainF.htm (2003)
Greenacre, M.J.: Theory and Applications of Correspondence Analysis. Academic Press, London (1984)
MATH Google Scholar
Greenacre, M.J., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, Boca Raton (2006)
Book MATH Google Scholar
Harding, T., Tusell, F., Schafer, J.L.: cat: Analysis of categorical-variable datasets with missing values. http://CRAN.R-project.org/package=cat, r package version 0.0-6.5 (2012)
Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. R package version 1.7.2 (2014)
Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. J. Stat. Softw. 45(7), 1–47 (2011)
Article Google Scholar
Husson, F., Josse, J.: missMDA: Handling missing values with multivariate data analysis. http://CRAN.R-project.org/package=missMDA, r package version 1.9 (2015)
Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)
Article MathSciNet MATH Google Scholar
Josse, J., Chavent, M., Liquet, B., Husson, F.: Handling missing values with regularized iterative multiple correspondence analysis. J. Classif. 29, 91–116 (2012)
Article MathSciNet MATH Google Scholar
Josse, J., Husson, F.: Selecting the number of components in PCA using cross-validation approximations. Comput. Stat. Data Anal. 56(6), 1869–1879 (2011)
Article MATH Google Scholar
Josse, J., Husson, F.: missmda a package to handle missing values in and with multivariate data analysis methods. J. Stat. Softw. 25, 1 (2015)
Google Scholar
Josse, J., Sardy, S.: Adaptive shrinkage of singular values. Stat. Comput. 71, 1–10 (2015)
MATH Google Scholar
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—an S4 package for kernel methods in R. J. Stat. Softw. 11(9):1–20, http://www.jstatsoft.org/v11/i09/ (2004)
King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data: An alternative algorithm for multiple imputation. Am. Polit. Sci. Rev. 95(1), 49–69 (2001)
Google Scholar
Lebart, L., Morineau, A., Werwick, K.M.: Multivariate Descriptive Statistical Analysis. Wiley, New-York (1984)
Google Scholar
Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)
Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. Wiley series in probability and statistics, Wiley, New-York (1987, 2002)
Meinfelder, F., Schnapp, T.: BaBooN: Bayesian bootstrap predictive mean matching—multiple and single imputation for discrete data. https://CRAN.R-project.org/package=BaBooN, r package version 0.2-0 (2015)
Meng, X.L., Rubin, D.B.: Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. J. Am. Stat. Assoc. 86(416), 899–909 (1991)
Article Google Scholar
Nishisato, S.: Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto (1980)
MATH Google Scholar
Quartagno, M., Carpenter, J.: jomo: A package for multilevel joint modelling multiple imputation. http://CRAN.R-project.org/package=jomo (2015)
R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.R-project.org/ (2014)
Rousseauw, J., du Plessis, J., Benade, A., Jordann, P., Kotze, J., Jooste, P., Ferreira, J.: Coronary risk factor screening in three rural communities. S. Afr. Med. J. 64, 430–436 (1983)
Google Scholar
Rubin, D.B.: Multiple Imputation for Non-Response in Survey. Wiley, New York (1987)
Book Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London (1997)
Book MATH Google Scholar
Schafer, J.L.: Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl. 57(1), 19–35 (2003)
Article MathSciNet Google Scholar
Seaman, S.R., Bartlett, J.W., White, I.R.: Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med. Res. Methodol. 12(1), 46 (2012). doi:10.1186/1471-2288-12-46
Article Google Scholar
Shabalin, A., Nobel, B.: Reconstruction of a low-rank matrix in the presence of gaussian noise. J. Multivar. Anal. 118, 67–76 (2013)
Article MathSciNet MATH Google Scholar
Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am. J. Epidemiol. 179(6), 764–774 (2014). doi:10.1093/aje/kwt312
Article Google Scholar
Si, Y., Reiter, J.: Nonparametric bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)
Article Google Scholar
Stekhoven, D.J., Bühlmann, P.: Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)
Article Google Scholar
Tenenhaus, M., Young, F.W.: An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50, 91–119 (1985)
Article MathSciNet MATH Google Scholar
Van Buuren, S., Groothuis-Oudshoorn, K.: mice. R package version 2.22 (2014)
Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76, 1049–1064 (2006)
Article MathSciNet MATH Google Scholar
Van Buuren, S.: Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics), 1st edn. Chapman and Hall/CRC, Boca Raton (2012)
Book MATH Google Scholar
Van Buuren, S., Groothuis-Oudshoorn, C.G.M.: mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)
Article Google Scholar
Van der Heijden, P., Escofier, B.: Analyse des correspondances: recherches au coeur de l’analyse des données, Presses universitaires de Rennes, Rennes, France, chap Multiple correspondence analysis with missing data, pp 152–170 (2003)
van der Palm, D., van der Ark, L., Vermunt, J.: A comparison of incomplete-data methods for categorical data. Stat. Methods Med. Res. 17, 33 (2014)
Google Scholar
Verbanck, M., Josse, J., Husson, F.: Regularised PCA to denoise and visualise data. Stat. Comput. 25(2), 471–486 (2013). doi:10.1007/s11222-013-9444-y
Article MathSciNet MATH Google Scholar
Vermunt, J.K., van Ginkel, J.R., van der Ark, L.A., Sijtsma, K.: Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(38), 369–397 (2008)
Article Google Scholar
Vidotto, D., Kapteijn, M.C., Vermunt, J.: Multiple imputation of missing categorical data using latent class models: State of art. Psychol. Test Assess. Model. 57, 542 (2014)
Google Scholar
Yucel, R.M., He, Y., Zaslavsky, A.M.: Using calibration to improve rounding in imputation. Am. Stat. 62, 125–129 (2008)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Applied Mathematics Department, Agrocampus Ouest, 65 rue de Saint-Brieuc, 35042, Rennes Cedex, France
Vincent Audigier, François Husson & Julie Josse

Authors

Vincent Audigier
View author publications
You can also search for this author in PubMed Google Scholar
François Husson
View author publications
You can also search for this author in PubMed Google Scholar
Julie Josse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vincent Audigier.

Appendices

Appendix 1: Simulation design: analysis models and sample characteristics

See Table 3.

Table 3 Set of the sample characteristics and of the analysis models used to perform the simulation study (Sect. 4.2) for the several data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit)

Full size table

Appendix 2: Simulation study: complementary results

See Figs. 4–7.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Audigier, V., Husson, F. & Josse, J. MIMCA: multiple imputation for categorical variables with multiple correspondence analysis. Stat Comput 27, 501–518 (2017). https://doi.org/10.1007/s11222-016-9635-4

Download citation

Received: 21 June 2015
Accepted: 30 January 2016
Published: 11 February 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s11222-016-9635-4