[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

MIMCA: multiple imputation for categorical variables with multiple correspondence analysis

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We propose a multiple imputation method to deal with incomplete categorical data. This method imputes the missing entries using the principal component method dedicated to categorical data: multiple correspondence analysis (MCA). The uncertainty concerning the parameters of the imputation model is reflected using a non-parametric bootstrap. Multiple imputation using MCA (MIMCA) requires estimating a small number of parameters due to the dimensionality reduction property of MCA. It allows the user to impute a large range of data sets. In particular, a high number of categories per variable, a high number of variables or a small number of individuals are not an issue for MIMCA. Through a simulation study based on real data sets, the method is assessed and compared to the reference methods (multiple imputation using the loglinear model, multiple imputation by logistic regressions) as well to the latest works on the topic (multiple imputation by random forests or by the Dirichlet process mixture of products of multinomial distributions model). The proposed method provides a good point estimate of the parameters of the analysis model considered, such as the coefficients of a main effects logistic regression model, and a reliable estimate of the variability of the estimators. In addition, MIMCA has the great advantage that it is substantially less time consuming on data sets of high dimensions than the other multiple imputation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Agresti, A.: Categorical Data Analysis. Wiley Series in Probability and Statistics. Wiley, New York (2013)

    Google Scholar 

  • Agresti, A., Coull, B.A.: Approximate is better than ‘exact” for interval estimation of binomial proportions. Am. Stat. 52(2), 119–126 (1998). doi:10.2307/2685469

    MathSciNet  Google Scholar 

  • Albert, A., Anderson, J.A.: On the existence of maximum likelihood estimates in logistic regression models. Biometrika 71(1), 1–10 (1984). doi:10.2307/2336390

    Article  MathSciNet  MATH  Google Scholar 

  • Allison, P.D.: Handling missing data by maximum likelihood. In: SAS global forum, pp 1–21 (2012)

  • Allison, P.D.: Missing Data. Sage, Thousand Oaks (2002)

    Book  MATH  Google Scholar 

  • Applied Mathematics Department, Agrocampus O, France (2010) galetas data set. http://math.agrocampus-ouest.fr/infoglueDeliverLive/digitalAssets/74258_galetas.txt

  • Audigier, V., Husson, F., Josse, J.: Multiple imputation for continuous variables using a Bayesian principal component analysis. J. Stat. Comput. Simul. (2014). doi:10.1080/00949655.2015.1104683

  • Audigier, V., Husson, F., Josse, J.: A principal component method to impute missing values for mixed data. Adv. Data Anal. Classif. 7, 1–22 (2014)

    Google Scholar 

  • Barnard, J., Rubin, D.B.: Small sample degrees of freedom with multiple imputation. Biometrika 86, 948–955 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  • Bartlett, J.W., Seaman, S.R., White, I.R., Carpenter, J.R.: Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat. Methods. Med. Res. 24, 462 (2014)

    Article  MathSciNet  Google Scholar 

  • Benzécri, J.P.: L’analyse des données. L’analyse des données.Tome II: L’analyse des correspondances. Dunod (1973)

  • Bernaards, C.A., Belin, T.R., Schafer, J.L.: Robustness of a multivariate normal approximation for imputation of incomplete binary data. Stat. Med. 26(6), 1368–1382 (2007)

    Article  MathSciNet  Google Scholar 

  • Besag, J.: Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B (Methodological) 36(2), 192 (1974)

    MathSciNet  MATH  Google Scholar 

  • Brand, J.P.L., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in sas for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003). doi:10.1111/1467-9574.00219

    Article  MathSciNet  Google Scholar 

  • Candès, E.J., Tao, T.: The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 56(5), 2053–2080 (2009). doi:10.1109/TIT.2010.2044061

    Article  MathSciNet  Google Scholar 

  • Carpenter, J.R., Goldstein, H., Kenward, M.G.: REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J. Stat. Softw. 45(5), 1–14 (2011), http://www.jstatsoft.org/v45/i05

  • Carpenter, J., Kenward, M.: Multiple Imputation and its Application, 1st edn. Wiley, Chichester (2013)

    Book  MATH  Google Scholar 

  • Dawson, R.J.M.: The ‘unusual episode’ data revisited. Journal of Statistics Education 3, 1–7, http://www.amstat.org/publications/jse/v3n3/datasets.dawson.html (1995)

  • Demirtas, H.: Rounding strategies for multiply imputed binary data. Biom. J. 51(4), 677–688 (2009)

    Article  MathSciNet  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. B 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  • Doove, L.L., Van Buuren, S., Dusseldorp, E.: Recursive partitioning for missing data imputation in the presence of interaction effects. Comput. Stat. Data Anal. 72, 92–104 (2014). doi:10.1016/j.csda.2013.10.025

    Article  MathSciNet  Google Scholar 

  • Dunson, D.B., Xing, C.: Nonparametric Bayes modeling of multivariate categorical data. J. Am. Stat. Assoc. 104(487), 1042–1051 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936)

    Article  MATH  Google Scholar 

  • Gavish, M., Donoho, D.: Optimal shrinkageof singular values. arXiv:1405.7511 e-prints (214)

  • Gelman, A., Hill, J., Su, Y., Yajima, M., Grazia Pittau, M., Goodrich, B., Si, Y.: mi: Missing data imputation and model checking. R package version 0.9-93 (2013)

  • Gifi, A.: Nonlinear Multivariate Analysis. D.S.W.O. Press, Leiden (1981)

    MATH  Google Scholar 

  • GlaxoSmithKline, Toronto, Ontario, Canada: Blood pressure data set. http://www.math.yorku.ca/Who/Faculty/Ng/ssc2003/BPMainF.htm (2003)

  • Greenacre, M.J.: Theory and Applications of Correspondence Analysis. Academic Press, London (1984)

    MATH  Google Scholar 

  • Greenacre, M.J., Blasius, J.: Multiple Correspondence Analysis and Related Methods. Chapman & Hall/CRC, Boca Raton (2006)

    Book  MATH  Google Scholar 

  • Harding, T., Tusell, F., Schafer, J.L.: cat: Analysis of categorical-variable datasets with missing values. http://CRAN.R-project.org/package=cat, r package version 0.0-6.5 (2012)

  • Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. R package version 1.7.2 (2014)

  • Honaker, J., King, G., Blackwell, M.: Amelia II: A program for missing data. J. Stat. Softw. 45(7), 1–47 (2011)

    Article  Google Scholar 

  • Husson, F., Josse, J.: missMDA: Handling missing values with multivariate data analysis. http://CRAN.R-project.org/package=missMDA, r package version 1.9 (2015)

  • Ishwaran, H., James, L.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat. Assoc. 96(453), 161–173 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Josse, J., Chavent, M., Liquet, B., Husson, F.: Handling missing values with regularized iterative multiple correspondence analysis. J. Classif. 29, 91–116 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Josse, J., Husson, F.: Selecting the number of components in PCA using cross-validation approximations. Comput. Stat. Data Anal. 56(6), 1869–1879 (2011)

    Article  MATH  Google Scholar 

  • Josse, J., Husson, F.: missmda a package to handle missing values in and with multivariate data analysis methods. J. Stat. Softw. 25, 1 (2015)

    Google Scholar 

  • Josse, J., Sardy, S.: Adaptive shrinkage of singular values. Stat. Comput. 71, 1–10 (2015)

    MATH  Google Scholar 

  • Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—an S4 package for kernel methods in R. J. Stat. Softw. 11(9):1–20, http://www.jstatsoft.org/v11/i09/ (2004)

  • King, G., Honaker, J., Joseph, A., Scheve, K.: Analyzing incomplete political science data: An alternative algorithm for multiple imputation. Am. Polit. Sci. Rev. 95(1), 49–69 (2001)

    Google Scholar 

  • Lebart, L., Morineau, A., Werwick, K.M.: Multivariate Descriptive Statistical Analysis. Wiley, New-York (1984)

    Google Scholar 

  • Lichman, M.: UCI machine learning repository. http://archive.ics.uci.edu/ml (2013)

  • Little, R.J.A., Rubin, D.B.: Statistical analysis with missing data. Wiley series in probability and statistics, Wiley, New-York (1987, 2002)

  • Meinfelder, F., Schnapp, T.: BaBooN: Bayesian bootstrap predictive mean matching—multiple and single imputation for discrete data. https://CRAN.R-project.org/package=BaBooN, r package version 0.2-0 (2015)

  • Meng, X.L., Rubin, D.B.: Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. J. Am. Stat. Assoc. 86(416), 899–909 (1991)

    Article  Google Scholar 

  • Nishisato, S.: Analysis of Categorical Data: Dual Scaling and its Applications. University of Toronto Press, Toronto (1980)

    MATH  Google Scholar 

  • Quartagno, M., Carpenter, J.: jomo: A package for multilevel joint modelling multiple imputation. http://CRAN.R-project.org/package=jomo (2015)

  • R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, http://www.R-project.org/ (2014)

  • Rousseauw, J., du Plessis, J., Benade, A., Jordann, P., Kotze, J., Jooste, P., Ferreira, J.: Coronary risk factor screening in three rural communities. S. Afr. Med. J. 64, 430–436 (1983)

    Google Scholar 

  • Rubin, D.B.: Multiple Imputation for Non-Response in Survey. Wiley, New York (1987)

    Book  Google Scholar 

  • Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London (1997)

    Book  MATH  Google Scholar 

  • Schafer, J.L.: Multiple imputation in multivariate problems when the imputation and analysis models differ. Stat. Neerl. 57(1), 19–35 (2003)

    Article  MathSciNet  Google Scholar 

  • Seaman, S.R., Bartlett, J.W., White, I.R.: Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods. BMC Med. Res. Methodol. 12(1), 46 (2012). doi:10.1186/1471-2288-12-46

    Article  Google Scholar 

  • Shabalin, A., Nobel, B.: Reconstruction of a low-rank matrix in the presence of gaussian noise. J. Multivar. Anal. 118, 67–76 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Shah, A.D., Bartlett, J.W., Carpenter, J., Nicholas, O., Hemingway, H.: Comparison of random forest and parametric imputation models for imputing missing data using MICE: A CALIBER study. Am. J. Epidemiol. 179(6), 764–774 (2014). doi:10.1093/aje/kwt312

    Article  Google Scholar 

  • Si, Y., Reiter, J.: Nonparametric bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. J. Educ. Behav. Stat. 38, 499–521 (2013)

    Article  Google Scholar 

  • Stekhoven, D.J., Bühlmann, P.: Missforest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012)

    Article  Google Scholar 

  • Tenenhaus, M., Young, F.W.: An analysis and synthesis of multiple correspondence analysis, optimal scaling, dual scaling, homogeneity analysis and other methods for quantifying categorical multivariate data. Psychometrika 50, 91–119 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  • Van Buuren, S., Groothuis-Oudshoorn, K.: mice. R package version 2.22 (2014)

  • Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76, 1049–1064 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Van Buuren, S.: Flexible Imputation of Missing Data (Chapman & Hall/CRC Interdisciplinary Statistics), 1st edn. Chapman and Hall/CRC, Boca Raton (2012)

    Book  MATH  Google Scholar 

  • Van Buuren, S., Groothuis-Oudshoorn, C.G.M.: mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45(3), 1–67 (2011)

    Article  Google Scholar 

  • Van der Heijden, P., Escofier, B.: Analyse des correspondances: recherches au coeur de l’analyse des données, Presses universitaires de Rennes, Rennes, France, chap Multiple correspondence analysis with missing data, pp 152–170 (2003)

  • van der Palm, D., van der Ark, L., Vermunt, J.: A comparison of incomplete-data methods for categorical data. Stat. Methods Med. Res. 17, 33 (2014)

    Google Scholar 

  • Verbanck, M., Josse, J., Husson, F.: Regularised PCA to denoise and visualise data. Stat. Comput. 25(2), 471–486 (2013). doi:10.1007/s11222-013-9444-y

    Article  MathSciNet  MATH  Google Scholar 

  • Vermunt, J.K., van Ginkel, J.R., van der Ark, L.A., Sijtsma, K.: Multiple imputation of incomplete categorical data using latent class analysis. Sociol. Methodol. 38(38), 369–397 (2008)

    Article  Google Scholar 

  • Vidotto, D., Kapteijn, M.C., Vermunt, J.: Multiple imputation of missing categorical data using latent class models: State of art. Psychol. Test Assess. Model. 57, 542 (2014)

    Google Scholar 

  • Yucel, R.M., He, Y., Zaslavsky, A.M.: Using calibration to improve rounding in imputation. Am. Stat. 62, 125–129 (2008)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vincent Audigier.

Appendices

Appendix 1: Simulation design: analysis models and sample characteristics

See Table 3.

Table 3 Set of the sample characteristics and of the analysis models used to perform the simulation study (Sect. 4.2) for the several data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit)

Appendix 2: Simulation study: complementary results

See Figs. 47.

Fig. 4
figure 4

Distribution of the relative bias (bias divided by the true value) over the several quantities of interest for several methods (Listwise deletion, Loglinear model, DPMPM, Normal distribution, MIMCA, FCS using logistic regressions, FCS using random forests, Full data) for different data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit). One point represents the relative bias observed for one coefficient

Fig. 5
figure 5

Distribution of the median of the confidence interval for the several quantities of interest for several methods (Loglinear model, DPMPM, Normal distribution, MIMCA, FCS using logistic regressions, FCS using random forests, Full data) for different data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit). One point represents the median of the confidence interval observed for one coefficient divided by the one obtained by Listwise deletion. The horizontal dashed line corresponds to a ratio of 1. Points over this line corresponds to confidence interval higher than the one obtain by listwise deletion

Fig. 6
figure 6

Distribution of the median of the confidence interval for the several quantities of interest for the MIMCA algorithm for several numbers of dimensions for different data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit). One point represents the median of the confidence interval observed for one coefficient divided by the one obtained by Listwise deletion. The horizontal dashed line corresponds to a ratio of 1. Points over this line corresponds to confidence interval higher than the one obtain by listwise deletion. The results for the number of dimensions provided by cross-validation are in grey

Fig. 7
figure 7

Distribution of the relative bias (bias divided by the true value) over the several quantities of interest for the MIMCA algorithm for several numbers of dimensions for different data sets (Saheart, Galetas, Sbp, Income, Titanic, Credit). One point represents the relative bias observed for one coefficient. The results for the number of dimensions provided by cross-validation are in grey

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Audigier, V., Husson, F. & Josse, J. MIMCA: multiple imputation for categorical variables with multiple correspondence analysis. Stat Comput 27, 501–518 (2017). https://doi.org/10.1007/s11222-016-9635-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-016-9635-4

Keywords

Mathematics Subject Classification

Navigation