Abstract
In the framework of cluster analysis based on Gaussian mixture models, it is usually assumed that all the variables provide information about the clustering of the sample units. Several variable selection procedures are available in order to detect the structure of interest for the clustering when this structure is contained in a variable sub-vector. Currently, in these procedures a variable is assumed to play one of (up to) three roles: (1) informative, (2) uninformative and correlated with some informative variables, (3) uninformative and uncorrelated with any informative variable. A more general approach for modelling the role of a variable is proposed by taking into account the possibility that the variable vector provides information about more than one structure of interest for the clustering. This approach is developed by assuming that such information is given by non-overlapped and possibly correlated sub-vectors of variables; it is also assumed that the model for the variable vector is equal to a product of conditionally independent Gaussian mixture models (one for each variable sub-vector). Details about model identifiability, parameter estimation and model selection are provided. The usefulness and effectiveness of the described methodology are illustrated using simulated and real datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Anderson, T.: An Introduction to Multivariate Statistical Analysis, 3rd edn. Wiley, New York (2003)
Andrews, J.L., McNicholas, P.D.: Variable selection for clustering and classification. J. Classif. 31, 136–153 (2014)
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Belitskaya-Levy, I.: A generalized clustering problem, with application to DNA microarrays. Stat. Appl. Genet. Mol. Biol. 5, Article 2 (2006)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (2000)
Biernacki, C., Govaert, G.: Choosing models in model-based clustering and discriminant analysis. J. Stat. Comput. Simul. 64, 49–71 (1999)
Bozdogan, H.: Intelligent statistical data mining with information complexity and genetic algorithms. In: Bozdogan, H. (ed.) Statistical Data Mining and Knowledge Discovery, pp. 15–56. Chapman & Hall/CRC, London (2004)
Browne, R.P., ElSherbiny, A., McNicholas, P.D.: mixture: mixture models for clustering and classification. R package version 1.4 (2015)
Brusco, M.J., Cradit, J.D.: A variable-selection heuristic for k-means clustering. Psychometrika 66, 249–270 (2001)
Campbell, N.A., Mahon, R.J.: A multivariate study of variation in two species of rock crab of the genus Leptograpsus. Aust. J. Zool. 22, 417–425 (1974)
Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recognit. 28, 781–793 (1995)
Celeux, G., Martin-Magniette, M.-L., Maugis, C., Raftery, A.E.: Letter to the editor. J. Am. Stat. Assoc. 106, 383 (2011)
Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C., Raftery, A.E.: Comparing model selection and regularization approaches to variable selection in model-based clustering. J. Soc. Fr. Statistique 155, 57–71 (2014)
Chatterjee, S., Laudato, M., Lynch, L.A.: Genetic algorithms and their statistical applications: an introduction. Comput. Stat. Data Anal. 22, 633–651 (1996)
Dang, X.H., Bailey, J.: A framework to uncover multiple alternative clusterings. Mach. Learn. 98, 7–30 (2015)
Dang, U.J., McNicholas, P.D.: Families of parsimonious finite mixtures of regression models. In: Morlini, I., Minerva, T., Vichi, M. (eds.) Statistical Models for Data Analysis, pp. 73–84. Springer, Berlin (2015)
De Sarbo, W.S., Cron, W.L.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5, 249–282 (1988)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood for incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–22 (1977)
Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845–889 (2004)
Fowlkes, E.B., Gnanadesikan, R., Kettenring, J.R.: Variable selection in clustering. J. Classif. 5, 205–228 (1988)
Fraiman, R., Justel, A., Svarc, M.: Selection of variables for cluster analysis and classification rules. J. Am. Stat. Assoc. 103, 1294–1303 (2008)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2002)
Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington (2012)
Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B 66, 815–849 (2004)
Frühwirth-Schnatter, S.: Finite Mixture and Markow Switching Models. Springer, New York (2006)
Galimberti, G., Montanari, A., Viroli, C.: Penalized factor mixture analysis for variable selection in clustered data. Comput. Stat. Data Anal. 53, 4301–4310 (2009)
Galimberti, G., Scardovi, E., Soffritti, G.: Using mixtures in seemingly unrelated linear regression models with non-normal errors. Stat. Comput. 26, 1025–1038 (2016)
Galimberti, G., Soffritti, G.: Model-based methods to identify multiple cluster structures in a data set. Comput. Stat. Data Anal. 52, 520–536 (2007)
Galimberti, G., Soffritti, G.: Using conditional independence for parsimonious model-based Gaussian clustering. Stat. Comput. 23, 625–638 (2013)
Gnanadesikan, R., Kettenring, J.R., Tsao, S.L.: Weighting and selection of variables for cluster analysis. J. Classif. 12, 113–136 (1995)
Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, Reading (1989)
Gordon, A.D.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999)
Grün, B., Leisch, F.: Bootstrapping finite mixture models. In: Antoch, J. (ed.) Compstat 2004. Proceedings in computational statistics, pp. 1115–1122. Phisica-Verlag/Springer, Heidelberg (2004)
Guo, J., Levina, E., Michailidis, G., Zhu, J.: Pairwise variable selection for high-dimensional model-based clustering. Biometrics 66, 793–804 (2010)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer, New York (2009)
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Kass, R.E., Raftery, A.E.: Bayes factors. J. Am. Stat. Assoc. 90, 773–795 (1995)
Keribin, C.: Consistent estimation of the order of mixture models. Sankhyā Ser. A 62, 49–66 (2000)
Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1154–1166 (2004)
Liu, T.-F., Zhang, N.L., Chen, P., Liu, A.H., Poon, L.K.M., Wang, Y.: Greedy learning of latent tree models for multidimensional clustering. Mach. Learn. 98, 301–330 (2015)
Malsiner-Walli, G., Frühwirth-Schnatter, S., Grün, B.: Model-based clustering based on sparse finite Gaussian mixtures. Stat. Comput. 26, 303–324 (2016)
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65, 701–709 (2009a)
Maugis, C., Celeux, G., Martin-Magniette, M.-L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53, 3872–3882 (2009b)
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, Chichester (2000)
McLachlan, G.J., Peel, D., Bean, R.W.: Modelling high-dimensional data by mixtures of factor analyzers. Comput. Stat. Data Anal. 41, 379–388 (2003)
McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput. 18, 285–296 (2008)
McNicholas, P.D., Murphy, T.B., McDaid, A.F., Frost, D.: Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput. Stat. Data Anal. 54, 711–723 (2010)
Melnykov, V., Maitra, R.: Finite mixture models and model-based clustering. Stat. Surv. 4, 80–116 (2010)
Montanari, A., Lizzani, L.: A projection pursuit approach to variable selection. Comput. Stat. Data Anal. 35, 463–473 (2001)
Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145–1164 (2007)
Poon, L.K.M., Zhang, N.L., Liu, T.-F., Liu, A.H.: Model-based clustering of high-dimensional data: variable selection versus facet determination. Int. J. Approx. Reason. 54, 196–215 (2013)
Quandt, R.E., Ramsey, J.B.: Estimating mixtures of normal distributions and switching regressions. J. Am. Stat. Assoc. 73, 730–738 (1978)
R Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL:http://www.R-project.org (2015)
Raftery, A.E., Dean, N.: Variable selection for model-based cluster analysis. J. Am. Stat. Assoc. 101, 168–178 (2006)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978)
Scrucca, L.: GA: a package for genetic algorithms in R. J. Stat. Softw. 53, 1–37 (4) (2013)
Scrucca, L.: Genetic algorithms for subset selection in model-based clustering. In: Celebi, M.E., Aydin, K. (eds.) Unsupervised Learning Algorithms, pp. 55–70. Springer, Berlin (2016)
Scrucca, L., Raftery, A.E.: Improved initialisation of model-based clustering using Gaussian hierarchical partitions. Adv. Data Anal. Classif. 9, 447–460 (2015)
Scrucca, L., Raftery, A.E.: clustvarsel: a package implementing variable selection for model-based clustering in R (2014). Pre-print available at arxiv:1411.0606
Soffritti, G.: Identifying multiple cluster structures in a data matrix. Commun. Stat. Simul. 32, 1151–1177 (2003)
Soffritti, G., Galimberti, G.: Multivariate linear regression with non-normal errors: a solution based on mixture models. Stat. Comput. 21, 523–536 (2011)
Srivastava, M.S.: Methods of Multivariate Statistics. Wiley, New York (2002)
Steinley, D., Brusco, M.J.: A new variable weighting and selection procedure for k-means cluster analysis. Multivar. Behav. Res. 43, 77–108 (2008a)
Steinley, D., Brusco, M.J.: Selection of variables in cluster analysis: an empirical comparison of eight procedures. Psychometrika 73, 125–144 (2008b)
Tadesse, M.G., Sha, N., Vannucci, M.: Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc. 100, 602–617 (2005)
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)
Viroli, C.: Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers. J. Classif. 31, 363–388 (2010)
Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440–448 (2008)
Witten, D.M., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105, 713–726 (2010)
Xie, B., Pan, W., Shen, X.: Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64, 921–930 (2008)
Zeng, H., Cheung, Y.-M.: A new feature selection method for Gaussian mixture clustering. Pattern Recognit. 42, 243–250 (2009)
Zhou, H., Pan, W., Shen, X.: Penalized model-based clustering with unconstrained covariance matrices. Electron. J. Stat. 3, 1473–1496 (2009)
Zhu, X., Melnykov, V.: Manly transformation in finite mixture modeling. Comput. Stat. Data Anal. (2016). doi:10.1016/j.csda.2016.01.015
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Galimberti, G., Manisi, A. & Soffritti, G. Modelling the role of variables in model-based cluster analysis. Stat Comput 28, 145–169 (2018). https://doi.org/10.1007/s11222-017-9723-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-017-9723-0