Finite mixture biclustering of discrete type multivariate data

Daniel Fernández ORCID: orcid.org/0000-0003-0012-2094^1,2,
Richard Arnold²,
Shirley Pledger²,
Ivy Liu² &
…
Roy Costilla³

568 Accesses
3 Citations
4 Altmetric
Explore all metrics

Abstract

Many of the methods which deal with clustering in matrices of data are based on mathematical techniques such as distance-based algorithms or matrix decomposition and eigenvalues. In general, it is not possible to use statistical inferences or select the appropriateness of a model via information criteria with these techniques because there is no underlying probability model. This article summarizes some recent model-based methodologies for matrices of binary, count, and ordinal data, which are modelled under a unified statistical framework using finite mixtures to group the rows and/or columns. The model parameter can be constructed from a linear predictor of parameters and covariates through link functions. This likelihood-based one-mode and two-mode fuzzy clustering provides maximum likelihood estimation of parameters and the options of using likelihood information criteria for model comparison. Additionally, a Bayesian approach is presented in which the parameters and the number of clusters are estimated simultaneously from their joint posterior distribution. Visualization tools focused on ordinal data, the fuzziness of the clustering structures, and analogies of various standard plots used in the multivariate analysis are presented. Finally, a set of future extensions is enumerated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Composite likelihood methods for parsimonious model-based clustering of mixed-type data

Article Open access 09 April 2023

Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm

Article 14 June 2015

Biclustering Models for Two-Mode Ordinal Data

Article Open access 21 June 2016

References

Agresti A (2010) Analysis of ordinal categorical data, 2nd edn. Wiley series in probability and statistics. Wiley, Hoboken
Book MATH Google Scholar
Agresti A (2013) Categorical data analysis, 3rd edn. Wiley series in probability and statistics. Wiley, Hoboken
MATH Google Scholar
Agresti A, Lang JB (1993) Quasi-symmetric latent class models, with application to rater agreement. Biometrics 49(1):131–139
Article Google Scholar
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Csaki F (eds) 2nd international symposium on information theory, pp 267–281
Anderson JA (1984) Regression and ordered categorical variables. J R Stat Soc Ser B 46(1):1–30
MathSciNet MATH Google Scholar
Arnold R, Hayakawa Y, Yip P (2010) Capture-recapture estimation using finite mixtures of arbitrary dimension. Biometrics 66(2):644–655
Article MathSciNet MATH Google Scholar
Bartolucci F, Bacci S, Pennoni F (2014) Longitudinal analysis of self-reported health status by mixture latent auto-regressive models. J R Stat Soc Ser C (Appl Stat) 63(2):267–288
Article MathSciNet Google Scholar
Biernacki C, Celeux G, Govaert G (1998) Assessing a mixture model for clustering with the integrated completed likelihood. Technical Report 3521, INRIA, Rhne-Alpes
Böhning D, Seidel W, Alfò M, Garel B, Patilea V, Walther G (2007) Advances in mixture models. Comput Stat Data Anal 51(11):5205–5210
Article MathSciNet MATH Google Scholar
Breen R, Luijkx R (2010) Assessing proportionality in the proportional odds model for ordinal logistic regression. Sociol Methods Res 39(1):3–24
Article MathSciNet Google Scholar
Browne RP, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis of data with mixed type. J Stat Plan Inference 142(11):2976–2984
Article MathSciNet MATH Google Scholar
Burnham KP, Anderson DR (2002) Model selection and multi-model inference: a practical information-theoretic approach, 2nd edn. Springer, Berlin
MATH Google Scholar
Cai JH, Song XY, Lam KH, Ip EHS (2011) A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput Stat Data Anal 55(11):2889–2907
Article MathSciNet MATH Google Scholar
Cappé O, Robert C, Rydén T (2003) Reversible jump, birth-and-death, and more general continuous time MCMC samplers. J R Stat Soc Ser B 65(3):679–700
Article MathSciNet MATH Google Scholar
Celeux G (1998) Bayesian inference for mixtures: the label switching problem. In: Proceedings in computational statistics 1998 (COMPSTAT98), Physica-Verlag HD, pp 227–232
Costilla R, Liu I, Arnold R (2015) A Bayesian model-based approach to estimate clusters in repeated ordinal data. In: JSM Proceedings, biometrics section, pp 545–556
Dellaportas P, Papageorgiou I (2006) Multivariate mixtures of normals with unknown number of components. Stat Comput 16(1):57–68
Article MathSciNet Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39(1):1–38
MathSciNet MATH Google Scholar
DeSantis SM, Houseman EA, Coull BA, Stemmer-Rachamimov A, Betensky RA (2008) A penalized latent class model for ordinal data. Biostatistics 9(2):249–262
Article MATH Google Scholar
Diggle PJ, Heagerty PJ, Liang KY, Zeger SL (2002) Analysis of longitudinal data, 2nd edn. Oxford University Press, Oxford
MATH Google Scholar
van Dijk B, van Rosmalen J, Paap R (2009) A Bayesian approach to two-mode clustering. Technical Report
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, Chichester
Book MATH Google Scholar
Fernández D, Arnold R (2016) Model selection for mixture-based clustering for ordinal data. Aust NZ J Stat 58(4):437–472
Article MathSciNet MATH Google Scholar
Fernández D, Liu I (2016) A goodness-of-fit test for the ordered stereotype model. Stat Med 35(25):4660–4696
Article MathSciNet Google Scholar
Fernández D, Pledger S (2016) Categorising count data into ordinal responses with application to ecological communities. J Agric Biol Environ Stat 21(2):348–362
Article MathSciNet MATH Google Scholar
Fernández D, Pledger S, Arnold R (2014) Introducing spaced mosaic plots. Research Report Series. ISSN: 1174-2011. 14-3, School of Mathematics, Statistics and Operations Research, VUW. http://msor.victoria.ac.nz/foswiki/pub/Main/ResearchReportSeries/TechReport_Spaced_Mosaic_Plots.pdf
Fernández D, Arnold R, Pledger S (2016) Mixture-based clustering for the ordered stereotype model. Comput Stat Data Anal 93:46–75
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588
Article MATH Google Scholar
Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631
Article MathSciNet MATH Google Scholar
Fraley C, Raftery AE (2007) Bayesian regularization for normal mixture estimation and model-based clustering. J Classif 24(2):155–181
Article MathSciNet MATH Google Scholar
Friedman HP, Rubin J (1967) On some invariant criteria for grouping data. J Amer Stat Assoc 62:1159–1178
Article MathSciNet Google Scholar
Friendly M (1991) Mosaic displays for multiway contingency tables. Technival Report 195, Department of Psychology Reports, New York University
Frühwirth-Schnatter S (2001) Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. J Am Stat Assoc 453(96):194–209
Article MathSciNet MATH Google Scholar
Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Wiley, New York
MATH Google Scholar
Frühwirth-Schnatter S, Pamminger C, Weber A, Winter-Ebmer R (2012) Labor market entry and earnings dynamics: Bayesian inference using mixtures-of-experts markov chain clustering. J Appl Econom 27(7):1116–1137
Article MathSciNet Google Scholar
Frydman H (2005) Estimation in the mixture of markov chains moving with different speeds. J Am Stat Assoc 100(471):1046–1053
Article MathSciNet MATH Google Scholar
Goodman LA (1974) Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61:215–231
Article MathSciNet MATH Google Scholar
Gotelli NJ, Graves GR (1996) Null models in ecology. Smithsonian Institution Press, Washington
Google Scholar
Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recognit 36(2):463–473
Article MATH Google Scholar
Govaert G, Nadif M (2005) An EM algorithm for the block mixture model. IEEE Trans Pattern Anal Mach Intell 27(4):643–647
Article MATH Google Scholar
Govaert G, Nadif M (2010) Latent block model for contingency table. Commun Stat Theory Methods 39(3):416–425
Article MathSciNet MATH Google Scholar
Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4):711–732
Article MathSciNet MATH Google Scholar
Haberman SJ (1979) Analysis of qualitative data, vol 2. Academic Press, New York
Google Scholar
Hartigan JA, Kleiner B (1981) Mosaics for contingency tables. In: Proceedings of the 13th symposium on the interface between computer sciencies and statistics, Springer, pp 268–273
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108
MATH Google Scholar
Hasnat MA, Velcin J, Bonnevay S, Jacques J (2015) Simultaneous clustering and model selection for multinomial distribution: a comparative study. In: International symposium on intelligent data analysis, Springer, pp 120–131
Hui FK, Taskinen S, Pledger S, Foster SD, Warton DI (2015) Model-based approaches to unconstrained ordination. Methods Ecol Evol 6(4):399–411
Article Google Scholar
Hurn M, Justel A, Robert CP (2003) Estimating mixture of regressions. J Comput Graph Stat 12(1):55–79
Article MathSciNet Google Scholar
Hurvich CM, Tsai CL (1989) Regression and time series model selection in small samples. Biometrika 76(2):297–307
Article MathSciNet MATH Google Scholar
Jasra A, Holmes CC, Stephens DA (2005) MCMC and the label switching problem in Bayesian mixture models. Stat Sci 20(1):50–67
Article MATH Google Scholar
Jobson JD (1992) Applied multivariate data analysis: categorical and multivariate methods. Springer texts in statistics. Springer, Berlin
Book MATH Google Scholar
Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254
Article MATH Google Scholar
Lee K, Marin JM, Robert C, Mengersen K (2008) Bayesian inference on mixtures of distributions. In: Proceedings of the platinum jubilee of the Indian statistical institute, p 776
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, University of California Press, pp 281–297
Manly BFJ (2005) Multivariate statistical methods: a primer. Chapman & Hall, London
MATH Google Scholar
Manly BFJ (2007) Randomization, bootstrap and monte carlo methods in biology, 3rd edn. Chapman & Hall, London
MATH Google Scholar
Marin JM, Robert C (2007) Bayesian core: a practical approach to computational Bayesian statistics. Springer texts in statistics. Springer, Berlin
MATH Google Scholar
Marin JM, Mengersen K, Robert C (2005) Bayesian modelling and inferences on mixtures of distributions. In: Dey D, Rao CR (eds) Handbook of statistics, vol 25. Springer, New York
Google Scholar
Marrs AD (1998) An application of reversible-jump MCMC to multivariate spherical Gaussian mixtures. In: Jordan MI, Kearns MJ, Solla SA (eds) Advances in neural information processing systems, vol 10. MIT Press, Cambridge, pp 577–583
Google Scholar
Matechou E, Liu I, Pledger S, Arnold R (2011) Biclustering models for ordinal data, presentation at the NZ Statistical Assn. In: Annual conference, University of Auckland, 28–31 Aug 2011
Matechou E, Liu I, Fernández D, Farias M, Gjelsvik B (2016) Biclustering models for two-mode ordinal data. Psychometrika 81(3):611–624
Article MathSciNet MATH Google Scholar
Maurizio V (2001) Double k-means clustering for simultaneous classification of objects and variables. Advances in classification and data analysis. Springer, Berlin, Heidelberg, pp 43–52
Chapter Google Scholar
McCullagh P (1980) Regression models for ordinal data. J R Stat Soc 42(2):109–142
MathSciNet MATH Google Scholar
McCullagh P, Yang J (2008) How many clusters? Bayesian Anal 3(1):101–120
Article MathSciNet MATH Google Scholar
McCune B, Grace JB (2002) Analysis of ecological communities. Struct Equ Model 28(2)
McCutcheon AL (1987) Latent class analysis. Sage Publications, Thousand Oaks
Book Google Scholar
McLachlan G, Peel D (2004) Finite mixture models. Wiley series in probability and statistics. Wiley, New York
MATH Google Scholar
McLachlan GJ (1982) The classification and mixture maximum likelihood approaches to cluster analysis. Handb Stat 2(299):199–208
Article MATH Google Scholar
McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl Stat 36(3):318–324
Article Google Scholar
McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Statistics, textbooks and monographs. M. Dekker, New York
MATH Google Scholar
McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley series in probability and statistics: applied probability and statistics. Wiley, Hoboken
MATH Google Scholar
McParland D, Gormley IC (2013) Clustering ordinal data via latent variable models. In: Lausen B, Van den Poel D, Ultsch A (eds) Algorithms from and for nature and life, studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 127–135
Google Scholar
McParland D, Gormley IC (2016) Model based clustering for mixed data: clustMD. Adv Data Anal Classif 10(2):155–169
Article MathSciNet MATH Google Scholar
Melnykov V (2013) Finite mixture modelling in mass spectrometry analysis. J R Stat Soc Ser C (Appl Stat) 62(4):573–592
Article MathSciNet Google Scholar
Melnykov V, Maitra R (2010) Finite mixture models and model-based clustering. Stat Surv 4(9):80–116
Article MathSciNet MATH Google Scholar
Moustaki I (2000) A latent variable model for ordinal variables. Appl Psychol Meas 24(3):211–233
Article MathSciNet Google Scholar
Nadif M, Govaert G (2005) A comparison between block CEM and two-way CEM algorithms to cluster a contingency table. In: European conference on principles of data mining and knowledge discovery, Springer, pp 609–616
Pamminger C, Frühwirth-Schnatter S et al (2010) Model-based clustering of categorical time series. Bayesian Anal 5(2):345–368
Article MathSciNet MATH Google Scholar
Pledger S (2000) Unified maximum likelihood estimates for closed capture-recapture models using mixtures. Biometrics 56(2):434–442
Article MATH Google Scholar
Pledger S, Arnold R (2014) Multivariate methods using mixtures: correspondence analysis, scaling and pattern-detection. Comput Stat Data Anal 71:241–261
Article MathSciNet MATH Google Scholar
Quinn GP, Keough MJ (2002) Experimental design and data analysis for biologists. Cambridge University Press, Cambridge
Book Google Scholar
Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473):168–178
Article MathSciNet MATH Google Scholar
Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components. J R Stat Soc Ser B 59(4):731–792
Article MathSciNet MATH Google Scholar
Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52(4):1984–2003
Article MathSciNet MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MathSciNet MATH Google Scholar
Self SG, Liang KY (1987) Asymptotic properties of maximum likelihood estimators and likelihood ratio tests under nonstandard conditions. J Am Stat Assoc 82(398):605–610
Article MathSciNet MATH Google Scholar
Silvestre C, Cardoso MG, Figueiredo MA (2014) Identifying the number of clusters in discrete mixture models. arXiv:1409.7419
Skrondal A, Rabe-Hesketh S (2004) Generalized latent variable modeling: multilevel, longitudinal, and structural equation models. Monographs on statistics and applied probability. Chapman & Hall, London
Book MATH Google Scholar
Stahl D, Sallis H (2012) Model-based cluster analysis. Wiley Interdiscip Rev Comput Stat 4(4):341–358
Article Google Scholar
Stephens M (2000a) Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Ann Stat 28(1):40–74
Article MathSciNet MATH Google Scholar
Stephens M (2000b) Dealing with label switching in mixture models. J R Stat Soc Ser B 62(4):795–809
Article MathSciNet MATH Google Scholar
Sugar CA, James GM (2003) Finding the number of clusters in a dataset: an information-theoretic approach. J Am Stat Assoc 98(463):750–763
Article MathSciNet MATH Google Scholar
Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3):511–528
Article MathSciNet Google Scholar
Vermunt JK (2001) The use of restricted latent class models for defining and testing nonparametric and parametric item response theory models. Appl Psychol Meas 25(3):283–294
Article MathSciNet Google Scholar
Vermunt JK, Hagenaars JA (2004) Ordinal longitudinal data analysis. In: Hauspie R, Cameron N, Molinari L (eds) Methods in human growth research. Cambridge University Press, Cambridge
Google Scholar
Vermunt JK, Van Dijk L (2001) A nonparametric random-coefficients approach: the latent class regression model. Multilevel Model Newsl 13(2):6–13
Google Scholar
Vichi M (2001) Double k-means clustering for simultaneous classification of objects and variables. In: Borra S, Rocci R, Vichi M, Schader M (eds) Studies in classification, data analysis, and knowledge organization. Springer, Berlin, pp 43–52
Google Scholar
Wagenmakers EJ, Lee M, Lodewyckx T, Iverson GJ (2008) Bayesian versus frequentist inference. Springer, Berlin
Book Google Scholar
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(2):415–428
Article MathSciNet MATH Google Scholar
Zhang Z, Chan KL, Wu Y, Chen C (2004) Learning a multivariate gaussian mixture model with the reversible jump MCMC algorithm. Stat Comput 14(4):343–355
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by the Marsden Fund on “Dimension reduction for mixed type multivariate data” (Award Number E2987-3648) from New Zealand Government funding, administrated by the Royal Society of New Zealand.

Author information

Authors and Affiliations

Institut de Recerca Sant Joan de Déu, Parc Sanitari Sant Joan de Déu, CIBERSAM, Dr. Antoni Pujades, 42, 08830, Sant Boi de Llobregat, Barcelona, Spain
Daniel Fernández
School of Mathematics and Statistics, Victoria University of Wellington, Wellington, New Zealand
Daniel Fernández, Richard Arnold, Shirley Pledger & Ivy Liu
Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia
Roy Costilla

Authors

Daniel Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Richard Arnold
View author publications
You can also search for this author in PubMed Google Scholar
Shirley Pledger
View author publications
You can also search for this author in PubMed Google Scholar
Ivy Liu
View author publications
You can also search for this author in PubMed Google Scholar
Roy Costilla
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Fernández.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 179 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fernández, D., Arnold, R., Pledger, S. et al. Finite mixture biclustering of discrete type multivariate data. Adv Data Anal Classif 13, 117–143 (2019). https://doi.org/10.1007/s11634-018-0324-3

Download citation

Received: 29 November 2016
Revised: 21 February 2018
Accepted: 02 May 2018
Published: 15 May 2018
Issue Date: 08 March 2019
DOI: https://doi.org/10.1007/s11634-018-0324-3

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Composite likelihood methods for parsimonious model-based clustering of mixed-type data

Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm

Biclustering Models for Two-Mode Ordinal Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 179 KB)

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Finite mixture biclustering of discrete type multivariate data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Composite likelihood methods for parsimonious model-based clustering of mixed-type data

Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm

Biclustering Models for Two-Mode Ordinal Data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 179 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation