Abstract
This paper deals with estimation and model selection in the Latent Block Model (LBM) for categorical data. First, after providing sufficient conditions ensuring the identifiability of this model, we generalise estimation procedures and model selection criteria derived for binary data. Secondly, we develop Bayesian inference through Gibbs sampling and with a well calibrated non informative prior distribution, in order to get the MAP estimator: this is proved to avoid the traps encountered by the LBM with the maximum likelihood methodology. Then model selection criteria are presented. In particular an exact expression of the integrated completed likelihood criterion requiring no asymptotic approximation is derived. Finally numerical experiments on both simulated and real data sets highlight the appeal of the proposed estimation and model selection procedures.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The data set is available from https://www.hds.utc.fr/coclustering/doku.php.
Congressional Voting Records data set is available from http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records.
The code of Wyse and Friel is available at http://www.ucd.ie/statdept/jwyse.
References
Allman, E., Mattias, C., Rhodes, J.: Identifiability of parameters in latent structure models with many observed variables. Ann. Stat. 37, 3099–3132 (2009)
Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S., Modha, D.S.: A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J. Mach. Learn. Res. 8, 1919–1986 (2007). ISSN 1532–4435
Baudry, J.-P. : Sélection de modèle pour la classification non supervisée. Choix du nombre de classes. PhD thesis, Université Paris Sud, December 2009.
Baudry, J.-P., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Gr. Stat. 19, 332–353 (2010)
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (Jul 2000)
Carreira-Perpiñàn, M., Renals, S.: Practical identifiability of finite mixtures of multivariate bernoulli distributions. Neural Comput. 12, 141–152 (2000)
Celeux, G., Diebolt, J.: Stochastic versions of the em algorithm. Comput. Stat. Quat. 2, 73–82 (1985)
Celisse, A., Daudin, J.-J., Latouche, P.: Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6, 1847–1899 (2012)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm (with discussion). J. R. Stat. Soc. Ser. B 39, 1–38 (1977)
Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer series in statistics, Springer (2006)
Frühwirth-Schnatter, S.: Mixtures : Estimation and Applications, Chapter Dealing with Label Switching Under Model Uncertainty. Wiley, Chichester (2011)
Govaert, G. : Algorithme de classification d’un tableau de contingence. In First international Symposium on Data Analysis and Informatics, pp. 487–500, Versailles, 1977. INRIA.
Govaert, G. : Classification croisée. PhD thesis, Université Paris 6, France, 1983.
Govaert, G., Nadif, M.: Block clustering with bernoulli mixture models: comparison of different approaches. Comput. Stat. Data Anal. 52, 3233–3245 (2008)
Govaert, G., Nadif, M.: Latent block model for contingency table. Commun. Stat. Theory Methods 39, 416–425 (2010)
Gyllenberg, M., Koski, T., Reilink, E., Verlann, M.: Non-uniqueness in probabilistic numerical identification of bacteria. J. Appl. Probab. 31, 542–548 (1994)
Jagalur, M., Pal, C., Learned-Miller, E., Zoeller, R.T., Kulp, D.: Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering. BMC Bioinform. 8, S5 (2007)
Keribin, C.: Consistent estimation of the order of mixture models. Sankhya Ser. A 62, 49–66 (2000)
Keribin, C.: Méthodes bayésiennes variationnelles: concepts et applications en neuroimagerie. Journal de la Société Française de Statistique 151, 107–131 (2010)
Keribin, C., Brault, V., Celeux, G., Govaert, G.: Model selection for the binary latent block model. Proceedings of COMPSTAT 2012, 2012.
Keribin, C., Brault, V., Celeux, G., Govaert, G. : Estimation and Selection for the Latent Block Model on Categorical Data. Rapport de recherche RR-8264, INRIA, March 2013. URL http://hal.inria.fr/hal-00802764
Lomet, A.: Sélection de modèle pour la classification croisée de données continues. PhD thesis, Université de Technologie de Compiègne, December 2012.
Lomet, A., Govaert, G., Grandvalet, Y.: Un protocole de simulation de données pour la classification croisée. In 44ème journées de statistique, Bruxelles, Mai 2012.
Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinf. 1, 24–45 (2004)
Mariadassou, M., Matias, C.: Convergence of the groups posterior distribution in latent or stochastic block models. arXiv, preprint arXiv:1206.7101v2, 2013.
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Nex York (2008)
McLachlan, G.J., Peel, D.: Finite Mixture Models, 2nd edn. Wiley, Nex York (2000)
Meeds, E, Roweis, S: Nonparametric bayesian biclustering. Technical Report UTML TR 2007–001, Department of Computer Science, University of Toronto, 2007.
Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted models. J. Roy. Stat. Soc. 73, 689–710 (2011)
Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Shan, H., Banerjee, A.: Bayesian co-clustering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pp. 530–539, Washington, DC, 2008. IEEE Computer Society.
Wyse, J., Friel, N.: Block clustering with collapsed latent block models. Stat. Comput. 22, 415–428 (2012)
Acknowledgments
The authors thank the reviewers and the Associate Editor for their very helpful comments which have greatly improved this paper. C. K. has been supported by LMH (Hadamard Mathematics Labex), backed by the foundation FMJH (Fondation Mathématique Jacques Hadamard).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendices
Appendix A: Proof of Theorem 1
The identifiability of a parametric model requires that for any two different values \(\varvec{\theta }\ne \varvec{\theta }'\) in the parameter space, the corresponding probability distribution \({I\!P}_{\varvec{\theta }}\) and \({I\!P}_{\varvec{\theta }'}\) are different. We prove that, under assumptions of Theorem 1, there exists a unique parameter \(\varvec{\theta }=(\varvec{\pi },\varvec{\rho },\varvec{\alpha })\), up to a permutation of row and column labels, corresponding to \({I\!P}(\mathbf{y})\) the probability distribution function of matrix \(\mathbf{y}\) having at least \(2m-1\) rows and \(2g-1\) columns. The proof is adapted from Celisse et al. (2012) who set up a similar result for the Stochastic Block Model. Notice first that \(\varvec{\tau }= A \varvec{\rho }\) is the vector of the probability \(\tau _k\) to have \(1\) in a cell of a row of given row class \(k\):
As all the coordinates of \(\varvec{\tau }\) are distinct, \(R\) defined as the \(g\times g\) matrix of coefficients \(R_{ik}=(\tau _k)^i\), for \(0 \le i < g\) and \(1 \le k \le g\), is Vandermonde, and hence invertible. Consider now \(u_i\), the probability to have \(1\) on the \(i\) first cells of the first row of \(\mathbf{y}\):
With a given \({I\!P}(\mathbf{y}),\, u_1, \ldots ,u_{2g-1}\) are known, and we denote \(u_0=1\). Let now \(M\) be the \((g+1)\times g\) matrix defined by \(M_{ij}=u_{i+j-2}\), for all \(1\le i \le g+1\) and \(1\le j\le g\) and define \(M_i\) as the square matrix obtained by removing the row \(i\) from \(M\). The coefficients of \(M\) are
We can write, with \(A_{\pi }=\hbox {Diag}(\varvec{\pi })\)
Now, \(R\), unknown at this stage, can be retrieved by noticing that the coefficients of \(\varvec{\tau }\) are the roots of the following polynomial (Celisse et al. 2012) of degree \(g\)
where \(D_k=\hbox {det} M_k\) and \(D_g\ne 0\) as \(M_g\) is a product of invertible matrices. Hence, it is possible to determine \(\varvec{\tau }\), and \(R\) is now known. Consequently, \(\varvec{\pi }\) is defined in a unique manner by \(A_{\pi }=R^{-1} M_g R^{'-1}\).
In the same way, \(\varvec{\rho }\) is defined in a unique manner by considering the probabilities \(\sigma _\ell \) to have \(1\) in a column of column class \(\ell \) and the probabilities \(v_j\) to have \(1\) on the \(j\) first cells of the first column of \(\mathbf{y}\)
To determine \(\varvec{\alpha }\), we introduce for \(1\le i \le g\) and \(1\le j \le m\) the probabilities \(U_{ij}\) to have 1 in each \(i\) first cells of the first row and in each \(j\) first cells of the first column
These probabilities are known and we can write, with \(S_{j\ell }=(\sigma _\ell )^{j-1},\,j=1,\ldots m,\,\ell =1,\ldots ,m\)
and it leads to define \(A=A_{\pi }^{-1}R^{-1} U S^{'-1} A_{\rho }^{-1}\), and hence \(\varvec{\alpha }\), in a unique manner.
The proof is straightforwardly extended to the categorical case. The identification of \(\varvec{\pi }\) and \(\varvec{\rho }\) are obtained by considering the probabilities of \(y_{ij}=1\) where \(1\) is the first outcome of the multinomial distribution. Then, each \(A^h=(\alpha _{k\ell }^h)_{k=1,\ldots ,g;\ell =1,\ldots ,m}\) is successively identified by considering \(y_{ij}=\ell ,\,i=1,\ldots ,m,\,j=1,\ldots , g\).
Appendix B: VEM algorithm for categorical data
Govaert and Nadif (2008) proposed to use, for the binary case, a variational approximation of the EM algorithm by imposing that the joint distribution of the labels takes the form \(q^{(c)}_{zw}(\mathbf{z},\mathbf{w})=q^{(c)}_z(\mathbf{z})q^{(c)}_w(\mathbf{w})\). To get simpler formulas, we will denote
and
Using the variational approximation, the maximisation of the loglikelihood is replaced by the maximisation of the free energy
alternatively in \(q_{z},\,q_w\) and \(\varvec{\theta }\) (see Keribin 2010). The difference between the maximum loglikelihood and the maximum free energy is the Kullback divergence \(KL(q_{zw}||p(\mathbf{z},\mathbf{w}|\mathbf{y};\varvec{\theta }) )={I\!E}_{q_{zw}}\left[ \log \frac{q_{zw}(\mathbf{z},\mathbf{w})}{p(\mathbf{z},\mathbf{w}|\mathbf{y};\varvec{\theta })} \right] \). This can be extended to the categorical LBM, and leads to the following Variational EM (VEM) algorithm:
E step It consists of maximising the free energy in \(q_{z}\) and \(q_w\), and it leads to:
-
1.
Computing \(s_{ik}^{(c+1)}\) with fixed \(w^{(c)}_{jl}\) and \(\varvec{\theta }^{(c)}\)
$$\begin{aligned} s_{ik}^{(c+1)}= \frac{\pi _k^{(c)} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot }^{(c)})}{\sum _{k'} \pi _{k'}^{(c)} \psi _{k'}^{(c)}(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k'\cdot }^{(c)}) }, k=1,\ldots , g \end{aligned}$$where \(\mathbf{y}_{i\cdot }\) denotes the row \(i\) of the matrix \(\mathbf{y}\), \(\varvec{\alpha }_{k\cdot }=(\alpha _{k1},\ldots ,\alpha _{km})\), and
$$\begin{aligned} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot }^{(c)})=\prod _{\ell ,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _j t_{j\ell }^{(c)} y_{ij}^{h}} \end{aligned}$$ -
2.
Computing \(t_{jl}^{(c+1)}\) with fixed \(s^{(c+1)}_{ik}\) and \(\varvec{\theta }^{(c)}\)
$$\begin{aligned} t_{j\ell }^{(c+1)}= \frac{\rho _{\ell }^{(c)} \phi _{\ell }(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell }^{(c)})}{\sum _{\ell '} \rho _{\ell '}^{(c)} \phi _{\ell '}(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell '}^{(c)}) } , \ell =1,\ldots ,m \end{aligned}$$where \(\mathbf{y}_{\cdot j}\) denotes the column \(j\) of the matrix \(\mathbf{y},\, \varvec{\alpha }_{\cdot \ell }=(\alpha _{1\ell },\ldots ,\alpha _{g\ell })\) and
$$\begin{aligned} \phi _{\ell }(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell }^{(c)})=\prod _{k,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _i s_{ik}^{(c+1)} y_{ij}^{h}}. \end{aligned}$$
M step Updating \(\varvec{\theta }^{(c+1)}\). Denoting \(s_{.k}^{(c+1)}=\sum _i s_{ik}^{(c+1)}\), \(t_{.\ell }^{(c+1)}=\sum _j t_{j\ell }^{(c+1)}\), it leads to
Appendix C: SEM algorithm for categorical data
E and S step
-
1.
Computation of \( p(\mathbf{z}|\mathbf{y},\mathbf{w}^{(c)};\varvec{\theta }^{(c)})\), then simulation of \(\mathbf{z}^{(c+1)}\) according to \( p(\mathbf{z}|\mathbf{y},\mathbf{w}^{(c)};\varvec{\theta }^{(c)})\):
$$\begin{aligned} p(z_i=k|\mathbf{y}_{i\cdot },\mathbf{w}^{(c)};\varvec{\theta }^{(c)})= \frac{\pi ^{(c)}_k \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }^{(c)}_{k\cdot })}{\sum _{k'} \pi ^{(c)}_{k'} \psi ^{(c)}_{k'}(\mathbf{y}_{i\cdot };\varvec{\alpha }^{(c)}_{k'\cdot }) }, \end{aligned}$$for \(k=1,\ldots , g\) with
$$\begin{aligned} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot })=\prod _{\ell ,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _j w_{j\ell }^{(c)} y_{ij}^{h}} \end{aligned}$$ -
2.
computation of \(p(\mathbf{w}|\mathbf{y},\mathbf{z}^{(c+1)};\varvec{\theta }^{(c)})\), then simulation of \(\mathbf{w}^{(c+1)}\) according to \(p(\mathbf{w}|\mathbf{y},\mathbf{z}^{(c+1)};\varvec{\theta }^{(c)})\).
M step Denoting \(z_{.k}:=\sum _i z_{ik}\) and \(w_{.\ell }:=\sum _j w_{j\ell }\), it leads to
and
Note that the formulae of VEM and SEM-Gibbs are essentially the same, except that the probabilities \(s_{ik}\) and \(t_{j\ell }\) are replaced with binary indicator values \(z_{ik}\) and \(w_{j\ell }\).
Appendix D: Computing ICL
In this section, the exact ICL expression is derived for categorical data. Using the conditional independence of the \(\mathbf{z}\)s and the \(\mathbf{w}\)s conditionally to \(\varvec{\alpha }, \varvec{\pi }, \varvec{\rho }\), the integrated completed likelihood can be written
Thus
Now, according to the LBM definition,
Since the prior distribution of \(\varvec{\pi }\) is the Dirichlet distribution \(\mathcal{D}(a,\ldots ,a)\)
we have
We recognise a non-normalised Dirichlet distribution \(\mathcal{D}(z_{.1}+a,\ldots ,z_{.g}+a)\) with the normalising factor
The expression of \(p( \mathbf{z})\) directly follows from the Bayes theorem
In the same manner,
We now turn to the computation of \(p(\mathbf{y}|\mathbf{z}, \mathbf{w} )\). Since \(\varvec{\alpha }\) and \((\mathbf{z}, \mathbf{w})\) are independent, we have
and, using the conditional independence of \(y_{ij}\) knowing \(\mathbf{z},\,\mathbf{w}\) and \(\varvec{\alpha }\),
Therefore
because \(p(\alpha _{k\ell })\) is the density of a Dirichlet distribution \(\mathcal{D}(b,\ldots ,b)\). Each term \(k\ell \), is the density of a non-normalised Dirichlet distribution \({\mathcal {D}}(b+N_{k\ell }^1,\ldots ,b+N_{k\ell }^r)\) with the normalising factor
Thus, by the Bayes formula,
And, the ICL criterion, presented in Section 4.1, is straightforwardly derived from equations (3), (8), (9) and (10).
Rights and permissions
About this article
Cite this article
Keribin, C., Brault, V., Celeux, G. et al. Estimation and selection for the latent block model on categorical data. Stat Comput 25, 1201–1216 (2015). https://doi.org/10.1007/s11222-014-9472-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-014-9472-2