[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Estimation and selection for the latent block model on categorical data

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

This paper deals with estimation and model selection in the Latent Block Model (LBM) for categorical data. First, after providing sufficient conditions ensuring the identifiability of this model, we generalise estimation procedures and model selection criteria derived for binary data. Secondly, we develop Bayesian inference through Gibbs sampling and with a well calibrated non informative prior distribution, in order to get the MAP estimator: this is proved to avoid the traps encountered by the LBM with the maximum likelihood methodology. Then model selection criteria are presented. In particular an exact expression of the integrated completed likelihood criterion requiring no asymptotic approximation is derived. Finally numerical experiments on both simulated and real data sets highlight the appeal of the proposed estimation and model selection procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. The data set is available from https://www.hds.utc.fr/coclustering/doku.php.

  2. Congressional Voting Records data set is available from http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records.

  3. The code of Wyse and Friel is available at http://www.ucd.ie/statdept/jwyse.

References

  • Allman, E., Mattias, C., Rhodes, J.: Identifiability of parameters in latent structure models with many observed variables. Ann. Stat. 37, 3099–3132 (2009)

  • Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S., Modha, D.S.: A generalized maximum entropy approach to bregman co-clustering and matrix approximation. J. Mach. Learn. Res. 8, 1919–1986 (2007). ISSN 1532–4435

    MATH  MathSciNet  Google Scholar 

  • Baudry, J.-P. : Sélection de modèle pour la classification non supervisée. Choix du nombre de classes. PhD thesis, Université Paris Sud, December 2009.

  • Baudry, J.-P., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Gr. Stat. 19, 332–353 (2010)

  • Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22, 719–725 (Jul 2000)

  • Carreira-Perpiñàn, M., Renals, S.: Practical identifiability of finite mixtures of multivariate bernoulli distributions. Neural Comput. 12, 141–152 (2000)

    Article  Google Scholar 

  • Celeux, G., Diebolt, J.: Stochastic versions of the em algorithm. Comput. Stat. Quat. 2, 73–82 (1985)

    Google Scholar 

  • Celisse, A., Daudin, J.-J., Latouche, P.: Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Stat. 6, 1847–1899 (2012)

    Article  MATH  MathSciNet  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm (with discussion). J. R. Stat. Soc. Ser. B 39, 1–38 (1977)

    MATH  MathSciNet  Google Scholar 

  • Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer series in statistics, Springer (2006)

    MATH  Google Scholar 

  • Frühwirth-Schnatter, S.: Mixtures : Estimation and Applications, Chapter Dealing with Label Switching Under Model Uncertainty. Wiley, Chichester (2011)

    Google Scholar 

  • Govaert, G. : Algorithme de classification d’un tableau de contingence. In First international Symposium on Data Analysis and Informatics, pp. 487–500, Versailles, 1977. INRIA.

  • Govaert, G. : Classification croisée. PhD thesis, Université Paris 6, France, 1983.

  • Govaert, G., Nadif, M.: Block clustering with bernoulli mixture models: comparison of different approaches. Comput. Stat. Data Anal. 52, 3233–3245 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Govaert, G., Nadif, M.: Latent block model for contingency table. Commun. Stat. Theory Methods 39, 416–425 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  • Gyllenberg, M., Koski, T., Reilink, E., Verlann, M.: Non-uniqueness in probabilistic numerical identification of bacteria. J. Appl. Probab. 31, 542–548 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  • Jagalur, M., Pal, C., Learned-Miller, E., Zoeller, R.T., Kulp, D.: Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering. BMC Bioinform. 8, S5 (2007)

    Article  Google Scholar 

  • Keribin, C.: Consistent estimation of the order of mixture models. Sankhya Ser. A 62, 49–66 (2000)

  • Keribin, C.: Méthodes bayésiennes variationnelles: concepts et applications en neuroimagerie. Journal de la Société Française de Statistique 151, 107–131 (2010)

    MathSciNet  Google Scholar 

  • Keribin, C., Brault, V., Celeux, G., Govaert, G.: Model selection for the binary latent block model. Proceedings of COMPSTAT 2012, 2012.

  • Keribin, C., Brault, V., Celeux, G., Govaert, G. : Estimation and Selection for the Latent Block Model on Categorical Data. Rapport de recherche RR-8264, INRIA, March 2013. URL http://hal.inria.fr/hal-00802764

  • Lomet, A.: Sélection de modèle pour la classification croisée de données continues. PhD thesis, Université de Technologie de Compiègne, December 2012.

  • Lomet, A., Govaert, G., Grandvalet, Y.: Un protocole de simulation de données pour la classification croisée. In 44ème journées de statistique, Bruxelles, Mai 2012.

  • Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinf. 1, 24–45 (2004)

    Article  Google Scholar 

  • Mariadassou, M., Matias, C.: Convergence of the groups posterior distribution in latent or stochastic block models. arXiv, preprint arXiv:1206.7101v2, 2013.

  • McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Nex York (2008)

    Book  MATH  Google Scholar 

  • McLachlan, G.J., Peel, D.: Finite Mixture Models, 2nd edn. Wiley, Nex York (2000)

    Book  MATH  Google Scholar 

  • Meeds, E, Roweis, S: Nonparametric bayesian biclustering. Technical Report UTML TR 2007–001, Department of Computer Science, University of Toronto, 2007.

  • Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted models. J. Roy. Stat. Soc. 73, 689–710 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  • Schwarz, G.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)

    Article  MATH  Google Scholar 

  • Shan, H., Banerjee, A.: Bayesian co-clustering. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, pp. 530–539, Washington, DC, 2008. IEEE Computer Society.

  • Wyse, J., Friel, N.: Block clustering with collapsed latent block models. Stat. Comput. 22, 415–428 (2012)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

The authors thank the reviewers and the Associate Editor for their very helpful comments which have greatly improved this paper. C. K. has been supported by LMH (Hadamard Mathematics Labex), backed by the foundation FMJH (Fondation Mathématique Jacques Hadamard).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christine Keribin.

Appendices

Appendices

Appendix A: Proof of Theorem 1

The identifiability of a parametric model requires that for any two different values \(\varvec{\theta }\ne \varvec{\theta }'\) in the parameter space, the corresponding probability distribution \({I\!P}_{\varvec{\theta }}\) and \({I\!P}_{\varvec{\theta }'}\) are different. We prove that, under assumptions of Theorem 1, there exists a unique parameter \(\varvec{\theta }=(\varvec{\pi },\varvec{\rho },\varvec{\alpha })\), up to a permutation of row and column labels, corresponding to \({I\!P}(\mathbf{y})\) the probability distribution function of matrix \(\mathbf{y}\) having at least \(2m-1\) rows and \(2g-1\) columns. The proof is adapted from Celisse et al. (2012) who set up a similar result for the Stochastic Block Model. Notice first that \(\varvec{\tau }= A \varvec{\rho }\) is the vector of the probability \(\tau _k\) to have \(1\) in a cell of a row of given row class \(k\):

$$\begin{aligned} \tau _k={I\!P}(y_{ij}=1|z_{ik}=1)=\sum _\ell \rho _\ell \alpha _{k\ell }=(A \varvec{\rho })_k. \end{aligned}$$

As all the coordinates of \(\varvec{\tau }\) are distinct, \(R\) defined as the \(g\times g\) matrix of coefficients \(R_{ik}=(\tau _k)^i\), for \(0 \le i < g\) and \(1 \le k \le g\), is Vandermonde, and hence invertible. Consider now \(u_i\), the probability to have \(1\) on the \(i\) first cells of the first row of \(\mathbf{y}\):

$$\begin{aligned} u_i&= {I\!P}(y_{11}=1,\ldots ,y_{1i}=1)\\&= \sum _{ k, \ell _1,\ldots ,\ell _i}{I\!P}(z_{1k}=1)\\&\times \prod _{j=1}^{j=i}\!\big ( {I\!P}(y_{1j}|z_{1k}=1, w_{j\ell _j}=1){I\!P}(w_{j\ell _j}=1)\big )\\&= \sum _{k} {I\!P}(z_{1k}=1)\\&\times \sum _{\ell _1} {I\!P}(y_{11}|z_{1k}=1, w_{1\ell _1}=1){I\!P}(w_{1\ell _1}=1 )\\&\times \ldots \times \sum _{\ell _i} {I\!P}(y_{1i}|z_{1k}=1, w_{i\ell _i}=1){I\!P}(w_{i\ell _i}=1 )\\&= {I\!P}(y_{11}=1,\ldots ,y_{1i}=1)= \sum _{k} \pi _k (\tau _k)^i \end{aligned}$$

With a given \({I\!P}(\mathbf{y}),\, u_1, \ldots ,u_{2g-1}\) are known, and we denote \(u_0=1\). Let now \(M\) be the \((g+1)\times g\) matrix defined by \(M_{ij}=u_{i+j-2}\), for all \(1\le i \le g+1\) and \(1\le j\le g\) and define \(M_i\) as the square matrix obtained by removing the row \(i\) from \(M\). The coefficients of \(M\) are

$$\begin{aligned} M_{ij}=u_{i+j-2}=\sum _{1\le k\le g} \tau _k^{i-1} \pi _k \tau _k^{j-1}. \end{aligned}$$

We can write, with \(A_{\pi }=\hbox {Diag}(\varvec{\pi })\)

$$\begin{aligned} M_g=RA_{\pi }R'. \end{aligned}$$

Now, \(R\), unknown at this stage, can be retrieved by noticing that the coefficients of \(\varvec{\tau }\) are the roots of the following polynomial (Celisse et al. 2012) of degree \(g\)

$$\begin{aligned} B(x)=\sum _{k=0}^g (-1)^{k+g} D_k x^k, \end{aligned}$$

where \(D_k=\hbox {det} M_k\) and \(D_g\ne 0\) as \(M_g\) is a product of invertible matrices. Hence, it is possible to determine \(\varvec{\tau }\), and \(R\) is now known. Consequently, \(\varvec{\pi }\) is defined in a unique manner by \(A_{\pi }=R^{-1} M_g R^{'-1}\).

In the same way, \(\varvec{\rho }\) is defined in a unique manner by considering the probabilities \(\sigma _\ell \) to have \(1\) in a column of column class \(\ell \) and the probabilities \(v_j\) to have \(1\) on the \(j\) first cells of the first column of \(\mathbf{y}\)

$$\begin{aligned} \sigma _\ell \!=\!{I\!P}(y_{ij}\!=\!1|w_{j\ell }=1)\;\; \hbox {and } v_j \!=\!{I\!P}(y_{11}\!=\!1,\ldots ,y_{j1}=1 ) \end{aligned}$$

To determine \(\varvec{\alpha }\), we introduce for \(1\le i \le g\) and \(1\le j \le m\) the probabilities \(U_{ij}\) to have 1 in each \(i\) first cells of the first row and in each \(j\) first cells of the first column

$$\begin{aligned} U_{ij}&= {I\!P}(y_{11}=1,\ldots ,y_{1i}=1, y_{21}=1,\ldots ,y_{j1}=1) \\&= \sum _{\overset{k,\ell _1,\ell _2,\ldots ,\ell _{i},}{k_1,\ldots ,k_{j}}} {I\!P}(y_{11}|z_{1k}=1, w_{1\ell }=1)\\&\times {I\!P}(z_{1k}=1) {I\!P}(w_{1\ell }=1)\\&\times \prod _{\lambda =2}^i{{I\!P}(y_{1\lambda }|z_{1k}=1, w_{\lambda \ell _\lambda }=1) {I\!P}(w_{\lambda \ell _\lambda }=1)}\\&\times \prod _{\eta =2}^j{{I\!P}(y_{\eta 1}|z_{\eta k_\eta }=1, w_{1\ell }=1) {I\!P}(z_{\eta k_\eta }=1)}\\&= \sum _{k,\ell } {I\!P}(y_{11}|z_{1k}=1, w_{1\ell }=1){I\!P}(z_{1k}=1)\\&\times {I\!P}(w_{1\ell }=1){I\!P}( y_{1i}|z_{1k}=1)^{i-1} {I\!P}( y_{j1}|w_{1\ell }=1)^{j-1}\\&= \sum _{k,\ell } \pi _k \tau _k^{i-1} \alpha _{k\ell } \rho _\ell \sigma _\ell ^{j-1}. \end{aligned}$$

These probabilities are known and we can write, with \(S_{j\ell }=(\sigma _\ell )^{j-1},\,j=1,\ldots m,\,\ell =1,\ldots ,m\)

$$\begin{aligned} U=RA_{\pi } A A_{\rho } S', \end{aligned}$$

and it leads to define \(A=A_{\pi }^{-1}R^{-1} U S^{'-1} A_{\rho }^{-1}\), and hence \(\varvec{\alpha }\), in a unique manner.

The proof is straightforwardly extended to the categorical case. The identification of \(\varvec{\pi }\) and \(\varvec{\rho }\) are obtained by considering the probabilities of \(y_{ij}=1\) where \(1\) is the first outcome of the multinomial distribution. Then, each \(A^h=(\alpha _{k\ell }^h)_{k=1,\ldots ,g;\ell =1,\ldots ,m}\) is successively identified by considering \(y_{ij}=\ell ,\,i=1,\ldots ,m,\,j=1,\ldots , g\).

Appendix B: VEM algorithm for categorical data

Govaert and Nadif (2008) proposed to use, for the binary case, a variational approximation of the EM algorithm by imposing that the joint distribution of the labels takes the form \(q^{(c)}_{zw}(\mathbf{z},\mathbf{w})=q^{(c)}_z(\mathbf{z})q^{(c)}_w(\mathbf{w})\). To get simpler formulas, we will denote

$$\begin{aligned} s_{ik}^{(c)}=q^{(c)}_z(z_{ik}=1) \end{aligned}$$

and

$$\begin{aligned} t_{j\ell }^{(c)}=q^{(c)}_w(w_{j\ell }=1). \end{aligned}$$

Using the variational approximation, the maximisation of the loglikelihood is replaced by the maximisation of the free energy

$$\begin{aligned} {\mathcal {F}}(q_{zw},\varvec{\theta })= {I\!E}_{q_{z}q_{w}}\left[ \log \frac{ p(\mathbf{y},\mathbf{z},\mathbf{w};\varvec{\theta })}{q_{z}(\mathbf{z})q_{w}(\mathbf{w})}\right] \end{aligned}$$

alternatively in \(q_{z},\,q_w\) and \(\varvec{\theta }\) (see Keribin 2010). The difference between the maximum loglikelihood and the maximum free energy is the Kullback divergence \(KL(q_{zw}||p(\mathbf{z},\mathbf{w}|\mathbf{y};\varvec{\theta }) )={I\!E}_{q_{zw}}\left[ \log \frac{q_{zw}(\mathbf{z},\mathbf{w})}{p(\mathbf{z},\mathbf{w}|\mathbf{y};\varvec{\theta })} \right] \). This can be extended to the categorical LBM, and leads to the following Variational EM (VEM) algorithm:

E step It consists of maximising the free energy in \(q_{z}\) and \(q_w\), and it leads to:

  1. 1.

    Computing \(s_{ik}^{(c+1)}\) with fixed \(w^{(c)}_{jl}\) and \(\varvec{\theta }^{(c)}\)

    $$\begin{aligned} s_{ik}^{(c+1)}= \frac{\pi _k^{(c)} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot }^{(c)})}{\sum _{k'} \pi _{k'}^{(c)} \psi _{k'}^{(c)}(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k'\cdot }^{(c)}) }, k=1,\ldots , g \end{aligned}$$

    where \(\mathbf{y}_{i\cdot }\) denotes the row \(i\) of the matrix \(\mathbf{y}\), \(\varvec{\alpha }_{k\cdot }=(\alpha _{k1},\ldots ,\alpha _{km})\), and

    $$\begin{aligned} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot }^{(c)})=\prod _{\ell ,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _j t_{j\ell }^{(c)} y_{ij}^{h}} \end{aligned}$$
  2. 2.

    Computing \(t_{jl}^{(c+1)}\) with fixed \(s^{(c+1)}_{ik}\) and \(\varvec{\theta }^{(c)}\)

    $$\begin{aligned} t_{j\ell }^{(c+1)}= \frac{\rho _{\ell }^{(c)} \phi _{\ell }(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell }^{(c)})}{\sum _{\ell '} \rho _{\ell '}^{(c)} \phi _{\ell '}(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell '}^{(c)}) } , \ell =1,\ldots ,m \end{aligned}$$

    where \(\mathbf{y}_{\cdot j}\) denotes the column \(j\) of the matrix \(\mathbf{y},\, \varvec{\alpha }_{\cdot \ell }=(\alpha _{1\ell },\ldots ,\alpha _{g\ell })\) and

    $$\begin{aligned} \phi _{\ell }(\mathbf{y}_{\cdot j};\varvec{\alpha }_{\cdot \ell }^{(c)})=\prod _{k,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _i s_{ik}^{(c+1)} y_{ij}^{h}}. \end{aligned}$$

M step Updating \(\varvec{\theta }^{(c+1)}\). Denoting \(s_{.k}^{(c+1)}=\sum _i s_{ik}^{(c+1)}\), \(t_{.\ell }^{(c+1)}=\sum _j t_{j\ell }^{(c+1)}\), it leads to

$$\begin{aligned}&\pi _k^{(c+1)}=\frac{ s_{.k}^{(c+1)}}{n},\\&\rho _{\ell }^{(c+1)}=\frac{t_{.\ell }^{(c+1)}}{d},\\&{\alpha _{k\ell }^{h}}^{(c+1)}=\frac{\sum _{i,j} s_{ik}^{(c+1)} t_{j\ell }^{(c+1)} y_{ij}^{h}}{ s_{.k}^{(c+1)} t_{.\ell }^{(c+1)}}. \end{aligned}$$

Appendix C: SEM algorithm for categorical data

E and S step

  1. 1.

    Computation of \( p(\mathbf{z}|\mathbf{y},\mathbf{w}^{(c)};\varvec{\theta }^{(c)})\), then simulation of \(\mathbf{z}^{(c+1)}\) according to \( p(\mathbf{z}|\mathbf{y},\mathbf{w}^{(c)};\varvec{\theta }^{(c)})\):

    $$\begin{aligned} p(z_i=k|\mathbf{y}_{i\cdot },\mathbf{w}^{(c)};\varvec{\theta }^{(c)})= \frac{\pi ^{(c)}_k \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }^{(c)}_{k\cdot })}{\sum _{k'} \pi ^{(c)}_{k'} \psi ^{(c)}_{k'}(\mathbf{y}_{i\cdot };\varvec{\alpha }^{(c)}_{k'\cdot }) }, \end{aligned}$$

    for \(k=1,\ldots , g\) with

    $$\begin{aligned} \psi _k(\mathbf{y}_{i\cdot };\varvec{\alpha }_{k\cdot })=\prod _{\ell ,h} ({\alpha _{k\ell }^{h}}^{(c)})^{\sum _j w_{j\ell }^{(c)} y_{ij}^{h}} \end{aligned}$$
  2. 2.

    computation of \(p(\mathbf{w}|\mathbf{y},\mathbf{z}^{(c+1)};\varvec{\theta }^{(c)})\), then simulation of \(\mathbf{w}^{(c+1)}\) according to \(p(\mathbf{w}|\mathbf{y},\mathbf{z}^{(c+1)};\varvec{\theta }^{(c)})\).

M step Denoting \(z_{.k}:=\sum _i z_{ik}\) and \(w_{.\ell }:=\sum _j w_{j\ell }\), it leads to

$$\begin{aligned} \pi _k^{(c+1)}=\frac{z_{.k}^{(c+1)}}{n}, \rho _{\ell }^{(c+1)}=\frac{w_{.\ell }^{(c+1)}}{d} \end{aligned}$$

and

$$\begin{aligned} {\alpha _{k\ell }^{h}}^{(c+1)}=\frac{\sum _{i,j} z_{ik}^{(c+1)} w_{j\ell }^{(c+1)} y_{ij}^{h}}{ z_{.k}^{(c+1)} w_{.\ell }^{(c+1)}}. \end{aligned}$$

Note that the formulae of VEM and SEM-Gibbs are essentially the same, except that the probabilities \(s_{ik}\) and \(t_{j\ell }\) are replaced with binary indicator values \(z_{ik}\) and \(w_{j\ell }\).

Appendix D: Computing ICL

In this section, the exact ICL expression is derived for categorical data. Using the conditional independence of the \(\mathbf{z}\)s and the \(\mathbf{w}\)s conditionally to \(\varvec{\alpha }, \varvec{\pi }, \varvec{\rho }\), the integrated completed likelihood can be written

$$\begin{aligned} p(\mathbf{y},\mathbf{z}, \mathbf{w} )&= \int p(\mathbf{y},\mathbf{z}, \mathbf{w} , \varvec{\alpha }, \varvec{\pi }, \varvec{\rho }) p(\varvec{\alpha }) p(\varvec{\pi }) p(\varvec{\rho }) d\varvec{\alpha }d\varvec{\pi }d\varvec{\rho }\\&= \int p(\mathbf{y}|\mathbf{z}, \mathbf{w}, \varvec{\alpha }, \varvec{\pi }, \varvec{\rho }) p(\mathbf{z}, \mathbf{w} | \varvec{\alpha }, \varvec{\pi }, \varvec{\rho })\\&\times p(\varvec{\alpha }) p(\varvec{\pi }) p(\varvec{\rho })d\varvec{\alpha }d\varvec{\pi }d\varvec{\rho }\\&= \int p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha }) p(\varvec{\alpha }) d\varvec{\alpha }\int p(\mathbf{z}| \varvec{\pi }) p(\varvec{\pi })d\varvec{\pi }\\&\times \int p(\mathbf{w}| \varvec{\rho }) p(\varvec{\rho }) d\varvec{\rho }= p(\mathbf{z} )p(\mathbf{w} )p(\mathbf{y}|\mathbf{z}, \mathbf{w} ). \end{aligned}$$

Thus

$$\begin{aligned} \text{ ICL }=\log p(\mathbf{z} )+ \log p(\mathbf{w}) + \log p(\mathbf{y}|\mathbf{z}, \mathbf{w}). \end{aligned}$$

Now, according to the LBM definition,

$$\begin{aligned} p(\mathbf{z} | \varvec{\pi })=\prod _{i,k} \pi _k^{z_{ik}}. \end{aligned}$$

Since the prior distribution of \(\varvec{\pi }\) is the Dirichlet distribution \(\mathcal{D}(a,\ldots ,a)\)

$$\begin{aligned} p(\varvec{\pi })= \frac{\Gamma (ga)}{\Gamma (a)^g} \prod _{k} \pi ^{a-1}, \end{aligned}$$

we have

$$\begin{aligned} p(\varvec{\pi }|\mathbf{z})=\frac{p(\mathbf{z}|\varvec{\pi })p(\varvec{\pi })}{\int p(\mathbf{z}|\varvec{\pi })p(\varvec{\pi }) } \propto \prod _k \pi _k^{z_{.k} +a-1}. \end{aligned}$$

We recognise a non-normalised Dirichlet distribution \(\mathcal{D}(z_{.1}+a,\ldots ,z_{.g}+a)\) with the normalising factor

$$\begin{aligned} \frac{\Gamma (n+ga)}{\prod _k\Gamma (z_{.k}+a)}. \end{aligned}$$

The expression of \(p( \mathbf{z})\) directly follows from the Bayes theorem

$$\begin{aligned} p( \mathbf{z})=\frac{\Gamma (ag)}{\Gamma (a)^g}\frac{\prod _k\Gamma (z_{.k}+a)}{\Gamma (n+ag)}. \end{aligned}$$
(8)

In the same manner,

$$\begin{aligned} p( \mathbf{w})=\frac{\Gamma (am)}{\Gamma (a)^m}\frac{\prod _\ell \Gamma (w_{.\ell }+a)}{\Gamma (d+am)}. \end{aligned}$$
(9)

We now turn to the computation of \(p(\mathbf{y}|\mathbf{z}, \mathbf{w} )\). Since \(\varvec{\alpha }\) and \((\mathbf{z}, \mathbf{w})\) are independent, we have

$$\begin{aligned} p(\varvec{\alpha }| \mathbf{y},\mathbf{z}, \mathbf{w} )=\frac{ p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha }) p(\varvec{\alpha }) }{\int p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha }) p(\varvec{\alpha }) d\varvec{\alpha }}, \end{aligned}$$

and, using the conditional independence of \(y_{ij}\) knowing \(\mathbf{z},\,\mathbf{w}\) and \(\varvec{\alpha }\),

$$\begin{aligned} p(\mathbf{y}|\mathbf{z}, \mathbf{w} , \varvec{\alpha })&= \prod _{i,j,k,\ell } \left( \prod _h (\alpha _{k\ell }^{h})^{ y_{ij}^{h}}\right) ^{z_{ik}w_{j\ell }}\\&= \prod _{k,\ell } \left( \prod _{h} (\alpha _{k\ell }^{h})^{\sum _{i,j}z_{ik}w_{j\ell } y_{ij}^{h}}\right) \\&= \prod _{k,\ell } \left( \prod _{h} (\alpha _{k\ell }^{h})^{N_{k\ell }^h}\right) . \\ \end{aligned}$$

Therefore

$$\begin{aligned} p(\varvec{\alpha }| \mathbf{y},\mathbf{z}, \mathbf{w} )&\propto \prod _{k,\ell } \left( p(\alpha _{k\ell })\prod _{h} (\alpha _{k\ell }^{h})^{N_{k\ell }^h} \right) \\&\propto \prod _{k,\ell } \left[ \left( \prod _{h}(\alpha _{k\ell }^{h})^{b-1}\right) \left( \prod _{h} (\alpha _{k\ell }^{h})^{N_{k\ell }^h} \right) \right] \\&\propto \prod _{k,\ell } \left( \prod _{h} (\alpha _{k\ell }^{h})^{b+N_{k\ell }^h-1} \right) .\\ \end{aligned}$$

because \(p(\alpha _{k\ell })\) is the density of a Dirichlet distribution \(\mathcal{D}(b,\ldots ,b)\). Each term \(k\ell \), is the density of a non-normalised Dirichlet distribution \({\mathcal {D}}(b+N_{k\ell }^1,\ldots ,b+N_{k\ell }^r)\) with the normalising factor

$$\begin{aligned} \frac{\Gamma (z_{.k}w_{.\ell }+rb)}{\prod _h \Gamma (N_{k\ell }^h +b)}. \end{aligned}$$

Thus, by the Bayes formula,

$$\begin{aligned} p(\mathbf{y}|\mathbf{z}, \mathbf{w} )= \prod _{k,\ell } \frac{\Gamma (rb)}{\Gamma (b)^r}\frac{\prod _h \Gamma (N_{k\ell }^h +b)}{\Gamma (z_{.k}w_{.\ell }+rb)}. \end{aligned}$$
(10)

And, the ICL criterion, presented in Section 4.1, is straightforwardly derived from equations (3), (8), (9) and (10).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Keribin, C., Brault, V., Celeux, G. et al. Estimation and selection for the latent block model on categorical data. Stat Comput 25, 1201–1216 (2015). https://doi.org/10.1007/s11222-014-9472-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9472-2

Keywords

Navigation