Infinite Mixtures of Multivariate Normal-Inverse Gaussian Distributions for Clustering of Skewed Data

584 Accesses
Explore all metrics

Abstract

Mixtures of multivariate normal inverse Gaussian (MNIG) distributions can be used to cluster data that exhibit features such as skewness and heavy tails. For cluster analysis, using a traditional finite mixture model framework, the number of components either needs to be known a priori or needs to be estimated a posteriori using some model selection criterion after deriving results for a range of possible number of components. However, different model selection criteria can sometimes result in different numbers of components yielding uncertainty. Here, an infinite mixture model framework, also known as Dirichlet process mixture model, is proposed for the mixtures of MNIG distributions. This Dirichlet process mixture model approach allows the number of components to grow or decay freely from 1 to $\infty$ (in practice from 1 to N) and the number of components is inferred along with the parameter estimates in a Bayesian framework, thus alleviating the need for model selection criteria. We run our algorithm on simulated as well as real benchmark datasets and compare with other clustering approaches. The proposed method provides competitive results for both simulations and real data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering

Article 06 May 2024

Robust finite mixture modeling of multivariate unrestricted skew-normal generalized hyperbolic distributions

Article 19 May 2018

Matrix-variate normal mean-variance Birnbaum–Saunders distributions and related mixture models

Article 06 October 2022

Data Availability

The datasets used in this manuscript are all publicly available in various R packages. The R code is available via https://github.com/yuanfang90/Infinite_MNIG_R.

References

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics, 2(6), 1152–1174.
Article MathSciNet MATH Google Scholar
Barndorff-Nielsen, O. E. (1997). Normal inverse Gaussian distributions and stochastic volatility modelling. Scandinavian Journal of Statistics, 24(1), 1–13.
Article MathSciNet MATH Google Scholar
Biernacki, C., Celeux, G., & Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7), 719–725.
Article Google Scholar
Blackwell, David, & MacQueen, J. B. (1973). Ferguson distributions via Polya urn schemes. The Annals of Statistics, 1(2), 353–355.
Article MathSciNet MATH Google Scholar
Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859–877.
Article MathSciNet Google Scholar
Browne, R. P., & McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. The Canadian Journal of Statistics, 43(2), 176–198.
Article MathSciNet MATH Google Scholar
Celeux, G., Hurn, M., & Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association, 95(451), 957–970.
Article MathSciNet MATH Google Scholar
Dellaportas, P., & Papageorgiou, I. (2006). Multivariate mixtures of normals with unknown number of components. Statistics and Computing, 16(1), 57–68.
Article MathSciNet Google Scholar
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1), 1–22.
MathSciNet MATH Google Scholar
Diebolt, J., & Robert, C. P. (1994). Estimation of finite mixture distributions through Bayesian sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 56(2), 363–375.
MathSciNet MATH Google Scholar
Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430), 577–588.
Article MathSciNet MATH Google Scholar
Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2), 209–230.
Article MathSciNet MATH Google Scholar
Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer Science & Business Media.
Frühwirth-Schnatter, S., & Malsiner-Walli, G. (2018). From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering. Advances in Data Analysis and Classification, 13, 1–32.
Article MATH Google Scholar
Fruhwirth-Schnatter, S., & Pyne, S. (2010). Bayesian inference for finite mixtures of univatiate and multivariate skew-normal and skew-t distributions. Biostatistics, 11(2), 317–336.
Article MATH Google Scholar
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis. CRC Press, third edition.
Gelman, A., Rubin, D. B., et al. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.
Article MATH Google Scholar
Görür, D., & Rasmussen, C. E. (2010). Dirichlet process Gaussian mixture models: Choice of the base distribution. Journal of Computer Science and Technology, 25(4), 653–664.
Article MathSciNet Google Scholar
Hakguder, Z., Shu, J., Liao, C., Pan, K., and Cui, J. (2018). Genome-scale microRNA target prediction through clustering with Dirichlet process mixture model. BMC Genomics, 19.
Hejblum, B. P., Alkhassim, C., Gottardo, R., Caron, F., Thiébaut, R., et al. (2019). Sequential Dirichlet process mixtures of multivariate skew t-distributions for model-based clustering of flow cytometry data. The Annals of Applied Statistics, 13(1), 638–660.
Article MathSciNet MATH Google Scholar
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218.
Article MATH Google Scholar
Huelsenbeck, J. P., & Andolfatto, P. (2007). Inference of population structure under a Dirichlet process model. Genetics, 175(4), 1787–1802.
Article Google Scholar
Ishwaran, H., & James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96(453), 161–173.
Article MathSciNet MATH Google Scholar
Karlis, D., & Santourian, A. (2009). Model-based clustering with non-elliptically contoured distributions. Statistics and Computing, 19(1), 73–83.
Article MathSciNet Google Scholar
Lartillot, N., & Philippe, H. (2004). A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. Molecular Biology and Evolution, 21(6), 1095–1109.
Article Google Scholar
Lijoi, A., Prünster, I., & Rigon, T. (2020). The Pitman-Yor multinomial process for mixture modelling. Biometrika, 107(4), 891–906.
Article MathSciNet MATH Google Scholar
Lin, T. I. (2010). Robust mixture modeling using multivariate skew t distributions. Statistics and Computing, 20, 343–356.
Article MathSciNet Google Scholar
Lin, T. I., Lee, J. C., & Hsieh, W. J. (2007). Robust mixture modeling using the skew t distribution. Statistics and Computing, 17, 81–92.
Article MathSciNet Google Scholar
Lin, T. I., Lee, J. C., & Yen, S. Y. (2007). Finite mixture modeling using the skew normal distribution. Statistica Sinica, 17, 909–927.
MathSciNet MATH Google Scholar
Lu, X., Li, Y., & Love, T. (2021). On Bayesian analysis of parsimonious Gaussian mixture models. Journal of Classification, 38(3), 576–593.
Article MathSciNet MATH Google Scholar
Maceachern, S. N., & Müller, P. (1998). Estimating mixture of Dirichlet process models. Journal of Computational and Graphical Statistics, 7(2), 223–238.
Google Scholar
Maindonald, J. H., & Braun, W. J. (2019). DAAG: Data analysis and graphics data and functions. R package version, 1(22), 1.
Google Scholar
McNicholas, P. D. (2016). Model-based clustering. Journal of Classification, 33(3), 331–373.
Article MathSciNet MATH Google Scholar
McNicholas, S. M., McNicholas, P. D., and Browne, R. P. (2017). A mixture of variance-gamma factor analyzers. In Big and Complex Data Analysis, pages 369–385. Springer.
Medvedovic, M., & Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics, 18(9), 1194–1206.
Article Google Scholar
Melnykov, V., & Maitra, R. (2010). Finite mixture models and model-based clustering. Statistics Surveys, 4, 80–116.
Article MathSciNet MATH Google Scholar
Miller, J. W., & Harrison, M. T. (2013). A simple example of Dirichlet process mixture inconsistency for the number of components. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26 (pp. 199–206). Curran Associates Inc.
Müller, P., & Mitra, R. (2013). Bayesian nonparametric inference - why and how. Bayesian Analysis, 8(2), 269–302.
Article MathSciNet MATH Google Scholar
Murray, P. M., Browne, R. P., & McNicholas, P. D. (2014). Mixtures of skew-t factor analyzers. Computational Statistics & Data Analysis, 77, 326–335.
Article MathSciNet MATH Google Scholar
Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2), 249–265.
MathSciNet Google Scholar
O’Hagan, A., Murphy, T. B., Gormley, I. C., McNicholas, P. D., & Karlis, D. (2016). Clustering with the multivariate normal inverse Gaussian distribution. Computational Statistics & Data Analysis, 93, 18–30.
Article MathSciNet MATH Google Scholar
Onogi, A., Nurimoto, M., & Morita, M. (2011). Characterization of a Bayesian genetic clustering algorithm based on a Dirichlet process prior and comparison among Bayesian clustering methods. BMC bioinformatics, 12, 263.
Article Google Scholar
Protassov, R. S. (2004). EM-based maximum likelihood parameter estimation for multivariate generalized hyperbolic distributions with fixed λ. Statistics and Computing, 14(1), 67–77.
Article MathSciNet Google Scholar
Pyne, S., Hu, X., Wang, K., Rossin, E., Lin, T.-I., Baecher-Allan, L. M. M. C., McLachlan, G. J., Tamayo, P., Hafler, D. A., Jager, P. L. D., & Mesirov, J. P. (2009). Automated high-dimensional flow cytometric data analysis. Proceedings of the National Academy of Sciences, 106(27), 8519–8524.
Article Google Scholar
Rasmussen, C. E. (2000). The infinite Gaussian mixture model. Advances in Neural Information Processing Systems, 12, 554–560.
Google Scholar
Richarson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 59(4), 731–792.
Article MathSciNet Google Scholar
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464.
Article MathSciNet MATH Google Scholar
Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4(2), 639–650.
MathSciNet MATH Google Scholar
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4), 583–639.
Article MathSciNet MATH Google Scholar
Stephens, M. (2000). Dealing with label switching in mixture models. Journal of Royal Statistical Society. Series B (Methodoloty), 62(4), 795–809.
Article MathSciNet MATH Google Scholar
Subedi, S., & McNicholas, P. D. (2014). Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions. Advances in Data Analysis and Classification, 8(2), 167–193.
Article MathSciNet MATH Google Scholar
Subedi, S., & McNicholas, P. D. (2021). A variational approximations-DIC rubric for parameter estimation and mixture model selection within a family setting. Journal of Classification, 38(1), 89–108.
Article MathSciNet MATH Google Scholar
Sun, J., Herazo-Maya, J., Kaminski, N., Zhao, H., and Warren, J. (2016). A Dirichlet process mixture model for clustering longitudinal gene expression data. Statistics in Medicine, 36.
Titterington, D. M., Smith, A. F., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. Wiley.
Tortora, C., ElSherbiny, A., Browne, R. P., Franczak, B. C., & McNicholas, P. D. (2018). MixGHD: Model based clustering, classification and discriminant analysis using the mixture of generalized hyperbolic distributions. R package version, 2, 2.
Google Scholar
Tortora, C., Franczak, B. C., Browne, R. P., & McNicholas, P. D. (2019). A mixture of coalesced generalized hyperbolic distributions. Journal of Classification, 36(1), 26–57.
Article MathSciNet MATH Google Scholar
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Springer, New York, fourth edition. ISBN 0-387-95457-0.
Vrbik, I., & McNicholas, P. (2012). Analytic calculations for the EM algorithm for multivariate skew-t mixture models. Statistics & Probability Letters, 82(6), 1169–1174.
Article MathSciNet MATH Google Scholar
Wei, X., & Li, C. (2012). The infinite student’s t-mixture for robust modeling. Signal Processing, 92(1), 224–234.
Article Google Scholar
Wei, Y., Tang, Y., & McNicholas, P. D. (2019). Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Computational Statistics & Data Analysis, 130, 18–41.
Article MathSciNet MATH Google Scholar
West, M. (1992). Hyperparameter estimation in Dirichlet process mixture models. Technical report, Institute of Statistics and Decision Sciences, Duke University, Durham NC 27706, USA.
Windham, M. P., & Cutler, A. (1992). Information ratios for validating mixture analyses. Journal of the American Statistical Association, 87(420), 1188–1192.
Article Google Scholar
Yang, C.-Y., Ho, N., and Jordan, M. I. (2019). Posterior distribution for the number of clusters in Dirichlet process mixture models. arXiv:1905.09959.

Download references

Acknowledgements

This work was supported by the Collaboration Grants for Mathematicians from the Simons Foundation, the Discovery Grant from the Natural Sciences and Engineering Research Council of Canada and the Canada Research Chair Program.

Author information

Authors and Affiliations

Department of Biostatistics, School of Public Health, Boston University, 02118, Boston, MA, USA
Yuan Fang
Department of Statistics, Athens University of Economics and Business, Athens, Greece
Dimitris Karlis
School of Mathematics and Statistics, Carleton University, K1S 5B6, Ottawa, Canada
Sanjeena Subedi

Authors

Yuan Fang
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris Karlis
View author publications
You can also search for this author in PubMed Google Scholar
Sanjeena Subedi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuan Fang.

Ethics declarations

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1: Mathematical Details for Posterior Distributions

1.1 1.1 Detail on the posterior updates for the parameters

Under the Karlis and Santourian (2009) parameterization for MNIG distribution, the mean-variance mixture is of the following structure:

$$\begin{aligned} \mathbf {X}|u \sim \mathrm {N}({\varvec{\mu }}+u{\varvec{\beta }},u{\varvec{\Sigma }}),\quad U\sim \text {IG}(1,\gamma ), \end{aligned}$$

where ${\varvec{\Sigma }}$ is not restricted. The density of MNIG distribution is of the form:

$$\begin{aligned} f_{\mathbf {X}}(\mathbf {x}) = \frac{1}{2^{\frac{d-1}{2}}}\left[ \frac{\alpha }{\pi q(\mathbf {x})}\right] ^{\frac{d+1}{2}}\exp \left( {p(\mathbf {x})}\right) ~K_{\frac{d+1}{2}}(\alpha q(\mathbf {x})), \end{aligned}$$

where

$$\begin{aligned} \alpha = \sqrt{\gamma ^{2} + {\varvec{\beta }}^{\top }{\varvec{\Sigma }}^{-1}{\varvec{\beta }}},\quad p(\mathbf {x}) = \gamma + (\mathbf {x} - {\varvec{\mu }})^{\top }{\varvec{\Sigma }}^{-1}{\varvec{\beta }},\quad q(\mathbf {x}) = \sqrt{1 + (\mathbf {x}-{\varvec{\mu }})^{\top } {\varvec{\Sigma }}^{-1}(\mathbf {x}-{\varvec{\mu }})}, \end{aligned}$$

and $K_{d}$ is the modified Bessel function of the third kind of order d.

The joint probability density of $\mathbf {x}$ and u is given by

$$\begin{aligned} f(\mathbf {x},u)= & {} f(\mathbf {x}|u)f(u)\\= & {} (2\pi )^{-1/2}|u{\varvec{\Sigma }}|^{-1/2}\exp \left\{ -\frac{1}{2} (\mathbf {x} - {\varvec{\mu }} - u{\varvec{\beta }})^{\top }(u{\varvec{\Sigma }})^{-1}(\mathbf {x} - {\varvec{\mu }} - u{\varvec{\beta }}) \right\} \\\times & {} \frac{1}{\sqrt{2\pi }}\exp (\gamma )u^{-3/2}\exp \left\{ -\frac{1}{2}\left( \frac{1}{u}+\gamma ^{2}u\right) \right\} \\\propto & {} u^{-\frac{d+3}{2}}|{\varvec{\Sigma }}|^{-1/2}\exp \left\{ -\frac{1}{2}\left( \frac{1 }{u}+\gamma ^{2}u-2\gamma \right) -\frac{1}{2} (\mathbf {x} - {\varvec{\mu }} - u{\varvec{\beta }})^{\top }(u{\varvec{\Sigma }})^{-1}(\mathbf {x} - {\varvec{\mu }} - u\varvec{\beta })\right\} . \end{aligned}$$

The complete-data likelihood is of the form:

$$\begin{aligned} L(\theta _{g}) \propto&\prod \limits _{g=1}^{G}\prod \limits _{i=1}^{N}\left[ \pi _{g} u_{ig}^{-\frac{d+3}{2}}|{\varvec{\Sigma }}_{g}|^{-\frac{1}{2}} \exp \left\{ -\frac{1}{2}\left( u_{ig}^{-1}+\gamma ^{2}u_{ig}-2\gamma _{g}\right) \right\} \right. \\&\left. \times \exp \left\{ -\frac{1}{2}\left( \mathbf {x}_{i} - {\varvec{\mu }}_{g} - u_{ig}{\varvec{\beta }}_{g})^{\top }(u_{ig}{\varvec{\Sigma }}_{g})^{-1}(\mathbf {x}_{i} - {\varvec{\mu }}_{g} - u_{ig}{\varvec{\beta }}_{g}\right) \right\} \right] \\= & {} \prod \limits _{g = 1}^{G}\left[ \pi _{g}|{\varvec{\Sigma }}_{g}|^{-\frac{1}{2}}\exp \left\{ \gamma _{g}-{\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\mu }}_{g}\right\} \right] ^{\sum _{i = 1}^{N}{z_{ig}}}\times \prod \limits _{g = 1}^{G}\prod \limits _{i=1}^{N}\left( u_{ig}^{-\frac{d+3}{2}}\right) ^{z_{ig}}\\&\times \prod \limits _{g=1}^{G}\exp \left\{ -\frac{1}{2}\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}\mathbf {x}_{i}^{\top }{\varvec{\Sigma }}_{g}^{-1}\mathbf {x}_{i}}+{\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}\sum \limits _{i = 1}^{N}{z_{ig}\mathbf {x}_{i}} + {\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}\mathbf {x}_{i}}\right. \\&\left. -\frac{1}{2}\left( {\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}+{\gamma _{g}^{2}}\right) \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}} - \frac{1}{2}\left( \varvec{\mu }_{g}^{\top }\varvec{\Sigma }_{g}^{-1}\varvec{\mu }_{g}+1\right) \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}}\right\} \\= & {} \prod \limits _{g = 1}^{G}\left\{ [r(\theta _{g})]^{t_{0g}}\cdot \prod \limits _{i = 1}^{N}[h(\mathbf {x}_{i},u_{ig})]^{z_{ig}}\times \exp \left( \text {tr}\left\{ \sum \limits _{j = 1}^{5}\phi _{j}(\theta _{g})\mathbf {t}_{jg}(\mathbf {x},\mathbf {u}_{g})\right\} \right) \right\} .\\ \end{aligned}$$

Then, common conjugate priors and their group specified posterior distributions of each parameter, $\gamma _{g}, {\varvec{\mu }}_{g}, {\varvec{\beta }}_{g}, {\varvec{\Sigma }}_{g}$ can be derived in the following.

Focusing on $\gamma _{g}$, the part in the likelihood that contains $\gamma _{g}$ is as follows
$$\begin{aligned} L(\gamma _{g})\propto & {} \left[ \exp \{\gamma _{g}\}\right] ^{\sum _{i = 1}^{N}{z_{ig}}}\times \exp \left\{ -\frac{1}{2}\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}}{\gamma _{g}^{2}}\right\} \\= & {} \exp \left\{ -\frac{1}{2}\left( \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}}{\gamma _{g}^{2}}-2\sum \limits _{i = 1}^{N}{z_{ig}}\gamma _{g}\right) \right\} =\exp \left\{ -\frac{1}{2}\left( a_{3}{\gamma _{g}^{2}}-2a_{0}\gamma _{g}\right) \right\} . \end{aligned}$$
This is a functional form of normal distribution with mean $\frac{a_{0}}{a_{3}}$ and variance $\frac{1}{a_{3}}$, truncated at 0 because we want $\gamma _{g}$ to be positive. Therefore, a conjugate truncated normal prior is assigned to $\gamma _{g}$, i.e.
$$\begin{aligned} \gamma _{g} \sim \mathrm {N} \left( \frac{a_{0}^{(0)}}{a_{3}^{(0)}},\frac{1}{a_{3}^{(0)}}\right) \cdot \mathbf {1}\left( \gamma _{g} > 0\right) , \end{aligned}$$
and the resulting posterior is truncated normal as well:
$$\begin{aligned} \gamma _{g} \sim \mathrm {N}\left( \frac{a_{0,g}}{a_{3,g}},\frac{1}{a_{3,g}}\right) \cdot \mathbf {1}\left( \gamma _{g} > 0\right) . \end{aligned}$$
For the precision matrix ${\varvec{\Sigma }}_{g}^{-1}$, denoted as $\mathbf {T}_{g}$, in the following derivation, the part of likelihood which contains $\mathbf {T}_{g}={\varvec{\Sigma }}_{g}^{-1}$ yields the following:
$$\begin{aligned} L(\mathbf {T}_{g})&\propto |\mathbf {T}_{g}|^{\sum _{i = 1}^{N}{z_{ig}}/2}\cdot \exp \left\{ -\frac{1}{2} \text {tr}\left[ \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}\left( \mathbf {x}_{i}-{\varvec{\mu }}_{g}-u_{ig}{\varvec{\beta }}_{g}\right) \left( \mathbf {x}_{i}-{\varvec{\mu }}_{g}-u_{ig}{\varvec{\beta }}_{g}\right) ^{\top }}\mathbf {T}_{g}\right] \right\} , \end{aligned}$$
which is a functional form of the Wishart distribution. Hence, a conjugate $\text {Wishart}\left( a_{0}^{(0)},{\mathbf {a}_{5}^{(0)}}^{-1}\right)$ prior is given to $\mathbf {T}_{g}$.
Look at the pair of $({\varvec{\mu }}_{g},{\varvec{\beta }}_{g})$, whose related part in the likelihood is conditional on $\mathbf {T}_{g} = {\varvec{\Sigma }}_{g}^{-1}$ and is a functional form of correlated Gaussian distribution.
$$\begin{aligned} L({\varvec{\mu }}_{g},{\varvec{\beta }}_{g}|\mathbf {T}_{g})\propto & {} \exp \left\{ \sum \limits _{i = 1}^{N}{z_{ig}\left( -{\varvec{\beta }}_{g}^{\top } \mathbf {T}_{g} {\varvec{\mu }}_{g}\right) }\right. \\&\left. -\frac{1}{2} \left[ \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}\left( \mathbf {x}_{i}-{\varvec{\mu }}_{g}-u_{ig}{\varvec{\beta }}_{g}\right) ^{\top } \mathbf {T}_{g}\left( \mathbf {x}_{i}-\varvec{\mu }_{g}-u_{ig}\varvec{\beta }_{g}\right) }\right] \right\} \\\propto & {} \exp \left\{ {\varvec{\beta }}_{g}^{\top } \mathbf {T}_{g}{\varvec{\mu }}_{g}\sum \limits _{i = 1}^{N}{z_{ig}}+{\varvec{\beta }}_{g}^{\top } \mathbf {T}_{g}\sum \limits _{i = 1}^{N}{z_{ig}\mathbf {x}_{i}} + {\varvec{\mu }}_{g}^{\top } \mathbf {T}_{g}\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}\mathbf {x}_{i}}\right. \\&\left. -\frac{1}{2}\left( {\varvec{\beta }}_{g}^{\top } \mathbf {T}_{g}{\varvec{\beta }}_{g}\right) \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}} - \frac{1}{2}\left( {\varvec{\mu }}_{g}^{\top } \mathbf {T}_{g}{\varvec{\mu }}_{g}\right) \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}}\right\} . \end{aligned}$$
Therefore, a multivariate normal distribution conditional on the precision matrix is assigned to $({\varvec{\mu }}_{g},{\varvec{\beta }}_{g})$ such that
$$\begin{aligned} \left. \left( \begin{array}{cc} {\varvec{\mu }}_{g}\\ {\varvec{\beta }}_{g}\end{array}\right) \right| \mathbf {T}_{g}\sim \mathrm {N}\left[ \left( \begin{array}{cc} {\varvec{\mu }}_{0}^{(0)}\\ {\varvec{\beta }}_{0}^{(0)}\end{array}\right) ,\left( \begin{array}{cc} \tau _{\mu }^{(0)} \mathbf {T}_{g} &{} \tau _{\mu \beta }^{(0)} \mathbf {T}_{g}\\ \tau _{\mu \beta }^{(0)} \mathbf {T}_{g} &{} \tau _{\beta }^{(0)} \mathbf {T}_{g} \end{array}\right) \right] , \end{aligned}$$
where
$$\begin{aligned} {\varvec{\mu }}_{0}^{(0)} = \frac{a_{3}^{(0)}\mathbf {a}_{2}^{(0)} - a_{0}^{(0)}\mathbf {a}_{1}^{(0)}}{a_{3}^{(0)}a_{4}^{(0)}-{a_{0}^{(0)}}^{2}},\quad {\varvec{\beta }}_{0}^{(0)} = \frac{a_{4}^{(0)}\mathbf {a}_{1}^{(0)} - a_{0}^{(0)}\mathbf {a}_{2}^{(0)}}{a_{3}^{(0)}a_{4}^{(0)}-{a_{0}^{(0)}}^{2}}, \quad \tau _{\mu }^{(0)} = a_{4}^{(0)},\quad \tau _{\beta }^{(0)} = a_{3}^{(0)}, \quad \tau _{\mu \beta }^{(0)} = a_{0}^{(0)}. \end{aligned}$$
The joint prior density of ${\varvec{\mu }}_{g},{\varvec{\beta }}_{g},$ and $\mathbf {T}_{g}$ is as following:
$$\begin{aligned} p({\varvec{\mu }}_{g},{\varvec{\beta }}_{g},\mathbf {T}_{g})\propto & {} \left| \begin{array}{cc} \tau _{\mu }^{(0)} \mathbf {T}_{g} &{} \tau _{\mu \beta }^{(0)} \mathbf {T}_{g}\\ \tau _{\mu \beta }^{(0)} \mathbf {T}_{g} \tau _{\beta }^{(0)} \mathbf {T}_{g} \end{array}\right| ^{-\frac{1}{2}}\exp \left\{ -\frac{1}{2}\left( \begin{array}{cc} {\varvec{\mu }}_{g}-{\varvec{\mu }}_{0}^{(0)}\\ {\varvec{\beta }}_{g}-{\varvec{\beta }}_{0}^{(0)}\end{array}\right) ^{\top }\left( \begin{array}{cc} \tau _{\mu }^{(0)} \mathbf {T}_{g} &{} \tau _{\mu \beta }^{(0)} \mathbf {T}_{g}\\ \tau _{\mu \beta }^{(0)} \mathbf {T}_{g} &{} \tau _{\beta }^{(0)} \mathbf {T}_{g} \end{array}\right) \left( \begin{array}{cc} {\varvec{\mu }}_{g}-{\varvec{\mu }}_{0}^{(0)}\\ {\varvec{\beta }}_{g}-{\varvec{\beta }}_{0}^{(0)}\end{array}\right) \right\} \\&\times \left| \mathbf {T}_{g}\right| ^{\frac{(a_{0}^{(0)}+d+1)-d-1}{2}}\exp \left\{ -\frac{1}{2} \text {tr}\left( \mathbf {a}_{5}^{(0)}\mathbf {T}_{g}\right) \right\} . \end{aligned}$$
Hence the posterior is then
$$\begin{aligned} \left. \left( \begin{array}{c} {\varvec{\mu }}_{g}\\ {\varvec{\beta }}_{g}\end{array}\right) \right| \mathbf {T}_{g}\sim \mathrm {N}\left[ \left( \begin{array}{cc} {\varvec{\mu }}_{0,g}\\ {\varvec{\beta }}_{0,g}\end{array}\right) ,\left( \begin{array}{cc} \tau _{\mu ,g} \mathbf {T}_{g} &{} \tau _{\mu \beta ,g} \mathbf {T}_{g}\\ \tau _{\mu \beta ,g} \mathbf {T}_{g} &{} \tau _{\beta ,g} \mathbf {T}_{g} \end{array}\right) \right] . \end{aligned}$$
where
$$\begin{aligned} {\varvec{\mu }}_{0,g}&= \frac{a_{3,g}\mathbf {a}_{2,g} - a_{0,g}\mathbf {a}_{1,g}}{a_{3,g}a_{4,g}-{a_{0,g}}^{2}};&{\varvec{\beta }}_{0,g}= \frac{a_{4,g}\mathbf {a}_{1,g} - a_{0,g}\mathbf {a}_{2,g}}{a_{3,g}a_{4,g}-{a_{0,g}}^{2}}; \end{aligned}$$
and
$$\begin{aligned} \tau _{\mu ,g}= & {} \tau _{\mu }^{(0)}+\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}} = a_{4}^{(0)}+\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}} = a_{4,g};\\ \tau _{\beta ,g}= & {} \tau _{\beta }^{(0)}+\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}} = a_{3}^{(0)}+\sum \limits _{i = 1}^{N}{z_{ig}u_{ig}} = a_{3,g};\\ \tau _{\mu \beta ,g}= & {} \tau _{\mu \beta }^{(0)}+\sum \limits _{i = 1}^{N}{z_{ig}} = a_{0}^{(0)}+\sum \limits _{i = 1}^{N}{z_{ig}} = a_{0,g}. \end{aligned}$$
The resulting posterior distribution of $\mathbf {T}_{g}$ conditional on $({\varvec{\mu }}_{g},{\varvec{\beta }}_{g})$ is of the form
$$\begin{aligned} \mathbf {T}_{g}|({\varvec{\mu }}_{g},{\varvec{\beta }}_{g}) \sim \text {Wishart}(a_{0,g},\mathbf {V}_{0,g}), \end{aligned}$$
where $\mathbf {V}_{0,g}^{-1}$ is shown to have the form as below:
$$\begin{aligned} \mathbf {V}_{0,g}^{-1}= & {} \mathbf {a}_{5}^{(0)} + \sum _{i = 1}^{N}{z_{ig}u_{ig}^{-1}\mathbf {x}_{i}\mathbf {x}_{i}^{\top }}\\&+{\varvec{\mu }}_{0}^{(0)}\tau _{\mu }^{(0)}{{\varvec{\mu }}_{0}^{(0)}}^{\top }+{\varvec{\mu }}_{0}^{(0)}\tau _{\mu \beta }^{(0)}{{\varvec{\beta }}_{0}^{(0)}}^{\top }+{\varvec{\beta }}_{0}^{(0)}\tau _{\mu \beta }^{(0)}{{\varvec{\mu }}_{0}^{(0)}}^{\top }+{\varvec{\beta }}_{0}^{(0)}\tau _{\beta }^{(0)}{{\varvec{\beta }}_{0}^{(0)}}^{\top }\\&-\left( {\varvec{\mu }}_{0,g}\tau _{\mu ,g}{\varvec{\mu }}_{0,g}^{\top }+{\varvec{\mu }}_{0,g}\tau _{\mu \beta ,g}{\varvec{\beta }}_{0,g}^{\top }+{\varvec{\beta }}_{0,g}\tau _{\mu \beta ,g}{\varvec{\mu }}_{0,g}^{\top }+{\varvec{\beta }}_{0,g}\tau _{\beta ,g}{\varvec{\beta }}_{0,g}^{\top }\right) \\= & {} \mathbf {a}_{5,g}+{\varvec{\mu }}_{0}^{(0)}\tau _{\mu }^{(0)}{{\varvec{\mu }}_{0}^{(0)}}^{\top }+{\varvec{\mu }}_{0}^{(0)}\tau _{\mu \beta }^{(0)}{{\varvec{\beta }}_{0}^{(0)}}^{\top }+{\varvec{\beta }}_{0}^{(0)}\tau _{\mu \beta }^{(0)}{{\varvec{\mu }}_{0}^{(0)}}^{\top }+{\varvec{\beta }}_{0}^{(0)}\tau _{\beta }^{(0)}{{\varvec{\beta }}_{0}^{(0)}}^{\top }\\&-\left( {\varvec{\mu }}_{0,g}\tau _{\mu ,g}{\varvec{\mu }}_{0,g}^{\top }+{\varvec{\mu }}_{0,g}\tau _{\mu \beta ,g}{\varvec{\beta }}_{0,g}^{\top }+{\varvec{\beta }}_{0,g}\tau _{\mu \beta ,g}{\varvec{\mu }}_{0,g}^{\top }+{\varvec{\beta }}_{0,g}\tau _{\beta ,g}{\varvec{\beta }}_{0,g}^{\top }\right) .\\ \end{aligned}$$

1.2 1.2 Detail on Posterior Updates for Hyperparameters

1.2.1 1.2.1 Derivation of the posteriors

There are six hyperparameters, $a_{0}^{(0)}, \mathbf {a}_{1}^{(0)}, \mathbf {a}_{2}^{(0)}, a_{3}^{(0)}, a_{4}^{(0)},$ and $\mathbf {a}_{5}^{(0)}$, common for all groups that define the prior distributions of the parameters. Another layer of prior distributions can be added on these hyperparameters for additional flexibility. Again, common conjugate prior distributions are assigned to these hyperparameters, which yield posteriors that depend on the group-specified observations. Details of the derivation are given below.

Again, recall that the complete-data likelihood can be written into a form that comes from the exponential family:

$$\begin{aligned} L(\theta _{g})\propto \prod \limits _{g = 1}^{G}\left\{ [r(\theta _{g})]^{t_{0g}}\cdot \prod \limits _{i = 1}^{N}[h(\mathbf {x}_{i},u_{ig})]^{z_{ig}}\times \exp \left( \text {tr}\left\{ \sum \limits _{j = 1}^{5}\phi _{j}(\theta _{g})\mathbf {t}_{jg}(\mathbf {x},\mathbf {u}_{g})\right\} \right) \right\} \end{aligned}$$

where

$$\begin{aligned} r({\varvec{\theta }}_{g})= & {} \pi _{g}|{\varvec{\Sigma }}_{g}|^{-\frac{1}{2}}\exp \left\{ \gamma _{g} - {\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}\right\} ,\quad \quad \quad t_{0g} = \sum \limits _{i = 1}^{N}{z_{ig}};\\ \phi _{1}({\varvec{\theta }}_{g})= & {} {\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}, \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbf {t}_{1g} (\mathbf {x},\mathbf {u}_{g})= \sum \limits _{i = 1}^{N} z_{ig}\mathbf {x}_{i}^{\top };\\ \phi _{2}({\varvec{\theta }}_{g})= & {} {\varvec{\Sigma }}_{g}^{-1}{\varvec{\mu }}_{g}, \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \mathbf {t}_{2g} (\mathbf {x},\mathbf {u}_{g})= \sum \limits _{i = 1}^{N} z_{ig}u_{ig}^{-1}\mathbf {x}_{i}^{\top };\\ \phi _{3}({\varvec{\theta }}_{g})= & {} -\frac{1}{2} \left( {\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}{\varvec{\beta }}_{g}^{\top } + \frac{{\gamma _{g}^{2}}}{d}\mathbb {I}_{d}\right) ,\quad \quad \quad \quad t_{3g}(\mathbf {x},\mathbf {u}_{g}) = \sum \limits _{i = 1}^{N} z_{ig}u_{ig};\\ \phi _{4}({\varvec{\theta }}_{g})= & {} -\frac{1}{2} \left( {\varvec{\Sigma }}_{g}^{-1}{\varvec{\mu }}_{g}{\varvec{\mu }}_{g}^{\top } + \frac{1}{d}\mathbb {I}_{d}\right) ,\quad \quad \quad \quad \ t_{4g} (\mathbf {x},\mathbf {u}_{g}) = \sum \limits _{i = 1}^{N} z_{ig}u_{ig}^{-1};\\ \phi _{5}({\varvec{\theta }}_{g})= & {} -\frac{1}{2} {\varvec{\Sigma }}_{g}^{-1}, \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \ \mathbf {t}_{5g} (\mathbf {x},\mathbf {u}_{g}) = \sum \limits _{i = 1}^{N} z_{ig} u_{ig}^{-1}\mathbf {x}_{i}\mathbf {x}_{i}^{\top }. \end{aligned}$$

$a_{0}^{(0)}$ is associated with $t_{0g}$, who only relates to $r({\varvec{\theta }}_{g})$ in the complete-data likelihood, with a functional form of the density from an exponential distribution. Therefore, an exponential prior with rate parameter $b_{0}$ is assigned to $a_{0}^{(0)}$:
$$\begin{aligned} a_{0}^{(0)} \sim \text {Exp}(b_{0}),\quad p\left( a_{0}^{(0)}\right) = b_{0}\exp \left( -b_{0}a_{0}^{(0)}\right) ; \end{aligned}$$
hence the posterior is an exponential distribution as well with rate parameter
$$\begin{aligned} b_{0}-\sum \limits _{g=1}^{G}\log (\pi _{g})-\sum \limits _{g=1}^{G}\log \left( |{\varvec{\Sigma }}_{g}|^{-\frac{1}{2}}\right) -\sum \limits _{g=1}^{G}\gamma _{g}+\sum \limits _{g=1}^{G}{\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}. \end{aligned}$$
(12)
$$\begin{aligned} p\left( a_{0}^{(0)}|r(\theta _{1}),\dots r(\theta _{G})\right)\propto & {} \exp \left\{ -b_{0}a_{0}^{(0)}\right\} \prod \limits _{g=1}^{G}\exp \left\{ \left[ r(\theta _{g})\right] ^{t_{0g}}\right\} \\= & {} \exp \left\{ -b_{0}a_{0}^{(0)}\right\} \exp \left\{ \sum \limits _{g=1}^{G}\left[ \log (\pi _{g})+\log \left( |{\varvec{\Sigma }}_{g}|^{-\frac{1}{2}}\right) +\gamma _{g}-{\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}\right] t_{0g}\right\} \\= & {} \exp \left\{ -\left[ b_{0}-\sum \limits _{g=1}^{G}\log (\pi _{g})-\sum \limits _{g=1}^{G}\log \left( |{\varvec{\Sigma }}_{g}|^{-\frac{1}{2}}\right) \right. \right. \\&\left. \left. -\sum \limits _{g=1}^{G}\gamma _{g}+\sum \limits _{g=1}^{G}{\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}\right] a_{0}^{(0)}\right\} . \end{aligned}$$
$\mathbf {a}_{1}^{(0)}$ is associated with $\mathbf {t}_{1g}$, whose functional form in the likelihood is a multivariate Gaussian distribution and relate to $\phi _{1}(\theta _{g})$ only. A multivariate Gaussian prior is then assigned to $\mathbf {a}_{1}^{(0)}$; it yields a multivariate Gaussian posterior:
$$\begin{aligned} \mathbf {a}_{1}^{(0)} \sim \mathrm {N}(\mathbf {c}_{1},B_{1}),\quad p\left( \mathbf {a}_{1}^{(0)}\right) \propto \exp \left\{ -\frac{1}{2}\left( \mathbf {a}_{1}^{(0)}-\mathbf {c}_{1}\right) ^{\top } B_{1}^{-1}\left( \mathbf {a}_{1}^{(0)}-\mathbf {c}_{1}\right) \right\} ; \end{aligned}$$
$$\begin{aligned} p\left( \mathbf {a}_{1}^{(0)}|\phi _{1}(\theta _{1}),\dots ,\phi _{1}(\theta _{G})\right)\propto & {} \exp \left\{ -\frac{1}{2}\left( \mathbf {a}_{1}^{(0)}-\mathbf {c}_{1}\right) ^{\top } B_{1}^{-1}\left( \mathbf {a}_{1}^{(0)}-\mathbf {c}_{1}\right) \right\} \prod \limits _{g=1}^{G}\exp \left( \text {tr}\left\{ \phi _{1g}\mathbf {t}_{1g}\right\} \right) \\= & {} \exp \left\{ -\frac{1}{2}\left( \mathbf {a}_{1}^{(0)}-\mathbf {c}_{1}\right) ^{\top } B_{1}^{-1}\left( \mathbf {a}_{1}^{(0)}-\mathbf {c}_{1}\right) +\sum \limits _{g=1}^{G}{\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1} \mathbf {t}_{1g}\right\} \\\propto & {} \exp \left\{ -\frac{1}{2}\left[ \mathbf {a}_{1}^{(0)} - \left( \mathbf {c}_{1}+\sum \limits _{g=1}^{G}{\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}B_{1}\right) B_{1}^{-1}\right] ^{\top }\right. \\&\left. B_{1}^{-1}\left[ \mathbf {a}_{1}^{(0)} - \left( \mathbf {c}_{1}+\sum \limits _{g=1}^{G}{\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}B_{1}\right) B_{1}^{-1}\right] \right\} . \end{aligned}$$
The mean and covariance matrix of the posterior are $\mathbf {c}_{1}+\sum _{g=1}^{G}{\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}B_{1}$ and $B_{1}$ respectively.
Similar to $\mathbf {a}_{1}^{(0)}$, $\mathbf {a}_{2}^{(0)}$ is associated with $\mathbf {t}_{2g}$ who also possesses a functional term of multivariate Gaussian that relates to the part including $\phi _{2}(\theta _{g})$ only in the likelihood. A multivariate Gaussian prior is assigned to $\mathbf {a}_{2}^{(0)}$ and results in a multivariate Gaussian posterior with mean $\mathbf {c}_{2}+\sum _{g=1}^{G}\varvec{\mu }_{g}^{\top }\varvec{\Sigma }_{g}^{-1}B_{2}$ and covariance $B_{2}$.
$$\begin{aligned} \mathbf {a}_{2}^{(0)} \sim \mathrm {N}(\mathbf {c}_{2},B_{2}),\quad p\left( \mathbf {a}_{2}^{(0)}\right) \propto \exp \left\{ -\frac{1}{2}\left( \mathbf {a}_{2}^{(0)}-\mathbf {c}_{2}\right) ^{\top } B_{2}^{-1}\left( \mathbf {a}_{2}^{(0)}-\mathbf {c}_{2}\right) \right\} ; \end{aligned}$$
$$\begin{aligned} p\left( \mathbf {a}_{2}^{(0)}|\phi _{2}(\theta _{1}),\dots ,\phi _{2}(\theta _{G})\right)\propto & {} \exp \left\{ -\frac{1}{2}\left( \mathbf {a}_{2}^{(0)}-\mathbf {c}_{2}\right) ^{\top } B_{2}^{-1}\left( \mathbf {a}_{2}^{(0)}-\mathbf {c}_{2}\right) \right\} \prod \limits _{g=1}^{G}\exp \text {tr}\left\{ \phi _{2g}\mathbf {t}_{2g}\right\} \\= & {} \exp \left\{ -\frac{1}{2}\left( \mathbf {a}_{2}^{(0)}-\mathbf {c}_{2}\right) ^{\top } B_{2}^{-1}\left( \mathbf {a}_{2}^{(0)}-\mathbf {c}_{2}\right) +\sum \limits _{g=1}^{G}{\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1} \mathbf {t}_{2g}\right\} \\\propto & {} \exp \left\{ -\frac{1}{2}\left[ \mathbf {a}_{2}^{(0)} - \left( \mathbf {c}_{2}+\sum \limits _{g=1}^{G}{\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}B_{2}\right) B_{2}^{-1}\right] ^{\top } \right. \\&\left. B_{2}^{-1}\left[ \mathbf {a}_{2}^{(0)} - \left( \mathbf {c}_{2}+\sum \limits _{g=1}^{G}{\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}B_{2}\right) B_{2}^{-1}\right] \right\} . \end{aligned}$$
$a_{3}^{(0)}$ is associated with $t_{3g}$, which only related to $\phi _{3}(\theta _{g})$ in the likelihood and has a functional form of an exponential distribution; hence, $a_{3}^{(0)}$ is assigned an exponential prior with a resulting posterior being exponential too.
$$\begin{aligned} a_{3}^{(0)} \sim \text {Exp}(b_{3}),\quad p\left( a_{3}^{(0)}\right) = b_{3}\exp \left( -b_{3}a_{3}^{(0)}\right) ; \end{aligned}$$
$$\begin{aligned} p\left( a_{3}^{(0)}|\phi _{3}(\theta _{1}),\dots \phi _{3}(\theta _{G})\right)\propto & {} \exp \left\{ -b_{3}a_{3}^{(0)}\right\} \prod \limits _{g=1}^{G}\exp \left\{ \phi _{3g}t_{3g}\right\} \\= & {} \exp \left\{ -\left[ b_{3} - \frac{1}{2}\sum \limits _{g=1}^{G}\left( \beta _{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}\beta _{g}+{\gamma _{g}^{2}}\right) \right] a_{3}^{(0)}\right\} ; \end{aligned}$$
where in the posterior distribution, the rate parameter is $b_{3} - \frac{1}{2}\sum _{g=1}^{G}\left( {\varvec{\beta }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\beta }}_{g}+{\gamma _{g}^{2}}\right)$.
Similar to $a_{3}^{(0)}$, $a_{4}^{(0)}$ is assigned an exponential prior:
$$\begin{aligned} a_{4}^{(0)} \sim \text {Exp}(b_{4}),\quad p\left( a_{4}^{(0)}\right) = b_{4}\exp \left( -b_{4}a_{4}^{(0)}\right) ; \end{aligned}$$
$$\begin{aligned} p\left( a_{4}^{(0)}|\phi _{4}(\theta _{1}),\dots ,\phi _{4}(\theta _{G})\right)\propto & {} \exp \left\{ -b_{4}a_{4}^{(0)}\right\} \prod \limits _{g=1}^{G}\exp \left\{ \phi _{4g}t_{4g}\right\} \\= & {} \exp \left\{ -\left[ b_{4} - \frac{1}{2}\sum \limits _{g=1}^{G}\left( {\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\mu }}_{g}+1\right) \right] a_{4}^{(0)}\right\} . \end{aligned}$$
It indicates that the posterior is exponential distribution with rate $b_{4} - \frac{1}{2}{\sum }_{g=1}^{G}\left( {\varvec{\mu }}_{g}^{\top }{\varvec{\Sigma }}_{g}^{-1}{\varvec{\mu }}_{g}+1\right)$.
$\mathbf {a}_{5}^{(0)}$ is associated with ${\varvec{\Sigma }}_{g}^{-1}$. A Wishart prior is assigned to $\mathbf {a}_{5}^{(0)}$; the resulting posterior is also a Wishart distribution.
$$\begin{aligned} \mathbf {a}_{5}^{(0)} \sim \text {Wishart}(\nu _{0},{\varvec{\Lambda }}_{0}),\quad p\left( \mathbf {a}_{5}^{(0)}\right) \propto \left| \mathbf {a}_{5}^{(0)}\right| ^{\frac{\nu _{0}-d-1}{2}}\exp \left\{ -\frac{1}{2} \text {tr} \left( {\varvec{\Lambda }}_{0}^{-1}\mathbf {a}_{5}^{(0)}\right) \right\} . \end{aligned}$$
The priors of ${\varvec{\Sigma }}_{g}^{-1}$ perform as the likelihood in the derivation of the posterior of $\mathbf {a}_{5}^{(0)}$:
$$\begin{aligned} p\left( \left. \mathbf {a}_{5}^{(0)}\right| {\varvec{\Sigma }}_{1}^{-1},\dots ,{\varvec{\Sigma }}_{G}^{-1}\right)\propto & {} \left| \mathbf {a}_{5}^{(0)}\right| ^{\frac{\nu _{0}-d-1}{2}}\exp \left\{ -\frac{1}{2} \text {tr} \left( \varvec{\Lambda }_{0}^{-1}\mathbf {a}_{5}^{(0)}\right) \right\} \times \prod \limits _{g=1}^{G}\left| \mathbf {a}_{5}^{(0)}\right| ^{\frac{a_{0}^{(0)}}{2}}\exp \left\{ -\frac{1}{2} \text {tr} \left( \mathbf {a}_{5}^{(0)}\varvec{\Sigma }_{g}^{-1}\right) \right\} \\= & {} \left| \mathbf {a}_{5}^{(0)}\right| ^{\frac{\nu _{0}+G\times a_{0}^{(0)}-d-1}{2}}\exp \left\{ -\frac{1}{2} \text {tr}\left[ \left( {\varvec{\Lambda }}_{0}^{-1}+\sum \limits _{g=1}^{G}{\varvec{\Sigma }}_{g}^{-1}\right) \mathbf {a}_{5}^{(0)}\right] \right\} . \end{aligned}$$
Therefore, the posterior of $\mathbf {a}_{5}^{(0)}$ is $\text {Wishart}\left[ \nu _{0}+G\times a_{0}^{(0)},\left( {\varvec{\Lambda }}_{0}^{-1}+\sum _{g=1}^{G}{\varvec{\Sigma }}_{g}^{-1}\right) ^{-1}\right] .$

1.2.2 1.2.2 Choice of Third-Layer Hyperparameters

The third-layer hyperparameters $b_{0},c_{1},B_{1},c_{2},B_{2},b_{3},b_{4},\nu _{0}$, and ${\varvec{\Lambda }}_{0}$ are chosen such that if a sample is drawn from the posteriors of the hyperparameters $a_{0}^{(0)},\mathbf {a}_{1}^{(0)},\mathbf {a}_{2}^{(0)},a_{3}^{(0)},a_{4}^{(0)}$, and $\mathbf {a}_{5}^{(0)}$, it is expected to close to the associated term of $t_{0g},\mathbf {t}_{1g}, \mathbf {t}_{2g}, t_{3g}, t_{4g}$ when $g=G=1$ and the sample covariance matrix ${\varvec{\Sigma }}_{\mathbf {x}}$, respectively. Therefore, we have the following:

When $G=1$, $z_{1} = z_{2} = \cdots = z_{N}$ hence $z_{ig}=1$ when $i=g$ and $z_{ig}= 0$ for $\forall i\ne g$.

$$\begin{aligned} 1/b_{0}= & {} \mathbb {E}\left( a_{0}^{(0)}\right) \sim \sum \limits _{i = 1}^{N}{z_{ig}}=N, \quad \quad \quad \quad \quad \quad \rightarrow \quad \quad b_{0} = 1/N;\\ 1/b_{3}= & {} \mathbb {E}\left( a_{3}^{(0)}\right) \sim \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}}, \quad \quad \quad \quad \quad \quad \quad \rightarrow \quad \quad b_{3} = 1/\sum \limits _{i = 1}^{N}{u_{i}};\\ 1/b_{4}= & {} \mathbb {E}\left( a_{4}^{(0)}\right) \sim \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}},\quad \quad \quad \quad \quad \quad \ \ \rightarrow \quad \quad b_{4} = 1/\sum \limits _{i = 1}^{N}{u_{i}^{-1}};\\ \mathbf {c}_{1}= & {} \mathbb {E}\left( \mathbf {a}_{1}^{(0)}\right) \sim \sum \limits _{i = 1}^{N}{z_{ig}\mathbf {x}_{i}} \quad \quad \quad \quad \quad \quad \quad \ \ \rightarrow \quad \quad \mathbf {c}_{1} = \sum \limits _{i = 1}^{N}{\mathbf {x}_{i}}, B_{1} = {\varvec{\Sigma }}_{\mathbf {x}};\\ \mathbf {c}_{2}= & {} \mathbb {E}\left( \mathbf {a}_{2}^{(0)}\right) \sim \sum \limits _{i = 1}^{N}{z_{ig}u_{ig}^{-1}\mathbf {x}_{i}} \quad \quad \quad \quad \quad \quad \rightarrow \quad \quad \mathbf {c}_{2} = \sum \limits _{i = 1}^{N}{u_{i}^{-1}\mathbf {x}_{i}}, B_{2} = {\varvec{\Sigma }}_{u^{-1}\mathbf {x}};\\ \nu _{0}{\varvec{\Lambda }}_{0}= & {} \mathbb {E}\left( \mathbf {a}_{5}^{(0)}\right) \sim {\varvec{\Sigma }}_{\mathbf {x}}\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \rightarrow \quad \quad {\varvec{\Lambda }}_{0} = {\varvec{\Sigma }}_{\mathbf {x}}/\nu _{0}, \nu _{0} = d+1. \end{aligned}$$

Appendix 2: Estimated and True Parameters from the Simulation Studies

Table 9

Table 9 True and estimated values for the parameters in Simulation Study 1. Numbers in parentheses are the standard errors calculated from the replications

Full size table

Table 10

Table 10 True and estimated values for the parameters in Simulation Study 2. Numbers in parentheses are the standard errors calculated from the replications

Full size table

Table 11

Table 11 True and estimated values for the parameters in Simulation Study 3. Numbers in parentheses are the standard errors calculated from the replications

Full size table

Table 12

Table 12 True and estimated values for the parameters in Simulation Study 4. Numbers in parentheses are the standard errors calculated from the replications

Full size table

Table 13

Table 13 For the 6-dimensional dataset, we provide the $L_{2}$ norm for the differences between the estimated and true parameters in Simulation Study 5. Numbers in parentheses are the standard errors calculated from the replications

Full size table

Table 14

Table 14 Median BIC for IMMNIG, InfGMM, and MixGHD for Simulation Study 1. Numbers in parentheses are the number of times the algorithm selected the corresponding model out of the 100 replications. For the IMMNIG and InfGMM models, a number of components were inferred along the algorithm; for the MixGHD models, BIC was used for model selection

Full size table

Table 15

Table 15 Median BIC for IMMNIG, InfGMM, and MixGHD for Simulation Study 2. Numbers in parentheses are the number of times the algorithm selected the corresponding model out of the 100 replications. For the IMMNIG and InfGMM models, a number of components were inferred along the algorithm; for the MixGHD models, BIC were used for model selection

Full size table

Table 16

Table 16 Median BIC for IMMNIG, InfGMM, and MixGHD for Simulation Study 3. Numbers in parentheses are the number of times the algorithm selected the corresponding model out of the 100 replications. For the IMMNIG and InfGMM models, a number of components were inferred along the algorithm; for the MixGHD models, BIC were used for model selection

Full size table

Table 17

Table 17 Median BIC for IMMNIG, InfGMM, and MixGHD for Simulation Study 4. Numbers in parentheses are the number of times the algorithm selected the corresponding model out of the 100 replications. For the IMMNIG and InfGMM models, a number of components were inferred along the algorithm; for the MixGHD models, BIC were used for model selection

Full size table

Table 18

Table 18 Median BIC for IMMNIG, InfGMM, and MixGHD for Simulation Study 5. Numbers in parentheses are the number of times the algorithm selected the corresponding model out of the 100 replications. For the IMMNIG and InfGMM models, a number of components were inferred along the algorithm; for the MixGHD models, BIC were used for model selection

Full size table

Table 19

Table 19 Median BIC for IMMNIG, InfGMM, and MixGHD for Simulation Study 6. Numbers in parentheses are the number of times the algorithm selected the corresponding model out of the 100 replications. For the IMMNIG and InfGMM models, a number of components were inferred along the algorithm; for the MixGHD models, BIC were used for model selection

Full size table

Table 20

Table 20 BIC for IMMNIG, InfGMM, and MixGHD for the three real datasets. Data were scaled before applying the three algorithms

Full size table

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Fang, Y., Karlis, D. & Subedi, S. Infinite Mixtures of Multivariate Normal-Inverse Gaussian Distributions for Clustering of Skewed Data. J Classif 39, 510–552 (2022). https://doi.org/10.1007/s00357-022-09417-9

Download citation

Accepted: 27 May 2022
Published: 23 August 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s00357-022-09417-9

Infinite Mixtures of Multivariate Normal-Inverse Gaussian Distributions for Clustering of Skewed Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering

Robust finite mixture modeling of multivariate unrestricted skew-normal generalized hyperbolic distributions

Matrix-variate normal mean-variance Birnbaum–Saunders distributions and related mixture models

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Appendices

Appendix 1: Mathematical Details for Posterior Distributions

1.1 1.1 Detail on the posterior updates for the parameters

1.2 1.2 Detail on Posterior Updates for Hyperparameters

1.2.1 1.2.1 Derivation of the posteriors

1.2.2 1.2.2 Choice of Third-Layer Hyperparameters

Appendix 2: Estimated and True Parameters from the Simulation Studies

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Infinite Mixtures of Multivariate Normal-Inverse Gaussian Distributions for Clustering of Skewed Data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Skew Multiple Scaled Mixtures of Normal Distributions with Flexible Tail Behavior and Their Application to Clustering

Robust finite mixture modeling of multivariate unrestricted skew-normal generalized hyperbolic distributions

Matrix-variate normal mean-variance Birnbaum–Saunders distributions and related mixture models

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Appendices

Appendix 1: Mathematical Details for Posterior Distributions

1.1 1.1 Detail on the posterior updates for the parameters

1.2 1.2 Detail on Posterior Updates for Hyperparameters

1.2.1 1.2.1 Derivation of the posteriors

1.2.2 1.2.2 Choice of Third-Layer Hyperparameters

Appendix 2: Estimated and True Parameters from the Simulation Studies

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now