Classification of multivariate count data with multivariate log-linear conditional Poisson distribution

49 Accesses
Explore all metrics

Abstract

A classification model is proposed for distinguishing between several subpopulations using a multivariate count dataset. The classification rule, which minimizes the probability of misclassification, is obtained under the distributional hypothesis of a multivariate log-linear conditional Poisson distribution. A sample classification rule is defined based on the maximum likelihood estimators of the distributional parameters. This rule is based on functions associated with each one of the subpopulations, or equivalently, on the estimated posterior probabilities. Additionally, the likelihood ratio test of equality of the parameters for all the subpopulations is analyzed, providing a measure of the power to discriminate between subpopulations. Furthermore, an algorithm to determine the most suitable subset of counting variables for classification is proposed. Finally, actual and simulated datasets are considered to illustrate the application of the methodology.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Estimation and Classification Using Samples from Two Logistic Populations with a Common Scale Parameter

Bayesian Inference for the Negative Binomial-Sushila Linear Model

Article 01 January 2019

Model selection and application to high-dimensional count data clustering

Article 13 November 2018

References

Berkhout P, Plug E (2004) A bivariate Poisson count data model using conditional probabilities. Stat Neerlandica 58(3):349–364
Article MathSciNet Google Scholar
Bray JR, Curtis JT (1957) An ordination of the upland forest communities of southern Wisconsin. Ecol Monogr 27(4):326–349. https://doi.org/10.2307/1942268
Article Google Scholar
Chen LP (2022) Network-based discriminant analysis for multiclassification. J Classif 39:410–431. https://doi.org/10.1007/s00357-022-09414-y
Article MathSciNet Google Scholar
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y, Yuan J (2023) xgboost: Extreme Gradient Boosting. R package version 1.7.3.1. https://CRAN.R-project.org/package=xgboost
Fisher RA (1938) The statistical utilization of multiple measurements. Ann Eugen 8(4):376–86. https://doi.org/10.1111/j.1469-1809.1938.tb02189.x
Article Google Scholar
Fushiki T (2011) Estimation of prediction error by using K-fold cross-validation. Stat Comput 21:137–146. https://doi.org/10.1007/s11222-009-9153-8
Article MathSciNet Google Scholar
Goldstein M, Dillon WR (1978) Discrete discriminant analysis. Wiley, New York
Google Scholar
Inouye DI, Yang E, Allen GI, Ravikumar P (2017) A review of multivariate distributions for count data derived from the Poisson distribution. WIREs Comput Stat 9(3):e1398. https://doi.org/10.1002/wics.1398
Article MathSciNet Google Scholar
Junta de Castilla y Leon (2023) Datos abiertos de Castilla y Leon: Accidentalidad por carreteras. https://datosabiertos.jcyl.es/web/jcyl/set/es/transporte/accidentalidad-carreteras/1284967604431
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken
Book Google Scholar
Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26
Article Google Scholar
Lachenbruch PA, Goldstein M (1979) Discriminant analysis. Biometrics 35:69–85. https://doi.org/10.2307/2529937
Article MathSciNet Google Scholar
Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22
Google Scholar
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2021) cluster: Cluster Analysis Basics and Extensions. R package version 2.1.1
Majka M (2019) naivebayes: High Performance Implementation of the Naive Bayes Algorithm in R. R package version 0.9.7, https://CRAN.R-project.org/package=naivebayes
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2022) e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-12, https://CRAN.R-project.org/package=e1071
Muñoz-Pichardo JM, Pino-Mejías R, García-Heras J, Ruiz-Muñoz F, Luz González-Regalado M (2021) A multivariate Poisson regression model for count data. J Appl Stat 48(13–15):2525–2541. https://doi.org/10.1080/02664763.2021.1877637
Article MathSciNet Google Scholar
Muñoz-Pichardo JM, Pino-Mejías R (2023) Multivariate log-linear conditional Poisson distribution. https://personal.us.es/juanm/Poissonweb/Posson.html
Oksanen J, Guillaume Blanchet F, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin P-R, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H (2020) vegan: Community Ecology Package. R package version 2.5-7. https://CRAN.R-project.org/package=vegan
PennState Eberly College of Science (2023) STAT 505 Applied Multivariate Statistical Analysis. Example: Woodyard Hammock Data. Pennsylvania State University. https://online.stat.psu.edu/stat505/lesson/14/14.1
Perez-de-la-Cruz Eslava-Gomez G (2019) Discriminant analysis for discrete variables derived from a tree-structured graphical model. Adv Data Anal Classif 13:855–876. https://doi.org/10.1007/s11634-019-00352-z
Article MathSciNet Google Scholar
Peyhardi J, Fernique P, Durand JB (2021) Splitting models for multivariate count data. J Multivar Anal 181:104677
Article MathSciNet Google Scholar
R Core Team (2022) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Sajana OK, Sajesh TA (2023) Splitting models for multivariate count data. Commun Stat-Simul Comput 52(3):735–744. https://doi.org/10.1080/03610918.2020.1868512
Article Google Scholar
Seber GAF (2004) Multivariate observations. Wiley-Interscience, Hoboken
Google Scholar
Silva A, Rothstein SJ, McNicholas PD, Subedi S (2019) A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinform 20(1):394
Article Google Scholar
Subedi S, Browne RP (2020) A family of parsimonious mixtures of multivariate Poisson-lognormal distributions for clustering multivariate count data. Stat 9:e310. https://doi.org/10.1002/sta4.310
Article MathSciNet Google Scholar
Venables WN, Ripley BD (2002) Mod Appl Stat S, 4th edn. Springer, New York
Book Google Scholar
Zhao X, Zhang J, Lin W (2023) Clustering multivariate count data via Dirichlet-multinomial network fusion. Comput Stat DataAnal 179:107634
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Estadística e I.O., Universidad de Sevilla, Avd. Reina Mercedes s/n, 41012, Sevilla, Spain
Juan M. Muñoz-Pichardo & Rafael Pino-Mejías

Authors

Juan M. Muñoz-Pichardo
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Pino-Mejías
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan M. Muñoz-Pichardo.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Approximation to moments of the distribution

Consider the three-dimensional case, that is, $\underline{Y} \sim MLCP_3(\underline{\eta },{\textbf{A}})$, with

$$\underline{\eta }=(\eta _1,\eta _2,\eta _3) \qquad \text { and } \qquad {\textbf{A}}= \left( \begin{array}{ccc} 0 & 0 & 0 \\ \alpha _{21} & 0 & 0 \\ \alpha _{31} & \alpha _{32} & 0 \end{array} \right) ,$$

Given the difficulty of calculating the moments of the distribution, we approach the approximate calculation through expansion series: the quadratic approximation, that is, the second degree Taylor polynomial approximation. In order to evaluate the change of the moments with respect to the parameters that determine the statistical dependence structure (${\textbf{A}}$) between the components, the polynomial approximation is carried out with respect to these parameters, keeping the $\underline{\eta }$ parametric vector fixed. The proofs of the following results are collected in the work of Muñoz-Pichardo and Pino-Mejías (2023).

1.
Expected values
1. (a)
  $E[Y_{1}] = e^{\eta _{1}}$.
2. (b)
  $E[Y_{2}]=e^{\eta _{2}}\ \exp \left[ e^{\eta _{1}}(e^{\alpha _{21}}-1)\right]$.
3. (c)
  The quadratic approximation of $E[Y_3]$ is given by
  $$\begin{aligned} E[Y_{3}]\approx &\,e^{\eta _{3}}\ + e^{\eta _{1}+\eta _{3}} \; \alpha _{31} + e^{\eta _{2}+\eta _{3}}\; \alpha _{32} \\ & + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3} } \; \alpha _{21}\alpha _{32} + \frac{1}{2} e^{\eta _{1} + \eta _{3}}(e^{\eta _{1}}+1)\; \alpha _{31}^{2} \\ & + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \alpha _{31} \; \alpha _{32} + \frac{1}{2} e^{\eta _{2}+\eta _{3}} \left( e^{\eta _{2}}+1\right) \; \alpha _{32}^{2}. \end{aligned}$$
2.
Variances
1. (a)
  $Var[Y_{1}] = e^{\eta _{1}}$.
2. (b)
  $Var(Y_2) =E[Y_{2}] +\left( E[Y_2]\right) ^2 \left\{ \exp \left[ e^{\eta _{1}} (e^{\alpha _{21}}-1)^{2} \right] -1 \right\}$.
3. (c)
  The quadratic approximation of $Var[Y_3]$ is given by
  $$\begin{aligned} Var\left[ Y_{3}\right]\approx &\,e^{\eta _{3}} + e^{\eta _{1}+\eta _{3}} \alpha _{31} + e^{\eta _{2}+\eta _{3}}\alpha _{32} \\ & +\frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}}\left( 1-e^{\eta _{3}}\right) \alpha _{21}\alpha _{32} + \frac{1}{2} e^{\eta _{1}+\eta _{3}}\left[ 1+e^{\eta _{1}}+2e^{\eta _{3}}\right] \alpha _{31}^{2} \\ & + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}}\alpha _{31}\alpha _{32} + \frac{1}{2} e^{\eta _{2}+\eta _{3}}\left[ 1+e^{\eta _{2}}+2e^{\eta _{3}}\right] \alpha _{32}^{2} \end{aligned}$$
3.
Covariances
1. (a)
  $Cov(Y_{1},Y_{2}) = e^{\eta _{1}}(e^{\alpha _{21}}-1)E[Y_{2}]$.
2. (b)
  The quadratic approximation of $Cov[Y_{1},Y_{3}]$ is given by
  $$\begin{aligned} Cov[Y_{1},Y_{3}] &\approx e^{\eta _{1}+\eta _{3}}\ \alpha _{31} + e^{\eta _{1}+\eta _{2}+\eta _{3}} \; \alpha _{21}\alpha _{32} \\ & \quad + \frac{1}{2} e^{\eta _{1}+\eta _{3}} \left[ 1+2e^{\eta _{1}}\ \right] \; \alpha _{31}^{2} + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \;\alpha _{31}\alpha _{32}. \end{aligned}$$
3. (c)
  The quadratic approximation of $Cov[Y_{2},Y_{3}]$ is given by
  $$\begin{aligned} Cov[Y_{2},Y_{3}] & \approx e^{\eta _{2}+\eta _{3}} \; \alpha _{32} + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \; \alpha _{21}\alpha _{31} \\ & \quad + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \; \alpha _{21}\alpha _{32} + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}}\; \alpha _{31}\alpha _{32} \end{aligned}$$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Muñoz-Pichardo, J.M., Pino-Mejías, R. Classification of multivariate count data with multivariate log-linear conditional Poisson distribution. Adv Data Anal Classif (2024). https://doi.org/10.1007/s11634-024-00617-2

Download citation

Received: 16 June 2023
Revised: 13 June 2024
Accepted: 11 November 2024
Published: 07 December 2024
DOI: https://doi.org/10.1007/s11634-024-00617-2

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Estimation and Classification Using Samples from Two Logistic Populations with a Common Scale Parameter

Bayesian Inference for the Negative Binomial-Sushila Linear Model

Model selection and application to high-dimensional count data clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Classification of multivariate count data with multivariate log-linear conditional Poisson distribution

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Estimation and Classification Using Samples from Two Logistic Populations with a Common Scale Parameter

Bayesian Inference for the Negative Binomial-Sushila Linear Model

Model selection and application to high-dimensional count data clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A

Appendix A

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation