Abstract
A classification model is proposed for distinguishing between several subpopulations using a multivariate count dataset. The classification rule, which minimizes the probability of misclassification, is obtained under the distributional hypothesis of a multivariate log-linear conditional Poisson distribution. A sample classification rule is defined based on the maximum likelihood estimators of the distributional parameters. This rule is based on functions associated with each one of the subpopulations, or equivalently, on the estimated posterior probabilities. Additionally, the likelihood ratio test of equality of the parameters for all the subpopulations is analyzed, providing a measure of the power to discriminate between subpopulations. Furthermore, an algorithm to determine the most suitable subset of counting variables for classification is proposed. Finally, actual and simulated datasets are considered to illustrate the application of the methodology.
Similar content being viewed by others
References
Berkhout P, Plug E (2004) A bivariate Poisson count data model using conditional probabilities. Stat Neerlandica 58(3):349–364
Bray JR, Curtis JT (1957) An ordination of the upland forest communities of southern Wisconsin. Ecol Monogr 27(4):326–349. https://doi.org/10.2307/1942268
Chen LP (2022) Network-based discriminant analysis for multiclassification. J Classif 39:410–431. https://doi.org/10.1007/s00357-022-09414-y
Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, Chen K, Mitchell R, Cano I, Zhou T, Li M, Xie J, Lin M, Geng Y, Li Y, Yuan J (2023) xgboost: Extreme Gradient Boosting. R package version 1.7.3.1. https://CRAN.R-project.org/package=xgboost
Fisher RA (1938) The statistical utilization of multiple measurements. Ann Eugen 8(4):376–86. https://doi.org/10.1111/j.1469-1809.1938.tb02189.x
Fushiki T (2011) Estimation of prediction error by using K-fold cross-validation. Stat Comput 21:137–146. https://doi.org/10.1007/s11222-009-9153-8
Goldstein M, Dillon WR (1978) Discrete discriminant analysis. Wiley, New York
Inouye DI, Yang E, Allen GI, Ravikumar P (2017) A review of multivariate distributions for count data derived from the Poisson distribution. WIREs Comput Stat 9(3):e1398. https://doi.org/10.1002/wics.1398
Junta de Castilla y Leon (2023) Datos abiertos de Castilla y Leon: Accidentalidad por carreteras. https://datosabiertos.jcyl.es/web/jcyl/set/es/transporte/accidentalidad-carreteras/1284967604431
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken
Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26
Lachenbruch PA, Goldstein M (1979) Discriminant analysis. Biometrics 35:69–85. https://doi.org/10.2307/2529937
Liaw A, Wiener M (2002) Classification and regression by random forest. R News 2(3):18–22
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2021) cluster: Cluster Analysis Basics and Extensions. R package version 2.1.1
Majka M (2019) naivebayes: High Performance Implementation of the Naive Bayes Algorithm in R. R package version 0.9.7, https://CRAN.R-project.org/package=naivebayes
Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2022) e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.7-12, https://CRAN.R-project.org/package=e1071
Muñoz-Pichardo JM, Pino-Mejías R, García-Heras J, Ruiz-Muñoz F, Luz González-Regalado M (2021) A multivariate Poisson regression model for count data. J Appl Stat 48(13–15):2525–2541. https://doi.org/10.1080/02664763.2021.1877637
Muñoz-Pichardo JM, Pino-Mejías R (2023) Multivariate log-linear conditional Poisson distribution. https://personal.us.es/juanm/Poissonweb/Posson.html
Oksanen J, Guillaume Blanchet F, Friendly M, Kindt R, Legendre P, McGlinn D, Minchin P-R, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Szoecs E, Wagner H (2020) vegan: Community Ecology Package. R package version 2.5-7. https://CRAN.R-project.org/package=vegan
PennState Eberly College of Science (2023) STAT 505 Applied Multivariate Statistical Analysis. Example: Woodyard Hammock Data. Pennsylvania State University. https://online.stat.psu.edu/stat505/lesson/14/14.1
Perez-de-la-Cruz Eslava-Gomez G (2019) Discriminant analysis for discrete variables derived from a tree-structured graphical model. Adv Data Anal Classif 13:855–876. https://doi.org/10.1007/s11634-019-00352-z
Peyhardi J, Fernique P, Durand JB (2021) Splitting models for multivariate count data. J Multivar Anal 181:104677
R Core Team (2022) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Sajana OK, Sajesh TA (2023) Splitting models for multivariate count data. Commun Stat-Simul Comput 52(3):735–744. https://doi.org/10.1080/03610918.2020.1868512
Seber GAF (2004) Multivariate observations. Wiley-Interscience, Hoboken
Silva A, Rothstein SJ, McNicholas PD, Subedi S (2019) A multivariate Poisson-log normal mixture model for clustering transcriptome sequencing data. BMC Bioinform 20(1):394
Subedi S, Browne RP (2020) A family of parsimonious mixtures of multivariate Poisson-lognormal distributions for clustering multivariate count data. Stat 9:e310. https://doi.org/10.1002/sta4.310
Venables WN, Ripley BD (2002) Mod Appl Stat S, 4th edn. Springer, New York
Zhao X, Zhang J, Lin W (2023) Clustering multivariate count data via Dirichlet-multinomial network fusion. Comput Stat DataAnal 179:107634
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
Approximation to moments of the distribution
Consider the three-dimensional case, that is, \(\underline{Y} \sim MLCP_3(\underline{\eta },{\textbf{A}})\), with
Given the difficulty of calculating the moments of the distribution, we approach the approximate calculation through expansion series: the quadratic approximation, that is, the second degree Taylor polynomial approximation. In order to evaluate the change of the moments with respect to the parameters that determine the statistical dependence structure (\({\textbf{A}}\)) between the components, the polynomial approximation is carried out with respect to these parameters, keeping the \(\underline{\eta }\) parametric vector fixed. The proofs of the following results are collected in the work of Muñoz-Pichardo and Pino-Mejías (2023).
-
1.
Expected values
-
(a)
\(E[Y_{1}] = e^{\eta _{1}}\).
-
(b)
\(E[Y_{2}]=e^{\eta _{2}}\ \exp \left[ e^{\eta _{1}}(e^{\alpha _{21}}-1)\right]\).
-
(c)
The quadratic approximation of \(E[Y_3]\) is given by
$$\begin{aligned} E[Y_{3}]\approx &\,e^{\eta _{3}}\ + e^{\eta _{1}+\eta _{3}} \; \alpha _{31} + e^{\eta _{2}+\eta _{3}}\; \alpha _{32} \\ & + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3} } \; \alpha _{21}\alpha _{32} + \frac{1}{2} e^{\eta _{1} + \eta _{3}}(e^{\eta _{1}}+1)\; \alpha _{31}^{2} \\ & + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \alpha _{31} \; \alpha _{32} + \frac{1}{2} e^{\eta _{2}+\eta _{3}} \left( e^{\eta _{2}}+1\right) \; \alpha _{32}^{2}. \end{aligned}$$
-
(a)
-
2.
Variances
-
(a)
\(Var[Y_{1}] = e^{\eta _{1}}\).
-
(b)
\(Var(Y_2) =E[Y_{2}] +\left( E[Y_2]\right) ^2 \left\{ \exp \left[ e^{\eta _{1}} (e^{\alpha _{21}}-1)^{2} \right] -1 \right\}\).
-
(c)
The quadratic approximation of \(Var[Y_3]\) is given by
$$\begin{aligned} Var\left[ Y_{3}\right]\approx &\,e^{\eta _{3}} + e^{\eta _{1}+\eta _{3}} \alpha _{31} + e^{\eta _{2}+\eta _{3}}\alpha _{32} \\ & +\frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}}\left( 1-e^{\eta _{3}}\right) \alpha _{21}\alpha _{32} + \frac{1}{2} e^{\eta _{1}+\eta _{3}}\left[ 1+e^{\eta _{1}}+2e^{\eta _{3}}\right] \alpha _{31}^{2} \\ & + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}}\alpha _{31}\alpha _{32} + \frac{1}{2} e^{\eta _{2}+\eta _{3}}\left[ 1+e^{\eta _{2}}+2e^{\eta _{3}}\right] \alpha _{32}^{2} \end{aligned}$$
-
(a)
-
3.
Covariances
-
(a)
\(Cov(Y_{1},Y_{2}) = e^{\eta _{1}}(e^{\alpha _{21}}-1)E[Y_{2}]\).
-
(b)
The quadratic approximation of \(Cov[Y_{1},Y_{3}]\) is given by
$$\begin{aligned} Cov[Y_{1},Y_{3}] &\approx e^{\eta _{1}+\eta _{3}}\ \alpha _{31} + e^{\eta _{1}+\eta _{2}+\eta _{3}} \; \alpha _{21}\alpha _{32} \\ & \quad + \frac{1}{2} e^{\eta _{1}+\eta _{3}} \left[ 1+2e^{\eta _{1}}\ \right] \; \alpha _{31}^{2} + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \;\alpha _{31}\alpha _{32}. \end{aligned}$$ -
(c)
The quadratic approximation of \(Cov[Y_{2},Y_{3}]\) is given by
$$\begin{aligned} Cov[Y_{2},Y_{3}] & \approx e^{\eta _{2}+\eta _{3}} \; \alpha _{32} + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \; \alpha _{21}\alpha _{31} \\ & \quad + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}} \; \alpha _{21}\alpha _{32} + \frac{1}{2} e^{\eta _{1}+\eta _{2}+\eta _{3}}\; \alpha _{31}\alpha _{32} \end{aligned}$$
-
(a)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Muñoz-Pichardo, J.M., Pino-Mejías, R. Classification of multivariate count data with multivariate log-linear conditional Poisson distribution. Adv Data Anal Classif (2024). https://doi.org/10.1007/s11634-024-00617-2
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11634-024-00617-2