Abstract
This paper considers a general Bayesian test for the multi-sample problem. Specifically, for M independent samples, the interest is to determine whether the M samples are generated from the same multivariate population. First, M Dirichlet processes are considered as priors for the true distributions generated the data. Then, the concentration of the distribution of the total distance between the M posterior processes is compared to the concentration of the distribution of the total distance between the M prior processes through the relative belief ratio. The total distance between processes is established based on the energy distance. Various interesting theoretical results of the approach are derived. Several examples covering the high dimensional case are considered to illustrate the approach.
Similar content being viewed by others
References
Abdelrazeq, I., Al-Labadi, L., Alzaatreh, A.: On one-sample Bayesian tests for the Mean. Statistics 54(2), 424–440 (2020)
Al-Labadi, L.: The two-sample problem via relative belief ratio. Comput. Stat. (2020). https://doi.org/10.1007/s00180-020-00988-y
Al-Labadi, L., Baskurt, Z., Evans, M.: Goodness of fit for the logistic regression model using relative belief. J. Stat. Distrib. Appl. 4(1), 1 (2017)
Al-Labadi, L., Baskurt, Z., Evans, M.: Statistical reasoning: choosing and checking the ingredients, inferences based on a measure of statistical evidence with some applications. Entropy 20(4), 289 (2018)
Al-Labadi, L., Evans, M.: Prior-based model checking. Can. J. Stat. 46(3), 380–398 (2018)
Al-Labadi, L., Fazeli Asl, F., Saberi, Z.: A Bayesian semiparametric Gaussian copula approach to a multivariate normality test. J. Stat. Comput. Simul. (2020). https://doi.org/10.1080/00949655.2020.1820504
Al-Labadi, L., Zarepour, M.: Two-sample Kolmogorov-Smirnov test using a Bayesian nonparametric approach. Math. Methods Stat. 26, 212–225 (2017)
Baringhaus, L., Franz, C.: On a new multivariate two-sample test. J. Multivar. Anal. 88, 190–206 (2004)
Bickel, P.J., Breiman, L.: Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test. Ann. Probab 11, 185–214 (1983)
Biswas, M., Ghosh, A.K.: A nonparametric two-sample test applicable to high dimensional data. J. Multivar. Anal. 123, 160–171 (2014)
Chen, Y., Hanson, T.: Bayesian nonparametric k-sample tests for censored and uncensored data. Comput. Stat. Data Anal. 71, 335–346 (2014)
Evans, M. (2015). Measuring Statistical Evidence Using Relative Belief. volume 144 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL
Fehrman, E., Muhammad, A.K., Mirkes, E.M., Egan, V., Gorban, A.N.: The five factor model of personality and evaluation of drug consumption risk. Data Sci. (2017). https://doi.org/10.1007/978-3-319-55723-6_18
Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1, 209–230 (1973)
Friedman, J.H., Rafsky, L.C.: Multivariate generalizations of the Wald-Wolfowitz and Smirnov two sample tests. Ann. Stat. 7, 697–717 (1979)
Heller, R., Jensen, S.T., Rosenbaum, P.R., Small, D.S.: Sensitivity analysis for the cross-match test, with applications in genomics. J. Am. Stat. Assoc. 105, 1005–1013 (2010)
Henze, N.: A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann. Stat. 16, 772–783 (1988)
Holmes, C.C., Caron, F., Griffin, J.E., Stephens, D.A.: Two-sample Bayesian nonparametric hypothesis testing. Bayesian Anal. 2, 297–320 (2015)
Ishwaran, H., Zarepour, M.: Exact and approximate sum representations for the Dirichlet process. Can. J. Stat. 30, 269–283 (2003)
Kuipers, J.B.: Quaternions and rotation sequences. Princeton University Press, Princeton (1999)
Mukherjee, S., Agarwal, D., Zhang, N.R., Bhattacharya, B.B.: Distribution-free multisample tests based on optimal matchings with applications to single cell genomics. J. Am. Stat. Assoc. 18, 1–12 (2020)
Mukhopadhyay, S., Wang, K.: A nonparametric approach to high-dimensional \(k\)-sample comparison problems. Biometrica (2020). https://doi.org/10.1093/biomet/asaa015
Oja, H.: Multivariate nonparametric methods with R: an approach based on spatial signs and ranks. Springer, New York (2010)
Oja, H., Randles, R.H.: Multivariate nonparametric tests. Stat. Sci. 19, 598–605 (2004)
Petrie, A.: Graph-theoretic multisample tests of equality in distribution for high dimensional data. Comput. Stat. Data Anal. 96, 145–158 (2016)
Rosenbaum, P.R.: An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 515–530 (2005)
Schilling, M.F.: Multivariate two-sample tests based on nearest neighbors. J. Am. Stat. Assoc. 81, 799–806 (1986)
Székely, G.: E-statistics: Energy of statistical samples, pp. 03–05. Bowling Green State University, Department of Mathematics and Statistics Technical Report No (2003)
Székely, G. and Rizzo, M. (2004). Testing for equal distributions in high dimension. Interstat
Székely, G., Rizzo, M.: The energy of data. Ann. Rev. Stat. Appl. 4(1), 447–479 (2017)
Tsukada, S.: High dimensional two-sample test based on the inter-point distance. Comput. Stat. 34, 599–615 (2019)
Zhang, Q., Filippi, S., Flaxman, S., Sejdinovic, D.: Bayesian kernel two-sample testing. Technical Report (2020). arXiv:2002.05550
Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Gen. 45(10), 1113–1120 (2013)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A proof of lemma 1
To prove (i), from properties of Dirichlet distribution, \(E_{P^{(1)}_N}(J^{(1)}_{i,N})=E_{P^{(2)}_N}(J^{(2)}_{i,N})=1/N\) and \(E_{P^{(1)}_{N}}(J^{(1)}_{i,N}J^{(1)}_{j,N})=E_{P^{(2)}_{N}}(J^{(2)}_{i,N}J^{(2)}_{j,N})=\frac{a}{(a+1)N^2}\). Since \(J^{(1)}_{i,N}\) and \(J^{(2)}_{i,N}\) are independent, we have
The proof is immediately followed by letting \(a\rightarrow \infty \) in (17). To prove (ii), For any \(a>0\), we have
Now, to compute the limit of (18) when \(a\rightarrow \infty \), we use the monotone convergence theorem as below. For this, let
Note that, since \(\dfrac{a}{(a+1)N^2}<\dfrac{1}{N^2}\), we have
Recalling equation (8), the right-hand side of (20) is \(d_{{\mathcal {E}},N}(G_{1},G_{2})\). Since \(d_{{\mathcal {E}},N}(G_{1},G_{2})\ge 0\), we have \(h(a)>0\) (h(a) is non-negative). On the other hand, since \(\frac{a}{a+1}<\frac{a+1}{a+2}\), we get \(h(a+1)<h(a)\) (h(a) is monotone). Also, since \(\dfrac{a}{(a+1)N^2}\rightarrow \dfrac{1}{N^2}\) as \(a\rightarrow \infty \), by letting \(a\rightarrow \infty \) in (19), we have \(h(a)\rightarrow d_{{\mathcal {E}},N}(G_{1},G_{2})\) as \(a\rightarrow \infty \). Hence, the monotone convergence theorem implies
as \(a\rightarrow \infty \). From Székely and Rizzo Székely and Rizzo (2004), we have
where \(\mathbf {X}^{(1)},\mathbf {X}^{\prime ^{(1)}}\overset{i.i.d.}{\sim } G_{1}\) and \(\mathbf {X}^{(2)},\mathbf {X}^{\prime ^{(2)}}\overset{i.i.d.}{\sim } G_{2}\). By letting \(N\rightarrow \infty \) in (21), we have
Recalling Eq. (7), the right-hand side of (22) is \(d_{{\mathcal {E}}}(G_{1},G_{2})\). This completes the proof of (ii).
Appendix B proof of theorem 1
By considering the conjugacy property of Dirichlet process in the proof of part (ii) of Lemma 1, we have \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \rightarrow E_{G^*_1,G^*_2}\left( d_{{\mathcal {E}},N}(G^*_1,G^*_2)\right) \) as \(n_1,n_2\rightarrow \infty \). On the other hand, by Glivenko-Cantelli theorem, when \(n_1,n_2\rightarrow \infty \), \(G^{*}_1\) and \(G^{*}_2\) converge almost surely to the true distribution of data \(F_1\) and \(F_2\). Thus, \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \xrightarrow {a.s.} E_{F_{1},F_{2}}(d_{{\mathcal {E}},N}(F_1,F_2))\) as \(n_1,n_2\rightarrow \infty \). Similar to the proof of part (ii) of Lemma 1, as \(N\rightarrow \infty \), \(E_{F_{1},F_{2}}(d_{{\mathcal {E}},N}(F_{1},F_{2}))\rightarrow d_{{\mathcal {E}}}(F_{1},F_{2})\). To prove (ii), we have \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \) \(\rightarrow E_{G^*_1,G^*_2}\left( d_{{\mathcal {E}},N}(G^*_1,G^*_2)\right) \) as \(a\rightarrow \infty \). Also, from (2), by letting \(a\rightarrow \infty \), \(G^{*}_1\) and \(G^{*}_2\) converge to \(G_1\) and \(G_2\), respectively. Hence, \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \rightarrow E_{G_{1},G_{2}}(d_{{\mathcal {E}},N}(G_1,G_2))\) as \(a\rightarrow \infty \). Now, similar to the proof of (i), the result follows as \(N\rightarrow \infty \).
Appendix C proof of proposition 1
From the linear property of the expectation, we have
By using Lemma 1, part (ii), and Theorem 1 in (23) and (24) the proof is concluded.
Appendix D proof of corollary 1
To prove (i), from part (i) of Proposition 1, we have \(E(d_{T,N}(F^{pri}_1,\ldots , F^{pri}_M))\rightarrow d_{T}(G_1,\ldots , G_M)\), as \(a\rightarrow \infty \) and \(N\rightarrow \infty \). Now, from Székely and Rizzo Székely and Rizzo (2004), \(d_{T}(G_1,\ldots , G_M)=0\) if and only if \(G_1=\cdots =G_M\) and the proof is completed. The proof of (ii) and (iii) are similar to the proof of (i) and are omitted.
Appendix E pseudocode algorithm of the multi-sample test
Appendix F relevant notations
Rights and permissions
About this article
Cite this article
Al-Labadi, L., Asl, F.F. & Saberi, Z. A Bayesian nonparametric multi-sample test in any dimension. AStA Adv Stat Anal 106, 217–242 (2022). https://doi.org/10.1007/s10182-021-00419-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10182-021-00419-3
Keywords
- Dirichlet process prior
- Energy distance
- Multi-sample hypothesis testing
- Relative belief ratio
- Simulation