[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

A Bayesian nonparametric multi-sample test in any dimension

  • Original Paper
  • Published:
AStA Advances in Statistical Analysis Aims and scope Submit manuscript

Abstract

This paper considers a general Bayesian test for the multi-sample problem. Specifically, for M independent samples, the interest is to determine whether the M samples are generated from the same multivariate population. First, M Dirichlet processes are considered as priors for the true distributions generated the data. Then, the concentration of the distribution of the total distance between the M posterior processes is compared to the concentration of the distribution of the total distance between the M prior processes through the relative belief ratio. The total distance between processes is established based on the energy distance. Various interesting theoretical results of the approach are derived. Several examples covering the high dimensional case are considered to illustrate the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Abdelrazeq, I., Al-Labadi, L., Alzaatreh, A.: On one-sample Bayesian tests for the Mean. Statistics 54(2), 424–440 (2020)

    Article  MathSciNet  Google Scholar 

  • Al-Labadi, L.: The two-sample problem via relative belief ratio. Comput. Stat. (2020). https://doi.org/10.1007/s00180-020-00988-y

    Article  MathSciNet  MATH  Google Scholar 

  • Al-Labadi, L., Baskurt, Z., Evans, M.: Goodness of fit for the logistic regression model using relative belief. J. Stat. Distrib. Appl. 4(1), 1 (2017)

    Article  Google Scholar 

  • Al-Labadi, L., Baskurt, Z., Evans, M.: Statistical reasoning: choosing and checking the ingredients, inferences based on a measure of statistical evidence with some applications. Entropy 20(4), 289 (2018)

    Article  Google Scholar 

  • Al-Labadi, L., Evans, M.: Prior-based model checking. Can. J. Stat. 46(3), 380–398 (2018)

    Article  MathSciNet  Google Scholar 

  • Al-Labadi, L., Fazeli Asl, F., Saberi, Z.: A Bayesian semiparametric Gaussian copula approach to a multivariate normality test. J. Stat. Comput. Simul. (2020). https://doi.org/10.1080/00949655.2020.1820504

    Article  MATH  Google Scholar 

  • Al-Labadi, L., Zarepour, M.: Two-sample Kolmogorov-Smirnov test using a Bayesian nonparametric approach. Math. Methods Stat. 26, 212–225 (2017)

    Article  MathSciNet  Google Scholar 

  • Baringhaus, L., Franz, C.: On a new multivariate two-sample test. J. Multivar. Anal. 88, 190–206 (2004)

    Article  MathSciNet  Google Scholar 

  • Bickel, P.J., Breiman, L.: Sums of functions of nearest neighbor distances, moment bounds, limit theorems and a goodness of fit test. Ann. Probab 11, 185–214 (1983)

    Article  MathSciNet  Google Scholar 

  • Biswas, M., Ghosh, A.K.: A nonparametric two-sample test applicable to high dimensional data. J. Multivar. Anal. 123, 160–171 (2014)

    Article  MathSciNet  Google Scholar 

  • Chen, Y., Hanson, T.: Bayesian nonparametric k-sample tests for censored and uncensored data. Comput. Stat. Data Anal. 71, 335–346 (2014)

    Article  MathSciNet  Google Scholar 

  • Evans, M. (2015). Measuring Statistical Evidence Using Relative Belief. volume 144 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL

  • Fehrman, E., Muhammad, A.K., Mirkes, E.M., Egan, V., Gorban, A.N.: The five factor model of personality and evaluation of drug consumption risk. Data Sci. (2017). https://doi.org/10.1007/978-3-319-55723-6_18

    Article  Google Scholar 

  • Ferguson, T.S.: A Bayesian analysis of some nonparametric problems. Ann. Stat. 1, 209–230 (1973)

    Article  MathSciNet  Google Scholar 

  • Friedman, J.H., Rafsky, L.C.: Multivariate generalizations of the Wald-Wolfowitz and Smirnov two sample tests. Ann. Stat. 7, 697–717 (1979)

    Article  MathSciNet  Google Scholar 

  • Heller, R., Jensen, S.T., Rosenbaum, P.R., Small, D.S.: Sensitivity analysis for the cross-match test, with applications in genomics. J. Am. Stat. Assoc. 105, 1005–1013 (2010)

    Article  MathSciNet  Google Scholar 

  • Henze, N.: A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann. Stat. 16, 772–783 (1988)

    Article  MathSciNet  Google Scholar 

  • Holmes, C.C., Caron, F., Griffin, J.E., Stephens, D.A.: Two-sample Bayesian nonparametric hypothesis testing. Bayesian Anal. 2, 297–320 (2015)

    MathSciNet  MATH  Google Scholar 

  • Ishwaran, H., Zarepour, M.: Exact and approximate sum representations for the Dirichlet process. Can. J. Stat. 30, 269–283 (2003)

    Article  MathSciNet  Google Scholar 

  • Kuipers, J.B.: Quaternions and rotation sequences. Princeton University Press, Princeton (1999)

    Book  Google Scholar 

  • Mukherjee, S., Agarwal, D., Zhang, N.R., Bhattacharya, B.B.: Distribution-free multisample tests based on optimal matchings with applications to single cell genomics. J. Am. Stat. Assoc. 18, 1–12 (2020)

    Article  Google Scholar 

  • Mukhopadhyay, S., Wang, K.: A nonparametric approach to high-dimensional \(k\)-sample comparison problems. Biometrica (2020). https://doi.org/10.1093/biomet/asaa015

    Article  MATH  Google Scholar 

  • Oja, H.: Multivariate nonparametric methods with R: an approach based on spatial signs and ranks. Springer, New York (2010)

    Book  Google Scholar 

  • Oja, H., Randles, R.H.: Multivariate nonparametric tests. Stat. Sci. 19, 598–605 (2004)

    Article  MathSciNet  Google Scholar 

  • Petrie, A.: Graph-theoretic multisample tests of equality in distribution for high dimensional data. Comput. Stat. Data Anal. 96, 145–158 (2016)

    Article  MathSciNet  Google Scholar 

  • Rosenbaum, P.R.: An exact distribution-free test comparing two multivariate distributions based on adjacency. J. R. Stat. Soc. Ser. B Stat. Methodol. 67, 515–530 (2005)

    Article  MathSciNet  Google Scholar 

  • Schilling, M.F.: Multivariate two-sample tests based on nearest neighbors. J. Am. Stat. Assoc. 81, 799–806 (1986)

    Article  MathSciNet  Google Scholar 

  • Székely, G.: E-statistics: Energy of statistical samples, pp. 03–05. Bowling Green State University, Department of Mathematics and Statistics Technical Report No (2003)

  • Székely, G. and Rizzo, M. (2004). Testing for equal distributions in high dimension. Interstat

  • Székely, G., Rizzo, M.: The energy of data. Ann. Rev. Stat. Appl. 4(1), 447–479 (2017)

    Article  Google Scholar 

  • Tsukada, S.: High dimensional two-sample test based on the inter-point distance. Comput. Stat. 34, 599–615 (2019)

    Article  MathSciNet  Google Scholar 

  • Zhang, Q., Filippi, S., Flaxman, S., Sejdinovic, D.: Bayesian kernel two-sample testing. Technical Report (2020). arXiv:2002.05550

  • Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., et al.: The cancer genome atlas pan-cancer analysis project. Nat. Gen. 45(10), 1113–1120 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luai Al-Labadi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A proof of lemma 1

To prove (i), from properties of Dirichlet distribution, \(E_{P^{(1)}_N}(J^{(1)}_{i,N})=E_{P^{(2)}_N}(J^{(2)}_{i,N})=1/N\) and \(E_{P^{(1)}_{N}}(J^{(1)}_{i,N}J^{(1)}_{j,N})=E_{P^{(2)}_{N}}(J^{(2)}_{i,N}J^{(2)}_{j,N})=\frac{a}{(a+1)N^2}\). Since \(J^{(1)}_{i,N}\) and \(J^{(2)}_{i,N}\) are independent, we have

$$\begin{aligned} E_{P^{(1)}_{N},P^{(2)}_{N}}\left( d_{{\mathcal {E}},a,N}(F^{pri}_{1},F^{pri}_{2})\right)&=\dfrac{2}{N^{2}}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(2)}_j||-\dfrac{a}{(a+1)N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(1)}_j||\nonumber \\&\quad -\dfrac{a}{(a+1)N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(2)}_i-\mathbf {X}^{(2)}_j||. \end{aligned}$$
(17)

The proof is immediately followed by letting \(a\rightarrow \infty \) in (17). To prove (ii), For any \(a>0\), we have

$$\begin{aligned} E\left( d_{{\mathcal {E}},a,N}(F^{pri}_{1},F^{pri}_{2})\right)&=E_{G_{1},G_{2}}\big \lbrace E_{P^{(1)}_{N},P^{(2)}_{N}}\big (d_{{\mathcal {E}},a,N}(F^{pri}_{1},F^{pri}_{2})\big )\big \rbrace \nonumber \\&=E_{G_{1},G_{2}}\big \lbrace \dfrac{2}{N^{2}}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(2)}_j||-\dfrac{a}{(a+1)N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(1)}_j||\nonumber \\& \quad -\dfrac{a}{(a+1)N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(2)}_i-\mathbf {X}^{(2)}_j||.\big \rbrace \end{aligned}$$
(18)

Now, to compute the limit of (18) when \(a\rightarrow \infty \), we use the monotone convergence theorem as below. For this, let

$$\begin{aligned} h(a)&=\dfrac{2}{N^{2}}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(2)}_j||-\dfrac{a}{(a+1)N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(1)}_j||\nonumber \\&\quad -\dfrac{a}{(a+1)N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(2)}_i-\mathbf {X}^{(2)}_j||. \end{aligned}$$
(19)

Note that, since \(\dfrac{a}{(a+1)N^2}<\dfrac{1}{N^2}\), we have

$$\begin{aligned} h(a)&>\dfrac{2}{N^{2}}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(2)}_j||-\dfrac{1}{N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(1)}_i-\mathbf {X}^{(1)}_j|| -\dfrac{1}{N^2}\sum _{i,j=1}^{N}||\mathbf {X}^{(2)}_i-\mathbf {X}^{(2)}_j||. \end{aligned}$$
(20)

Recalling equation (8), the right-hand side of (20) is \(d_{{\mathcal {E}},N}(G_{1},G_{2})\). Since \(d_{{\mathcal {E}},N}(G_{1},G_{2})\ge 0\), we have \(h(a)>0\) (h(a) is non-negative). On the other hand, since \(\frac{a}{a+1}<\frac{a+1}{a+2}\), we get \(h(a+1)<h(a)\) (h(a) is monotone). Also, since \(\dfrac{a}{(a+1)N^2}\rightarrow \dfrac{1}{N^2}\) as \(a\rightarrow \infty \), by letting \(a\rightarrow \infty \) in (19), we have \(h(a)\rightarrow d_{{\mathcal {E}},N}(G_{1},G_{2})\) as \(a\rightarrow \infty \). Hence, the monotone convergence theorem implies

$$\begin{aligned} E_{G_{1},G_{2}}(h(a))\rightarrow E_{G_{1},G_{2}}(d_{{\mathcal {E}},N}(G_{1},G_{2})), \end{aligned}$$

as \(a\rightarrow \infty \). From Székely and Rizzo Székely and Rizzo (2004), we have

$$\begin{aligned} E_{G_{1},G_{2}}(d_{{\mathcal {E}},N}(G_{1},G_{2}))&=2E||\mathbf {X}^{(1)}-\mathbf {X}^{(2)}||-E||\mathbf {X}^{(1)}-\mathbf {X}^{\prime ^{(1)}}||-E||\mathbf {X}^{(2)}-\mathbf {X}^{\prime ^{(2)}}||\nonumber \\&\quad +\dfrac{E||\mathbf {X}^{(1)}-\mathbf {X}^{\prime ^{(1)}}||+E||\mathbf {X}^{(2)}-\mathbf {X}^{\prime ^{(2)}}||}{N}, \end{aligned}$$
(21)

where \(\mathbf {X}^{(1)},\mathbf {X}^{\prime ^{(1)}}\overset{i.i.d.}{\sim } G_{1}\) and \(\mathbf {X}^{(2)},\mathbf {X}^{\prime ^{(2)}}\overset{i.i.d.}{\sim } G_{2}\). By letting \(N\rightarrow \infty \) in (21), we have

$$\begin{aligned} E_{G_{1},G_{2}}(d_{{\mathcal {E}},N}(G_{1},G_{2}))\rightarrow 2E||\mathbf {X}^{(1)}-\mathbf {X}^{(2)}||-E||\mathbf {X}^{(1)}-\mathbf {X}^{\prime ^{(1)}}||-E||\mathbf {X}^{(2)}-\mathbf {X}^{\prime ^{(2)}}||. \end{aligned}$$
(22)

Recalling Eq. (7), the right-hand side of (22) is \(d_{{\mathcal {E}}}(G_{1},G_{2})\). This completes the proof of (ii).

Appendix B proof of theorem 1

By considering the conjugacy property of Dirichlet process in the proof of part (ii) of Lemma 1, we have \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \rightarrow E_{G^*_1,G^*_2}\left( d_{{\mathcal {E}},N}(G^*_1,G^*_2)\right) \) as \(n_1,n_2\rightarrow \infty \). On the other hand, by Glivenko-Cantelli theorem, when \(n_1,n_2\rightarrow \infty \), \(G^{*}_1\) and \(G^{*}_2\) converge almost surely to the true distribution of data \(F_1\) and \(F_2\). Thus, \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \xrightarrow {a.s.} E_{F_{1},F_{2}}(d_{{\mathcal {E}},N}(F_1,F_2))\) as \(n_1,n_2\rightarrow \infty \). Similar to the proof of part (ii) of Lemma 1, as \(N\rightarrow \infty \), \(E_{F_{1},F_{2}}(d_{{\mathcal {E}},N}(F_{1},F_{2}))\rightarrow d_{{\mathcal {E}}}(F_{1},F_{2})\). To prove (ii), we have \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \) \(\rightarrow E_{G^*_1,G^*_2}\left( d_{{\mathcal {E}},N}(G^*_1,G^*_2)\right) \) as \(a\rightarrow \infty \). Also, from (2), by letting \(a\rightarrow \infty \), \(G^{*}_1\) and \(G^{*}_2\) converge to \(G_1\) and \(G_2\), respectively. Hence, \(E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{1},F^{pos}_{2})\right) \rightarrow E_{G_{1},G_{2}}(d_{{\mathcal {E}},N}(G_1,G_2))\) as \(a\rightarrow \infty \). Now, similar to the proof of (i), the result follows as \(N\rightarrow \infty \).

Appendix C proof of proposition 1

From the linear property of the expectation, we have

$$\begin{aligned} E\left( d_{T,N}(F^{pri}_1,\ldots , F^{pri}_M)\right) =\sum _{1\le i<j \le M}E\left( d_{{\mathcal {E}},a,N}(F^{pri}_{i},F^{pri}_{j})\right) , \end{aligned}$$
(23)
$$\begin{aligned} E\Big (d_{T,N}(F^{pos}_1,\ldots , F^{pos}_M)\Big )=\sum _{1\le i<j \le M}E\left( d_{{\mathcal {E}},a,N}(F^{pos}_{i},F^{pos}_{j})\right) . \end{aligned}$$
(24)

By using Lemma 1, part (ii), and Theorem 1 in (23) and (24) the proof is concluded.

Appendix D proof of corollary 1

To prove (i), from part (i) of Proposition 1, we have \(E(d_{T,N}(F^{pri}_1,\ldots , F^{pri}_M))\rightarrow d_{T}(G_1,\ldots , G_M)\), as \(a\rightarrow \infty \) and \(N\rightarrow \infty \). Now, from Székely and Rizzo Székely and Rizzo (2004), \(d_{T}(G_1,\ldots , G_M)=0\) if and only if \(G_1=\cdots =G_M\) and the proof is completed. The proof of (ii) and (iii) are similar to the proof of (i) and are omitted.

Appendix E pseudocode algorithm of the multi-sample test

figure a
figure b

Appendix F relevant notations

Table 9 Description of notations

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Labadi, L., Asl, F.F. & Saberi, Z. A Bayesian nonparametric multi-sample test in any dimension. AStA Adv Stat Anal 106, 217–242 (2022). https://doi.org/10.1007/s10182-021-00419-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10182-021-00419-3

Keywords

Mathematics Subject Classification