Variance reduction of estimators arising from Metropolis–Hastings algorithms

George Iliopoulos¹ &
Sonia Malefaki²

411 Accesses
1 Citation
2 Altmetric
1 Mention
Explore all metrics

Abstract

The Metropolis–Hastings algorithm is one of the most basic and well-studied Markov chain Monte Carlo methods. It generates a Markov chain which has as limit distribution the target distribution by simulating observations from a different proposal distribution. A proposed value is accepted with some particular probability otherwise the previous value is repeated. As a consequence, the accepted values are repeated a positive number of times and thus any resulting ergodic mean is, in fact, a weighted average. It turns out that this weighted average is an importance sampling-type estimator with random weights. By the standard theory of importance sampling, replacement of these random weights by their (conditional) expectations leads to more efficient estimators. In this paper we study the estimator arising by replacing the random weights with certain estimators of their conditional expectations. We illustrate by simulations that it is often more efficient than the original estimator while in the case of the independence Metropolis–Hastings and for distributions with finite support we formally prove that it is even better than the “optimal” importance sampling estimator.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Efficient Large Deviation Estimation Based on Importance Sampling

Article 06 July 2020

Computing marginal likelihoods via the Fourier integral theorem and pointwise estimation of posterior densities

Article 16 August 2022

Bayesian Approaches to the Design of Markov Chain Monte Carlo Samplers

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Atchadé, Y.F., Perron, F.: Improving on the independent Metropolis–Hastings algorithm. Stat. Sin. 15, 3–18 (2005)
MATH Google Scholar
Billingsley, P.: Statistical methods in Markov chains. Ann. Math. Stat. 32, 12–40 (1961)
Article MathSciNet MATH Google Scholar
Casella, G., Robert, C.: Rao–Blackwellisation of sampling schemes. Biometrika 83, 81–94 (1996)
Article MathSciNet MATH Google Scholar
Douc, R., Robert, C.P.: A vanilla Rao–Blackwellisation of Metropolis–Hastings algorithms. Ann. Stat. 39, 261–277 (2011)
Article MathSciNet MATH Google Scholar
Geyer, C.J.: Practical Markov chain Monte Carlo. Stat. Sci. 7, 473–483 (1992)
Article Google Scholar
Hastings, W.K.: Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 (1970)
Article MATH Google Scholar
Jacob, P., Robert, C.P., Smith, M.: Using parallel computation to improve independent Metropolis–Hastings based estimation. J. Comput. Graph. Stat. 20, 616–635 (2011)
Article MathSciNet Google Scholar
Malefaki, S., Iliopoulos, G.: On convergence of properly weighted samples to the target distribution. J. Stat. Plan. Inference 138, 1210–1225 (2008)
Article MathSciNet MATH Google Scholar
Marin, J.M., Robert, C.P.: Importance sampling methods for Bayesian discrimination between embedded models. In: Chen, M.-H., Dey, D.K., Müller, P., Sun, D., Ye, K. (eds.) Frontiers of Statistical Decision Making and Bayesian Analysis (2010). Chapter 14, pp. 513–553
Google Scholar
Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1091 (1953)
Article Google Scholar

Download references

Acknowledgements

The authors wish to thank the two anonymous referees for their helpful comments and suggestions which considerably improved the paper.

Author information

Authors and Affiliations

Department of Statistics and Insurance Science, University of Piraeus, 80 Karaoli & Dimitriou str., 18534, Piraeus, Greece
George Iliopoulos
Department of Engineering Sciences, University of Patras, 26500, Rio, Patras, Greece
Sonia Malefaki

Authors

George Iliopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Sonia Malefaki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Iliopoulos.

Appendix: Proof of Theorem 1

(a) Assume without loss of generality that the state space is $\mathcal {X}=\{1, \ldots, m\}$ for some m≥2 and set for convenience h _k=h(k), π _k=π(k), g _k=g(k), g _kl=g(l|k) and w _k=w(k). Then,

(5)

By the ergodic theorem it holds $n^{-1}\sum_{i=1}^{n}I(X_{i}=k)\stackrel {\mathrm{a.s.}}{\longrightarrow}g_{k}$ and $\sum_{j=1}^{n}\frac{\xi_{j}}{\sum_{l=1}^{n}\xi_{l}}I(X_{j}=\ell )\stackrel{\mathrm{a.s.}}{\longrightarrow}\pi _{\ell}$. Since (5) contains only finite sums we conclude that it converges almost surely to

because $g_{k} = \kappa\sum_{i=1}^{m} \min\{\pi_{k} g_{k\ell}, \pi _{\ell }g_{\ell k}\}$. By taking h≡1 we get that $n^{-1}\sum _{i=1}^{n}\hat{w}(X_{i})\stackrel{\mathrm{a.s.}}{\longrightarrow}\kappa $ and thus

$$\hat{h}_{\hat{w}} = \frac{\sum_{i=1}^n\hat{w}(X_i)h(X_i)/n}{\sum _{i=1}^n\hat{w}(X_i)/n} \stackrel{\mathrm{a.s.}}{\longrightarrow }E_{\pi}(h). $$

(b) Let us define

$$\bar{U}_{k} = \frac{1}{n}\sum_{i=1}^{n}I(X_{i}={k}), \quad k=1,\ldots,m, $$

$$\bar{V}_{k} = \frac{1}{n}\sum_{i=1}^{n}\xi_{i}I(X_{i}={k}), \quad k=1,\ldots,m, $$

and

$$\rho_{kl} = \min \biggl\{\frac{q_{lk}}{\pi_{k}},\frac{q_{lk}}{\pi_{l}} \biggr\}, \quad k,l=1,\ldots,m. $$

Billingsley (1961) proved that

$$n^{1/2}(\bar{\boldsymbol{U}}-\boldsymbol{g})\rightarrow\mathcal {N}_m(\boldsymbol{0}, \boldsymbol{\varSigma }_{11}) $$

where $\bar{\boldsymbol{U}}= (\bar{U}_{1}, \ldots, \bar{U}_{m})^{T}$, g=(g ₁,…,g _m)^T and the ij entry of Σ ₁₁ is

Here $g_{ij}^{(n)}$ denotes the n-step transition probability from state i to state j and δ _ij is Kronecker’s delta. Using similar arguments it can be proven that

where the ij entry of the submatrix Σ ₁₂ is σ ₁₂(ij)=κw _j σ ₁₁(ij) while the ij entry of the submatrix Σ ₂₂ is σ ₂₂(ij)=δ _ij g _i κw _i(κw _i−1)+κ ² w _i w _j σ ₁₁(ij).

It is clear that the asymptotic distribution of $n^{1/2}\{\hat {h}_{\hat {w}}-E_{\pi}(h)\}$ can be found via the standard delta method. Observe that $\hat{h}_{\hat{w}}$ can be expressed as

(6)

say. By differentiation we get

$$\frac{\partial}{\partial u_i}f_1(\boldsymbol{u}, \boldsymbol{v}) \bigg|_{(\boldsymbol{u}, \boldsymbol{v})=(\boldsymbol{g}, \kappa\boldsymbol{\pi})} = w_i\bigl\{h_i- E_{\pi}(h)\bigr\} $$

and

The variance of the asymptotic normal distribution of $n^{1/2}\{\hat {h}_{\hat{w}}-E_{\pi}(h)\}$ is

where ∇_u f ₁, ∇_v f ₁ are the vectors containing the derivatives with respect to u, v, respectively, evaluated at (u,v)=(g,κ π). We will see below that the above expression is in fact as stated in the theorem. To this end, let us also consider the IS “estimator” $\hat {h}_{IS}$ and evaluate its asymptotic variance too. After some algebra we get that

$$ \hat{h}_{IS} = \sum_{k=1}^{m}\frac{h_{k}\bar{U}_{k}}{\sum _{l=1}^{m}\rho _{kl}\pi_{l}} \bigg/ \sum_{k=1}^{m}\frac{\bar{U}_{k}}{\sum _{l=1}^{m}\rho _{kl}\pi_{l}} = f_2(\bar{\boldsymbol{U}}), $$

(7)

say. By (6) and (7) we see that the only difference between the two “estimators” is the replacement of π _l by its unbiased estimate $\kappa^{-1}\bar{V}_{l}$ in $\hat {h}_{\hat{w}}$. By differentiation we get

$$\frac{\partial}{\partial u_i}f_2(\boldsymbol{u}) \bigg|_{\boldsymbol{u}=\boldsymbol{g}} = w_i\bigl\{h_i- E_{\pi}(h)\bigr\}. $$

Since the corresponding variance of the importance sampling estimator is

$$\sigma^2_{IS}(h) = \nabla_{\boldsymbol{u}}f_2^{T} \boldsymbol{ \varSigma }_{11} \nabla_{\boldsymbol{u}}f_2 \equiv \nabla_{\boldsymbol{u}}f_1^{T} \boldsymbol{ \varSigma }_{11} \nabla_{\boldsymbol{u}}f_1, $$

it is clear that

Set now $A_{1} = - 2\nabla_{\boldsymbol{u}}f_{1}^{T} \boldsymbol{\varSigma } _{12} \nabla_{\boldsymbol{v}}f_{1}$, $A_{2} = -\nabla_{\boldsymbol{u}}f_{1}^{T} \boldsymbol{\varSigma }_{22} \nabla_{\boldsymbol{u}}f_{1}$, B _n=w(X _n)h(X _n) for n=0,1,…, and b _i=w _i h _i, μ _i=E(B ₁|X ₀=i) for i=1,…,m. Consider further without loss of generality that E _π(h)=0 and note that under stationarity, it holds that E _π(h)=E _g(B _n) for all n. Then,

But

$$\sum_{i=1}^m \mu_i g_i = \sum_{i=1}^m E(B_i|X_0=i) g_i = E_g\bigl\{ E(B_1|X_0)\bigr\} = E_g(B_1) = 0 $$

and since X is reversible, i.e., it holds g _i g _ij=g _j g _ji and more generally $g_{i} g_{ij}^{(n)} = g_{j} g_{ji}^{(n)}$ for all n, we conclude that

$$A_1 = 2 \Biggl\{\sum_{i=1}^m b_i \mu_i g_i + 2 \sum _{n=1}^{\infty }\sum_{i=1}^m \sum_{j=1}^mb_i g_i \mu_j g_{ij}^{(n)} \Biggr\}. $$

Moreover,

$$\sum_{i=1}^m b_i \mu_i g_i = \sum_{i=1}^m b_i \Biggl(\sum_{k=1}^m b_kg_{ik} \Biggr) g_i = E_g(B_0B_1) $$

and for all n, we have that

$$\sum_{i=1}^m\sum _{j=1}^m b_ig_i \mu_j g_{ij}^{(n)} = \sum _{i=1}^m \sum_{j=1}^m b_ig_i \Biggl(\sum_{k=1}^m b_kg_{jk} \Biggr) g_{ij}^{(n)} = \sum_{i=1}^m \sum _{k=1}^m b_ib_kg_i \sum_{j=1}^m g_{jk} g_{ij}^{(n)} -\sum_{i=1}^m \sum_{k=1}^m b_ib_kg_i g_{ik}^{(n+1)} = E_g(B_0B_{n+1}). $$

Thus,

$$ A_1 = 2E_g(B_0 B_1) + 4 \sum_{n=1}^{\infty} E_g (B_0 B_{n+1}) = 2 \sum_{n=1}^{\infty} E_g (B_0 B_{n}) + 2 \sum _{n=1}^{\infty} E_g (B_0 B_{n+1}). $$

(8)

On the other hand,

But

and for all n,

Hence,

$$ A_{2} = \sum_{i=1}^{m} \frac{\mu_{i}^{2}g_{i}}{\kappa w_{i}}-2E_{g}(B_{0}B_{2})-2\sum _{n=1}^{\infty}E_{g}(B_{0}B_{n+2}) = \sum_{i=1}^{m}\frac{\mu_{i}^{2}g_{i}}{\kappa w_{i}}-2\sum _{n=1}^{\infty}E_{g}(B_{0}B_{n+1}). $$

(9)

From (8) and (9) we get

$$ \sigma^{2}_{IS}(h)-\sigma^{2}_{\hat{w}}(h) = A_{1}+A_{2} = \sum_{i=1}^{m} \frac{\mu_{i}^{2}g_{i}}{\kappa w_{i}} + 2\sum_{n=1}^{\infty }E_{g}(B_{0}B_{n}). $$

(10)

By the standard asymptotic theory for Markov chains $\sigma ^{2}_{IS}(h)=E_{g}(B_{0}^{2})+2\sum_{n=1}^{\infty}E_{g}(B_{0}B_{n})$. Since $E_{g}(B_{0}^{2})=\operatorname{Var}_{g}\{w(X_{0})h(X_{0})\}$ part (b) of the theorem follows.

(c) In order to compare the two variances note first that the first term of the sum in (10) is clearly positive. Moreover, we know that $\sum_{n=2}^{\infty}E_{g}(B_{0}B_{n}) \equiv\sum _{n=2}^{\infty}\mathrm{Cov}_{g}(B_{0},B_{n}) \geq 0$ since each residual sum of autocovariances starting from an even integer is nonnegative (see for example Geyer 1992). Thus a sufficient condition for the variances difference to be positive is E _g(B ₀ B ₁)≥0. This expectation can be expressed as a quadratic form, namely,

$$E_{g}(B_{0}B_{1}) = \sum _{i=1}^{m}\sum_{j=1}^{m}b_{i}b_{j} \tilde {g}_{ij} = \boldsymbol{b}^{T}\tilde{\boldsymbol{G}} \boldsymbol{b} $$

where $\tilde{g}_{ij}=g_{i}g_{ij}$. We will show that in the case of independence MH, the matrix $\tilde{\boldsymbol{G}}$ is nonnegative definite. Indeed, in this case,

Consider without loss of generality that

$$ q_{1}/\pi_{1}\leq\cdots\leq q_{m}/\pi_{m} $$

(11)

and set γ _ij=κ ⁻¹ q _i π _j. Then clearly $\tilde {g}_{ij}=\gamma_{i\wedge j,i\vee j}$ where i∧j=min{i,j} and i∨j=max{i,j} so $\tilde{\boldsymbol{G}}$ is nonnegative definite. To see that, proceed by induction to show that all its principal minors are nonnegative. Let $\tilde{\boldsymbol{G}}_{kk}$ denote the matrix consisting of the first k rows and columns of $\tilde{\boldsymbol{G}}$. Then, $|\tilde{\boldsymbol{G}}_{11}|\equiv\gamma_{1\wedge1}=\kappa^{-1}q_{1}\pi_{1}>0$. Suppose now that $|\tilde{\boldsymbol{G}}_{kk}|\geq 0$ for some k≥1. In order to prove that $|\tilde{\boldsymbol{G}}_{k+1,k+1}|\geq 0$, multiply the k-th row of $\tilde{\boldsymbol{G}}_{k+1,k+1}$ by π _k+1/π _k and subtract it from the (k+1)-th. Then the resulting matrix has all last row’s elements equal to zero except from the last one which is $\kappa^{-1}q_{k+1}\pi_{k+1}-\kappa^{-1}q_{k}\pi _{k+1}^{2}/\pi _{k}=\kappa^{-1}\pi_{k+1}^{2}(q_{k+1}/\pi_{k+1}-q_{k}/\pi_{k})$. By (11) this quantity is nonnegative thus, $|\tilde {\boldsymbol{G}}_{k+1,k+1}|= \kappa^{-1}\pi_{k+1}^{2}(q_{k+1}/\pi_{k+1}- q_{k}/\pi _{k})|\tilde{\boldsymbol{G}}_{kk}|\geq 0$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iliopoulos, G., Malefaki, S. Variance reduction of estimators arising from Metropolis–Hastings algorithms. Stat Comput 23, 577–587 (2013). https://doi.org/10.1007/s11222-012-9331-y

Download citation

Received: 25 June 2011
Accepted: 18 April 2012
Published: 16 May 2012
Issue Date: September 2013
DOI: https://doi.org/10.1007/s11222-012-9331-y