Abstract
In this paper, we analyse a proximal method based on the idea of forward–backward splitting for sampling from distributions with densities that are not necessarily smooth. In particular, we study the non-asymptotic properties of the Euler–Maruyama discretization of the Langevin equation, where the forward–backward envelope is used to deal with the non-smooth part of the dynamics. An advantage of this envelope, when compared to widely-used Moreu–Yoshida one and the MYULA algorithm, is that it maintains the MAP estimator of the original non-smooth distribution. We also study a number of numerical experiments that support our theoretical findings.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
The problem of calculating expectations with respect to a probability distribution \(\mathfrak {p}\) in \(\mathbb {R}^d\) is ubiquitous throughout applied mathematics, statistics, molecular dynamics, statistical physics and other fields.
In practice, often the dimension d is large, which renders deterministic techniques, such as quadrature methods, computationally intractable. In contrast, probabilistic methods do not suffer from this curse of dimensionality and are often the method of choice when the dimension d is large.
In particular, Markov chain Monte Carlo (MCMC) methods are based on the construction of a Markov chain in \(\mathbb {R}^{m}\) with \(m \ge d\), for which the invariant distribution (or its suitable marginal) coincides with the target distribution \(\mathfrak {p}\) (Brooks et al. 2011).
Often, such Markov chains are based on the discretization of stochastic differential equations (SDEs). One such SDE, which is also the focus of this paper, is the (overdamped) Langevin equation
where \(\{W_t\}_t\) is the standard d-dimensional Brownian motion and \(\nabla f\) denotes the gradient of a continuously-differentiable function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\). Under mild assumptions on f, the dynamics of (1) are ergodic with respect to the distribution \(\mathfrak {p}\propto e^{-f}\). In particular, \(\mathfrak {p}\) is the invariant distribution of (1); see Milstein and Tretyakov (2007).
The discretization of (1), however, requires special care because the resulting discrete Markov chain might not be ergodic anymore (Roberts and Tweedie 1996). In addition, even if ergodic, the resulting discrete Markov chain often has a different invariant distribution than \(\mathfrak {p}\), known as the numerical invariant distribution \({\widehat{\mathfrak {p}}}\). The study of the asymptotic error between the numerical invariant distribution \({\widehat{\mathfrak {p}}}\) and the target distribution \(\mathfrak {p}\) has received considerable attention recently (Mattingly et al. 2010; Abdulle et al. 2014). In particular, Mattingly et al. (2010) investigated the effect of discretization on the convergence of the ergodic averages, and Abdulle et al. (2014) presented general order conditions to ensure that the numerical invariant distribution accurately approximates the target distribution.
Another active line of research quantifies the nonasymptotic error between the numerical invariant distribution \({\widehat{\mathfrak {p}}}\) and the target distribution \(\mathfrak {p}\). In particular, when \(\mathfrak {p}\) is a smooth and strongly log-concave distribution, Dalalyan (2017) established non-asymptotic bounds in total variation distance for the Euler–Maruyama discretization of (1), commonly known as the unadjusted Langevin algorithm (ULA). These results have also been extended to the Wasserstein distance \(\textrm{W}_{2}\) and the KL divergence in Durmus and Moulines (2017), Dalalyan (2017), Durmus and Moulines (2019), Dalalyan and Karagulyan (2019), Durmus et al. (2019), to name a few. Typically, these works study the number of iterations that the numerical integrator would require to achieve a desired accuracy, when applied to a target distribution \(\mathfrak {p}\) with a known condition number.
In fact, the above strong log-concavity of \(\mathfrak {p}\) can be substantially relaxed. In particular, using a variant of the reflection coupling, the recent work (Eberle and Majka 2019) derives non-asymptotic bounds for the ULA in the Wasserstein distance \(\textrm{W}_1\), when \(\mathfrak {p}\) is strictly log-concave outside of a ball in \(\mathbb {R}^d\). Similar results for the Wasserstein distance \(\textrm{W}_2\) have been obtained in Majka et al. (2020).
Within the class of log-concave distributions, a significant challenge for the Langevin diffusion in (1) arises when the target distribution \(\mathfrak {p}\) is not smooth and/or has a compact (convex) support in \(\mathbb {R}^d\). One approach to address this challenge is to replace the non-smooth distribution \(\mathfrak {p}\) with a smooth proxy obtained via the so-called Moreu–Yoshida (MY) envelope. This new smooth density remains log-concave and thus amenable to the non-asymptotic results discussed earlier. When the support of \(\mathfrak {p}\) is also compact, proximal Monte Carlo methods have been explored in Brosse et al. (2017), Pereyra (2016), Durmus et al. (2018). It is also worth noting that (Bubeck et al. 2018) pursued a different approach for sampling from compactly-supported densities that does not involve the MY envelope.
A potential drawback of the above approach is that the MY envelope often does not maintain the maximum a posteriori (MAP) estimator. That is, the above approach alters the location at which the new (smooth) density reaches its maximum. This is a well-known issue in the context of (non-smooth) convex optimization and is often resolved by appealing to the proximal gradient method. The latter can be understood as the Euler discretization of the so-called forward–backward (FB) envelope (Stella et al. 2017).
1.1 Contributions
This work explores and analyzes the use of the FB envelope for sampling from non-smooth and compactly-supported log-concave distributions. In analogy with the Langevin proximal Monte Carlo, we replace the non-smooth density with a smooth proxy obtained via the FB envelope. In particular, this proxy is strongly log-concave over long distances.
Crucially, the new proxy also maintains the MAP estimator under certain assumptions. However, this improvement comes at the cost of requiring additional smoothness for the smooth part of the density. Lastly, the strong convexity of the new proxy over long distances allows us to utilise the work of Eberle and Majka (2019) to obtain non-asymptotic guarantees for our method in the Wasserstein distance \(\textrm{W}_1\).
In addition to investigating the use of the FB envelope in sampling, this work has the following contributions:
-
It introduces a general theoretical framework for sampling from non-smooth densities by introducing the notion of admissible envelopes. MY and FB envelopes are both instances of admissible envelopes.
-
It proposes a new Langevin algorithm to sample from non-smooth densities, dubbed EULA. EULA generalizes MYULA in the sense that it can work with any admissible envelope, e.g., MY or FB. EULA can also handle a family of increasingly more accurate envelopes rather than using a fixed envelope.
1.2 Organisation
The rest of the paper is organised as follows. Section 2 formalizes the problem of sampling from a non-smooth and compactly-supported log-concave distribution. As a proxy for this non-smooth distribution, its (smooth) Moreau–Yosida (MY) envelope is reviewed in Sect. 3. This section also explains the main limitation of MY envelope, i.e., its inaccurate MAP estimation. In Sect. 4, we introduce the forward–backward (FB) envelope which overcomes the key shortcoming of the MY envelope.
Section 5 introduces and analyses EULA, an extension of the popular ULA for sampling from a non-smooth distribution. EULA can be adapted to various envelopes. In particular, MYULA from Brosse et al. (2017) is a special case of EULA for the MY envelope. Section 6 proves the iteration complexity of EULA and Sect. 7 presents a few numerical examples to support the theory developed here.
2 Statement of the problem
Consider a compact convex set \(\textrm{K}\subset \mathbb {R}^d\). For a pair of functions \(\overline{f}:\mathbb {R}^d\rightarrow \mathbb {R}\) and \(\overline{g}:\mathbb {R}^d\rightarrow \mathbb {R}\), our objective in this work is to sample from the probability distribution
whenever the ratio above is well-defined. In order to sample from \(\mathfrak {p}\), we only have access to the gradient of \(\overline{f}\) and the proximal operator for \(\overline{g}\), to be defined later. Our assumptions on \(\textrm{K},\overline{f},\overline{g}\) are detailed below.
Assumption 1
We make the following assumptions:
-
1.
For radii \(R\ge r>0\), assume that \(\textrm{K}\subset \mathbb {R}^d\) is a compact convex body that satisfies \(\textrm{B}(0,r)\subset \textrm{K}\subset \textrm{B}(0,R)\). Here, \(\textrm{B}(0,r)\) is the Euclidean ball of radius r centered at the origin.
-
2.
Assume also that \(\overline{f}:\mathbb {R}^d\rightarrow \mathbb {R}\) is a convex function that is three-times continuously differentiable.
-
3.
Assume lastly that \(\overline{g}:\mathbb {R}^d\rightarrow (-\infty ,\infty ]\) is a proper closed convex function. Moreover, we assume that \(\overline{g}\) is continuous.Footnote 1
A few important remarks about Assumption 1 are in order. First, in the special case when \(\overline{f}\) is a convex quadratic (Luu et al. 2021), the assumption of thrice-differentiability above is trivially met and some of the developments below are simplified. However, our more general setup here necessitates the thrice-differentiability above and leads below to more involved technical derivations.
Second, for the sake of mathematical convenience, instead of the two functions \(\overline{f},\overline{g}\), we will work throughout this manuscript with two new functions f, g, without any loss of generality.
More specifically, instead of \(\overline{f}\), we consider a convex function f that coincides with \(\overline{f}\) on the set \(\textrm{K}\), has a compact support and a continuously differentiable Hessian. For this function f, the compactness of \(\textrm{K}\) and the smoothness of \(\overline{f}\) in Assumption 1 together imply that \(f,\nabla f,\nabla ^2 f\) are all Lipschitz-continuous functions. To summarize, for the function f described above, there exist nonnegative constants \(\lambda _0,\lambda _1,\lambda _2,\lambda _3\) such that
for every \(x,y\in \mathbb {R}^d\). In particular, the compactness of \(\textrm{K}\) and the differentiability of \(\overline{f}\) in Assumption 1 imply that f is finite, i.e., \(\max _x |f(x)|<\infty \).
Instead of \(\overline{g}\), we consider the proper closed convex function g, defined as
where \(1_\textrm{K}\) is the indicator function for the set \(\textrm{K}\). That is, \(1_\textrm{K}(x)=0\) if \(x\in \textrm{K}\) and \(1_\textrm{K}(x)=\infty \) if \(x\notin \textrm{K}\). The compactness of \(\textrm{K}\) and the continuity of \({\bar{g}}\) in Assumption 1 together imply that g is finite, when its domain is limited to the set \(\textrm{K}\). Outside of the set \(\textrm{K}\), g is infinite. To summarize, the new function g is lower semi-continuous and also satisfies
Using the new functions f and g, we rewrite the definition of \(\mathfrak {p}\) in (2) as
Above, F is often referred to as the potential associated with \(\mathfrak {p}\). The last identity above holds by construction. Indeed, on the set \(\textrm{K}\), the functions f and \(\overline{f}\) coincide. Likewise, on the set \(\textrm{K}\), the functions g and \(\overline{g}\) coincide. Outside of the set \(\textrm{K}\), however, both sides of the last equality above are infinite. In particular, F is finite on \(\textrm{K}\) and infinite outside of this set.
In view of (6b), instead of \(\overline{f}\) and \(\overline{g}\), we can work with f and g without any change to the target distribution \(\mathfrak {p}\). This will be more convenient for the proofs. As a side note, Assumption 1 implies that \(\mathfrak {p}\) is finite; see after (3) and (5)–(6). The integral in the denominator in (6a) is also finite by Assumption 1. When there is no confusion, we will overload our notation and use \(\mathfrak {p}\) to also denote the probability measure associated with the law \(\mathfrak {p}\).
Since g is not differentiable, F in (6b) is itself non-differentiable. In turn, this means that one cannot provably use gradient based algorithms (such as the ULA) to sample from \(\mathfrak {p}\propto \exp (-F)\) (Dalalyan and Karagulyan 2019). One way to deal with this issue is to replace F with a smooth function \(F_{\gamma }\), which we will refer to as an envelope. To this envelope we can then apply the ULA. It is reasonable to require this envelope \(F_\gamma \) to fulfill the following admissibility assumptions.
Definition 1
(Admissible envelopes) For \(\gamma ^0>0\), the functions \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\) are admissible envelopes of F if
-
1.
There exists a function \(F^0:\mathbb {R}^d\rightarrow [-\infty ,\infty ]\) such that \(e^{-F^0}\) is integrable, and \(F_\gamma \) dominates \(F^0\). That is, for every \(x\in \mathbb {R}^d\) and \(\gamma \in (0, \gamma ^0)\), it holds that \(F_\gamma (x) \ge F^0(x)\).
-
2.
\(F_\gamma :\mathbb {R}^d\rightarrow [-\infty ,\infty ]\) converges pointwise to F. That is, \( \lim _{\gamma \rightarrow 0} F_\gamma (x)= F(x)\) for every \(x\in \mathbb {R}^d\).
-
3.
\(F_\gamma \) is \(\lambda _\gamma \)-smooth, i.e., there exists a constant \(\lambda _{\gamma }\ge 0\) such that for all \(x,y\in \mathbb {R}^d\)
$$\begin{aligned} \Vert \nabla F_\gamma (x)-\nabla F_\gamma (y)\Vert _2\le \lambda _{\gamma } \Vert x-y\Vert _2,\quad \gamma \in (0,\gamma ^0). \end{aligned}$$
If \(\{F_\gamma :\gamma \in (0, \gamma ^0)\}\) are admissible envelopes of F, we can define the corresponding probability densities: For every \(x\in \mathbb {R}^d\) and every \(\gamma \in (0,\gamma ^0)\), we define
Remark 1
Definition 1.1 implies that \(\mathfrak {p}_\gamma \) can be normalized and (7) is thus well-defined. This observation and Definition 1.2 together imply, after an application of the dominated convergence theorem, that
where the probability measures \(\mathfrak {p}\) and \(\mathfrak {p}_\gamma \) are defined in (6a) and (7), respectively. That is, \(\mathfrak {p}_\gamma \) converges weakly to \(\mathfrak {p}\) in the limit of \(\gamma \rightarrow 0\). In other words, we can use \(\mathfrak {p}_\gamma \) as a proxy for \(\mathfrak {p}\), provided that \(\gamma \) is sufficiently small. Finally, as we will see shortly, Definition 1.3 will help us establish the convergence of the ULA to an invariant distribution close to \(\mathfrak {p}_\gamma \), provided that the step size of the ULA is small.
3 Moreau–Yosida envelope and its limitation
For \(\gamma >0\), let us define
where
is the Moreau–Yosida (MY) envelope of g. Somewhat inaccurately, we will also refer to \(F_\gamma ^MY \) as the MY envelope of F, to distinguish \(F^MY _\gamma \) from its alternative later in this paper. It is well-known that \(g_\gamma \) is \(\gamma ^{-1}\)-smooth and that \(g_\gamma \) converges pointwise to g in the limit of \(\gamma \rightarrow 0\). These facts enable us to establish the admissibility of MY envelopes, as detailed below. All proofs are deferred to the appendices. We note that the result below closely relates to Durmus et al. (2018, Proposition 1). In effect, the result below restates the result in Durmus et al. (2018) in the framework of admissible envelopes.
Proposition 1
(Admissibility of MY envelopes) Suppose that Assumption 1 is fulfilled. Then \(\{F_\gamma ^{MY }:\gamma > 0\}\) are admissible envelopes of F in (6b). In particular, \(\nabla F_\gamma ^{MY }\) is \((\lambda _2+\gamma ^{-1})\)-Lipschitz continuous, and given by the expression
Above, \(\lambda _2\) was defined in (3c), and \(P_{\gamma g}:\mathbb {R}^d\rightarrow \mathbb {R}^d\) is the proximal operator associated with the function \(\gamma g\).
Remark 2
(Connection to Nesterov’s smoothing technique) Alternatively, we can also view the MY envelope through the lens of Nesterov’s smoothing technique (Nesterov 2005). More specifically, if Assumption 1 is fulfilled, one can invoke a standard minimax theorem to verify that
where \(g^*\) is the Fenchel conjugate of g. This interpretation is central to Nesterov’s technique for minimizing the non-smooth function F in (6a).
In view of the admissibility of \(F^{MY }_{\gamma }\) by Proposition 1, we can apply the ULA to the new (smooth) potential \(F^{MY }_{\gamma }\) instead of the nonsmooth F. In particular, if \(\gamma \) is sufficiently small, then \(\mathfrak {p}^MY _\gamma \propto \exp (-{F_{\gamma }^MY })\) would be close to the target distribution \(\mathfrak {p}\) by Remark 1. This technique is known as the MYULA (Brosse et al. 2017).
However, a limitation of the MY envelope is that the minimizers of \(F^{MY}_{\gamma }\) are not necessarily the same as the minimizers of F. In turn, the MAP estimator of \(\mathfrak {p}_\gamma ^MY \), denoted by \(x^MY _\gamma \), might not coincide with the MAP estimator of \(\mathfrak {p}\), except in the limit of \(\gamma \rightarrow 0\). That is,
This observation is particularly problematic because, as we will see later, very small values of \(\gamma \) are often avoided in practice due to numerical stability issues.
In view of this discussion, our objective is to replace the MY envelope with a new envelope that has the same minimizers as F for all sufficiently small \(\gamma \), not just in the limit of \(\gamma \rightarrow 0\).
4 Forward–backward envelope
In this section, we will study an envelope that addresses the limitations of the MY envelope. More specifically, for \(\gamma >0\), let us recall from Stella et al. (2017) that the forward–backward (FB) envelope of the function F in (6b) is defined as
where \(g_\gamma \) was defined in (8). A number of useful properties of \(F^FB _\gamma \) are collected below for the convenience of the reader, borrowed from Stella et al. (2017). Recall that \(P_{\gamma g}\) in (9) denotes the proximal operator associated with the function \(\gamma g\).
Proposition 2
(Properties of the FB envelope) Suppose that Assumption 1 is fulfilled. For \(\gamma \in (0,1/\lambda _2)\) and every \(x\in \mathbb {R}^d\), it holds that
-
1.
\(F( P_{\gamma g}(x-\gamma \nabla f(x) ) \le F_\gamma ^{FB }(x) \le F(x) \), which relates the function F to its FB envelope.
-
2.
\(F_{\frac{\gamma }{1-\gamma \lambda _2}}^{MY }(x) \le F_\gamma ^{FB }(x) \le F_\gamma ^{MY }(x)\), which relates the MY and FB envelopes of the function F.
-
3.
\(F_\gamma ^{FB }\) is continuously differentiable and its gradient is given by
$$\begin{aligned} \nabla F_\gamma ^{FB }(x)= \frac{(I-\gamma \nabla ^2 f(x))}{\gamma } (x - P_{\gamma g}(x-\gamma \nabla f(x))). \end{aligned}$$ -
4.
\({{\,\mathrm{\hbox {arg min}}\,}}F^FB _\gamma = {{\,\mathrm{\hbox {arg min}}\,}}F\), i.e., the function F and its FB envelope have the same minimizers.
In view of Proposition 2.4, a remarkable property of the FB envelope is that the modes of the distribution \(\mathfrak {p}_\gamma ^FB \propto \exp (-F^FB _\gamma )\) coincide with the modes of the target distribution \(\mathfrak {p}\propto \exp (-F)\), for all sufficiently small \(\gamma \), rather than only in the limit of \(\gamma \rightarrow 0\). Indeed, very small values of \(\gamma \) are often avoided in practice due to numerical stability issues. This observation signifies the advantage of the FB envelope over the MY envelope. Recall that the modes of \(\mathfrak {p}^MY _\gamma \) coincide with those of \(\mathfrak {p}\) only in the limit of \(\gamma \rightarrow 0\); see Sect. 3.
As a side note, let us remark that the proximal gradient descent algorithm for minimizing the (non-smooth) function F coincides with the gradient descent (with variable metric) for minimizing the (smooth) function \(F^FB _\gamma \), whenever \(\gamma \) is sufficiently small (Stella et al. 2017). It is also easy to use Proposition 2 to check the admissibility of the FB envelopes, as summarized below.
Proposition 3
(Admissibility of FB envelopes) Suppose that Assumption 1 is fulfilled. Then \(\{F_\gamma ^{FB }:\gamma \in (0,\gamma ^{FB })\}\) are admissible envelopes of F in (6b), where
Moreover,
for every \(x,y\in \mathbb {R}^d\). The second inequality above holds when \(\Vert x-y\Vert _2 \ge \rho _\gamma ^FB \). Lastly,
The Eq. (11) provides valuable information about the landscape of the FB envelope of F, which we now summarize: (11a) means that \(F^FB _\gamma \) is a \(\lambda _\gamma ^FB \)-smooth function. The smoothness of \(F^FB _\gamma \) in (FB) is not surprising since both f and \(g_\gamma \) are smooth functions. (Recall that \(g_\gamma \) is the MY envelope of g, which is known to be \(\gamma ^{-1}\)-smooth.) Moreover, even though \(F^FB _\gamma \) is not necessarily a strongly convex function, (11b) implies that \(F^FB _\gamma \) behaves like a strongly convex function over long distances. As detailed in the proof, (11b) holds essentially because the MY envelope of the indicator function \(1_\textrm{K}\) is the “quadratic” function \(\tfrac{1}{2\gamma }{\text {dist}}(\cdot ,\textrm{K})^2\). The latter function grows quadratically far away from the origin. Here, \({\text {dist}}(\cdot ,\textrm{K})\) is the distance to the set \(\textrm{K}\).
It is worth noting that a similar result to Proposition 3 is implicit in Brosse et al. (2017) for the MY envelope. That is, the MY envelope \(F^MY _\gamma \) also satisfies (11), albeit with different constants.
Remark 3
(Convergence in the Wasserstein metric) Recall from Remark 1 that \(\mathfrak {p}_\gamma ^FB \) converges weakly to \(\mathfrak {p}\) in the limit of \(\gamma \rightarrow 0\). This weak convergence implies convergence in the Wasserstein metric by Bou-Rabee and Eberle (2020, Lemma 2.6), i.e.,
We recall here that, for two probability measures \(\mathfrak {q}_1\) and \(\mathfrak {q}_2\) that satisfy \(\mathbb {E}_{x\sim \mathfrak {q}_1}\Vert x\Vert _2<\infty \) and \(\mathbb {E}_{y\sim \mathfrak {q}_2}\Vert y\Vert _2 <\infty \), their 1-Wasserstein or Kantorovich distance (Villani 2009) is
With some abuse of notation, throughout this work, we will occasionally replace the probability measures above with the corresponding probability distributions or random variables.
A non-asymptotic version of Remark 3 is presented below, which bounds the Wasserstein distance between the two probability measures \(\mathfrak {p}^FB _\gamma \) and \(\mathfrak {p}\). The result below is an analogue of Brosse (2017, Proposition 5) for the MY envelope. The key ingredient of their result is the Steiner’s formula for the volume of the set \(\textrm{K}+\textrm{B}(0,t)=\{x: {\text {dist}}(x,\textrm{K})\le t\}\), where the sum is in the Minkowski sense. Essentially, our proof strategy is to use Proposition 2.2 to relate the FB and MY envelopes and then invoke Brosse (2017, Proposition 5). below needs update
Theorem 1
(Wasserstein distance between \(\mathfrak {p}^FB _\gamma \) and \(\mathfrak {p}\)) Suppose that Assumption 1 is fulfilled. For \(\gamma \in (0,\gamma ^{FB })\), it holds that
where
Above, \(\text {d}_i(\textrm{K})\) is the i-th intrinsic volume of \(\textrm{K}\); see Klain and Rota (1997). In particular, the d-th volume of \(\textrm{K}\) coincides with the standard volume of \(\textrm{K}\), i.e., \(\text {d}_d(\textrm{K})=\text {d}(\textrm{K})\). Moreover, to keep the notation light, above we have suppressed the dependence of \(c_1\) to \(c_5\) on \(\textrm{K},f,g,\gamma \).
As a sanity check, consider the special case of \(g=1_\textrm{K}\), where \(1_\textrm{K}\) is the indicator function for the set \(\textrm{K}\). Then we can use (15) to verify that \(c_1\) and \(I_1(\gamma )\) and \(I_2(\gamma )\) and \(I_2(\gamma /(1-\gamma \lambda _2))\) all vanish when we send \(\gamma \rightarrow 0\). Consequently, both the left- and right-hand sides of (14) vanish if we send \(\gamma \rightarrow 0\). When \(g=1_\textrm{K}\), then Theorem 1 is precisely the analogue of Brosse (2017, Proposition 5) for the FB envelope. Their work, however, does not cover the case of \(g\ne 1_\textrm{K}\).
When \(g\ne 1_\textrm{K}\) and \(\gamma \rightarrow 0\), the right-hand side of (14) converges to the nonzero value
unlike the left-hand side of (14), which converges to zero by (12). Improving (14) in the case \(g\ne 1_\textrm{K}\) appears to require highly restrictive assumptions on g which we wish to avoid here. Loosely speaking, the technical difficulty can be described as follows: As the volume is obtained by integrating the indicator function, the Steiner formula is particularly suited for the case \(g=1_\textrm{K}\). When \(g\ne 1_\textrm{K}\), we are not aware of an analogue of the Steiner’s formula. At any rate, very small values of \(\gamma \) are often avoided in practice due to numerical stability issues. In this sense, improving (14) for very small values of \(\gamma \) might have limited practical value.
To summarize this section, we established that the FB envelopes \(\{F^FB _\gamma : \gamma \in (0,\gamma ^{0,FB })\}\) are admissible and we can thus use them as a differentiable proxy for the non-smooth function F in (6b). Crucially, the FB envelope addresses the key limitation of the MY envelope, i.e., the modes of \(\mathfrak {p}^FB _\gamma \propto \exp (-F^FB _\gamma )\) coincide with the modes \(\mathfrak {p}\propto \exp (-F)\), for all sufficiently small \(\gamma \), rather than only in the limit of \(\gamma \rightarrow 0\).
5 EULA: envelope unadjusted Langevin algorithm
We have so far introduced two smooth envelopes for the non-smooth function F in (6b), namely, the MY envelope \(F^MY _\gamma \) in (MY) and the FB envelope \(F^FB _\gamma \) in (FB). We also described in Sect. 4 the advantage of the FB envelope over the MY envelope. To keep our discussion general, below we consider admissible envelopes \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\) for the target function F in (6b); see Definition 1. Our discussion below can be specialized to either of the envelopes by setting \(F_\gamma =F^MY _\gamma \) or \(F_\gamma =F^FB _\gamma \), respectively.
For the time being, let us fix \(\gamma \in (0,\gamma ^0)\). Unlike F, note that \(\nabla F_\gamma \) exists and is Lipschitz continuous by Definition 1.3. We can now use the ULA (Dalalyan and Karagulyan 2019) to sample from \(\mathfrak {p}_\gamma \propto \exp (-F_\gamma )\), as a proxy for the target measure \(\mathfrak {p}\propto \exp (-F)\). The k-th iteration of the resulting algorithm is
where h is the step size and \(\zeta _{k+1} \in \mathbb {R}^d\) is a standard Gaussian random vector, independent of \(\{\zeta _i\}_{i\le k}\). In particular, if we choose \(F_\gamma =F^MY _\gamma \), then (16) coincides with the MYULA from Brosse et al. (2017).
Under standard assumptions, to be reviewed later, the Markov chain \(\{x_k\}_{k\ge 0}\) in (16) has a unique invariant probability measure, which we denote by \(\widehat{\mathfrak {p}}_{\gamma ,h}\). There are two sources of error that contribute to the difference between \(\widehat{\mathfrak {p}}_{\gamma ,h}\) and the target measure \(\mathfrak {p}\) in (6a), which we list below:
-
1.
First, note that (16) is only intended to sample from \(\mathfrak {p}_\gamma \), as a proxy for the target distribution \(\mathfrak {p}\). That is, the first source of error is the difference between the two probability measures \(\mathfrak {p}_\gamma \) and \(\mathfrak {p}\).
-
2.
Second, the step size h is known to contribute to the difference between the two probability measures \(\widehat{\mathfrak {p}}_{\gamma ,h}\) and \(\mathfrak {p}_\gamma \); see Dalalyan and Karagulyan (2019). This bias vanishes only in the limit of \(h\rightarrow 0\).
In fact, instead of (16), we study here a slightly more general algorithm that allows \(\gamma \) and h to vary. More specifically, for a nonincreasing sequence \(\{\gamma _k\}_{k\ge 0}\) and step sizes \(\{h_k\}_{k\ge 0}\), the k-th iteration of this more general algorithm is
where \(\zeta _{k+1} \in \mathbb {R}^d\) is a standard Gaussian random vector independent of \(\{\zeta _i\}_{i\le k}\). (EULA) stands for Envelope Unadjusted Langevin Algorithm.
In particular, if we set \(\gamma _k=\gamma \) in (EULA) for every \(k\ge 0\), then we retrieve (16). Alternatively, if \(\{\gamma _k\}_{k\ge 0}\) is a decreasing sequence, then \(F_{\gamma _k}\) becomes an increasingly better approximation of the target potential function F as k increases; see Definition 1.2. That is, (EULA) uses increasingly better approximations of the potential function F as k increases.
We next present the iteration complexity of (EULA) for admissible envelopes \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\), where admissibility was defined in Definition 1. The result below can be specialized to both MY and FB envelopes by setting \(F_\gamma =F^MY _\gamma \) or \(F_\gamma =F^FB _\gamma \), respectively.
Theorem 2
(Iteration complexity of (EULA)) For \(\gamma ^0>0\), consider admissible envelopes \(\{F_\gamma :\gamma \in (0,\gamma ^0)\}\) of F in (6b); see Definition 1. For \(\mu _\gamma >0\) and \(\rho _\gamma \ge 0\), we additionally assume that \(F_\gamma \) satisfies
when \(\Vert x-y\Vert _2\ge \rho _\gamma \) and \(\gamma \in (0,\gamma ^0)\). Consider two sequences \(\{\gamma _k\}_{k\ge 0}\subset (0,\gamma ^0)\) and \(\{h_k\}_{k\ge 0} \subset \mathbb {R}_+\). For the algorithm (EULA), let \(\mathfrak {q}_k\) denote the law of \(x_k\) for every integer \(k\ge 0\). That is, \(x_k \sim \mathfrak {q}_k\) for every \(k\ge 0\). Then the \(\textrm{W}_1\) distance between \(\mathfrak {q}_k\) and the target measure \(\mathfrak {p}\propto e^{-F}\) in (6a) is bounded by
for every \(k\ge 0\), provided that
Above, \(c_0\ge 0.007\) is a universal constant specified in Eberle and Majka (2019, Eq. (6.6)). Moreover,
When \(\rho _\gamma =0\), then note that (17) requires \(F_\gamma \) to be \(\mu _\gamma \)-strongly convex for every \(\gamma \in (0,\gamma ^0)\). This can happen, for example, when f itself is a strongly convex function. A shortcoming of Theorem 2 in (18) is that \(h_k=0\) if \(\rho _{\gamma _k}=0\). It becomes clear from our proof that we have inherited this shortcoming from Eberle and Majka (2019); see Proposition 4. Their work also offers a more involved result, which does not suffer from this issue; see Eberle and Majka (2019, Theorem 2.12). However, we opted in this work for their simplified result in Proposition 4, though this can be improved.
A particularly important special case of Theorem 2 is when we choose the FB envelope, and use a fixed \(\gamma \) and step size h.
Corollary 1
(Iteration complexity of (EULA) for FB envelope) Suppose that Assumption 1 is fulfilled. For the algorithm (EULA), suppose that \(\gamma _k=\gamma \in (0,\gamma ^FB )\) and \(F_{\gamma _k}=F^FB _\gamma \) and \(h_k=h>0\) for every integer \(k\ge 0\), see (FB) and (10). In (EULA), also let \(\mathfrak {q}_k\) denote the law of \(x_k\) for every \(k\ge 0\). That is, \(x_k \sim \mathfrak {q}_k\) for every integer \(k\ge 0\). Then the \(\textrm{W}_1\) distance between \(\mathfrak {q}_k\) and the target measure \(\mathfrak {p}\propto e^{-F}\) in (6a) is bounded by
for every \(k\ge 0\), provided that \(\gamma \in (0,\gamma ^0)\), and
Above, \(c_0\ge 0.007\) is a universal constant specified in Eberle and Majka (2019, Eq. (6.6)). Moreover,
The remaining quantities were defined in Propositions 3 and 1.
We remark that Corollary 1 for the FB envelope is the analogue of Brosse (2017, Proposition 7) for the MY envelope. However, note that Brosse (2017, Proposition 7) requires f to be strongly convex whereas we merely assume f to be convex, see Assumption 1.
6 Proof of Theorem 2
To begin, we let \(Q_k\) denote the Markov transition kernel associated with the Markov chain \(\{x_k\}_{k\ge 0}\). This transition kernel is specified as
where \(\text {Normal}(a,B)\) is the Gaussian probability measure with mean \(a\in \mathbb {R}^d\) and covariance matrix \(B\in \mathbb {R}^{d\times d}\). Above, note that \(Q_k\) depends on both \(h_k\) and \(\gamma _k\). We also let \(\mathfrak {q}_k\) denote the law of \(x_k\), i.e., \(x_k \sim \mathfrak {q}_k\). Using the standard notation, we can now write that
To be precise, (21) is equivalent to
for every \(y\in \mathbb {R}^d\). Recall that \(\mathfrak {p}_{\gamma _{k+1}}\) serves as a proxy for the target probability measure \(\mathfrak {p}\). The \(\textrm{W}_1\) distance between \(\mathfrak {q}_{k+1}\) and \(\mathfrak {p}_{\gamma _{k}}\) can be bounded as
where (i) uses the triangle inequality. We separately control each \(\textrm{W}_1\) distance in the last line above. For the first distance, we plan to invoke Theorem 2.12 from Eberle and Majka (2019), reviewed below for the convenience of the reader. It is worth noting that a similar result to the one below appears in Majka et al. (2020, Corollary 2.4).
Proposition 4
(Eberle and Majka 2019, Theorem 2.12) Let
where \(c_0\ge 0.007\) is a universal constant specified in Eberle and Majka (2019, Eq. (6.6)). Then it holds that
where \(\textrm{W}_\Theta \) is defined similar to (13) but the \(\ell _2\)-norm is replaced with \(\Theta \). Above, to keep the notation light, we have suppressed the dependence of \(c_6\) and \(c_7\) and \(c_8\) on \(\textrm{K},f,g,h_k\). Moreover, the two metrics \(\textrm{W}_\Theta \) and \(\textrm{W}_1\) are related as
As for the second \(\textrm{W}_1\) distance in the last line of (23), the following result is standard; see the appendix for its proof.
Lemma 1
(Discretization error) It holds that
In fact, it is more common to write the left-hand side of (26) in terms of the Markov transition kernel of the corresponding Langevin diffusion, as discussed in the proof of Lemma 1. By combining Proposition 4 and Lemma 1, we can now revisit (23) and write that
where (i) uses (23), (ii) uses Proposition 4 and (25), and (iii) uses Lemma 1. Using the triangle inequality, it immediately follows that
where (i) uses the triangle inequality, (ii) uses (27), and (iii) uses (25). By unwrapping (28), we find that
where (i) and (iii) use (25), and (ii) uses (28). Lastly, we can use (29) in order to bound the \(\textrm{W}_1\) distance at iteration k to the target measure \(\mathfrak {p}\) as
where (i) uses the triangle inequality, and (ii) uses (29). This completes the proof of Theorem 2.
7 Numerical experiments
A few numerical experiments are presented below to support our theoretical contributions.
7.1 Truncated Gaussian
Our first numerical experiment deals with sampling from a truncated Gaussian distribution, restricted to a box \(K_{d} \subset \mathbb {R}^d\). For this problem the potential \(U:\mathbb {R}^d\rightarrow \mathbb {R}\) is specified as
Here similarly to Brosse et al. (2017) the (i, j)th entry of the covariance matrix is given by
We now consider three scenarios, namely,
-
\(d=2\) with \(K_{2}=[0,5] \times [0,1]\),
-
\(d=10\) with \(K_{10}=[0,5] \times [0,0.5]^{9}\)
-
\(d=100\) with \(K_{100}=[0,5] \times [0,0.5]^{99}\).
Using quadrature techniques, it is possible in the two-dimensional case (\(d=2\)) to calculate exactly the mean and the covariance of the truncated Gaussian distribution, as well as the corresponding approximations obtained via MY and FB envelopes. More specifically, Fig. 1 uses MATLAB’s integral2 command to plot the following quantities for various values of \(\gamma \):
The horizontal lines in Fig. 1 show the ground truth values obtained by MATLAB’s integral2 command, i.e.,
For small values of the parameter \(\gamma \), we observe in Fig. 1 that the FB envelope better approximates the mean of the first component than the MY envelope. However, the FB envelope tends to overestimate the variance. This can be understood by comparing the two envelopes since in the case of the MY envelope the smoothing is more localized compared to the FB envelope.
Such explicit calculations are not tractable in higher dimensions, i.e., for \(d\in \{10,100\}\). Instead, we now generate \(10^6\) samples from the truncated Gaussian distribution \(\mathfrak {p}\) by applying MYULA and FBULA. As our ground truth, we also generate \(10^{5}\) samples from \(\mathfrak {p}\) with the wall HMC (wHMC) (Pakman and Paninski 2014). In all three approaches, the initial 10% of the obtained samples are discarded as the burn-in period, while each experiment is repeated 100 times. In terms of the parameters, we set \(\gamma =0.05 \) and fix \(h=0.005\) for all of our experiments.
The results are visualized in Figs. 2 and 3. More specifically, Fig. 2 corresponds to \(d=10\) and shows the estimates for \(\mathbb {E}_\mathfrak {p}[x_i]\) for \(i\in \{1,2,3\}\), obtained by MYULA, FBULA, and wHMC. Similarly, Fig. 3 corresponds to \(d=100\). These figures indicate that, in all of these cases, FBULA is providing a more accurate approximation of the expectation compared to MYULA.
7.2 Tomographic image reconstruction
We now study a tomographic image reconstruction problem. In this case the true image is the Shepp–Logan phantom test image of dimension \(d=128 \times 128\) (Shepp and Logan 1974), in which we applied a Fourier operator F followed by a subsampling operator A, reducing the observed pixels by approximately 85%. Finally, zero-mean additive Gaussian noise \(\xi \) is added with standard deviation \(\sigma = 10^{-2}\) to produce an incomplete observation \(y=AFx + \xi \) where \(y \in \mathbb {C}^p\). Note that \(p < d\). With regards to the prior, we use the total-variation norm with an additional constraint for the size of the pixels. This leads to the following posterior distribution:
with \(\beta = 100\). Above, \(1_{[0,1]^d}\) is the convex indicator function on the unit cube, as the pixel values for this experiment are scaled to the range [0, 1]. Here \(\text {TV}(x)\) represents the total-variation pseudo-norm (Rudin et al. 1992; Chambolle 2004). Following (4) and (6b), we have that \(f(x)=\Vert y - AFx \Vert ^2 / 2\sigma ^2\) and \(g(x) = {\beta }\text {TV}(x) + 1_{\left[ 0,1\right] ^d}(x)\).
Figure 4a shows the Shepp–Logan phantom tomography test image for this experiment and Fig. 4b shows the amplitude of the (noisy) Fourier coefficients collected in the observation vector y (in logarithmic scale). In this figure, black regions represent unobserved pixels.
We have set \(\gamma = 1/(5L_f)\) where \(L_f = 1/\sigma ^2 = 10^4\) for both MYULA and FBULA. Fig. 5a shows the evolution of the values of \(\log \pi (x)\) from (32) of both MYULA and FBULA with the step-size \(h=1/(L_f + 1/\gamma ) = 1.67 \times 10^{-5}\). We observe that both methods converge at a similar rate in terms of the actual computational time required to run the algorithms. However, Fig. 5b shows the evolution of the mean-squared error (MSE) between the ergodic mean of the samples and the true image x. Here, it can be seen FBULA reaches a better MSE level compared to MYULA. We have also included in Fig. 6a and b the posterior mean estimated by both MYULA and FBULA.
References
Abdulle, A., Vilmart, G., Zygalakis, K.C.: High order numerical approximation of the invariant measure of ergodic SDEs. SIAM J. Numer. Anal. 52(4), 1600–1622 (2014)
Beck, A.: First-Order Methods in Optimization (2017)
Bertsekas, D.: Convex Optimization Theory (2009)
Bou-Rabee, N., Eberle, A.: Markov Chain Monte Carlo Methods (2020)
Brooks, S., Gelman, A., Jones, G., Meng, X.-L.: Handbook of Markov Chain Monte Carlo (2011)
Brosse, N., Durmus, A., Moulines, É., Pereyra, M.: Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo. In: Conference on Learning Theory, pp. 319–342. PMLR (2017)
Bubeck, S., Eldan, R., Lehec, J.: Sampling from a log-concave distribution with projected Langevin Monte Carlo. Discrete Comput. Geom. 59(4), 757–783 (2018)
Chambolle, A.: An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 20(1–2), 89–97 (2004)
Dalalyan, A.S.: Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. In: Conference on Learning Theory, pp. 678–689. PMLR (2017)
Dalalyan, A.S.: Theoretical guarantees for approximate sampling from smooth and log-concave densities. J. R. Stat. Soc. Ser. B Methodol. 79(3), 651–676 (2017)
Dalalyan, A.S., Karagulyan, A.: User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stoch. Process. Their Appl. 129(12), 5278–5311 (2019)
Durmus, A., Moulines, E.: Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab. 27(3), 1551–1587 (2017)
Durmus, A., Moulines, E.: High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli 25(4A), 2854–2882 (2019)
Durmus, A., Moulines, É., Pereyra, M.: Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM J. Imaging Sci 11(1), 473–506 (2018)
Durmus, A., Majewski, S., Miasojedow, B.: Analysis of Langevin Monte Carlo via convex optimization. J. Mach. Learn. Res. 20(73), 1–46 (2019)
Eberle, A., Majka, M.B.: Quantitative contraction rates for Markov chains on general state spaces. Electron. J. Probab. 24 (2019)
Kampf, J.: On weighted parallel volumes. Beiträge Algebra Geom. 50(2), 495–519 (2009)
Klain, D.A., Rota, G.-C.: Introduction to Geometric Probability (1997)
Luu, T.D., Fadili, J., Chesneau, C.: Sampling from non-smooth distributions through Langevin diffusion. Methodol. Comput. Appl. Probab. 23(4), 1173–1201 (2021)
Majka, M.B., Mijatovic, A., Szpruch, L.: Nonasymptotic bounds for sampling algorithms without log-concavity. Ann. Appl. Probab. 30(4), 1534–1581 (2020)
Mattingly, J.C., Stuart, A.M., Tretyakov, M.V.: Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM J. Numer. Anal. 48(2), 552–577 (2010)
Milstein, G.N., Tretyakov, M.V.: Computing ergodic limits for Langevin equations. Physica D 229(1), 81–95 (2007)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Pakman, A., Paninski, L.: Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. J. Comput. Graph. Stat. 23(2), 518–542 (2014)
Pereyra, M.: Proximal Markov chain Monte Carlo algorithms. Stat. Comput. 4(26), 745–760 (2016)
Roberts, G.O., Tweedie, R.L.: Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli 2(4), 341–363 (1996)
Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis (2009)
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992)
Shepp, L.A., Logan, B.F.: The Fourier reconstruction of a head section. IEEE Trans. Nucl. Sci. 21(3), 21–43 (1974). https://doi.org/10.1109/TNS.1974.6499235
Stella, L., Themelis, A., Patrinos, P.: Forward–backward quasi-newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 67(3), 443–487 (2017)
Villani, C.: Optimal Transport: Old and New (2009)
Acknowledgements
The authors would like to thank Marcelo Pereyra for valuable comments on a draft manuscript. The work of KCZ was supported by the Leverhulme Trust (RF/ 2020-310) and by the Engineering and Physical Sciences Research Council under grant EP/V006177/1. The work of AE was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Proof of Proposition 1
To establish the admissibility of the MY envelope, we verify the conditions in Definition 1. We begin by verifying that Definition 1.3 is met. Recall that the MY envelope is continuously-differentiable Rockafellar and Wets (2009, Theorem 2.26), and its gradient is given by
Here, \(P_{\gamma g}\) is the proximal operator associated with the function \(\gamma g\). Above, note that the minimizer is unique and the map \(P_{\gamma g}\) is thus well-defined. Recall also that we can use the Moreau decomposition (Beck 2017) to write that
where \(g^*\) is the Fenchel conjugate of g. Using (A3), we rewrite (A1) as
which we will use next to compute the Lipschitz constant of \(\nabla F_\gamma ^{MY }\). For \(x,y\in \mathbb {R}^d\), note that
where (i) uses the triangle inequality and (A4), and (ii) uses (3c) and the non-expansiveness of the proximal operator. We next verify that Definition 1.2 holds. In one direction, it is easy to see from (8) that \(g_\gamma \le g\) for every \(\gamma >0\). In the other direction, we fix \(x\in \mathbb {R}^d\) and distinguish two cases:
-
1.
When \(x\in \textrm{K}\), let us fix an arbitrary \(\epsilon >0\). Because \(g(x)<\infty \) and \(\min _z g(z) > -\infty \) by (5), there exists a sufficiently small \(\gamma _\epsilon >0\) such that the following holds for every \(\gamma \le \gamma _\epsilon \):
$$\begin{aligned} \min _{z\in \mathbb {R}^d}\left\{ g(z) + \frac{\epsilon ^2}{2\gamma }\right\} > g(x). \end{aligned}$$(A6)We now use the above inequality as part of the following argument,
$$\begin{aligned}&\min _{z\in \mathbb {R}^d}\left\{ g(z)+ \frac{1}{2\gamma }\Vert x-z\Vert _2^2: \Vert x-z\Vert \ge \epsilon \right\} \nonumber \\&\ge \min _z \left\{ g(z)+ \frac{\epsilon ^2}{2\gamma }\right\} \nonumber \\&\overset{(i)}{>}\ g(x) \nonumber \\&\ge g_\gamma (x) = \min _z \left\{ g(z)+ \frac{1}{2\gamma }\Vert x-z\Vert _2^2 \right\} , \end{aligned}$$(A7)which holds when \(x\in \textrm{K}\) and for every \(\gamma \le \gamma _\epsilon \). Above, (i) uses (A6). In view of (A2) and (A7), we conclude that \(\Vert x-P_{\gamma g}(x)\Vert _2\le \epsilon \), for every \(x\in \textrm{K}\) and every \(\gamma \le \gamma _\epsilon \). Since the choice of \(\epsilon \) was arbitrary, we arrive at
$$\begin{aligned} \lim _{\gamma \rightarrow 0}\Vert x - P_{\gamma g}(x)\Vert _2 = 0, \quad x\in \textrm{K}. \end{aligned}$$(A8)It immediately follows that
$$\begin{aligned}&\liminf _{\gamma \rightarrow 0} g_{\gamma }(x) - g(x)\nonumber \\&\overset{(i)}{=}\ \liminf _{\gamma \rightarrow 0} \min _{z}\left\{ g(z)+\frac{1}{2\gamma }\Vert x-z\Vert _2^2 \right\} - g(x) \nonumber \\&\overset{(ii)}{=}\ \liminf _{\gamma \rightarrow 0} \left\{ g(P_{\gamma g}(x)) + \frac{1}{2\gamma }\Vert x - P_{\gamma g}(x)\Vert _2^2\right\} - g(x) \nonumber \\&\ge \liminf _{\gamma \rightarrow 0} g(P_{\gamma g}(x)) - g(x) \nonumber \\&\overset{(iii)}{\ge }\ \liminf _{z\rightarrow x} g(z) - g(x) \nonumber \\&\overset{(iv)}{\ge }\ 0, \quad x\in \textrm{K}, \end{aligned}$$(A9)where (i) uses (MY), (ii) uses (A2), (iii) uses (A8), and (iv) uses the lower semi-continuity of g is lower semi-continuous; see before Eq. (5). If we combine (A9) with the earlier observation that \(g_\gamma \le g\) for every \(\gamma \), we reach the conclusion that \(\lim _{\gamma \rightarrow 0} g_\gamma (x)=g(x)\), provided that \(x\in \textrm{K}\). Lastly, after recalling the definitions of F and \(F_\gamma ^MY \) in (6b) and (MY), respectively, we arrive at
$$\begin{aligned} \lim _{\gamma \rightarrow 0} F^MY _\gamma (x) = F(x), \quad x\in \textrm{K}. \end{aligned}$$(A10) -
2.
When \(x\notin \textrm{K}\), we write that
$$\begin{aligned} g_\gamma (x)&= \min _z\left\{ g(z)+ \frac{1}{2\gamma } \Vert x-z\Vert _2^2 \right\} \nonumber \\&\overset{(i)}{=}\ \min \left\{ g(z) + \frac{1}{2\gamma } \Vert x- z\Vert ^2_2: z\in \textrm{K}\right\} \nonumber \\&\overset{(ii)}{\ge }\ -\max _{z\in \textrm{K}} g(z) + \frac{1}{2\gamma } \min _{z\in \textrm{K}} \Vert x-z\Vert _2^2 \nonumber \\&=: -\max _{z \in \textrm{K}}g(z)+ \frac{1}{2\gamma } {\text {dist}}(x,\textrm{K})^2, \end{aligned}$$(A11)where \({\text {dist}}(x,\textrm{K})\) is the Euclidean distance from x to the set \(\textrm{K}\). Above, (i) uses the fact that g is infinite outside the set \(\textrm{K}\) by (5). Also, (ii) uses the inequality \(\min h_1+h_2 \ge \min h_1-\max h_2\) for real-valued functions \(h_1\) and \(h_2\). The maximum above is finite by (5). By sending \(\gamma \) to zero above, we immediately find that \(\lim _{\gamma \rightarrow 0} g_\gamma (x)=\infty =g(x)\), provided that \(x\notin \textrm{K}\). Recalling the definitions of F and \(F_\gamma \) in (6b) and (MY), respectively, we conclude that
$$\begin{aligned} \lim _{\gamma \rightarrow 0} F_\gamma ^MY (x) = F(x) = \infty ,\quad x\notin \textrm{K}. \end{aligned}$$(A12)
Together, (A8) and (A12) imply that \(\lim _{\gamma \rightarrow 0} F_\gamma (x) = F(x)\) for every \(x\in \mathbb {R}^d\).
Lastly, we now verify that Definition 1.1 is satisfied: By Assumption 1.1, \(\textrm{K}\) is enclosed inside a ball of radius R centered at the origin, which implies that
where \((a)_+:=\max (a,0)\). When \(\Vert x\Vert _2\) is sufficiently large, we can simplify (A13). In particular, (A13) immediately implies that
If \(\Vert x\Vert _2\ge 2R\), then we can use the convexity of f in Assumption 1.2 to write that
for \( \Vert x\Vert _2 \ge 2R\). Above, (i) uses (MY), (ii) uses the convexity of f, (iii) uses (A11), and (iv) uses (A14). We now set
By its construction, note that \(F^0\) satisfies \(\int e^{-F^0(x)}\, {\text {d}}x<\infty \) and \(F_\gamma ^MY \ge F^0\). That is, \(F^0\) satisfies the requirements in Definition 1.1. The above claim about the finite integral is true because \(F^0(x)\) is quadratic for large \(\Vert x\Vert _2\) and \(\exp (-F^0)\) thus decays rapidly far away from the origin. Having verified the requirements in Definition 1, the proof of Proposition 1 is now complete.
Appendix B: Proof of Proposition 3
To establish the admissibility of the FB envelope, we verify the requirements in Definition 1. We first verify that Definition 1.3 is satisfied. To begin, for \(x\in \mathbb {R}^d\), let
for short. Above, recall that \(P_{\gamma g}\) is the proximal operator associated with the function \(\gamma g\); see (9). We can then apply the Moreau decomposition (Beck 2017) to write that
where \(g^*\) is the Fenchel conjugate of g. For future reference, we record the following observations:
Above, \(I_d\in \mathbb {R}^{d\times d}\) is the identity matrix. Also, (i) combines Assumption 1.1 and the second line of (A11). Moreover, (ii) uses (B17), (iii) uses (B15), (iv) uses (B15) and (B16), (v) uses (3c) and the non-expansiveness of the proximal operator, (vi) uses the triangle inequality and (B15), and (vii) uses (3c). Lastly, (viii) uses (3c), and (ix) uses (B20) and the assumption that \(\gamma \in (0,1/\lambda _2)\).
Recall the expression for \(\nabla F_\gamma ^{FB }\) from Proposition 2.3. We next establish that \(F_\gamma ^FB \) is smooth. In our argument below, without loss of generality, we assume that \(\Vert y\Vert _2\le \lambda _0\):
Above, (i) uses Proposition 2.3, (ii) uses (3d) and (B18)–(B20), and (iii) uses the assumption that \(\Vert y\Vert _2\le \lambda _0\), without loss of generality. Indeed, if both \(\Vert x\Vert _2\ge \lambda _0\) and \(\Vert y\Vert _2 \ge \lambda _0\), then \(f(x)=f(y)=\Vert \nabla f(x)\Vert _2=\Vert \nabla f(y)\Vert _2=\Vert \nabla ^2f(x)\Vert =\Vert \nabla ^2f(y)\Vert =0\) by (3), and thus (B22) still holds. This observation justifies our earlier restriction to the case where \(\Vert y\Vert _2\le \lambda _0\).
Next, we show the strong convexity of \(F_\gamma ^FB \) over long distances. Again, without loss of generality, we assume below that \(\Vert y\Vert _2\le \lambda _0\) and write
where (i) uses Proposition 2.3 and (ii) uses (3d), (B17), (B18), and (B20). When \(\Vert x-y\Vert _2\) is sufficiently large, the first term in the last line above dominates the second term. In particular, we have
In words, \(F^FB _\gamma \) behaves like a strongly convex function over long distances. It is not difficult to verify that Definition 1.2 is also valid for the FB envelope: To that end, one needs to combine Proposition 1 with the relation between the MY and FB envelopes in Proposition 2.2.
Lastly, we now verify that the requirement in Definition 1.1 is met for the FB envelope: Below, suppose that \(\gamma \in (0,1/\lambda _2)\). On the one hand, Proposition 2.2 implies that
On the other hand, Proposition 1 implies that there exists \(F^0\) such that
and \(\exp (-F^0)\) is integrable. By combining the preceding two observations, we find that the FB envelope satisfies Definition 1.1 when \(\gamma \in (0,1/\lambda _2)\). This completes the proof of Proposition 3.
Appendix C: Proof of Theorem 1
For every \(x\in \mathbb {R}^d\), let us define
where \(1_\textrm{K}\) is the indicator function on the set \(\textrm{K}\). Note that the target distribution \(\mathfrak {p}\) in (6a) coincides with \(\mathfrak {q}\) in the special case where \(g=1_\textrm{K}\). Let \({\text {dist}}(x,\textrm{K})\) denote the distance from x to the set \(\textrm{K}\). For \(\gamma >0\), we also define
to be the MY envelope of the indicator function on the set \(\textrm{K}\). We denote the corresponding probability distribution by
When \(\gamma \) is sufficiently small, we may intuitively regard \(\mathfrak {q}_\gamma \) as a proxy for \(\mathfrak {q}\). We also recall our earlier notation for the convenience of the reader:
Above, the functions F and \(F^FB _\gamma \) were defined in (6b) and (FB), respectively. Our objective is to control the distance \(\textrm{W}_1(\mathfrak {p}^FB _\gamma ,\mathfrak {p})\). To begin, we use the triangle inequality to write that
Each of the following three lemmas bounds one of the terms on the right-hand side above.
Lemma 2
It holds that
where
where \(\text {d}_i(\textrm{K})\) is the i-th intrinsic volume of \(\textrm{K}\) (Klain and Rota 1997). In particular, the d-th volume of \(\textrm{K}\) coincides with the standard volume of \(\textrm{K}\), i.e., \(\text {d}_d(\textrm{K})=\text {d}(\textrm{K})\).
Lemma 3
It holds that
where \(c_2:=e^{\max _x f(x)-\min _x f(x)}\).
Lemma 4
It holds that
where
To keep the notation light, above we have suppressed the dependence of \(c_1\) to \(c_4\) on \(\textrm{K},f,g\). With Lemmas 2–4 at hand, we revisit (C29) and write
which completes the proof of Proposition 1.
Appendix D: Proof of Lemma 2
In order to upper bound the distance \(\textrm{W}_1(\mathfrak {q}_\gamma ,\mathfrak {p}_\gamma ^FB )\), we will use the following result which bounds the \(\textrm{W}_1\) distance under small perturbations.
Lemma 5
For constants \(\alpha \ge \beta \) and \(\beta '\le 1\), consider \(h_1:\mathbb {R}^d\rightarrow \mathbb {R}\) and \(h_2:\mathbb {R}^d\rightarrow \mathbb {R}\) that satisfy
for every \(x\in \mathbb {R}^d\). Then it holds that
As a sanity check, note that the right-hand side of (D36) is always nonnegative because, by assumption, \(\alpha \ge \beta \) and \(\beta '\le 1\). Moreover, if we set \(\alpha =\beta =0\) and \(\beta '=1\), the right-hand side of (D36) reduces to zero, as expected.
Let us now recall from (C27) and (6a) that \(\mathfrak {q}_\gamma \propto \exp (-f-1_{\textrm{K},\gamma })\) and \(\mathfrak {p}_\gamma ^FB \propto \exp (-F^FB _\gamma )\), respectively. Our plan is to invoke Lemma 5 with the choice of \(h_1=f+1_{\textrm{K},\gamma }\) and \(h_2=F^FB _\gamma \). In turn, this plan necessitates that we verify (D35) for the choice of \(h_1=f+1_{\textrm{K},\gamma }\) and \(h_2=F^FB _\gamma \). We begin by relating the two functions as
where (i) uses Proposition 2.2, (ii) uses (MY), (iii) uses (8) and the fact that g is by (5) infinite outside of \(\textrm{K}\), and (iv) uses (C26). Above, note that \(\max _{z\in \textrm{K}} g(z)\) is finite by the construction of g; see (5). In the other direction, we similarly write
where (i) uses Proposition 2.2 and (ii) uses (C26). Above, \(\min _z f(z)\) is finite by the construction of f; see (3). We now summarize (C37) and (C38): (D35) is satisfied for our choice of \(h_1=f+1_{\textrm{K},\gamma }\) and \(h_2=F^FB _\gamma \), i.e.,
where
With (C39) at hand, we can invoke Lemma 5 to find that
where (i) uses the definitions in (C27) and (C28), (ii) uses Lemma 5, and (iii) uses (C26) and (C39). To complete the proof, we need the following technical result, which is proved in Brosse et al. (2017), Kampf (2009) using the well-known Steiner’s formula from integral geometry (Klain and Rota 1997). For completeness, a self-contained proof is given later in this work.
Lemma 6
It holds that
where \(\textrm{K}^c=\mathbb {R}^d\backslash \textrm{K}\) is the complement of the set \(\textrm{K}\) and \(\text {d}_i(\textrm{K})\) is the i-th intrinsic volume of \(\textrm{K}\) (Klain and Rota 1997). In particular, the d-th volume of \(\textrm{K}\) coincides with the standard volume of \(\textrm{K}\), i.e., \(\text {d}_d(\textrm{K})=\text {d}(\textrm{K})\).
As a sanity check for Lemma 6, note that \(I_1(\gamma )\) and \(I_2(\gamma )\) both vanish in the limit of \(\gamma \rightarrow 0\), as expected. To control the first fraction in the last line of (C40), we use Lemma 6 to write that
where (i) breaks the numerator into two integrals on \(\textrm{K}\) and its complement, and uses the fact that \({\text {dist}}(x,\textrm{K})=0\) when \(x\in \textrm{K}\). Above, (ii) uses Lemma 6. For the second fraction in the last line of (C40), we note that
where (i) breaks the numerator into two integrals on \(\textrm{K}\) and its complement, and uses the fact that \({\text {dist}}(x,\textrm{K})=0\) when \(x\in \textrm{K}\). For the third fraction in the last line of (C40), we similarly obtain that
We can now combine (C42)–(C44) to upper bound the \(\textrm{W}_1\) distance in (C40) as
which completes the proof of Lemma 2.
Appendix E: Proof of Lemma 5
Let us first recall Theorem 6.15 in Villani (2009), which we will use in the proof to link the \(\textrm{W}_1\) and TV distances of two rapidly decaying distributions.
Theorem 3
Any pair \((\mathfrak {r}_1,\mathfrak {r}_2)\) of probability distributions satisfies
We apply Theorem 3 to bound the \(\textrm{W}_1\) distance of interest as
To bound the integral on the right-hand side above, we first use the assumed relationship between \(h_1\) and \(h_2\) to write that
where (i) uses (D35). In the other direction, we can again use the relation between \(h_1\) and \(h_2\) as
where (i) uses (D35) and (ii) uses the assumptions that \(\alpha \ge \beta \) and \(\beta '\le 1\). By combining (C48) and (C49), we find that
where (i) uses (C48), (C49), and the following observation: \(a\le b\le c\) and \(a\le 0\) imply that
It immediately follows from (C50) that
where (i) combines the domains of the first and third integrals in the previous inequality and uses the fact that \(\alpha \ge \beta \) (and thus \(e^{\beta -\alpha }\le 1\)). Similarly, (ii) combines the domains of the second and third integrals in the previous inequality. We revisit (C47) and conclude that
which completes the proof of Lemma 5.
Appendix F: Proof of Lemma 6
Let \(\iota (A)=1\) if the claim A is true and \(\iota (A)=0\) otherwise. Using the fact that
we can rewrite the left-hand side of (C41b) as
Note that \(\{x:{\text {dist}}(x,\textrm{K})\le t\}\) is the tube of radius t around the set \(\textrm{K}\). We can represent this set more compactly as \(K+\textrm{B}(0,t)\), where the addition is in the Minkowski sense. With this in mind, we rewrite the last line above as
We can now use the Steiner’s formula (Klain and Rota 1997) to express the volume above as
where \(\kappa _i\) is the volume of the unit ball in \(\mathbb {R}^i\) and \(\text {d}_i(\textrm{K})\) is the i-th intrinsic volume of \(\textrm{K}\) (Klain and Rota 1997). In particular, the d-th intrinsic volume of \(\textrm{K}\) coincides with its the standard volume, i.e., \(\text {d}_d(\textrm{K})=\text {d}(\textrm{K})\). Substituting the above identity back into (F55), we find that
where (i) uses (F55) and (F56), (ii) uses the identity \(\kappa _i = \pi ^{\frac{i}{2}}/\Gamma (1+i/2)\), and (iii) uses the fact that \(\text {d}_d(\textrm{K}) = \text {d}(\textrm{K})\). We thus established (C41b). Above, (ii) also uses the identity
To prove (C41a), we write that
where (i) uses (C41b) and the fact that \({\text {dist}}(x,\textrm{K})=0\) when \(x\in \textrm{K}\). To prove (C41c), we use the fact that \(\textrm{K}\subset \textrm{B}(0,R)\) by Assumption 1.1, which implies that
In turn, we use (F60) to write that
where (i) uses (F60) and the fact that \(\textrm{K}^c=\{x:{\text {dist}}(x,\textrm{K})>0\}\) and (ii) uses (F54). We can again use the Steiner’s formula to calculate the volume of the tube of radius t around \(\textrm{K}\), which appears in the last line above. Substituting from (F56) into the last line above, we find that
where (i) uses (F56), (F61), and the fact that \(\text {d}_d(\textrm{K})=\text {d}(\textrm{K})\). Also, (ii) uses (F58) and the fact that \(\kappa _i = \pi ^{\frac{i}{2}}/\Gamma (1+i/2)\). Above, (iii) is an application of the Gautschi’s inequality. This establishes (C41c) and completes the proof of Lemma 6.
Appendix G: Proof of Lemma 3
Recall from (C25) that \(\mathfrak {q}\propto \exp (-f-1_\textrm{K})\), where \(1_\textrm{K}\) denotes the indicator function on the set \(\textrm{K}\). Recall also from (C27) that \(\mathfrak {q}_\gamma \propto \exp (-f-1_{\textrm{K},\gamma })\), where \(1_{\textrm{K},\gamma }(x)= \frac{1}{2\gamma } {\text {dist}}(x,\textrm{K})^2\) is the MY envelope of \(1_\textrm{K}\); see (C26). We will repeatedly use these two distributions in the proof. To begin, let us invoke Theorem 3 to write that
where the first line above uses Theorem 3. For the first integral on the right-hand side of (i) above, we use the fact that \(\textrm{K}\subset \textrm{B}(0,R)\) in Assumption 1.1 to write that
where (i) uses Assumption 1.1, (ii) uses the fact that \(1_\textrm{K}(x)=1_{\textrm{K},\gamma }(x)=0\) when \(x\in \textrm{K}\), and (iii) uses the fact that
We continue to simplify the last line of (G64) as
where (i) uses (G64) and (ii) uses Lemma 6. For the second integral on the right-hand side of (i) in (G63), we can again use the definitions of \(\mathfrak {q}\) and \(\mathfrak {q}_\gamma \) to write that
where (i) uses the fact that \(\mathfrak {q}\) is supported on \(\textrm{K}\) and (ii) uses Lemma 6. With (G65) and (G66) at hand, we revisit (G63) and write that
This completes the proof of Lemma 3.
Appendix H: Proof of Lemma 4
Recall from (C25) and (6a) that \(\mathfrak {q}\propto \exp (-f-1_{\textrm{K}})\) and \(\mathfrak {p}\propto \exp (-F)\), respectively. The function F was defined in (6b). In order to bound \(\textrm{W}_1(\mathfrak {q},\mathfrak {p})\), our plan is to invoke Lemma 5 with \(h_1=f+1_\textrm{K}\) and \(h_2=F\). As before, this means that we need to verify (D35) for the choice of \(h_1=f+1_\textrm{K}\) and \(h_2=F\). We begin by relating these two functions together as
where (i) uses (6b), and (ii) uses the fact that F and \(1_\textrm{K}\) both take infinity outside of the set \(\textrm{K}\). In the other direction, we write that
To summarize, for our choice of \(h_1=f+1_\textrm{K}\) and \(h_2=F\), (D35) is satisfied with
where
We can now invoke Lemma 5 to find that
where
Above, (i) uses Lemma 5 and (ii) uses the fact that \(\textrm{K}\subset \textrm{B}(0,R)\) by Assumption 1.1. This completes the proof of Lemma 4.
Appendix I: Proof of Lemma 1
Consider the stochastic differential equation
where \(\{B_t\}_{t\ge 0}\) is the standard Brownian motion. Above, note that the initial probability measure is \(\mathfrak {p}_{\gamma _k}\propto \exp (-F_{\gamma _k})\). From Definition 1.3, recall that \(F_{\gamma _{k}}\) is \(\lambda _{\gamma _{k}}\)-smooth and is, moreover, coercive. To be concrete, the latter means that \(F_{\gamma _{k}}\) satisfies
Therefore Theorem 3.4 in Bou-Rabee and Eberle (2020) ensures that \(\mathfrak {p}_{\gamma _{k}}\propto \exp (-F_{\gamma _{k}})\) is the invariant probability measure of (I71). In particular, because \(x_0\sim \mathfrak {p}_{\gamma _{k}}\) in (I71), it holds that
Let \(\{P_{k}\}_{k\ge 0}\) denote the Markov transition kernel associated with the Markov chain \(\{x_{H_k}\}_{k\ge 0}\), where
is the elapsed time since initialization. Using this transition kernel, we can rewrite (I72) as
Finally, we can write the quantity of interest on the left-hand side of (26) as
where we used (I74). In the remainder of the proof, we will upper bound the right-hand side above, which can be thought of as the discretization error associated with (I71). To begin, recall from (I72) that \(x_{H_{k}}\sim \mathfrak {p}_{\gamma _{k}}\), and note that
by construction. Above, (i) uses (20) and (ii) uses (I73), i.e., the fact that \(H_{k+1}-H_k=h_k\). Likewise, note that
where (i) uses (I71) and (I74). In the second identity above, both sides of \(\overset{\text {dist.}}{=}\) have the same distribution and recall that \(\zeta _{k+1}\sim \text {normal}(0,I_d)\). In view of (I76) and (I77), we revisit (I75) and write that
where (i) uses I75, (ii) uses (13), (I76), and (I77), (ii) uses (I76) and (I77), and (iii) uses Definition 1.3, i.e., the fact that \(F_{\gamma _{k}}\) is \(\lambda _{\gamma _{k}}\)-smooth. To estimate the expectation in the last line above, we write that
where (i) uses (I71). To estimate the first expectation in the last line of (I79), we recall from the last line of (I78) that \(t\in [H_k,H_{k+1})\) and then write
where (i) uses (I72), (ii) uses (I73), and (iii) uses the Jensen’s inequality. Also, (iv) uses the integration by part and the fact that \(\mathfrak {p}_{\gamma _{k}}\sim e^{-F_{\gamma _{k}}}\). Lastly, (v) uses \(\Delta F_{\gamma _{k}}(x)={\text {trace}}(\nabla ^2 F_{\gamma _{k}}(x))\) and the fact that \(F_{\gamma _{k}}\) is \(\lambda _{\gamma _k}\)-smooth by Definition 1.3. To estimate the second expectation in the last line of (I79), note that
The above observation enables us to bound the second expectation in the last line of (I79) as
where (i) uses (I81), (ii) uses the Jensen’s inequality, (iii) uses the second line of (I81), and (iv) uses (I73). We now plug in the bounds in (I80) and (I82) back into (I79) to obtain that
where (i) is simply (I79) and (ii) uses (I80) and (I82). By substituting the above bound back into (I78), we arrive at
where (i) uses (I83) and (ii) uses (I73). This completes the proof of Lemma 1.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Eftekhari, A., Vargas, L. & Zygalakis, K.C. The forward–backward envelope for sampling with the overdamped Langevin algorithm. Stat Comput 33, 85 (2023). https://doi.org/10.1007/s11222-023-10254-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-023-10254-y