Abstract
This work is concerned with the use of Gaussian surrogate models for Bayesian inverse problems associated with linear partial differential equations. A particular focus is on the regime where only a small amount of training data is available. In this regime the type of Gaussian prior used is of critical importance with respect to how well the surrogate model will perform in terms of Bayesian inversion. We extend the framework of Raissi et. al. (2017) to construct PDE-informed Gaussian priors that we then use to construct different approximate posteriors. A number of different numerical experiments illustrate the superiority of the PDE-informed Gaussian priors over more traditional priors.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Combining complex mathematical models with observational data is an extremely challenging yet ubiquitous problem in the fields of modern applied mathematics and data science. Inverse problems, where one is interested in learning inputs to a mathematical model such as physical parameters and initial conditions given partial and noisy observations of model outputs, are hence of frequent interest. Adopting a Bayesian approach (Kaipio and Somersalo 2005; Stuart 2010), we incorporate our prior knowledge on the inputs into a probability distribution, the prior distribution, and obtain a more accurate representation of the model inputs in the posterior distribution, which results from conditioning the prior distribution on the observed data.
The posterior distribution contains all the necessary information about the characteristics of our inputs. However, in most cases the posterior is unfortunately intractable and one needs to resort to sampling methods such as Markov chain Monte Carlo (MCMC) (Robert and Casella 2004; Brooks et al. 2011) to explore it. A major challenge in the application of MCMC methods to problems of practical interest is the large computational cost associated with numerically solving the mathematical model for a given set of the input parameters. Since the generation of each sample by an MCMC method requires a solve of the governing equations, and often millions of samples are required in practical applications, this process can quickly become very costly.
One way to deal with the challenge of full Bayesian inference for complex models is the use of surrogate models, also known as emulators, meta-models or reduced order models. Instead of using the complex (and computationally expensive) model, one uses a simpler and computationally more efficient model to approximate the solution of the governing equations, which in turn is used to approximate the data likelihood. Within the statistics literature, the most commonly used type of surrogate model is a Gaussian process emulator (Rasmussen and Williams 2006; Stein 1999; Sacks et al. 1989; Kennedy and O’Hagan 2000; O’Hagan 2006; Higdon et al. 2004), but other types of surrogate models can also be used including projection-based methods (Bui-Thanh et al. 2008), generalised Polynomial Chaos (Xiu and Karniadakis 2003; Marzouk et al. 2007), sparse grid collocation (Babuska et al. 2007; Marzouk and Xiu 2009) and adaptive subspace methods (Constantine 2015; Constantine et al. 2014).
In this paper, we focus on the use of Gaussian process surrogate models for approximating the posterior distribution in inverse problems, where the forward model is related to the solution of a linear partial differential equation (PDE). In particular, we consider two different ways of using the surrogate model, emulating either the parameter-to-observation map or the negative log-likelihood. Convergence properties of the corresponding posterior approximations, as the number of design points N used to construct the surrogate model goes to infinity, have recently been studied in Stuart and Teckentrup (2018); Teckentrup (2020); Helin et al. (2023). These results put the methodology on a firm theoretical footing, and show that the error in the approximate posterior distribution can be bounded by the corresponding error in the surrogate model. Furthermore, the error in the approximate posteriors tends to zero as N tends to infinity. However, when the forward model of interest is given by a complex model such as a PDE, one normally operates in a regime where only a very limited number of design points N can be used due to constraints on computational cost. This setting is less understood and is the main setting of interest in this paper.
With a small number of design points, different modelling choices made in the derivation of the approximate posterior can have a large effect on its accuracy. In particular, the choice of Gaussian prior distribution in the emulator is crucial, as it heavily influences its accuracy. Intuitively, we want to make the Gaussian prior as informative as possible, by incorporating known information about the underlying forward model. For example, such a Gaussian prior specially tailored to solving the forward problem in linear PDEs can be found in Raissi et al. (2017). For incorporating more general constraints, we refer the reader to the recent review (Swiler et al. 2021). Other modelling choices that require careful consideration are whether we build a surrogate model for the parameter-to-observation map or the log-likelihood directly, and whether we use the full distribution of the emulator or only the mean (see e.g. Stuart and Teckentrup (2018); Lie et al. (2018)).
The focus of this paper is on computational aspects of the use of Gaussian process surrogate models in PDE inverse problems, with particular emphasis on the setting where the number of design points is limited by computational constraints. The main contributions of this paper are the following:
-
1.
We extend the PDE-informed Gaussian process priors from Raissi et al. (2017) to enable their use in inverse problems, which requires a Gaussian process prior as a function of both the spatial variable of the PDE and the unknown parameter(s).
-
2.
By showing that the required gradients can be computed explicitly, we establish that gradient-based MCMC samplers such as the Metropolis-adjusted Langevin algorithm (MALA) can be used to efficiently sample from the approximate posterior distributions.
-
3.
Using a range of numerical examples, we demonstrate the isolated effects of various modelling choices made, and thus offer valuable insights and guidance for practitioners. This includes choices of posterior approximation in the inverse problem (e.g. emulating the parameter-to-observation map or the log-likelihood) and on prior distributions for the Gaussian process emulator (e.g. black-box or PDE-constrained).
The rest of the paper is organised as follows. In Sect. 2 we set up notation with respect to the inverse problems of interest and discuss the different kinds of posterior approximations that result from using Gaussian surrogate models for the data-likelihood. We then proceed in Sect. 3 to present our main methodology, discussing how one can blend better-informed Gaussian surrogate models with inverse problems as well as presenting the MCMC algorithm that we use. A number of different numerical experiments that illustrate the computational benefits of our approach are then presented in Sect. 4, and finally Sect. 5 provides a summary and discussion of the main results.
2 Preliminaries
We now give more details about the type of inverse problems considered in this paper and discuss different aspects of Gaussian emulators, as well as the corresponding type of approximate posteriors considered in this work. At the end of this section, we summarise in Table 1 all the different notations introduced in this section.
2.1 PDE inverse problems
Consider the linear PDE
posed on a domain \(D \subseteq \mathbb {R}^{{d}_{\textbf{x}}}\), where \(\mathcal {L}^{\varvec{\theta }}\) denotes a linear differential operator depending on parameters \(\varvec{\theta } \in \mathcal {T} \subseteq \mathbb {R}^{{d}_{\varvec{\theta }}}\) and the linear operator \({\mathcal {B}}\) incorporates boundary conditions. The inverse problem of interest in this paper is to infer the parameters \(\varvec{\theta }\) from the noisy data \(\textbf{y} \in \mathbb {R}^{{d}_{\textbf{y}}}\) given by
where \(X=\{\textbf{x}_{1},\cdots ,\textbf{x}_{{d}_{\textbf{y}}} \} \subset {\overline{D}}\) are the spatial points where we observe the solution u of our PDE, \({\mathcal {G} _{X}:\mathcal {T}\rightarrow \mathbb {R}^{{d}_{\textbf{y}}}}\) is the parameter-to-observation map defined by \(\mathcal {G}_{X}(\varvec{\theta }) = \{u(\textbf{x}_j; \varvec{\theta })\}_{j=1}^{{{d}_{\textbf{y}}}}\), and \(\varvec{\eta } \sim \mathcal {N}(0,\Gamma _{\eta })\) is an additive Gaussian noise term with covariance matrix \(\Gamma _{\eta } = \sigma _{\eta }^2 I_{{d}_{\textbf{y}}}\). Note that the assumption of Gaussianity and diagonal noise covariance is done for simplicity, but these assumptions can be relaxed (Lie et al. 2018). Likewise, the methodology generalises straightforwardly to general bounded linear observation operators applied to the PDE solution u (see the discussion in Sect. 3.1).
To solve the inverse problem we will adopt a Bayesian approach (Stuart 2010). That is, prior to observing the data \(\textbf{y}\), \(\varvec{\theta }\) is assumed to be distributed according to a prior density \(\pi _0(\varvec{\theta }),\) and we are interested in the updated posterior density \(\pi (\varvec{\theta }|{\textbf{y}})\). From (2) we have \( \textbf{y}|\varvec{\theta }\sim \mathcal {N}(\mathcal {G}_{X}(\varvec{\theta }),\Gamma _{\eta })\), so the likelihood is
where the function \(\Phi :\mathcal {T}\times \mathbb {R}^{{d}_{\textbf{y}}}\rightarrow \mathbb {R}\) is called the negative log-likelihood or potential and \(\Vert \textbf{z}\Vert _{\Gamma _\eta }:= \textbf{z}^\textrm{T} \Gamma _\eta ^{-1} \textbf{z}\) denotes the norm weighted by \(\Gamma _\eta ^{-1}\). Note that our notation of \(\Vert \textbf{z}\Vert _{\Gamma _\eta }\) here follows the convention introduced in Stuart (2010). Then by Bayes’ formula we have
The posterior distribution \(\pi ({\varvec{\theta }}|{\textbf{y}})\) is in general intractable, and we need to resort to sampling methods such as MCMC to extract information from it. However, generating a sample typically involves evaluating the likelihood and hence the solution of the PDE (1), which can be prohibitively costly. This motivates the use of surrogate models to emulate the PDE solution, which in turn is used to approximate the posterior and hence accelerate the sampling process.
2.2 Gaussian processes
Gaussian process regression (GPR) is a flexible non-parametric model for Bayesian inference (Rasmussen and Williams 2006). Our starting point for approximating an arbitrary function \(\textbf{g}:\mathcal {T}\rightarrow \mathbb {R}^d\), for some \(d \in \mathbb {N}\) is the Gaussian process prior
where \(\textbf{m}:\mathcal {T}\rightarrow \mathbb {R}^{d}\) is a mean function and \(K(\varvec{\theta },\varvec{\theta }'):\mathcal {T}\times \mathcal {T}\rightarrow \mathbb {R}^{d \times d}\) is the matrix-valued positive definite covariance function which represents the covariance between the different entries of \(\textbf{g}\) evaluated at \(\varvec{\theta }\) and \(\varvec{\theta }'\). Distinct from the prior introduced earlier in solving the Bayesian inverse problem, this prior is built for Gaussian process regression. When emulating the forward map the function \(\textbf{g}\) corresponds to the PDE solution evaluated at \({d}_{\textbf{y}}\) different spatial points, and hence \(d={d}_{\textbf{y}}\). In contrast, \(d=1\) when directly emulating the log-likelihood.
In the case where \(d>1\) there is a number of choices that one can make for the matrix-valued covariance function (Alvarez et al. 2012). In this section, for simplicity we will assume that the matrix \(K(\varvec{\theta },\varvec{\theta }')\) takes the form
for some scalar-valued covariance function \(k(\varvec{\theta },\varvec{\theta }'):\mathcal {T}\times \mathcal {T}\rightarrow \mathbb {R}\), implying that the entries of \(\textbf{g}\) are independent. We will refer to this as the baseline model. As we will see later better emulators can be constructed by relaxing this independence assumption.
The mean function and the covariance function fully characterise our Gaussian prior. A typical choice for \(\textbf{m}\) is to set it to zero, while common choices for the covariance function \(k(\varvec{\theta },\varvec{\theta }')\) include the squared exponential covariance function (Rasmussen and Williams 2006)
and the Matérn covariance function (Rasmussen and Williams 2006)
For both kernels, the hyperparameter \(\sigma ^2 >0\) governs the magnitude of the covariance and the hyperparameter \(l>0\) governs the length-scale at which the entries of \(\textbf{g}_0(\varvec{\theta })\) and \(\textbf{g}_0(\varvec{\theta }')\) are correlated. For the Matérn covariance function the smoothness of the entries of \(\textbf{g}_0\) depends on the hyperparameter \(\nu > 0\). In the limit \(\nu \rightarrow \infty \) we obtain the squared exponential covariance function, which gives rise to infinitely differentiable sample paths for \(\textbf{g}_0\).
Now suppose that we are given data in the form of N distinct design points \(\Theta = \{\varvec{\theta }^i\}_{i = 1}^{N} \subseteq \mathbb {R}^{{d}_{\mathbf {\varvec{\theta }}}}\) with corresponding function values
Since we have assumed that the multi-output function \(\textbf{g}_{0}\) is a Gaussian process, the vector
for any test point \(\tilde{\varvec{\theta }}\), follows a multivariate Gaussian distribution. The conditional distribution of \(\textbf{g}_0(\tilde{\varvec{\theta }})\) given the set of values \(\textbf{g}(\Theta )\) is then again Gaussian with mean and covariance given by the standard formulas for the conditioning of Gaussian random variables (Rasmussen and Williams 2006). In particular, if we denote with \(\textbf{g}^{N}\) the Gaussian process (4) conditioned on the values \(\textbf{g}(\Theta )\) we have
where the predictive mean vector \(\textbf{m}^{\textbf{g}}_{N}\) and the predictive covariance matrix \(K_{N}(\varvec{\theta },\varvec{\theta }')\) are given by
with
and
We note that \(\textbf{g}^{N}\) is the Gaussian process posterior, but to avoid confusion with the posterior of the Bayesian inverse problem, we call it the predictive Gaussian process. In addition, for clarity of notation, we use regular font for scalar values, bold font for vector values, and capital font for matrices (details in Table 1).
2.3 Gaussian emulators and approximate posteriors
We now discuss two different approaches for constructing a Gaussian emulator and using it for approximating the posterior of interest. The first approach constructs an emulator for the forward map \(\mathcal {G}_{X}\), while the second approach is based on constructing an emulator directly for the log-likelihood.
2.3.1 Emulating the forward map
Given the data set \({\mathcal {G}_{X}}(\Theta ) = \{\mathcal {G}_{X}(\varvec{\theta }^i)\}_{i=1}^{N}\), we can now proceed with building our Gaussian process emulator for the forward map \(\mathcal {G}_{X}\). Therefore, using similar notation to (7), we denote the predictive Gaussian process by \(\mathcal {G}_{X}^{N}\). One then needs to decide how to incorporate the emulation for the construction of an approximate posterior. In particular, depending on what type of information we plan to utilize, different approximations will be obtained. If we use its predictive mean \(\textbf{m}^{\mathcal {G}_X}_{N}\) as a point estimator of the forward map \(\mathcal {G}_{X}\), we obtain
Alternatively, we can try to exploit the full information given by the Gaussian process by incorporating its variance in the posterior approximation. A natural way to do this is to consider the following approximationFootnote 1:
where the expectation is taken over the probability space of the Gaussian process posterior. A detailed derivation of (11) can be found in Appendix A. Comparing (11) with (10), the likelihood function in the marginal approximation is Gaussian with additional uncertainty \(K_{N}(\varvec{\theta },\varvec{\theta })\) from the emulator included into its covariance matrix. Hence, for a fixed parameter \(\varvec{\theta }\), the likelihood function in (11) will be less concentrated due to variance inflation. When the magnitude of \(K_{N}(\varvec{\theta },\varvec{\theta })\) is small compared to that of \(\Gamma _{\eta }\), the marginal approximation will be similar to the mean-based approximation.
2.3.2 Emulating the log-likelihood
Another way of building the emulator is to model the potential function \(\Phi \) directly. We can convert the data set \(\mathcal {G}_{X}(\Theta )\) into a data set of negative log-likelihood evaluations \(\varvec{\Phi }(\Theta ) = \{\Phi (\varvec{\theta }^i,\textbf{y})\}_{i = 1}^{N}\), and obtain the predictive Gaussian process \(\Phi ^{N}(\varvec{\theta })\sim \text {GP}(m^{\Phi }_{N}(\varvec{\theta }), k_{N}(\varvec{\theta },\varvec{\theta }))\). Again, if we only include the mean of the Gaussian process emulator the posterior approximation becomes
while, in a similar fashion to the forward map emulation, we can take into account the covariance of our emulator to obtain the approximate posterior
The derivation of (13) is similar to that of (11). Note that in this case, the following relationship holds between the two approximate posteriors
which again illustrates a form of variance inflation for the marginal posterior approximation.
In summary, we have two methods for approximating the true posterior: the mean-based approximation and the marginal approximation; and we have two types of emulators: the forward map emulator and the potential function emulator; thus by combination we have four types of approximation in total. The convergence properties of all these approximate posteriors were the subject of study in Stuart and Teckentrup (2018); Teckentrup (2020); Helin et al. (2023), where it was proved under suitable assumptions that all of them converge to the true posterior as \(N \rightarrow \infty \). However, in the case of small N, the difference between the approximate posteriors could be large and which one we choose is important. Furthermore, the type of Gaussian process emulator used plays an even bigger role in this case, and one would like to use a Gaussian prior that is as informative as possible. We discuss how to do this in the next section.
3 Methodology
Having described the different types of posterior approximations we will consider, in this section we discuss different modelling approaches for the prior distribution used in our Gaussian emulators. In doing this it is important to note that the function that we are interested to emulate, in this case the forward map \(\mathcal {G}_{X}(\varvec{\theta })\), depends not only on the parameters \(\varvec{\theta }\) of our PDE, but also on the locations of the spatial observations. Thus in terms of modelling, one would like to take this into account and build spatial correlation explicitly into the prior covariance. Note that when emulating the potential \(\Phi \) instead of the forward map \({\mathcal {G}}_{X}\), we are emulating a scalar-valued function. Since \(\Phi \) is a non-linear function of \({\mathcal {G}}_{X}\), it is not possible to extend the ideas of spatial correlation presented in this section to emulating \(\Phi \), and in particular, it is not possible to construct a PDE-informed emulator in the same way.
Introducing spatial correlation when emulating \(\mathcal {G}_{X}(\varvec{\theta })\) can be done in two different ways, the first by prescribing some explicit form of spatial correlation, and the second by using the fact that we know that our forward map is associated with the solution of a linear PDE. We do this in Sect. 3.1. It is important to note that in both cases it is possible to calculate the gradients with respect to the parameters \(\varvec{\theta }\) in a closed form, which can then be used to sample from the approximate posterior distributions using gradient-based MCMC methods such as MALA. We discuss this in more detail in Sect. 3.2.
3.1 Correlated and PDE-informed priors
We now discuss two different approaches to incorporate spatial correlation into our prior covariance function for the forward map \(\mathcal {G}_{X}(\varvec{\theta })\). Even though this is a function from the parameter space \(\mathcal {T}\) to the observation space \(\mathbb {R}^{{d}_{\textbf{y}}}\), for introducing more complicated spatial correlation it is useful to think first about the PDE solution \(u(\varvec{\theta },\textbf{x})\) as a function from \(\mathcal {T} \times {\overline{D}} \) to \(\mathbb {R}\). We introduce the prior covariance function \(k((\varvec{\theta },\textbf{x}),(\varvec{\theta }',\textbf{x}'))\) for \(u(\varvec{\theta },\textbf{x})\), and choose a separable model
where \(k_p\) and \(k_s\) are the covariance functions for the parameters \(\varvec{\theta }\) and the spatial points \(\textbf{x}\) respectively.
Using the fact that the forward map \(\mathcal {G}_{X}\) relates to the point-wise evaluation of the function \(u(\varvec{\theta }, \textbf{x})\) for \(\textbf{x} \in X\), and assuming zero mean, we then obtain the Gaussian prior
with
where \(K_s\) is the covariance matrix with entries \((K_{s}(X,X))_{i,j}=k_{s}(\textbf{x}_{i},\textbf{x}_{j})\), \( \textbf{x}_{i},\textbf{x}_{j} \in X \). This prior can then be conditioned on training data \({\mathcal {G}}_X(\Theta ) \), and due to the separable structure in (14), the predictive mean \(\textbf{m}_N^{{\mathcal {G}}_X}(\varvec{\theta })\) is in fact the same as for the baseline model in Sect. 2.2. See Appendix B for details.
The second way of introducing spatial correlation is explicitly taking into account that the forward map is related to a PDE solution. Given the PDE system
as described in Sect. 2, we can build a joint prior between u, f and g. In particular, if we take fixed points \(\textbf{x},\textbf{x}_{f} \in D\) and \(\textbf{x}_{g} \in \partial D\) we have that
where the above is a Gaussian process as a function of \(\varvec{\theta }\), and we have used known properties of linear operators applied to Gaussian processes (see e.g. (Matsumoto and Sullivan 2023)) in the derivation. The idea of a joint prior between u and f was also used in Raissi et al. (2017); Spitieris and Steinsland (2023); Cockayne et al. (2017); Pförtner et al. (2022), while (Chris J. Oates and Girolami 2019) uses this explicitly in an inverse problem setting. The crucial difference is that in these works u and f were considered as functions of the spatial variable \(\textbf{x}\) only, while here we instead explicitly model the dependency of u on \(\varvec{\theta }\). We then have
where
and \(X_{g} \subset \partial D\) and \(X_{f} \subset D\) are collections of \(d_g\) and \(d_f\) points at which g and f have been evaluated, respectively. Note that the marginal prior placed on \({\mathcal {G}}_X\) is the same as in (15).
The prior (17) can then again be conditioned on training data as in Sect. 2.2, see Appendix B for details. Note that in this case we are updating our prior on \({\mathcal {G}}_X(\varvec{\theta })\) using the observations \(g(\Theta ,X_g)\) and \(f(\Theta ,X_f)\) as well as \(\mathcal G_X(\Theta )\), essentially augmenting the space on which the emulator \(\mathcal {G}^N_{X}(\varvec{\theta })\) is trained. Since g and f are assumed known, these additional observations are cheap to obtain. It is also possible to condition on training data \(g(\Theta _g,X_g)\) and \(f(\Theta _f,X_f)\), for point sets \(\Theta _g\) and \(\Theta _f\) different to \(\Theta \), and this has been found to be beneficial in some of the numerical experiments (see Sect. 4 and Appendix D).
3.2 MCMC algorithms
To extract information from the posterior, MCMC algorithms are powerful and popular tools (Robert and Casella 2004; Brooks et al. 2011). In this work, we will consider the Metropolis-Adjusted Langevin Algorithm (MALA) (Roberts and Tweedie 1996), which is a type of MCMC algorithm that uses gradient information to accelerate the convergence of the sampling chain. Central to the idea of MALA is the over-damped Langevin stochastic differential equation (SDE):
where W is a standard \({d_{\varvec{\theta }}}\)-dimensional Brownian motion. Under mild conditions on the posterior \(\pi \) (Robert and Casella 2004), (18) is ergodic and has \(\pi \) as its stationary distribution, so that the probability density function of \(\varvec{\theta }(t)\) tends to \(\pi \) as \(t \rightarrow \infty \).
In practice (18) is discretised with a simple Euler-Maruyama method with a time step \(\gamma \):
with \(\xi _n \sim \mathcal {N}(0,1)\). Assuming that the dynamics of (19) remain ergodic the corresponding numerical invariant measure would not necessarily coincide with the posterior. To alleviate this bias, one needs to incorporate an accept-reject mechanism (Sanz-Serna 2014). This gives rise to MALA as described in Algorithm 1.
An advantage of using the Gaussian process emulator in the posterior is that, assuming the prior is differentiable, \(\nabla \log \pi ^N(\varvec{\theta }|{\textbf{y}})\) can be computed analytically for the mean-based and marginal approximations introduced in Sect. 2.3, which enables us to easily implement the MALA algorithm. Note that in contrast since the true posterior involves the (analytical or numerical) solution u to the PDE (1a)-(1b), it is usually impossible to compute these gradients analytically and one needs to resort to their numerical approximation. The following Lemma gives the gradient of the different approximate posteriors. The proof can be found in Appendix C.
Lemma 1
Given the Gaussian process \(\mathcal {G}^N_X \sim \text {GP}(\textbf{m}^{\mathcal {G}_X}_N(\varvec{\theta }),K_N(\varvec{\theta },\varvec{\theta }))\) that emulates the forward map \(\mathcal {G}_X\) with data \(\mathcal {G}_X(\Theta )\), we have the gradient of the mean-based approximation of the posterior
and the gradient of the marginal approximation of the posterior
4 Numerical experiments
We now discuss a number of different numerical experiments related to inverse problems for the PDE (1a)-(1b) in various set-upsFootnote 2 in terms of the number of spatial and parameter dimensions as well as for different types of forward models. A common theme in all our experiments is that the number of training points N is small, as this would be the case in large-scale applications in practice where increasing the number of training points is often either very costly or infeasible. The number of training points N used will serve as a benchmark for comparing different methodologies. Throughout all our numerical experiments in Sects. 4.1-4.3 when comparing the different approaches we keep N fixed. The value of N is chosen in such a way to ensure that significant uncertainty remains present in the emulator, which is typically the case in applications. Alternatively, one could ask what number of training points for each model is needed to reach a certain accuracy, however as explained above, this is not the viewpoint taken here. Precise timings for each of the approaches are reported in Sect. 4.4.
In cases where the PDE solution is not available in closed form, we use the finite element software Firedrake (Rathgeber et al. 2016) to obtain the "true" solution. Furthermore, when using MALA we adaptively tune the step size to achieve an average acceptance probability 0.573 (Brooks et al. 2011). In all our numerical experiments, we replace the uniform prior with a smooth approximation given by the \(\lambda -\)Moreau-Yosida envelope (Bauschke et al. 2011) with \(\lambda = 10^{-3}\).
The selection of hyperparameters is crucial for the application of Gaussian process regression. In this paper, we optimize the hyperparameters by minimizing the negative log marginal likelihood, which in general could be computationally intensive. We simplify this process by assuming isotropy for the length-scale in the covariance function for \(\varvec{\theta }\) as in (5) and (6), so the optimization of the hyperparameters becomes a two-dimensional problem. Additionally, since we are operating in the small training data regime, the computational cost of evaluating the log-likelihood is small. To emphasise the improvement brought by the structure and also for simplicity, in the case of the spatially correlated and the PDE-constrained models, we use the same hyperparameters for the covariance function of the unknown parameter \(\varvec{\theta }\) as in the baseline model and only optimise the hyperparameters of the spatial covariance function. In principle, these assumptions can be relaxed to achieve potentially higher accuracy in the regression. The computational timings for optimization of the hyperparameters can be found in Appendix D.
To clarify the notation we use in our numerical experiments, we recall some of it in Table 2.
4.1 Examples in one spatial dimension
4.1.1 Two-dimensional piece-wise constant diffusion coefficient
We consider an elliptic equation with a 2-dimensional piece-wise constant diffusion coefficient; we have the following equation
where \(\kappa \) is defined as piece-wise constant over four equally spaced intervals. More precisely, we consider
Since it is not possible to solve (20) explicitly, we use Firedrake to obtain its solution.
Throughout this numerical experiment, we take the prior of the parameters to be the uniform distribution on \([-1,1]^{2}\), and we generate our data \(\textbf{y}\) according to equation (2) for \(\varvec{\theta }^{\dagger }=[0.098, 0.430]\), \({d}_{\textbf{y}}=6\) (equally spaced points in [0,1]) and noise level \(\sigma ^{2}_{\eta }=10^{-4}\). For the covariance kernels, we choose \(k_p\) to be the squared exponential kernel and \(k_s\) to be the Matèrn kernel with \(\nu = \frac{5}{2}\).
For the PDE constrained model, we first test the effect of additional training data \(g(\Theta _g, X_g)\) and \(f(\Theta _f, X_f)\) on the accuracy of the emulator. In Fig. 1, we see that as \(d_{f}\) and \(\bar{N}\) increase, the accuracy of emulators gradually increases.
We now use MALA to obtain samples for all our approximate posteriors using \(10^{6}\) samples. For all models, we have used \(N=4\) training points (chosen to be the first 4 points in the Halton sequence), while additionally for the PDE-constrained model, we have used \(\bar{N} = 10\) (chosen to be the next 10 points in the Halton sequence), \(d_{f} = 20\) and \(d_{g} = 2\). Since we do not have access to the true posterior, we consider the results obtained from a mean-based approximation with the baseline model for \(N=10^{2}\) training points as the ground truth.
As we can see in Fig. 2, all the mean-based posteriors fail to put significant posterior mass near the true parameter value \(\varvec{\theta }^{\dagger }\). The situation improves when the uncertainty of the emulator is taken into account as we can see for the marginal approximations. Out of the three different models, the PDE-constrained one seems to be performing best since it is placing the most posterior mass around the true value \(\varvec{\theta }^{\dagger }\). This is further illustrated in Fig. 3 where we plot the \(\theta _{1}\) and \(\theta _{2}\) marginals for all the mean-based posterior approximations \(\pi ^{N,\mathcal {G}_X}_{\textrm{mean}}\), \(\pi ^{N,\mathcal {G}_X,s}_{\textrm{mean}}\), \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{mean}}\) and the marginal-based posterior approximations \(\pi ^{N,\mathcal {G}_X}_{\textrm{marginal}}\), \(\pi ^{N,\mathcal {G}_X,s}_{\textrm{marginal}}\), \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{marginal}}\). Note that the marginal plot could be misleading regarding the overall performance of the approximations, for example in Fig. 3 (top right) the baseline model seems to be better than the PDE-constrained model, but from Fig. 2 we know that this is not true. In other words, the marginal posteriors are better approximations than the joint posterior. When we increase \(d_{f}\) from 20 to 50, the accuracy of the approximation improves as we can see in Fig. 4 where we compare PDE-constrained approximations for the two different values of \(d_{f}\).
4.1.2 Parametric expansion for the diffusion coefficient
In this example, we study again (20), but this time instead of working with a piece-wise constant diffusion coefficient we assume that the diffusion coefficient satisfies the following parametric expansion
where \(a_n = \frac{8}{\omega ^2_n + 16}\), \(b_{n}(x) = A_n (\sin (\omega _n x) + \frac{\omega _n}{4}\cos (\omega _n x))\), \(\omega _n\) is the \(n_{th}\) solution of the equation \(\tan (\omega _n) = \frac{8\omega _n}{\omega ^2_n - 16}\) and \(A_n\) is a normalisation constant which makes \(\Vert b_n\Vert _{L^2(0,1)} = 1\). This choice is motivated by the fact that for \(\{\theta _n\}_{n=1}^{d_{\varvec{\theta }}}\) i.i.d. standard normal random variables, this is a truncated Karhunen-Loève expansion of \(\log (\kappa (\varvec{\theta }, x)) \sim \text {GP}(0, \exp (-\Vert x-x'\Vert _1))\) (Ghanem and Spanos 1991).
In terms of the inverse problem setting, we are using the same parameters as before (\(\varvec{\theta }^{\dagger } = [0.098,0.430]\), \({d}_{\textbf{y}}= 6\), noise level \(\sigma ^2_{\eta } = 10^{-4}\)). The number of training points for all the emulators has been set to \(N=4\) (chosen using the Halton sequence), while in the case of the PDE-constrained emulator we have used \(\bar{N}=10\) and \(d_{f}=8\). Furthermore, throughout this numerical experiment, we take the prior of the parameters to be the uniform distribution on \([-1,1]^{2}\). For the choices of kernels, we use the squared exponential kernel for both \(k_p\) and \(k_s\).
As in the previous experiments, we produce \(10^{6}\) samples of the posteriors using MALA, and use the results obtained by a mean-based approximation with the baseline model for \(N=10^{2}\) training points as the ground truth.
We now plot in Fig. 5 the \(\theta _{1}\) and \(\theta _{2}\) marginals for the different Gaussian emulators both in the case of mean-based and marginal posterior approximations. As we can see in Fig. 5 for the mean-based posterior approximations, the baseline and spatially correlated model fail to capture the true posterior while this is not the case for the PDE-constrained model since the agreement with the true posterior is excellent. When looking at the marginal approximations in Fig. 5 (bottom left and bottom right) we can see that the marginals for the baseline and spatially correlated models move closer towards the true value \(\varvec{\theta }^{\dagger }\) and exhibit variance inflation. This is, however, not the case for the PDE-constrained model since again it is in excellent agreement with the true posterior.
4.2 Two spatial dimensions
In this example, we increase the spatial dimension from \(d_\textbf{x} = 1\) to \(d_\textbf{x} = 2\) and use a 2-dimensional piece-wise constant as the diffusion coefficient. The values of the diffusion coefficient are set in a similar way to the first example, but depending only on the first dimension of \(\textbf{x}\):
The boundary conditions are a mixture of Neumann and Dirichlet conditions, given by
These boundary conditions define a flow cell, with no flux at the top and bottom boundary (\(x_2=0,1\)) and flow from left to right induced by the higher value of u at \(x_1=0\).
For the observation, we generate our data \(\textbf{y}\) according to equation (2) for \(\varvec{\theta }^{\dagger }=[0.098, 0.430]\) with \({d}_{\textbf{y}}=6\) (chosen to be the first 6 points in the Halton sequence) and a noise level \(\sigma ^{2}_{\eta }=10^{-5}\). In addition, for the baseline and spatially correlated models, we have used \(N=4\) training points (chosen to be the first 4 points in the Halton sequence), while additionally for the PDE-constrained model, we have used \(\bar{N} = 30\), \(d_{{f}}=30\) and \(d_g=8\), corresponding to 2 equally spaced points on each boundary. For the covariance kernels, we let \(k_p\) be the squared exponential kernel and \(k_s\) be the Matèrn kernel with \(\nu = \frac{5}{2}\).
We plot the mean-based approximate marginal posteriors in Fig. 6 (top left and top right). We can see that in this case, the PDE-constrained model significantly improves the approximation accuracy, which is different from the previous piece-wise constant diffusion coefficient example in one spatial dimension. In Fig. 6 (bottom left and bottom right), we compare the marginal approximation for the three models, and we again see that the PDE-constrained model performs best.
4.3 Emulating the negative log-likelihood
As discussed in Sect. 2.3.2, we can emulate the negative log-likelihood directly with Gaussian process regression instead of emulating the forward map. Since emulation of the log-likelihood involves emulating a non-linear functional of the PDE solution u, we are not able to incorporate spatial correlation or PDE constraints in the same way. We test the performance of the mean-based approximation (12) and the marginal approximation (13) using the previous examples: problem (20) with diffusion coefficient (21) with \(d_\textbf{x} = 1\) and \(d_\textbf{x} = 2\). All parameters are kept the same as in Sect. 4.1.1 and Sect. 4.2.
In Fig. 7, we compare the mean-based approximation with the emulation of the log-likelihood \(\Phi \) and the observation operator \({\mathcal {G}}_X\) using baseline model. We see that the results are very different in both examples. For the \(d_\textbf{x}=1\), emulating the log-likelihood function performs better than the emulation of the observation with the baseline model, the approximated posterior is closer to the true posterior. For the \(d_\textbf{x} = 2\), its performance is much worse. Hence, emulating the log-likelihood with a small amount of data could be less reliable compared to emulating the forward map. If we increase the number of training data to \(N=10\) for the \(d_\textbf{x} = 2\) case, we can see an improvement of accuracy in Fig. 8, but it is still worse than emulating the forward map with the baseline model.
Similarly, we see in Fig. 9 that marginal approximations of the posterior based emulation of the log-likelihood appear to be less reliable, but including more training points can again improve the performance.
4.4 Computational timings
In this section, we discuss computational timings. We focus on the computational gains resulting from using Gaussian process emulators instead of the PDE solution in the posterior (see Table 3) and the relative costs of sampling from the various approximate posteriors (see Tables 4, 5, 6 and 7).
Table 3 gives average computational timings comparing the evaluation of the solution of the PDE using Firedrake with using the Gaussian process surrogate model. For the baseline surrogate model, the two primary costs are (i) computing the coefficients \(\varvec{\alpha } = K(\Theta ,\Theta )^{-1} {\mathcal {G}}_X(\Theta )\), which is an offline cost and only needs to be done once, and (ii) computing the predictive mean \(m_N^f(\varvec{\theta }) = K(\varvec{\theta },\Theta ) \varvec{\alpha }\), which is the online cost and needs to be done for every new test point \(\varvec{\theta }\). We see that evaluating \(m_N^f(\varvec{\theta })\) is orders of magnitude faster than evaluating \({\mathcal {G}}_X(\varvec{\theta })\).
In Tables 4, 5 and 6, we compare average computational timings of drawing one sample from the approximate posterior with different models. In Table 4, we see that the mean-based approximation with the PDE-informed prior is more expensive than the one with the baseline prior, by a factor of 2–4 depending on the setting. This is to be expected, since the PDE-informed posterior mean \(\textbf{m}^{{\mathcal G}_X}_{N,X_f,X_g}\) involves matrices of larger dimensions than the baseline posterior mean \(\textbf{m}^{{{\mathcal {G}}}_X}_{N}\).
Table 5 investigates the different marginal approximations. Compared to the mean-based approximations in Table 4, we see that the marginal approximations are more expensive by a factor of around 2 for the baseline model and around 3–10 for the PDE-constrained model. Within the different marginal approximations, the spatially correlated model is not much more expensive than the baseline model, whereas, depending on the setting, the PDE-constrained model is 2–30 times more expensive.
In Table 6, we can see that emulating the log-likelihood significantly reduces the cost of sampling from the mean-based and marginal approximations, by around a factor of 20 compared to the baseline model for emulating the observations.
Finally, Table 7 shows the effective sample sizes (ESSs) obtained for the different posterior approximations with MALA. We can see that the ESSs are all comparable, implying that it is meaningful to look at the cost per sample to compare the different approximate posteriors in terms of computational cost.
5 Conclusions, discussion and actionable advice
Bayesian inverse problems for PDEs pose significant computational challenges. The application of state-of-the-art sampling methods, including MCMC methods, is typically computationally infeasible due to the large computational cost of simulating the underlying mathematical model for a given value of the unknown parameters. A solution to alleviate this problem is to use a surrogate model to approximate the PDE solution within the Bayesian posterior distribution. In this work we considered the use of Gaussian process surrogate models, which are frequently used in engineering and geo-statistics applications and offer the benefit of built-in uncertainty quantification in the variance of the emulator.
The focus of this work was on practical aspects of using Gaussian process emulators in this context, providing efficient MCMC methods and studying the effect of various modelling choices in the derivation of the approximate posterior on its accuracy and computational efficiency. We now summarise the main conclusions of our investigation.
-
1.
Emulating log-likelihood vs emulating observations. We can construct an emulator for the negative log-likelihood \(\Phi \) or the parameter-to-observation map \({\mathcal {G}}_X\) in the likelihood (3).
-
Computational efficiency. The log-likelihood \(\Phi \) is always scalar-valued, independent of the number of observations \({d}_{\textbf{y}}\), which makes the computation of the approximate likelihood for a given value of the parameters \(\varvec{\theta }\) much cheaper than the approximate likelihood with emulated \({\mathcal {G}}_X\). The relative cost will depend on \({d}_{\textbf{y}}\).
-
Accuracy. When only limited training data are provided, emulating \({\mathcal {G}}_X\) appears more reliable than emulating \(\Phi \), even with the baseline model. The major advantage of emulating \({\mathcal {G}}_X\) is that it allows us to include correlation between different observations, i.e. between the different entries of \({\mathcal {G}}_X\). This substantially increases the accuracy of the approximate posteriors, in particular if we use the PDE structure to define the correlations (see point 3 below).
-
-
2.
Mean-based vs marginal posterior approximations. We can use only the mean of the Gaussian process emulator to define the approximate posterior as in (10) and (12), or we can make use of its full distribution to define the marginal approximate posteriors as in (11) and (13).
-
Computational efficiency. The mean-based approximations are faster to sample from using MALA. This is due to simpler structure of the gradient required for the proposals. The difference in computational times depends on the prior chosen, and is greater for the PDE-constrained model.
-
Accuracy. The marginal approximations correspond to a form of variance inflation in the approximate posterior (see Sect. 2.3), representing our incomplete knowledge about the PDE solution. They thus combat over-confident predictions. In our experiments, we confirm that they typically allocate larger mass to regions around the true parameter value than the mean-based approximations.
-
-
3.
Spatial correlation and PDE-constrained priors.
-
Computational efficiency. Introducing the spatially correlated model only affects the marginal approximation, and sampling from the marginal approximate posterior with the spatially correlated model is slightly slower than with the baseline model. The PDE-constrained model significantly increases the computational times for both the mean-based and marginal approximations, with the extent depending on the size of the additional training data.
-
Accuracy. Introducing spatial correlation improves the accuracy of the marginal approximation compared to the baseline model. The most accurate results are obtained with the PDE-constrained priors, which are problem specific and more informative. A benefit of the spatially correlated model is that it does not rely on the underlying PDE being linear, and easily extends to non-linear settings.
-
In summary, the marginal posterior approximations and the spatially correlated/ PDE-constrained prior distributions provide mechanisms for increasing the accuracy of the inference and avoiding over-confident biased predictions, without the need to increase N. This is particularly useful in practical applications, where the number of model runs N available to train the surrogate model may be very small due to constraints in time and/or cost. This does result in higher computational cost compared to mean-based approximations based on black-box priors, but may still be the preferable option if obtaining another training point is impossible or computationally very costly.
Variance inflation, as exhibited in the marginal posterior approximations considered in this work, is a known tool to improve Bayesian inference in complex models, see e.g. (Conrad et al. 2017; Calvetti et al. 2018; Fox et al. 2020). Conceptually, it is also related to including model discrepancy (Kennedy and O’Hagan 2000; Brynjarsdóttir and O’Hagan 2014). The approach to variance inflation presented in this work has several advantages. Firstly, the variance inflation being equal to the predictive variance of the emulator means that the amount of variance inflation included depends on the location \(\varvec{\theta }\) in the parameter space. We introduce more uncertainty in parts of the parameter space where we have less training points and the emulator is possibly less accurate. Secondly, the amount of variance inflation can be tuned in a principled way using standard techniques for hyperparameter estimation in Gaussian process emulators. There is no need to choose a model for the variance inflation separately to choosing the emulator, since this is determined automatically as part of the emulator.
We did not apply optimal experimental design in this work, i.e. how we should optimally choose the locations \(\Theta \) of the training data. One would expect that using optimal design will have a large influence on the accuracy of the approximate posteriors, especially for small N. In the context of inverse problems, one usually wants to place the training points in regions of parameter space where the (approximate) posterior places significant mass (see e.g. Helin et al. (2023) and the references therein). For a fair comparison between all scenarios, and to eliminate the interplay between optimal experimental design and other modelling choices, we have chosen the training points as a space-filling design in our experiments. We expect the same conclusions to hold with optimally placed points.
Data availibility
No datasets were generated or analysed during the current study.
Notes
A number of additional numerical experiments can be found in Appendix D
References
Alvarez, M.A., Rosasco, L., Lawrence, N.D.: Kernels for vector-valued functions: a review. Foundations and Trends® in Machine Learning 4(3), 195–266 (2012)
Babuska, I., Nobile, F., Tempone, R.: A stochastic collocation method for elliptic partial differential equations with random input data. SIAM J. Numerical Analysis 45, 1005–1034 (2007)
Bauschke, H.H., Burachik, R.S., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H.: Fixed-point Algorithms for Inverse Problems in Science and Engineering, vol. 49. Springer, New York, NY (2011)
Bonilla, E.V., Chai, K., Williams, C.: Multi-task Gaussian process prediction. Advances in neural information processing systems 20 (2007)
Brooks, S., Gelman, A., Jones, G., Meng, X.-L.: Handbook of Markov Chain Monte Carlo. CRC Press, Boca Raton, FL (2011)
Brynjarsdóttir, J., O’Hagan, A.: Learning about physical parameters: The importance of model discrepancy. Inverse Prob. 30(11), 114007 (2014)
Bui-Thanh, T., Willcox, K., Ghattas, O.: Model reduction for large-scale systems with high-dimensional parametric input space. SIAM J. Sci. Comput. 30(6), 3270–3288 (2008)
Calvetti, D., Dunlop, M., Somersalo, E., Stuart, A.: Iterative updating of model error for Bayesian inversion. Inverse Prob. 34(2), 025008 (2018)
Cockayne, J., Oates, C., Sullivan, T., Girolami, M.: Probabilistic numerical methods for PDE-constrained Bayesian inverse problems. AIP Conf. Proc. 1853(1), 060001 (2017)
Conrad, P.R., Girolami, M., Särkkä, S., Stuart, A., Zygalakis, K.: Statistical analysis of differential equations: introducing probability measures on numerical solutions. Stat. Comput. 27, 1065–1082 (2017)
Constantine, P.G.: Active Subspaces. Society for Industrial and Applied Mathematics, Philadelphia, PA (2015)
Constantine, P.G., Dow, E., Wang, Q.: Active subspace methods in theory and practice: applications to kriging surfaces. SIAM J. Sci. Comput. 36(4), 1500–1524 (2014)
Fox, C., Cui, T., Neumayer, M.: Randomized reduced forward models for efficient metropolis-hastings mcmc, with application to subsurface fluid flow and capacitance tomography. GEM-Int. J. Geomath. 11, 1–38 (2020)
Ghanem, R.G., Spanos, P.D.: Stochastic Finite Elements: a Spectral Approach. Springer, Berlin (1991)
Giordano, M., Nickl, R.: Consistency of Bayesian inference with Gaussian process priors in an elliptic inverse problem. Inverse Prob. 36(8), 085001 (2020)
Helin, T., Stuart, A.M., Teckentrup, A.L., Zygalakis, K.C.: Introduction to Gaussian process regression in Bayesian inverse problems, with new results on experimental design for weighted error measures. arXiv preprint arXiv:2302.04518 (2023)
Higdon, D., Kennedy, M., Cavendish, J.C., Cafeo, J.A., Ryne, R.D.: Combining field data and computer simulations for calibration and prediction. SIAM J. Sci. Comput. 26(2), 448–466 (2004)
Kaipio, J., Somersalo, E.: Statistical and Computational Inverse Problems. Springer, Dordrecht (2005)
Kennedy, M.C., O’Hagan, A.: Bayesian calibration of computer models. J. R. Stat. Soc. Ser. B Methodol. 63, 425–464 (2000)
Lie, H.C., Sullivan, T.J., Teckentrup, A.L.: Random forward models and log-likelihoods in Bayesian inverse problems. SIAM/ASA Journal on Uncertainty Quantification 6(4), 1600–1629 (2018)
Marzouk, Y., Xiu, D.: A stochastic collocation approach to Bayesian inference in inverse problems. PRISM: NNSA Center for Prediction of Reliability, Integrity and Survivability of Microsystems 6 (2009)
Marzouk, Y.M., Najm, H.N., Rahn, L.A.: Stochastic spectral methods for efficient Bayesian solution of inverse problems. J. Comput. Phys. 224(2), 560–586 (2007)
Matsumoto, T., Sullivan, T.: Images of Gaussian and other stochastic processes under closed, densely-defined, unbounded linear operators. arXiv preprint arXiv:2305.03594 (2023)
Niederreiter, H.: Random Number Generation and quasi-Monte Carlo Methods. SIAM, London (1992)
Oates, Chris J., Jon Cockayne, R.G.A., Girolami, M.: Bayesian probabilistic numerical methods in time-dependent state estimation for industrial hydrocyclone equipment. J. Am. Stat. Assoc. 114(528), 1518–1531 (2019)
O’Hagan, A.: Bayesian analysis of computer code outputs: a tutorial. Reliab. Eng. Syst. Saf. 91(10), 1290–1300 (2006)
Pförtner, M., Steinwart, I., Hennig, P., Wenger, J.: Physics-informed Gaussian process regression generalizes linear PDE solvers (2022)
Raissi, M., Perdikaris, P., Karniadakis, G.E.: Machine learning of linear differential equations using Gaussian processes. J. Comput. Phys. 348, 683–693 (2017)
Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA (2006)
Rathgeber, F., Ham, D.A., Mitchell, L., Lange, M., Luporini, F., McRae, A.T., Bercea, G.-T., Markall, G.R., Kelly, P.H.: Firedrake: automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software (TOMS) 43(3), 1–27 (2016)
Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer, Berlin (2004)
Roberts, G.O., Tweedie, R.L.: Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 341–363 (1996)
Sacks, J., Welch, W.J., Mitchell, T.J., Wynn, H.P.: Design and analysis of computer experiments. Stat. Sci. 4(4), 409–423 (1989)
Sanz-Serna, J.M.: Markov Chain Monte Carlo and Numerical Differential Equations, 39–88. Springer, Cham (2014)
Spitieris, M., Steinsland, I.: Bayesian calibration of imperfect computer models using physics-informed priors. J. Mach. Learn. Res. 24(108), 1–39 (2023)
Stein, M.L.: Interpolation of Spatial Data. Springer Series in Statistics, 247. Springer, New York, NY (1999)
Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer. 19, 451–559 (2010)
Stuart, A., Teckentrup, A.: Posterior consistency for Gaussian process approximations of Bayesian posterior distributions. Math. Comput. 87, 721–753 (2018)
Swiler, L.P., Gulian, M., Frankel, A.L., Safta, C., Jakeman, J.D.: Constrained Gaussian processes: A survey. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States) (2021)
Teckentrup, A.L.: Convergence of Gaussian process regression with estimated hyper-parameters and applications in Bayesian inverse problems. SIAM/ASA Journal on Uncertainty Quantification 8(4), 1310–1337 (2020)
Xiu, D., Karniadakis, G.E.: Modeling uncertainty in flow simulations via generalized polynomial chaos. J. Comput. Phys. 187(1), 137–167 (2003)
Acknowledgements
The authors would like to thank the Isaac Newton Institute for Mathematical Sciences, Cambridge, for support and hospitality during the programme Mathematical and statistical foundation of future data-driven engineering where work on this paper was undertaken. This work was supported by EPSRC grants no EP/R014604/1, EP/V006177/1, EP/X01259X/1 and EP/Y028783/1.
Author information
Authors and Affiliations
Contributions
All authors contributed equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Derivation of marginal likelihood
Let \(\textbf{m}_{\varvec{\theta }} = \textbf{m}_N^{\mathcal {G}_X}({\varvec{\theta }})\), \(K_{\varvec{\theta }} = K_N(\varvec{\theta },\varvec{\theta })\) and \(\Gamma _{\eta } = \sigma ^{-2}_{\eta }I_{{d}_{\textbf{y}}}\). Since \(\mathcal {G}^N_{X}(\varvec{\theta }) = \textbf{m}_{\varvec{\theta }} + \varvec{\xi }\), where \(\varvec{\xi } \sim \mathcal {N}(0, K_{\varvec{\theta }})\), using the definition of the expectation we obtain
We then rewrite and simplify the exponent part in the formula. Let \(\bar{\textbf{y}} = \textbf{y}-\textbf{m}_{\varvec{\theta }}\), then
Since \(\Gamma _{\eta }\) and \(K_{\varvec{\theta }}\) are symmetric matrices, we have
where \(C = K_{\varvec{\theta }}(K_{\varvec{\theta }}+\Gamma _{\eta })^{-1}\Gamma _{\eta }\) and \(\tilde{\textbf{y}} = C\Gamma _{\eta }^{-1}\bar{\textbf{y}}\). Substituting it into the formula above, we have
We can then complete the square
We now factor out \(\exp \left( -\frac{1}{2}\left( \Vert \varvec{\xi }- \tilde{\textbf{y}}\Vert ^2_{C} \right) \right) \) from the integral, and the remaining part matches the form of Gaussian distribution up to a constant.
Hence, we obtain the explicit form of the marginal approximation.
Predictive Gaussian process
1.1 Derivation
Gaussian prior given by (15)
Conditioning the Gaussian process prior (15) on the data \({\mathcal {G}}_X(\Theta )\) yields
Gaussian prior given by (17) We can condition the joint Gaussian process prior (17) as in Sect. 2.2 on the observations \(\textbf{g}(\Theta )\), where now
After a re-ordering of the observations \(\textbf{g}(\Theta )\), this results in the conditional distribution
with \(K(\varvec{\theta },\varvec{\theta }') = k_p(\varvec{\theta },\varvec{\theta }') K_s(X,X)\) as before and
and
The marginal posterior distribution on \({\mathcal {G}}_X(\varvec{\theta })\) can then be extracted from the above joint posterior by taking the first \({d}_{\textbf{y}}\) rows of \(\textbf{m}_N^\textbf{g}\) and the first \({d}_{\textbf{y}}\) rows and columns of \(K_N^\textbf{g}\), which gives
where
1.2 Computational implementation
We have three different approaches for emulating the forward map and defining the correlation between its components. We will refer to these as the independent, spatially correlated, and PDE-constrained model, respectively. Each of them can be combined with the mean-based or the marginal approximation of the posterior. We note here that for the computational implementation of the spatially correlated model, the introduction of the correlation matrix does not change the predictive mean of the Gaussian process, it only affects the predictive covariance (see Theorem 2 below). This was already noted in Bonilla et al. (2007), but we give a proof for this for completeness. Since the spatial correlation matrix is independent of \(\varvec{\theta }\), the covariance matrix between two sets of parameters \(\Theta _1\) and \(\Theta _2\) can be computed by the Kronecker product, that is,
Hence, assuming a spatial correlation of the type (14) only affects approximate posteriors that take into account the uncertainty of the emulator.
Theorem 2
Consider two Gaussian processes \(\textbf{g}_{0} (\varvec{\theta }) \sim \text {GP}(\textbf{m}(\varvec{\theta }),k_p(\varvec{\theta },\varvec{\theta }')I_{{d}_{\textbf{y}}})\), \(\textbf{g}_{0,s} (\varvec{\theta }) \sim \text {GP}(\textbf{m}(\varvec{\theta }),k_p(\varvec{\theta },\varvec{\theta }')K_s(X,X))\), where \(K_s(X,X)\) is the covariance matrix on the set of spatial points \(X = \{\textbf{x}_i\}_{i=1}^{{d}_{\textbf{y}}}\) and \(k_p(\varvec{\theta },\varvec{\theta }')\) is scalar-valued. Conditioning both Gaussian processes on a set of training points \(\textbf{g}(\Theta ) = \{\textbf{g}(\varvec{\theta }_i)\}_{i=1}^{N}\), denote the corresponding conditional Gaussian processes by \(\textbf{g}^N (\varvec{\theta }) \sim \text {GP}(\textbf{m}_N^{\textbf{g}}(\varvec{\theta }),K_N(\varvec{\theta },\varvec{\theta }'))\) and \(\textbf{g}^N_{s}(\varvec{\theta }) \sim \text {GP}(\textbf{m}_{N,s}^{\textbf{g}}(\varvec{\theta }),K_{N,s}(\varvec{\theta },\varvec{\theta }'))\), respectively. Then we have,
where \(k_{N,p}(\varvec{\theta },\varvec{\theta }')\) is scalar-valued.
Proof
Let \(k_p(\varvec{\theta },\Theta ):= [k_p(\varvec{\theta },\varvec{\theta }^1); \dots ; k_p(\varvec{\theta },\varvec{\theta }^N)] \in \mathbb {R}^{{d}_{\textbf{y}}}\), and denote by \(K_p(\Theta ,\Theta ) \in \mathbb {R}^{{d}_{\textbf{y}}\times {d}_{\textbf{y}}}\) the matrix with entries \((K_p(\Theta ,\Theta ))_{i,j} = k_p(\varvec{\theta }^i,\varvec{\theta }^j)\). Then by (8) we have
where \(\otimes \) denotes the Kronecker product. Using properties of products and inverses of Kronecker products and the fact that \(K_s(X,X)\) is symmetric positive definite, we then have
The relationship between \(K_{N,s}(\varvec{\theta },\varvec{\theta }')\) and \(K_{N}(\varvec{\theta },\varvec{\theta }')\) can be shown in a similar way, using (9). \(\square \)
For the PDE-constrained model, since the covariance functions related to f are obtained by applying the differential operator, the spatially correlated matrix in the joint prior (16) also depends explicitly on the parameters \(\varvec{\theta }\). Therefore, its covariance matrix cannot be written in a Kronecker product structure as in (26) and Theorem 2 does not apply. Thus, incorporating the PDE constraints into the model also affects the predictive mean and hence the mean-based posterior is also changed.
Derivation of the gradient of the approximate log-posteriors
For the mean-based posterior:
For the marginal posterior
Further numerical experiments
1.1 Constant diffusion coefficient
We consider the following PDE in one spatial dimension
In this case the dimension of the parameter space is \(d_{\varvec{\theta }}=1\), and the solution is available in closed form. More precisely, we have
Given this explicit solution and the low dimension of the parameter space, we calculate the true and approximate posteriors on a fine grid without having to resort to Markov Chain Monte Carlo sampling. We now generate our observations \(\textbf{y}\) according to equation (2) for \({\varvec{\theta }}^{\dagger } = 0.314\) at a varying number of spatial points \({d}_{\textbf{y}}\) (equally spaced in [0, 1]) and at a noise level \(\sigma ^2_{\eta } = 10^{-5}\). As we can see in Fig. 10 as we increase \({d}_{\textbf{y}}\) the true posterior \(\pi (\theta |\textbf{y})\) tends to get more and more concentrated around the value of \({\varvec{\theta }}^{\dagger }\) which is consistent with what the theory would predict by a Bernstein-von-Mises theorem (see e.g. Giordano and Nickl (2020) for related results).
We now turn our attention to the different approximate posteriors discussed in Section 2.3 obtained with different Gaussian priors (independent, spatially correlated, and PDE-constrained).
Baseline model: In the case of the simplest emulator with independent entries, we illustrate in Fig. 11, how the mean-based posterior \(\pi ^{N,\mathcal {G}_{X}}_{\textrm{mean}}(\theta |\textbf{y})\) and the marginal posterior \(\pi ^{N,\mathcal {G}_{X}}_{\textrm{marginal}}(\theta |\textbf{y})\) behave as a function of the number of training points N (here \(d_{\textbf{y}}=5\)). The locations of the training points are chosen from the Halton sequence (Niederreiter 1992). Now, when comparing Fig. 11 (top left and top right) we see that the marginal posterior is more spread than the mean-based posterior. This is due to the variance inflation associated with the marginal posterior which reflects better the uncertainty of the emulator. For example, in the case \(N=1\) the mean-based posterior has negligible posterior probability mass near \(\varvec{\theta }^{\dagger }\) and exhibits bimodality since it so happens that the training point used is not near the true \(\varvec{\theta }^\dagger \). However, due to the variance inflation this is not the case for the marginal-based posterior. Furthermore, in Fig. 11 (bottom left) we plot the Hellinger distance between the approximate posteriors and the true posterior as a function of the number of training points N. As we can see the error for the marginal-based posterior is slightly smaller than the error for the mean-based posterior for small N. The two errors are equal as N increases, which can be further understood by Fig. 11(bottom right). Here, we plot the average variance \(k_N(\varvec{\theta },\varvec{\theta })\) (averaged over \(\varvec{\theta }\)) of our emulator for different values of N, and see that as expected from (11) the marginal approximation behaves in the same manner as the mean-based approximation once the average variance is of the same order as the observational noise \(\sigma ^2_{\eta }\).
Spatially correlated model: As discussed in Appendix B, the introduction of spatial correlation does not change the predictive mean of the Gaussian processes. We hence now compare in Fig. 12 the two different marginal posteriors \(\pi ^{N,\mathcal {G}_X}_{\textrm{marginal}}\) and \(\pi ^{N,\mathcal {G}_X,s}_{\textrm{marginal}}\), where the latter includes spatial correlation. We again choose \({d}_{\textbf{y}}=5\). In particular, as we can see in Fig. 12 (left) (here \(N=2\)), introducing spatial correlation seems to improve the accuracy of the approximate posterior and place more mass near \(\theta ^\dagger \). The fact that the spatially correlated model has an increased variance at \(N=2\) (see Fig. 12) leads to similar behavior as in Fig. 11 with \(\pi ^{N,\mathcal {G}_X,s}_{\textrm{marginal}}\) being more spread than \(\pi ^{N,\mathcal {G}_X}_{\textrm{marginal}}\). Furthermore, as we can see in Fig. 12 (middle), as we increase the number of training points for our Gaussian process, the Hellinger distance between the true posterior and \(\pi ^{N,\mathcal {G}_X,s}_{\textrm{marginal}}\) is smaller than the one of the baseline model.
PDE-constrained model: We now compare the behaviour of the PDE-constrained model with the other two models, both for mean-based approximate posterior, as well as for the marginal posterior (again here \({d}_{\textbf{y}}=5\)). In particular, as we can see in Figs. 13 for \(N=2\), \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{mean}}\) and \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{marginal}}\) are indistinguishable from the true posterior when using \(\bar{N}=10\), \(d_{f}=5\) showing much better approximation properties than the other two models. This is consistent with what we observe in terms of the Hellinger distance, since both \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{mean}}\) and \(\pi ^{N,\mathcal {G}_X,\textrm{PDE}}_{\textrm{marginal}}\) have similar errors over different ranges of values for \(N_{f}\). It is also worth noting that when comparing with the Hellinger distance from Figs. 11 (bottom left) and 12 (middle) we see that the PDE-based model achieves the same order of error with only using half of the training points (\(N=2\) instead of \(N=4\)). Furthermore, as we can see in Fig. 13 (bottom right), the average variance of the PDE-constrained emulator converges to zero very fast as the number of extra training points for f increases, implying that at least in this simple example adding the PDE knowledge leads to an extremely good approximation of the forward map.
1.2 Integral observation operator
We now investigate the proposed method with a different form of the observation operator. In terms of the PDE problem, we study again (20). However, instead of point-wise observations \(\mathcal {G}_{X}(\varvec{\theta }) = \{u(x_j; \varvec{\theta })\}_{j=1}^{{{d}_{\textbf{y}}}}\) as in (2), we observe local averages \(\mathcal {G}_{X}(\varvec{\theta }) = \{\int _{a_j}^{b_j}u(x; \varvec{\theta }) dx\}_{j=1}^{{{d}_{\textbf{y}}}}\) for non-overlapping intervals \([a_j,b_j] \subset [0,1]\).
For the inverse problem setting, we have \(\varvec{\theta }^{\dagger } = [0.098, 0.430]\), \({d}_{\textbf{y}}= 16\) (equally spaced sub-intervals of [0, 1]) and \(\sigma _\eta ^2 = 10^{-6}\). We do not conduct precise integration as in (27), but use MALA algorithm to obtain our samples. We utilize \(10^{6}\) samples for all our approximate posteriors. We treat the sampling results obtained by a mean-based approximation with the baseline model for \(N=10^{2}\) training points as the ground truth. In Fig. 14, we plot again the \(\theta _{1}\) and \(\theta _{2}\) marginals for all the mean-based posterior approximations and the marginal posterior approximations. The result is similar to the example in Sect. 4.2 that the PDE-constrained model performs better than the other two models.
1.3 Ten-dimensional parametric expansion diffusion coefficient
We now increase the dimension of the diffusion coefficient from \(d_{\varvec{\theta }} = 2\) to \( d_{\varvec{\theta }} = 10\) in (22), to test the proposed method in a relatively high dimensional space. With regard to the inverse problem setting, we set
and we increase the number of observation points to \({d}_{\textbf{y}}= 20\). The level of noise is the same as before (\(\sigma _{\eta }^2 = 10^{-4}\)). The number of training points for all emulators is again set to \(N = 4\), and for the PDE-constrained emulator we use \(\bar{N} = 50\), \(d_f = 25\) and \(d_g = 2\). For the choices of kernels, we use the squared exponential kernel for both \(k_p\) and \(k_s\).
We now use the MALA algorithm to obtain \(10^{7}\) samples for the approximate posteriors. In this relatively high-dimensional setting, we need longer chains for the sampling algorithm to converge. Meanwhile, computation of a suitable "ground truth" is prohibitively expensive, so we only compare the sampling result with the true parameter \(\varvec{\theta }^{\dagger }\). The number of training points \(N = 4\) is far from enough for the baseline Gaussian process model to give an accurate prediction. From Fig. 15, we can see that the mean-based posterior approximation with the baseline model can only give a reasonable approximation for the first few variables, for the rest of the variables the approximation could not put any density around the true value. Adding spatial correlation to the model helps the approximation move towards the true value, but it still cannot correctly approximate the posterior for the last few variables. The performance of the PDE-constrained model is much better than that of the other models, it is placing the posterior mass around the true value for all variables.
1.4 Timing of hyperparameter optimization
Table 8 gives computational timings for computing the hyperparameters in the covariance functions. The optimization of the hyperparameters for \(k_s\) involves repeatedly computing the inverse of the Gram matrix. In the PDE-constrained model, since the matrix is significantly augmented by the data of PDE, the computation timing is therefore much longer than that of the other two models.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bai, T., Teckentrup, A.L. & Zygalakis, K.C. Gaussian processes for Bayesian inverse problems associated with linear partial differential equations. Stat Comput 34, 139 (2024). https://doi.org/10.1007/s11222-024-10452-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11222-024-10452-2