Abstract
We analyse the calibration of BayesCG under the Krylov prior. BayesCG is a probabilistic numeric extension of the Conjugate Gradient (CG) method for solving systems of linear equations with real symmetric positive definite coefficient matrix. In addition to the CG solution, BayesCG also returns a posterior distribution over the solution. In this context, a posterior distribution is said to be ‘calibrated’ if the CG error is well-described, in a precise distributional sense, by the posterior spread. Since it is known that BayesCG is not calibrated, we introduce two related weaker notions of calibration, whose departures from exact calibration can be quantified. Numerical experiments confirm that, under low-rank approximate Krylov posteriors, BayesCG is only slightly optimistic and exhibits the characteristics of a calibrated solver, and is computationally competitive with CG.
Similar content being viewed by others
References
Cockayne, J., Oates, C.J., Ipsen, I.C.F., Girolami, M.: A Bayesian conjugate gradient method (with discussion). Bayesian Anal. 14(3), 937–1012 (2019). https://doi.org/10.1214/19-BA1145. Includes 6 discussions and a rejoinder from the authors
Reid, T.W., Ipsen, I.C.F., Cockayne, J., Oates, C.J.: BayesCG as an uncertainty aware version of CG. arXiv:2008.03225 (2022)
Cockayne, J., Oates, C.J., Sullivan, T.J., Girolami, M.: Bayesian probabilistic numerical methods. SIAM Rev. 61(4), 756–789 (2019). https://doi.org/10.1137/17M1139357
Hennig, P., Osborne, M.A., Girolami, M.: Probabilistic numerics and uncertainty in computations. Proc. R. Soc. A. 471(2179), 20150142–17 (2015)
Oates, C.J., Sullivan, T.J.: A modern retrospective on probabilistic numerics. Stat. Comput. 29(6), 1335–1351 (2019). https://doi.org/10.1007/s11222-019-09902-z
Cockayne, J., Ipsen, I.C.F., Oates, C.J., Reid, T.W.: Probabilistic iterative methods for linear systems. J. Mach. Learn. Res. 22(232), 1–34 (2021)
Hart, J., van Bloemen Waanders, B., Herzog, R.: Hyperdifferential sensitivity analysis of uncertain parameters in PDE-constrained optimization. Int. J. Uncertain. Quantif. 10(3), 225–248 (2020). https://doi.org/10.1615/Int.J.UncertaintyQuantification.2020032480
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering, p. 664. Springer, New York (2006)
Petra, N., Zhu, H., Stadler, G., Hughes, T.J.R., Ghattas, O.: An inexact Gauss-Newton method for inversion of basal sliding and rheology parameters in a nonlinear Stokes ice sheet model. J. Glaciol. 58(211), 889–903 (2012). https://doi.org/10.3189/2012JoG11J182
Saibaba, A.K., Hart, J., van Bloemen Waanders, B.: Randomized algorithms for generalized singular value decomposition with application to sensitivity analysis. Numer. Linear Algebra Appl. 28(4), 2364–27 (2021). https://doi.org/10.1002/nla.2364
Bartels, S., Cockayne, J., Ipsen, I.C.F., Hennig, P.: Probabilistic linear solvers: a unifying view. Stat. Comput. 29(6), 1249–1263 (2019). https://doi.org/10.1007/s11222-019-09897-7
Fanaskov, V.: Uncertainty calibration for probabilistic projection methods. Stat. Comput. 31(5), 56–17 (2021). https://doi.org/10.1007/s11222-021-10031-9
Hennig, P.: Probabilistic interpretation of linear solvers. SIAM J. Optim. 25(1), 234–260 (2015). https://doi.org/10.1137/140955501
Wenger, J., Hennig, P.: Probabilistic linear solvers for machine learning. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 6731–6742. Curran Associates Inc, Red Hook (2020)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49, 409–436 (1952). https://doi.org/10.6028/jres.049.044
Higham, N.J.: Functions of Matrices. Theory and Computation. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2008). https://doi.org/10.1137/1.9780898717778
Cockayne, J., Oates, C.J., Ipsen, I.C.F., Girolami, M.: Supplementary material for ‘A Bayesian conjugate-gradient method’. Bayesian Anal. (2019). https://doi.org/10.1214/19-BA1145SUPP
Liesen, J., Strakos, Z.: Krylov Subspace Methods: Principles and Analysis. Oxford University Press, Oxford (2013)
Berljafa, M., Güttel, S.: Generalized rational Krylov decompositions with an application to rational approximation. SIAM J. Matrix Anal. Appl. 36(2), 894–916 (2015). https://doi.org/10.1137/140998081
Strakoš, Z., Tichý, P.: On error estimation in the conjugate gradient method and why it works in finite precision computations. Electron. Trans. Numer. Anal. 13, 56–80 (2002)
Kressner, D., Latz, J., Massei, S., Ullmann, E.: Certified and fast computations with shallow covariance kernels. arXiv:2001.09187 (2020)
Villani, C.: Optimal Transport, Old and New. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 338, p. 973. Springer, Berlin (2009). https://doi.org/10.1007/978-3-540-71050-9
Gelbrich, M.: On a formula for the \(L^2\) Wasserstein metric between measures on Euclidean and Hilbert spaces. Math. Nachr. 147, 185–203 (1990). https://doi.org/10.1002/mana.19901470121
Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer, New York (1985). https://doi.org/10.1007/978-1-4757-4286-2
Stuart, A.M.: Inverse problems: a Bayesian perspective. Acta Numer 19, 451–559 (2010). https://doi.org/10.1017/S0962492910000061
Cockayne, J., Graham, M.M., Oates, C.J., Sullivan, T.J.: Testing whether a Learning Procedure is Calibrated. arXiv:2012.12670 (2021)
Ross, S.M.: Introduction to Probability Models, 9th edn. Academic Press Inc, Boston (2007)
Kaltenbach, H.-M.: A Concise Guide to Statistics. Springer Briefs in Statistics, p. 111. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-23502-3
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning, vol. 112, 2nd edn. Springer, New York (2021). https://doi.org/10.1007/978-1-0716-1418-1
BCSSTK14: BCS Structural Engineering Matrices (linear equations) Roof of the Omni Coliseum, Atlanta. https://math.nist.gov/MatrixMarket/data/Harwell-Boeing/bcsstruc2/bcsstk14.html
Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. The Johns Hopkins University Press, Baltimore (2013)
Hairer, E., Nørsett, S.P., Wanner, G.: Solving Ordinary Differential Equations. I, 2nd edn. Springer Series in Computational Mathematics, vol. 8. Springer, Berlin (1993)
Golub, G.H., Meurant, G.: Matrices, moments and quadrature. II. How to compute the norm of the error in iterative methods. BIT 37(3), 687–705 (1997). https://doi.org/10.1007/BF02510247
Meurant, G., Tichý, P.: Approximating the extreme Ritz values and upper bounds for the \(A\)-norm of the error in CG. Numer. Algorithms 82(3), 937–968 (2019). https://doi.org/10.1007/s11075-018-0634-8
Strakoš, Z.: On the real convergence rate of the conjugate gradient method. Linear Algebra Appl. 154(156), 535–549 (1991). https://doi.org/10.1016/0024-3795(91)90393-B
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)
Muirhead, R.J.: Aspects of Multivariate Statistical Theory. Wiley, New York (1982)
Ouellette, D.V.: Schur complements and statistics. Linear Algebra Appl. 36, 187–295 (1981). https://doi.org/10.1016/0024-3795(81)90232-9
Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985)
Campbell, S.L., Meyer, C.D.: Generalized Inverses of Linear Transformations. Classics in Applied Mathematics, vol. 56. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2009). https://doi.org/10.1137/1.9780898719048.ch0
Mathai, A.M., Provost, S.B.: Quadratic Forms in Random Variables: Theory and Applications. Marcel Dekker Inc, New York (1992)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (2003)
Giraud, L., Langou, J., Rozložník, M.: The loss of orthogonality in the Gram–Schmidt orthogonalization process. Comput. Math. Appl. 50(7), 1069–1075 (2005). https://doi.org/10.1016/j.camwa.2005.08.009
Giraud, L., Langou, J., Rozložník, M., van den Eshof, J.: Rounding error analysis of the classical Gram–Schmidt orthogonalization process. Numer. Math. 101(1), 87–100 (2005). https://doi.org/10.1007/s00211-005-0615-4
Acknowledgements
We thank the reviewers for their helpful comments, and questions which lead to the addition of Sect. 6. We are also deeply indebted to Jonathan Wenger for his very careful reading of our paper and many perceptive suggestions that greatly improved the paper and widened its outlook through the addition of Sects. 5.5 and 5.6.
Funding
The work was supported in part by NSF Grant DMS-1745654 (TWR, ICFI), NSF Grant DMS-1760374 and DOE Grant DE-SC0022085 (ICFI), and the Lloyd’s Register Foundation Programme on Data Centric Engineering at the Alan Turing Institute (CJO)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Auxiliary results
We present auxiliary results required for proofs in other sections.
The stability of Gaussian distributions implies that a linear transformation of a Gaussian random variable remains Gaussian.
Lemma A.1
(Stability of Gaussian Distributions [37, Section 1.2]) Let \(X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}},\varvec{\Sigma })\) be a Gaussian random variable with mean \({\textbf{x}}\in {\mathbb {R}}^n\) and covariance \(\varvec{\Sigma }\in {\mathbb {R}}^{n\times n}\). If \({\textbf{y}}\in {\mathbb {R}}^n\) and \({\textbf{F}}\in {\mathbb {R}}^{n\times n}\), then
The conjugacy of Gaussian distributions implies that the distribution of a Gaussian random variable conditioned on information that linearly depends on the random variable is a Gaussian distribution.
Lemma A.2
(Conjugacy of Gaussian Distributions [38, Section 6.1], [25, Corollary 6.21]) Let \(X\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}},\varvec{\Sigma }_x)\) and \(Y\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{y}},\varvec{\Sigma }_y)\). The jointly Gaussian random variable \(\begin{bmatrix}X^T&Y^T\end{bmatrix}^T\) has the distribution
where \(\varvec{\Sigma }_{xy} \equiv {{\,\textrm{Cov}\,}}(X,Y) = {{\,\mathrm{{\mathbb {E}}}\,}}[(X-{\textbf{x}})(Y-{\textbf{y}})^T]\) and the conditional distribution of \(X\) given \(Y\) is
We show how to transform a \({\textbf{B}}\)-orthogonal matrix into an orthogonal matrix.
Lemma A.3
Let \({\textbf{B}}\in {\mathbb {R}}^{n\times n}\) be symmetric positive definite, and let \({\textbf{H}}\in {\mathbb {R}}^{n\times n}\) be a \({\textbf{B}}\)-orthogonal matrix with \({\textbf{H}}^T{\textbf{B}}{\textbf{H}}= {\textbf{H}}{\textbf{B}}{\textbf{H}}^T = {\textbf{I}}\). Then
is an orthogonal matrix with \({\textbf{U}}^T{\textbf{U}}= {\textbf{U}}{\textbf{U}}^T={\textbf{I}}\).
Proof
The symmetry of \({\textbf{B}}\) and the \({\textbf{B}}\)-orthogonality of \({\textbf{H}}\) imply
From the orthonormality of the columns of \({\textbf{U}}\), and the fact that \({\textbf{U}}\) is square follows that \({\textbf{U}}\) is an orthogonal matrix [39, Definition 2.1.3]. \(\square \)
Definition A.4
[39, Section 7.3] The thin singular value decomposition of the rank-p matrix \({\textbf{G}}\in {\mathbb {R}}^{m\times n}\) is
where \({\textbf{U}}\in {\mathbb {R}}^{m\times p}\) and \({\textbf{W}}\in {\mathbb {R}}^{n\times p}\) are matrices with orthonormal columns and \({\textbf{D}}\in {\mathbb {R}}^{p\times p}\) is a diagonal matrix with positive diagonal elements. The Moore–Penrose inverse of \({\textbf{G}}\) is
If a matrix has full column-rank or full row-rank, then its Moore–Penrose can be expressed in terms of the matrix itself. Furthermore, the Moore–Penrose inverse of a product is equal to the product of the Moore–Penrose inverses, provided the first matrix has full column-rank and the second matrix has full row-rank.
Lemma A.5
[40, Corollary 1.4.2] Let \({\textbf{G}}\in {\mathbb {R}}^{m\times n}\) and \({\textbf{J}}\in {\mathbb {R}}^{n\times p}\) have full column and row rank respectively, so \({{\,\textrm{rank}\,}}({\textbf{G}})={{\,\textrm{rank}\,}}({\textbf{J}})=n\). The Moore–Penrose inverses of \({\textbf{G}}\) and \({\textbf{J}}\) are
respectively, and the Moore–Penrose inverse of the product equals
Below is an explicit expression for the mean of a quadratic form of Gaussians.
Lemma A.6
[41, Sections 3.2b.1–3.2b.3] Let \(Z\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_z,\varvec{\Sigma }_z)\) be a Gaussian random variable in \({\mathbb {R}}^n\), and \({\textbf{B}}\in {\mathbb {R}}^{n\times n}\) be symmetric positive definite. The mean of \(Z^T{\textbf{B}}Z\) is
We show that the squared Euclidean norm of a Gaussian random variable with an orthogonal projector as its covariance matrix is distributed according to a chi-squared distribution.
Lemma A.7
Let \(Z\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{P}})\) be a Gaussian random variable in \({\mathbb {R}}^n\). If the covariance matrix \({\textbf{P}}\) is an orthogonal projector, that is, if \({\textbf{P}}^2 = {\textbf{P}}\) and \({\textbf{P}}= {\textbf{P}}^T\), then
where \(p = {{\,\textrm{rank}\,}}({\textbf{P}})\).
Proof
We express the projector in terms of orthonormal matrices and then use the invariance of the 2-norm under orthogonal matrices and the stability of Gaussians.
Since \({\textbf{P}}\) is an orthogonal projector, there exists \({\textbf{U}}_1\in {\mathbb {R}}^{n\times p}\) such that \({\textbf{U}}_1{\textbf{U}}_1^T = {\textbf{P}}\) and \({\textbf{U}}_1^T{\textbf{U}}= {\textbf{I}}_p\). Choose \({\textbf{U}}_2\in {\mathbb {R}}^{n\times (n-p)}\) so that \({\textbf{U}}= \begin{bmatrix} {\textbf{U}}_1&{\textbf{U}}_2 \end{bmatrix}\) is an orthogonal matrix. Thus,
Lemma A.1 implies that \(Y= {\textbf{U}}_1^TX\) is distributed according to a Gaussian distribution with mean \({\textbf{0}}\) and covariance \({\textbf{U}}_1^T{\textbf{U}}_1{\textbf{U}}_1^T{\textbf{U}}= {\textbf{I}}_p\). Similarly, \(Z= {\textbf{U}}_2^TX\) is distributed according to a Gaussian distribution with mean \({\textbf{0}}\) and covariance \({\textbf{U}}_2^T{\textbf{U}}_1{\textbf{U}}_1^T{\textbf{U}}_2 = {\textbf{0}}\), thus \(Z= {\textbf{0}}\).
Substituting \(Y\) and \(Z\) into (A1) gives \(X^TX= Y^TY+ {\textbf{0}}^T{\textbf{0}}\). From \(Y\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{I}}_p)\) follows \((X^TX)\sim \chi _p^2\). \(\square \)
Lemma A.8
If \({\textbf{A}}\in {\mathbb {R}}^{n\times n}\) is symmetric positive definite, and \(M\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_\mu \varvec{\Sigma }_\mu )\) and \(N\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{x}}_\nu ,\varvec{\Sigma }_\nu )\) are independent random variables in \({\mathbb {R}}^n\), then
Proof
The random variable \(M-N\) has mean \({{\,\mathrm{{\mathbb {E}}}\,}}[M-N] = {\textbf{x}}_\mu -{\textbf{x}}_\nu \), and covariance
where the covariances \({{\,\textrm{Cov}\,}}(M,N) ={{\,\textrm{Cov}\,}}(N,M) = 0\) because \(M\) and \(N\) are independent. Now apply Lemma A.6 to \(M-N\). \(\square \)
Appendix B: Algorithms
We present algorithms for the modified Lanczos method (Sect. B.1), BayesCG with random search directions (Sect. B.2), BayesCG with covariances in factored form (Sect. B.3), and BayesCG under the Krylov prior (Sect. B.4).
1.1 Modified Lanczos method
The Lanczos method [42, Algorithm 6.15] produces an orthonormal basis for the Krylov space \({\mathcal {K}}_g({\textbf{A}},{\textbf{v}}_1)\), while the modified version in Algorithm B.1 produces an \({\textbf{A}}\)-orthonormal basis.
Algorithm B.1 reorthogonalizes the basis vectors \({\textbf{v}}_i\) with Classical Gram-Schmidt performed twice, see Lines 10 and 11. This reorthogonalization technique can be implemented efficiently and produces vectors that are orthogonal to machine precision [43, 44].
1.2 BayesCG with random search directions
The version of BayesCG in Algorithm B.2 is designed to be calibrated because the search directions do not depend on \({\textbf{x}}_*\), hence the posteriors do not depend on \({\textbf{x}}_*\) either [6, Section 1.1].
After sampling an initial random search direction \({\textbf{s}}_1\sim {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{I}})\), Algorithm B.2 computes an \({\textbf{A}}\varvec{\Sigma }_0{\textbf{A}}\)-orthonormal basis for the Krylov space \({\mathcal {K}}_m({\textbf{A}}\varvec{\Sigma }_0{\textbf{A}},{\textbf{s}}_1)\) with Algorithm B.1. Then Algorithm B.2 computes the BayesCG posteriors directly with (2) and (3) from Theorem 2.1. The numerical experiments in Sect. 5 run Algorithm B.2 with the inverse prior \(\mu _0 = {{\,\mathrm{{\mathcal {N}}}\,}}({\textbf{0}},{\textbf{A}}^{-1})\).
1.3 BayesCG with covariances in factored form
Algorithm B.3 takes as input a general prior covariance \(\varvec{\Sigma }_0\) in factored form, and subsequently maintains the posterior covariances \(\varvec{\Sigma }_m\) in factored form as well. Theorem B.1 presents the correctness proof for Algorithm B.3.
Theorem B.1
Under the conditions of Theorem 2.1, if \(\varvec{\Sigma }_0 = {\textbf{F}}_0{\textbf{F}}_0^T\) for \({\textbf{F}}_0\in {\mathbb {R}}^{n\times \ell }\) and some \(m \le \ell \le n\), then \(\varvec{\Sigma }_m = {\textbf{F}}_m{\textbf{F}}_m^T\) with
Proof
Fix m. Substituting \(\varvec{\Sigma }_0 = {\textbf{F}}_0{\textbf{F}}_0^T\) into (3) and factoring out \({\textbf{F}}_0\) on the left and \({\textbf{F}}_0^T\) on the right gives \(\varvec{\Sigma }_m = {\textbf{F}}_0{\textbf{P}}{\textbf{F}}_0^T\) where
Show that \({\textbf{P}}\) is a projector,
Hence \(\varvec{\Sigma }_m = {\textbf{F}}_0{\textbf{P}}{\textbf{F}}_0^T = {\textbf{F}}_0{\textbf{P}}{\textbf{P}}{\textbf{F}}_0^T = {\textbf{F}}_m{\textbf{F}}_m^T\). \(\square \)
1.4 BayesCG under the Krylov prior
We present algorithms for BayesCG under full Krylov posteriors (Sect. B.4.1) and under approximate Krylov posteriors (Sect. B.4.2).
1.4.1 Full Krylov posteriors
Algorithm B.4 computes the following: a matrix \({\textbf{V}}\) whose columns are an \({\textbf{A}}\)-orthonormal basis for \({\mathcal {K}}_g({\textbf{A}},{\textbf{r}}_0)\); the diagonal matrix \(\varvec{\Phi }\) in (5); and the posterior mean \({\textbf{x}}_m\) in (26). The output consists of the posterior mean \({\textbf{x}}_m\), and the factors \({\textbf{V}}_{m+1:g}\) and \(\varvec{\Phi }_{m+1:g}\) for the posterior covariance.
1.4.2 Approximate Krylov posteriors
Algorithm B.5 computes rank-d approximate Krylov posteriors in two main steps: (i) posterior mean and iterates \({\textbf{x}}_m\) in Lines 5–14; and (ii) factorization of the posterior covariance \({\widehat{\varvec{\Gamma }}}_m\) in Lines 16–26.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Reid, T.W., Ipsen, I.C.F., Cockayne, J. et al. Statistical properties of BayesCG under the Krylov prior. Numer. Math. 155, 239–288 (2023). https://doi.org/10.1007/s00211-023-01375-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00211-023-01375-7