1 Introduction

L-BFGS  [15, 42, 52] is one of the most popular methods for large-scale unconstrained optimization problems. Among others, it is used in geosciences [43], image registration [45], computer-generated holography [58] and machine learning [28]. Despite the maturity of L-BFGS, there are still fundamental open questions, for instance if L-BFGS converges globally on nonconvex problems.

In this paper, we study a globally convergent version of L-BFGS for the problem

$$\begin{aligned} \min _{x\in {\mathcal {X}}} \, f(x), \end{aligned}$$
(P)

where \(f:{\mathcal {X}}\rightarrow \mathbb {R}\) is continuously differentiable and bounded below, and \({\mathcal {X}}\) is a Hilbert space. The convergence analysis includes the possibility that the memory size in L-BFGS is zero, in which case the new method becomes a globalized Barzilai–Borwein (BB) method [8, 18,19,20, 56], also called spectral gradient method. Since they are suited for large-scale problems, BB-type methods have recently received renewed interest [6, 7, 12, 21, 24, 38].

To ensure global convergence on nonconvex problems, various modified L-BFGS methods have been developed. To the best of our knowledge, however, they all suffer from at least one of the following issues:

  • Global convergence is often established as \(\liminf _{k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\), e.g., [37, 63, 65]; this does not guarantee that cluster points of \((x_k)\) are stationary.

  • If rate of convergence results are provided, they frequently rely on assumptions whose satisfaction is unclear in the nonconvex case; examples include convergence of \((x_k)\) in [19, Section 3] and the existence of \(c>0\) such that \(y_k^T s_k > c \Vert y_k\Vert \Vert s_k\Vert \) for all k in [10, Theorem 3.2]. In particular, these properties may not hold even if \(\liminf _{k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\), \((x_k)\subset \mathbb {R}^N\) is bounded, and f is \(C^\infty \) with a single stationary point that satisfies sufficient optimality conditions and that is the unique global minimizer.

  • Some methods rely on beforehand knowledge of the local modulus of strong convexity. This comprises, for instance, methods that skip the update if \(y_k^T s_k \ge \mu \Vert s_k\Vert ^2\) is violated, where \(\mu >0\) is chosen at the beginning of the algorithm, e.g., [9, 45]. Clearly, if f is strongly convex and \(\mu \) is no larger than the modulus of strong convexity, all updates are carried out (as in L-BFGS). On the other hand, if \(\mu \) is too large, many or even all updates may be skipped, which will usually diminish the efficiency significantly. In nonconvex situations this problem appears if \((x_k)\) converges to a point near which f is strongly convex. In globalized BB methods and L-BFGS with seed matrices \(H_0^{(k)}=\gamma _k I\), this issue (also) arises when the step sizes, respectively, the scaling factors \((\gamma _k)\) are safeguarded away from zero. Ultimately, all these constructions permanently restrict the spectrum of the L-BFGS operator \(H_k\) (cf. Lemma 4.1 for a proof of this fact).

The method provided in this paper does not suffer from any of these issues, but instead comes with the following strong convergence guarantees:

  1. (1)

    every cluster point is stationary, cf. Theorem 4.2;

  2. (2)

    \(\lim _{k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\) if \(\nabla f\) is uniformly continuous in some level set, cf. Theorem 4.1;

  3. (3)

    if the iterates cluster at a point near which f is strongly convex and near which \(\nabla f\) is Lipschitz, then they converge to this point at a linear rate and the method agrees with classical L-BFGS after a finite number of iterations, cf. Theorems 4.4 and 4.3;

  4. (4)

    under the assumptions of (3), all update pairs are eventually stored and applied, and any \(\gamma _k\) that lies between the two Barzilai–Borwein step sizes (see Definition 2.1) is eventually accepted, cf. Theorem 4.3.

To the best of our knowledge, the new method is the only L-BFGS-type method that satisfies both (1) and (3). As a consequence of (3), the new method will often inherit the supreme efficiency of L-BFGS. In fact, our numerical experiments indicate that the new method agrees entirely with L-BFGS/BB if one of its algorithmic parameters is chosen sufficiently small. This may be regarded as an explanation why L-BFGS/BB often converges for nonconvex problems, but it is also noteworthy since some authors report that cautious updating can degrade the performance, at least for BFGS, cf. [27, p.18]. A discussion of this agreement is offered in Sect. 4.3.2, particularly in Remark 4.4.

Many existing globalizations of the Barzilai–Borwein method also permanently restrict the spectrum of \(H_k=H_k^{(0)}\), so the techniques and results in this paper are of interest in the context of BB methods, too. That being said, nonmonotone line searches, which are understood to be more effective for BB-type methods [6], are only addressed in the numerical experiments in Sect. 5, but not in the convergence analysis.

The convergence guarantees of the new method and the transition to classical L-BFGS rely on a new form and a careful calibration of the cautious updates introduced for BFGS by Li and Fukushima [41]. We discuss these points further once we have introduced the globalized L-BFGS method in Sect. 3, respectively, once we have introduced its calibration in Assumption 4.3, but let us stress that the new L-BFGS method proposed in this paper is fully compatible with the techniques used in efficient numerical realizations of L-BFGS. In particular, it can still be implemented matrix free based on the well-known two-loop recursion and it has essentially the same costs per iteration as L-BFGS.

We strive to use the weakest possible assumptions in the convergence analysis. For instance, in Theorem 4.4 we prove that the new method converges globally and linearly under purely local assumptions except for the requirement that f has to be bounded below. Specifically, f needs to be continuously differentiable and bounded below, and \((x_k)\) has to have a cluster point near which f is strongly convex and \(\nabla f\) is Lipschitz. In contrast, in the existing literature for L-BFGS-type methods it is frequently assumed that \(\nabla f\) is Lipschitz continuous in the level set \(\Omega \) associated to the initial point, that \(\Omega \) is bounded and that f is twice continuously differentiable; cf. [2, 9, 10, 37, 42, 59]. Note, however, that if f is actually twice continuously differentiable, then our assumption of strong convexity near a cluster point is equivalent to the Hessian of f being positive definite at that cluster point. We avoid using this classical second order sufficient condition for optimality because we do not want to require the existence of second derivatives. The results of this work hold not only for the Wolfe–Powell conditions but also for backtracking line searches based on the Armijo condition. To the best of our knowledge, it has not been proven before that the Wolfe–Powell conditions imply global convergence if \(\nabla f\) is merely continuous, cf. Theorem 4.2. Since this result can be extended to other descent methods, it may be of interest beyond this work.

Another improvement of existing results concerns the type of linear convergence. The strongest available result for the original L-BFGS method is the classical [42, Theorem 7.1], where it is shown that \((f(x_k))\) converges q-linearly and \((x_k)\) converges r-linearly. We obtain the same rate for \((f(x_k))\), but for \((x_k)\) we prove a stronger error estimate that implies l-step q-linear convergence for all sufficiently large l. A sequence that converges l-step q-linearly for a single l is r-linearly convergent, but not vice versa. We discuss l-step q-linear convergence further in Sect. 4.3. Moreover, we show that \((\nabla f(x_k))\) satisfies a similar error estimate and we assess these theoretical results in the numerical experiments in Sect. 5. We are not aware of works on L-BFGS that involve multi-step q-linear convergence, but it frequently appears in results for Barzilai–Borwein methods [6, 19]. We establish the improved rates for our method in Theorem 4.4 under the aforementioned assumptions. In addition, we infer that if f is strongly convex in \(\Omega \), then the result also holds for classical L-BFGS, cf. Remark 4.5. This improves existing results such as [42, Theorem 7.1] by lowering its assumptions while sharpening its conclusion.

There are few works that provide a convergence analysis of L-BFGS in Hilbert space. In fact, a Hilbert space setting is only considered in our work [45] for a structured L-BFGS method and in [6, 7] for the BB method. For BFGS, characterizations of the update and convergence analyses are available in Hilbert space [26, 29, 48, 60]. Still, the techniques that we use differ from those in the literature on L-BFGS and BFGS. For instance, by not relying on traces we avoid the restriction to work with trace class operators, which would severely limit the applicability of our results for infinite dimensional \({\mathcal {X}}\) because trace class operators are compact and require \({\mathcal {X}}\) to be separable. Studying L-BFGS in Hilbert space is valuable because when the (globalized) L-BFGS method is applied to a discretization of an infinite dimensional optimization problem, its numerical performance is usually closely related to the convergence properties of the method on the infinite dimensional level, see for instance the example in [29, Section 1]. This relationship appears, among others, in the form of mesh independence [3, 7, 36]. In Sect. 5 we validate the method’s mesh independence numerically for an optimal control problem.

BFGS and L-BFGS have been modified in various ways to obtain global convergence for nonconvex objectives. For line search based methods we are aware of cautious updating [9, 40, 41, 61, 65], modified updating [37, 39, 61, 63], damped updating [2, 31, 55, 57] and modification of the line search [34, 64]. Other options for globalization include trust region approaches, cf. [10, 11, 13] and the references therein, iterative regularization [35, 44, 59] as well as robust BFGS [62]. For BB methods global and linear convergence in a nonconvex setting is shown in [19]. As pointed out before, however, there is no L-BFGS-type method available for nonconvex objectives that converges globally in the sense that every cluster point is stationary while also recovering classical L-BFGS. For the standard cautious updating from [39] we are not aware of works that show that every cluster point is stationary.

A joint implementation of the new method and standard L-BFGS is available on arXiv with the preprint of this paper.

1.1 Organization and Notation

The paper is organized as follows. To set the stage, Sect. 2 recalls the classical L-BFGS method. In Sect. 3 we present the modified L-BFGS method. Its convergence is studied in Sect. 4. Section 5 contains numerical experiments and Sect. 6 concludes with a summary.

Notation. We use \(\mathbb {N}=\{1,2,3,\ldots \}\) and \(\mathbb {N}_0=\mathbb {N}\cup \{0\}\). In common abuse of notation [29, 30] the scalar product of \(v,w\in {\mathcal {X}}\) is indicated by \(v^T w\) and the linear functional \(w\mapsto v^T w\) is denoted by \(v^T\). In particular, due to the Riesz representation theorem there is a unique element of \({\mathcal {X}}\), denoted \(\nabla f(x)\), such that \(f'(x)=\nabla f(x)^T\). Note that for \({\mathcal {X}}=\mathbb {R}^N\) this implies that the operator \(^T\) equals transposition only if the Euclidean inner product is used and similarly the gradient is not generally equal to the vector of partial derivatives. We write \(M\in \mathcal {L}({\mathcal {X}})\) if \(M:{\mathcal {X}}\rightarrow {\mathcal {X}}\) is a bounded linear operator with respect to the operator norm. An \(M\in \mathcal {L}({\mathcal {X}})\) is self-adjoint iff it satisfies \(x^T M y = (Mx)^T y\) for all \(x,y\in {\mathcal {X}}\). It is positive definite, respectively, positive semi-definite iff it is self-adjoint and there exists \(\beta >0\), respectively, \(\beta \ge 0\) such that \(x^T M x\ge \beta \Vert x\Vert ^2\) for all \(x\in {\mathcal {X}}\setminus \{0\}\). We set \(\mathcal {L}_+({\mathcal {X}}):=\{M\in \mathcal {L}({\mathcal {X}}): \,M \text { is positive definite}\}\).

2 The Classical L-BFGS Method

In this section we discuss the aspects of the classical L-BFGS method for (P) that are relevant to this work. The pseudo code of the method is given below as Algorithm LBFGS.

figure a

2.1 The L-BFGS Operator \(H_k\)

We recall the definition of the operator \(H_k\in \mathcal {L}({\mathcal {X}})\) appearing in Line 4. It is easy to see that \(|{\mathcal {I}}|\le m\) holds in Line 4 for any k, so there is \(r=r(k)\) with \(0\le r\le m\) such that \({\mathcal {I}}=\{k_0,k_1,\ldots ,k_{r-1}\}\subset \mathbb {N}_0\). If \({\mathcal {I}}=\emptyset \), we let \(H_k:=H_k^{(0)}\). Otherwise, we simplify notation by identifying \(k_j\) with j for \(j=0,\ldots ,r-1\). That is, \({\mathcal {I}}=\{0,1,\ldots ,r-1\}\) and the storage is given by \(\{(s_j,y_j)\}_{j=0}^{r-1}\) with \(y_j^T s_j>0\) for \(j=0,\ldots ,r-1\) (since \((s_k,y_k)\) is only stored if \(y_k^T s_k>0\)). Endowed with this notation, the operator \(H_k\) is obtained as \(H_k:=H_k^{(r)}\) from the seed matrix \(H_k^{(0)}\) and the storage \(\{(s_j,y_j)\}_{j\in {\mathcal {I}}}\) by means of the recursion

$$\begin{aligned} H_k^{({j+1})} = V_{j}^T H_k^{(j)} V_{j} + \rho _{j} s_{j} s_{j}^T, \qquad j=0,\ldots ,r-1, \end{aligned}$$
(1)

where \(V_j:=I-\rho _{j} y_{j} s_{j}^T\) and \(\rho _{j}:= (y_{j}^T s_{j})^{-1}\). From the Sherman–Morrison formula we infer that \(B_k=H_k^{-1}\) can be obtained by setting \(B_k:=B_k^{(r)}\) in the recursion

$$\begin{aligned} B_k^{(j+1)}:=B_k^{(j)}-\frac{B_k^{(j)}s_j (B_k^{(j)}s_j)^T}{(B_k^{(j)}s_j)^T s_{j}}+\frac{y_{j}y_{j}^T}{y_{j}^T s_{j}}, \qquad j=0,\ldots ,r-1, \end{aligned}$$
(2)

where \(B_k^{(0)}:=(H_{k}^{(0)})^{-1}\). We stress that for \({\mathcal {X}}=\mathbb {R}^N\) the transpose operator in (1) and (2) does not agree with the transposition of a vector unless the Euclidean scalar product is used on \({\mathcal {X}}\), cf. the paragraph Notation above. In practice we do not form \(H_k\), but compute the search direction \(d_k = -H_k\nabla f(x_{k})\) in a matrix free way through the two-loop recursion [53, Algorithm 7.4] (with the transpositions replaced by scalar products). This computation is very efficient, enabling the use of L-BFGS for large-scale problems.

2.2 Choice of the Seed Matrix \(H_k^{(0)}\)

The most common choice of the seed matrix \(H_k^{(0)}\) is \(H_k^{(0)}=\gamma _k I\), where

$$\begin{aligned} \gamma _k = \frac{y_{k-1}^T s_{k-1}}{\Vert y_{k-1}\Vert ^2}, \end{aligned}$$
(3)

cf. [53, (7.20)], but other choices of \(\gamma _k\) and of \(H_k^{(0)}\) have also been studied, e.g., in [26, 42, 54] and [4, 47]. In particular, the following two values are well-known choices for \(\gamma _{k+1}\), where the index shift is for later reference.

Definition 2.1

Let \(k\in \mathbb {N}_0\) and let \((s_k,y_k)\in {\mathcal {X}}\times {\mathcal {X}}\) with \(y_k^T s_k>0\). We set

$$\begin{aligned} \gamma _{k+1}^-:=\frac{y_k^T s_k}{\Vert y_k\Vert ^2} \qquad \text { and }\qquad \gamma _{k+1}^+:=\frac{\Vert s_k\Vert ^{2}}{y_k^T s_k}. \end{aligned}$$

To simplify the presentation of the modified L-BFGS method in Sect. 3, we restrict attention to choices of the form \(H_k^{(0)}=\gamma _k I\) with \(\gamma _k\in [\gamma _{k}^-,\gamma _{k}^+]\). For memory size \(m=0\) this yields \(d_k=-\gamma _k\nabla f(x_k)\), which shows that in this case Algorithm LBFGS may be viewed as a globalized BB method. For BB methods, the values \(\gamma _{k}^-\) and \(\gamma _{k}^+\) were introduced in [8] and they are sometimes referred to as the two Barzilai–Borwein step sizes. The restriction \(\gamma _k\in [\gamma _{k}^-,\gamma _{k}^+]\) appears for instance in the spectral gradient methods of [21], a work that provides ample references to related BB methods.

Remark 2.1

The Cauchy–Schwarz inequality implies \(\gamma _{k+1}^-\le \gamma _{k+1}^+\). Furthermore, it is well-known that for twice differentiable f with positive definite Hessians, the numbers \(\gamma _{k+1}^-\) and \(\gamma _{k+1}^+\) fall within the spectrum of the inverse of the averaged Hessian \(\int _0^1\nabla ^2f(x_k+t s_k)\,\textrm{d}t\), cf. [45] for a statement in Hilbert space. If f is quadratic with positive definite Hessian B, we have \(B s_k=y_k\), hence \(\gamma _{k+1}^+\) is the inverse of a Rayleigh quotient of B and \(\gamma _{k+1}^-\) is a Rayleigh quotient of \(B^{-1}\).

2.3 Choice of the Step Size \(\alpha _k\)

Most often, Algorithm LBFGS is employed with a line search that ensures the satisfaction of the Wolfe–Powell conditions or the strong Wolfe–Powell conditions. That is, for constants \(\sigma \in (0,1)\) and \(\eta \in (\sigma ,1)\), the step sizes \(\alpha _k\) are selected in such a way that they satisfy the Armijo condition

$$\begin{aligned} f(x_{k+1}) \le f(x_k) + \alpha _k\sigma \nabla f(x_k)^T d_k \end{aligned}$$
(4)

and either the curvature condition or the strong curvature condition

$$\begin{aligned} \nabla f(x_{k+1})^T d_k \ge \eta \nabla f(x_k)^T d_k, \hspace{5.0pt}\text {resp. }\hspace{5.0pt}\left|\nabla f(x_{k+1})^T d_k\right|\le \eta \left|\nabla f(x_k)^T d_k\right|. \end{aligned}$$
(5)

This entails that \(y_k^T s_k>0\), so the current secant pair \((s_k,y_k)\) is guaranteed to enter the storage in Line 7, which slightly simplifies the algorithm, e.g., [53, Algorithm 7.5], and ensures that the storage contains the most recent secant information. Since we also consider line searches that only involve the Armijo condition (4), we may have \(y_k^T s_k\le 0\). In this case the pair \((s_k,y_k)\) is not stored because using it in the construction of \(H_{k+1}\) would result in \(H_{k+1}\) not being positive definite. While this skipping of secant information seems undesirable, it can still pay off to work only with the Armijo condition [1, 45]. If step sizes are based on (4) only, we demand that they are computed by backtracking. That is, if \(\alpha _{k,i}\) does not satisfy (4), then the next trial step size \(\alpha _{k,i+1}\) is chosen from \([\beta _1\alpha _{k,i},\beta _2\alpha _{k,i}]\), where \(i=0,1,\ldots \), \(\alpha _{k,0}:=1\) and \(0<\beta _1\le \beta _2<1\) are constants. This includes step size strategies that involve interpolation [22, Section 6.3.2].

2.4 Effect of Update skipping on the Storage

Observe that if no new pair enters the storage in Line 7, then no old pair will be removed in Line 7, so we allow older information to be retained. This is a design choice and other variants are conceivable, e.g., we could always remove \((s_{k-m},y_{k-m})\) from the storage and set \({\mathcal {I}}:={\mathcal {I}}\setminus \{k-m\}\) in Line 7. This would not affect the convergence analysis of this work in a meaningful way.

3 The Modified L-BFGS Method

In this section we present the new method, point out its novelties and discuss how they relate to cautious updating [41] introduced by Li and Fukushima.

To state the pseudo code of the method, let us denote

$$\begin{aligned} q:{\mathcal {X}}\times {\mathcal {X}}\rightarrow \mathbb {R}, \qquad q(s,y):= {\left\{ \begin{array}{ll} \min \left\{ \frac{y^T s}{\Vert s\Vert ^2},\,\frac{y^T s}{\Vert y\Vert ^2}\right\} & \text { if }s\ne 0 \text { and } y\ne 0,\\ 0 & \text { else}. \end{array}\right. } \end{aligned}$$

The new method Algorithm LBFGSM (= L-BFGS, modified) reads as follows.

figure b

The interval \([\gamma _k^-,\gamma _k^+]\cap [\omega _k,\omega _k^{-1}]\) in Line 4 is well-defined, but it may be empty. In that case, \([\omega _k,\omega _k^{-1}]\) is used. The latter is nonempty due to \(c_0\in (0,1]\).

3.1 The Two Main Novelties of Algorithm LBFGSM

While Algorithm LBFGSM still stores any pair \((s_k,y_k)\) with \(y_k^T s_k>0\), cf. Line 10, it does not necessarily use all stored pairs for constructing \(H_k\). Instead, in every iteration the updating involves only those of the stored pairs \((s_j,y_j)\), \(j\in {\mathcal {I}}\), that satisfy \(q_j\ge \omega _k\), cf. Line 6, where \(q_j=q(s_j,y_j)\). We recall that the construction of \(H_k\) is described in (1). Note that the condition \(q_j\ge \omega _k\) relates all stored pairs to the current value \(\omega _k\), so a pair \((s_j,y_j)\) in the storage may be used to form \(H_k\) in some iterations, but may be skipped in others, until it is removed from the storage. In contrast, in L-BFGS a pair within the storage is consistently used to form \(H_k\) until it is removed from the storage. The second novelty concerns the choice of \(\gamma _k\). In classical L-BFGS with Wolfe–Powell step sizes we have \(\gamma _k^->0\) for all k, so often \(\gamma _k=\gamma _k^-\) is chosen for all k. By additionally requiring \(\gamma _k\in [\omega _k,\omega _k^{-1}]\), Algorithm LBFGSM further restricts the choice of \(\gamma _k\). In particular, this may prevent the choice \(\gamma _k=\gamma _k^-\) for some k. The key point is that these two modifications allow us to bound \(\Vert H_k\Vert \) and \(\Vert H_k^{-1}\Vert \) in terms of \(\omega _k\), cf. Lemma 4.3, which is crucial for establishing the global convergence of LBFGSM in Theorems 4.1 and 4.2 (note that \(\omega _k\) is related to \(\Vert \nabla f(x_k)\Vert \)). We are not aware of other works on cautious updating that have achieved global convergence in the sense of Theorem 4.1 or 4.2 without the use of fixed bounds for \(\Vert H_k\Vert \) and \(\Vert H_k^{-1}\Vert \). Such bounds can severely degrade the performance, cf. also the comment at the end of Sect. 3.2.

3.2 Relation of Algorithm LBFGSM to Cautious Updating

We emphasize that the two main novelties of Algorithm LBFGSM are based on the cautious updating introduced by Li and Fukushima in [41] for the BFGS method. There, the BFGS update based on \((s_k,y_k)\) is applied if \(y_k^T s_k/\Vert s_k\Vert ^2\ge \epsilon \Vert \nabla f(x_k)\Vert ^\alpha \) for positive constants \(\epsilon \) and \(\alpha \), otherwise the update is skipped; cf. [41, (2.10)]. This condition for skipping the update has been employed many times in the literature, both for BFGS and for L-BFGS. However, to the best of our knowledge, in the existing variants once a pair \((s_j,y_j)\) is used for updating, it is also used for updating in every subsequent iteration (until it is removed from the storage in case of L-BFGS). Clearly, this is less flexible than deciding in each iteration which of the stored pairs are used for updating. Moreover, in classical cautious updating the decision whether to store a pair \((s_j,y_j)\) (and thus, to use it for updating) is based on \(\Vert \nabla f(x_j)\Vert \), whereas in Algorithm LBFGSM the decision to involve a stored pair \((s_j,y_j)\) in the updating is based on the most recent norm \(\Vert \nabla f(x_k)\Vert \) rather than older norms. Finally, involving \(\Vert \nabla f(x_k)\Vert \) in the choice of \(H_k^{(0)}\) seems to be new altogether apart from our very recent work [45]. As it turns out, the novel form of cautious updating used in Algorithm LBFGSM allows to bound \(\Vert H_k\Vert \) and \(\Vert H_k^{-1}\Vert \) in terms of \(\Vert \nabla f(x_k)\Vert ^{-1}\), which is crucial for establishing global and linear convergence. We emphasize that the bounds for \(\Vert H_k\Vert \) and \(\Vert H_k^{-1}\Vert \) tend to infinity when \(\Vert \nabla f(x_k)\Vert \) tends to zero, cf. Lemma 4.3. This is highly desirable because it indicates that \(\Vert H_k\Vert \) and \(\Vert H_k^{-1}\Vert \) can become as large as necessary for \(k\rightarrow \infty \), whereas any preset lower bound for \(\gamma _k\) or \(q(s_k,y_k)\) uniformly bounds the spectrum of \(H_k\) and may therefore limit the ability of \(H_k\) to capture the spectrum of the inverse Hessian, which would slow down the convergence.

3.3 A further novelty of Algorithm LBFGSM: \(\nabla f\) not (globally) Lipschitz

Another novelty of Algorithm LBFGSM concerns the assumption of Lipschitz continuity of \(\nabla f\) in the level set associated to \(x_0\). Specifically, if \(\nabla f\) satisfies this assumption, then q can be replaced by the simpler function \(\hat{q}(s,y):=y^T s/\Vert s\Vert ^2\) for \(s,y\ne 0\), \(\hat{q}(s,y)=0\) else, without meaningfully affecting the convergence results of this paper. The decision criterion \(\hat{q}(s_j,y_j)\ge \omega _k\) in Line 6 of Algorithm LBFGSM then resembles the original cautious updating more clearly. However, using q instead of \(\hat{q}\) enables global convergence if \(\nabla f\) is continuous and linear convergence if \(\nabla f\) is Lipschitz continuous near cluster points, cf. Theorem 4.2 and Theorem 4.4. That is, using q instead of \(\hat{q}\) allows us to eliminate, respectively, weaken substantially the assumption of \(\nabla f\) being Lipschitz in the entire level set. We are not aware of other works on cautious updating that have achieved this.

3.4 A Slightly Larger Interval for the Choice of \(\gamma _k\)

Without meaningfully affecting the convergence theory of this paper we could replace the interval \([\gamma _k^-,\gamma _k^+]\) in Line 4 of Algorithm LBFGSM by a larger interval \([\gamma _k^{\min },\gamma _k^{\max }]\), where \(\gamma _k^{\min }:=\min \{c_3,c_4\gamma _k^-\}\) and \(\gamma _k^{\max }:=\max \{C_3,C_4\gamma _k^+\}\) for positive constants \(c_3,c_4,C_3,C_4\) with \(c_3\le C_3\). This allows, in particular, to include the choice \(\gamma _k=1\) for all k, i.e., \(H_k^{(0)}=I\), that also appears in the literature. However, in our assessment \(H_k^{(0)}=\gamma _k I\) with \(\gamma _k\) according to (3) is used much more often, so to keep the notation light we work with \([\gamma _k^-,\gamma _k^+]\). (If \(m=0\) and \(\gamma _k=1\) for all k, LBFGSM is the method of steepest descent.)

4 Convergence Analysis

This section is devoted to the convergence properties of Algorithm LBFGSM. We start with its well-definedness in Sect. 4.1, after which we focus on the global and local convergence in Sects. 4.2 and 4.3.

4.1 Well-definedness

We recall some estimates to infer that Algorithm LBFGSM is well-defined.

Lemma 4.1

(cf. [45, Lemma 4.1]) Let \(M\in \mathbb {N}_0\) and \(\kappa _1,\kappa _2>0\). For all \(j\in \{0,\ldots ,M-1\}\) let \((s_j,y_j)\in {\mathcal {X}}\times {\mathcal {X}}\) satisfy

$$\begin{aligned} \frac{y_j^T s_j}{\Vert s_j\Vert ^2}\ge \frac{1}{\kappa _1}\qquad \text { and }\qquad \frac{y_j^T s_j}{\Vert y_j\Vert ^2}\ge \frac{1}{\kappa _2}. \end{aligned}$$

Moreover, let \(H^{(0)}\in \mathcal {L}_+({\mathcal {X}})\). Then its L-BFGS update H, obtained from the recursion (1) using \(H_k^{(0)}=H^{(0)}\) and \(r=M\), satisfies \(H\in \mathcal {L}_+({\mathcal {X}})\) and

$$\begin{aligned} \Vert H^{-1}\Vert \le \Vert (H^{(0)})^{-1}\Vert + M \kappa _2 \end{aligned}$$
(6)

as well as

$$\begin{aligned} \Vert H\Vert \le 5^M\max \bigl \{1,\Vert H^{(0)}\Vert \bigr \}\max \bigl \{1,\kappa _1^M,(\kappa _1 \kappa _2)^M\bigr \}. \end{aligned}$$
(7)

The following assumption ensures that Algorithm LBFGSM is well-defined.

Assumption 4.1

  1. (1)

    The function \(f:{\mathcal {X}}\rightarrow \mathbb {R}\) is continuously differentiable and bounded below.

  2. (2)

    The step size \(\alpha _k\) in LBFGSM either satisfies the Armijo condition (4) and is computed by backtracking for all k, or it satisfies the Wolfe–Powell conditions for all k, cf. (5).

Lemma 4.2

Let Assumption 4.1 hold. Then Algorithm LBFGSM either terminates in Line 3 with an \(x_k\) satisfying \(\nabla f(x_k)=0\) or it generates a sequence \((f(x_k))\) that is strictly monotonically decreasing and convergent.

Proof

After observing that \([\omega _k,\omega _k^{-1}]\ne \emptyset \) for all k due to \(c_0\le 1\), the proof is similar to that of [45, Lemma 4.5]. \(\square \)

Applying Lemma 4.1 to Algorithm LBFGSM yields the following result.

Lemma 4.3

Let Assumption 4.1 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Then

$$\begin{aligned} \Vert H_k^{-1}\Vert \le \left( m+1\right) \omega _k^{-1} \qquad \text { and }\qquad \Vert H_k\Vert \le 5^m\max \Bigl \{1,\omega _k^{-(2m+1)}\Bigr \} \end{aligned}$$

are satisfied for all \(k\in \mathbb {N}_0\), where m is the constant from Algorithm LBFGSM.

Proof

The acceptance criterion \(q(s_j,y_j)\ge \omega _k\) in Line 6 of Algorithm LBFGSM implies that \(y_j^T s_j/\Vert s_j\Vert ^2\ge \omega _k\) and \(y_j^T s_j/\Vert y_j\Vert ^2\ge \omega _k\) for all \((s_j,y_j)\) that are involved in forming \(H_k\). Moreover, Line 4 ensures that \(\omega _k\le \gamma _k\le \omega _k^{-1}\). Using (6), respectively, (7) we thus deduce that

$$\begin{aligned} \Vert H_k^{-1}\Vert \le \Vert (H_k^{(0)})^{-1}\Vert + m\omega _k^{-1} = \Vert I\Vert \gamma _k^{-1} + m\omega _k^{-1} \le \omega _k^{-1} + m\omega _k^{-1} \end{aligned}$$

and

$$\begin{aligned} \Vert H_k\Vert \le 5^m\max \bigl \{1,\omega _k^{-1}\bigr \}\max \bigl \{1,\omega _k^{-2m}\bigr \} \le 5^m\max \bigl \{1,\omega _k^{-(2m+1)}\bigr \}, \end{aligned}$$

establishing the assertions. \(\square \)

The previous lemma implies useful bounds.

Corollary 4.1

Let Assumption 4.1 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Then there are constants \(c,C>0\) such that

$$\begin{aligned} c\min \Bigl \{1,\Vert \nabla f(x_k)\Vert ^{c_2}\Bigr \}\le \frac{\Vert d_k\Vert }{\Vert \nabla f(x_k)\Vert }\le C\max \Bigl \{1,\Vert \nabla f(x_k)\Vert ^{-(2m+1)c_2}\Bigr \} \end{aligned}$$
(8)

as well as

$$\begin{aligned} \frac{|\nabla f(x_k)^T d_k|}{\Vert d_k\Vert } \ge c\Vert \nabla f(x_k)\Vert \min \Bigl \{1,\Vert \nabla f(x_k)\Vert ^{2(m+1)c_2}\Bigr \}. \end{aligned}$$

are satisfied for all \(k\in \mathbb {N}_0\), where \(c_2\) and m are the constants from Algorithm LBFGSM.

Proof

Since \(d_k=-H_k\nabla f(x_k)\), the estimates (8) readily follow from those for \(\Vert H_k\Vert \) and \(\Vert H_k^{-1}\Vert \). Concerning the final inequality we have

$$\begin{aligned} \begin{aligned} -\nabla f(x_k)^T d_k = d_k^T H_k^{-1} d_k&\ge \Vert d_k\Vert ^2\Vert H_k\Vert ^{-1}\\&\ge c\Vert d_k\Vert \Vert \nabla f(x_k)\Vert \min \Bigl \{1,\Vert \nabla f(x_k)\Vert ^{(2m+2)c_2}\Bigr \}, \end{aligned} \end{aligned}$$

where we used the first inequality of (8) and the estimate for \(\Vert H_k\Vert \). \(\square \)

4.2 Global Convergence

In this section we establish the global convergence of Algorithm LBFGSM. In fact, we prove several types of global convergence under different assumptions.

The level set for Algorithm LBFGSM, respectively, the level set with \(\delta \) vicinity are given by

$$\begin{aligned} \Omega :=\Bigl \{x\in {\mathcal {X}}: \; f(x)\le f(x_0)\Bigr \}\qquad \text { and }\qquad \Omega _\delta :=\Omega +\mathbb {B}_{\delta }(0). \end{aligned}$$

The first result relies on the following assumption.

Assumption 4.2

  1. (1)

    Assumption 4.1 holds.

  2. (2)

    The gradient of f is uniformly continuous in \(\Omega \), i.e., for all sequences\((x_k)\subset \Omega \) and \((s_k)\subset {\mathcal {X}}\) satisfying \((x_k+s_k)\subset \Omega \) and \(\lim _k s_k=0\), there holds \(\lim _k\,\Vert \nabla f(x_k+s_k)-\nabla f(x_k)\Vert =0\).

  3. (3)

    If Algorithm LBFGSM uses Armijo with backtracking, then there is \(\delta >0\) such that f or \(\nabla f\) is uniformly continuous in \(\Omega _\delta \).

Remark 4.1

If \({\mathcal {X}}\) is finite dimensional and \(\Omega \) is bounded, part 2) can be dropped since it follows from the continuity of \(\nabla f\); similarly for part 3).

To prove global convergence for step sizes that satisfy the Wolfe–Powell conditions, we will use the following less stringent form of the Zoutendijk condition from [5]. In contrast to the original Zoutendijk condition, cf. e.g., [53, Theorem 3.2], it holds without Lipschitz continuity of the gradient.

Lemma 4.4

(cf. [5, Theorem 2.28]) Let Assumption 4.2 hold and let \((x_k)\) be generated by Algorithm LBFGSM with the Wolfe–Powell conditions. Then \(\lim _{k\rightarrow \infty }\frac{|\nabla f(x_k)^T d_k|}{\Vert d_k\Vert } = 0\).

Remark 4.2

By following the proof of [5, Theorem 2.28] it is not difficult to show that if we replace Assumption 4.2 by Assumption 4.1 (i.e., drop the uniform continuity of \(\nabla f\)) and additionally assume that \((x_k)_K\) converges, then \(\lim _{K\ni k\rightarrow \infty }\frac{|\nabla f(x_k)^T d_k|}{\Vert d_k\Vert } = 0\).

As a main result we establish that Algorithm LBFGSM is globally convergent under Assumption 4.2.

Theorem 4.1

Let Assumption 4.2 hold. Then Algorithm LBFGSM either terminates after finitely many iterations with an \(x_k\) that satisfies \(\nabla f(x_k)=0\) or it generates a sequence \((x_k)\) with

$$\begin{aligned} \lim _{k\rightarrow \infty }\,\Vert \nabla f(x_k)\Vert = 0. \end{aligned}$$
(9)

In particular, every cluster point of \((x_k)\) is stationary.

Proof

The case that Algorithm LBFGSM terminates is clear, so let us assume that it generates a sequence \((x_k)\). If \((x_k)\) satisfies (9), then by continuity it follows that every cluster point \(x^*\) of \((x_k)\) satisfies \(\nabla f(x^*)=0\). Hence, it only remains to establish (9). We argue by contradiction, so suppose that (9) were false. Then there is a subsequence \((x_k)_K\) of \((x_k)\) and an \(\epsilon >0\) such that

$$\begin{aligned} \Vert \nabla f(x_k)\Vert \ge \epsilon \qquad \forall k\in K. \end{aligned}$$
(10)

Since \(\alpha _k\) satisfies the Armijo condition for all \(k\ge 0\), we infer from \(-\nabla f(x_k)^T d_k = d_k^T H_k^{-1} d_k\) that

$$\begin{aligned} \begin{aligned} \sigma \sum _{k\in K}\alpha _k\frac{\Vert d_k\Vert ^2}{\Vert H_k\Vert }&\le -\sigma \sum _{k\in K}\alpha _k\nabla f(x_k)^T d_k\\&\le \sum _{k\in K} \left[ f(x_k)-f(x_{k+1})\right] \le \sum _{k=0}^\infty \left[ f(x_k)-f(x_{k+1})\right] <\infty , \end{aligned} \end{aligned}$$

where we used that \((f(x_k))\) is monotonically decreasing by Lemma 4.2. From Lemma 4.3 and (10) we obtain \(\sup _{k\in K}\Vert H_k\Vert <\infty \), which yields that the sum \(\sum _{k\in K}\alpha _k\Vert d_k\Vert ^2\) is finite. From the first inequality in (8) and (10) we infer that

$$\begin{aligned} \exists c>0: \quad \Vert d_k\Vert \ge c \quad \forall k\in K. \end{aligned}$$
(11)

Hence, \(\sum _{k\in K}\alpha _k\Vert d_k\Vert ^2<\infty \) implies \(\sum _{k\in K}\alpha _k\Vert d_k\Vert <\infty \) and \(\sum _{k\in K}\alpha _k<\infty \), thus

$$\begin{aligned} \lim _{K\ni k\rightarrow \infty }\alpha _k = \lim _{K\ni k\rightarrow \infty }s_k = 0. \end{aligned}$$
(12)

We now distinguish two cases depending on how the step sizes are determined.

Case 1: The step sizes are computed by Armijo with backtracking

From (12) and the construction of the backtracking we infer that for all sufficiently large \(k\in K\) there is \(\beta _k\in (0,\beta _1^{-1}]\) such that (4) is violated for \(\hat{\alpha }_k:=\beta _k\alpha _k\). Therefore,

$$\begin{aligned} -\hat{\alpha }_k\sigma \nabla f(x_k)^T d_k > f(x_k)-f(x_k+\hat{\alpha }_k d_k) = -\hat{\alpha }_k\nabla f(x_k+\theta _k\hat{\alpha }_kd_k)^T d_k \end{aligned}$$

for all these k and \(\theta _k\in (0,1)\). Multiplying by \(-\hat{\alpha }_k^{-1}\) and subtracting \(\nabla f(x_k)^T d_k\) we obtain

$$\begin{aligned} (\sigma -1)\nabla f(x_k)^T d_k < \left[ \nabla f(x_k+\theta _k\hat{\alpha }_kd_k) - \nabla f(x_k)\right] ^T d_k. \end{aligned}$$

Due to \(\sigma <1\) and \(-\nabla f(x_k)^T d_k = d_k^T H_k^{-1} d_k\) there holds for all \(k\in K\) sufficiently large

$$\begin{aligned} (1-\sigma )\frac{\Vert d_k\Vert ^2}{\Vert H_k\Vert } < \Vert \nabla f(x_k+\theta _k\hat{\alpha }_kd_k)-\nabla f(x_k)\Vert \Vert d_k\Vert . \end{aligned}$$

Because of \(h:=\sup _{k\in K}\Vert H_k\Vert <\infty \) this entails for all large \(k\in K\) that

$$\begin{aligned} (1-\sigma )\Vert d_k\Vert < h\Vert \nabla f(x_k+\theta _k\hat{\alpha }_kd_k)-\nabla f(x_k)\Vert . \end{aligned}$$
(13)

As \(\theta _k\in (0,1)\) and \(\beta _k\le \beta _1^{-1}\) for all \(k\in K\), it follows from (12) that \(\theta _k\hat{\alpha }_k d_k=\theta _k\beta _k s_k\rightarrow 0\) for \(K\ni k\rightarrow \infty \). If \(\nabla f\) is uniformly continuous in \(\Omega _\delta \), then (13) implies \(\lim _{K\ni k\rightarrow \infty }\Vert d_k\Vert =0\), which contradicts (11). If f is uniformly continuous in \(\Omega _\delta \), then \(x_k+\theta _k\hat{\alpha }_k\Vert d_k\Vert \in \Omega \) for all large \(k\in K\), so the uniform continuity of \(\nabla f\) in \(\Omega \) yields \(\lim _{K\ni k\rightarrow \infty }\Vert d_k\Vert =0\) in (13), contradicting (11).

Case 2: The step sizes satisfy the Wolfe–Powell conditions

Combining Lemma 4.4 and Corollary 4.1 yields \(\lim _{K\ni k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\), contradicting (10). \(\square \)

Cluster points are already stationary under Assumption 4.1.

Theorem 4.2

Let Assumption 4.1 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Then every cluster point of \((x_k)\) is stationary.

Proof

The proof is almost identical to that of Theorem 4.1. The main difference is that since the subsequence \((x_k)_K\) now converges to some \(x^*\), continuity of f at \(x^*\) implies that there is \(\delta '>0\) such that \(\mathbb {B}_{\delta '}(x^*)\subset \Omega \) and for all sufficiently large \(k\in K\) there holds \(x_k\in \mathbb {B}_{\delta '}(x^*)\). Therefore, local assumptions in the vicinity of the cluster point suffice (e.g., continuity instead of uniform continuity). In Case 2 of the proof, in particular, we use Remark 4.2 and (12) instead of Lemma 4.4. \(\square \)

Remark 4.3

Theorem 4.2 shows global convergence using only continuity of the gradient and no boundedness of the level set. This result can be extended to essentially any descent method, so it may be of interest beyond this work. Surprisingly, we have not found it elsewhere for the Wolfe–Powell conditions.

If \({\mathcal {X}}\) is finite dimensional and \((x_k)\) is bounded, we obtain the conclusions of Theorem 4.1 under the weaker assumptions of Theorem 4.2.

Corollary 4.2

Let Assumption 4.1 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Let \((x_k)\) be bounded and let \({\mathcal {X}}\) be finite dimensional. Then \((x_k)\) has at least one cluster point and \(\lim _{k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\).

Proof

The claims readily follow from Theorem 4.2. \(\square \)

In the infinite dimensional case we also want weak cluster points to be stationary. Recall from Theorem 4.1 that \(\lim _{k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\) holds under Assumption 4.2.

Lemma 4.5

Let Assumption 4.1 hold and let \(\nabla f\) be weak-to-weak continuous. Let \((x_k)\) be generated by Algorithm LBFGSM and let \(\lim _{k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\). Then every weak cluster point of \((x_k)\) is stationary.

Proof

As \(\Vert \cdot \Vert \) is weakly lower semicontinuous, the proof is straightforward. \(\square \)

4.3 Linear Convergence

In this section we prove that Algorithm LBFGSM converges linearly under mild assumptions and that it turns into classical L-BFGS under first order sufficient optimality conditions. We divide the section into three parts.

4.3.1 Preliminaries

The main result on linear convergence, Theorem 4.4, will show that \((f(x_k))\) converges q-linearly while \((x_k)\) and \((\nabla f(x_k))\) satisfy estimates that imply l-step q-linear convergence for all sufficiently large l.

Definition 4.1

We call \((x_k)\subset {\mathcal {X}}\) l-step q-linearly convergent for some \(l\in \mathbb {N}\), iff there exist \(x^*\in {\mathcal {X}}\), \(\bar{k}\ge 0\) and \(\kappa \in (0,1)\) such that \(\Vert x_{k+l}-x^*\Vert \le \kappa \Vert x_k-x^*\Vert \) is satisfied for all \(k\ge \bar{k}\).

For \(l=1\) this is q-linear convergence. It is easy to see that l-step q-linear convergence for an arbitrary l implies r-linear convergence whereas the opposite is not necessarily true.

We are not aware of works on L-BFGS-type methods that use the concept of l-step q-linear convergence. However, for Barzilai–Borwein-type methods the notion appears in [6, 7, 19] in convex and nonconvex settings.

We use the following assumption to obtain linear convergence of LBFGSM.

Assumption 4.3

  1. 1)

    Assumption 4.1 holds.

  2. 2)

    The constant \(c_2\) in Algorithm LBFGSM satisfies \(c_2<(2m+1)^{-1}\) if the Armijo rule with backtracking is used, and \(c_2<(2m+2)^{-1}\) else.

Note that the requirement on \(c_2\) in part 2) restricts the rate with which \(\omega _k\) can go to zero, cf. the definition of \(\omega _k\) in Line 3 of Algorithm LBFGSM. Due to Lemma 4.3 this limits the growth rate of the condition number of the L-BFGS operator \(H_k\). The assumption on \(c_2\) is crucial to show that if a stationary cluster point \(x^*\) belongs to a ball in which f is strongly convex, then \(x^*\) is attractive, i.e., the entire sequence \((x_k)\) converges to \(x^*\). This property is the fundamental building block for all subsequent results, including for the transition to classical L-BFGS in Theorem 4.3 and for the linear convergence in Theorem 4.4. In other words, the calibration of the cautious updates according to 2), which we have not seen elsewhere, is of vital importance from a theoretical point of view.

Lemma 4.6

Let Assumption 4.3 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Let \((x_k)\) have a cluster point \(x^*\) with a convex neighborhood in which f is strongly convex. Then \(\lim _{k\rightarrow \infty }\Vert x_k - x^*\Vert =0\).

Proof

Let \({\mathcal {N}}\) denote the neighborhood of \(x^*\). The proof consists of two parts.

Part 1: Cluster points induce vanishing steps.

First we show that for any \(\epsilon >0\) there exists \(\delta '>0\) such that for all \(k\in \mathbb {N}_0\) the implication \(\Vert x_k-x^*\Vert<\delta '\Rightarrow \Vert s_k\Vert <\epsilon \) holds true. Apparently, to prove this it suffices to consider \(\epsilon \) so small that \(\mathbb {B}_{\epsilon }(x^*)\subset {\mathcal {N}}\). Let such an \(\epsilon \) be given. From Corollary 4.1 we infer that for all \(k\in \mathbb {N}_0\) we have

$$\begin{aligned} \Vert d_k\Vert \le C\Vert \nabla f(x_k)\Vert ^{1-(2m+1)c_2}. \end{aligned}$$
(14)

The exponent in (14) is positive because of the assumption \(c_2<(2m+1)^{-1}\), hence \(\Vert d_k\Vert <\epsilon \) whenever \(\Vert x_k-x^*\Vert <\delta '\) for some sufficiently small \(\delta '>0\), where we used the continuity of \(\nabla f\) at \(x^*\) and \(\nabla f(x^*)=0\), which holds due to Theorem 4.2. If Algorithm LBFGSM uses the Armijo rule with backtracking, then \(\alpha _k\le 1\) for all k, so the desired implication follows. In the remainder of Part 1 we can therefore assume that all \(\alpha _k\), \(k\in \mathbb {N}_0\), satisfy the Wolfe–Powell conditions. The strong convexity of f in \({\mathcal {N}}\) implies the existence of \(\mu >0\) such that

$$\begin{aligned} f(x)-f(x^*)\le \frac{1}{2\mu }\Vert \nabla f(x)\Vert ^2 \end{aligned}$$

for all \(x\in {\mathcal {N}}\). In particular, this holds for \(x=x_k\) whenever \(x_k\in \mathbb {B}_{\epsilon }(x^*)\). Moreover, the Armijo condition (4) holds for all step sizes \(\alpha _k\), \(k\in \mathbb {N}_0\). Together, we have for all \(k\in \mathbb {N}_0\) with \(x_k\in \mathbb {B}_{\epsilon }(x^*)\)

$$\begin{aligned} \begin{aligned} \alpha _k\sigma \nabla f(x_k)^T H_k \nabla f(x_k)&\le f(x_k)-f(x_{k+1})\\&\le f(x_k)-f(x^*) \le \frac{1}{2\mu }\Vert \nabla f(x_k)\Vert ^2, \end{aligned} \end{aligned}$$

where we used that \(f(x_{k+1})\ge f(x^*)\) as \((f(x_k))\) is monotonically decreasing. Thus, for these k

$$\begin{aligned} \alpha _k \le \frac{\Vert H_k^{-1}\Vert }{2\sigma \mu }. \end{aligned}$$

From Lemma 4.3 we obtain that \(\Vert H_k^{-1}\Vert \le \hat{C}\omega _k^{-1}\) for all \(k\in \mathbb {N}_0\), where \(\hat{C}>0\) is a constant. Decreasing \(\delta '\) if need be, we may assume \(\delta '\le \epsilon \) and \(\omega _k^{-1}=c_1^{-1}\Vert \nabla f(x_k)\Vert ^{-c_2}\) for all \(k\in \mathbb {N}_0\) with \(x_k\in \mathbb {B}_{\delta '}(x^*)\). Combining this with (14) we obtain for all \(k\in \mathbb {N}_0\) with \(x_k\in \mathbb {B}_{\delta '}(x^*)\) that

$$\begin{aligned} \begin{aligned} \Vert s_k\Vert = \alpha _k\Vert d_k\Vert \le \frac{\Vert H_k^{-1}\Vert }{2\sigma \mu }\Vert d_k\Vert&\le \frac{C\hat{C}}{2\sigma \mu } \omega _k^{-1} \Vert \nabla f(x_k)\Vert ^{1-(2m+1)c_2}\\&= \frac{C\hat{C}}{2\sigma \mu c_1}\Vert \nabla f(x_k)\Vert ^{1-2(m+1)c_2}. \end{aligned} \end{aligned}$$

The choice of \(c_2\) in Assumption 4.3 2) implies \(1-2(m+1)c_2>0\). Thus, after decreasing \(\delta '\) if need be, there holds \(\Vert s_k\Vert <\epsilon \) for all \(k\in \mathbb {N}_0\) with \(x_k\in \mathbb {B}_{\delta '}(x^*)\).

Part 2: Convergence of the entire sequence \(\mathbf {(x_k)}\).

Let \(\epsilon '>0\) be so small that \(\mathbb {B}_{\epsilon '}(x^*)\subset {\mathcal {N}}\). We have to show that there is \(\bar{k}\ge 0\) such that \(x_k\in \mathbb {B}_{\epsilon '}(x^*)\) for all \(k\ge \bar{k}\). Due to Part 1 we find a positive \(\delta '\) such that for all \(k\in \mathbb {N}_0\) the implication \(x_k\in \mathbb {B}_{\delta '}(x^*)\Rightarrow \Vert s_k\Vert <\epsilon '/2\) holds true. Decreasing \(\delta '\) if need be, we can assume that \(\delta '\le \epsilon '/2\). It then follows that \(x_{k+1}\in \mathbb {B}_{\epsilon '}(x^*)\) for all \(k\in \mathbb {N}_0\) with \(x_k\in \mathbb {B}_{\delta '}(x^*)\). Next we use that the \(\mu \)-strongly convex function \(f\vert _{\mathbb {B}_{\epsilon '}(x^*)}\) satisfies the growth condition

$$\begin{aligned} f(x^*) + \frac{\mu }{2}\Vert x-x^*\Vert ^2 \le f(x) \end{aligned}$$
(15)

for all x in \(\mathbb {B}_{\epsilon '}(x^*)\). Let \(U:=\mathbb {B}_{\epsilon '}(x^*)\setminus \mathbb {B}_{\delta '}(x^*)\). Due to (15) we have \(f(x)-f(x^*)\ge (\delta ')^2\mu /2\) for all \(x\in U\). Since \((f(x_k))\) converges by Lemma 4.2, we find \(\hat{k}\) such that \(f(x_k)-f(x^*)<(\delta ')^2\mu /2\) for all \(k\ge \hat{k}\), hence \(x_k\not \in U\) for all \(k\ge \hat{k}\). Selecting \(\bar{k}\ge \hat{k}\) such that \(x_{\bar{k}}\in \mathbb {B}_{\delta '}(x^*)\), we obtain that \(x_{\bar{k}+1}\in \mathbb {B}_{\epsilon '}(x^*)\) and \(x_{\bar{k}+1}\not \in U\), thus \(x_{\bar{k}+1}\in \mathbb {B}_{\delta '}(x^*)\). By induction we infer that \(x_k\in \mathbb {B}_{\delta '}(x^*)\) for all \(k\ge \bar{k}\), which concludes the proof as \(\delta '\le \epsilon '/2\). \(\square \)

Next we show that \((\Vert H_k\Vert )\) and \((\Vert H_k^{-1}\Vert )\) are bounded.

Lemma 4.7

Let Assumption 4.1 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Suppose there are \(x^*\in {\mathcal {X}}\) and a convex neighborhood \({\mathcal {N}}\) of \(x^*\) such that \(\lim _{k\rightarrow \infty }x_k=x^*\), \(f\vert _{{\mathcal {N}}}\) is strongly convex and \(\nabla f\vert _{{\mathcal {N}}}\) is Lipschitz. Then \((\Vert H_k\Vert )\) and \((\Vert H_k^{-1}\Vert )\) are bounded.

Proof

Let \(\bar{k}\) be such that \(x_k\in {\mathcal {N}}\) for all \(k\ge \bar{k}\). Since \(f\vert _{{\mathcal {N}}}\) is strongly convex, there is \(\mu >0\) such that \(\nabla f\) is \(\mu \)-strongly monotone in \({\mathcal {N}}\), i.e.,

$$\begin{aligned} \bigl [ \nabla f(\hat{x})-\nabla f(x) \bigr ]^T (\hat{x}-x) \ge \mu \Vert \hat{x}-x\Vert ^2 \end{aligned}$$

for all \(x,\hat{x}\in {\mathcal {N}}\). By inserting \(\hat{x}=x_{j+1}\) and \(x=x_j\) we infer that

$$\begin{aligned} y_j^T s_j\ge \mu \Vert s_j\Vert ^2 \qquad \text { and }\qquad y_j^T s_j\ge \frac{\mu }{L^2} \Vert y_j\Vert ^2 \end{aligned}$$
(16)

for all \(j\ge \bar{k}\), where the second estimate follows from the first by the Lipschitz continuity of \(\nabla f\) in \({\mathcal {N}}\). Note that \(y_j^T s_j>0\) for all \(j\ge \bar{k}\), so in any iteration \(k\ge \bar{k}\) the pair \((s_k,y_k)\) enters the storage, cf. Line 9 and Line 10 in Algorithm LBFGSM. Therefore, at the beginning of iteration \(k\ge \bar{k}+m\) we have \({\mathcal {I}}=\{k-m,k-m+1,\ldots ,k-1\}\) for the index set of the storage (with \({\mathcal {I}}=\emptyset \) if \(m=0\)), and consequently (16) holds for all pairs \((s_j,y_j)\) in the storage whenever the iteration counter k is sufficiently large. In view of Lemma 4.1 it only remains to prove that \(H_k^{(0)}=\gamma _k I\) and its inverse are bounded independently of k, i.e., that \((\gamma _k)\) and \((\gamma _k^{-1})\) are bounded from above. Since \(\lim _{k\rightarrow \infty }\Vert \nabla f(x_k)\Vert =0\) by Theorem 4.2, we infer that \(\lim _{k\rightarrow \infty }\omega _k=0\). We now show that \((\gamma _k^-)\) is bounded away from zero and that \((\gamma _k^+)\) is bounded from above for sufficiently large k. This implies that \([\gamma _k^-,\gamma _k^+]\cap [\omega _k,\omega _k^{-1}]=[\gamma _k^-,\gamma _k^+]\) for all sufficiently large k, in turn showing that \((\gamma _k)\) and \((\gamma _k^{-1})\) are bounded, cf. Line 4 in Algorithm LBFGSM. As we have already established, there holds \(y_k^T s_k>0\) for all \(k\ge \bar{k}\). By Line 9 we thus deduce that for all \(k\ge \bar{k}\), \(\gamma _{k+1}^-\) and \(\gamma _{k+1}^+\) are computed according to Definition 2.1. Together with (16) we readily obtain that

$$\begin{aligned} \gamma _{k+1}^-= \frac{y_k^T s_k}{\Vert y_k\Vert ^2} \ge \frac{\mu }{L^2} \qquad \text { and }\qquad \gamma _{k+1}^+= \frac{\Vert s_k\Vert ^2}{y_k^T s_k} \le \frac{1}{\mu }, \end{aligned}$$

both valid for all \(k\ge \bar{k}\). This concludes the proof. \(\square \)

4.3.2 Transition to Classical L-BFGS

Before we prove the linear convergence of Algorithm LBFGSM in Theorem 4.4, let us draw some conclusions from Lemma 4.7 that shed more light on Algorithm LBFGSM. In particular, this will enable us to establish in Theorem 4.3 that Algorithm LBFGSM turns into Algorithm LBFGS when approaching a minimizer that satisfies sufficient optimality conditions.

We start by noting that in the situation of Lemma 4.7, eventually each new update pair \((s_k,y_k)\) enters the storage, all stored pairs are used to compute \(H_k\), and any \(\gamma _k\in [\gamma _k^-,\gamma _k^+]\) can be chosen.

Lemma 4.8

Under the assumptions of Lemma 4.7 there exists \(\bar{k}\in \mathbb {N}_0\) such that when arriving at Line 6 of Algorithm LBFGSM in iteration \(k\ge \bar{k}+m\) there holds \({\mathcal {I}}= \{k-m,k-m+1,\ldots ,k-1\}\) (with \({\mathcal {I}}=\emptyset \) if \(m=0\)), i.e., the storage \(\{(s_j,y_j)\}_{j\in {\mathcal {I}}}\) consists of the m most recent update pairs. Moreover, all pairs are used for the computation of \(H_k\) in Line 6 for \(k\ge \bar{k}\). Also, \(\bar{k}\) can be chosen such that for all \(k\ge \bar{k}\) we have \([\gamma _k^-,\gamma _k^+]\cap [\omega _k,\omega _k^{-1}]=[\gamma _k^-,\gamma _k^+]\) in Line 4 of Algorithm LBFGSM.

Proof

In the proof of Lemma 4.7 we already argued that \({\mathcal {I}}= \{k-m,\ldots ,k-1\}\) for \(k\ge \bar{k}+m\). Regarding the computation of \(H_k\) we recall that (16) holds for all \(j\ge \bar{k}\), so \(q(s_j,y_j)\ge \min \{\mu ,\mu /L^2\}\) for all these j and hence for all pairs in the storage in iteration \(k\ge \bar{k}+m\). We have also shown in the proof of Lemma 4.7 that \(\lim _{k\rightarrow \infty }\omega _k=0\). Thus, for all large k there holds \(q(s_j,y_j)\ge \omega _k\) for all \(j\in {\mathcal {I}}\), so all pairs in the storage are used for the computation of \(H_k\), cf. Line 6. We also recall from the proof of Lemma 4.7 that \([\gamma _k^-,\gamma _k^+]\cap [\omega _k,\omega _k^{-1}]=[\gamma _k^-,\gamma _k^+]\) for all sufficiently large k. \(\square \)

We can now argue that Algorithm LBFGSM turns into Algorithm LBFGS close to \(x^*\). More precisely, if in iteration k of Algorithm LBFGSM, where k is sufficiently large, we took a snapshot of the storage \({\mathcal {I}}\) and initialized Algorithm LBFGS with \(x_k\) and that storage, then the iterates generated subsequently by these algorithms would be identical (and so would the storages, the step sizes, etc.), regardless of the choice of constants in Algorithm LBFGSM. Of course, this assumes that the seed matrices and the step sizes are selected in the same way in both algorithms, e.g., \(\alpha _k\) is determined according to the Armijo condition with the same constant \(\sigma \) and using identical backtracking mechanisms. For ease of presentation let us assume that both algorithms choose \(H_{k}^{(0)}=\gamma _{k}^-I\) whenever possible, i.e., Algorithm LBFGS makes this choice if \(\gamma _{k}^->0\) while Algorithm LBFGSM makes this choice if \(\gamma _{k}^-\in [\omega _k,\omega _k^{-1}]\). To distinguish the quantities generated by the two algorithms, we indicate those of Algorithm LBFGS by a hat, e.g., we write \((\hat{x}_k)\) for the iterates of LBFGS and \((x_k)\) for the iterates of LBFGSM.

Theorem 4.3

Let Assumption 4.3 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Let \((x_k)\) have a cluster point \(x^*\) with a convex neighborhood \({\mathcal {N}}\) such that \(f\vert _{{\mathcal {N}}}\) is strongly convex and \(\nabla f\vert _{{\mathcal {N}}}\) is Lipschitz. Then \((x_k)\) converges to \(x^*\) and Algorithm LBFGSM eventually turns into Algorithm LBFGS. More precisely, consider the sequence \((\hat{x}_k)\) generated by Algorithm LBFGS with starting point \(\hat{x}_0:=x_{\bar{k}}\) for some \(\bar{k}\ge m\) using the initial storage \(\hat{\mathcal {I}}:=\{(s_j,y_j)\}_{j=\bar{k}-m}^{\bar{k}-1}\). Suppose that for all \(k\ge 0\), Algorithm LBFGS selects \(\hat{\alpha }_k\) in the same way as Algorithm LBFGSM selects \(\alpha _{\bar{k}+k}\). If \(\bar{k}\) is sufficiently large, then for any \(k\in \mathbb {N}_0\) we have \(x_{\bar{k}+k}=\hat{x}_k\), \(H_{\bar{k}+k}=\hat{H}_k\), \({\mathcal {I}}_{\bar{k}+k}=\hat{\mathcal {I}}_k\) and \(\alpha _{\bar{k} + k} = \hat{\alpha }_k\). Moreover, if \(\bar{k}\) is sufficiently large, then for all \(k\ge \bar{k}\) the storage \(\{(s_j,y_j)\}_{j\in {\mathcal {I}}}\) consists of the m most recent pairs, they are all used for the computation of \(H_k\), and any \(\gamma _k\in [\gamma _k^-,\gamma _k^+]\) is accepted in Line 4 of Algorithm LBFGSM.

Proof

By Lemma 4.6, \((x_k)\) converges to \(x^*\), so the assumptions of Lemma 4.7 are satisfied. All claims then follows by induction over k using Lemma 4.8. \(\square \)

Remark 4.4

We conclude the study of the relationship between LBFGS and LBFGSM with the following observations.

  1. 1)

    Let \(x^*\) be a stationary point having a neighborhood \({\mathcal {N}}\) in which f is strongly convex and \(\nabla f\) is Lipschitz. It can be shown similarly to Lemma 4.8 that for any choice of constants \(c_0,c_1,c_2\) in Algorithm LBFGSM there is a convex neighborhood \(\hat{\mathcal {N}}\subset {\mathcal {N}}\) such that when initialized with any \(x_0\in \hat{\mathcal {N}}\), Algorithm LBFGS and Algorithm LBFGSM are identical (assuming that they choose \(H_k^{(0)}\) and \(\alpha _k\) in the same way for all k). This shows that when initialized sufficiently close to a point that satisfies sufficient optimality conditions, Algorithm LBFGS and Algorithm LBFGSM agree.

  2. 2)

    Let \(x^*\) be a stationary point having a neighborhood \({\mathcal {N}}\) in which f is strongly convex and \(\nabla f\) is Lipschitz. Suppose that Algorithm LBFGS has generated iterates \((x_k)\) that converge to \(x^*\). Since there are only finitely many iterates outside of \({\mathcal {N}}\), it is not difficult to argue that LBFGSM agrees entirely with Algorithm LBFGS (assuming that they choose \(H_k^{(0)}\) and \(\alpha _k\) in the same way for all k) provided that one of the constants \(c_0,c_1\) is chosen sufficiently small. However, the required value of these constants depends on the starting point \(x_0\) and making it dependent on the associated level set \(\Omega \) but not on \(x_0\) would require additional assumptions. An example for such assumptions is given in 3). This shows that if Algorithm LBFGS converges to a point that satisfies sufficient optimality conditions, then it agrees with Algorithm LBFGSM for sufficiently small \(c_0\) or \(c_1\). We mention again that we allow \(m=0\) in these algorithms, so this also holds for the BB method. We infer further that for appropriate choices of its parameters, L-BFGS with the classical cautious updating from [39] also agrees with LBFGS in that situation.

  3. 3)

    It is not difficult to show that if f is strongly convex in \(\Omega \) with Lipschitz continuous gradients, then Algorithm LBFGS and Algorithm LBFGSM agree entirely for any starting point \(x_0\in \Omega \), provided one of the constants \(c_0,c_1\) in LBFGSM is chosen sufficiently small. Here, the required values of \(c_0,c_1\) depend on \(\Omega \), but not on \(x_0\).

4.3.3 Main Result

As the main result of this work we prove that if Algorithm LBFGSM generates a sequence \((x_k)\) with a cluster point near which f is strongly convex and near which it has Lipschitz continuous gradients, then the entire sequence converges to that point and the convergence is linear. Note that Theorem 4.3 applies in this situation, guaranteeing eventually that the scaling factor \(\gamma _k\) of the seed matrix can be chosen appropriately and that the m most recent update pairs are stored and used for the computation of \(H_k\).

Theorem 4.4

Let Assumption 4.3 hold and let \((x_k)\) be generated by Algorithm LBFGSM. Suppose that \((x_k)\) has a cluster point \(x^*\), that f is strongly convex in a convex neighborhood \({\mathcal {N}}_1\) of \(x^*\), and that \(\nabla f\) is Lipschitz in a neighborhood \({\mathcal {N}}_2\subset {\mathcal {N}}_1\) of \(x^*\). Then

  • \(x^*\) is an isolated local minimizer of f;

  • \((f(x_k))\) converges q-linearly to \(f(x^*)\);

  • \((x_k)\) converges l-step q-linearly to \(x^*\) for any sufficiently large l;

  • \((\nabla f(x_k))\) converges l-step q-linearly to zero for any sufficiently large l.

More precisely, let f be \(\mu \)-strongly convex in \({\mathcal {N}}_1\) and let \(\nabla f\) be L-Lipschitz in \({\mathcal {N}}_2\). Then the numbers \(\bar{k}_i:=\min \{\hat{k}: x^k\in {\mathcal {N}}_i\;\forall k\ge \hat{k}\}\) are well-defined for \(i=1,2\), and setting

$$\begin{aligned} \nu _k := 1-\frac{2\sigma \alpha _k\mu }{\Vert H_k^{-1}\Vert } \qquad \forall k\in \mathbb {N}_0 \end{aligned}$$

we have \(\sup _k\nu _k<1\) and \((f(x_k))\) converges q-linearly to \(f(x^*)\) in that

$$\begin{aligned} f(x_{k+1})-f(x^*) \le \nu _k\bigl [f(x_k)-f(x^*)\bigr ] \qquad \forall k\ge \bar{k}_1. \end{aligned}$$
(17)

Furthermore, for any \(l\in \mathbb {N}_0\) and all \(k\ge \bar{k}_2\) there hold

$$\begin{aligned} \Vert x_{k+l}-x^*\Vert \le \sqrt{\kappa \nu ^l}\; \Vert x_k-x^*\Vert \quad \text {and}\quad \Vert \nabla f(x_{k+l})\Vert \le \kappa \sqrt{\nu ^l} \; \Vert \nabla f(x_k)\Vert , \end{aligned}$$
(18)

where \(\kappa :=L/\mu \) and \(\nu :=\sup _{k\ge \bar{k}_2}\nu _k\).

Proof

Part 1: Preliminaries

Theorem 4.2 implies that \(\nabla f(x^*)=0\). Together with the strong convexity of f it follows that \(x^*\) is the unique local and global minimizer of f in \({\mathcal {N}}_1\), hence isolated. Moreover, Lemma 4.6 implies that \((x_k)\) converges to \(x^*\) and Lemma 4.7 thus yields that \((\Vert H_k\Vert )\) and \((\Vert H_k^{-1}\Vert )\) are bounded.

Part 2: Q-linear convergence of \(\mathbf {(f(x_k))}\)

Setting \(h_k:=\Vert H_k^{-1}\Vert ^{-1}>0\) we use the Armijo condition and \(d_k = -H_k\nabla f(x_k)\) to infer for all k

$$\begin{aligned} \begin{aligned} f(x_{k+1})-f(x^*)&= f(x_{k+1})-f(x_k)+f(x_k)-f(x^*)\\&\le \sigma \alpha _k\nabla f(x_k)^T d_k + f(x_k)-f(x^*)\\&\le -\sigma \alpha _k h_k\Vert \nabla f(x_k)\Vert ^2 + f(x_k)-f(x^*). \end{aligned} \end{aligned}$$

The \(\mu \)-strong convexity of f implies

$$\begin{aligned} f(x)-f(x^*)\le \frac{1}{2\mu } \Vert \nabla f(x)\Vert ^2 \end{aligned}$$
(19)

for all \(x\in {\mathcal {N}}_1\). Thus, for all \(k\ge \bar{k}_1\) we have

$$\begin{aligned} f(x_{k+1})-f(x^*) \le \left( 1-2\sigma \alpha _k h_k\mu \right) \bigl [f(x_k)-f(x^*)\bigr ], \end{aligned}$$

which proves (17). The boundedness of \((\Vert H_k^{-1}\Vert )\) implies \(\inf _k h_k>0\). Moreover, there is \(\alpha >0\) such that \(\alpha _k\ge \alpha \) for all k, as is well-known both for the Wolfe–Powell conditions [57, (3.6)] and for backtracking with the Armijo condition [14, proof of Lemma 4.1] (here we need the Lipschitz continuity of \(\nabla f\) in \({\mathcal {N}}_2\) and the boundedness of \((\Vert H_k\Vert )\)). Together, we infer that \(\sup _{k}\nu _k<1\).

Part 3: Convergence of \(\mathbf {(x_k)}\) and \(\mathbf {(\nabla f(x_k))}\)

The strong convexity yields the validity of (15) for all \(x\in {\mathcal {N}}_1\). Together with (17) and the Lipschitz continuity of \(\nabla f\) in \({\mathcal {N}}_2\) we estimate for all \(k\ge \bar{k}_2\) and all \(l\in \mathbb {N}_0\)

$$\begin{aligned} \begin{aligned} \frac{\mu }{2}\Vert x_{k+l}-x^*\Vert ^2&\le f(x_{k+l})-f(x^*)\\&\le \left[ \prod _{j=k}^{k+l-1}\nu _j\right] \bigl [f(x_k)-f(x^*)\bigr ] \le \nu ^l\frac{L}{2} \Vert x_k-x^*\Vert ^2, \end{aligned} \end{aligned}$$
(20)

where we used that \(\nu _j\le \nu =\sup _{k\ge \bar{k}_2}\nu _k\) for all \(j\ge \bar{k}_2\). Evidently, this implies the left estimate in (18). Since we have established in Part 2 of the proof that \(\nu <1\), there is a minimal \(l^*\in \mathbb {N}\) such that \(\sqrt{\kappa \nu ^{l^*}}\in (0,1)\). Hence, \(\sqrt{\kappa \nu ^l}\in (0,1)\) for any \(l\ge l^*\), so the left estimate in (18) indeed shows the l-step q-linear convergence of \((x_k)\) for any \(l\ge l^*\).

For \((\nabla f(x_k))\) we infer from the Lipschitz continuity, (20) and (19) for all \(k\ge \bar{k}_2\) and all \(l\in \mathbb {N}_0\)

$$\begin{aligned} \Vert \nabla f(x_{k+l})\Vert ^2 \le L^2\Vert x_{k+l}-x^*\Vert ^2 \le \frac{2L^2}{\mu } \nu ^l\bigl [f(x_k)-f(x^*)\bigr ] \le \kappa ^2 \nu ^l \Vert \nabla f(x_k)\Vert ^2. \end{aligned}$$

This proves the right estimate in (18). The l-step q-linear convergence follows as for \((x_k)\). \(\square \)

Remark 4.5

  1. 1)

    As mentioned in Remark 4.4 2) and 3), it can be proven that if f is strongly convex with \(\nabla f\) Lipschitz that Algorithm LBFGSM agrees with classical L-BFGS if the constant \(c_0\) is sufficiently small, and this also holds if Algorithm LBFGS generates a sequence \((x_k)\) that converges to some \(x^*\) near which f is strongly convex and has Lipschitz continuous gradients. Thus, in these cases Theorem 4.4 applies not only to Algorithm LBFGSM but also holds for classical L-BFGS. In the first case Theorem 4.4 extends and improves the standard result [42, Theorem 7.1] for classical L-BFGS that shows only r-linear convergence of \((x_k)\), does not include a rate of convergence for \((\nabla f(x_k))\), assumes that f is twice continuously differentiable with bounded second derivatives, and covers only \({\mathcal {X}}=\mathbb {R}^n\).

  2. 2)

    From (17) and the fact that \(f(x_{k+1})-f(x^*)<f(x_k)-f(x^*)\) for all \(k\le \bar{k}_1\) it follows that there is \(\bar{\nu }\in (0,1)\) such that \(f(x_{k+1})-f(x^*)\le \bar{\nu }\left[ f(x_k)-f(x^*)\right] \) for all \(k\in \mathbb {N}_0\). Hence, Algorithm LBFGSM generally consists of two phases: First, in the global phase, lasting from iteration \(k=0\) to at most iteration \(k=\bar{k}_2\), the objective function decays q-linearly but we have no control over the errors \(\Vert x_k-x^*\Vert \) and \(\Vert \nabla f(x_k)\Vert \). Second, in the local phase, starting at iteration \(k=\bar{k}_2\) or earlier, the errors \(\Vert x_k-x^*\Vert \) and \(\Vert \nabla f(x_k)\Vert \) become l-step q-linearly convergent for any sufficiently large l. Somewhere in between, specifically starting at or before iteration \(k=\bar{k}_1\), the errors in the objective start to satisfy (17). Note that the behavior of the algorithm in the local phase is closely related to the local condition number \(\kappa =L/\mu \) [50]. If \(\Omega \) is convex and f is strongly convex with Lipschitz gradient in \(\Omega \), then there is no global phase since \(\bar{k}_2=0\). As in 1) this also holds for classical L-BFGS.

  3. 3)

    The estimates in (18) are still meaningful if l is such that \(\sqrt{\kappa \nu ^l}\), respectively, \(\kappa \sqrt{\nu ^l}\) is larger than one because they limit the increase of \(\Vert x_{k+l}-x^*\Vert \) in comparison to \(\Vert x_{k}-x^*\Vert \), respectively, of \(\Vert \nabla f(x_{k+l})\Vert \) in comparison to \(\Vert \nabla f(x_k)\Vert \). For \(l=1\) we infer that there is a constant \(C>0\) such that the quotients \(\Vert x_{k+1}-x^*\Vert /\Vert x_k-x^*\Vert \) and \(\Vert \nabla f(x_{k+1})\Vert /\Vert \nabla f(x_{k})\Vert \) are bounded from above by C. This is generally not true for r-linear convergence.

  4. 4)

    Note that l-step q-linear convergence for some \(l\in \mathbb {N}\) does not generally imply j-step q-linear convergence for \(j>l\). For example, let \(0<a<b<1\) and consider the sequence \(\tau _{2n-1}:=a^n\), \(\tau _{2n}:=b^n\). This sequence converges l-step q-linearly if and only if l is even.

5 Numerical Experiments

Before we present the numerical experiments that follow, let us stress that in all experiments, including those that are not reported, we have consistently found that LBFGSM agrees entirely with the classical L-BFGS/BB method Algorithm LBFGS for moderately small values of \(c_0\). This suggests that L-BFGS is inherently cautious, which has also been observed for BFGS with cautious updates [41]. We may also recall from Remark 4.4 that for any finite set of starting points such that the classical method only converges to points that satisfy sufficient optimality conditions, there is \(c_0>0\) such that the two methods agree entirely. Additionally, let us note that whenever LBFGSM agrees entirely with LBFGS, then these two algorithms are also identical to L-BFGS/BB with standard cautious updating [39] for appropriate parameters.

In the following experiments, we choose \(c_0\) so small that LBFGSM and LBFGS agree. Since the literature contains ample numerical experiments for L-BFGS, we consider only three test problems for LBFGSM to illustrate some of the new theoretical results. Recent works that include numerical experiments for L-BFGS are, for instance, [5, 35].

The values of \(c_0,c_1,c_2\) are specified in Table 1. We use \(\gamma _k=\gamma _k^-\) for all k, cf. Definition 2.1. The computation of \(-H_k\nabla f(x_k)\) in Line 6 is realized in a matrix free way through the two-loop recursion [53, Algorithm 7.5]. Algorithm LBFGSM terminates when it has generated an \(x_k\) that satisfies \(\Vert \nabla f(x_k)\Vert \le 10^{-9}\) in the first and third example, respectively, \(\Vert \nabla f(x_k)\Vert \le 10^{-5}\) in the second. Regarding line search strategies, we use Armijo with backtracking by the fixed factor \(\beta :=\beta _1=\beta _2=\frac{1}{2}\) in all examples. In the first and third example we additionally apply the well known Moré–Thuente line search [49] from Poblano [23], which ensures that the strong Wolfe–Powell conditions are satisfied. In the second example we replace the Moré–Thuente line search by another one; details are discussed in that example. The parameter values of the line searches are included in Table 1. The experiments are conducted in MATLAB 2023b. The code for the first example is available on arXiv with the preprint of this article. It includes LBFGSM and LBFGS.

Table 1 Parameter values for Algorithm LBFGSM and the line searches

Next we define some quantities for the evaluation of the numerical results. Suppose that Algorithm LBFGSM terminates for \(k=K\) in the modified Line 3. It has then generated iterates \(x_0,x_1,\ldots ,x_K\), step sizes \(\alpha _0,\alpha _1,\ldots ,\alpha _{K-1}\) and has taken K iterations if we count \(k=0\) as the first iteration and do not count the incomplete iteration for \(k=K\). The number of iterations in which \((s_k,y_k)\) is added to the storage is \(\#{\mathcal {P}}:=|\{k\in \{0,\ldots ,K-1\}: y_k^T s_k>0 \}|\). Note that the maximal value of \(\#{\mathcal {P}}\) is K and that for the Moré–Thuente line search there holds \(\#{\mathcal {P}}=K\). By \(\alpha _{\min }\) and \(\alpha _{\max }\) we denote the smallest, respectively, largest step size that is used during the course of Algorithm LBFGSM. Moreover, we let

$$\begin{aligned} \begin{aligned}&Q_{f}:=\max _{1\le k\le K}\left\{ \frac{f(x_k)-f^*}{f(x_{k-1})-f^*}\right\} , \qquad Q_x:=\max _{1\le k\le K}\left\{ \frac{\Vert x_k-x^*\Vert }{\Vert x_{k-1}-x^*\Vert }\right\} ,\\&Q_{g}:=\max _{1\le k\le K}\left\{ \frac{\Vert \nabla f(x_k)\Vert }{\Vert \nabla f(x_{k-1})\Vert }\right\} \end{aligned} \end{aligned}$$

denote the maximal q-factors of the respective sequences. To assess the asymptotic of the q-factors, we also consider a variant in which \(\max _{1\le k\le K}\) is replaced by \(\max _{K-2\le k\le K}\), i.e., where the maximum is only taken with respect to the final three quotients. This variant is denoted by \(Q_f^3\), \(Q_x^3\) and \(Q_g^3\), respectively.

5.1 Example 1: The Rosenbrock Function

As a classical example we consider the Rosenbrock function \(f:\mathbb {R}^2\rightarrow \mathbb {R}\), \(f(x):=(1-x_1)^2+100(x_2-x_1^2)^2\), with unique global minimizer \(x^*=(1,1)^T\) that is also the unique stationary point of f. It is straightforward to confirm that f is strongly convex in the square \([-0.5,1.4]^2\) surrounding \(x^*\). Since every level set of f is compact, \(\nabla f\) is Lipschitz continuous in \(\Omega \) regardless of the starting point. However, f is not convex in the level set associated to the starting point \(x_0=(-1.2,1)^T\) that we use, so there is no result available that guarantees convergence of the classical L-BFGS method to \(x^*\) from this starting point. For LBFGSM, in contrast, Theorem 4.2 shows that \((x_k)\) converges to \(x^*\) for any \(x_0\in \mathbb {R}^2\), and Theorem 4.4 further implies that if \(c_2<1/(2m+2)\), convergence is either finite or at least linear. In addition, Theorem 4.3 ensures that LBFGSM turns into L-BFGS as \(x^*\) is approached. Table 2 and Fig. 1 show the numerical results of Algorithm LBFGSM.

Table 2 Results of LBFGSM for the Rosenbrock function

Table 2 shows, among others, that using \(m=0\) with a strong Wolfe–Powell line search can be disastrous; the convergence is comparable to that of steepest descent (not shown). Indeed, it is well understood that monotone line searches can be too restrictive for the Barzilai–Borwein method, cf. for instance the convergence analysis in [6]. For this reason, the BB method is typically combined with a non-monotone line search [20, 33, 56]. We thus applied LBFGSM for \(m=0\) with the non-monotone Armijo line search of Grippo et al. [32], yielding an improvement over monotone Armijo. Specifically, with the best choice of parameters, 71 iterations and 82 evaluations of f are required. A further observation in Table 2 is that for \(m>0\) the q-factor is usually significantly smaller during the final 3 iterations than in the previous iterations, suggesting an acceleration in convergence; Fig. 1 confirms this observation.

Fig. 1
figure 1

LBFGSM with \(m=2\) for the Rosenbrock function. The results for Algorithm LBFGS with \(m=2\) are identical

Next we comment on Fig. 1. Since f is strongly convex for any \(x\in \mathbb {R}^2\) with \(\Vert x-x^*\Vert \le 0.4\), the error plot for \(\Vert x_k-x^*\Vert \) in Fig. 1 in combination with Theorem 4.4 indicates that l-step q-linear convergence of \((x_k)\) and \((\nabla f(x_k))\) is guaranteed from around iteration \(k_1=k_2=26\) onward for Armijo and from \(k_1=k_2=24\) onward for Moré–Thuente; cf. also Remark 4.5 2). In this regard we note that for \(k\le k_1\) the plot of \((\Vert \nabla f(x_k)\Vert )\) contains many pairs \((x_{k-l},x_k)\) with \(\Vert \nabla f(x_{k-l})\Vert \le \Vert \nabla f(x_k)\Vert \) for several l, ruling out l-step q-linear convergence, while for \(k\ge k_1\) we presumably see l-step q-linear convergence for any \(l\ge 4\) for Armijo, respectively, for any \(l\ge 5\) in case of Moré–Thuente. Yet, 5-step q-linear convergence is violated for \(k=k_1-1\) since \(\Vert \nabla f(x^{30})\Vert >\Vert \nabla f(x^{25})\Vert \) for Armijo and \(\Vert \nabla f(x^{28})\Vert >\Vert \nabla f(x^{23})\Vert \) for Moré–Thuente, where the inequalities are based on the numerical values underlying the figure.

5.2 Example 2: A Piecewise Quadratic Function

To demonstrate that LBFGSM works on objectives that are \(C^{1,1}\) but not \(C^2\), we let \(N\in \mathbb {N}\), \(d:=3N\) and consider the piecewise quadratic function \(f:\mathbb {R}^d\rightarrow \mathbb {R}\), \(f(x):=\frac{1}{2} \Vert x-b\Vert ^2+\frac{99}{2}\sum _{i=1}^{d}\max \{0,x_i\}^2\), where \(b = (f,f,\ldots ,f)^T\in \mathbb {R}^d\) with \(f=(1,-1,0)\). This objective is strongly convex, every level set is bounded, and it is \(C^1\) with \(\nabla f(x) = x-b + 99\max \{0,x\}\), where \(\max \) is applied componentwise. It is clear that \(\nabla f\) is Lipschitz in \(\mathbb {R}^d\), but not differentiable at \(x^*=(y,y,\ldots ,y)^T\) with \(y=(0.01,-1,0)\), the unique stationary point of f. As in the first example it follows from Theorem 4.2 and Theorem 4.4 that for any starting point Algorithm LBFGSM either terminates finitely or it generates linearly convergent sequences. In contrast to the first example we have \(k_1=k_2=0\). In particular, the q-linear convergence estimate (17) holds for all k. The convergence behavior of classical L-BFGS is exactly the same, but this cannot be inferred from existing results such as [42, Theorem 7.1] since f is not twice continuously differentiable. Instead, it follows from the results of this work, cf. the discussion in Remark 4.5 1).

Applying Algorithm LBFGSM with starting point \(x_0=b\) for \(N=100\) yields the results displayed in Table 3. We point out that in this example we do not use the Moré–Thuente line search but resort to a line search that ensures the weak Wolfe–Powell conditions instead. The reason for not using the Moré–Thuente line search is that it involves quadratic and cubic interpolation for f, which is not appropriate since f is only piecewise smooth.

Table 3 Results of LBFGSM for Example 2, a strongly convex and piecewise quadratic objective

While Table 3 does not reveal this information, LBFGSM finds the exact solution \(x^*\) in the displayed runs. It is also interesting that the iterates for \(m=0\) and \(m=10\) agree if the same line search is used. The combination of \(m=0\) with a non-monotone Armijo line search (not shown) does not improve the performance in this example. The gap between \(Q_f\) and 1 is much larger than in Example 1, which we attribute to the fact that (17) holds for all k.

To check for global convergence we conduct, for each of the memory sizes and line searches displayed in Table 3, \(10^5\) runs of Algorithm LBFGSM with random starting points generated by Matlab’s randn. The gradient norm is successfully decreased below \(10^{-5}\) in all runs. The average number of iterations for \(m=0\) is 98.9 for Armijo and 227.7 for Wolfe–Powell, for \(m=5\) it is 83.0 for Armijo and 82.4 for Wolfe–Powell, and for \(m=10\) it is 92.5 for Armijo and 92.3 for Wolfe–Powell.

5.3 Example 3: PDE-constrained Optimal Control

To illustrate that the results of this work are valid in infinite dimensional Hilbert space, we consider a nonconvex large-scale problem from PDE-constrained optimal control. Recently, Barzilai–Borwein-type methods have been applied to and studied for this problem class [7, 24, 38]. Numerical studies involving L-BFGS–type methods for PDE-constrained optimal control problems are available in [17, 25, 46, 51], for instance. Besides the present work, the convergence theory of L-BFGS–type methods in Hilbert space is only addressed in our paper [45]. We consider the problem

$$\begin{aligned} \min _{u\in L^2(\Omega )} \frac{1}{2}\Vert y_u-y_d\Vert ^2_{L^2(\Omega )} + \frac{\nu }{2}\Vert u\Vert ^2_{L^2(\Omega )}, \end{aligned}$$

where \(\Omega =(0,1)^2\subset \mathbb {R}^2\), \(y_d\in L^2(\Omega )\), \(\nu >0\) and \(y=y_u\) denotes the solution to the semilinear elliptic boundary value problem

$$\begin{aligned} \begin{aligned} -\Delta y + \exp (y)&= u & \text { in }\Omega ,\\ y&= 0 & \text { on }\partial \Omega . \end{aligned} \end{aligned}$$

It can be shown by standard arguments that for every \(u\in L^2(\Omega )=:U\) there is a unique weak solution \(y_u\in H_0^1(\Omega )\cap C(\bar{\Omega })=:Y\) to this PDE and that the mapping \(u\mapsto y_u\) is smooth from U to Y, cf. [16] and the references therein. Regarding \(y_u\) as a function of u, the objective \(f(u):=\frac{1}{2}\Vert y_u-y_d\Vert ^2_{L^2(\Omega )} + \frac{\nu }{2}\Vert u\Vert ^2_{L^2(\Omega )}\) defined on the Hilbert space U is smooth and it admits a global minimizer [16]. Since \(u\mapsto y_u\) is nonlinear, f is nonconvex.

We choose \(\nu =10^{-3}\) and \(y_d(x_1,x_2)=\sin (2\pi x_1)\cos (2\pi x_2)\) and we discretize the Laplacian by the classical 5-point stencil on a uniform grid with \(M+1=2^{j}+1\), \(4\le j\le 11\) points in each direction of the grid. The discretization of the control u and the state y, denoted \(u_h\) and \(y_h\), live in \({\mathcal {X}}=\mathbb {R}^N\) with \(N=(M-1)^2\). Its entries represent function values at the \((M-1)^2\) inner nodes of the grid. We obtain the discretization \(f_h\) of f by replacing integrals \(\int _\Omega v\,\textrm{d}x\) by \(h^2\sum _{i=1}^N [v_h]_i\), where \([v_h]_i\) indicates the i-th component of \(v_h\) and \(h:=1/M\) is the mesh width of the grid. To mimic the \(L^2\) inner product, we endow \(\mathbb {R}^N\) with the scalar product \((u_h,v_h):=h^2\sum _{i=1}^N ([u_h]_i \cdot [v_h]_i)\). Note that this differs from the Euclidean inner product, which has to be taken into account, for instance when computing \(B_{k+1}\) (respectively, when applying the two-loop recursion) since formulas (1) and (2) involve the scalar product, cf. also the paragraph Notation at the end of Sect. 1. From now on all quantities are discrete, so we suppress the index h. In every iteration we compute the discrete state \(y_{u_k}\) associated to the discrete control \(u_k\) by a few iterations of a damped Newton’s method. Since the exact solution of the problem is unknown, we run Algorithm LBFGSM to obtain \(u_k\) satisfying \(\Vert \nabla f(u_k)\Vert \le 10^{-12}\) and use it in place of an exact solution. The desired state \(y_d\), the optimal state \(\bar{y}:=y_{\bar{u}}\) and the optimal control \(\bar{u}\) are depicted in Fig. 2. The results obtained with starting point \(u_0=0\) are displayed in Tables 4 and 5.

Fig. 2
figure 2

Desired state \(y_d\), optimal state \(\bar{y}=y_{\bar{u}}\), optimal control \(\bar{u}\)

Table 4 shows, for instance, that the full step is taken in all iterations for the Armijo line search and in all but one iteration for the Moré–Thuente line search. A closer inspection reveals that only the first iteration does not use a full step. Although the problem is nonconvex, we have \(y^T s>0\) in all iterations, cf. the values of \(\#{\mathcal {P}}\) in Table 4. For \(m=10\) we have \(\# \textrm{it} <m\), so any choice \(m>10\) produces exactly the same results as for \(m=10\). The values of \(Q_f\), \(Q_x\) and \(Q_g\) indicate that for the Moré–Thuente line search we have q-linear convergence for the objective values, the iterates and also the gradients. Similar to Example 1, the last three columns of Table 4 hint at the fact that a significant acceleration takes place during the course of the algorithm.

An important property of efficient numerical algorithms for PDE-constrained optimization is their mesh independence [3, 7, 36]. This roughly means that the number of iterations to reach a prescribed tolerance is insensitive to the mesh size. Table 5 clearly confirms the mesh independence of Algorithm LBFGSM. Finally, we mention that in this example we used \(\sigma =10^{-8}\) for the Moré–Thuente line search since \(\sigma =10^{-4}\) sometimes failed.

Table 4 Results of LBFGSM (and simultaneously standard L-BFGS/BB) for Example 3, a nonconvex optimal control problem
Table 5 Iteration numbers of LBFGSM (and standard L-BFGS/BB) for Example 3 with different mesh sizes

6 Conclusion

This work introduces the first globally convergent modification of L-BFGS that recovers classical L-BFGS under locally sufficient optimality conditions. The strong convergence guarantees of the method rely on several modifications of cautious updating, including the novel idea to decide in each iteration, based on the most current gradient norm only, which storage pairs to use and which ones to skip. The method enjoys q-linear convergence for the objective values and l-step q-linear convergence for the iterates and the gradients for all sufficiently large l. The rates of convergence rely on strong convexity and gradient Lipschitz continuity, but only near one cluster point of the iterates; the existence of second derivatives is not required. Global convergence for Wolfe–Powell line searches is shown without boundedness of the level set and using only continuity of the gradient. The rates are also valid for classical L-BFGS if its iterates converge to a point near which the objective is strongly convex with Lipschitz continuous gradient, which is for instance satisfied if the objective is strongly convex in the level set and has Lipschitz gradients near its unique stationary point. The results hold in any Hilbert space and also for memory size \(m=0\), yielding a new globalization of the Barzilai–Borwein method. Numerical experiments support the theoretical findings and suggest that for sufficiently small parameter \(c_0\), the new method and L-BFGS agree entirely. This may explain why L-BFGS is often successful for nonconvex problems.