Abstract
The first-order optimality condition of convexly constrained nonconvex nonconcave min-max optimization problems with box constraints formulates a nonmonotone variational inequality (VI), which is equivalent to a system of nonsmooth equations. In this paper, we propose a quasi-Newton subspace trust region (QNSTR) algorithm for the least squares problems defined by the smoothing approximation of nonsmooth equations. Based on the structure of the nonmonotone VI, we use an adaptive quasi-Newton formula to approximate the Hessian matrix and solve a low-dimensional strongly convex quadratic program with ellipse constraints in a subspace at each step of the QNSTR algorithm efficiently. We prove the global convergence of the QNSTR algorithm to an \(\epsilon \)-first-order stationary point of the min-max optimization problem. Moreover, we present numerical results based on the QNSTR algorithm with different subspaces for a mixed generative adversarial networks in eye image segmentation using real data to show the efficiency and effectiveness of the QNSTR algorithm for solving large-scale min-max optimization problems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Min-max optimization problems have wild applications in games [29], distributional robustness optimization [27], robust machine learning [26], generative adversarial networks (GANs) [14], reinforcement learning [7], distributed optimization [32], etc. Mathematically, a convexly constrained min-max optimization problem can be written as
where \(X\subseteq {\mathbb {R}}^{n}\) and \(Y\subseteq {\mathbb {R}}^{m}\) are nonempty, closed and convex sets, \(\xi \) is an s-dimensional random vector obeying a probability distribution P with support set \(\varXi \), \(\ell :{\mathbb {R}}^{n}\times {\mathbb {R}}^{m}\times {\mathbb {R}}^{s}\rightarrow {\mathbb {R}}\) is nonconvex-nonconcave for a fixed realization of \(\xi \), i.e., \(\ell (x,y,\xi )\) is neither convex in x nor concave in y. Hence the objective function f(x, y) is also nonconvex-nonconcave in general.
In this paper, we are mainly interested in problem (1) arising from GANs [14], which reads
where \(\xi _i\) is an \({\mathbb {R}}^{s_i}\)-valued random vector with probability distribution \(P_i\) for \(i=1,2\), \(D:{\mathbb {R}}^m \times {\mathbb {R}}^{s_1} \rightarrow (0,1)\) is a discriminator, \(G:{\mathbb {R}}^n \times {\mathbb {R}}^{s_2} \rightarrow {\mathbb {R}}^{s_1}\) is a generator.
Generally, the generator G and the discriminator D are two feedforward neural networks. For example, G can be a p-layer neural network and D can be a q-layer neural network, that is
where \(W_G^1,\cdots ,W_G^p\), \(b_G^1,\cdots ,b_G^p\) and \(W_D^1,\cdots ,W_D^q\), \(b_D^1,\cdots ,b_D^q\) are the weight matrices, biases vectors of G and D with suitable dimensions, \(\sigma _G^1,\cdots ,\sigma _G^p\) and \(\sigma _D^1,\cdots ,\sigma _D^q\) are proper activation functions, such as ReLU, GELU, Sigmoid, etc. Denote
where \(\textrm{vec}(\cdot )\) denotes the vectorization operator. In this case, problem (2) reduces to problem (1) if let \(\xi :=(\xi _1,\xi _2)\in \varXi \) and
Due to the nonconvexity-nonconcavity of the objective function f, problem (1) may not have a saddle point. Hence the concepts of global and local saddle points are untimely to characterize the optimum of problem (1). Recently, motivated by practical applications, the so-called global and local minimax points are proposed to describe the global and local optima of nonconvex-nonconcave min-max optimization problems in [19] from the viewpoint of sequential games. Moreover, the optimality necessary conditions for a local minimax point are studied in [19] for unconstrained min-max optimization problems. In [8, 18], the optimality conditions for a local minimax point are studied for constrained min-max optimization problems.
Numerical methods for min-max optimization problems have been extensively studied. These algorithms can be divided into four classes based on the convexity or concavity of problems: the convex-concave cases (see, e.g., [28, 30, 31, 39]), the nonconvex-concave cases (see, e.g., [22, 23, 33]), the convex-nonconcave cases (see, e.g., [22, 23, 33]) and the nonconvex-nonconcave cases (see, e.g., [11, 42]).
To solve problem (1) numerically, we first apply the sample average approximation (SAA) approach to obtain a discrete form. We collect N independent identically distributed (i.i.d.) samples of \(\xi \) (e.g., generated by the Monte Carlo method), denoted by \(\xi ^1,\cdots ,\xi ^N\), and obtain a discrete counterpart of problem (1) as below:
To ease the discussion, we assume that \(\ell (\cdot ,\cdot ,\xi )\) is twice continuously differentiable with respect to (x, y) for an arbitrary \(\xi \in \varXi \) in what follows.
Let \(z:=(x^\top , y^\top )^\top \in {\mathbb {R}}^{n+m}\), \(Z:=X\times Y\subseteq {\mathbb {R}}^{n+m}\) and
Then the first-order optimality condition of a local minimax point for problem (3) can be presented as the following nonmonotone variational inequality (VI):
where \({\mathcal {N}}_{Z}(z)\) is the normal cone to the convex set Z at z, which is defined by
We call \(z^*\) a first-order stationary point of problem (3) if it satisfies (4). Due to the nonconvexity-nonconcavity, seldom algorithms can ensure the convergence to a global or local optimal point of problem (3). In the study of the convergence of algorithms for nonconvex-nonconcave min-max problem (3), some strong assumptions, such as the Polyak-Łojasiewicz (PL) condition [42], the existence of solutions for the corresponding Minty VI of problem (4) [11] etc, are required. In fact, such assumptions are difficult to check in practice. On the other hand, some algorithms need to estimate the Lipschitz constant of \(H_{N}(z)\), but the computation of the Lipschitz constant of \(H_{N}(z)\) may be too costly or even intractable. In this paper, without estimating the Lipschitz constant of \(H_{N}(z)\) and assuming the PL condition or the existence of solutions for Minty VI, we use a so-called quasi-Newton subspace trust region (QNSTR, for short) algorithm for solving problem (4).
The VI (4) can be equivalently reformulated as the following system of nonsmooth equations (see [12])
where \(\textrm{Proj}_Z(\cdot )\) denotes the projection operator onto Z.
Obviously, \(z^*\) is a first-order stationary point of (3) (i.e., a solution of (4) or (5)) if it is an optimal solution of the following least squares problem:
and \(r_N(z^*)=0\), where \(\Vert \cdot \Vert \) denotes the Euclidean norm.
The main contributions of this paper are summarized as follows. (i) We develop the QNSTR algorithm for solving the least squares problem (6) when X and Y are boxes. Based on the structure of the VI (4), we use a smoothing function to approximate the nonsmooth function \(F_N\), adopt an adaptive quasi-Newton formula to approximate the Hessian matrix and solve a quadratic program with ellipse constraints in a subspace at each step of the QNSTR algorithm with a relatively low computational cost. (ii) We prove the global convergence of the QNSTR algorithm to a stationary point of a smoothing least squares problem of (6), which is an \(\epsilon \)-first-order stationary point of the min-max optimization problem (3) if every element of the generalized Jacobian matrix of \(F_N\) is nonsingular at it. (iii) We apply the QNSTR algorithm to GANs in eye image segmentation with real data, which validates the effectiveness and efficiency of our approach for large-scale min-max optimization problems.
This paper is organized as follows. In Sect. 2, we introduce concepts of local minimax points and first-order optimality conditions. Moreover, we investigate the asymptotic convergence between problems (1) and (3) to build the numerical foundation for the subsequent discussion. In Sect. 3, we present the QNSTR algorithm and establish its global convergence. In Sect. 4, we apply the QNSTR algorithm to solve problem (3) with examples from eye image segmentation problems and digital handwriting image generation problems based on two different real data sets. We compare the QNSTR algorithm with some existing methods including alternating Adam. Finally, we give some concluding remarks in Sect. 5.
Notations \({\mathbb {N}}\) denotes the set of natural numbers. \(\Vert \cdot \Vert \) denotes the Euclidean norm of a vector or the norm of a matrix induced by the Euclidean norm. \(\textrm{d}(x, Y):=\inf _{y\in Y} \Vert x - y\Vert \) and \(\textrm{d}(X, Y):= \sup _{x\in X}\inf _{y\in Y} \Vert x-y\Vert \), \(X, Y \subseteq {\mathbb {R}}^{n}\).
2 First-Order Stationarity and Asymptotic Convergence
In this section, we focus on the asymptotic convergence between problems (3) and (1) regarding to the global minimax point and the first-order stationary point. To this end, we first give some preliminaries on how to describe the optima of a min-max optimization problem.
Definition 1
(global and local minimax points, [19, Definitions 9 & 14])
-
(i)
\(({\hat{x}},{\hat{y}})\in X\times Y\) is called a global minimax point of problem (1), if
$$\begin{aligned} f({\hat{x}},y) \le f({\hat{x}},{\hat{y}}) \le \max _{y'\in Y}f(x,y'),~\forall (x,y)\in X\times Y. \end{aligned}$$ -
(ii)
\(({\hat{x}},{\hat{y}})\in X\times Y\) is called a local minimax point of problem (1), if there exist a \(\delta _0>0\) and a function \(\tau :{\mathbb {R}}_+\rightarrow {\mathbb {R}}_+\) satisfying \(\tau (\delta )\rightarrow 0\) as \(\delta \rightarrow 0\), such that for any \(\delta \in (0,\delta _0]\) and any \((x,y)\in X\times Y\) satisfying \(\left\| x-{\hat{x}}\right\| \le \delta \) and \(\left\| y-{\hat{y}}\right\| \le \delta \), we have
$$\begin{aligned} f({\hat{x}},y) \le f({\hat{x}},{\hat{y}}) \le \max _{y'\in \{y\in Y: \left\| y-{\hat{y}}\right\| \le \tau (\delta )\}}f(x,y'). \end{aligned}$$
Remark 1
The concept of saddle points has been commonly used to characterize the optima of min-max problems. A point \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a saddle point of problem (1), if
and \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a local saddle point of problem (1) if (7) holds in a neighborhood of \(({\hat{x}},{\hat{y}})\). However, as pointed out in [19], saddle points and local saddle points may not exist in many applications of machine learning, especially in the nonconvex-nonconcave case. Also, (local) saddle points are solutions from the viewpoint of simultaneous game, where the minimization operator and the maximization operator act simultaneously. However, many applications, such as GANs and adversarial training, seek for solutions in the sense of sequential game, where the minimization operator acts first and the maximization operator acts latter. The global and local minimax points exist under some mild conditions (see [19, Proposition 11 and Lemma 16]) and also describe the optima in the sense of sequential game.
The following lemma gives the first-order necessary optimality conditions of local minimax points for problem (1).
Lemma 1
( [18, Theorem 3.2 & Corollary 3.1]) If \(({\hat{x}},{\hat{y}})\in X\times Y\) is a local minimax point of problem (1), then we have
Definition 2
\(({\hat{x}},{\hat{y}})\in X\times Y\) is called a first-order stationary point of problem (1) if (8) holds. \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a first-order stationary point of problem (3) if (8) holds with replacing f by \({\hat{f}}_N\).
Hereafter, we will focus on finding a first-order stationary point of (3).
As for the exponential rate of convergence of the first-order and second-order stationary points of SAA for a specific GAN, one can refer to [18, Proposition 4.3]. In what follows, we mainly focus on the almost surely convergence analysis between problems (1) and (3) as N tends to infinity. If the problem is well-behaved and the global minimax points are achievable, we consider the convergence of global minimax points between problems (1) and (3). Otherwise, the first-order stationary points (Definition 2) are getatable. Thus, we need also consider the convergence of first-order stationary points between problems (1) and (3) as N tends to infinity.
Denote the optimal value, the set of global minimax points and the set of first-order stationary points of problem (1) by \(\vartheta _g\), \({\mathcal {S}}_{g}\) and \({\mathcal {S}}_{1\textrm{st}}\), respectively. Let \({\widehat{\vartheta }}_g^N\), \(\widehat{{\mathcal {S}}}_{g}^{N}\) and \(\widehat{{\mathcal {S}}}_{1\textrm{st}}^{N}\) denote the optimal value, the set of global minimax points and the set of first-order stationary points of problem (3), respectively.
Lemma 2
Suppose that: (a) X and Y are compact sets; (b) \(\ell (x,y,\xi )\) is dominated by an integrable function for every \((x,y)\in X\times Y\). Then
w.p.1 as \(N\rightarrow \infty \).
If, further, (c) \(\left\| \nabla _x \ell (x,y,\xi )\right\| \) and \(\left\| \nabla _y \ell (x,y,\xi ) \right\| \) are dominated by an integrable function for every \((x,y)\in X\times Y\), then
w.p.1 as \(N\rightarrow \infty \).
Proof
Since the samples are i.i.d. and X and Y are compact, it is known from [36, Theorem 7.53] that the above uniform convergence results hold.\(\square \)
The following proposition gives the nonemptiness of \(\widehat{{\mathcal {S}}}_{g}^{N}\), \({\mathcal {S}}_{g}\), \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) and \({\mathcal {S}}_{\textrm{1st}}\).
Proposition 1
If conditions (a)–(c) in Lemma 2 hold, then \({\mathcal {S}}_{g}\) and \({\mathcal {S}}_{\textrm{1st}}\) are nonempty and \(\widehat{{\mathcal {S}}}_{g}^{N}\) and \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) are nonempty for any \(N\in {\mathbb {N}}\).
Proof
Since the continuity of f(x, y) and \({\hat{f}}_{N}(x,y)\) w.r.t. (x, y) and the boundedness of X and Y, we know from [19, Proposition 11] the nonemptiness of \({\mathcal {S}}_{g}\) and \(\widehat{{\mathcal {S}}}_{g}^{N}\). Note that both \({\mathcal {S}}_{\textrm{1st}}\) and \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) are solutions of variational inequalities. Then we have from [12, Corollary 2.2.5] that \({\mathcal {S}}_{\textrm{1st}}\) and \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) are nonempty.\(\square \)
Based on the uniform laws of large numbers in Lemma 2, we have the following convergence results.
Theorem 1
Let conditions (a)–(c) in Lemma 2 hold. Then
w.p.1 as \(N\rightarrow \infty \).
Proof
(10) follows from [35, Proposition 19] directly. Thus, in what follows, we only consider (9).
From Proposition 1, we know that \(\widehat{{\mathcal {S}}}_{g}^{N}\) and \({\mathcal {S}}_{g}\) are nonempty for any \(N\in {\mathbb {N}}\). Let \(z^{N}=(x^{N}, y^{N}) \in \widehat{{\mathcal {S}}}_{g}^{N}\) and \(z^{N} \rightarrow {\bar{z}}=({\bar{x}}, {\bar{y}})\) w.p.1 as \(N\rightarrow \infty \). Then we just verify that \({\bar{z}}\in {\mathcal {S}}_{g}\) w.p.1. If \(\{z^{N}\}\) is not a convergent sequence, due to the boundedness of X and Y, we can choose a convergent subsequence. Denote \(\varphi (x):=\max _{y\in Y} f(x,y)\) and \({\hat{\varphi }}_{N}(x):=\max _{y\in Y} {\hat{f}}_{N}(x,y)\). Note that
w.p.1 as \(N\rightarrow \infty \), where the last convergence assertion follows from Lemma 2.
Next, we show
where \({\hbox {Proj}}_x\) denotes the projection onto the x’s space. Based on the definition of global minimax points, we have, for any \(({\hat{x}},{\hat{y}})\in {\mathcal {S}}_{g}\), that
which implies
This means \(\textrm{Proj}_{x}{\mathcal {S}}_{g}\subseteq \mathop {\hbox {arg\,min}}\limits _{x\in X}\varphi (x)\). On the other hand, for any \({\hat{x}}\in \mathop {\hbox {arg\,min}}\limits _{x\in X}\varphi (x)\), we let \({\hat{y}} \in \arg \max _{y\in Y}f({\hat{x}},y)\). Then it is not difficult to examine that \(({\hat{x}},{\hat{y}})\) is a global minimax point, i.e., \(\mathop {\hbox {arg\,min}}\limits _{x\in X}\varphi (x) \subseteq \textrm{Proj}_{x}{\mathcal {S}}_{g}\). The \({\hbox {Proj}}_x \widehat{{\mathcal {S}}}_{g}^{N}=\mathop {\hbox {arg\,min}}\limits _{x\in X} {\hat{\varphi }}_{N}(x)\) can be similarly verified. Hence (12) holds.
Then (11) and (12) indicate, according to [41, Lemma 4.1], that
w.p.1 as \(N\rightarrow \infty \). We know from (13) that \({\bar{x}}\in \textrm{Proj}_x {\mathcal {S}}_{g}\).
Moreover, we know that
w.p.1 as \(N\rightarrow \infty \), where \(\vartheta _g\) and \({\widehat{\vartheta }}_g^{N}\) are optimal values of problems (1) and (3), respectively. Due to Lemma 2 and the continuity of f, we know that
Since \({\widehat{\vartheta }}_g^{N} = {\hat{f}}_{N}(x^{N},y^{N})\), we know that \(\vartheta _g = f({\bar{x}},{\bar{y}})\), which, together with \({\bar{x}}\in \textrm{Proj}_x {\mathcal {S}}_{g}\), implies that \(({\bar{x}},{\bar{y}})\in {\mathcal {S}}_{g}\).\(\square \)
Based on Theorem 1, it is well-founded for us to employ problem (3) to approximately solve problem (1). In the sequel, we will focus on how to compute an \(\epsilon \)-first-order stationary point of problem (3).
3 The QNSTR Algorithm and its Convergence Analysis
In this section, we propose the QNSTR algorithm to compute an \(\epsilon \)-first-order stationary point of problem (3) with a fixed sample size N. In the remainder of this paper, let \(X=[a,b]\) and \(Y=[c,d]\), where \(a,b\in {\mathbb {R}}^n\), \(c,d\in {\mathbb {R}}^m\) with \(a< b\) and \(c< d\) in the componentwise sense. In this case, the projection in (5) has a closed form and the function \(F_N\) can be written as
where \(l,u\in {\mathbb {R}}^{n+m}\) with \(l=(a^\top , c^\top )^\top \) and \(u=(b^\top , d^\top )^\top \), “mid” is the middle operator in the componentwise sense, that is
Since X and Y are boxes, (14) can be divided into two parts separably and rewritten as
3.1 Smoothing Approximation
Let \(q(z)=z-H_N(z).\) The function \(F_N\) is not differentiable at z when \(q_i(z) = l_i\) or \(q_i(z) = u_i\) for some \(1\le i\le n+m\). To overcome the difficulty in computation of the generalized Hessian of the nonsmooth function \(F_N(z)\), we consider its smoothing approximation
where \(0<\mu \le {\hat{\mu }}:=\min _{1\le i\le n+m} (u_i - l_i)\).
From (15), the smoothing function \(F_{N,\mu }(z)\) can also be represented as
where \(F^{\mathbb {1}}_{N,\mu }(z)\) and \(F^{\mathbb {2}}_{N,\mu }(z)\) are the smoothing approximations of \(F_{N}^{\mathbb {1}}(z)\) and \(F_{N}^{\mathbb {2}}(z)\), respectively.
We summarize some useful properties of the smoothing function \(F_{N,\mu }\), which can be found in [4] and [3, Section 6].
Lemma 3
Let \(F_{N,\mu }\) be a smoothing function of \(F_N\) defined in (16). Then for any \(\mu \in (0, {\hat{\mu }})\), \(F_{N,\mu }\) is continuously differentiable and has the following properties.
-
(i)
There is a \(\kappa >0\) such that for any \(z\in {\mathbb {R}}^{m+n}\) and \(\mu >0\),
$$\begin{aligned} \Vert F_{N,\mu }(z)-F_N(z)\Vert \le \kappa \mu . \end{aligned}$$ -
(ii)
For any \(z\in {\mathbb {R}}^{m+n}\), we have
$$\begin{aligned} \lim _{\mu \downarrow 0} \textrm{d}(\nabla _{z} F_{N,\mu }(z), \partial _C F_N(z))=0, \end{aligned}$$where \(\partial _CF_N(z)=\partial (F_N(z))_1 \times \partial (F_N(z))_2\times \cdots \times \partial (F_N(z))_{n+m},\) and \(\partial (F_N(z))_i\) is the Clarke generalized gradient of \((F_N(\cdot ))_i\) at z for \(i=1,\ldots , n+m\). Moreover, there exists a \({\bar{\mu }}>0\) such that for any \(\mu \in (0, {\bar{\mu }})\), we have \(\nabla F_{N,\mu }(z)\in \partial _C F_N(z).\)
In Fig. 11 in Appendix A, we show the approximation error \(\Vert F_{N,\mu }(z)-F_{N}(z)\Vert \) over \(X\times Y\) as \(\mu \downarrow 0\) with different N.
Definition 3
(\(\epsilon \)-first-order stationary points) For given \(\epsilon >0\), a point z is called an \(\epsilon \)-first-order stationary point of problem (3), if \(\left\| F_N(z)\right\| \le \epsilon \).
From (i) of Lemma 3, if \(z^*\) is an \(\epsilon \)-first-order stationary point of problem (3), i.e., \(\left\| F_N(z^*)\right\| \le \epsilon \), then for any \(\mu \in (0, \frac{\epsilon }{\kappa })\), we have
which implies \(\Vert F_{N,\mu }(z^*)\Vert \le \Vert F_N(z^*)\Vert + \epsilon \le 2\epsilon .\) On the other hand, if \(z^*\) satisfies \(\Vert F_{N,\mu }(z^*)\Vert \le \frac{\epsilon }{2}\) for some \(\epsilon >0\) and \(\mu \in (0, \frac{\epsilon }{2\kappa })\), then we have
which implies \(\Vert F_N(z^*)\Vert \le \Vert F_{N,\mu }(z^*)\Vert + \frac{\epsilon }{2}\le \epsilon \), that is, \(z^*\) is an \(\epsilon \)-first-order stationary point of problem (3).
Now we consider the smoothing least squares problem with a fixed small smoothing parameter \(\mu >0\):
Let \(J_{N,\mu }(z)\) be the Jacobian matrix of \(F_{N,\mu }(z)\). The gradient of the function \(r_{N,\mu }\) is
A vector \(z^*\) is called a first-order stationary point of problem (19) if \(\nabla r_{N,\mu }(z^*)=0\). If \(J_{N,\mu }(z^*)\) is nonsingular, then \( F_{N,\mu }(z^*)=0\). From (i) of Lemma 3, \(\Vert F_N(z^*)\Vert =\Vert F_N(z^*)-F_{N,\mu }(z^*)\Vert \le \kappa \mu \le \epsilon \) when \(\mu \in (0, \epsilon /\kappa ).\) This means that a first-order stationary point \(z^*\) of problem (19) is an \(\epsilon \)-first-order stationary point of problem (3) if \(\mu \in (0, \epsilon /\kappa )\) and \(J_{N,\mu }(z^*)\) is nonsingular. Note that \(\partial _CF_N(z)\) is a compact set for any \(z\in X\times Y\). From (ii) of Lemma 3, if all matrices in \(\partial _CF_N(z^*)\) are nonsingular, then there is \(\mu _0 >0\) such that for any \(\mu \in (0, \mu _0)\), \(J_{N,\mu }(z^*)\) is nonsingular.
If \(J_{N,\mu }(z^{*})\) is singular, the assumptions of local convergence theorems in [10, 15] for Gauss-Newton methods to solve the least squares problem (19) fail. In the next subsection, we prove the convergence of the QNSTR algorithm for the least squares problem (19) to a stationary point of (19) without assuming the nonsigularity of \(J_{N,\mu }(z^{*})\).
3.2 The QNSTR Algorithm
In this subsection, we present the QNSTR algorithm with a fixed sample size N and a fixed smoothing parameter \(\mu \). For simplicity, in this subsection, we use F(z), J(z) and r(z) to denote \(F_{N,\mu }(z) \), \(J_{N,\mu }(z)\) and \(r_{N,\mu }(z)\), respectively. Moreover, we use \(F^{\mathbb {1}}(z)\), \(F^{\mathbb {2}}(z)\) to represent \(F^{\mathbb {1}}_{N,\mu }(z)\) and \(F^{\mathbb {2}}_{N,\mu }(z)\), respectively.
For an arbitrary point \(z_0\in X\times Y\) and a positive number \(R_0\), we denote the level set \({\bar{S}}:=\{z\in {\mathbb {R}}^{n+m}\, | \, r(z)\le r(z_0)\}\) and define a set \(S(R_{0}):=\{z\in {\mathbb {R}}^{n+m} \, |\, \Vert z-z'\Vert \le R_{0},\forall z'\in {\bar{S}}\}\). By the definition of F and the boundedness of X and Y, we have \(r(z)=\frac{1}{2}\Vert F(z)\Vert ^2 \rightarrow \infty \) if \(\Vert z\Vert \rightarrow \infty \). Hence both \({\bar{S}}\) and \(S(R_0)\) are bounded.
Let \(J_{1}(z)=\nabla F^{\mathbb {1}}(z)\) and \(J_{2}(z)=\nabla F^{\mathbb {2}}(z)\). Then from
if \(F^{\mathbb {1}}\) and \(F^{\mathbb {2}}\) are twice differentiable at z, the Hessian matrix
can be written as
If \(F^{\mathbb {1}}\) and \(F^{\mathbb {2}}\) are not twice differentiable at z, the generalized Hessian of r at z, denote \(\partial (\nabla r(z)),\) is the convex hull of all \((m+n)\times (m+n)\) matrices obtained as the limit of a sequence of the form \(\nabla ^{2}r(z^k)\), where \(z^k \rightarrow z\) and \(F^{\mathbb {1}}\) and \(F^{\mathbb {2}}\) are twice differentiable at \(z^k\) [5]. Hence from (20)–(21) and the twice continuous differentiability of \({\hat{f}}_N\), we know that there is a positive number \(M_1\) such that \(\Vert H\Vert \le M_1\) for any \(H\in \partial (\nabla r(z))\), \(z\in S(R_0)\). Moreover from [5, Proposition 2.6.5], there is a positive number \(M_2\) such that
To give a globally convergent algorithm for problem (19) without using the second derivatives, we keep the term \(J_{1}(z)^{\top }J_{1}(z)\) and \(J_{2}(z)^{\top }J_{2}(z)\) in (20)–(21), and approximate
Specifically, the Hessian matrix at the k-th iteration point \(z_{k}\) is approximated by \(H_k\) with
where
Here the matrices \(B_{k}\) and \(C_k\) are computed by the truncated BFGS quasi-Newton formula as follows.
where
and
Here \({\bar{\epsilon }}\) and \(\gamma \) are given positive parameters, \(s^{\mathbb {1}}_k=x_{k+1}-x_k\), \(s^{\mathbb {2}}_{k}=y_{k+1}-y_{k}\),
Notice that the approximation form (23) is proposed in [47]. However, the matrix \(A_k\) in [47] is defined by using F(z) and \(\nabla F(z)\) at \(z_{k+1}, z_k\). In this paper, we use \(F^{\mathbb {1}}(z)\), \(F^{\mathbb {2}}(z)\), \(\nabla F^{\mathbb {1}}(z)\) and \(\nabla F^{\mathbb {2}}(z)\) at \(z_{k+1}, z_k\) to define a two-block diagonal matrix \(A_{k}\) in (23), based on the structure of VI in (17).
In [47], a back tracking line search is used to obtain a stationary point of the least squares problem. In this paper, we use a subspace trust-region method to solve problem (19) with global convergence guarantees. Comparing with the quasi-Newton method with back tracking line search in [47], the QNSTR algorithm solves a strongly convex quadratic subproblem in a low dimension at each step, which is efficient to solve large-scale min-max optimization problems with real data. See Sect. 4 for more details.
In what follows, for simplification, we use \(J_{k}\) and \(F_{k}\) to denote \(J(z_{k})\), \(F(z_{k})\), respectively.
Let \(g_{k}=\nabla r(z_{k})\). Choose \(\{d_k^1, \cdots , d_k^{L-1}\}\) such that \(V_{k}:=\left[ \begin{matrix} -g_{k}&d_{k}^{1}&\cdots&d_{k}^{L-1} \end{matrix}\right] \in {\mathbb {R}}^{(n+m)\times L}\) has L linearly independent column vectors. Let
Then, to obtain the stepsize \(\alpha _{k}\) at the iteration point \(z_k\), we solve the following strongly convex quadratic program in an L-dimensional space:
where \(\varDelta _{k}>0\) is the trust-region radius.
A key question for solving problem (28) is how to compute \(Q_{k}\) efficiently when \(H_{k}\) is huge. In fact, \(Q_k\) can be calculated efficiently without computing and storing the full information \(H_k\). From (23), we can write \(Q_{k}\) as
For the term \(V_{k}^{\top }J_{k}^{\top }J_{k}V_{k}\) in (29), we compute \(J_{k}V_{k}\) in a columnwise way: For a sufficiently small \(\epsilon >0\),
On the other hand, \(A_{k}V_{k}\) in term \(V_{k}^{\top }A_{k}V_{k}\) can also be computed columnwisely by a series of vector-vector products.
We give the QNSTR algorithm in Algorithm 1.
Trust-region algorithms are a class of popular numerical methods for optimization problems [6, 43]. Our QNSTR algorithm uses the special structure of the VI in (15) and (17) to construct the subproblem (28). The global convergence of the QNSTR algorithm is given in the following theorem.
Theorem 2
Suppose that X and Y are nonempty and bounded boxes and \({\hat{f}}_N\) is twice continuously differentiable. Let \(\{z_{k}\}_{k=0}^{\infty }\) be generated by Algorithm 1. Then there exists an \(M>0\) such that \(\left\| \nabla r(z) - \nabla r(z')\right\| \le M \left\| z - z'\right\| \) for any \(z,z'\in S(R_0)\) and \(\left\| H_k\right\| \le M\) for \(k\in {\mathbb {N}}\). Moreover, we have \(\lim _{k\rightarrow \infty }\Vert g_{k}\Vert =0.\)
The proof is given in Appendix B.
To end this section, we give some remarks about Theorem 2.
Remark 2
From Corollary 2.2.5 in [12], the continuity of F and boundness of X and Y imply that the solution set \({{{\mathcal {Z}}}}^*\) of \(F(z)=0\) is nonempty and bounded. Hence the set of minimizers of the least squares problem \(\min r(z)\) with the optimal value zero is coincident with the solution set of \(F(z)=0\). If all matrices in \(\partial _C F_N(z)\) for \(z\in {{{\mathcal {Z}}}}^*\) are nonsingular, we know from the compactness of \(\partial _C F_N(z)\) and (ii) of Lemma 3 that there is a \(\mu _0 >0\) such that for any \(\mu \in (0, \mu _0)\), J(z) is nonsingular and \(\sup _{\mu \in (0,\mu _0)}\left\| J(z)^{-1}\right\| \le C\) for some \(C>0\). Thus,
We have from (i) of Lemma 3, i.e., \(\left\| F_N(z) - F(z) \right\| \le \kappa \mu \) that
Thus, for any \(\epsilon >0\), we can properly select parameters \(\delta \) and \(\mu \) such that \(\left\| \nabla r(z)\right\| \le \delta \) and \(C \delta + \kappa \mu \le \epsilon \).
According to Theorem 2, we can find a point \(z_k\) such that
with an given stopping criterion parameter \(\delta \). The numerical experiments in the next section show that the QNSTR algorithm can result in an \(\epsilon \)-first-order stationary point of problem (3).
4 Numerical Experiments
In this section, we report some numerical results via the QNSTR algorithm for finding an \(\epsilon \)-first-order stationary point of problem (3). Also, we compare the QNSTR algorithm with some state-of-the-art algorithms for minimax problems. All of the numerical experiments in this paper are implemented on TensorFlow 1.13.1, Python 3.6.9 and Cuda 10.0 on a server with 1 Tesla P100-PCIE GPU with 16 GB memory at 1.3285 GHz and an operating system of 64 bits in the University Research Facility in Big Data Analytics (UBDA) of The Hong Kong Polytechnic University. (UBDA website: https://www.polyu.edu.hk/ubda/.)
We test our algorithm with two practical problems. One is a GAN based image generation problem for MNIST hand-writing data. The other one is a mix model for image segmentation on Digital Retinal Images for Vessel Extraction (DRIVE) data. In the first experiment, we use notation QNSTR(L) to denote that the dimension of the subspace spanned by the columns of \(V_k\) is L in the QNSTR algorithm and test the efficiency of QNSTR(L) under different choices of L and directions \(\{d^{i}_{k}\}_{i=1}^{L-1}\). In the second experiment, we apply the QNSTR algorithm to a medical image segmentation problem.
To ensure that \(H_N\) is continuously differentiable, we choose Gaussian error Linear Units (GELU) [16]
with \(\varSigma =10^{-4}\) as the activation function in each hidden layer in D and G, and Sigmoid
as the activation function of the output layers in D and G.
We test different choices of the subspace spanned by the columns of \(V_k\). Specifically, denote
for \(L\ge 2.\) Among all the experiments in the sequel, the initial point \(z_{0}\) is the result for running 10,000 steps of the alternating Adam with step size 0.0005. The parameters in Algorithm 1 are set as \({\bar{\varDelta }}=100\), \(\varDelta _0=1\), \(\beta _1=0.5\), \(\beta _2=2\), \(\eta =0.01\), \(\zeta _{1}=0.02\), \(\zeta _{2}=0.05\). The parameters \({\bar{\epsilon }}\) and \(\gamma \) in (24) and (25) are chosen as \({\bar{\epsilon }}=10^{-4}\) and \(\gamma =10^3.\)
4.1 Numerical Performance on MNIST Data
In this subsection, we report some preliminary numerical results of the QNSTR algorithm for solving the following SAA counterpart
of problem (2) by using MNIST handwritting data. We consider a two-layer GAN, where
with different choices of dimensions \(N^1_G\) and \(N^1_D\) for hidden outputs. Here \(\{\xi ^i_2\}\) are generated by an uniform distribution \({{{\mathcal {U}}}} (-1, 1)^{100}\). All initial weight matrices \(W^{i}_{G}\) and \(W^{i}_{D}\) for \(i=1,2\) are randomly generated by using the Gaussian distribution with mean 0 and standard deviation 0.1, and all initial bias vectors \(b^{i}_{G}\) and \(b^{i}_{D}\) for \(i=1,2\) are set to be zero. X and Y are set as \([-1,1]^{n}\) and \([-1,1]^{m}\), respectively.
4.1.1 Numerical Results for SAA Problems
We first study the performance of SAA problems under different sample sizes on a GAN model with a two-layer generator with \(N_{G}^{1}=64\) and a two-layer discriminator with \(N_{D}^{1}=64\), respectively. We set \({\hat{N}}=10000\) as the benchmark to approximate the original problem (2) and let \(N=100, 500, 1000, 2000, 5000\). For each N, we solve \(F_N(z)=0\), 50 times with different samples by using the QNSTR algorithm. We stop the iteration either
or the number of iterations exceeds 5000.
In these experiments, we set \(\mu =10^{-8}\). We use \(z^{*}_{N}\) to denote the first point that satisfies (31) in the iteration for \(N=500, 1000, 2000, 5000\), and we measure its optimality by the residual
Figure 1 shows the convergence of \(\textrm{res}_{N}\) to zero as N grows. Table 1 presents the average of mean, standard deviation (std) and the width of 95% CI of \(\textrm{res}_{N}\). It shows that all values decrease as the sample size N increases. Both Fig. 1 and Table 1 validate the convergence results in Sect. 2.
4.1.2 Numerical Results for Smoothing Approximation
In this subsection, for a fixed sample size \(N=2000\), we study how the smoothing parameter \(\mu \) affects the residual \(\Vert F_{N,\mu }(z)\Vert \). All numerical results in this part are based on a GAN model which is constituted of a two-layer generator with \(N_{G}^{1}=64\) and a two-layer discriminator with \(N_{D}^{1}=64\). Specifically, for \(\mu =10^{-t}, t=1, 2, 4, 5, 6, 8\), we generate 50 test problems, respectively. For each \(\mu \), we solve problem \(F_{N,\mu }(z)=0\) by the QNSTR algorithm. We stop the iteration either condition \(\Vert F_{N,\mu }(z_k)\Vert \le 10^{-5}\) holds or the number of iterations exceeds 5000.
We use \(z^{*}_{\mu }\) to denote the first point that satisfies (31) in the iteration, and we measure the residual of \(z_{\mu }^{*}\) by
The numerical results are presented in Fig. 2, which shows that \(\textrm{res}_{\mu }\) decreases as smoothing parameter \(\mu \) decreases. In fact, the residual \(\textrm{res}_{\mu }\) becomes stable when \(\mu \le 10^{-5}\). Note that it is not difficult to obtain \(\kappa =\frac{\sqrt{n+m}}{8}\) in (18). Also, when \(N_{D}^{1}=N_{G}^{1}=64\), we have \(n+m=107729\). If \(\Vert F_{N,\mu }(z)\Vert \le 10^{-5}\) and \(\mu =10^{-8}\), then we have \(\Vert F_{N}(z)\Vert \le \kappa \mu + \epsilon \le \frac{\sqrt{107729}}{8}\times 10^{-8} +10^{-5}\approx 1.041\times 10^{-5}\). This is consistent with the theoretical results in Sect. 3.
4.1.3 Comparison Experiments
In this subsection, we report some numerical results to compare the QNSTR algorithm with some commonly-used methods. To this end, we set \(N=2000\) and \(\mu =10^{-8}\). We use \(\Vert F_{N,\mu }(z_{k})\Vert \) to measure performance of these algorithms and apply Frechet Inception Distance (FID) score to measure the quality of image generated by the generator G trained by different algorithms. We terminate all these algorithms when one of the three cases holds: \(\Vert F_{N,\mu }(z_{k})\Vert \le 10^{-5}\), \(\Vert \nabla F_{N,\mu }(z_{k})^\top F_{N,\mu }(z_{k})\Vert \le 10^{-8}\), the number of iterations exceeds 5000.
We present the numerical results in Figs. 3, 4, 5. It is not difficult to observe from these figures that the QNSTR algorithm outperforms.
We also compare the QNSTR algorithm with some commonly-used algorithms in training GANs including simultaneous and alternate gradient descent-ascent (GDA) [9, 45], simultaneous and alternate optimistic gradient descent-ascent (OGDA) [9, 45], \(\gamma \)-alternate Adam, projected point algorithm (PPA) [11, 24]. The initial point \(z_{0}\) of all these methods are given by alternating Adam with stepsize 0.0005 and 10000 iterations. We use the grid search for the selection of hyper-parameters in these methods. For simultaneous and alternate GDA and simultaneous and alternate OGDA, the stepsize is \(\alpha = 0.5, 0.05, 0.005, 0.001, 0.0005\). For \(\gamma \)-alternate Adam, the ratio is \(\gamma = 1, 2, 3, 5, 10\). For PPA, the proximal parameter is \({\overline{L}}= 10, 100, 500, 1000, 5000\) and stepsize is \(\frac{0.1}{2{\overline{L}}}\). Besides, the stopping criteria of the k-th inner loop is \(\frac{0.01}{k^2}\) or the number of iterations exceeds 100. The comparison results between the QNSTR algorithm and the \(\gamma \)-alternate Adam, the QNSTR algorithm and PPA are given in Figs. 6 and 7, respectively. According to the results, we can see the \(\gamma \)-alternate Adam also shows a outstanding convergence tendency under a suitable hyper-parameter and the QNSTR algorithm can even better than the \(\gamma \)-alternate Adam if a suitable searching space \(V_{k}\) is selected. The PPA performs poorly in these comparisons.
The comparison results between the QNSTR algorithm and simultaneous and alternate GDA, simultaneous and alternate OGDA are given in Fig. 8. To show these comparison results more clearly, we only give the results of each method with the optimal stepsize \(\alpha \) in the search range.
We can observe from these figures that the QNSTR algorithm outperforms, which validates that the QNSTR algorithm is more efficient in finding an \(\epsilon \)-first-order stationary point of problem (30).
We also record the final FID score of each algorithm’s output. All results are given in Tables 2 and 3. They show the generator of GANs trained by the QNSTR algorithm can generate high quality images.
4.2 DRIVE Data
Image segmentation is an important component in many visual understanding systems, which is the process of partitioning a digital image into multiple image segments [38]. Image segmentation plays a central role in a broad range of applications [13], including medical image analysis, autonomous vehicles (e.g., navigable surface and pedestrian detection), video surveillance and augmented reality. One of the well-known paradigm for image segmentation is based on some kinds of manual designed loss functions. However, they usually lead to the blurry segmentation boundary [17].
In 2016, Phillip et. al introduced a generative adversarial network framework into their objective function to implement an image-to-image translation problem [17], and found the blurry output given CNN under \(l_{1}\) norm can be reduced. At the same year, Pauline et. al introduced a mix objective function combining by GAN and cross-entropy loss on semantic image segmentation problem [25], also implemented a better performance. The similar idea of mix GAN and traditional loss can also be found in [37, 46].
The mix model has the following form
where X, Y are two bounded boxes, \(\{(\xi ^{i}_{1},\xi ^{i}_{2})\}_{i=1}^{N}\) is the finite collected data, \(\xi ^{i}_{2}\) is the original data while the \(\xi ^{i}_{1}\) is the corresponding label. Problem (32) is a special case of problem (3), which can be viewed as a discrete generative adversarial problem (30) with an extra classical supervision term. The classical supervision part, i.e.,
is to minimize the difference between the output of given \(\xi _{2}\) on G and its corresponding label \(\xi _{1}\). The model can be regarded as a combination of a classical supervised learning problem and a generative adversarial problem with a trade-off parameter \(\lambda \in [0,\infty )\). When \(\lambda =0\), problem (32) reduces to a classical supervised learning problem (33). When \(\lambda \rightarrow \infty \), problem (32) tends to a vanilla GAN.
The fundoscopic exam is an important procedure to provide information to diagnose different retinal degenerative diseases such as Diabetic Retinopathy, Macular Edema and Cytomegalovirus Retinitis. A high accurate system to sketch out the blood vessel and find abnormalities on fundoscopic images is necessary. Although the supervision deep learning frameworks such as Unet are able to segment macro vessel accurately, they failed for segmenting microvessels with high certainty. In this part, we will train a Unet as a generator in our framework on DRIVE data. We download the data includes 20 eye blood vessel images with manual segmentation label from the open source website (https://drive.grand-challenge.org/). We applied 16 images as training data while the other 4 as testing data. In this experiment, the structure of segmentation model G is U-net [34] which includes 18 layers with \(n=121435\) parameters, and the structure of discrimination model D is a deep convolutional neural network which contains 5 convolutional layers and 1 fully connected layer with \(m=142625\) parameters. The feasible sets X and Y are set as \([-5,5]^{n}\) and \([-5,5]^{m}\), respectively. We use activation function GELU except Sigmoid at the output layer of D and G. We compare our results based on problem (32) with some existing models. In our experiment, we use \(\lambda =10\) and \(V^z_k\) with \(L=4\).
We compute traditional metrics such as F1-score, Sensitivity, Specificity and Accuracy. The form of these metrics are given as follows:
where \(\varOmega \) is the Universe set for all index of pixels in image, \(\textrm{GT}_{i}\) is the grouth truth vessel index for i-th image, \(\textrm{SR}\) is the index of pixel labelled as vessel in i-th image’s segmentation result. Furthermore, we compute Area Under Curve-Receiver Operating Characteristic (AUC-ROC) [2] and Structural Similarity Index Measure (SSIM) [40].
Table 4 shows that the QNSTR algorithm for solving problem (32) is more promising for blood vessel segmentation. In Fig. 9, we visualize the error between vessel map generated by problem (32) with the QNSTR algorithm and the manual segmentation. In Fig. 10, we compare the segmentation results of problem (32) based on the QNSTR algorithm and the alternating Adam approach, respectively.
5 Conclusion
In this paper, we propose a new QNSTR algorithm for solving the large-scale min-max optimization problem (3) via the nonmonotone VI (4). Based on the structure of the problem, we use a smoothing function \(F(\cdot , \mu )\) to approximate the nonsmooth function \(F_N\), and consider the smoothing least squares problem (19). We adopt an adaptive quasi-Newton formula in [47] to approximate the Hessian matrix and solve a strongly convex quadratic program with ellipse constraints in a low-dimensional subspace at each step of the QNSTR algorithm. We prove the global convergence of the QNSTR algorithm to a stationary point of the least squares problem, which is an \(\epsilon \)-first-order stationary point of the min-max optimization problem if every element of the generalized Jacobian of \(F_N\) is nonsingular at the point. In our numerical experiments, we test the QNSTR algorithm by using two real data sets: MNIST data and DRIVE data. Preliminary numerical results validate that the QNSTR algorithm outperforms some existing algorithms.
Data Availibility
All used data sets in Sect. 4.1 were downloaded from the website https://www.tensorflow.org/datasets/catalog/mnist, and all used data sets in Sect. 4.2 were downloaded from the website https://drive.grand-challenge.org/.
References
Alom, M.Z., Hasan, M., Yakopcic, C., Taha, T.M., Asari, V.K.: Recurrent residual convolutional neural network based on U-net (R2U-Net) for medical image segmentation. arXiv preprint arXiv:1802.06955 (2018)
Bradley, A.P.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 30, 1145–1159 (1997)
Chen, X.: Superlinear convergence of smoothing quasi-Newton methods for nonsmooth equations. J. Comput. Appl. Math. 80, 105–126 (1997)
Chen, X., Qi, L., Sun, D.: Global and superlinear convergence of the smoothing Newton method and its application to general box constrained variational inequalities. Math. Comput. 67, 519–540 (1998)
Clarke, F.H.: Optimization and Nonsmooth Analysis. Wiley, New York (1983)
Conn, A.R., Gould, N.I., Toint, P.L.: Trust Region Methods. SIAM, Philadelphia (2000)
Dai, B., Shaw, A., Li, L., Xiao, L., He, N., Liu, Z., Chen, J., Song, L.: SBEED: convergent reinforcement learning with nonlinear function approximation. In: Proceedings of the International Conference on Machine Learning, pp. 1125–1134. PMLR (2018)
Dai, Y.H., Zhang, L.: Optimality conditions for constrained minimax optimization. CSIAM Trans. Appl. Math. 1, 296–315 (2020)
Daskalakis, C., Ilyas, A., Syrgkanis, V., Zeng, H.: Training gans with optimism. arXiv preprint arXiv:1711.00141 (2017)
Dennis, J.E., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-Hall, Englewood Cliffs, NJ (1983)
Diakonikolas, J., Daskalakis, C., Jordan, M.I.: Efficient methods for structured nonconvex-nonconcave min-max optimization. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 2746–2754. PMLR (2021)
Facchinei, F., Pang, J.S.: Finite-dimensional Variational Inequalities and Complementarity Problems. Springer, New York (2003)
Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, Upper Saddle River (2002)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27 (2014)
Gratton, S., Lawless, A.S., Nichols, N.K.: Approximate Gauss-Newton methods for nonlinear least squares problems. SIAM J. Optim. 18, 106–132 (2007)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016)
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)
Jiang, J., Chen, X.: Optimality conditions for nonsmooth nonconvex-nonconcave min-max problems and generative adversarial networks. SIAM J. Math. Data Sci. 5, 693–722 (2023)
Jin, C., Netrapalli, P., Jordan, M.: What is local optimality in nonconvex-nonconcave minimax optimization? In: Proceedings of International Conference on Machine Learning, pp. 4880–4889. PMLR (2020)
Jin, Q., Meng, Z., Pham, T.D., Chen, Q., Wei, L., Su, R.: DUNet: a deformable network for retinal vessel segmentation. Knowl.-Based Syst. 178, 149–162 (2019)
Li, L., Verma, M., Nakashima, Y., Nagahara, H., Kawasaki, R.: Iternet: Retinal image segmentation utilizing structural redundancy in vessel networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3656–3665 (2020)
Lin, T., Jin, C., Jordan, M.: On gradient descent ascent for nonconvex-concave minimax problems. In: Proceedings of the International Conference on Machine Learning, pp. 6083–6093. PMLR (2020)
Lin, T., Jin, C., Jordan, M.I.: Near-optimal algorithms for minimax optimization. In: Proceedings of the Conference on Learning Theory, pp. 2738–2779. PMLR (2020)
Liu, M., Rafique, H., Lin, Q., Yang, T.: First-order convergence theory for weakly-convex-weakly-concave min-max problems. J. Mach. Learn. Res. 22, 7651–7684 (2021)
Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016)
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017)
Mohajerin Esfahani, P., Kuhn, D.: Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Math. Program. 171, 115–166 (2018)
Monteiro, R.D., Svaiter, B.F.: Complexity of variants of Tseng’s modified FB splitting and Korpelevich’s methods for hemivariational inequalities with applications to saddle-point and convex optimization problems. SIAM J. Optim. 21, 1688–1720 (2011)
Myerson, R.B.: Game Theory: Analysis of Conflict. Harvard University Press, Cambridge (1991)
Nemirovski, A.: Prox-method with rate of convergence O (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM J. Optim. 15, 229–251 (2004)
Nesterov, Y.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109, 319–344 (2007)
Rabbat, M., Nowak, R.: Distributed optimization in sensor networks. In: Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks, pp. 20–27 (2004)
Rafique, H., Liu, M., Lin, Q., Yang, T.: Weakly-convex-concave min-max optimization: provable algorithms and applications in machine learning. Optim. Method Softw. 37, 1087–1121 (2022)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, pp. 234–241. Springer (2015)
Shapiro, A.: Monte Carlo sampling methods. In: Ruszczyński, A., Shapiro, A. (eds.) Handbooks in Operations Research and Management Science, pp. 353–425. Elsevier, Amsterdam (2003)
Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modeling and Theory. SIAM, Philadelphia (2021)
Son, J., Park, S.J., Jung, K.H.: Towards accurate segmentation of retinal vessels and the optic disc in fundoscopic images with generative adversarial networks. J. Digit. Imaging 32, 499–512 (2019)
Szeliski, R.: Computer Vision: Algorithms and Applications. Springer Nature, Switzerland (2022)
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. submitted to SIAM J. Optim. 2 (2008)
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)
Xu, H.: Uniform exponential convergence of sample average random functions under general sampling with applications in stochastic programming. J. Math. Anal. Appl. 368, 692–710 (2010)
Yang, J., Kiyavash, N., He, N.: Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Adv. Neural. Inf. Process. Syst. 33, 1153–1165 (2020)
Yuan, Y.: Recent advances in trust region algorithms. Math. Program. 151, 249–281 (2015)
Zhang, C., Ge, D., Jiang, B., Ye, Y.: DRSOM: A dimension reduced second-order method and preliminary analyses. arXiv preprint arXiv:2208.00208 (2022)
Zhang, G., Wang, Y., Lessard, L., Grosse, R.B.: Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 7659–7679. PMLR (2022)
Zhang, Y., Yang, L., Chen, J., Fredericksen, M., Hughes, D.P., Chen, D.Z.: Deep adversarial networks for biomedical image segmentation utilizing unannotated images. In: Proceedings of the Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, pp. 408–416. Springer (2017)
Zhou, W., Chen, X.: Global convergence of a new hybrid Gauss-Newton structured BFGS method for nonlinear least squares problems. SIAM J. Optim. 20, 2422–2441 (2010)
Acknowledgements
We would like to thank two referees for their very helpful comments. We would like to thank Prof. Yinyu Ye for his suggestion to add more search directions in the Dimension Reduced Second-Order Method proposed in [44].
Funding
Open access funding provided by The Hong Kong Polytechnic University This work is supported by CAS AMSS-PolyU Joint Laboratory of Applied Mathematics, University Research Facility in Big Data Analytics, PolyU and the Hong Kong Research Grant Council PolyU15300021, C5036-21E, N-PolyU507/22.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A: Smoothing Approximation of \(F_{N}(z)\)
We consider problem (30) with a two-layers discriminator and a two-layers generator using MNIST handwritting data. All notations are the same as those in Sect. 4.1. Set \(X=[-5,5]\), \(Y=[-5,5]\), \(N_{G}^{1}=N_{D}^{1}=64\), \(N_{G}^{1}=N_{D}^{1}=128\), \(\mu =10^{-t}, t=0,1,\dots ,6\) and \(N=1000, 2000, 10,000\), respectively. Based on the uniform distribution over \([-10,10]^{n+m}\), we generate 1000 points \(z^i\in [-10,10]^{n+m}\), \(i=1, \cdots , 1000\). Denote the “approximation error" by
In Fig. 11, we plot the average of “approximation error" under different choices of \(\mu \) and N with 20 sets of 1000 points in \([-10,10]^{n+m}\). From the figure, we can observe that for each \(N=1000, 2000, 10000\), the “approximation error" converges to zero as \(\mu \) tends to zero.
Appendix B: Proof of Theorem 2
As we discussed in subsection 3.2, there is a positive number \(M_1\) such that \(\Vert H\Vert \le M_1\) for any \(H\in \partial (\nabla r(z))\), \(z\in S(R_0)\). Moreover there is a positive number \(M_2\) such that \(\left\| \nabla r(z) - \nabla r(z')\right\| \le M_2 \left\| z - z'\right\| , \,\, \forall z, z' \in S(R_0)\).
Since \(z_k\) is updated only when \(\rho _k>\eta \) in Algorithm 1, we have \(z_{k}\in S(R_{0})\). From the continuous differentiability of r over \(S(R_{0})\), there is a positive number \(M_3\) such that \(\Vert J_{k}\Vert \le M_3\) for \(k \in {\mathbb {N}}\). Moreover, from (23), \(\Vert A_{k}\Vert =\max \{\Vert B_{k}\Vert , \Vert C_{k}\Vert \}\le \max \{\gamma , \Vert F(z_0)\Vert \}\). Hence \(\left\| H_k\right\| \le M_{3}+\max \{\gamma , \Vert F(z_0)\Vert \}\). Let \(M\ge \max \{M_2, M_3+\gamma , M_3+\Vert F(z_0)\Vert \}\). Then we have \(\left\| \nabla r(z) - \nabla r(z')\right\| \le M \left\| z - z'\right\| \) for any \(z,z'\in S(R_0)\) and \(\left\| H_k\right\| \le M\) for \(k\in {\mathbb {N}}\).
In the rest of Appendix B, we prove \(\lim _{k\rightarrow \infty } \Vert g_k\Vert =0\).
If \(g_k=0\) for some \(k>0\), then Algorithm 1 terminates and Theorem 2 holds. In the remainder, we only consider the case that \(g_k\ne 0\).
We next consider the following one-dimensional problem:
where \(\alpha ^{s}_{k}\) is an optimal solution of
Let \(\tau _k\) denote an arbitrary optimal solution of problem (34). Then \(\alpha ^{C}_{k}:=\tau _k \alpha ^{s}_{k}\) is a feasible solution of problem (28).
In what follows, we give the closed form of \(\alpha ^{C}_{k}\) step by step. For this purpose, we consider the KKT condition of problem (35) as follows:
where \(\lambda \) is a multiplier. Since \(g_k\ne 0\) and \(V_k\) is of full column rank, we have \(c_{k}\ne 0\). Thus, \(\lambda >0\), and the KKT system gives
Then we obtain \(\frac{1}{\lambda }=\sqrt{\frac{\varDelta _{k}^{2}}{(G^{-1}_{k}c_{k})^{\top }G_{k}(G^{-1}_{k}c_{k})}}\), and the solution of (35) can written as
Hence, the objective function of (34) has the following form
and the constraint of (34) satisfies
which is equivalent to \(0< \tau \le 1\).
Therefore, problem (34) can be equivalently rewritten as
Since \(H_{k}\) is positive definite (see (23), (26) and (27)), problem (36) has the unique solution
Finally, we obtain
Lemma 4
Let \(\alpha _{k}\) be the unique optimal solution of subproblem (28) in the k-th step. Then
Proof
Since \(\alpha _{k}^{C}\) is a feasible solution of problem (28), we have
In what follows, we verify
If \(\Vert g_{k}\Vert ^{3}/(\varDelta _{k}g^{\top }_{k}H_{k}g_{k})<1\), substituting \(\alpha _{k}^{C}\) (see (37)) into (28), we have
If \(\Vert g_{k}\Vert ^{3}/(\varDelta _{k}g^{\top }_{k}H_{k}g_{k})\ge 1\) (i.e., \(g^{\top }_{k}H_{k}g_{k}\le \frac{\Vert g_{k}\Vert ^{3}}{\varDelta _{k}}\)), we have
Combining (38) and (39), we complete the proof.\(\square \)
Lemma 5
Under assumptions of Theorem 2, for any index k, there exists a \({\bar{k}}>k\) such that \(\Vert g_{{\bar{k}}}\Vert <\Vert g_{k}\Vert /2\).
Proof
We give the proof by contradiction. Suppose that there exists a \({\hat{k}}\) with \(\Vert g_{{\hat{k}}}\Vert =2\epsilon \) for some \(\epsilon >0\), and \(\Vert g_{k}\Vert {\ge }\epsilon \), \(\forall k\ge {\hat{k}}\). Then we know from Lemma 4 that
According to the definition of \(\rho _{k}\) in (30), we have
By Taylor expansion, we have
Then
where the first inequality follows from \(Q_k=V_{k}^{\top }H_{k}V_{k}\) and the mean-value theorem, and the second inequality follows from \(\Vert V_{k}\alpha _{k} \Vert \le \varDelta _{k}\) due to the constraint in problem (28).
Then, by (40), (41) and (42), we get
Denote
For any \(\varDelta _{k}\le {\tilde{\varDelta }}\), we have
which implies \(\rho _{k}\ge \zeta _{1}\), where the first equality follows from the fact that
The above observation together with update rules in Algorithm 1 indicates that \(\varDelta _{k+1}\ge \varDelta _{k}\) when \(\varDelta _{k}\le {\tilde{\varDelta }}\) (and thus \(\rho _{k}\ge \zeta _{1}\)). In other words, if \(\varDelta _{k}>{\tilde{\varDelta }}\), \(\rho _{k}<\zeta _{1}\) holds. In this case, \(\varDelta _{k+1}=\beta _{1}\varDelta _{k}>\beta _{1}{\tilde{\varDelta }}\). To summarize the two cases, we then have
Then we proved sequence \(\{\varDelta _{k}\}_{k\ge {\hat{k}}}^{\infty }\) is bounded from below. Note that there exists an infinite subsequence, denoted by \({\mathcal {K}}\), of \(\{{\hat{k}}, {\hat{k}}+1, \cdots \}\) such that, for any \(k\in {\mathcal {K}}\), one of the following two cases holds.
Case 1: \(\varDelta _{k+1}=\beta _{1}\varDelta _{k}\). It is easy to obtain \(\varDelta _{k}\rightarrow 0\) as \(k\overset{{\mathcal {K}}}{\rightarrow }\infty \) since \(\beta _{1}<1\), which is contradicted to the fact that \(\varDelta _{k}\) is bounded from below (see (43)).
Case 2: \(\rho _{k}\ge \zeta _{1}\). We have from the definition of \(\rho _k\) (see (30)) and \(\rho _{k}\ge \zeta _{1}\) that
where the second inequality follows from Lemma 4 and \(\Vert g_{k}\Vert \ge \epsilon \) for \(k\in {\mathcal {K}}\).
Therefore, \(\{r(z_k)\}_{k\in {\mathcal {K}}}\) is strictly decreasing. Since \(\{r(z_k)\}_{k\in {\mathcal {K}}}\) is bounded from below (note that \(r(z) \ge 0\) for any z), we know that the sequence \(\{r(z_k)\}_{k\in {\mathcal {K}}}\) is convergent and \(r(z_{k})-r(z_{k+1})\downarrow 0\) as \(k\overset{{\mathcal {K}}}{\rightarrow }\infty \). Thus, \(\varDelta _{k} \rightarrow 0\) as \(k\overset{{\mathcal {K}}}{\rightarrow }\infty \), which is also contradicted with (43).\(\square \)
Now we are ready to prove \(\lim _{k\rightarrow \infty } \Vert g_k\Vert =0\).
Proof
(of \(\, \lim _{k\rightarrow \infty } \Vert g_k\Vert =0\)) Let
Note that \({\mathcal {B}}(z_{k},R)=\{z: \Vert z-z_{k}\Vert \le R\} \subseteq S(R_{0})\), and thus \(\nabla r(\cdot )\) is Lipschitz continuous on \({\mathcal {B}}(z_{k},R)\) with Lipschitz modulus M. Thus, for \(\forall z\in {\mathcal {B}}(z_{k},R)\), we have
For \(\forall z\in {\mathcal {B}}(z_{k},R)\), we have by the triangle inequality that
According to Lemma 5, we know that there exists an index \(l\ge k\) satisfying \(\Vert g_{l+1}\Vert <\epsilon \). Moreover, we assume that \(z_{l+1}\) is the first point that iterates out of the ball \({\mathcal {B}}(z_{k},R)\) after \(z_{k}\) as well as satisfying \(\Vert g_{l+1}\Vert <\epsilon \). Thus, \(\Vert g_{i}\Vert \ge \epsilon \) for \(i=k,k+1,\cdots ,l\). Then we have
where the second equality follows from (30), the first inequality follows from \(\rho _{i}<\eta \) when \(z_{i}\ne z_{i+1}\). Since \(\left\| g_k\right\| = 2\epsilon \) and \(\left\| g_{l+1}\right\| <\epsilon \), we have \(z_{l+1}\ne z_{k}\), which implies that \(\{k,\cdots ,l\} \cap \{j: z_{j}\ne z_{j+1}\} \ne \emptyset \).
If \(\varDelta _{i}\le \epsilon /M\) for all \(i\in \{k,\cdots ,l\} \cap \{j: z_{j}\ne z_{j+1}\}\), we continue (45) as follows:
where the second inequality follows from \(\left\| z_{i+1} - z_i\right\| \le \varDelta _i\), the third inequality follows from the triangle inequality, the last inequality follows from the fact that \(z_{l+1}\) is the first point that iterates out of the ball \({\mathcal {B}}(z_{k},R)\) after \(z_{k}\).
If \(\varDelta _{i}>\epsilon /M\) for some \(i\in \{k,\cdots ,l\} \cap \{j: z_{j}\ne z_{j+1}\}\), we continue (45) as follows:
where the last inequality follows from \(\{k,\cdots ,l\} \cap \{j: z_{j}\ne z_{j+1}\} \ne \emptyset \). To summarize, we obtain
Since the sequence \(\{r(z_{i})\}_{i=0}^{\infty }\) is a decreasing and bounded sequence from below, there exists \(r^*\ge 0\) such that \( \lim _{i\rightarrow \infty }r(z_{i})=r^{*}.\) Hence
where the second inequality follows from (46), the last equality follows from (44).
Due to the arbitrariness of k, by letting \(k\rightarrow \infty \), we know
which implies \(\lim _{k\rightarrow \infty }\Vert g_{k}\Vert =0\). Then the proof is complete.\(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Qiu, Z., Jiang, J. & Chen, X. A Quasi-Newton Subspace Trust Region Algorithm for Nonmonotone Variational Inequalities in Adversarial Learning over Box Constraints. J Sci Comput 101, 45 (2024). https://doi.org/10.1007/s10915-024-02679-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-024-02679-y
Keywords
- Nonmonotone variational inequality
- Min-max optimization
- Quasi-Newton method
- Least squares problem
- Generative adversarial networks