1 Introduction

Min-max optimization problems have wild applications in games [29], distributional robustness optimization [27], robust machine learning [26], generative adversarial networks (GANs) [14], reinforcement learning [7], distributed optimization [32], etc. Mathematically, a convexly constrained min-max optimization problem can be written as

$$\begin{aligned} \min _{x\in X}\max _{y\in Y}~ f(x,y):={\mathbb {E}}_{P}\left[ \ell (x,y,\xi )\right] , \end{aligned}$$
(1)

where \(X\subseteq {\mathbb {R}}^{n}\) and \(Y\subseteq {\mathbb {R}}^{m}\) are nonempty, closed and convex sets, \(\xi \) is an s-dimensional random vector obeying a probability distribution P with support set \(\varXi \), \(\ell :{\mathbb {R}}^{n}\times {\mathbb {R}}^{m}\times {\mathbb {R}}^{s}\rightarrow {\mathbb {R}}\) is nonconvex-nonconcave for a fixed realization of \(\xi \), i.e., \(\ell (x,y,\xi )\) is neither convex in x nor concave in y. Hence the objective function f(xy) is also nonconvex-nonconcave in general.

In this paper, we are mainly interested in problem (1) arising from GANs [14], which reads

$$\begin{aligned} \begin{aligned} \min _{x\in X}\max _{y\in Y} \Big ( {\mathbb {E}}_{P_1}\big [\log (D(y,\xi _1))\big ] + {\mathbb {E}}_{P_2}\big [ \log (1- D(y,G(x,\xi _2))) \big ] \Big ), \end{aligned} \end{aligned}$$
(2)

where \(\xi _i\) is an \({\mathbb {R}}^{s_i}\)-valued random vector with probability distribution \(P_i\) for \(i=1,2\), \(D:{\mathbb {R}}^m \times {\mathbb {R}}^{s_1} \rightarrow (0,1)\) is a discriminator, \(G:{\mathbb {R}}^n \times {\mathbb {R}}^{s_2} \rightarrow {\mathbb {R}}^{s_1}\) is a generator.

Generally, the generator G and the discriminator D are two feedforward neural networks. For example, G can be a p-layer neural network and D can be a q-layer neural network, that is

$$\begin{aligned} G(x,\xi _2)&= \sigma _G^p (W^{p}_{G}\sigma _G^{p-1}(\cdots \sigma _G^1(W_G^{1}\xi _{2}+b^{1}_{G})+\cdots )+b^{p}_{G}), \\ D(y,\xi _1)&= \sigma _D^q(W_D^q\sigma _D^{q-1}(\cdots \sigma _D^1(W_D^1 \xi _1 + b_D^1)+\cdots )+b_D^q), \end{aligned}$$

where \(W_G^1,\cdots ,W_G^p\), \(b_G^1,\cdots ,b_G^p\) and \(W_D^1,\cdots ,W_D^q\), \(b_D^1,\cdots ,b_D^q\) are the weight matrices, biases vectors of G and D with suitable dimensions, \(\sigma _G^1,\cdots ,\sigma _G^p\) and \(\sigma _D^1,\cdots ,\sigma _D^q\) are proper activation functions, such as ReLU, GELU, Sigmoid, etc. Denote

$$\begin{aligned} x&:=(\textrm{vec}(W^{1}_{G})^\top ,\cdots ,\textrm{vec}(W^{p}_{G})^\top ,(b^{1}_{G})^\top ,\cdots ,(b^{p}_{G})^\top )^\top ,\\ y&:=(\textrm{vec}(W^{1}_{D})^\top ,\cdots ,\textrm{vec}(W^{q}_{D})^\top ,(b^{1}_{D})^\top ,\cdots ,(b^{q}_{D})^\top )^\top , \end{aligned}$$

where \(\textrm{vec}(\cdot )\) denotes the vectorization operator. In this case, problem (2) reduces to problem (1) if let \(\xi :=(\xi _1,\xi _2)\in \varXi \) and

$$\begin{aligned} \ell (x,y,\xi ):= \log (D(y,\xi _1)) + \log (1- D(y,G(x,\xi _2))). \end{aligned}$$

Due to the nonconvexity-nonconcavity of the objective function f, problem (1) may not have a saddle point. Hence the concepts of global and local saddle points are untimely to characterize the optimum of problem (1). Recently, motivated by practical applications, the so-called global and local minimax points are proposed to describe the global and local optima of nonconvex-nonconcave min-max optimization problems in [19] from the viewpoint of sequential games. Moreover, the optimality necessary conditions for a local minimax point are studied in [19] for unconstrained min-max optimization problems. In [8, 18], the optimality conditions for a local minimax point are studied for constrained min-max optimization problems.

Numerical methods for min-max optimization problems have been extensively studied. These algorithms can be divided into four classes based on the convexity or concavity of problems: the convex-concave cases (see, e.g., [28, 30, 31, 39]), the nonconvex-concave cases (see, e.g., [22, 23, 33]), the convex-nonconcave cases (see, e.g., [22, 23, 33]) and the nonconvex-nonconcave cases (see, e.g., [11, 42]).

To solve problem (1) numerically, we first apply the sample average approximation (SAA) approach to obtain a discrete form. We collect N independent identically distributed (i.i.d.) samples of \(\xi \) (e.g., generated by the Monte Carlo method), denoted by \(\xi ^1,\cdots ,\xi ^N\), and obtain a discrete counterpart of problem (1) as below:

$$\begin{aligned} \min _{x\in X}\max _{y\in Y} ~{\hat{f}}_{N}(x,y):=\frac{1}{N}\sum _{i=1}^{N}\ell (x,y,\xi ^{i}). \end{aligned}$$
(3)

To ease the discussion, we assume that \(\ell (\cdot ,\cdot ,\xi )\) is twice continuously differentiable with respect to (xy) for an arbitrary \(\xi \in \varXi \) in what follows.

Let \(z:=(x^\top , y^\top )^\top \in {\mathbb {R}}^{n+m}\), \(Z:=X\times Y\subseteq {\mathbb {R}}^{n+m}\) and

$$\begin{aligned} H_N(z):= \begin{pmatrix} \nabla _x {\hat{f}}_N(x,y)\\ -\nabla _y {\hat{f}}_N(x,y) \end{pmatrix}. \end{aligned}$$

Then the first-order optimality condition of a local minimax point for problem (3) can be presented as the following nonmonotone variational inequality (VI):

$$\begin{aligned} 0\in H_N(z)+{\mathcal {N}}_{Z}(z), \end{aligned}$$
(4)

where \({\mathcal {N}}_{Z}(z)\) is the normal cone to the convex set Z at z, which is defined by

$$\begin{aligned} {\mathcal {N}}_Z(z):= \{v: \left\langle v, u -z\right\rangle \le 0, \forall u\in Z\}. \end{aligned}$$

We call \(z^*\) a first-order stationary point of problem (3) if it satisfies (4). Due to the nonconvexity-nonconcavity, seldom algorithms can ensure the convergence to a global or local optimal point of problem (3). In the study of the convergence of algorithms for nonconvex-nonconcave min-max problem (3), some strong assumptions, such as the Polyak-Łojasiewicz (PL) condition [42], the existence of solutions for the corresponding Minty VI of problem (4) [11] etc, are required. In fact, such assumptions are difficult to check in practice. On the other hand, some algorithms need to estimate the Lipschitz constant of \(H_{N}(z)\), but the computation of the Lipschitz constant of \(H_{N}(z)\) may be too costly or even intractable. In this paper, without estimating the Lipschitz constant of \(H_{N}(z)\) and assuming the PL condition or the existence of solutions for Minty VI, we use a so-called quasi-Newton subspace trust region (QNSTR, for short) algorithm for solving problem (4).

The VI (4) can be equivalently reformulated as the following system of nonsmooth equations (see [12])

$$\begin{aligned} F_N(z):=z - \textrm{Proj}_Z(z - H_N(z)) = 0, \end{aligned}$$
(5)

where \(\textrm{Proj}_Z(\cdot )\) denotes the projection operator onto Z.

Obviously, \(z^*\) is a first-order stationary point of (3) (i.e., a solution of (4) or (5)) if it is an optimal solution of the following least squares problem:

$$\begin{aligned} \min _{z\in {\mathbb {R}}^{n+m}} r_N(z):=\frac{1}{2}\Vert F_N(z)\Vert ^{2} \end{aligned}$$
(6)

and \(r_N(z^*)=0\), where \(\Vert \cdot \Vert \) denotes the Euclidean norm.

The main contributions of this paper are summarized as follows. (i) We develop the QNSTR algorithm for solving the least squares problem (6) when X and Y are boxes. Based on the structure of the VI (4), we use a smoothing function to approximate the nonsmooth function \(F_N\), adopt an adaptive quasi-Newton formula to approximate the Hessian matrix and solve a quadratic program with ellipse constraints in a subspace at each step of the QNSTR algorithm with a relatively low computational cost. (ii) We prove the global convergence of the QNSTR algorithm to a stationary point of a smoothing least squares problem of (6), which is an \(\epsilon \)-first-order stationary point of the min-max optimization problem (3) if every element of the generalized Jacobian matrix of \(F_N\) is nonsingular at it. (iii) We apply the QNSTR algorithm to GANs in eye image segmentation with real data, which validates the effectiveness and efficiency of our approach for large-scale min-max optimization problems.

This paper is organized as follows. In Sect. 2, we introduce concepts of local minimax points and first-order optimality conditions. Moreover, we investigate the asymptotic convergence between problems (1) and (3) to build the numerical foundation for the subsequent discussion. In Sect. 3, we present the QNSTR algorithm and establish its global convergence. In Sect. 4, we apply the QNSTR algorithm to solve problem (3) with examples from eye image segmentation problems and digital handwriting image generation problems based on two different real data sets. We compare the QNSTR algorithm with some existing methods including alternating Adam. Finally, we give some concluding remarks in Sect. 5.

Notations \({\mathbb {N}}\) denotes the set of natural numbers. \(\Vert \cdot \Vert \) denotes the Euclidean norm of a vector or the norm of a matrix induced by the Euclidean norm. \(\textrm{d}(x, Y):=\inf _{y\in Y} \Vert x - y\Vert \) and \(\textrm{d}(X, Y):= \sup _{x\in X}\inf _{y\in Y} \Vert x-y\Vert \), \(X, Y \subseteq {\mathbb {R}}^{n}\).

2 First-Order Stationarity and Asymptotic Convergence

In this section, we focus on the asymptotic convergence between problems (3) and (1) regarding to the global minimax point and the first-order stationary point. To this end, we first give some preliminaries on how to describe the optima of a min-max optimization problem.

Definition 1

(global and local minimax points, [19, Definitions 9 & 14])

  1. (i)

    \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a global minimax point of problem (1), if

    $$\begin{aligned} f({\hat{x}},y) \le f({\hat{x}},{\hat{y}}) \le \max _{y'\in Y}f(x,y'),~\forall (x,y)\in X\times Y. \end{aligned}$$
  2. (ii)

    \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a local minimax point of problem (1), if there exist a \(\delta _0>0\) and a function \(\tau :{\mathbb {R}}_+\rightarrow {\mathbb {R}}_+\) satisfying \(\tau (\delta )\rightarrow 0\) as \(\delta \rightarrow 0\), such that for any \(\delta \in (0,\delta _0]\) and any \((x,y)\in X\times Y\) satisfying \(\left\| x-{\hat{x}}\right\| \le \delta \) and \(\left\| y-{\hat{y}}\right\| \le \delta \), we have

    $$\begin{aligned} f({\hat{x}},y) \le f({\hat{x}},{\hat{y}}) \le \max _{y'\in \{y\in Y: \left\| y-{\hat{y}}\right\| \le \tau (\delta )\}}f(x,y'). \end{aligned}$$

Remark 1

The concept of saddle points has been commonly used to characterize the optima of min-max problems. A point \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a saddle point of problem (1), if

$$\begin{aligned} f({\hat{x}},y) \le f({\hat{x}},{\hat{y}}) \le f(x,{\hat{y}}),~\forall (x,y)\in X\times Y, \end{aligned}$$
(7)

and \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a local saddle point of problem (1) if (7) holds in a neighborhood of \(({\hat{x}},{\hat{y}})\). However, as pointed out in [19], saddle points and local saddle points may not exist in many applications of machine learning, especially in the nonconvex-nonconcave case. Also, (local) saddle points are solutions from the viewpoint of simultaneous game, where the minimization operator and the maximization operator act simultaneously. However, many applications, such as GANs and adversarial training, seek for solutions in the sense of sequential game, where the minimization operator acts first and the maximization operator acts latter. The global and local minimax points exist under some mild conditions (see [19, Proposition 11 and Lemma 16]) and also describe the optima in the sense of sequential game.

The following lemma gives the first-order necessary optimality conditions of local minimax points for problem (1).

Lemma 1

( [18, Theorem 3.2 & Corollary 3.1]) If \(({\hat{x}},{\hat{y}})\in X\times Y\) is a local minimax point of problem (1), then we have

$$\begin{aligned} {\left\{ \begin{array}{ll} 0\in \nabla _x f({\hat{x}},{\hat{y}}) + {\mathcal {N}}_{X}({\hat{x}}),\\ 0\in - \nabla _y f({\hat{x}},{\hat{y}}) + {\mathcal {N}}_{Y}({\hat{y}}). \end{array}\right. } \end{aligned}$$
(8)

Definition 2

\(({\hat{x}},{\hat{y}})\in X\times Y\) is called a first-order stationary point of problem (1) if (8) holds. \(({\hat{x}},{\hat{y}})\in X\times Y\) is called a first-order stationary point of problem (3) if (8) holds with replacing f by \({\hat{f}}_N\).

Hereafter, we will focus on finding a first-order stationary point of (3).

As for the exponential rate of convergence of the first-order and second-order stationary points of SAA for a specific GAN, one can refer to [18, Proposition 4.3]. In what follows, we mainly focus on the almost surely convergence analysis between problems (1) and (3) as N tends to infinity. If the problem is well-behaved and the global minimax points are achievable, we consider the convergence of global minimax points between problems (1) and (3). Otherwise, the first-order stationary points (Definition 2) are getatable. Thus, we need also consider the convergence of first-order stationary points between problems (1) and (3) as N tends to infinity.

Denote the optimal value, the set of global minimax points and the set of first-order stationary points of problem (1) by \(\vartheta _g\), \({\mathcal {S}}_{g}\) and \({\mathcal {S}}_{1\textrm{st}}\), respectively. Let \({\widehat{\vartheta }}_g^N\), \(\widehat{{\mathcal {S}}}_{g}^{N}\) and \(\widehat{{\mathcal {S}}}_{1\textrm{st}}^{N}\) denote the optimal value, the set of global minimax points and the set of first-order stationary points of problem (3), respectively.

Lemma 2

Suppose that: (a) X and Y are compact sets; (b) \(\ell (x,y,\xi )\) is dominated by an integrable function for every \((x,y)\in X\times Y\). Then

$$\begin{aligned} \sup _{(x,y)\in X\times Y} \left| {\hat{f}}_{N}(x,y) -f(x,y)\right| \rightarrow 0 \end{aligned}$$

w.p.1 as \(N\rightarrow \infty \).

If, further, (c) \(\left\| \nabla _x \ell (x,y,\xi )\right\| \) and \(\left\| \nabla _y \ell (x,y,\xi ) \right\| \) are dominated by an integrable function for every \((x,y)\in X\times Y\), then

$$\begin{aligned} \sup _{(x,y)\in X\times Y} \left\| \nabla {\hat{f}}_{N}(x,y) - \nabla f(x,y)\right\| \rightarrow 0 \end{aligned}$$

w.p.1 as \(N\rightarrow \infty \).

Proof

Since the samples are i.i.d. and X and Y are compact, it is known from [36, Theorem 7.53] that the above uniform convergence results hold.\(\square \)

The following proposition gives the nonemptiness of \(\widehat{{\mathcal {S}}}_{g}^{N}\), \({\mathcal {S}}_{g}\), \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) and \({\mathcal {S}}_{\textrm{1st}}\).

Proposition 1

If conditions (a)–(c) in Lemma 2 hold, then \({\mathcal {S}}_{g}\) and \({\mathcal {S}}_{\textrm{1st}}\) are nonempty and \(\widehat{{\mathcal {S}}}_{g}^{N}\) and \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) are nonempty for any \(N\in {\mathbb {N}}\).

Proof

Since the continuity of f(xy) and \({\hat{f}}_{N}(x,y)\) w.r.t. (xy) and the boundedness of X and Y, we know from [19, Proposition 11] the nonemptiness of \({\mathcal {S}}_{g}\) and \(\widehat{{\mathcal {S}}}_{g}^{N}\). Note that both \({\mathcal {S}}_{\textrm{1st}}\) and \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) are solutions of variational inequalities. Then we have from [12, Corollary 2.2.5] that \({\mathcal {S}}_{\textrm{1st}}\) and \(\widehat{{\mathcal {S}}}_{\textrm{1st}}^{N}\) are nonempty.\(\square \)

Based on the uniform laws of large numbers in Lemma 2, we have the following convergence results.

Theorem 1

Let conditions (a)–(c) in Lemma 2 hold. Then

$$\begin{aligned}&\textrm{d}\left( \widehat{{\mathcal {S}}}_{g}^{N}, {\mathcal {S}}_{g} \right) \rightarrow 0, \end{aligned}$$
(9)
$$\begin{aligned}&\textrm{d}\left( \widehat{{\mathcal {S}}}_{1\textrm{st}}^{N}, {\mathcal {S}}_{1\textrm{st}} \right) \rightarrow 0, \end{aligned}$$
(10)

w.p.1 as \(N\rightarrow \infty \).

Proof

(10) follows from [35, Proposition 19] directly. Thus, in what follows, we only consider (9).

From Proposition 1, we know that \(\widehat{{\mathcal {S}}}_{g}^{N}\) and \({\mathcal {S}}_{g}\) are nonempty for any \(N\in {\mathbb {N}}\). Let \(z^{N}=(x^{N}, y^{N}) \in \widehat{{\mathcal {S}}}_{g}^{N}\) and \(z^{N} \rightarrow {\bar{z}}=({\bar{x}}, {\bar{y}})\) w.p.1 as \(N\rightarrow \infty \). Then we just verify that \({\bar{z}}\in {\mathcal {S}}_{g}\) w.p.1. If \(\{z^{N}\}\) is not a convergent sequence, due to the boundedness of X and Y, we can choose a convergent subsequence. Denote \(\varphi (x):=\max _{y\in Y} f(x,y)\) and \({\hat{\varphi }}_{N}(x):=\max _{y\in Y} {\hat{f}}_{N}(x,y)\). Note that

$$\begin{aligned} \begin{aligned} \max _{x\in X}\left| {\hat{\varphi }}_{N}(x) - \varphi (x)\right|&= \max _{x\in X}\left| \max _{y\in Y} {\hat{f}}_{N}(x,y) - \max _{y\in Y} f(x,y)\right| \\&\le \max _{(x,y)\in X\times Y} \left| {\hat{f}}_{N}(x,y) - f(x,y) \right| \\&\rightarrow 0 \end{aligned} \end{aligned}$$
(11)

w.p.1 as \(N\rightarrow \infty \), where the last convergence assertion follows from Lemma 2.

Next, we show

$$\begin{aligned} {\hbox {Proj}}_{x} \widehat{\mathcal {S}}_{g}^{N}=\mathop {\hbox {arg\,min}}\limits _{x\in X} {\hat{\varphi }}_{N}(x)\,\, \text {and} \,\, {\hbox {Proj}}_{x} {\mathcal {S}}_{g}=\mathop {\hbox {arg\,min}}\limits _{x\in X} \varphi (x), \end{aligned}$$
(12)

where \({\hbox {Proj}}_x\) denotes the projection onto the x’s space. Based on the definition of global minimax points, we have, for any \(({\hat{x}},{\hat{y}})\in {\mathcal {S}}_{g}\), that

$$\begin{aligned} f({\hat{x}},y)\le f({\hat{x}},{\hat{y}}) \le \max _{y'\in Y}f(x,y'),~\forall (x,y)\in X\times Y, \end{aligned}$$

which implies

$$\begin{aligned} \varphi ({\hat{x}}) = \max _{y\in Y}f({\hat{x}},y) \le \max _{y'\in Y}f(x,y')= \varphi (x), ~\forall x \in X. \end{aligned}$$

This means \(\textrm{Proj}_{x}{\mathcal {S}}_{g}\subseteq \mathop {\hbox {arg\,min}}\limits _{x\in X}\varphi (x)\). On the other hand, for any \({\hat{x}}\in \mathop {\hbox {arg\,min}}\limits _{x\in X}\varphi (x)\), we let \({\hat{y}} \in \arg \max _{y\in Y}f({\hat{x}},y)\). Then it is not difficult to examine that \(({\hat{x}},{\hat{y}})\) is a global minimax point, i.e., \(\mathop {\hbox {arg\,min}}\limits _{x\in X}\varphi (x) \subseteq \textrm{Proj}_{x}{\mathcal {S}}_{g}\). The \({\hbox {Proj}}_x \widehat{{\mathcal {S}}}_{g}^{N}=\mathop {\hbox {arg\,min}}\limits _{x\in X} {\hat{\varphi }}_{N}(x)\) can be similarly verified. Hence (12) holds.

Then (11) and (12) indicate, according to [41, Lemma 4.1], that

$$\begin{aligned} \textrm{d}\left( {\hbox {Proj}}_x \widehat{{\mathcal {S}}}_{g}^{N}, \textrm{Proj}_x {\mathcal {S}}_{g} \right) \rightarrow 0 \end{aligned}$$
(13)

w.p.1 as \(N\rightarrow \infty \). We know from (13) that \({\bar{x}}\in \textrm{Proj}_x {\mathcal {S}}_{g}\).

Moreover, we know that

$$\begin{aligned} \left| {\widehat{\vartheta }}_g^{N} - \vartheta _g\right|&= \left| \min _{x\in X}{\hat{\varphi }}_{N}(x) - \min _{x\in X}\varphi (x) \right| \\ &\le \max _{x\in X}\left| {\hat{\varphi }}_{N}(x) - \varphi (x)\right| \\&\rightarrow 0 \end{aligned}$$

w.p.1 as \(N\rightarrow \infty \), where \(\vartheta _g\) and \({\widehat{\vartheta }}_g^{N}\) are optimal values of problems (1) and (3), respectively. Due to Lemma 2 and the continuity of f, we know that

$$\begin{aligned} \left| {\hat{f}}_{N}(x^{N},y^{N}) - f({\bar{x}},{\bar{y}}) \right|&\le \left| {\hat{f}}_{N}(x^{N},y^{N}) - f(x^{N},y^{N})\right| + \left| f(x^{N},y^{N}) - f({\bar{x}},{\bar{y}}) \right| \\&\rightarrow 0. \end{aligned}$$

Since \({\widehat{\vartheta }}_g^{N} = {\hat{f}}_{N}(x^{N},y^{N})\), we know that \(\vartheta _g = f({\bar{x}},{\bar{y}})\), which, together with \({\bar{x}}\in \textrm{Proj}_x {\mathcal {S}}_{g}\), implies that \(({\bar{x}},{\bar{y}})\in {\mathcal {S}}_{g}\).\(\square \)

Based on Theorem 1, it is well-founded for us to employ problem (3) to approximately solve problem (1). In the sequel, we will focus on how to compute an \(\epsilon \)-first-order stationary point of problem (3).

3 The QNSTR Algorithm and its Convergence Analysis

In this section, we propose the QNSTR algorithm to compute an \(\epsilon \)-first-order stationary point of problem (3) with a fixed sample size N. In the remainder of this paper, let \(X=[a,b]\) and \(Y=[c,d]\), where \(a,b\in {\mathbb {R}}^n\), \(c,d\in {\mathbb {R}}^m\) with \(a< b\) and \(c< d\) in the componentwise sense. In this case, the projection in (5) has a closed form and the function \(F_N\) can be written as

$$\begin{aligned} F_N(z)=z - \textrm{mid}(l, u, z - H_N(z)), \end{aligned}$$
(14)

where \(l,u\in {\mathbb {R}}^{n+m}\) with \(l=(a^\top , c^\top )^\top \) and \(u=(b^\top , d^\top )^\top \), “mid” is the middle operator in the componentwise sense, that is

$$\begin{aligned} \textrm{mid}(l,u, z-H_N(z))_i= \left\{ \begin{array}{ll} l_i, & \textrm{if} \quad (z - H_N(z))_i < l_i,\\ u_i, & \textrm{if} \quad (z - H_N(z))_i > u_i, ~ i=1,\cdots , n+m,\\ (z - H_N(z))_i, & \textrm{otherwise}. \end{array} \right. \end{aligned}$$

Since X and Y are boxes, (14) can be divided into two parts separably and rewritten as

$$\begin{aligned} F_{N}(z)=\left( \begin{array}{l} F^{\mathbb {1}}_{N}(z)\\ F^{\mathbb {2}}_{N}(z)\end{array}\right) = \left( \begin{array}{l} x-\textrm{mid}(a,b,x-\nabla _{x}{\hat{f}}_{N}(x,y))\\ y-\textrm{mid}(c,d,y+\nabla _{y}{\hat{f}}_{N}(x,y)) \end{array}\right) . \end{aligned}$$
(15)

3.1 Smoothing Approximation

Let \(q(z)=z-H_N(z).\) The function \(F_N\) is not differentiable at z when \(q_i(z) = l_i\) or \(q_i(z) = u_i\) for some \(1\le i\le n+m\). To overcome the difficulty in computation of the generalized Hessian of the nonsmooth function \(F_N(z)\), we consider its smoothing approximation

$$\begin{aligned} (F_{N,\mu })_i(z) = {\left\{ \begin{array}{ll} \frac{1}{2}((H_N)_i(z) + z_i) + \frac{1}{2\mu } (u_i-q_i(z))^2 + \frac{\mu }{8} - \frac{u_i}{2}, & \text {if}~ \left| u_i -q_i(z)\right| \le \frac{\mu }{2},\\ \frac{1}{2}((H_N)_i(z) + z_i) - \frac{1}{2\mu } (l_i - q_i(z))^2 - \frac{\mu }{8} - \frac{l_i}{2}, & \text {if}~ \left| l_i -q_i(z)\right| \le \frac{\mu }{2},\\ (F_N)_i(z), & \text {otherwise}, \end{array}\right. }\nonumber \\ \end{aligned}$$
(16)

where \(0<\mu \le {\hat{\mu }}:=\min _{1\le i\le n+m} (u_i - l_i)\).

From (15), the smoothing function \(F_{N,\mu }(z)\) can also be represented as

$$\begin{aligned} F_{N,\mu }(z)=\left( \begin{array}{l} F^{\mathbb {1}}_{N,\mu }(z)\\ F^{\mathbb {2}}_{N,\mu }(z)\end{array}\right) , \end{aligned}$$
(17)

where \(F^{\mathbb {1}}_{N,\mu }(z)\) and \(F^{\mathbb {2}}_{N,\mu }(z)\) are the smoothing approximations of \(F_{N}^{\mathbb {1}}(z)\) and \(F_{N}^{\mathbb {2}}(z)\), respectively.

We summarize some useful properties of the smoothing function \(F_{N,\mu }\), which can be found in [4] and [3, Section 6].

Lemma 3

Let \(F_{N,\mu }\) be a smoothing function of \(F_N\) defined in (16). Then for any \(\mu \in (0, {\hat{\mu }})\), \(F_{N,\mu }\) is continuously differentiable and has the following properties.

  1. (i)

    There is a \(\kappa >0\) such that for any \(z\in {\mathbb {R}}^{m+n}\) and \(\mu >0\),

    $$\begin{aligned} \Vert F_{N,\mu }(z)-F_N(z)\Vert \le \kappa \mu . \end{aligned}$$
  2. (ii)

    For any \(z\in {\mathbb {R}}^{m+n}\), we have

    $$\begin{aligned} \lim _{\mu \downarrow 0} \textrm{d}(\nabla _{z} F_{N,\mu }(z), \partial _C F_N(z))=0, \end{aligned}$$

    where \(\partial _CF_N(z)=\partial (F_N(z))_1 \times \partial (F_N(z))_2\times \cdots \times \partial (F_N(z))_{n+m},\) and \(\partial (F_N(z))_i\) is the Clarke generalized gradient of \((F_N(\cdot ))_i\) at z for \(i=1,\ldots , n+m\). Moreover, there exists a \({\bar{\mu }}>0\) such that for any \(\mu \in (0, {\bar{\mu }})\), we have \(\nabla F_{N,\mu }(z)\in \partial _C F_N(z).\)

In Fig. 11 in Appendix A, we show the approximation error \(\Vert F_{N,\mu }(z)-F_{N}(z)\Vert \) over \(X\times Y\) as \(\mu \downarrow 0\) with different N.

Definition 3

(\(\epsilon \)-first-order stationary points) For given \(\epsilon >0\), a point z is called an \(\epsilon \)-first-order stationary point of problem (3), if \(\left\| F_N(z)\right\| \le \epsilon \).

From (i) of Lemma 3, if \(z^*\) is an \(\epsilon \)-first-order stationary point of problem (3), i.e., \(\left\| F_N(z^*)\right\| \le \epsilon \), then for any \(\mu \in (0, \frac{\epsilon }{\kappa })\), we have

$$\begin{aligned} \Vert F_{N,\mu }(z^*)\Vert -\Vert F_N(z^*)\Vert \le \Vert F_{N,\mu }(z^*)-F_N(z^*)\Vert \le \kappa \mu \le \epsilon , \end{aligned}$$
(18)

which implies \(\Vert F_{N,\mu }(z^*)\Vert \le \Vert F_N(z^*)\Vert + \epsilon \le 2\epsilon .\) On the other hand, if \(z^*\) satisfies \(\Vert F_{N,\mu }(z^*)\Vert \le \frac{\epsilon }{2}\) for some \(\epsilon >0\) and \(\mu \in (0, \frac{\epsilon }{2\kappa })\), then we have

$$\begin{aligned} \Vert F_N(z^*)\Vert -\Vert F_{N,\mu }(z^*)\Vert \le \Vert F_{N,\mu }(z^*)-F_N(z^*)\Vert \le \kappa \mu \le \frac{\epsilon }{2}, \end{aligned}$$

which implies \(\Vert F_N(z^*)\Vert \le \Vert F_{N,\mu }(z^*)\Vert + \frac{\epsilon }{2}\le \epsilon \), that is, \(z^*\) is an \(\epsilon \)-first-order stationary point of problem (3).

Now we consider the smoothing least squares problem with a fixed small smoothing parameter \(\mu >0\):

$$\begin{aligned} \min _{z\in {\mathbb {R}}^{n+m}} r_{N,\mu }(z):=\frac{1}{2}\Vert F_{N,\mu }(z )\Vert ^{2}. \end{aligned}$$
(19)

Let \(J_{N,\mu }(z)\) be the Jacobian matrix of \(F_{N,\mu }(z)\). The gradient of the function \(r_{N,\mu }\) is

$$\begin{aligned} \nabla r_{N,\mu }(z)=J_{N,\mu }(z)^\top F_{N,\mu }(z). \end{aligned}$$

A vector \(z^*\) is called a first-order stationary point of problem (19) if \(\nabla r_{N,\mu }(z^*)=0\). If \(J_{N,\mu }(z^*)\) is nonsingular, then \( F_{N,\mu }(z^*)=0\). From (i) of Lemma 3, \(\Vert F_N(z^*)\Vert =\Vert F_N(z^*)-F_{N,\mu }(z^*)\Vert \le \kappa \mu \le \epsilon \) when \(\mu \in (0, \epsilon /\kappa ).\) This means that a first-order stationary point \(z^*\) of problem (19) is an \(\epsilon \)-first-order stationary point of problem (3) if \(\mu \in (0, \epsilon /\kappa )\) and \(J_{N,\mu }(z^*)\) is nonsingular. Note that \(\partial _CF_N(z)\) is a compact set for any \(z\in X\times Y\). From (ii) of Lemma 3, if all matrices in \(\partial _CF_N(z^*)\) are nonsingular, then there is \(\mu _0 >0\) such that for any \(\mu \in (0, \mu _0)\), \(J_{N,\mu }(z^*)\) is nonsingular.

If \(J_{N,\mu }(z^{*})\) is singular, the assumptions of local convergence theorems in [10, 15] for Gauss-Newton methods to solve the least squares problem (19) fail. In the next subsection, we prove the convergence of the QNSTR algorithm for the least squares problem (19) to a stationary point of (19) without assuming the nonsigularity of \(J_{N,\mu }(z^{*})\).

3.2 The QNSTR Algorithm

In this subsection, we present the QNSTR algorithm with a fixed sample size N and a fixed smoothing parameter \(\mu \). For simplicity, in this subsection, we use F(z), J(z) and r(z) to denote \(F_{N,\mu }(z) \), \(J_{N,\mu }(z)\) and \(r_{N,\mu }(z)\), respectively. Moreover, we use \(F^{\mathbb {1}}(z)\), \(F^{\mathbb {2}}(z)\) to represent \(F^{\mathbb {1}}_{N,\mu }(z)\) and \(F^{\mathbb {2}}_{N,\mu }(z)\), respectively.

For an arbitrary point \(z_0\in X\times Y\) and a positive number \(R_0\), we denote the level set \({\bar{S}}:=\{z\in {\mathbb {R}}^{n+m}\, | \, r(z)\le r(z_0)\}\) and define a set \(S(R_{0}):=\{z\in {\mathbb {R}}^{n+m} \, |\, \Vert z-z'\Vert \le R_{0},\forall z'\in {\bar{S}}\}\). By the definition of F and the boundedness of X and Y, we have \(r(z)=\frac{1}{2}\Vert F(z)\Vert ^2 \rightarrow \infty \) if \(\Vert z\Vert \rightarrow \infty \). Hence both \({\bar{S}}\) and \(S(R_0)\) are bounded.

Let \(J_{1}(z)=\nabla F^{\mathbb {1}}(z)\) and \(J_{2}(z)=\nabla F^{\mathbb {2}}(z)\). Then from

$$\begin{aligned} r(z):=\frac{1}{2}\Vert F^{\mathbb {1}}(z)\Vert ^{2}+\frac{1}{2}\Vert F^{\mathbb {2}}(z)\Vert ^{2}, \end{aligned}$$

if \(F^{\mathbb {1}}\) and \(F^{\mathbb {2}}\) are twice differentiable at z, the Hessian matrix

$$\begin{aligned} \nabla ^{2}r(z)=\frac{1}{2}\nabla ^{2}\Vert F^{\mathbb {1}}(z)\Vert ^{2}+\frac{1}{2}\nabla ^{2}\Vert F^{\mathbb {2}}(z)\Vert ^{2}, \end{aligned}$$

can be written as

$$\begin{aligned} \nabla ^2 \Vert F^{\mathbb {1}}(z)\Vert ^{2}&= J_{1}(z)^\top J_{1}(z) +\sum _{i=1}^{n} (F^{\mathbb {1}})_i(z)\nabla ^2(F^{\mathbb {1}})_i(z), \end{aligned}$$
(20)
$$\begin{aligned} \nabla ^2 \Vert F^{\mathbb {2}}(z)\Vert ^{2}&= J_{2}(z)^\top J_{2}(z) +\sum _{i=1}^{m} (F^{\mathbb {2}})_i(z)\nabla ^2(F^{\mathbb {2}})_i(z). \end{aligned}$$
(21)

If \(F^{\mathbb {1}}\) and \(F^{\mathbb {2}}\) are not twice differentiable at z, the generalized Hessian of r at z, denote \(\partial (\nabla r(z)),\) is the convex hull of all \((m+n)\times (m+n)\) matrices obtained as the limit of a sequence of the form \(\nabla ^{2}r(z^k)\), where \(z^k \rightarrow z\) and \(F^{\mathbb {1}}\) and \(F^{\mathbb {2}}\) are twice differentiable at \(z^k\) [5]. Hence from (20)–(21) and the twice continuous differentiability of \({\hat{f}}_N\), we know that there is a positive number \(M_1\) such that \(\Vert H\Vert \le M_1\) for any \(H\in \partial (\nabla r(z))\), \(z\in S(R_0)\). Moreover from [5, Proposition 2.6.5], there is a positive number \(M_2\) such that

$$\begin{aligned} \left\| \nabla r(z) - \nabla r(z')\right\| \le M_2 \left\| z - z'\right\| , \,\, \forall z, z' \in S(R_0). \end{aligned}$$
(22)

To give a globally convergent algorithm for problem (19) without using the second derivatives, we keep the term \(J_{1}(z)^{\top }J_{1}(z)\) and \(J_{2}(z)^{\top }J_{2}(z)\) in (20)–(21), and approximate

$$\begin{aligned} \sum _{i=1}^{n} (F^{\mathbb {1}})_i(z)\nabla ^2 (F^{\mathbb {1}})_i(z) ~~ \text {and} ~~ \sum _{i=1}^{m} (F^{\mathbb {2}})_i(z)\nabla ^2(F^{\mathbb {2}})_i(z). \end{aligned}$$

Specifically, the Hessian matrix at the k-th iteration point \(z_{k}\) is approximated by \(H_k\) with

$$\begin{aligned} H_{k}=J_{1}(z_{k})^{\top }J_{1}(z_{k})+J_{2}(z_{k})^{\top }J_{2}(z_{k})+A_{k}, \end{aligned}$$
(23)

where

$$\begin{aligned} A_{k}=\begin{pmatrix} B_{k} & O \\ O & C_{k} \end{pmatrix}. \end{aligned}$$

Here the matrices \(B_{k}\) and \(C_k\) are computed by the truncated BFGS quasi-Newton formula as follows.

$$ \begin{aligned} & B_{k+1}={\left\{ \begin{array}{ll} {\bar{B}}_{k+1}& \,\ \text {if}\,\, \Vert {\bar{B}}_{k+1}\Vert \le \gamma \,\, \& \,\, \frac{(s^{\mathbb {1}}_k)^\top v^{\mathbb {1}}_k}{(s^{\mathbb {1}}_k)^\top s^{\mathbb {1}}_k }\ge {\bar{\epsilon }}\\ \Vert F^{\mathbb {1}}(z_{k+1}) \Vert I_{n}& \,\,\text {otherwise,} \end{array}\right. } \end{aligned}$$
(24)
$$ \begin{aligned} & C_{k+1}={\left\{ \begin{array}{ll} {\bar{C}}_{k+1}& \,\, \text {if} \,\, \Vert {\bar{B}}_{k+1}\Vert \le \gamma \,\, \& \,\, \,\frac{(s^{\mathbb {2}}_k)^\top v^{\mathbb {2}}_k}{(s^{\mathbb {2}}_k)^\top s^{\mathbb {2}}_k}\ge {\bar{\epsilon }} \\ \Vert F^{\mathbb {2}}(z_{k+1}) \Vert I_m & \,\,\text {otherwise,} \end{array}\right. } \end{aligned}$$
(25)

where

$$\begin{aligned} {\bar{B}}_{k+1}= B_{k}-\frac{B_{k}s^{\mathbb {1}}_{k}(s^{\mathbb {1}}_{k})^{\top }B^{\top }_{k}}{(s^{\mathbb {1}}_{k})^{\top }B_{k}s^{\mathbb {1}}_{k}}+\frac{v^{\mathbb {1}}_{k}(v^{\mathbb {1}}_{k})^{\top }}{(v^{\mathbb {1}}_{k})^{\top }s^{\mathbb {1}}_{k}} \end{aligned}$$
(26)

and

$$\begin{aligned} {\bar{C}}_{k+1}=C_{k}-\frac{C_{k}s^{\mathbb {2}}_{k}(s^{\mathbb {2}}_{k})^{\top }C^{\top }_{k}}{(s^{\mathbb {2}}_{k})^{\top }C_{k}s^{\mathbb {2}}_{k}}+\frac{v^{\mathbb {2}}_{k}(v^{\mathbb {2}}_{k})^{\top }}{(v^{\mathbb {2}}_{k})^{\top }s^{\mathbb {2}}_{k}}. \end{aligned}$$
(27)

Here \({\bar{\epsilon }}\) and \(\gamma \) are given positive parameters, \(s^{\mathbb {1}}_k=x_{k+1}-x_k\), \(s^{\mathbb {2}}_{k}=y_{k+1}-y_{k}\),

$$\begin{aligned} v^{\mathbb {1}}_{k}&=(\nabla _xF^{\mathbb {1}}(z_{k+1})-\nabla _xF^{\mathbb {1}}(z_{k}))^{\top }F^{\mathbb {1}}(z_{k+1})\Vert F^{\mathbb {1}}(z_{k+1})\Vert /\Vert F^{\mathbb {1}}(z_{k})\Vert , \\ v^{\mathbb {2}}_{k}&=(\nabla _yF^{\mathbb {2}}(z_{k+1})-\nabla _yF^{\mathbb {2}}(z_{k}))^{\top }F^{\mathbb {2}}(z_{k+1})\Vert F^{\mathbb {2}}(z_{k+1})\Vert /\Vert F^{\mathbb {2}}(z_{k})\Vert . \end{aligned}$$

Notice that the approximation form (23) is proposed in [47]. However, the matrix \(A_k\) in [47] is defined by using F(z) and \(\nabla F(z)\) at \(z_{k+1}, z_k\). In this paper, we use \(F^{\mathbb {1}}(z)\), \(F^{\mathbb {2}}(z)\), \(\nabla F^{\mathbb {1}}(z)\) and \(\nabla F^{\mathbb {2}}(z)\) at \(z_{k+1}, z_k\) to define a two-block diagonal matrix \(A_{k}\) in (23), based on the structure of VI in (17).

In [47], a back tracking line search is used to obtain a stationary point of the least squares problem. In this paper, we use a subspace trust-region method to solve problem (19) with global convergence guarantees. Comparing with the quasi-Newton method with back tracking line search in [47], the QNSTR algorithm solves a strongly convex quadratic subproblem in a low dimension at each step, which is efficient to solve large-scale min-max optimization problems with real data. See Sect. 4 for more details.

In what follows, for simplification, we use \(J_{k}\) and \(F_{k}\) to denote \(J(z_{k})\), \(F(z_{k})\), respectively.

Let \(g_{k}=\nabla r(z_{k})\). Choose \(\{d_k^1, \cdots , d_k^{L-1}\}\) such that \(V_{k}:=\left[ \begin{matrix} -g_{k}&d_{k}^{1}&\cdots&d_{k}^{L-1} \end{matrix}\right] \in {\mathbb {R}}^{(n+m)\times L}\) has L linearly independent column vectors. Let

$$\begin{aligned} c_{k}:=V_{k}^{\top }g_{k}, \quad G_{k}:=V_{k}^{\top }V_{k},\quad Q_{k}:=V_{k}^{\top }H_{k}V_{k}. \end{aligned}$$

Then, to obtain the stepsize \(\alpha _{k}\) at the iteration point \(z_k\), we solve the following strongly convex quadratic program in an L-dimensional space:

$$\begin{aligned} \begin{array}{cl} \alpha _{k}=\mathop {\hbox {arg\,min}}\limits \limits _{\alpha \in {\mathbb {R}}^{L}} & m_{k}(\alpha ):=r(z_k)+ c_{k}^{\top }\alpha +\frac{1}{2}\alpha ^{\top }Q_{k}\alpha \\ ~~~~~~~~\mathrm {s.t.} & \Vert V_k\alpha \Vert \le \varDelta _k, \end{array} \end{aligned}$$
(28)

where \(\varDelta _{k}>0\) is the trust-region radius.

A key question for solving problem (28) is how to compute \(Q_{k}\) efficiently when \(H_{k}\) is huge. In fact, \(Q_k\) can be calculated efficiently without computing and storing the full information \(H_k\). From (23), we can write \(Q_{k}\) as

$$\begin{aligned} Q_{k}=V^{\top }_{k}J_{k}^{\top }J_{k}V_{k} + V^{\top }_{k}A_{k}V_{k}. \end{aligned}$$
(29)

For the term \(V_{k}^{\top }J_{k}^{\top }J_{k}V_{k}\) in (29), we compute \(J_{k}V_{k}\) in a columnwise way: For a sufficiently small \(\epsilon >0\),

$$\begin{aligned} J_{k}g_{k}\approx \frac{F(z_{k}+\epsilon g_{k})-F(z_{k})}{\epsilon },~~ J_{k}d_{k}^{i}\approx \frac{F(z_{k}+\epsilon d^{i}_{k})-F(z_{k})}{\epsilon },~ i=1,\cdots ,L-1. \end{aligned}$$

On the other hand, \(A_{k}V_{k}\) in term \(V_{k}^{\top }A_{k}V_{k}\) can also be computed columnwisely by a series of vector-vector products.

We give the QNSTR algorithm in Algorithm 1.

Algorithm 1
figure a

The QNSTR Algorithm

Trust-region algorithms are a class of popular numerical methods for optimization problems [6, 43]. Our QNSTR algorithm uses the special structure of the VI in (15) and (17) to construct the subproblem (28). The global convergence of the QNSTR algorithm is given in the following theorem.

Theorem 2

Suppose that X and Y are nonempty and bounded boxes and \({\hat{f}}_N\) is twice continuously differentiable. Let \(\{z_{k}\}_{k=0}^{\infty }\) be generated by Algorithm 1. Then there exists an \(M>0\) such that \(\left\| \nabla r(z) - \nabla r(z')\right\| \le M \left\| z - z'\right\| \) for any \(z,z'\in S(R_0)\) and \(\left\| H_k\right\| \le M\) for \(k\in {\mathbb {N}}\). Moreover, we have \(\lim _{k\rightarrow \infty }\Vert g_{k}\Vert =0.\)

The proof is given in Appendix B.

To end this section, we give some remarks about Theorem 2.

Remark 2

From Corollary 2.2.5 in [12], the continuity of F and boundness of X and Y imply that the solution set \({{{\mathcal {Z}}}}^*\) of \(F(z)=0\) is nonempty and bounded. Hence the set of minimizers of the least squares problem \(\min r(z)\) with the optimal value zero is coincident with the solution set of \(F(z)=0\). If all matrices in \(\partial _C F_N(z)\) for \(z\in {{{\mathcal {Z}}}}^*\) are nonsingular, we know from the compactness of \(\partial _C F_N(z)\) and (ii) of Lemma 3 that there is a \(\mu _0 >0\) such that for any \(\mu \in (0, \mu _0)\), J(z) is nonsingular and \(\sup _{\mu \in (0,\mu _0)}\left\| J(z)^{-1}\right\| \le C\) for some \(C>0\). Thus,

$$\begin{aligned} \left\| F(z)\right\| = \left\| (J(z)^\top )^{-1} \nabla r(z)\right\| \le \left\| (J(z)^\top )^{-1}\right\| \left\| \nabla r(z)\right\| \le C\left\| \nabla r(z)\right\| . \end{aligned}$$

We have from (i) of Lemma 3, i.e., \(\left\| F_N(z) - F(z) \right\| \le \kappa \mu \) that

$$\begin{aligned} \left\| F_N(z)\right\| \le C \left\| \nabla r(z)\right\| + \kappa \mu . \end{aligned}$$

Thus, for any \(\epsilon >0\), we can properly select parameters \(\delta \) and \(\mu \) such that \(\left\| \nabla r(z)\right\| \le \delta \) and \(C \delta + \kappa \mu \le \epsilon \).

According to Theorem 2, we can find a point \(z_k\) such that

$$\begin{aligned} \left\| g_k\right\| = \left\| \nabla r(z_k)\right\| \le \delta , \end{aligned}$$

with an given stopping criterion parameter \(\delta \). The numerical experiments in the next section show that the QNSTR algorithm can result in an \(\epsilon \)-first-order stationary point of problem (3).

4 Numerical Experiments

In this section, we report some numerical results via the QNSTR algorithm for finding an \(\epsilon \)-first-order stationary point of problem (3). Also, we compare the QNSTR algorithm with some state-of-the-art algorithms for minimax problems. All of the numerical experiments in this paper are implemented on TensorFlow 1.13.1, Python 3.6.9 and Cuda 10.0 on a server with 1 Tesla P100-PCIE GPU with 16 GB memory at 1.3285 GHz and an operating system of 64 bits in the University Research Facility in Big Data Analytics (UBDA) of The Hong Kong Polytechnic University. (UBDA website: https://www.polyu.edu.hk/ubda/.)

We test our algorithm with two practical problems. One is a GAN based image generation problem for MNIST hand-writing data. The other one is a mix model for image segmentation on Digital Retinal Images for Vessel Extraction (DRIVE) data. In the first experiment, we use notation QNSTR(L) to denote that the dimension of the subspace spanned by the columns of \(V_k\) is L in the QNSTR algorithm and test the efficiency of QNSTR(L) under different choices of L and directions \(\{d^{i}_{k}\}_{i=1}^{L-1}\). In the second experiment, we apply the QNSTR algorithm to a medical image segmentation problem.

To ensure that \(H_N\) is continuously differentiable, we choose Gaussian error Linear Units (GELU) [16]

$$\begin{aligned} \sigma (x) = x\int _{-\infty }^{x}\frac{e^\frac{-t^{2}}{2\varSigma ^{2}}}{\sqrt{2\pi }\varSigma }dt \end{aligned}$$

with \(\varSigma =10^{-4}\) as the activation function in each hidden layer in D and G, and Sigmoid

$$\begin{aligned} \sigma (x)=\frac{e^{-x}}{1+e^{-x}} \end{aligned}$$

as the activation function of the output layers in D and G.

We test different choices of the subspace spanned by the columns of \(V_k\). Specifically, denote

$$\begin{aligned} & V^z_k= [ -g_k, (z_{k}-z_{k-1}), \cdots ,(z_{k-L+2}-z_{k-L+1})],\\ & V_k^F=[ -g_k, F(z_{k}), \cdots , F(z_{k-L+2})],\\ & V_k^g= -[g_k, g_{k-1}, \cdots , g_{k-L+1}] \end{aligned}$$

for \(L\ge 2.\) Among all the experiments in the sequel, the initial point \(z_{0}\) is the result for running 10,000 steps of the alternating Adam with step size 0.0005. The parameters in Algorithm 1 are set as \({\bar{\varDelta }}=100\), \(\varDelta _0=1\), \(\beta _1=0.5\), \(\beta _2=2\), \(\eta =0.01\), \(\zeta _{1}=0.02\), \(\zeta _{2}=0.05\). The parameters \({\bar{\epsilon }}\) and \(\gamma \) in (24) and (25) are chosen as \({\bar{\epsilon }}=10^{-4}\) and \(\gamma =10^3.\)

4.1 Numerical Performance on MNIST Data

In this subsection, we report some preliminary numerical results of the QNSTR algorithm for solving the following SAA counterpart

$$\begin{aligned} \begin{aligned} \min _{x\in X}\max _{y\in Y} \frac{1}{N}\sum ^{N}_{i=1}\Big (\log (D(y,\xi ^i_1)) + \log (1- D(y,G(x,\xi ^i_2))) \Big ) \end{aligned} \end{aligned}$$
(30)

of problem (2) by using MNIST handwritting data. We consider a two-layer GAN, where

$$\begin{aligned} & \xi ^i_1\in {\mathbb {R}}^{784}, \, \xi ^i_2\in {\mathbb {R}}^{100},\\ & W^1_G\in {\mathbb {R}}^{N_G^1\times 100}, \,W^2_G\in {\mathbb {R}}^{784\times N_G^1},\, W^1_D\in {\mathbb {R}}^{N^1_D\times 784},\,W^2_D\in {\mathbb {R}}^{1\times N^1_D} \end{aligned}$$

with different choices of dimensions \(N^1_G\) and \(N^1_D\) for hidden outputs. Here \(\{\xi ^i_2\}\) are generated by an uniform distribution \({{{\mathcal {U}}}} (-1, 1)^{100}\). All initial weight matrices \(W^{i}_{G}\) and \(W^{i}_{D}\) for \(i=1,2\) are randomly generated by using the Gaussian distribution with mean 0 and standard deviation 0.1, and all initial bias vectors \(b^{i}_{G}\) and \(b^{i}_{D}\) for \(i=1,2\) are set to be zero. X and Y are set as \([-1,1]^{n}\) and \([-1,1]^{m}\), respectively.

4.1.1 Numerical Results for SAA Problems

We first study the performance of SAA problems under different sample sizes on a GAN model with a two-layer generator with \(N_{G}^{1}=64\) and a two-layer discriminator with \(N_{D}^{1}=64\), respectively. We set \({\hat{N}}=10000\) as the benchmark to approximate the original problem (2) and let \(N=100, 500, 1000, 2000, 5000\). For each N, we solve \(F_N(z)=0\), 50 times with different samples by using the QNSTR algorithm. We stop the iteration either

$$\begin{aligned} \Vert F_{N}(z_k)\Vert \le 10^{-5} \end{aligned}$$
(31)

or the number of iterations exceeds 5000.

In these experiments, we set \(\mu =10^{-8}\). We use \(z^{*}_{N}\) to denote the first point that satisfies (31) in the iteration for \(N=500, 1000, 2000, 5000\), and we measure its optimality by the residual

$$\begin{aligned} \text {res}_{N}:=\Vert z^{*}_{N}-\text {mid}\left( l,u,z^{*}_{N}- H_{\hat{N}}(z^{*}_{N}\right) \Vert . \end{aligned}$$

Figure 1 shows the convergence of \(\textrm{res}_{N}\) to zero as N grows. Table 1 presents the average of mean, standard deviation (std) and the width of 95% CI of \(\textrm{res}_{N}\). It shows that all values decrease as the sample size N increases. Both Fig. 1 and Table 1 validate the convergence results in Sect. 2.

Fig. 1
figure 1

The convergence of \(\textrm{res}_{N}\) as N grows (left: The range of \(\textrm{res}_{N}\) with different N, right: The boxplot of \(\textrm{res}_{N}\) with different N)

Table 1 Means, variances and 95% CIs of \(\textrm{res}_{N}\) with different N

4.1.2 Numerical Results for Smoothing Approximation

In this subsection, for a fixed sample size \(N=2000\), we study how the smoothing parameter \(\mu \) affects the residual \(\Vert F_{N,\mu }(z)\Vert \). All numerical results in this part are based on a GAN model which is constituted of a two-layer generator with \(N_{G}^{1}=64\) and a two-layer discriminator with \(N_{D}^{1}=64\). Specifically, for \(\mu =10^{-t}, t=1, 2, 4, 5, 6, 8\), we generate 50 test problems, respectively. For each \(\mu \), we solve problem \(F_{N,\mu }(z)=0\) by the QNSTR algorithm. We stop the iteration either condition \(\Vert F_{N,\mu }(z_k)\Vert \le 10^{-5}\) holds or the number of iterations exceeds 5000.

We use \(z^{*}_{\mu }\) to denote the first point that satisfies (31) in the iteration, and we measure the residual of \(z_{\mu }^{*}\) by

$$\begin{aligned} \text {res}_{\mu }:=\Vert z^{*}_{\mu }-\text {mid}\left( l,u,z^{*}_{\mu }- H_{N}(z^{*}_{\mu })\right) \Vert . \end{aligned}$$

The numerical results are presented in Fig. 2, which shows that \(\textrm{res}_{\mu }\) decreases as smoothing parameter \(\mu \) decreases. In fact, the residual \(\textrm{res}_{\mu }\) becomes stable when \(\mu \le 10^{-5}\). Note that it is not difficult to obtain \(\kappa =\frac{\sqrt{n+m}}{8}\) in (18). Also, when \(N_{D}^{1}=N_{G}^{1}=64\), we have \(n+m=107729\). If \(\Vert F_{N,\mu }(z)\Vert \le 10^{-5}\) and \(\mu =10^{-8}\), then we have \(\Vert F_{N}(z)\Vert \le \kappa \mu + \epsilon \le \frac{\sqrt{107729}}{8}\times 10^{-8} +10^{-5}\approx 1.041\times 10^{-5}\). This is consistent with the theoretical results in Sect. 3.

Fig. 2
figure 2

The convergence of \(\textrm{res}_{\mu }\) as \(\mu \) decreases (left: The range of \(\textrm{res}_{\mu }\) with different \(\mu \); right: The boxplot of \(\textrm{res}_{\mu }\) with different \(\mu \))

4.1.3 Comparison Experiments

In this subsection, we report some numerical results to compare the QNSTR algorithm with some commonly-used methods. To this end, we set \(N=2000\) and \(\mu =10^{-8}\). We use \(\Vert F_{N,\mu }(z_{k})\Vert \) to measure performance of these algorithms and apply Frechet Inception Distance (FID) score to measure the quality of image generated by the generator G trained by different algorithms. We terminate all these algorithms when one of the three cases holds: \(\Vert F_{N,\mu }(z_{k})\Vert \le 10^{-5}\), \(\Vert \nabla F_{N,\mu }(z_{k})^\top F_{N,\mu }(z_{k})\Vert \le 10^{-8}\), the number of iterations exceeds 5000.

We present the numerical results in Figs. 3, 4, 5. It is not difficult to observe from these figures that the QNSTR algorithm outperforms.

Fig. 3
figure 3

The residual \(\Vert F_{N,\mu }(z_k)\Vert \) with \(V^z_{k}\) (left: \(N^{1}_{G}=N^{1}_{D}=64\); right: \(N^{1}_{G}=N^{1}_{D}=128\))

Fig. 4
figure 4

The residual \(\Vert F_{N,\mu }(z_k)\Vert \) with \(V^F_{k}\) (left: \(N^{1}_{G}=N^{1}_{D}=64\); right: \(N^{1}_{G}=N^{1}_{D}=128\))

Fig. 5
figure 5

The residual \(\Vert F_{N,\mu }(z_k)\Vert \) with \(V^g_{k}\) (left: \(N^{1}_{G}=N^{1}_{D}=64\); right: \(N^{1}_{G}=N^{1}_{D}=128\))

We also compare the QNSTR algorithm with some commonly-used algorithms in training GANs including simultaneous and alternate gradient descent-ascent (GDA) [9, 45], simultaneous and alternate optimistic gradient descent-ascent (OGDA) [9, 45], \(\gamma \)-alternate Adam, projected point algorithm (PPA) [11, 24]. The initial point \(z_{0}\) of all these methods are given by alternating Adam with stepsize 0.0005 and 10000 iterations. We use the grid search for the selection of hyper-parameters in these methods. For simultaneous and alternate GDA and simultaneous and alternate OGDA, the stepsize is \(\alpha = 0.5, 0.05, 0.005, 0.001, 0.0005\). For \(\gamma \)-alternate Adam, the ratio is \(\gamma = 1, 2, 3, 5, 10\). For PPA, the proximal parameter is \({\overline{L}}= 10, 100, 500, 1000, 5000\) and stepsize is \(\frac{0.1}{2{\overline{L}}}\). Besides, the stopping criteria of the k-th inner loop is \(\frac{0.01}{k^2}\) or the number of iterations exceeds 100. The comparison results between the QNSTR algorithm and the \(\gamma \)-alternate Adam, the QNSTR algorithm and PPA are given in Figs. 6 and 7, respectively. According to the results, we can see the \(\gamma \)-alternate Adam also shows a outstanding convergence tendency under a suitable hyper-parameter and the QNSTR algorithm can even better than the \(\gamma \)-alternate Adam if a suitable searching space \(V_{k}\) is selected. The PPA performs poorly in these comparisons.

Fig. 6
figure 6

Comparison results between the QNSTR algorithm for \(V^{z}_{k}\), \(V_{k}^{F}\), \(V_{k}^{g}\) and \(\gamma \)-alternate Adam (left: \(N^{1}_{G}=N^{1}_{D}=64\); right \(N^{1}_{G}=N^{1}_{D}=128\))

Fig. 7
figure 7

Comparison results between the QNSTR algorithm for \(V^{z}_{k}\), \(V_{k}^{F}\), \(V_{k}^{g}\) and PPA (left: \(N^{1}_{G}=N^{1}_{D}=64\); right \(N^{1}_{G}=N^{1}_{D}=128\))

The comparison results between the QNSTR algorithm and simultaneous and alternate GDA, simultaneous and alternate OGDA are given in Fig. 8. To show these comparison results more clearly, we only give the results of each method with the optimal stepsize \(\alpha \) in the search range.

Fig. 8
figure 8

Comparison results between the QNSTR algorithm for \(V^{z}_{k}\), \(V_{k}^{F}\), \(V_{k}^{g}\) and simultaneous GDA, alternate GDA, simultaneous OGDA, alternate OGDA (left: \(N^{1}_{G}=N^{1}_{D}=64\); right: \(N^{1}_{G}=N^{1}_{D}=128\))

We can observe from these figures that the QNSTR algorithm outperforms, which validates that the QNSTR algorithm is more efficient in finding an \(\epsilon \)-first-order stationary point of problem (30).

We also record the final FID score of each algorithm’s output. All results are given in Tables 2 and 3. They show the generator of GANs trained by the QNSTR algorithm can generate high quality images.

Table 2 FID scores of different algorithms with \(N_D^1= N_G^1=64\)
Table 3 FID scores of different algorithms with \(N_D^1= N_G^1=128\)

4.2 DRIVE Data

Image segmentation is an important component in many visual understanding systems, which is the process of partitioning a digital image into multiple image segments [38]. Image segmentation plays a central role in a broad range of applications [13], including medical image analysis, autonomous vehicles (e.g., navigable surface and pedestrian detection), video surveillance and augmented reality. One of the well-known paradigm for image segmentation is based on some kinds of manual designed loss functions. However, they usually lead to the blurry segmentation boundary [17].

In 2016, Phillip et. al introduced a generative adversarial network framework into their objective function to implement an image-to-image translation problem [17], and found the blurry output given CNN under \(l_{1}\) norm can be reduced. At the same year, Pauline et. al introduced a mix objective function combining by GAN and cross-entropy loss on semantic image segmentation problem [25], also implemented a better performance. The similar idea of mix GAN and traditional loss can also be found in [37, 46].

The mix model has the following form

$$\begin{aligned} \begin{aligned} \min _{x\in X}\max _{y\in Y} \bigg \{ {\hat{f}}_{N}(x,y): =&\frac{1}{N}\sum _{i=1}^N \lambda \cdot \psi \big (\xi _1^i, G(x,\xi _2^i)\big ) \\&\quad +\Big ( \frac{1}{N}\sum _{i=1}^N \Big ( \log (D(y,\xi _1^i)) + \log (1- D(y,G(x,\xi _2^i)))\Big ) \bigg \}, \end{aligned} \end{aligned}$$
(32)

where X, Y are two bounded boxes, \(\{(\xi ^{i}_{1},\xi ^{i}_{2})\}_{i=1}^{N}\) is the finite collected data, \(\xi ^{i}_{2}\) is the original data while the \(\xi ^{i}_{1}\) is the corresponding label. Problem (32) is a special case of problem (3), which can be viewed as a discrete generative adversarial problem (30) with an extra classical supervision term. The classical supervision part, i.e.,

$$\begin{aligned} \min _{x\in X}\frac{1}{N}\sum _{i=1}^{N}\Big [ \psi \big (\xi ^{i}_1, G(x,\xi ^{i}_2)\big ) \Big ] \end{aligned}$$
(33)

is to minimize the difference between the output of given \(\xi _{2}\) on G and its corresponding label \(\xi _{1}\). The model can be regarded as a combination of a classical supervised learning problem and a generative adversarial problem with a trade-off parameter \(\lambda \in [0,\infty )\). When \(\lambda =0\), problem (32) reduces to a classical supervised learning problem (33). When \(\lambda \rightarrow \infty \), problem (32) tends to a vanilla GAN.

The fundoscopic exam is an important procedure to provide information to diagnose different retinal degenerative diseases such as Diabetic Retinopathy, Macular Edema and Cytomegalovirus Retinitis. A high accurate system to sketch out the blood vessel and find abnormalities on fundoscopic images is necessary. Although the supervision deep learning frameworks such as Unet are able to segment macro vessel accurately, they failed for segmenting microvessels with high certainty. In this part, we will train a Unet as a generator in our framework on DRIVE data. We download the data includes 20 eye blood vessel images with manual segmentation label from the open source website (https://drive.grand-challenge.org/). We applied 16 images as training data while the other 4 as testing data. In this experiment, the structure of segmentation model G is U-net [34] which includes 18 layers with \(n=121435\) parameters, and the structure of discrimination model D is a deep convolutional neural network which contains 5 convolutional layers and 1 fully connected layer with \(m=142625\) parameters. The feasible sets X and Y are set as \([-5,5]^{n}\) and \([-5,5]^{m}\), respectively. We use activation function GELU except Sigmoid at the output layer of D and G. We compare our results based on problem (32) with some existing models. In our experiment, we use \(\lambda =10\) and \(V^z_k\) with \(L=4\).

We compute traditional metrics such as F1-score, Sensitivity, Specificity and Accuracy. The form of these metrics are given as follows:

$$\begin{aligned} \textrm{Sensivity}= & \frac{1}{N}\sum _{i=1}^{N}\frac{|\textrm{GT}_{i}\cap \textrm{SR}_{i}|}{|\textrm{GT}_{i}\cap \textrm{SR}_{i}|+|\textrm{GT}_{i}\cap \textrm{SR}_{i}^{c}|},\\ \textrm{Specificity}= & \frac{1}{N}\sum _{i=1}^{N}\frac{|\textrm{GT}_{i}^{c}\cap \textrm{SR}_{i}^{c}|}{|\textrm{GT}_{i}^{c}\cap \textrm{SR}_{i}^{c}|+|\textrm{GT}_{i}^{c}\cap \textrm{SR}_{i}|},\\ \textrm{Accuracy}= & \frac{1}{N}\sum _{i=1}^{N}\frac{|\textrm{GT}_{i}\cap \textrm{SR}_{i}|+|\textrm{GT}_{i}^{c}\cap \textrm{SR}_{i}^{c}|}{\varOmega },\\ \textrm{Precision}= & \frac{1}{N}\sum _{i=1}^{N}\frac{|\textrm{GT}_{i}\cap \textrm{SR}_{i}|}{|\textrm{GT}_{i}\cap \textrm{SR}_{i}|+|\textrm{GT}_{i}^{c}\cap \textrm{SR}_{i}|},\\ \textrm{F1}= & \frac{2\textrm{Precision}\times \textrm{Sensitivity}}{\textrm{Precision}+\textrm{Sensitivity}}, \end{aligned}$$

where \(\varOmega \) is the Universe set for all index of pixels in image, \(\textrm{GT}_{i}\) is the grouth truth vessel index for i-th image, \(\textrm{SR}\) is the index of pixel labelled as vessel in i-th image’s segmentation result. Furthermore, we compute Area Under Curve-Receiver Operating Characteristic (AUC-ROC) [2] and Structural Similarity Index Measure (SSIM) [40].

Table 4 shows that the QNSTR algorithm for solving problem (32) is more promising for blood vessel segmentation. In Fig. 9, we visualize the error between vessel map generated by problem (32) with the QNSTR algorithm and the manual segmentation. In Fig. 10, we compare the segmentation results of problem (32) based on the QNSTR algorithm and the alternating Adam approach, respectively.

Table 4 The performance of QNSTR algorithm and alternating Adam for model (32), and other methods in [1, 20, 21] for model (33)
Fig. 9
figure 9

Row 1. fundus image, Row 2. manual segmentation, Row 3. vessel map generated by GANs with QNSTR algorithm, Row 4. yellow(correct); red(wrong); green(missing), Row 5. Error (Color figure online)

Fig. 10
figure 10

Comparison of Alternating Adam and QNSTR Algorithm. Columns from left to right are: 1. manual segmentation, 2. vessel map generated by GANs with Alternating Adam, 3. yellow(correct); red(wrong); green(missing) of Alternating Adam, 4. vessel map generated by GANs with proposed Algorithm, 5. yellow(correct); red(wrong); green(missing) of QNSTR Algorithm (Color figure online)

5 Conclusion

In this paper, we propose a new QNSTR algorithm for solving the large-scale min-max optimization problem (3) via the nonmonotone VI (4). Based on the structure of the problem, we use a smoothing function \(F(\cdot , \mu )\) to approximate the nonsmooth function \(F_N\), and consider the smoothing least squares problem (19). We adopt an adaptive quasi-Newton formula in [47] to approximate the Hessian matrix and solve a strongly convex quadratic program with ellipse constraints in a low-dimensional subspace at each step of the QNSTR algorithm. We prove the global convergence of the QNSTR algorithm to a stationary point of the least squares problem, which is an \(\epsilon \)-first-order stationary point of the min-max optimization problem if every element of the generalized Jacobian of \(F_N\) is nonsingular at the point. In our numerical experiments, we test the QNSTR algorithm by using two real data sets: MNIST data and DRIVE data. Preliminary numerical results validate that the QNSTR algorithm outperforms some existing algorithms.