Abstract
In this paper, we develop a splitting algorithm incorporating Bregman distances to solve a broad class of linearly constrained composite optimization problems, whose objective function is the separable sum of possibly nonconvex nonsmooth functions and a smooth function, coupled by a difference of functions. This structure encapsulates numerous significant nonconvex and nonsmooth optimization problems in the current literature including the linearly constrained difference-of-convex problems. Relying on the successive linearization and alternating direction method of multipliers (ADMM), the proposed algorithm exhibits the global subsequential convergence to a stationary point of the underlying problem. We also establish the convergence of the full sequence generated by our algorithm under the Kurdyka–Łojasiewicz property and some mild assumptions. The efficiency of the proposed algorithm is tested on a robust principal component analysis problem and a nonconvex optimal power flow problem.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Throughout this paper, the set of nonnegative integers is denoted by \({\mathbb {N}}\), the set of real numbers by \({\mathbb {R}}\), the set of nonnegative real numbers by \({\mathbb {R}}_+\), and the set of the positive real numbers by \({\mathbb {R}}_{++}\). We use \({\mathbb {R}}^d\) to denote the d-dimensional Euclidean space with inner product \(\langle \cdot , \cdot \rangle \) and Euclidean norm \(\Vert \cdot \Vert \). For a given matrix M, \({\text {Im}}(M)\) denotes its image.
We consider the linearly constrained composite optimization problem
where \(\textbf{x}=(x_1,\dots ,x_m)\); for each \(i\in \{1,\dots ,m\}\), \(f_i: {\mathbb {R}}^{d_i} \rightarrow (-\infty ,+\infty ]\) is a proper lower semicontinuous function; \(H:{\mathbb {R}}^q \rightarrow {\mathbb {R}}\) is a differentiable (possibly nonconvex) function whose gradient is Lipschitz continuous; \(P:{\mathbb {R}}^{d_1}\times \dots \times {\mathbb {R}}^{d_m} \rightarrow {\mathbb {R}}\) is another differentiable (possibly nonconvex) function whose gradient is Lipschitz continuous; \(G:{\mathbb {R}}^{d_1}\times \dots \times {\mathbb {R}}^{d_m} \rightarrow {\mathbb {R}}\) is a continuous (possibly nonsmooth) weakly convex function, \(A_i \in {\mathbb {R}}^{p\times d_i}\), \(B\in {\mathbb {R}}^{p\times q}\), and \(b \in {\mathbb {R}}^p\). Problem (1) covers two important models in the literature. Firstly, when \(G \equiv 0\), problem (1) reduces to the form
which is studied in [27]. This class of problems, together with its special cases considered in [11, 28], has a broad range of useful applications in image processing, matrix decomposition, statistical learning, and others [27, 29]. In [27], the author studied the multi-block alternating direction method of multipliers (ADMM) in the nonconvex nonsmooth setting, and they introduced new assumptions such as restricted prox-regularity and Lipschitz subminimization path to establish the global convergence. A special case of (2) in which \(P\equiv 0\) was considered in [30], and proximal terms and dual stepsize were incorporated into the ADMM to solve the problem. An inertial proximal splitting method was also developed in [26] to revisit that problem. Other simpler special cases of (2) can be addressed by variants of the three-block ADMM [14, 15, 28], or two-block ADMM [11, 13, 25], and their convergence are well studied in the literature. Secondly, when \(P \equiv 0\), \(m =1\), \(x_1 =x \in {\mathbb {R}}^d\), and \(A_1 =A \in {\mathbb {R}}^{p\times d}\), problem (1) reduces to
This is known as the linearly constrained difference-of-convex problem [23, 24], in which f, H, and G are all assumed to be convex. Problem (3) arises from the use of the difference-of-convex regularization terms, which has shown very promising performance [10, 31]. One practical engineering problem that can be solved via problem (3) is the total variation image restoration problem [20, 23, 24], which is one of the most fundamental issues in imaging science. The authors of [23] developed a variant of the ADMM which incorporates the Bregman distance in updating steps for solving problem (3) in the case when f, G, and H are convex functions, with H being differentiable with a Lipschitz continuous gradient. A hybrid Bregman ADMM was then proposed in [24] to revisit problem (3) with the same assumptions on f, G, and H. This version introduces an additional step to evaluate the convex conjugate of G, allowing the flexibility in choosing the subgradient step or proximal step to update the solution by choosing a control parameter to either activate or deactivate the additional step. By this strategy, this algorithm can include the one in [23] as a special case. However, further research on more general form of problem (3) is still not available in the literature, to the best of the authors’ knowledge.
Inspired by the success of those previous studies, we aim to study a class of problems that can cover both (2) and (3). We develop a splitting algorithm based on ADMM to solve problem (1), which not only includes problem (2) as a special case, but also extends problem (3) into a multi-block, nonsmooth, and nonconvex setting. Bregman distances are also incorporated into the algorithm, making it more general and allowing the flexibility in choosing the appropriate proximal terms to solve the subproblems efficiently. The sequence generated by our algorithm is bounded and each of its cluster points is a stationary point of the problem. We also prove the convergence of the full sequence under the Kurdyka–Łojasiewicz property of a suitable function.
Compared to the two closely related works in [23, 24], the convexity of \(f_i\) is not required, H can be nonconvex, and G can also be weakly convex. In other words, the convexity requirement in our work is weaker, allowing a more general class of structured optimization problems. For the convergence of the full sequence, we do not require the restricted prox-regularity assumption on \(f_i\) which was used in [27]. Also, although we need the assumption that B has full column rank, we do not require any assumptions on the column rank of any matrices \(A_i\).
In the following examples, we present two practical problems which motivate our research.
Example 1.1
(Robust principal component analysis (RPCA) [14]) The RPCA problem is formulated as
where M is a given \(m\times d\) observation matrix, \(\Vert \cdot \Vert _{*}\) is the nuclear norm, \(\Vert \cdot \Vert _1\) is the component-wise \(L_1\) norm (defined as the sum of absolute values of all entries) which controls the sparsity of the solution, and \(\Vert \cdot \Vert _F\) is the standard Frobenius norm which controls the noise level, where \(\tau \) and \(\gamma \) are positive penalty constants. By introducing a new variable \(T=L+S\), the problem is rewritten as
This problem can be addressed by using the three-block ADMM as shown in [14]. Recently, \(L_1-L_2\) regularization has been shown to have better performance than \(L_1\) alone (see [17] and references therein). That motivates us to investigate the effects of \(\Vert \cdot \Vert _1-\Vert \cdot \Vert \) on this problem, where \(\Vert \cdot \Vert \) is the spectral norm. By using this modified regularization term, the optimization problem fits into the structure of (1). We will later solve this problem using our proposed algorithm in Sect. 4.
Example 1.2
(Direct current optimal power flow (DC-OPF) [2]) Optimal power flow is an important problem in power system engineering [1]. The problem’s simple formulation is given as follows. Let \(x_i \in {\mathbb {R}}^{d_i}\) be the variable associated with the \(i^{th}\) node in the network. The mathematical model is then given by
where \(Q_i\), \(A_i\), \(q_i\), and b are coefficient matrices and vectors. The authors of [2] have successfully used multi-block ADMM to solve this problem with promising results. As modern electrical grids require the integration of renewable energy resources such as photovoltaic (PV) systems, it is necessary to find their optimal locations in the network while satisfying the power flow. This leads to the introduction of binary variables into the OPF formulation. Binary relaxation and reformulation into difference-of-convex program have been successfully used to address this problem [21]. However, the algorithm proposed in [21] has to rely on solvers for the subproblems. Therefore, it is tempting to investigate whether using ADMM-based algorithms can further decompose the problem and reduce the computational complexity. In Sect. 4 we will also give the detailed formulation to show how it fits into the structure of (1), and address it with our algorithm.
The rest of the paper is organized as follows. In Sect. 2, the preliminary materials used in this work are presented. In Sect. 3, we introduce our algorithm, and prove the subsequential and full sequential convergence of the proposed algorithm under suitable assumptions. Section 4 presents the numerical results of the algorithm. Finally, the conclusions are given in Sect. 5.
2 Preliminaries
Let \(f:{\mathbb {R}}^d\rightarrow \left( -\infty ,+\infty \right] \). The domain of f is \({\text {dom}}f {:}{=}\{x\in {\mathbb {R}}^d: f(x) <+\infty \}\) and the epigraph of f is \({\text {epi}}f {:}{=} \{(x,\xi )\in {\mathbb {R}}^d\times {\mathbb {R}}: f(x)\le \xi \}\). The function f is said to be proper if \({\text {dom}}f \ne \varnothing \) and it never takes the value \(-\infty \), lower semicontinuous if its epigraph is closed, and convex if its epigraph is convex. The function f is said to be \(\alpha \)-strongly convex (\(\alpha \in {\mathbb {R}}_+\)) if \(f-\frac{\alpha }{2}\Vert \cdot \Vert ^2\) is convex, and \(\beta \)-weakly convex (\(\beta \in {\mathbb {R}}_+\)) if \(f+\frac{\beta }{2}\Vert \cdot \Vert ^2\) is convex. We say that f is coercive if \(f(x)\rightarrow +\infty \) as \(\Vert x\Vert \rightarrow +\infty \).
Let \(x\in {\mathbb {R}}^d\) with \(|f(x)| <+\infty \). The regular (or Fréchet) subdifferential of f at x is defined by
and the limiting (or Mordukhovich) subdifferential of f at x is defined by
where the notation \(y{\mathop {\rightarrow }\limits ^{f}}x\) means \(y\rightarrow x\) with \(f(y)\rightarrow f(x)\). In the case where \(|f(x) |=+\infty \), both regular subdifferential and limiting subdifferential of f at x are defined to be the empty set. The domain of \(\partial _L f\) is given by \({\text {dom}}\partial _L f {:}{=}\{x\in {\mathbb {R}}^d: \partial _L f(x) \ne \varnothing \}\). We now collect some important properties of the limiting subdifferential.
Lemma 2.1
(Calculus rules) Let \(f, g:{\mathbb {R}}^d \rightarrow (-\infty , +\infty ]\) be proper lower semicontinuous functions, and let \(x \in {\mathbb {R}}^d\).
-
(i)
(Sum rule) Suppose that f is finite at x and g is locally Lipschitz around \(\overline{x}\). Then \(\partial _L (f + g)(\overline{x}) \subseteq \partial _L f(\overline{x})+\partial _L g(\overline{x})\). Moreover, if g is strictly differentiable at \(\overline{x}\), then \(\partial _L(f + g)(\overline{x}) = \partial _Lf(\overline{x}) + \nabla g(\overline{x})\).
-
(ii)
(Separable sum rule) If \(f(\textbf{x})=\sum _{i=1}^{m}f_i(x_i)\) with \(\textbf{x}=(x_1,\dots ,x_m)\), then \(\partial _L f(\textbf{x}) = \partial _L f_1(x_1) \times \dots \times \partial _L f_m(x_m)\).
Proof
(i) follows from [18, Proposition 1.107(ii) and Theorem 3.36], while (ii) follows from [22, Proposition 10.5]. \(\square \)
Lemma 2.2
(Upper semicontinuity of subdifferential) Let \(f:{\mathbb {R}}^d \rightarrow [-\infty , +\infty ]\) be Lipschitz continuous around \(x \in {\mathbb {R}}^d\) with \(|f(x)| <+\infty \), and consider sequences \((x_n)_{n\in {\mathbb {N}}}\) and \((x_n^*)_{n\in {\mathbb {N}}}\) in \({\mathbb {R}}^d\) such that \(x_n \rightarrow x\) and, for all \({n\in {\mathbb {N}}}\), \(x_n^* \in \partial _L f(x_n)\). Then \((x_n^*)_{n\in {\mathbb {N}}}\) is bounded and its cluster points are contained in \(\partial _L f(x)\).
Proof
This follows from [21, Lemma 2.2]. \(\square \)
We recall the concept of the well-known Bregman distance in a real Hilbert space \({\mathcal {H}}\), which was introduced in [7]. Given a differentiable convex function \(\phi : {\mathcal {H}} \rightarrow {\mathbb {R}}\), the Bregman distance associated with \(\phi \) is defined as
When \(\phi (x)=\Vert x\Vert ^2\), then \(D_{\phi }(u,v)=\Vert u-v\Vert ^2\). When \(\phi (x)=x^\top M x\), where M is a positive semidefinite matrix, then \(D_{\phi }(u,v) = \Vert u-v\Vert ^2_M {:}{=}(u-v)^\top M (u-v)\). When \({\mathcal {H}}={\mathbb {R}}^{p\times d}\) and \(\phi (X)=\Vert X\Vert _F^2\), then \(D_{\phi }(U,V) =\Vert U-V\Vert _F^2\) [9]. Some useful properties of the Bregman distance are listed in the following proposition.
Proposition 2.1
Let \(\phi : {\mathcal {H}} \rightarrow {\mathbb {R}}\) be a differentiable convex function and \(D_{\phi }\) is the Bregman distance associated with \(\phi \). Then the following hold:
-
(i)
For all \(u, v \in {\mathcal {H}}\), \(D_{\phi }(u, v)\ge 0\), and \(D_{\phi }(u, v)=0\) if and only if \(u = v\).
-
(ii)
For each \(v \in {\mathcal {H}}\), \(D_{\phi }(\cdot , v)\) is convex.
-
(iii)
If \(\phi \) is \(\alpha \)-strongly convex, then, for all \(u, v \in {\mathcal {H}}\), \(D_{\phi }(u, v) \ge \frac{\alpha }{2} \Vert u-v\Vert ^2\).
-
(iv)
If \(\nabla \phi \) is \(\ell _{\phi }\)-Lipschitz continuous, then, for all \(u, v \in {\mathcal {H}}\), \(D_{\phi }(u,v)\le \frac{\ell _{\phi }}{2}\Vert u-v\Vert ^2\).
Proof
(i) is given in [7, Section 1]. (ii), (iii), and (iv) can be easily verified by using the definitions. \(\square \)
We end this section by the following lemma which will be instrumental in proving the convergence results in the next section.
Lemma 2.3
Let \(h:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) be a differentiable function with \(\ell \)-Lipschitz continuous gradient. Then the following hold:
-
(i)
For all \(x, y\in {\mathbb {R}}^d\), \(|h(y) -h(x) -\langle \nabla h(x), y -x\rangle |\le \frac{\ell }{2}\Vert y -x\Vert ^2\).
-
(ii)
If h is bounded below and \(\ell >0\), then
$$\begin{aligned} \inf _{x\in {\mathbb {R}}^d} \left( h(x) -\frac{1}{2\ell }\Vert \nabla h(x)\Vert ^2\right) >-\infty . \end{aligned}$$
Proof
(i): This follows from [19, Lemma 1.2.3].
(ii): Let \(x\in {\mathbb {R}}^d\). Following the argument of [19, Equation (1.2.19)], we apply (i) with \(y =x -\frac{1}{\ell }\nabla h(x)\) to obtain that
which together with the boundedness of h implies the conclusion. \(\square \)
3 Proposed Algorithm and Convergence Analysis
We first fix some notations which are used throughout the paper from now on. Let \(\textbf{A}=[A_1|A_2|\dots |A_m]\) and let \(\textbf{I}_{q\times q}\) be the \(q\times q\) identity matrix. Then \({\textbf{A}}{\textbf{x}}=\sum _{i=1}^{m}A_ix_i\). We also denote by \(\partial _L^{x}\) the limiting subdifferential with respect to the x-variable, and by \(\nabla _i f\) the \(i^{th}\) block of the gradient vector \(\nabla f\). Given a matrix M, \(\lambda _{\min }(M)\) denotes its smallest eigenvalue while \(\lambda _{\max }(M)\) denotes its largest eigenvalue. The following assumptions are used in our convergence analysis
Assumption 3.1
(Standing assumptions)
-
(i)
\(\nabla H\) is \(\ell _H\)-Lipschitz continuous, \(\nabla P\) is \(\ell _P\)-Lipschitz continuous, and G is \(\beta \)-weakly convex.
-
(ii)
\({\text {Im}}(\textbf{A}) \bigcup \{b \}\subseteq {\text {Im}}(B)\).
-
(iii)
\(\lambda {:}{=}\lambda _{\min }(B^\top B) >0\) (equivalently, B has full column rank).
We note from [27, Section 4.1] that Assumption 3.1(ii) is crucial for the convergence of ADMM methods in nonconvex settings and generally cannot be completely removed. For this assumption to hold, a sufficient condition is that B has full row rank.
Recall that the augmented Lagrangian of problem (1) is given by
where \(z\in {\mathbb {R}}^p\) and \(\rho \in {\mathbb {R}}_{++}\). Now, we present our splitting algorithm with guaranteed global convergence to a stationary point \((\overline{\textbf{x}},\overline{y},\overline{z})\) of \({\mathcal {L}}_\rho \), i.e., \(0 \in \partial _L {\mathcal {L}}_\rho (\overline{\textbf{x}},\overline{y},\overline{z})\), or equivalently,
Remark 3.1
(Discussion of the algorithm structure) Some comments on Algorithm 3.1 are in order.
-
(i)
For each \(n\in {\mathbb {N}}\) and \(g_n\in \partial _L G(\textbf{x}_n)\), we define the linearized augmented Lagrangian as
$$\begin{aligned} {\mathcal {L}}_{\rho ,n}(\textbf{x},y,z)&=\sum _{i=1}^m f_i(x_i)+H(y)+\langle \nabla P(\textbf{x}_n)-g_{n}, \textbf{x} \rangle + \langle z, {\textbf{A}}{\textbf{x}}+By-b \rangle \\&\quad +\frac{\rho }{2}\Vert {\textbf{A}}{\textbf{x}}+By-b\Vert ^2. \end{aligned}$$Then the updating scheme given in Step 2 of Algorithm 3.1 can be written as
$$\begin{aligned} \begin{array}{ll} x_{i,n+1}\!\!\! & =\mathop {{\text {argmin}}}\limits _{x_i \in {\mathbb {R}}^{d_i}}\left( {\mathcal {L}}_{\rho ,n}(\textbf{u}_{i,n+1}(x_i),y_n,z_n) + \mu _n D_{\phi _i}(x_i,x_{i,n})\right) , \\ y_{n+1} & =\mathop {{\text {argmin}}}\limits _{y\in {\mathbb {R}}^q} \left( {\mathcal {L}}_{\rho ,n}(\textbf{x}_{n+1},y,z_n) +\nu _n D_{\psi _i}(y,y_{n})\right) ,\\ z_{n+1} & =z_n+\rho ({\textbf{A}}{\textbf{x}}_{n+1}+By_{n+1}-b). \end{array} \end{aligned}$$ -
(ii)
In the case that \(P \equiv 0\), \(m =1\), \(x_1 =x \in {\mathbb {R}}^d\), \(A_1 =A \in {\mathbb {R}}^{p\times d}\), and \(\mu _n =\nu _n =1\) for all \(n\in {\mathbb {N}}\), the updates in Step 2 of Algorithm 3.1 becomes
$$\begin{aligned} \begin{array}{ll} & x_{n+1} \in \mathop {{\text {argmin}}}\limits _{x \in {\mathbb {R}}^{d}} \left( f(x) -\langle g_n, x \rangle + \langle z_n, Ax\rangle +\frac{\rho }{2}\Vert Ax+By_n-b\Vert ^2 + D_{\phi }(x,x_n)\right) ,\\ & y_{n+1} \in \mathop {{\text {argmin}}}\limits _{y \in {\mathbb {R}}^q} \left( H(y) + \langle z_n, By\rangle +\frac{\rho }{2}\Vert Ax_{n+1} + By-b \Vert ^2 + D_{\psi }(y,y_n) \right) ,\\ & z_{n+1}=z_n+\rho (Ax_{n+1}+By_{n+1}-b). \end{array} \end{aligned}$$In turn, Algorithm 3.1 reduces to the one introduced in [23], in which f, H, G are all convex, and both \(\phi \) and \(\psi \) are also assumed to be strongly convex functions and have Lipschitz continuous gradients.
-
(iii)
It is also worth noting that since the problems considered in [11, 13, 15, 25, 26, 28, 30] are all special cases of (2), they can also be solved by our proposed algorithm.
-
(iv)
The performance of the ADMM heavily relies on whether the \(x_i-\)updates and the \(y-\)update can be efficiently solved or not. The ideal situation is that those updates can be written into tractable forms, such as quadratic forms or closed-form proximal operators (see Sect. 4 for two examples). In such cases, the ADMM can converge faster than some algorithms which need external solvers. For further discussion on the various closed-form proximal operators, we refer the readers to [21, Remark 3.1]. Incorporating Bregman distance into ADMM also has many benefits in terms of flexibility and performance, as shown in [13, 23,24,25], when appropriate choices of Bregman distance are used.
The following lemmas will be useful for our convergence analysis.
Lemma 3.1
Let L be a symmetric matrix in \({\mathbb {R}}^{d\times d}\) and let M be a matrix in \({\mathbb {R}}^{p\times d}\). Then the following hold:
-
(i)
For all \(x\in {\mathbb {R}}^d\), \(\sqrt{\lambda _{\min }(M^\top M)}\Vert x\Vert \le \Vert Mx\Vert \).
-
(ii)
For all \(x\in {\mathbb {R}}^d\), \(\gamma (x^\top Lx) \le \Vert Lx\Vert ^2\), where \(\gamma =\min \{|\lambda |: \lambda \) is a nonzero eigenvalue of \(L\}\). Consequently, for all \(z\in {\text {Im}}(M)\), \(\sqrt{\lambda _*(M^\top M)}\Vert z\Vert \le \Vert M^\top z\Vert \), where \(\lambda _*(M^\top M)\) is the smallest strictly positive eigenvalue of \(M^\top M\).
Proof
(i): This follows from the fact that \(\Vert Mx\Vert ^2 =x^\top (M^\top M)x\).
(ii): The first assertion follows from [6, Lemma 4.5] and its proof. Now, let \(z\in {\text {Im}}(M)\). Then there exists \(x\in {\mathbb {R}}^d\) such that \(z =Mx\). Since \(M^\top M\) is a symmetric matrix in \({\mathbb {R}}^{d\times d}\) with all eigenvalues being nonnegative, we have from the first assertion that
which completes the proof. \(\square \)
We move on to prove the main results of the paper, starting with the guaranteed subsequential convergence.
Theorem 3.1
(Subsequential convergence) Suppose that Assumption 3.1 holds. Let \((\textbf{x}_n,y_n,z_n)_{n\in {\mathbb {N}}}\) be the sequence generated by Algorithm 3.1 and, for each \(n\in {\mathbb {N}}^*\), define
Then the following hold:
-
(i)
For all \(n\in {\mathbb {N}}^*\),
$$\begin{aligned} L_{n+1} +\delta _{\textbf{x}}\Vert \textbf{x}_{n+1}-\textbf{x}_{n}\Vert ^2 +\delta _y\Vert y_{n+1}-y_n\Vert ^2 \le L_n, \end{aligned}$$(5)where \(\delta _{\textbf{x}} =\frac{\mu \alpha -\ell _P -\beta }{2} > 0\) and \(\delta _y =\frac{\lambda \rho }{2} -\frac{(\ell _H+2\nu \ell _{\psi })^2}{\lambda \rho } -\frac{\ell _H}{2} >0\).
-
(ii)
Suppose that \(z_0\in {\text {Im}}(B)\), that \(\sum _{i=1}^m f_i(x_i)+P(\textbf{x})-G(\textbf{x})\) is coercive, and that H is bounded below. Then the sequence \((\textbf{x}_n,{y}_n,{z}_n)_{n\in {\mathbb {N}}}\) is bounded and \(\textbf{x}_{n+1}-\textbf{x}_n\rightarrow 0\), \(y_{n+1}-y_n\rightarrow 0\), \(z_{n+1}-z_n\rightarrow 0\), and \({\textbf{A}}{\textbf{x}}_n +By_n \rightarrow b\) as \(n\rightarrow +\infty \). Furthermore, for every cluster point \((\overline{\textbf{x}},\overline{y},\overline{z})\) of \((\textbf{x}_n,{y}_n,{z}_n)_{n\in {\mathbb {N}}}\), it holds that
$$\begin{aligned} \textbf{A}\overline{\textbf{x}} +b\overline{y}&=b \text {~~and}~~\\ \lim _{n \rightarrow +\infty } F(\textbf{x}_{n },y_{n})&=\lim _{n \rightarrow +\infty }{\mathcal {L}}_\rho (\textbf{x}_{n },y_{n},z_{n}) ={\mathcal {L}}_\rho (\overline{\textbf{x}},\overline{y},\overline{z})\\&=F(\overline{\textbf{x}},\overline{y}), \end{aligned}$$and that if G is strictly differentiable and, for each \(i \in \{1,\dots ,m\}\), \(\nabla \phi _i\) is \(\ell _{\phi }\)-Lipschitz continuous, then \((\overline{\textbf{x}},\overline{y},\overline{z})\) is a stationary point of \({\mathcal {L}}_\rho \).
Proof
(i): Let \(n\in {\mathbb {N}}\). The update of \(z_{n+1}\) yields
By the optimality condition for the update of \(y_{n+1}\) in Step 2 of Algorithm 3.1,
which combined with (6) yields
From the definition of the augmented Lagrangian, we have that
and that
Writing \({\textbf{A}}{\textbf{x}}_{n+1}+By_{n}-b =({\textbf{A}}{\textbf{x}}_{n+1}+By_{n+1}-b) -(By_{n+1}- By_n)\), we obtain that
and so
In view of (7),
where the last inequality is from the \(\ell _H\)-Lipschitz continuity of \(\nabla H\) and Lemma 2.3(i), and from the monotonicity of \(\nabla \psi \). We therefore obtain that
Next, by the update of \(x_{i,n+1}\) in Step 2 of Algorithm 3.1, for all \(i \in \{1,\dots ,m\}\) and all \(x_i \in {\mathbb {R}}^{d_i}\),
Letting \(x_i =x_{i,n}\), using Proposition 2.1(i) &(iii), and summing up over \(i \in \{1,\dots ,m\}\), we derive that
On the other hand,
where the first inequality is obtained by using Lemma 2.3(i) on the \(\ell _P\)-Lipschitz continuity of \(\nabla P\) and [5, Lemma 4.1] on the \(\beta \)-weak convexity of G, while the last inequality follows from (11). Summing up three relations (8), (9), and (12) yields
Now, since \({\text {Im}}(\textbf{A}) \cup \{b\} \subseteq {\text {Im}}(B)\), it follows from (6) that
which, by Lemma 3.1(ii), yields
Let \(n\in {\mathbb {N}}^*\). We derive from (7) and the Lipschitz continuity of \(\nabla H\) and \(\nabla \psi \) that
and thus,
Applying Cauchy–Schwarz inequality to two vectors \(u =(\sqrt{\ell _H+\nu \ell _{\psi }},\sqrt{\nu \ell _{\psi }})\) and \(v =(\sqrt{\ell _H+\nu \ell _{\psi }}\Vert y_{n+1} -y_n\Vert ,\sqrt{\nu \ell _{\psi }}\Vert y_n -y_{n-1}\Vert )\), we obtain that
By combining this with (13) and using Lemma 3.1(i),
or equivalently,
which proves (5). Here, we note that \(\delta _{\textbf{x}} =\frac{\mu \alpha -\ell _P -\beta }{2} > 0\) since \(\mu > \frac{\ell _P+\beta }{\alpha }\) and that \(\delta _y =\frac{\lambda \rho }{2} -\frac{(\ell _H+2\nu \ell _{\psi })^2}{\lambda \rho } -\frac{\ell _H}{2} >0\) since \(\rho > \frac{\ell _H +\sqrt{\ell _H^2 +8(\ell _H +2\nu \ell _{\psi })^2}}{2\lambda }\).
(ii): Let \(n\in {\mathbb {N}}^*\). As \(z_0 \in {\text {Im}}(B)\), we have from (14) that \(z_n \in {\text {Im}}(B)\). Using Lemma 3.1(ii), (7), and the Lipschitz continuity of \(\psi \) yields
For \(\omega \in {\mathbb {R}}_{++}\) to be chosen later, invoking Cauchy–Schwarz inequality with two vectors \(u =(\sqrt{\omega },\sqrt{\nu \ell _{\psi }})\) and \(v =(\frac{1}{\sqrt{\omega }}\Vert \nabla H(y_n)\Vert ,\sqrt{\nu \ell _{\psi }}\Vert y_n-y_{n-1}\Vert )\), we have that
Combining this with the fact that
we deduce that
Now, we note that
and so \(\frac{1}{2\ell _H} >\frac{1}{\lambda \rho }\). Choosing \(\omega =1\) if \(\nu \ell _{\psi } =0\), and \(\omega =\nu \ell _{\psi }\) if \(\nu \ell _{\psi } >0\), it holds that
This together with (i) yields
in which the first term on the right hand side is bounded below since \(\sum _{i=1}^m f_i(x_i)+P(\textbf{x})-G(\textbf{x})\) is coercive, the second term is bounded below due to the assumption on H and Lemma 2.3(ii), while the remaining terms are nonnegative due to their positive coefficients. Therefore, all the terms are bounded. Again, as \(\sum _{i=1}^m f_i(x_i)+P(\textbf{x})-G(\textbf{x})\) is coercive, the boundedness of the first term implies that \((\textbf{x}_{n})_{n\in {\mathbb {N}}}\) is bounded. Since the last two terms are bounded, it follows from (17) that \((z_n)_{n\in {\mathbb {N}}}\) is bounded. Since \(z_{n+1}=z_n+\rho ({\textbf{A}}{\textbf{x}}_{n+1}+By_{n+1}-b)\) and B has full column rank, we also obtain that \((y_n)_{n\in {\mathbb {N}}}\) is bounded. Consequently, the sequence \((\textbf{x}_n,y_n,z_n)_{n\in {\mathbb {N}}}\) is bounded.
On the other hand, since all the terms on the right hand side of (18) are bounded below, so is the sequence \((L_n)_{n\in {\mathbb {N}}}\). Combining with the nonincreasing property of \((L_n)_{n\in {\mathbb {N}}}\) due to (i), it follows that \((L_n)_{n\in {\mathbb {N}}}\) is convergent. By rearranging (5) and performing telescoping, we then have that
which implies that \(\sum _{n=1}^{+\infty }\Vert \textbf{x}_{n+1}-\textbf{x}_n\Vert ^2 < +\infty \), \(\sum _{n=1}^{+\infty }\Vert y_{n+1}-y_n\Vert ^2 <+\infty \), and by (16), \(\sum _{n=1}^{+\infty }\Vert z_{n+1}-z_n\Vert ^2 < +\infty \). As a result, \(\textbf{x}_{n+1}-\textbf{x}_{n} \rightarrow 0\), \(y_{n+1}-y_n \rightarrow 0\), \(z_{n+1}-z_n \rightarrow 0\), and \({\textbf{A}}{\textbf{x}}_n +By_n =(z_n -z_{n-1})/\rho +b \rightarrow b\) as \(n \rightarrow +\infty \). It then follows from the definition of \(L_n\) that the sequence \(({\mathcal {L}}_\rho (\textbf{x}_{n},y_{n},z_{n}))_{n\in {\mathbb {N}}}\) is also convergent with \(\lim _{n\rightarrow +\infty } {\mathcal {L}}_\rho (\textbf{x}_{n},y_{n},z_{n}) =\lim _{n\rightarrow +\infty }L_n\).
Let \((\overline{\textbf{x}},\overline{y},\overline{z})\) be a cluster point of the sequence \((\textbf{x}_n,y_n,z_n)_{n\in {\mathbb {N}}}\). Then there exists a subsequence \((\textbf{x}_{k_n},y_{k_n},z_{k_n})_{n\in {\mathbb {N}}}\) that converges to \((\overline{\textbf{x}},\overline{y},\overline{z})\). In view of (10), letting \(n =k_n\) and using Proposition 2.1(iii) &(iv), we obtain that, for all \(i \in \{1,\dots ,m\}\) and all \(x_i\in {\mathbb {R}}^{d_i}\),
Since \(\nabla P\) is continuous, \(\nabla P(\textbf{x}_{k_n}) \rightarrow \nabla P(\overline{\textbf{x}})\) as \(n\rightarrow +\infty \). As \(G +\frac{\beta }{2}\Vert \cdot \Vert ^2\) is a continuous convex function, it follows from [22, Example 9.14] that G is locally Lipschitz continuous. Since \(\textbf{x}_{k_n} \rightarrow \overline{\textbf{x}}\) as \(n\rightarrow +\infty \), using Lemma 2.2 and passing to a subsequence if necessary, we can and do assume that \(g_{k_n}\rightarrow \overline{g} \in \partial _L G (\overline{\textbf{x}})\) as \(n\rightarrow +\infty \). Now, for each \(i \in \{1,\dots ,m\}\), letting \(x_i =\overline{x}_i\) and \(n\rightarrow +\infty \) in (19) with noting that \(x_{i,k_n+1} \rightarrow \overline{x}_i\), \(y_{k_n} \rightarrow \overline{y}\), and \(z_{k_n} \rightarrow \overline{z}\), we have that \(\limsup _{n \rightarrow +\infty } f_i(x_{i,k_n+1}) \le f_i(\overline{x}_i)\), which together with the lower semicontinuity of \(f_i\) yields
As H, P, and G are continuous, it follows that \(\lim _{n \rightarrow +\infty }{\mathcal {L}}_\rho (\textbf{x}_{k_n +1 },y_{k_n+1},z_{k_n +1}) = {\mathcal {L}}_\rho (\overline{\textbf{x}},\overline{y},\overline{z})\). Since the sequence \(({\mathcal {L}}_\rho (\textbf{x}_{n},y_{n},z_{n}))_{n\in {\mathbb {N}}}\) is convergent, we deduce that
We next show that \((\overline{\textbf{x}},\overline{y},\overline{z})\) is a stationary point of \({\mathcal {L}}_\rho \). In (6) and (7) with n replaced by \(k_n-1\), letting \(n\rightarrow +\infty \) and using the coninuity of \(\nabla H\) and the Lipschitz continuity of \(\nabla \psi \), we obtain that \(0 =\mathbf {A\overline{x}} + B\overline{y} -b\) and \(0 =\nabla H(\overline{y}) + B^\top \overline{z}\). In turn, letting \(n \rightarrow +\infty \) in (19), we have for all \(i \in \{1,\dots ,m\}\) and all \(x_i\in {\mathbb {R}}^{d_i}\) that
which can be written as
where \(h_i(x_i) =f_i(x_i) +\langle \nabla _i P(\overline{\textbf{x}}) -\overline{g}_i, x_i \rangle + \langle \overline{z}, A_ix_i \rangle +\frac{\rho }{2}\Vert A_1\overline{x}_1 +\dots +A_{i-1}\overline{x}_{i-1} +A_ix_i +A_{i+1}\overline{x}_{i+1}+\dots +A_m \overline{x}_{m} +B\overline{y} -b \Vert ^2 +\frac{\nu \ell _{\phi }}{2}\Vert x_i-\overline{x}_i\Vert ^2\). It follows that, for all \(i \in \{1,\dots ,m\}\),
Since \(\textbf{A}\overline{\textbf{x}}+B\overline{y}-b =0\), we obtain that
which completes the proof due to Lemma 2.1(i) &(ii) and the strict differentiability of G. \(\square \)
Remark 3.2
(Conditions for the boundedness of the generated sequence) In order to establish the boundedness of the sequence \((\textbf{x}_n,y_n,z_n)_{n\in {\mathbb {N}}}\), authors in [11, 23, 24] require that there exists \(\sigma \in (0, +\infty )\) such that
This makes it challenging to verify the existence of \(\sigma \) in practice. In our analysis in Theorem 3.1(ii), we replace this condition with the requirement that H is bounded below, which is much easier to verify.
We now establish the convergence of the full sequence generated by Algorithm 3.1. Recall that a proper lower semicontinuous function \({\mathcal {F}}:{\mathcal {H}}\rightarrow \left( -\infty , +\infty \right] \) satisfies the Kurdyka–Łojasiewicz (KL) property [12, 16] at \(\overline{x} \in {\text {dom}}\partial _L {\mathcal {F}}\) if there are \(\eta \in (0, +\infty ]\), a neighborhood V of \(\overline{x}\), and a continuous concave function \(\varphi : \left[ 0, \eta \right) \rightarrow {\mathbb {R}}_+\) such that \(\varphi \) is continuously differentiable with \(\varphi ' > 0\) on \((0, \eta )\), \(\varphi (0) = 0\), and, for all \(x \in V\) with \( {\mathcal {F}}(\overline{x})< {\mathcal {F}}(x) < {\mathcal {F}}(\overline{x}) + \eta \),
We say that \({\mathcal {F}}\) is a KL function if it satisfies the KL property at any point in \({\text {dom}}\partial _L {\mathcal {F}}\). If \({\mathcal {F}}\) satisfies the KL property at \(\overline{x} \in {\text {dom}}\partial _L {\mathcal {F}}\), in which the corresponding function \(\varphi \) can be chosen as \(\varphi (t) = c t ^{1 - \lambda }\) for some \(c \in {\mathbb {R}}_{++}\) and \(\lambda \in [0, 1)\), then \({\mathcal {F}}\) is said to satisfy the KL property at \(\overline{x}\) with exponent \(\lambda \). The function \({\mathcal {F}}\) is called a KL function with exponent \(\lambda \) if it is a KL function and has the same exponent \(\lambda \) at any \(x \in {\text {dom}}\partial _L {\mathcal {F}}\). Now we will present the proof of the full sequential convergence in the following theorem.
Theorem 3.2
(Full sequential convergence) Suppose that Assumption 3.1 holds, that \(\sum _{i=1}^m f_i(x_i)+P(\textbf{x})-G(\textbf{x})\) is coercive, that H is bounded below, that G is differentiable with \(\ell _G\)-Lipschitz continuous gradient, and that, for all \(i\in \{1, \dots , m\}\), \(\nabla {\phi _i}\) is \(\ell _{\phi }\)-Lipschitz continuous. Let \((\textbf{x}_{n},y_{n},z_{n})_{n\in {\mathbb {N}}}\) be the sequence generated by Algorithm 3.1 with \(z_0\in {\text {Im}}(B)\) and \(\limsup _{n\rightarrow +\infty } \mu _n =\overline{\mu } <+\infty \). Define
-
(i)
Suppose that \({\mathcal {F}}\) is a KL function. Then the sequence \((\textbf{x}_{n},y_{n},z_{n})_{n\in {\mathbb {N}}}\) converges to a stationary point \((\textbf{x}^{*},y^{*},z^{*})\) of \({\mathcal {L}}_\rho \) and
$$\begin{aligned} \sum _{n=0}^{+\infty } \Vert (\textbf{x}_{n+1},y_{n+1},z_{n+1})-(\textbf{x}_{n},y_{n},z_{n})\Vert < +\infty . \end{aligned}$$ -
(ii)
Suppose that \({\mathcal {F}}\) is a KL function with exponent \(\kappa \in [0,1)\). Then the following hold:
-
(a)
If \(\kappa =0\), then \((\textbf{x}_{n},y_{n},z_{n})_{n\in {\mathbb {N}}}\) converges to \((\textbf{x}^{*},y^{*},z^{*})\) in a finite number of steps.
-
(b)
If \(\kappa \in (0,\frac{1}{2}]\), then there exist \(\overline{\gamma }\in {\mathbb {R}}_{++}\) and \(\zeta \in \left( 0,1\right) \) such that, for all \(n\in {\mathbb {N}}\),
$$\begin{aligned}&\Vert (\textbf{x}_{n},y_{n},z_{n})-(\textbf{x}^{*},y^{*},z^{*})\Vert \le \overline{\gamma }\zeta ^{\frac{n}{2}}, \\&\Vert {\textbf{A}}{\textbf{x}}_n+By_n-b\Vert \le \overline{\gamma } \zeta ^{\frac{n}{2}}, \\&|{\mathcal {L}}_\rho (\textbf{x}_{n},y_{n},z_{n}) -{\mathcal {L}}_\rho (\textbf{x}^{*},y^{*},z^{*}) |\le \overline{\gamma }\zeta ^n, \\ \text {and~}&|F(\textbf{x}_n,y_n)-F(\textbf{x}^{*},y^{*})| \le \overline{\gamma } \zeta ^{\frac{n}{2}}. \end{aligned}$$ -
(c)
If \(\kappa \in (\frac{1}{2},1)\), then there exists \(\overline{\gamma }\in {\mathbb {R}}_{++}\) such that, for all \(n\in {\mathbb {N}}\),
$$\begin{aligned}&\Vert (\textbf{x}_{n},y_{n},z_{n})-(\textbf{x}^{*},y^{*},z^{*})\Vert \le \overline{\gamma } n^{-\frac{1-\kappa }{2\kappa -1}}, \\&\Vert {\textbf{A}}{\textbf{x}}_n+By_n-b\Vert \le \overline{\gamma } n^{-\frac{1-\kappa }{2\kappa -1}}, \\ &|{\mathcal {L}}_\rho (\textbf{x}_{n},y_{n},z_{n}) -{\mathcal {L}}_\rho (\textbf{x}^{*},y^{*},z^{*})|\le \overline{\gamma } n^{-\frac{2-2\kappa }{2\kappa -1}}, \\ \text {and~}&|F(\textbf{x}_n,y_n)-F(\textbf{x}^{*},y^{*})| \le \overline{\gamma } n^{-\frac{1-\kappa }{2\kappa -1}}. \end{aligned}$$
-
(a)
Proof
For each \(n\in {\mathbb {N}}\), set \(w_n=(\textbf{x}_{n+1},y_{n+1},z_{n+1},y_n)\) and \(\varDelta _n =\Vert \textbf{x}_{n+2}-\textbf{x}_{n+1}\Vert +\Vert y_{n+2}-y_{n+1}\Vert \). Let \(n \in {\mathbb {N}}^*\). According to Theorem 3.1, we have that
that \(w_{n+1} - w_n \rightarrow 0\) as \(n\rightarrow +\infty \), that \((w_n)_{n\in {\mathbb {N}}}\) is bounded, that any of its cluster point \(\overline{w}=(\overline{\textbf{x}},\overline{y},\overline{z},\overline{y})\) is a stationary point of (1), and also that \({\mathcal {F}}(w_{n}) \rightarrow {\mathcal {F}}(\overline{w}) = {\mathcal {L}}_{\rho }(\overline{\textbf{x}},\overline{y},\overline{z})\) as \(n\rightarrow +\infty \). Notice that
which together with (20) yields
Next, we have that
and, for all \(i \in \{1,\dots ,m\}\),
By revoking the optimality condition for the update of \(x_{i,n+1}\) in Step 2 of Algorithm 3.1,
which combined with the above equality implies that
Therefore,
We now see that
and from the optimality condition for the update of \(y_{n+1}\) in Algorithm 3.1 that
This leads to
and so
Since \(\partial _L ^z {\mathcal {F}}(w_n) ={\textbf{A}}{\textbf{x}}_{n+1} + By_{n+1} -b =\frac{1}{\rho }(z_{n+1} -z_n)\) and \(\partial _L ^t {\mathcal {F}}(w_n) =-2c(y_{n+1}-y_{n})\), it holds that
Combining (21), (22), and (23), noting that \(\limsup _{n\rightarrow +\infty } \mu _n =\overline{\mu } <+\infty \), and using (15) and (16) in the proof of Theorem 3.1, we derive that there exists \(n_0\in {\mathbb {N}}^*\) and \(C_2 \in {\mathbb {R}}_{++}\) such that, for all \(n\ge n_0\),
(i): We see that all the conditions in the abstract convergence framework [5, Theorem 5.1] are satisfied with \(I =\{1,2\}\), \(\lambda _1=\lambda _2=1/2\), \(\alpha _n \equiv C_1\), \(\beta _n \equiv 1/(2C_2)\), and \(\varepsilon _n \equiv 0 \). By [5, Theorem 5.1(i)],
which implies that \(\sum _{n=0}^{+\infty } \Vert \textbf{x}_{n+1}-\textbf{x}_{n}\Vert <+\infty \) and \(\sum _{n=0}^{+\infty } \Vert y_{n+1}-y_{n}\Vert < +\infty \). Together with (16), we derive that \(\sum _{n=0}^{+\infty } \Vert z_{n+1}-z_n\Vert <+\infty \), and hence
which yields the convergence of \((\textbf{x}_{n},y_{n},z_{n})_{n\in {\mathbb {N}}}\) to \((\textbf{x}^{*},y^{*},z^{*})\). In view of Theorem 3.1(ii), \((\textbf{x}^{*},y^{*},z^{*})\) is a stationary point of \({\mathcal {L}}_\rho \).
(ii)(a): This follows from the arguments in the proof of [5, Theorem 5.1] and [4, Theorem 2(i)].
(ii)(b): According to [5, Theorem 5.1(iv)], there exist \(\gamma _0\in {\mathbb {R}}_{++}\) and \(\zeta \in \left( 0,1\right) \) such that, for all \(n\in {\mathbb {N}}^*\),
We derive that, for all \(n\in {\mathbb {N}}^*\), \(\Vert \textbf{x}_n-\textbf{x}^{*}\Vert \le \gamma _0\zeta ^{\frac{n}{2}}\) and \(\Vert y_n-y^{*}\Vert \le \gamma _0\zeta ^{\frac{n}{2}}\), which yields
By passing to the limit in (7), \(B^\top z^{*}=-\nabla H(y^{*})\). Following the same steps as in (15) and (16), we obtain that, for all \(n\in {\mathbb {N}}^*\),
Consequently,
We now deduce from the definition of \({\mathcal {F}}\) that
and from the definition of \({\mathcal {L}}_\rho \) that
where the last inequality follows from the fact that \(\zeta ^n \le \zeta ^{\frac{n}{2}}\) since \(\zeta \in (0,1)\). By letting
and increasing it if necessary, we arrive at the conclusion.
(ii)(c): Following the arguments in [5, Theorem 5.1(iv)] and [4, Theorem 2(iii)], we find \(\gamma _0 \in {\mathbb {R}}_{++}\) such that, for all \(n\in {\mathbb {N}}^*\),
Similar to (ii)(b), for all \(n\ge 2\), since \(n-1\ge \frac{1}{2}n\) and \(\frac{1-\kappa }{2\kappa -1} >0\), we derive that
and that
where \(\gamma _1 {:}{=}\frac{\ell _H\gamma _0 +(1 +2^{\frac{1-\kappa }{2\kappa -1}})\nu \ell _{\psi }\gamma _0}{\sqrt{\lambda }}\), \(\gamma _2 {:}{=}\gamma _0 +c(1 +2^{\frac{1-\kappa }{2\kappa -1}})^2\gamma _0^2\), and the last inequality follows from the fact that \(n^{-\frac{2-2\kappa }{2\kappa -1}}<n^{-\frac{1-\kappa }{2\kappa -1}}\) for \(\kappa \in (\frac{1}{2},1)\). By setting \(\overline{\gamma }\) in the same way as in (ii)(b), we obtain the conclusion. \(\square \)
In Theorem 3.2, to obtain the convergence of the full sequence generated by Algorithm 3.1, we require that \({\mathcal {F}}(\textbf{x},y,z,t) {:}{=}{\mathcal {L}}_\rho (\textbf{x},y,z) +\frac{(\ell _H+2\nu \ell _{\psi })\nu \ell _{\psi }}{\lambda \rho }\Vert y-t\Vert ^2\) is a KL function. It is worthwhile mentioning that if the objective function \(F(\textbf{x},y)\) is a semi-algebraic function, then so is \({\mathcal {F}}(\textbf{x},y,z,t)\), and hence \({\mathcal {F}}(\textbf{x},y,z,t)\) is a KL function with exponent \(\kappa \in [0, 1)\); see, e.g., [4, Example 1].
4 Numerical Results
In this section, we provide the numerical results of our proposed algorithm for two case studies: RPCA with modified regularization, and an DC-OPF problem which considers optimal photovoltaic systems placement. All of the experiments are performed in MATLAB R2021b on a 64-bit laptop with Intel(R) Core(TM) i7-1165G7 CPU (2.80GHz) and 16GB of RAM.
4.1 RPCA with Modified Regularization
We consider the following RPCA model introduced in Example 1.1
We modify the problem (24) as follows
where \(\Vert \cdot \Vert _1 - \Vert \cdot \Vert \) is the modified regularization. In order to use Algorithm 3.1, we let \(f_1(L)= \Vert L\Vert _{*}\), \(f_2(S)= \tau \Vert S\Vert _1\), \(H(T)=\frac{\gamma }{2}\Vert T-M\Vert _F^2\), \(P\equiv 0\), \(G(S)=\tau \Vert S\Vert \), \(D_{\phi _1}(L,L_n)=\frac{\alpha }{2}\Vert L-L_n\Vert _F^2\), and \(D_{\phi _2}(S,S_n)=\frac{\alpha }{2}\Vert S-S_n\Vert _F^2\). The coefficient matrices are defined by \(A_1=A_2=\textbf{I}_{m\times m}\), and \(B=-\textbf{I}_{m\times m}\). Note that \(b=0\) in this case, and for all \(n\in {\mathbb {N}}\) we fix \(\mu _n=1,~\nu _n=0\). The problem (25) can be solved by Algorithm 3.1 with the following steps
where \(g_n=(g_{1,n},g_{2,n}) \in \partial _L \Vert \cdot \Vert ((L_n,S_n))\). In this case, since \(G(S)=\tau \Vert S\Vert \), \(g_{1,n}\) is a zero matrix with size \(m\times d\). This updating scheme can be rewritten as
Both of the L and S subproblems have closed-form solutions, which are
where \({\mathcal {P}}_c(\cdot )\) denotes the soft shrinkage operator imposed on the singular values of the input matrix [8], \({\mathcal {S}}_c(\cdot )\) denotes the soft shrinkage operator imposed on all entries of the input matrix [14], and c is the threshold value. The subgradient \(g_{2,n}\) is given by \(u_1 v_1^\top \), where \(u_1\) and \(v_1\) are the first left and right singular column vectors which are obtained via the singular value decomposition of \(S_n\). We randomly generate the ground truth matrices for this case study by the procedures described in Appendix Appendix A. We set \(\tau =1/\sqrt{\max (m,d)}\), \(\alpha =10^{-2}\), and \(\gamma =1\) for all test cases. To apply Algorithm 3.1, we choose \(\rho =2 + 10^{-10}\) (since \(\lambda =1,~\ell _H = \gamma \)). We compare our proposed algorithm with the three-block ADMM (ADMM-3) used in [14], with their corresponding Lagrangian penalty term \(\rho =2\). The ADMM-3 solves problem (24) while our algorithm solves problem (25). The noise parameter \(\varGamma \) is set to \(10^{-2}\) and \(2\times 10^{-2}\), respectively. The algorithms are terminated when the relative change between the two consecutive iterates is less than \(10^{-6}\), specifically
We denote by \((\widehat{L}_n,\widehat{S}_n,\widehat{T}_n)\) the solution found at each iteration, and by \((L_O,S_O,T_O)\) the ground truth matrices that we want to recover by solving the problem (25). The following relative error (RE) with respects to the ground truth is used to measure the quality of the solution
All algorithms are run for 30 times for each test cases with a maximum number of iterations of 4000. At each time, the initial matrices L and S are created randomly, T is initialized by the value of M, and Z is initialized as a zero matrix. We perform our experiments on both square (\(100\times 100\) and \(1000\times 1000\) matrices) and rectangular matrices (here we arbitrary choose \(200\times 100\) and \(2000 \times 1000\) matrices). The average CPU time (in seconds), RE, rank, and sparsity of the solutions are presented in Table 1 and Table 2, respectively. The tuple (r, s) in the tables shows the rank and sparsity ratio of the randomly-generated ground truth. Although the proposed algorithm takes more time to run (mainly due to the singular value decomposition process required to compute the subgradient at each iteration), it outperforms ADMM-3 in terms of solution quality for both test cases. Interestingly, the sparsity of the solutions found by our algorithm is also closer to the ground truth values, which indicates the efficiency of the modified regularization. Our algorithm can successfully recover the rank of the ground truth matrices in all test cases for \(\varGamma =10^{-2}\). Also, our algorithm can still recover better sparsity level in almost all test cases. When the size of the test matrices is larger, the CPU time of our algorithm is still comparable to that of the ADMM-3, and it still outperforms the ADMM-3 in terms of solution quality. When \(\varGamma =2\times 10^{-2}\) for the larger case, the rank and sparsity are more challenging to be recovered, and our proposed algorithm can still manage to obtain better sparsity level in almost all test cases.
4.2 DC-OPF with optimal photovoltaic system placement
We revisit the DC-OPF problem with optimal PV allocation based on the formulations given in [2, 21]. The details of the variables as well as the parameters are given in Table 4 in the Appendix. The formulation is given as follows
The objective function aims at minimizing the installation cost of PV systems and the operation cost of conventional generators. Constraint (26a) ensures that the demand is met by the power flow from both conventional generators and renewable sources. Constraint (26b) makes sure that we have enough PV systems to achieve 50 percent renewable penetration rate. Constraint (26c) is the limit of the power flow between any two nodes. The remaining constraints define the boundaries of the decision variables. Using the binary relaxation technique used in [21], we have that
Taking into account of the above equivalence, a plausible alternative optimization model for the OPF problem with PV is as follows
In order to apply our algorithm, we let \(\textbf{x}=(x_1,x_2,\dots ,x_{|N|})\), in which \(x_i=[P_i^{PV},P_i^G,\varTheta _i,u_i]^\top \) denotes the set of variables associated with bus i, for \(i \in \{1,\dots ,|N|\}\). Here we use the notation |N| to denote the cardinality of the set N. Similarly, let \(f_i=Cu_i + a_i(P_{i}^{G})^2+b_iP_{i}^{G}+c_i\) be the cost function associated with bus i and \(G(\textbf{x})=\gamma \sum _{i\in N}(u_i^2 - u_i)\), we can reformulate the problem (OPF-M) as follows
where \(A_i\) are coefficient matrices associated with each \(x_i\), and b is the vector form by the right hand side of the problem (OPF-M). We then introduce slack variables and then penalise the slack variables to transform all inequalities into equalities as in [2]. Two worth noting points are that our DC-OPF problem is nonconvex due to the presence of binary variables (and later by the reformulation into difference-of-convex form), while the DC-OPF problem considered in [2] is convex, and that our algorithm is more general and can also cover their ADMM version. Let p be the number of inequality constraints, where \(p=9|N|+\sum _{i\in N}|M_i| +1\). By letting \(y=(y_1,y_2,\dots ,y_p)\) be the slack variables, the problem (OPF-M2) can be equivalently reformulated as follows
Here, we observe that
and that \(y\mapsto {\text {dist}}^2(y,{\mathbb {R}}^{p}_{+})\) is a differentiable function whose gradient \(\nabla {\text {dist}}^2(y,{\mathbb {R}}^{p}_{+}) =2(y -{\text {Proj}}_{{\mathbb {R}}_+^p}(y))\) is 2-Lipschitz continuous. So, the final relaxed problem is given as follows
where \(\eta \) is a positive parameter. Now, (OPF-M3) is a special case of problem (1), in which \(f_i(x_i)=Cu_i + a_i(P_{i}^{G})^2+b_iP_{i}^{G}+c_i\), \(H(y)=\frac{\eta }{2}{\text {dist}}^2(y,{\mathbb {R}}^{p}_{+})\), \(G(\textbf{x})=\gamma \sum _{i\in N}(u_i^2 - u_i)\), \(P(\textbf{x})\equiv 0\), \(B=\textbf{I}_{p\times p}\), and \(A_i\) is constructed as described in Appendix Appendix C. Each function \(f_i(x_i)\) can be rewritten in quadratic form as \(f_i(x_i)=\frac{1}{2}x_i^\top Q_ix_i +q_i^\top x_i +c_i\), in which
Using Algorithm 3.1 with \(D_{\phi _i}(x_i,x_{i,n})=\frac{\alpha }{2}\Vert x_i-x_{i,n}\Vert ^2\), where \(\alpha =10^{-2}\), and we also fix \(\mu _n=1,~\nu _n=0\) for all \(n\in {\mathbb {N}}\), the updates of the variables are given as follows
Note that the closed-form update of y can be derived with the same approach used in [2, Theorem 1]. The proposed algorithm is run for a maximum of 4000 iterations and it is terminated when
The test systems considered here are the 14-bus system used in [21, Case study 4.2], and the 141-bus test case taken from MATPOWER [32]. We set \((\eta , \rho , \gamma )=(900, 1800+10^{-10},80)\) for the 14-bus case, and then \((\eta , \rho , \gamma )=(3000, 6000+10^{-10},80)\) for the 141-bus case. We compare the performance of our proposed algorithm with the generalized proximal point algorithm (GPPA) [3], and the proximal subgradient algorithm with extrapolation (PSAe) [21]. We run all algorithms for 30 times, initialized at the lower bound of the variables, for a maximum of 4000 iterations. The results are given in Table 3. The objective function values are calculated using (OPF-1).
The best solution found in the 14-bus case using this proposed BPL-ADMM algorithm is approximately the same as the one found in [21]. This is due to the relaxation used during the modelling process. However, it can be seen that the CPU time is shorter, due to the availability of closed-form solutions of the subproblems. This also makes each iteration of the proposed algorithm less computationally expensive than those of the remaining algorithms. Moreover, it can be observed that for the larger test case, the proposed algorithm benefits from the closed-form solutions of the subproblems and can converge to a stationary point within an acceptable running time. The best solutions found by the proposed algorithm are shown in Figs. 1 and 2, respectively.
5 Conclusion
In this paper, we have proposed a splitting algorithm with linearization based on ADMM to solve a class of structured nonconvex and nonsmooth optimization problems, which can cover two important classes of problems in the literature. Our proposed structure and the proposed algorithm impose less restrictions on the convexity requirements than some algorithms existing in the current literature. The convergence of the whole sequence generated by our algorithm is proved with some mild additional assumptions and KL property. The efficiency of the proposed algorithm is illustrated on two important nonconvex optimization problems, and it shows competitive results in comparison with the existing algorithms.
Data availability
All data generated or analyzed during this study are included in this article. In particular, the data for Case study 4.1 were generated randomly and we explained how they were explicitly generated in Appendix Appendix A. The data for Case study 4.2 are available in Appendix Appendix B.
References
Abdi, H., Beigvand, S.D., Scala, M.L.: A review of optimal power flow studies applied to smart grids and microgrids. Renew. Sustain. Energy Rev. 71, 742–766 (2017)
Abraham, M.P., Kulkarni, A.A.: ADMM-based algorithm for solving DC-OPF in a large electricity network considering transmission losses. IET Gener. Transm. Distrib. 12(21), 5811–5823 (2018)
An, N.T., Nam, N.M.: Convergence analysis of a proximal point algorithm for minimizing differences of functions. Optimization 66(1), 129–147 (2016)
Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116(1–2), 5–16 (2007)
Boţ, R.I., Dao, M.N., Li, G.: Extrapolated proximal subgradient algorithms for nonconvex and nonsmooth fractional programs. Math. Oper. Res. 47(3), 2415–2443 (2022)
Boţ, R.I., Dao, M.N., Li, G.: Inertial proximal block coordinate method for a class of nonsmooth sum-of-ratios optimization problems. SIAM J. Optim. 33(2), 361–393 (2023)
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Cai, J.-F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Dhillon, I.S., Tropp, J.A.: Matrix nearness problems with Bregman divergences. SIAM J. Matrix Anal. Appl. 29(4), 1120–1146 (2008)
Gasso, G., Rakotomamonjy, A., Canu, S.: Recovering sparse signals with a certain family of nonconvex penalties and DC programming. IEEE Trans. Signal Process. 57(12), 4686–4698 (2009)
Guo, K., Han, D.R., Wu, T.T.: Convergence of alternating direction method for minimizing sum of two nonconvex functions with linear constraints. Int. J. Comput. Math. 94(8), 1653–1669 (2016)
Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. de l’institut Fourier 48(3), 769–783 (1998)
Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25(4), 2434–2460 (2015)
Li, X., Ng, M.K., Yuan, X.: Median filtering-based methods for static background extraction from surveillance video. Numer. Linear Algebra Appl. 22(5), 845–865 (2015)
Lin, T., Ma, S., Zhang, S.: Global convergence of unmodified 3-block ADMM for a class of convex minimization problems. J. Sci. Comput. 76(1), 69–88 (2017)
Łojasiewicz, S.: Une propriété topologique des sous-ensembles analytiques réels. Les Équations aux Dérivées Partielles, pp. 87–89 (1963)
Lou, Y., Yan, M.: Fast L1–L2 minimization via a proximal operator. J. Sci. Comput. 74(2), 767–785 (2017)
Mordukhovich, B.S.: Variational Analysis and Generalized Differentiation I. Grundlehren der mathematischen Wissenschaften, vol. 330. Springer, Berlin (2006)
Nesterov, Y.: Convex Optimization. Springer Optimization and Its Applications, vol. 137. Springer International Publishing, Cham (2018)
Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for total variation-based image restoration. Multisc. Model. Simul. 4(2), 460–489 (2005)
Pham, T.N., Dao, M.N., Shah, R., Sultanova, N., Li, G., Islam, S.: A proximal subgradient algorithm with extrapolation for structured nonconvex nonsmooth problems. Numer. Algor. 94(4), 1763–1795 (2023)
Rockafellar, R.T.: Variational Analysis. Grundlehren der mathematischen Wissenschaften, vol. 317. Springer, Berlin (1998)
Sun, T., Yin, P., Cheng, L., Jiang, H.: Alternating direction method of multipliers with difference of convex functions. Adv. Comput. Math. 44(3), 723–744 (2017)
Tu, K., Zhang, H., Gao, H., Feng, J.: A hybrid Bregman alternating direction method of multipliers for the linearly constrained difference-of-convex problems. J. Global Optim. 76(4), 665–693 (2019)
Wang, H., Banerjee, A.: Bregman alternating direction method of multipliers. In Proceedings of the 27th International Conference on Neural Information Processing Systems-Vol 2, NIPS’14, pp. 2816-2824, Cambridge, MA, USA. MIT Press (2014)
Wang, X., Shao, H., Liu, P., Yang, W.: An inertial proximal splitting method with applications. Optimization 73(8), 2555–2584 (2023)
Wang, Y., Yin, W., Zeng, J.: Global convergence of ADMM in nonconvex nonsmooth optimization. J. Sci. Comput. 78(1), 29–63 (2018)
Yang, L., Pong, T.K., Chen, X.: Alternating direction method of multipliers for a class of nonconvex and nonsmooth problems with applications to background/foreground extraction. SIAM J. Imag. Sci. 10(1), 74–110 (2017)
Yang, Y., Guan, X., Jia, Q.-S., Yu, L., Xu, B., Spanos, C.J.: A survey of ADMM variants for distributed optimization: Problems, algorithms and features (2022). arXiv:2208.03700
Yashtini, M.: Multi-block nonconvex nonsmooth proximal ADMM: convergence and rates under Kurdyka-Łojasiewicz property. J. Optim. Theory Appl. 190(3), 966–998 (2021)
Yin, P., Lou, Y., He, Q., Xin, J.: Minimization of \(\ell _{1-2}\) for compressed sensing. SIAM J. Sci. Comput. 37(1), A536–A563 (2015)
Zimmerman, R.D., Murillo-Sanchez, C.E., Thomas, R.J.: MATPOWER: steady-state operations, planning, and analysis tools for power systems research and education. IEEE Trans. Power Syst. 26(1), 12–19 (2011)
Acknowledgements
The authors thank Rakibuzzaman Shah and Syed Islam from the Centre for New Energy Transition Research, Federation University Australia for providing the research data and the discussion on the power system optimization problem. They are also grateful to the anonymous reviewers for their insightful feedback during the revision of this manuscript.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions. TNP was supported by Henry Sutton PhD Scholarship Program from Federation University Australia. MND was partially supported by the Australian Research Council (ARC), project number DP230101749, by the PHC FASIC program, project number 49763ZL, and by a public grant from the Fondation Mathématique Jacques Hadamard (FMJH).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by René Henrion.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Data Generation for RPCA Model
Let \(M=L_O+S_O+N\), where \(L_O\) and \(S_O\) are the ground truth low-rank and sparse matrices, and N is a Gaussian noise matrix. The following MATLAB code generates the matrix M:
Data of case study 4.2
The values of the parameters in Table 4 are used for the 14-bus test case. We use the same parameter notations for the 141-bus test case. The parameters of the 141-bus test case (line configurations, load demand, diesel generator capacity) can be found within MATPOWER software. Note that we also use 800 kW PV for the 141-bus case. All calculations are performed on Per Unit (pu) values.
Construction of Matrices \(A_i\) in the DC-OPF Problem
We give an example of how the matrices \(A_i\) can be constructed. For simplicity, we use a 2-bus example. The procedures are still the same if there are more buses. Consider the example given in Fig. 3.
The problem (OPF-M3) in this case reads as
The matrices \(A_1\), \(A_2\), and vector b are given by
By this way of construction, \(A_i\) has full column rank.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Pham, T.N., Dao, M.N., Eberhard, A. et al. Bregman Proximal Linearized ADMM for Minimizing Separable Sums Coupled by a Difference of Functions. J Optim Theory Appl 203, 1622–1658 (2024). https://doi.org/10.1007/s10957-024-02539-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-024-02539-7
Keywords
- Alternating direction method of multipliers (ADMM)
- Bregman distance
- Composite optimization problem
- Difference of functions
- Splitting algorithm
- Linear constraints
- Linearization
- Multi-block structure
- Nonconvex optimization