Abstract
“Classical” First Order (FO) algorithms of convex optimization, such as Mirror Descent algorithm or Nesterov’s optimal algorithm of smooth convex optimization, are well known to have optimal (theoretical) complexity estimates which do not depend on the problem dimension. However, to attain the optimality, the domain of the problem should admit a “good proximal setup”. The latter essentially means that (1) the problem domain should satisfy certain geometric conditions of “favorable geometry”, and (2) the practical use of these methods is conditioned by our ability to compute at a moderate cost proximal transformation at each iteration. More often than not these two conditions are satisfied in optimization problems arising in computational learning, what explains why proximal type FO methods recently became methods of choice when solving various learning problems. Yet, they meet their limits in several important problems such as multi-task learning with large number of tasks, where the problem domain does not exhibit favorable geometry, and learning and matrix completion problems with nuclear norm constraint, when the numerical cost of computing proximal transformation becomes prohibitive in large-scale problems. We propose a novel approach to solving nonsmooth optimization problems arising in learning applications where Fenchel-type representation of the objective function is available. The approach is based on applying FO algorithms to the dual problem and using the accuracy certificates supplied by the method to recover the primal solution. While suboptimal in terms of accuracy guaranties, the proposed approach does not rely upon “good proximal setup” for the primal problem but requires the problem domain to admit a Linear Optimization oracle—the ability to efficiently maximize a linear form on the domain of the primal problem.
Similar content being viewed by others
Notes
This is a natural normalization: indeed, \(\Vert e_j\Vert \ll 1\) means that \(j\)th coordinate of \(x\) has a large Lipschitz constant w.r.t. \(\Vert \cdot \Vert \), in spite of the fact that this coordinate is “perfectly well behaved” on \(X\)—its variation on the set is just 2.
We assume here that \(g_\tau \ne 0\) for all \(\tau \le t\). In the opposite case, the situation is trivial: when \(g(y_{\tau _*})=0\), for some \(\tau *\le t\), setting \(\lambda ^t_\tau =0\) for \(\tau \ne \tau _*\) and \(\lambda ^t_{\tau _*}=1\), we ensure that \(\epsilon (y^t,\lambda ^t)=0\).
We assume that \(f'(y_\tau )\ne 0\), for \(\tau \le t\); otherwise, as we remember, the situation is trivial.
Note that to ensure the nonemptiness of \(Y^s_0\), it suffices to set \(h_{0,j}^s(\cdot )=h^s(\cdot )\), so that \(h_{0,j}(y)>\ell _s\) for \(y\in \mathop {\hbox {Argmax}}_Yh^s(\cdot )\); recall that \(f_s=\max _{y\in Y} h^s(y)>0\).
In typical applications, \(\psi \) is just linear, so that computing \(y(x)\) is as easy as computing the value of the prox-mapping associated with \(Y\), \(\omega (\cdot )\).
If our only goal were to approximate \(f_*\) by a smooth concave function, we could use other options, most notably the famous Moreau–Yosida regularization [14, 25] . The advantage of the smoothing based on Fenchel-type representation (1) of \(f_*\) and proximal setup for \(Y\) is that in many cases it indeed allows for computationally cheap approximations, which is usually not the case for Moreau–Yosida regularization.
The restriction \(R\ge 1\) is quite natural. Indeed, with the optimal choice of \(x\), we want most of the terms \(\left[ 1 -\epsilon _j\left[ \langle x,z_j\rangle +b\right] \right] _+\) to be \(\ll 1\); assuming that the number of examples with \(\epsilon _j=-1\) and \(\epsilon _j=1\) are of order of \(N\), this condition can be met only when \(|\langle x,z_j\rangle |\) are at least of order of 1 for most of \(j\)’s. The latter, in view of (48), implies that \(\Vert \sigma (x)\Vert _{1}\) should be at least \(O(1)\).
To arrive at this situation, one can augment \(z_j\) by additional entry, equal to 1, and to redefine \(x\): the new \(x\) is the old \([x,b]\).
In our straightforward software implementation of the algorithm, handling memory of depth \(m\) requires storing in RAM up to \(m+1\) \(p\times p\) matrices, which makes the implementation too space-consuming when \(p\) and \(m\) are large; this is why in our experiments the larger \(p\), the smaller is the allowed values of \(m\). Note that with a more sophisticated software implementation, handling memory \(m\) would require storing just \(m+1\) of rank 1 \(p\times p\) matrices, reducing dramatically the required space.
References
Amit, Y., Fink, M., Srebro, N., Ullman, S.: Uncovering shared structures in multiclass classification. In: Proceedings of the 24th International Conference on Machine Learning, pp. 17–24. ACM (2007)
Ben-Tal, A., Nemirovski, A.: Non-euclidean restricted memory level method for large-scale convex optimization. Math. Program. 102(3), 407–456 (2005)
Bregman, L.M.: The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7(3), 200–217 (1967)
Chen, G., Teboulle, M.: Convergence analysis of a proximal-like minimization algorithm using bregman functions. SIAM J. Optim. 3(3), 538–543 (1993)
Combettes, P.L., Dũng, D., Vũ, B.C.: Dualization of signal recovery problems. Set-Valued Var. Anal. 18(3–4), 373–404 (2010)
Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2002)
Demyanov, V., Rubinov, A.: Approximate Methods in Optimization Problems. Elsevier, Amsterdam (1970)
Dunn, J.C., Harshbarger, S.: Conditional gradient algorithms with open loop step size rules. J. Math. Anal. Appl. 62(2), 432–444 (1978)
Fan, J., Liao, Y., Mincheva, M.: High dimensional covariance matrix estimation in approximate factor models. Ann. Stat. 39(6), 3320–3356 (2011)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Res. Logist. Q. 3(1–2), 95–110 (1956)
Juditsky, A., Nemirovski, A.: First order methods for nonsmooth large-scale convex minimization, i: General purpose methods; ii: Utilizing problem’s structure. In: Sra, S., Nowozin, S., Wright, S.J. (eds.) Optimization for Machine Learning, pp. 121–254. MIT Press, Cambridge, MA (2011)
Lemaréchal, C., Nemirovskii, A., Nesterov, Y.: New variants of bundle methods. Math. Program. 69(1–3), 111–147 (1995)
Moreau, J.-J.: Fonctions convexes duales et points proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér. A Math. 255, 2897–2899 (1962)
Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bulletin de la Société mathématique de France 93, 273–299 (1965)
Nemirovski, A.: Prox-method with rate of convergence \( {O}(1/t)\) for variational inequalities with lipschitz continuous monotone operators and smooth convex–concave saddle point problems. SIAM J. Optim. 15(1), 229–251 (2004)
Nemirovski, A., Onn, S., Rothblum, U.G.: Accuracy certificates for computational problems with convex structure. Math. Oper. Res. 35(1), 52–78 (2010)
Nemirovskii, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, Chichester (1983)
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \( {O}(1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2004)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Dual extrapolation and its applications to solving variational inequalities and related problems. Math. Program. 109(2–3), 319–344 (2007)
Nesterov, Y.: Primal-dual subgradient methods for convex problems. Math. Program. 120(1), 221–259 (2009)
Nesterov, Y., Nemirovski, A.: Some first order algorithms for \(\ell _1\)/nuclear norm minimization. Acta Numerica 22, 509–575 (2013)
Pshenichnyi, B.N., Danilin, Y.M.: Numerical Methods in Extremal Problems. Mir Publishers, Moscow (1978)
Yosida, K.: Functional Analysis. Springer, Berlin (1964)
Author information
Authors and Affiliations
Corresponding author
Additional information
Research of the third author was supported by the ONR Grant N000140811104 and the NSF Grants DMS 0914785, CMMI 1232623.
Appendix: Proofs
Appendix: Proofs
We need the following technical result originating from [4] (for proof, see [4], or section 2.1 and Lemma A.2 of [23]).
Lemma 7.1
. Let \(Y\) be a nonempty closed and bounded subset of a Euclidean space \(E_y\), and let \(\Vert \cdot \Vert \), \(\omega (\cdot )\) be the corresponding proximal setup. Let, further, \(U\) be a closed convex subset of \(Y\) intersecting the relative interior of \(Y\), and let \(p\in E_y\).
-
(i)
The optimization problem
$$\begin{aligned} \min _{y\in U} h_p(y):=\left[ \langle p,y\rangle + \omega (y)\right] \end{aligned}$$has a unique solution \(y_*\). This solution is fully characterized by the inclusion \(y_*\in U\cap Y^o\), \(Y^o=\{y\in Y:\partial \omega (y)\ne \emptyset \}\), coupled with the relation
$$\begin{aligned} \langle p+\omega '(y_*),u-y_*\rangle \ge 0\;\;\forall u\in U. \end{aligned}$$(63) -
(ii)
When \(U\) is cut off \(Y\) by a system of linear inequalities \(e_i(y)\le 0\), \(i=1,\ldots ,m\), there exist Lagrange multipliers \(\lambda _i\ge 0\) such that \(\lambda _ie_i(y_*)=0,\;1\le i\le m\), and
$$\begin{aligned} \forall u\in Y:\;\;\left\langle p+\omega '(y_*)+\sum _{i=1}^m\lambda _ie_i^\prime ,u-y_*\right\rangle \ge 0. \end{aligned}$$(64) -
(iii)
In the situation of (ii), assuming \(p=\xi -\omega '(y)\) for some \(y\in Y^o\), we have
$$\begin{aligned} \forall u\in Y: \langle \xi ,y_*-u\rangle -\sum _i\lambda _ie_i(u)\le V_y(u)-V_{y_*}(u)-V_{y}(y_*). \end{aligned}$$(65)
1.1 Proof of Proposition 3.2
1\(^0\). Observe that when \(t\) is a non-terminal step of the algorithm, the level set \(U_t\) is a closed convex subset of \(Y\) which intersects the relative interior of \(Y\); indeed, by (25) and due to \(\epsilon _t>0\) for a non-terminal \(t\), there exists \(y\in Y\) such that \(h_\tau (y)\ge \epsilon _t>\ell _t=\gamma \epsilon _t\), \(\tau \in S_t\), that is, \(U_t\) is cut off \(Y\) by a system of constraints satisfying the Slater condition. Denoting \(S_t=\{y_{\tau _1},y_{\tau _2},\ldots ,y_{\tau _k}\}\) and invoking item (iii) of Lemma 7.1 with \(y=\widehat{y}_t\), \(\xi =0\) and \(e_i(\cdot )=\gamma \epsilon _t-h_{\tau _i}(\cdot )\) (so that \(U_t=\{u\in Y: e_i(u)\le 0, 1\le i\le k\}\), we get \(y_*=y_{t+1}\) and
with some \(\lambda _i\ge 0\). When \(u\in U_t\), we have \(e_i(u)\le 0\), that is,
2\(^0\). When \(t\) starts a phase, we have \(\widehat{y}_t=y_\omega =y_1\), and clearly \(1\in I_t\), whence \(h_\tau (\widehat{y}_t)\le 0\) for some \(\tau \in I_t\) (specifically, for \(\tau =1\)). When \(t\) does not start a phase, we have \(\widehat{y}_t=y_t\) and \(t\in I_t\), so that here again \(h_\tau (\widehat{y}_t)\le 0\) for some \(\tau \in I_t\). On the other hand, \(h_\tau (y_{t+1})\ge \gamma \epsilon _t\) for all \(\tau \in I_t\) due to \(y_{t+1}\in U_t\). Thus, when passing from \(\widehat{y}_t\) to \(y_{t+1}\), at least one of \(h_\tau (\cdot )\) grows by at least \(\gamma \epsilon _t\). Taking into account that \(h_\tau (z)=\langle g(y_\tau ),y_\tau -z\rangle \) is Lipschitz continuous with constant \(L[g]\) w.r.t. \(\Vert \cdot \Vert \) (by (14)), we conclude that \(\Vert \widehat{y}_t-y_{t+1}\Vert \ge \gamma \epsilon _t/L[g]\). With this in mind, (66) combines with (8) to imply that
3\(^0\). Let the algorithm perform phase \(s\), let \(t_s\) be the first step of this phase, and \(r\) be another step of the phase. We claim that all level sets \(U_t\), \(t_s\le t\le r\), have a point in common, specifically, (any) \(u\in \mathop {\hbox {Argmax}}_{y\in Y}\min _{\tau \in I_r} h_\tau (y)\). Indeed, since \(r\) belongs to phase \(s\), we have
and \(\Delta _s=\epsilon _{t_s}=\max _{y\in Y}\min _{\tau \in I_{t_s}} h_\tau (y)\) (see (25) and the definition of \(\Delta _s\)). Besides this, \(r\) belongs to phase \(s\), and within a phase, sets \(I_t\) extend as \(t\) grows, so that \(I_{t_s}\subset I_t\subset I_r\) when \(t_s\le t\le r\), implying that \(\epsilon _{t_s}\ge \epsilon _{t_s+1}\ge \cdots \ge \epsilon _r\). Thus, for \(t\in \{t_s,t_s+1,\ldots ,r\}\) we have
implying that \(u\in U_t\).
With the just defined \(u\), let us look at the quantities \(v_t:=V_{\widehat{y}_t}(u)\), \(t_s\le t\le r\). We have \(v_{t_s}\le {1\over 2}\Omega ^2\) due to \(\widehat{y}_{t_s}=y_\omega \) and (11), and
when \(t_s\le t<r\) (due to (67) combined with \(\widehat{y}_t=y_t\) when \(t_s<t\le r\)). We conclude that \((r-t_s)\gamma ^4\Delta _s^2\le \Omega ^2L^2[g]\). Thus, the number \(T_s\) of steps of phase \(s\) admits the bound
where the concluding inequality follows from \(\Delta _s\le \Delta _1=\max _{y\in Y} \langle g(y_\omega ),y_\omega -y\rangle \le L[g]\Omega \), see (11), combined with \(\gamma \in (0,1)\).
4\(^0\). Assume that MDL does not terminate in course of first \(T\ge 1\) steps, and let \(s_*\) be the index of the phase to which the step \(T\) belongs. Then \(\Delta _{s_*}>\epsilon \) (otherwise we would terminate not later than at the first step of phase \(s_*\)); and besides this, by construction, \(\Delta _{s+1}\le \gamma \Delta _s\) whenever phase \(s+1\) takes place. Therefore
\(\square \)
1.2 Proof of Proposition 3.3
Observe that the algorithm can terminate only in the case A of B.2.2.3, and in this case the output is indeed as claimed in Proposition. Thus, all we need to prove is the upper bound (36) on the number of steps before termination.
1\(^0\). Let us bound from above the number of steps at an arbitrary phase \(s\). Assume that phase \(s\) did not terminate in course of the first \(T\) steps, so that \(u_1,\ldots ,u_T\) are well defined. We claim that then
Indeed, by construction \(h^s_{\tau -1,m+1}(y):=\langle g(u_\tau ),u_\tau -y\rangle \) is \(\ge \ell _s=\gamma f_s\) when \(y=u_{\tau +1}\) (due to \(u_{\tau +1}\in Y_\tau \)). Since \(\Vert g(u)\Vert _*\le L[g]\) for all \(u\in Y\), (69) follows.
Now let us look at what happens with the quantities \(\omega (u_\tau )\) as \(\tau \) grows. By strong convexity of \(\omega \) we have
The first term in the right hand side is \(\ge 0\), since \(u_\tau \) is the minimizer of \(\omega (\cdot )\) over \(Y^s_{\tau -1}\), while \(u_{\tau +1}\in Y_\tau \subset Y^s_{\tau -1}\). The second term in the right hand side is \(\ge {\ell _s^2\over 2L^2[g]}\) by (69). Thus, \(\omega (u_{\tau +1})-\omega (u_\tau )\ge {\ell _s^2\over 2L^2[g]}\), whence \( \omega (u_T)-\omega (u_1)\ge (T-1){\ell _s^2\over 2L^2[g]}=(T-1){\gamma ^2f_s^2\over 2L^2[g]}. \) Recalling the definition of \(\Omega \) and that \(u_1=y_\omega \), the left hand side in this inequality is \(\le {1\over 2}\Omega ^2\). It follows that whenever phase \(s\) does not terminate in course of the first \(T\) steps, one has \( T\le {\Omega ^2L^2[g]\over \gamma ^2f_s^2}+1, \) that is, the total number of steps at phase \(s\), provided this phase exists, is at most \( T_s={\Omega ^2L^2[g]\over \gamma ^2f_s^2}+2. \) Now, we have
(recall that \(\Vert g(y)\Vert _*\le L[g]\) and see (11)). Thus
for all \(s\) such that \(s\)th phase exists. By construction, we have \(f_s\ge \epsilon \) and \(f_s\le (\gamma +(1-\gamma )\theta ) f_{s-1}\), whence the method eventually terminates (since \(\gamma +(1-\gamma )\theta <1\)). Assuming that the termination happens at phase \({s_*}\), we have \(f_s\ge (\gamma +(1-\gamma )\theta )^{s-{s_*}}f_{{s_*}}\) when \(1\le s\le {s_*}\), so that the total number of steps is bounded by
as claimed. \(\square \)
Rights and permissions
About this article
Cite this article
Cox, B., Juditsky, A. & Nemirovski, A. Dual subgradient algorithms for large-scale nonsmooth learning problems. Math. Program. 148, 143–180 (2014). https://doi.org/10.1007/s10107-013-0725-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-013-0725-1