Random Gradient-Free Minimization of Convex Functions

Yurii Nesterov¹ &
Vladimir Spokoiny²

18k Accesses
447 Citations
12 Altmetric
Explore all metrics

Abstract

In this paper, we prove new complexity bounds for methods of convex optimization based only on computation of the function value. The search directions of our schemes are normally distributed random Gaussian vectors. It appears that such methods usually need at most n times more iterations than the standard gradient methods, where n is the dimension of the space of variables. This conclusion is true for both nonsmooth and smooth problems. For the latter class, we present also an accelerated scheme with the expected rate of convergence $O\Big ({n^2 \over k^2}\Big )$, where k is the iteration counter. For stochastic optimization, we propose a zero-order scheme and justify its expected rate of convergence $O\Big ({n \over k^{1/2}}\Big )$. We give also some bounds for the rate of convergence of the random gradient-free methods to stationary points of nonconvex functions, for both smooth and nonsmooth cases. Our theoretical results are supported by preliminary computational experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Article 28 August 2023

Gradient-Free Two-Point Methods for Solving Stochastic Nonsmooth Convex Optimization Problems with Small Non-Random Noises

Article 14 August 2018

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Article Open access 22 October 2024

Notes

In [15], u was uniformly distributed over a unit ball. In our comparison, we use a direct translation of the constructions in [15] into the language of the normal Gaussian distribution.
Presence of this oracle is the main reason why we call our methods gradient free (not derivative free!). Indeed, directional derivative is a much simpler object as compared with the gradient. It can be easily defined for a very large class of functions. At the same time, definition of the gradient (or subgradient) is much more involved. It is well known that in nonsmooth case, collection of partial derivatives is not a subgradient of convex function. For nonsmooth nonconvex functions, the possibility of computing a single subgradient needs a serious mathematical justification [17]. On the other hand, if we have an access to a program for computing the value of our function, then the program for computing directional derivatives can be obtained by a trivial automatic forward differentiation.
The rest of the proof is very similar to the proof of Lemma 2.2.4 in [16]. We present it here just for the reader convenience.

References

A. Agarwal, O. Dekel, and L. Xiao, Optimal algorithms for online convex optimization with multi-point bandit feedback, in Proceedings of the 23rd Annual Conference on Learning, 2010, pp. 2840.
A. Agarwal, D. Foster, D. Hsu, S. Kakade, and A. Rakhlin, Stochastic convex optimization with bandit feedback,. SIAM J. on Optimization, 23 (2013), pp. 213-240.
Article MathSciNet MATH Google Scholar
D. Bertsimas and S. Vempala, Solving convex programs by random walks, J. of the ACM, 51 (2004), pp. 540-556.
Article MathSciNet MATH Google Scholar
F. Clarke, Optimization and nonsmooth analysis, Wliley, New York, 1983.
MATH Google Scholar
A. Conn, K. Scheinberg, and L. Vicente , Introduction to derivative-free optimization. MPS-SIAM series on optimization, SIAM, Philadelphia, 2009.
Book MATH Google Scholar
C. Dorea, Expected number of steps of a random optimization method, JOTA, 39 (1983), pp. 165-171.
Article MathSciNet MATH Google Scholar
J. Duchi, M.I. Jordan, M.J. Wainwright, and A. Wibisono, Finite sample convergence rate of zero-order stochastic optimization methods, in NIPS, 2012, pp. 1448-1456.
A. D. Flaxman, A.T. Kalai, and B.H. Mcmahan, Online convex optimization in the bandit setting: gradient descent without a gradient, in Proceedings of the 16th annual ACM-SIAM symposium on Discrete Algorithms, 2005, pp. 385-394 .
R. Kleinberg, A. Slivkins, and E. Upfal, Multi-armed bandits in metric spaces, in Proceedings of the 40th annual ACM symposium on Theory of Computing, 2008, pp. 681-690.
J. C. Lagarias, J. A. Reeds, M. H. Wright, and P. E. Wright, Convergence properties of the Nelder-Mead Simplex Algorithm in low dimensions, SIAM J. Optimization, 9 (1998), pp. 112-147.
Article MATH Google Scholar
J. C. Lagarias, B. Poonen, and M. H. Wright, Convergence of the restricted Nelder-Mead algorithm in two dimensions, SIAM J. Optimization, 22 (2012), pp. 501-532.
Article MathSciNet MATH Google Scholar
J. Matyas, Random optimization. Automation and Remote Control, 26 (1965), pp. 246-253.
MathSciNet MATH Google Scholar
J. A. Nelder and R. Mead, A simplex method for function minimization, Computer Journal, 7 (1965), pp. 308–3013
Article MathSciNet MATH Google Scholar
A. Nemirovski, A. Juditsky, G. Lan, and A.Shapiro, Robust Stochastic Approximation approach to Stochastic Programming, SIAM J. on Optimization, 19 (2009), pp. 1574-1609.
Article MathSciNet MATH Google Scholar
A. Nemirovsky and D.Yudin, Problem complexity and method efficiency in optimization, John Wiley and Sons, New York, 1983.
Google Scholar
Yu. Nesterov, Introductory Lectures on Convex Optimization, Kluwer, Boston, 2004.
Book MATH Google Scholar
Yu. Nesterov, Lexicographic differentiation of nonsmooth functions’, Mathematical Programming, 104 (2005), pp. 669-700.
Article MathSciNet MATH Google Scholar
Yu. Nesterov, Random gradient-free minimization of convex functions, CORE Discussion Paper # 2011/1, (2011).
Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. on Optimization, 22 (2012), pp. 341-362.
Article MathSciNet MATH Google Scholar
B. Polyak, Introduction to Optimization. Optimization Software - Inc., Publications Division, New York, 1987.
MATH Google Scholar
V. Protasov, Algorithms for approximate calculation of the minimum of a convex function from its values, Mathematical Notes, 59 (1996), pp. 69-74.
Article MathSciNet MATH Google Scholar
M. Sarma, On the convergence of the Baba and Dorea random optimization methods, JOTA, 66 (1990), pp. 337-343.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The authors would like to thank two anonymous referees for enormously careful and helpful comments. Pavel Dvurechensky proposed a better proof of inequality (37), which we use in this paper. Research activity of the first author for this paper was partially supported by the grant “Action de recherche concertè ARC 04/09-315” from the “Direction de la recherche scientifique - Communautè française de Belgique,” and RFBR research projects 13-01-12007 ofi_m. The second author was supported by Laboratory of Structural Methods of Data Analysis in Predictive Modeling, MIPT, through RF government grant, ag.11.G34.31.0073.

Author information

Authors and Affiliations

Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), Leuven, Belgium
Yurii Nesterov
Weierstrass Institute for Applied Analysis and Stochastics (WIAS), Humboldt University of Berlin, Berlin, Germany
Vladimir Spokoiny

Authors

Yurii Nesterov
View author publications
You can also search for this author in PubMed Google Scholar
Vladimir Spokoiny
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yurii Nesterov.

Additional information

Communicated by Michael Overton.

Y. Nesterov: This work was done in affiliation with Higher School of Economics, Moscow.

Appendix: Proofs of Statements of Sect. 2

Proof of Lemma 1

Denote $\psi (p) = \ln M_p$. This function is convex in p. Let us represent $p = (1-\alpha )\cdot 0 + \alpha \cdot 2$ (thus, $\alpha = {p \over 2}$). For $p \in [0,2]$, we have $\alpha \in [0,1]$. Therefore,

$$\begin{aligned} \begin{array}{rcl} \psi (p)\le & {} (1-\alpha ) \psi (0) + \alpha \psi (2) \; \mathop {=}\limits ^{(14)} \; {p \over 2} \ln n. \end{array} \end{aligned}$$

This is the upper bound (16). If $p \ge 2$, then $\alpha \ge 1$, and $\alpha \psi (2)$ becomes a lower bound for $\psi (p)$. It remains to prove the upper bound in (17).

Let us fix some $\tau \in (0,1)$. Note that for any $t \ge 0$ we have

$$\begin{aligned} \begin{array}{rcl} t^p \mathrm{e}^{-{\tau \over 2} t^2}\le & {} \left( p \over \tau e\right) ^{p/2}. \end{array} \end{aligned}$$

(80)

Therefore,

$$\begin{aligned} \begin{array}{rcl} M_p &{} = &{} {1 \over \kappa }\int \limits _E \Vert u \Vert ^p \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u \; = \; {1 \over \kappa } \int \limits _E \Vert u \Vert ^p \mathrm{e}^{-{\tau \over 2} \Vert u \Vert ^2} \mathrm{e}^{-{1-\tau \over 2} \Vert u \Vert ^2} \mathrm{d}u\\ \\ &{} \mathop {\le }\limits ^{(80)} &{} {1 \over \kappa } \left( p \over \tau e\right) ^{p/2} \int \limits _E \mathrm{e}^{-{1-\tau \over 2} \Vert u \Vert ^2} \mathrm{d}u\; = \; \left( p \over \tau e\right) ^{p/2} {1 \over (1-\tau )^{n/2}}. \end{array} \end{aligned}$$

The minimum of the right-hand side in $\tau \in (0,1)$ is attained at $\tau = {p \over p+n}$. Thus,

$$\begin{aligned} \begin{array}{rcl} M_p\le & {} \left( p \over e\right) ^{p/2} \left( 1 + {n \over p}\right) ^{p/2} \left( 1 + {p \over n}\right) ^{n/2} \; \le \; (p+n)^{p/2}. \end{array} \end{aligned}$$

$\square $

Proof of Theorem 1

Indeed, for any $x \in E$ we have $f_{\mu }(x) - f(x) = {1 \over \kappa } \int \limits _E [ f(x+\mu u) - f(x)] \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u$. Therefore, if $f \in C^{0,0}(E)$, then

$$\begin{aligned} \begin{array}{rcl} |f_{\mu }(x) - f(x) | &{} \le &{} {1 \over \kappa } \int \limits _E | f(x+\mu u) - f(x)| \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u \\ \\ &{} \le &{} {\mu L_0(f) \over \kappa } \int \limits _E \Vert u \Vert \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u \; \mathop {\le }\limits ^{(16)} \; \mu L_0(f) n^{1/2}. \end{array} \end{aligned}$$

Further, if f is differentiable at x, then

$$\begin{aligned} \begin{array}{rcl} f_{\mu }(x) - f(x)= & {} {1 \over \kappa } \int \limits _E [ f(x+\mu u) - f(x) - \mu \langle \nabla f(x), u \rangle ] \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u. \end{array} \end{aligned}$$

Therefore, if $f \in C^{1,1}(E)$, then

$$\begin{aligned} \begin{array}{rcl} |f_{\mu }(x) - f(x) |&\mathop {\le }\limits ^{(6)}&{\mu ^2 L_1(f) \over 2\kappa } \int \limits _E \Vert u \Vert ^2 \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u \; \mathop {=}\limits ^{(14)} \; {\mu ^2 L_1(f) \over 2} n. \end{array} \end{aligned}$$

Finally, if f is twice differentiable at x, then

$$\begin{aligned} \begin{array}{c} {1 \over \kappa } \int \limits _E [ f(x+\mu u) - f(x) - \mu \langle \nabla f(x), u \rangle - {\mu ^2 \over 2} \langle \nabla ^2 f(x) u, u \rangle ] \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u\\ \\ \mathop {=}\limits ^{(13)} \; f_{\mu }(x) - f(x) - {\mu ^2 \over 2} \langle \nabla ^2 f(x), B^{-1} \rangle . \end{array} \end{aligned}$$

Therefore, if $f \in C^{2,2}(E)$, then

$$\begin{aligned}&|f_{\mu }(x) - f(x) - {\mu ^2 \over 2} \langle \nabla ^2 f(x), B^{-1} \rangle | \; \mathop {\le }\limits ^{(7)} \; {\mu ^3 L_2(f) \over 6\kappa } \int \limits _E \Vert u \Vert ^3 \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u \\&\mathop {=}\limits ^{(17)} \; {\mu ^3 L_1(f) \over 6} (n+3)^{3/2}. \end{aligned}$$

$\square $

Proof of Lemma 2

Indeed, for all x and y in E, we have

$$\begin{aligned} \begin{array}{rcl} \Vert \nabla f_{\mu }(x) - \nabla f_{\mu }(y)\Vert _* &{} \mathop {\le }\limits ^{(21)} &{} {1 \over \kappa \mu } \int \limits _E |f(x + \mu u) - f(y+\mu u)| \, \Vert u \Vert \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2} \mathrm{d}u\\ \\ &{} \le &{} {1 \over \kappa \mu } L_0(f) \int \limits _E \Vert u \Vert \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2} \mathrm{d}u \cdot \Vert x - y \Vert . \end{array} \end{aligned}$$

It remains to apply (16). $\square $

Proof of Theorem 2

Let $\mu >0$. Since $f_{\mu }$ is convex, for all x and $y \in E$ we have

Taking now the limit as $\mu \rightarrow 0$, we prove the statement for $\mu = 0$. $\square $

Proof of Lemma 3

Indeed, for function $f \in C^{1,1}(E)$, we have

$$\begin{aligned} \begin{array}{rcl} \Vert \nabla f_{\mu }(x) - \nabla f(x) \Vert _* &{} \mathop {=}\limits ^{(25)} &{} \Big \Vert {1 \over \kappa } \int \limits _E \left( {f(x+\mu u ) - f(x) \over \mu } - \langle \nabla f(x), u \rangle \right) Bu \, \mathrm{e}^{-{1 \over 2}\Vert u \Vert ^2} \mathrm{d}u \Big \Vert _* \\ \\ &{} \le &{} {1 \over \kappa \mu } \int \limits _E |f(x+\mu u ) - f(x) - \mu \langle \nabla f(x), u \rangle | \, \Vert u \Vert \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u\\ \\ &{} \mathop {\le }\limits ^{(6)} &{} {\mu L_1(f) \over 2 \kappa } \int \limits _E \Vert u \Vert ^3 \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u \; \mathop {\le }\limits ^{(17)} \; {\mu \over 2 } L_1(f) (n+3)^{3/2}. \end{array} \end{aligned}$$

Let $f \in C^{2,2}(E)$. Denote $a_u(\tau ) = f(x+\tau u) - f(x) - \tau \langle \nabla f(x), u \rangle - {\tau ^2 \over 2} \langle \nabla ^2 f(x) u, u \rangle $. Then, $|a_u(\pm \mu )| \mathop {\le }\limits ^{(7)} {\mu ^3 \over 6} L_2(f) \Vert u \Vert ^3$. Since

we have

$$\begin{aligned} \Vert \nabla f_{\mu }(x) - \nabla f(x) \Vert _*\le & {} {1 \over 2\kappa \mu } \int \limits _E |f(x+\mu u ) - f(x-\mu u) - 2 \mu \langle \nabla f(x), u \rangle | \, \Vert u \Vert \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u\\ \\= & {} {1 \over 2\kappa \mu } \int \limits _E |a_u(\mu )-a_u(-\mu )| \, \Vert u \Vert \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u \\ \\\le & {} {\mu ^2 L_2(f) \over 6 \kappa } \int \limits _E \Vert u \Vert ^4 \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u \; \mathop {\le }\limits ^{(17)} \; {\mu ^2 \over 6 } L_2(f) (n+4)^{2}. \end{aligned}$$

$\square $

Proof of Lemma 4

Indeed,

$$\begin{aligned} \begin{array}{l} \Vert \nabla f(x) \Vert ^2_* \; \mathop {=}\limits ^{(13)} \; \Vert {1 \over \kappa } \int \limits _E \langle \nabla f(x) , u \rangle Bu \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u \Vert _*^2\\ \\ = \; \Vert {1 \over \kappa \mu } \int \limits _E \left( [f(x+\mu u) - f(x)] - [f(x+\mu u) - f(x) - \mu \langle \nabla f(x) , u \rangle ]\right) Bu \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u \Vert _*^2\\ \\ \mathop {\le }\limits ^{(26)} \; 2 \Vert \nabla f_{\mu }(x) \Vert _*^2 + {2 \over \mu ^2} \Vert {1 \over \kappa } \int \limits _E [f(x+\mu u) - f(x) - \mu \langle \nabla f(x) , u \rangle ] Bu \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u \Vert _*^2\\ \\ \le \; 2 \Vert \nabla f_{\mu }(x) \Vert _*^2 + {2 \over \mu ^2 \kappa } \int \limits _E [f(x+\mu u) - f(x) - \mu \langle \nabla f(x) , u \rangle ]^2 \Vert u\Vert ^2 \mathrm{e}^{-{1 \over 2} \Vert u \Vert ^2}\mathrm{d}u\\ \\ \mathop {\le }\limits ^{(6)} \; 2 \Vert \nabla f_{\mu }(x) \Vert _*^2 + {\mu ^2 \over 2} L_1^2(f) M_6. \end{array} \end{aligned}$$

It remains to use inequality (17). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nesterov, Y., Spokoiny, V. Random Gradient-Free Minimization of Convex Functions. Found Comput Math 17, 527–566 (2017). https://doi.org/10.1007/s10208-015-9296-2

Download citation

Received: 15 July 2013
Revised: 04 May 2015
Accepted: 24 September 2015
Published: 30 November 2015
Issue Date: April 2017
DOI: https://doi.org/10.1007/s10208-015-9296-2

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Gradient-free methods for non-smooth convex stochastic optimization with heavy-tailed noise on convex compact

Gradient-Free Two-Point Methods for Solving Stochastic Nonsmooth Convex Optimization Problems with Small Non-Random Noises

Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications

Notes

References

Acknowledgments