More Web Proxy on the site http://driver.im/

research-article

Zeroth-Order Regularized Optimization (ZORO): : Approximately Sparse Gradients and Adaptive Sampling

Authors: HanQin Cai, Daniel McKenzie, Wotao Yin, Zhenliang ZhangAuthors Info & Claims

SIAM Journal on Optimization, Volume 32, Issue 2

Pages 687 - 714

https://doi.org/10.1137/21M1392966

Published: 01 January 2022 Publication History

Abstract

We consider the problem of minimizing a high-dimensional objective function, which may include a regularization term, using only (possibly noisy) evaluations of the function. Such optimization is also called derivative-free, zeroth-order, or black-box optimization. We propose a new zeroth-order regularized optimization method, dubbed ZORO. When the underlying gradient is approximately sparse at an iterate, ZORO needs very few objective function evaluations to obtain a new iterate that decreases the objective function. We achieve this with an adaptive, randomized gradient estimator, followed by an inexact proximal-gradient scheme. Under a novel approximately sparse gradient assumption and various different convex settings, we show that the (theoretical and empirical) convergence rate of ZORO is only logarithmically dependent on the problem dimension. Numerical experiments show that ZORO outperforms existing methods with similar assumptions, on both synthetic and real datasets.

References

[1]

K. Balasubramanian and S. Ghadimi, Zeroth-order (non)-convex stochastic optimization via conditional gradient and gradient updates, in Proceedings of Advances in Neural Information Processing Systems, 2018, pp. 3455--3464.

[2]

R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, A simple proof of the restricted isometry property for random matrices, Constr. Approx., 28 (2008), pp. 253--263.

[3]

A. S. Berahas, L. Cao, K. Choromanski, and K. Scheinberg, A theoretical and empirical comparison of gradient approximations in derivative-free optimization, Found. Comput. Math., 2021, pp. 1--54.

[4]

A. S. Berahas, L. Cao, and K. Scheinberg, Global convergence rate analysis of a generic line search algorithm with noise, SIAM J. Optim., 31 (2021), pp. 1489--1518.

[5]

E. H. Bergou, E. Gorbunov, and P. Richtarik, Stochastic three points method for unconstrained smooth minimization, SIAM J. Optim., 30 (2020), pp. 2726--2749.

[6]

J. Bergstra and Y. Bengio, Random search for hyper-parameter optimization, J. Mach. Learn. Res., 13 (2012), pp. 281--305.

[7]

D. P. Bertsekas, Nonlinear programming, J. Oper. Res. Soc., 48 (1997), pp. 334--334.

[8]

D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29--51.

[9]

H. Cai, Y. Lou, D. Mckenzie, and W. Yin, A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization, in Proceedings of the International Conference on Machine Learning, 2021, pp. 1193--1203.

[10]

H. Cai, D. Mckenzie, W. Yin, and Z. Zhang, A One-Bit, Comparison-Based Gradient Estimator, preprint, arXiv:2010.02479, 2020.

[11]

T.-J. Chang, N. Meade, J. E. Beasley, and Y. M. Sharaiha, Heuristics for cardinality constrained portfolio optimisation, Comput. Oper. Res., 27 (2000), pp. 1271--1302.

[12]

X. Chen, S. Liu, K. Xu, X. Li, X. Lin, M. Hong, and D. Cox, ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization, in Proceedings of Advances in Neural Information Processing Systems, 2019, pp. 7202--7213.

[13]

K. Choromanski, A. Pacchiano, J. Parker-Holder, Y. Tang, D. Jain, Y. Yang, A. Iscen, J. Hsu, and V. Sindhwani, Provably robust blackbox optimization for reinforcement learning, in Proceedings of the Conference on Robot Learning, 2020, pp. 683--696.

[14]

K. Choromanski, M. Rowland, V. Sindhwani, R. Turner, and A. Weller, Structured evolution with compact architectures for scalable policy optimization, in Proceedings of the International Conference on Machine Learning, 2018, pp. 970--978.

[15]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 248--255.

[16]

J. Fan and R. Li, Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc., 96 (2001), pp. 1348--1360.

[17]

A. D. Flaxman, A. T. Kalai, and H. B. McMahan, Online Convex Optimization in the Bandit Setting: Gradient Descent Without a Gradient, preprint, arXiv:cs/0408007, 2004.

[18]

S. Foucart, Sparse Recovery Algorithms: Sufficient Conditions in Terms of Restricted Isometry Constants, in Proceedings of Approximation Theory XIII, San Antonio, 2010, Springer, New York, 2012, pp. 65--77.

[19]

M. P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM J. Sci. Comput., 34 (2012), pp. A1380--A1405.

[20]

S. Ghadimi and G. Lan, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM J. Optim., 23 (2013), pp. 2341--2368.

[21]

K. G. Jamieson, R. Nowak, and B. Recht, Query complexity of derivative-free optimization, in Proceedings of Advances in Neural Information Processing Systems, 2012, pp. 2672--2680.

[22]

J. Kiefer and J. Wolfowitz, Stochastic estimation of the maximum of a regression function, Ann. Math. Statist., 23 (1952), pp. 462--466.

[23]

B. Kim, H. Cai, D. McKenzie, and W. Yin, Curvature-Aware Derivative-Free Optimization, preprint, arXiv:2109.13391, 2021.

[24]

C. G. Knight, S. H. Knight, N. Massey, T. Aina, C. Christensen, D. J. Frame, J. A. Kettleborough, A. Martin, S. Pascoe, B. Sanderson, D. A. Stainforth, and M. R. Allen, Association of parameter, software, and hardware variation with large-scale behavior across 57,000 climate models, Proc. Natl. Acad. Sci. USA, 104 (2007), pp. 12259--12264.

[25]

A. Kurakin, I. Goodfellow, and S. Bengio, Adversarial machine learning at scale, in Proceedings of the International Conference on Learning Representations, 2017.

[26]

Z.-Q. Luo and P. Tseng, Error bounds and convergence analysis of feasible descent methods: A general approach, Ann. Oper. Res., 46 (1993), pp. 157--178.

[27]

H. Mania, A. Guy, and B. Recht, Simple Random Search Provides a Competitive Approach to Reinforcement Learning, preprint, arXiv:1803.07055, 2018.

[28]

A. Modas, S.-M. Moosavi-Dezfooli, and P. Frossard, Sparsefool: A few pixels make a big difference, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9087--9096.

[29]

B. S. Mordukhovich, Variational Analysis and Generalized Differentiation I: Basic Theory, Grundlehren Math. Wise. 330, Springer, New York, 2006.

[30]

B. S. Mordukhovich, N. M. Nam, and N. Yen, Fréchet subdifferential calculus and optimality conditions in nondifferentiable programming, Optimization, 55 (2006), pp. 685--708.

[31]

N. Nakamura, J. Seepaul, J. B. Kadane, and B. Reeja-Jayan, Design for low-temperature microwave-assisted crystallization of ceramic thin films, Appl. Stoch. Models Bus. Ind., 33 (2017), pp. 314--321.

[32]

A. Nedić and D. P. Bertsekas, The effect of deterministic noise in subgradient methods, Math. Program., 125 (2010), pp. 75--99.

[33]

D. Needell and J. A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Appl. Comput. Harmon. Anal., 26 (2009), pp. 301--321.

[34]

Y. Nesterov and V. Spokoiny, Random Gradient-Free Minimization of Convex Functions, Technical report, Universite Catholique de Louvain, Center for Operations Research and Econometrics, 2011.

[35]

Y. Nesterov and V. Spokoiny, Random gradient-free minimization of convex functions, Found. Comput. Math., 17 (2017), pp. 527--566.

[36]

N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, Practical black-box attacks against machine learning, in Proceedings of the ACM on Asia Conference on Computer and Communications Security, 2017, pp. 506--519.

[37]

E. Ryu and W. Yin, Large-Scale Convex Optimization via Monotone Operators, 2020.

[38]

T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever, Evolution Strategies as a Scalable Alternative to Reinforcement Learning, preprint, arXiv:1703.03864, 2017.

[39]

M. Schmidt, N. L. Roux, and F. R. Bach, Convergence rates of inexact proximal-gradient methods for convex optimization, in Proceedings of Advances in Neural Information Processing Systems, 2011, pp. 1458--1466.

[40]

F. Schöpfer, Linear convergence of descent methods for the unconstrained minimization of restricted strongly convex functions, SIAM J. Optim., 26 (2016), pp. 1883--1911.

[41]

O. Shamir, On the complexity of bandit and derivative-free stochastic convex optimization, in Proceedings of the Conference on Learning Theory, 2013, pp. 3--24.

[42]

J. Snoek, H. Larochelle, and R. P. Adams, Practical Bayesian optimization of machine learning algorithms, in Proceedings of Advances in Neural Information Processing Systems, 2012, pp. 2951--2959.

[43]

J. C. Spall, An overview of the simultaneous perturbation method for efficient optimization, Johns Hopkins APL Tech. Digest, 19 (1998), pp. 482--492.

[44]

S. U. Stich, C. L. Muller, and B. Gartner, Optimization of convex functions with random pursuit, SIAM J. Optim., 23 (2013), pp. 1284--1309.

[45]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818--2826.

[46]

R. Tappenden, P. Richtárik, and J. Gondzio, Inexact coordinate descent: Complexity and preconditioning, J. Optim. Theory and Appl., 170 (2016), pp. 144--176.

[47]

B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin, Learning structured prediction models: A large margin approach, in Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 896--903.

[48]

Y. Wang, S. Du, S. Balakrishnan, and A. Singh, Stochastic zeroth-order optimization in high dimensions, in Proceedings of the International Conference on Artificial Intelligence and Statistics, 2018, pp. 1356--1365.

[49]

Z. Wang, M. Zoghi, F. Hutter, D. Matheson, and N. De Freitas, Bayesian optimization in high dimensions via random embeddings, in Proceedings of the International Joint Conference on Artificial Intelligence, 2013.

[50]

H. Zhang, The restricted strong convexity revisited: Analysis of equivalence to error bound and quadratic growth, Optim. Lett., 11 (2017), pp. 817--833.

Cited By

Lei JLiu Q(2025)Minimax bilevel fractional optimization for imaging in electrical capacitance tomographyApplied Soft Computing10.1016/j.asoc.2024.112598169:COnline publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1016/j.asoc.2024.112598
Zhang LLi BThekumparampil KOh SHe NSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)DPZeroProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694516(59210-59246)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694516
Zhang YLi PHong JLi JZhang YZheng WChen PLee JYin WHong MWang ZLiu SChen TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Revisiting zeroth-order optimization for memory-efficient LLM fine-tuningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694514(59173-59190)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694514
Show More Cited By

Recommendations

Zeroth-order optimization with orthogonal random directions
Abstract
We propose and analyze a randomized zeroth-order optimization method based on approximating the exact gradient by finite differences computed in a set of orthogonal random directions that changes with each iteration. A number of previously ...
Stochastic zeroth order descent with structured directions
Abstract
We introduce and analyze Structured Stochastic Zeroth order Descent (S-SZD), a finite difference approach that approximates a stochastic gradient on a set of $l \leq d$ orthogonal directions, where d is the dimension of the ambient space. These ... $^{}$
Inexact Successive quadratic approximation for regularized optimization

Successive quadratic approximations, or second-order proximal methods, are useful for minimizing functions that are a sum of a smooth part and a convex, possibly nonsmooth part that promotes regularization. Most analyses of iteration complexity focus on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Optimization

SIAM Journal on Optimization Volume 32, Issue 2

DOI:10.1137/sjope8.32.2

Issue’s Table of Contents

© 2022, Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2022

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lei JLiu Q(2025)Minimax bilevel fractional optimization for imaging in electrical capacitance tomographyApplied Soft Computing10.1016/j.asoc.2024.112598169:COnline publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1016/j.asoc.2024.112598
Zhang LLi BThekumparampil KOh SHe NSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)DPZeroProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694516(59210-59246)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694516
Zhang YLi PHong JLi JZhang YZheng WChen PLee JYin WHong MWang ZLiu SChen TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Revisiting zeroth-order optimization for memory-efficient LLM fine-tuningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694514(59173-59190)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694514
Qiu RTong HSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Gradient compressed sensingProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3693766(41717-41748)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3693766
Rando MMolinari CVilla SRosasco L(2024)Stochastic zeroth order descent with structured directionsComputational Optimization and Applications10.1007/s10589-024-00616-189:3(691-727)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s10589-024-00616-1
Malladi SGao TNichani EDamian ALee JChen DArora SOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Fine-tuning language models with just forward passesProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668430(53038-53075)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668430
Li JBalasubramanian KMa S(2023)Stochastic Zeroth-Order Riemannian Derivative Estimation and OptimizationMathematics of Operations Research10.1287/moor.2022.130248:2(1183-1211)Online publication date: 1-May-2023
https://dl.acm.org/doi/10.1287/moor.2022.1302
de Vazelhes WZhang HWu HYuan XGu BKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Zeroth-order hard-thresholdingProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601911(22589-22601)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601911

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents