More Web Proxy on the site http://driver.im/

research-article

Finding approximate local minima faster than gradient descent

Authors:

Zeyuan Allen-Zhu,

Tengyu MaAuthors Info & Claims

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

Pages 1195 - 1199

https://doi.org/10.1145/3055399.3055464

Published: 19 June 2017 Publication History

Abstract

We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

Supplementary Material

MP4 File (d4_sb_t6.mp4)

Download
174.05 MB

References

[1]

Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization for machine learning in linear time. arXiv preprint arXiv:1602.03943, 2016.

[2]

Zeyuan Allen-Zhu. Natasha: Faster Stochastic Non-Convex Optimization via Strongly Non-Convex Parameter. ArXiv e-prints, abs/1702.00763, February 2017.

[3]

Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.

Digital Library

[4]

Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for semidefinite programs arising in synchronization and community detection. arXiv preprint arXiv:1602.04426, 2016.

[5]

S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global Optimality of Local Search for Low Rank Matrix Recovery. ArXiv e-prints, May 2016.

[6]

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex optimization. arXiv preprint 1611.00756, 2016.

[7]

Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.

[8]

4 We have not tried to improve the constants in the theorem statements. Finding Approximate Local Minima Faster than Gradient Descent STOC’17, June 2017, Montreal, Canada

Digital Library

[9]

Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical Programming, 130(2):295–319, 2011.

Digital Library

[10]

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.

[11]

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.

Digital Library

[12]

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.

Digital Library

[13]

Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.

[14]

Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. In ICML, 2016.

[15]

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points— online stochastic gradient for tensor decomposition. arXiv:1503.02101, 2015.

[16]

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points— online stochastic gradient for tensor decomposition. In Proceedings of the 28th Annual Conference on Learning Theory, COLT 2015, 2015.

[17]

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 797–842, 2015.

[18]

Rong Ge, Jason Lee, and Tengyu Ma. Matrix Completion has No Spurious Local Minimum. ArXiv e-prints, May 2016.

[19]

Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions, 2016.

[20]

Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, pages 1–26, feb 2015.

Digital Library

[21]

I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. ArXiv e-prints, December 2014.

[22]

Elad Hazan and Tomer Koren. A linear-time algorithm for trust region problems. Mathematical Programming, pages 1–19, 2015.

Digital Library

[23]

Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM, 60(6):45, 2013.

Digital Library

[24]

Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 1246–1257, 2016.

[25]

Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.

[26]

Yurii Nesterov. A method of solving a convex programming problem with convergence rate O (1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983.

[27]

Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.

Digital Library

[28]

Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.

[29]

Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.

Digital Library

[30]

Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.

[31]

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013.

[32]

Preliminary version appeared in NIPS 2012.

[33]

Jonathan Richard Shewchuk. An introduction to the conjugate gradient method without the agonizing pain, 1994.

[34]

Abstract 1 Introduction 1.1 Related Work 1.2 Our Techniques 2 Preliminaries and Main Theorem 2.1 Main Results 3 Our Fast Cubic Regularization Algorithm 4 Further Details References

Cited By

Sellke M(2024)Optimizing mean field spin glasses with external fieldElectronic Journal of Probability10.1214/23-EJP106629:noneOnline publication date: 1-Jan-2024
https://doi.org/10.1214/23-EJP1066
Chen CHuang HPreskill JZhou LMohar BShinkar IO'Donnell R(2024)Local Minima in Quantum SystemsProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649675(1323-1330)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3618260.3649675
Ghosh AMaity RMazumdar A(2024)DIST-CURE: A Robust Distributed Learning Algorithm with Cubic Regularized Newton2024 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT57864.2024.10619481(3642-3647)Online publication date: 7-Jul-2024
https://doi.org/10.1109/ISIT57864.2024.10619481
Show More Cited By

Index Terms

Finding approximate local minima faster than gradient descent
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Theory of computation
  1. Design and analysis of algorithms
    1. Approximation algorithms analysis
      1. Stochastic approximation
    2. Mathematical optimization
      1. Continuous optimization
        Nonconvex optimization
        Stochastic control and optimization

Recommendations

Non-convex projected gradient descent for generalized low-rank tensor regression

In this paper, we consider the problem of learning high-dimensional tensor regression problems with low-rank structure. One of the core challenges associated with learning high-dimensional models is computation since the underlying optimization problems ...
Damping proximal coordinate descent algorithm for non-convex regularization

Non-convex regularization has attracted much attention in the fields of machine learning, since it is unbiased and improves the performance on many applications compared with the convex counterparts. The optimization is important but difficult for non-...
Image compressed sensing based on non-convex low-rank approximation

Nonlocal sparsity and structured sparsity have been evidenced to improve the reconstruction of image details in various compressed sensing (CS) studies. The nonlocal processing is achieved by grouping similar patches of the image into the groups. To ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

June 2017

1268 pages

ISBN:9781450345286

DOI:10.1145/3055399

General Chairs:
Hamed Hatami
McGill University
,
Pierre McKenzie
Université de Montréal
,
Program Chair:
Valerie King
University of Victoria

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGACT: ACM Special Interest Group on Algorithms and Computation Theory

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

STOC '17

Sponsor:

SIGACT

STOC '17: Symposium on Theory of Computing

June 19 - 23, 2017

Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25

Sponsor:
sigact

57th Annual ACM Symposium on Theory of Computing (STOC 2025)

June 23 - 27, 2025

Prague , Czech Republic

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
572
Total Downloads

Downloads (Last 12 months)72
Downloads (Last 6 weeks)11

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Sellke M(2024)Optimizing mean field spin glasses with external fieldElectronic Journal of Probability10.1214/23-EJP106629:noneOnline publication date: 1-Jan-2024
https://doi.org/10.1214/23-EJP1066
Chen CHuang HPreskill JZhou LMohar BShinkar IO'Donnell R(2024)Local Minima in Quantum SystemsProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649675(1323-1330)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3618260.3649675
Ghosh AMaity RMazumdar A(2024)DIST-CURE: A Robust Distributed Learning Algorithm with Cubic Regularized Newton2024 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT57864.2024.10619481(3642-3647)Online publication date: 7-Jul-2024
https://doi.org/10.1109/ISIT57864.2024.10619481
Yang ZLi DNai WLiu LSun JLv X(2024)Compressible Non-Newtonian Fluid Based Road Traffic Flow Equation Solved by Physical-Informed Rational Neural NetworkIEEE Access10.1109/ACCESS.2024.335617312(12992-13009)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3356173
He CHuang HLu Z(2024)A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimizationComputational Optimization and Applications10.1007/s10589-024-00603-689:3(843-894)Online publication date: 30-Aug-2024
https://doi.org/10.1007/s10589-024-00603-6
Marumo NTakeda A(2024)Universal heavy-ball method for nonconvex optimization under Hölder continuous HessiansMathematical Programming10.1007/s10107-024-02100-4Online publication date: 4-Jun-2024
https://doi.org/10.1007/s10107-024-02100-4
Dvurechensky PStaudigl M(2024)Hessian barrier algorithms for non-convex conic optimizationMathematical Programming10.1007/s10107-024-02062-7Online publication date: 4-Mar-2024
https://doi.org/10.1007/s10107-024-02062-7
Li CHuang JHon WLee C(2024)RPH-PGD: Randomly Projected Hessian for Perturbed Gradient DescentAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2253-2_20(248-259)Online publication date: 25-Apr-2024
https://doi.org/10.1007/978-981-97-2253-2_20
Feinberg VChen XSun YAnil RHazan EOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)SketchyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669438(75911-75924)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3669438
Ganesh ALiu DOh SThakurta AOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Private (stochastic) non-convex optimization revisitedProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668986(65618-65641)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3668986
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents