[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3055399.3055464acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

Finding approximate local minima faster than gradient descent

Published: 19 June 2017 Publication History

Abstract

We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.

Supplementary Material

MP4 File (d4_sb_t6.mp4)

References

[1]
Naman Agarwal, Brian Bullins, and Elad Hazan. Second order stochastic optimization for machine learning in linear time. arXiv preprint arXiv:1602.03943, 2016.
[2]
Zeyuan Allen-Zhu. Natasha: Faster Stochastic Non-Convex Optimization via Strongly Non-Convex Parameter. ArXiv e-prints, abs/1702.00763, February 2017.
[3]
Zeyuan Allen-Zhu and Elad Hazan. Variance Reduction for Faster Non-Convex Optimization. In ICML, 2016.
[4]
Afonso S Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for semidefinite programs arising in synchronization and community detection. arXiv preprint arXiv:1602.04426, 2016.
[5]
S. Bhojanapalli, B. Neyshabur, and N. Srebro. Global Optimality of Local Search for Low Rank Matrix Recovery. ArXiv e-prints, May 2016.
[6]
Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex optimization. arXiv preprint 1611.00756, 2016.
[7]
Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part i: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011.
[8]
4 We have not tried to improve the constants in the theorem statements. Finding Approximate Local Minima Faster than Gradient Descent STOC’17, June 2017, Montreal, Canada
[9]
Coralia Cartis, Nicholas IM Gould, and Philippe L Toint. Adaptive cubic regularisation methods for unconstrained optimization. part ii: worst-case function-and derivative-evaluation complexity. Mathematical Programming, 130(2):295–319, 2011.
[10]
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In AISTATS, 2015.
[11]
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
[12]
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–2159, 2011.
[13]
Dan Garber and Elad Hazan. Fast and simple PCA via convex optimization. ArXiv e-prints, September 2015.
[14]
Dan Garber, Elad Hazan, Chi Jin, Sham M. Kakade, Cameron Musco, Praneeth Netrapalli, and Aaron Sidford. Robust shift-and-invert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. In ICML, 2016.
[15]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points— online stochastic gradient for tensor decomposition. arXiv:1503.02101, 2015.
[16]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points— online stochastic gradient for tensor decomposition. In Proceedings of the 28th Annual Conference on Learning Theory, COLT 2015, 2015.
[17]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points - online stochastic gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, July 3-6, 2015, pages 797–842, 2015.
[18]
Rong Ge, Jason Lee, and Tengyu Ma. Matrix Completion has No Spurious Local Minimum. ArXiv e-prints, May 2016.
[19]
Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions, 2016.
[20]
Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, pages 1–26, feb 2015.
[21]
I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. ArXiv e-prints, December 2014.
[22]
Elad Hazan and Tomer Koren. A linear-time algorithm for trust region problems. Mathematical Programming, pages 1–19, 2015.
[23]
Christopher J. Hillar and Lek-Heng Lim. Most tensor problems are np-hard. J. ACM, 60(6):45, 2013.
[24]
Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, pages 1246–1257, 2016.
[25]
Katta G Murty and Santosh N Kabadi. Some np-complete problems in quadratic and nonlinear programming. Mathematical programming, 39(2):117–129, 1987.
[26]
Yurii Nesterov. A method of solving a convex programming problem with convergence rate O (1/k 2 ). In Doklady AN SSSR (translated as Soviet Mathematics Doklady), volume 269, pages 543–547, 1983.
[27]
Yurii Nesterov. Introductory Lectures on Convex Programming Volume: A Basic course, volume I. Kluwer Academic Publishers, 2004.
[28]
Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
[29]
Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.
[30]
Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
[31]
Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. arXiv preprint arXiv:1309.2388, pages 1–45, 2013.
[32]
Preliminary version appeared in NIPS 2012.
[33]
Jonathan Richard Shewchuk. An introduction to the conjugate gradient method without the agonizing pain, 1994.
[34]
Abstract 1 Introduction 1.1 Related Work 1.2 Our Techniques 2 Preliminaries and Main Theorem 2.1 Main Results 3 Our Fast Cubic Regularization Algorithm 4 Further Details References

Cited By

View all
  • (2024)Optimizing mean field spin glasses with external fieldElectronic Journal of Probability10.1214/23-EJP106629:noneOnline publication date: 1-Jan-2024
  • (2024)Local Minima in Quantum SystemsProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649675(1323-1330)Online publication date: 10-Jun-2024
  • (2024)DIST-CURE: A Robust Distributed Learning Algorithm with Cubic Regularized Newton2024 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT57864.2024.10619481(3642-3647)Online publication date: 7-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2017: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
June 2017
1268 pages
ISBN:9781450345286
DOI:10.1145/3055399
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cubic Regularization
  2. Deep Learning
  3. Non-convex Optimization
  4. Second-Order Optimization

Qualifiers

  • Research-article

Conference

STOC '17
Sponsor:
STOC '17: Symposium on Theory of Computing
June 19 - 23, 2017
Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25
57th Annual ACM Symposium on Theory of Computing (STOC 2025)
June 23 - 27, 2025
Prague , Czech Republic

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)72
  • Downloads (Last 6 weeks)11
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing mean field spin glasses with external fieldElectronic Journal of Probability10.1214/23-EJP106629:noneOnline publication date: 1-Jan-2024
  • (2024)Local Minima in Quantum SystemsProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649675(1323-1330)Online publication date: 10-Jun-2024
  • (2024)DIST-CURE: A Robust Distributed Learning Algorithm with Cubic Regularized Newton2024 IEEE International Symposium on Information Theory (ISIT)10.1109/ISIT57864.2024.10619481(3642-3647)Online publication date: 7-Jul-2024
  • (2024)Compressible Non-Newtonian Fluid Based Road Traffic Flow Equation Solved by Physical-Informed Rational Neural NetworkIEEE Access10.1109/ACCESS.2024.335617312(12992-13009)Online publication date: 2024
  • (2024)A Newton-CG based barrier-augmented Lagrangian method for general nonconvex conic optimizationComputational Optimization and Applications10.1007/s10589-024-00603-689:3(843-894)Online publication date: 30-Aug-2024
  • (2024)Universal heavy-ball method for nonconvex optimization under Hölder continuous HessiansMathematical Programming10.1007/s10107-024-02100-4Online publication date: 4-Jun-2024
  • (2024)Hessian barrier algorithms for non-convex conic optimizationMathematical Programming10.1007/s10107-024-02062-7Online publication date: 4-Mar-2024
  • (2024)RPH-PGD: Randomly Projected Hessian for Perturbed Gradient DescentAdvances in Knowledge Discovery and Data Mining10.1007/978-981-97-2253-2_20(248-259)Online publication date: 25-Apr-2024
  • (2023)SketchyProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669438(75911-75924)Online publication date: 10-Dec-2023
  • (2023)Private (stochastic) non-convex optimization revisitedProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3668986(65618-65641)Online publication date: 10-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media