Abstract
Stochastic gradient descent (SGD) for strongly convex functions converges at the rate \(\mathcal {O}(1/k)\). However, achieving good results in practice requires tuning the parameters (for example the learning rate) of the algorithm. In this paper we propose a generalization of the Polyak step size, used for subgradient methods, to stochastic gradient descent. We prove a non-asymptotic convergence at the rate \(\mathcal {O}(1/k)\) with a rate constant which can be better than the corresponding rate constant for optimally scheduled SGD. We demonstrate that the method is effective in practice, and on convex optimization problems and on training deep neural networks, and compare to the theoretical rate.
Similar content being viewed by others
Availability of Data and Materials
The datasets analysed during the current study are available in https://www.cs.toronto.edu/~kriz/cifar.html.
Code Availability
Available at https://github.com/marianapraz/polyakSGD.
References
Agarwal, A., Wainwright, M.J., Bartlett, P.L., Ravikumar, P.K.: Information–theoretic lower bounds on the oracle complexity of convex optimization. In: Advances in Neural Information Processing Systems, pp. 1–9 (2009)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016)
Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, Philadelphia (2017)
Boyd, S., Mutapcic, A.: Subgradient methods. Lecture notes of EE364b, Stanford University, Winter Quarter, 2013 (2013)
Bottou, L.: Stochastic gradient learning in neural networks. Proc. Neuro-Nımes 91(8), 12 (1991)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)
Hinton, G., Srivastava, N., Swersky, K.: RmsProp: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning, Coursera lecture 6e (2012)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Kim, S., Ahn, H., Cho, S.-C.: Variable target value subgradient method. Math. Program. 49(1–3), 359–369 (1990)
Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82(1), 35–45 (1960)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002 (2012)
Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Volume 89 of Proceedings of Machine Learning Research, pp. 983–992. PMLR, 16–18 April 2019
Laborde, M., Oberman, A.: A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In: International Conference on Artificial Intelligence and Statistics, pp. 602–612 (2020)
Laborde, M., Oberman, A.M.: Nesterov’s method with decreasing learning rate leads to accelerated stochastic gradient descent. arXiv preprint arXiv:1908.07861 (2019)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)
Osher, S., Wang, B., Yin, P., Luo, X., Barekat, F., Pham, M., Lin, A.: Laplacian smoothing gradient descent. arXiv preprint arXiv:1806.06317 (2018)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Polyak, R.A.: Introduction to Continuous Optimization, Springer Optimization and Its Applications Series, vol. 172 (2021)
Qian, X., Richtarik, P., Gower, R., Sailanbayev, A., Loizou, N., Shulgin, E.: SGD with arbitrary sampling: general analysis and improved rates. In: International Conference on Machine Learning, pp. 5200–5209 (2019)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Rolinek, M., Martius, G.: L4: practical loss-based stepsize adaptation for deep learning. In: Advances in Neural Information Processing Systems, pp. 6434–6444 (2018)
Raginsky, M., Rakhlin, A.: Information-based complexity, feedback and dynamics in convex programming. IEEE Trans. Inf. Theory 57(10), 7036–7056 (2011)
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647 (2011)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
Shor, N.Z.: Minimization Methods for Non-differentiable Functions, vol. 3. Springer, Berlin (2012)
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)
Wu, X., Ward, R., Bottou, L.: Wngrad: learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865 (2018)
Wang, B., Zou, D., Gu, Q., Osher, S.J.: Laplacian smoothing stochastic gradient Markov chain Monte Carlo. SIAM J. Sci. Comput. 43(1), A26–A53 (2021)
Yan, Y., Yang, Ti., Li, Z., Lin, Q., Yi, Y.: A unified analysis of stochastic momentum methods for deep learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI18, pp. 2955–2961. AAAI Press (2018)
Zhou, D., Chen, J., Cao, Y., Tang, Y., Yang, Z., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization (2020)
Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
Zhang, L., Mahdavi, M., Jin, R.: Linear convergence with condition number independent access of full gradients. In: Advances in Neural Information Processing Systems, pp. 980–988 (2013)
Funding
Adam Oberman was supported by the Air Force Office of Scientific Research under award number FA9550-18-1-0167 and by IVADO. Mariana Prazeres was supported by the FCT scholarship SFRH/BD/145075/2019 (M.P.).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Prazeres, M., Oberman, A.M. Stochastic Gradient Descent with Polyak’s Learning Rate. J Sci Comput 89, 25 (2021). https://doi.org/10.1007/s10915-021-01628-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-021-01628-3