Abstract
We propose a novel and efficient momentum-based first-order algorithm for optimizing neural networks which uses an adaptive coefficient for the momentum term. Our algorithm, called Adaptive Momentum Coefficient (AMoC), utilizes the inner product of the gradient and the previous update to the parameters, to effectively control the amount of weight put on the momentum term based on the change of direction in the optimization path. The algorithm is easy to implement and its computational overhead over momentum methods is negligible. Extensive empirical results on both convex and neural network objectives show that AMoC performs well in practise and compares favourably with other first and second-order optimization algorithms. We also provide a convergence analysis and a convergence rate for AMoC, showing theoretical guarantees similar to those provided by other efficient first-order methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amari, S.I.: Natural gradient works efficiently in learning. Neural Comput. 10(2), 251–276 (1998)
Aujol, J.F., Rondepierre, A., Aujol, J., Dossal, C., et al.: Optimal convergence rates for Nesterov acceleration. arXiv preprint arXiv:1805.05719 (2018)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Defazio, A.: On the curved geometry of accelerated optimization. arXiv preprint arXiv:1812.04634 (2018)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(Jul), 2121–2159 (2011)
Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 European Control Conference (ECC), pp. 310–315. IEEE (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lucas, J., Sun, S., Zemel, R., Grosse, R.: Aggregated momentum: stability through passive damping. arXiv preprint arXiv:1804.00325 (2018)
Martens, J.: Deep learning via hessian-free optimization. In: ICML, vol. 27, pp. 735–742 (2010)
Martens, J., Grosse, R.: Optimizing neural networks with Kronecker-Factored approximate curvature. In: International Conference on Machine Learning, pp. 2408–2417 (2015)
Meng, X., Chen, H.: Accelerating Nesterov’s method for strongly convex functions with Lipschitz gradient. arXiv preprint arXiv:1109.6058 (2011)
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 (2017)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Heidelberg (2013). https://doi.org/10.1007/978-1-4419-8853-9
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o \((1/\text{k}^2)\). Dokl. akad. nauk Sssr. 269, 543–547 (1983)
O’donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2015)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of ADAM and beyond (2018)
Su, W., Boyd, S., Candes, E.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems, pp. 2510–2518 (2014)
Sutskever, I., Martens, J., Dahl, G.E., Hinton, G.E.: On the importance of initialization and momentum in deep learning. In: ICML (3), vol. 28, pp. 1139–1147 (2013)
Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
Acknowledgements
We would like to thank Ruth Urner for valuable discussions and insights. We would also like to thank the anonymous reviewers for their useful suggestions and constructive feedback.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Proofs
Lemma
For arbitrarily large integer T, there exists \(\beta >0\) such that the first \(T+1\) elements of the sequence \(\{\lambda _k\}\) are positive.
Proof
Consider \(\lambda _1\) to \(\lambda _k\) are positive and \(\lambda _{k+1}\) is negative. We have:
Combining all the inequalities for \(j=1,...,k\) results:
therefore:
The right-hand side of this inequality approaches \(+\infty \) as \(\beta \) approaches 0. Thus, by choosing a small enough \(\beta \), we achieve arbitrarily large number of positive terms.
Theorem
For any differentiable convex function f with L-Lipschitz gradients, the sequence generated by AMoC with sufficiently small \(\beta \), \(\mu \in [0,\frac{1}{1+\beta })\), and \(\epsilon \in (0,\frac{\mu (1+\beta )}{L\lambda _1})\) satisfies the following:
where \(\theta ^*\) is the optimal point and \(\tilde{\theta }_T=(\sum _{k=1}^T\frac{\lambda _k}{\gamma _k}\theta _k)/(\sum _{k=1}^T\frac{\lambda _k}{\gamma _k})\).
Proof
To prove this theorem, we follow a similar approach as [6]. By definition we have:
therefore:
By subtracting \(\theta ^*\) and seting \(\delta _k=\theta _k-\theta ^*\), we get:
According to [15, Theorem 2.1.5], \(\delta _k\cdot g_k\ge f(\theta _k)-f(\theta ^*)+\frac{1}{2L}\Vert g_k\Vert ^2\) which combined with (14) results:
Summing up all the inequalities for \(k=1,...,T\) yields:
For \(\beta <1\), function \(\frac{x}{1-\beta x}\) is convex for \(x\in [-1,1]\), thus:
Function \(\frac{x}{1-\beta x}\) is also increasing, and \(g_k\cdot d_k\ge f(\theta _k)-f(\theta _{k-1})\) [15, Theorem 2.1.5], therefore:
Furthermore, easily one can show that sequence \(\{\lambda _k\}\) is decreasing, and
Therefore:
where the final inequality follows from \(f(\theta _1)-f(\theta ^*)\le \frac{L}{2}\Vert \delta _1\Vert ^2\) [15, Theorem 2.1.5], and concludes the proof.
B Inner Product Analysis
In this section we report the value of the inner product (\(\bar{g}_t\cdot \bar{d}_t\)) for AMoC and AMoC-N per iteration/epoch for each of the experiments in Figs. 5 and 6. The results from the Anisotropic Bowl and the Smooth-BPDN experiments are particularly interesting. In the first case (Fig. 5a), with the inner product oscillating between positive and negative values, we can infer that the algorithm is crossing the optimum multiple times (without overshooting) but is able to bounce back and reach the optimum point eventually and in less iterations than the baselines. The second case (Fig. 5c) behaves in a similar way, except the algorithm seems to be moving close to the optimum but going up again and bouncing back several times (most likely in an oval-shaped trajectory) until it gets close enough that it terminates. In the Ridge Regression problem (Fig. 5b), the inner product drops from values between 0.5 and 1 to values between -0.5 to -1, indicating that the algorithm is moving towards the optimum steadily, with the updates always keeping a low angle with the gradient. In the neural network experiments (Figs. 6a to 6d), the inner product gets closer to 0 by the end of training. We link this behaviour to the algorithm moving in a circular fashion around the optima with the direction of the negative gradient almost perpendicular (\(\phi =\pi /2\)) to the previous update.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Rashidi, Z., Ahmadi K. A., K., An, A., Wang, X. (2021). Adaptive Momentum Coefficient for Neural Network Optimization. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12458. Springer, Cham. https://doi.org/10.1007/978-3-030-67661-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-67661-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67660-5
Online ISBN: 978-3-030-67661-2
eBook Packages: Computer ScienceComputer Science (R0)