Abstract
We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning \((L/\mu )^2\) (where \(L\) is a bound on the smoothness and \(\mu \) on the strong convexity) to a linear dependence on \(L/\mu \). Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.
Similar content being viewed by others
Notes
Bach and Moulines’s results are somewhat more general. Their Lipschitz requirement is a bit weaker and more complicated, but in terms of \(L_i\) yields (2.3). They also study the use of polynomial decaying step-sizes, but these do not lead to improved runtime if the target accuracy is known ahead of time.
References
Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems (NIPS) (2011)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Optimization for Machine Learning, p. 351 (2011)
Bousquet, O., Bottou, L.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, pp. 161–168 (2007)
Boutsidis, C., Mahoney, M., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Symposium on Discrete Algorithms, pp. 968–977 (2009)
Byrne, C.L.: Applied Iterative Methods. A K Peters Ltd., Wellesley, MA. ISBN:978-1-56881-342-4; 1-56881-271-X (2008)
Cenker, C., Feichtinger, H.G., Mayer, M., Steier, H., Strohmer, T.: New variants of the POCS method using affine subspaces of finite codimension, with applications to irregular sampling. In: Proceedings of SPIE: Visual Communications and Image Processing, pp. 299–310 (1992)
Censor, Y., Eggermont, P.P.B., Gordon, D.: Strong underrelaxation in Kaczmarz’s method for inconsistent systems. Numer. Math. 41(1), 83–92 (1983)
Chen, Y., Bhojanapalli, S., Sanghavi, S., Ward, R.: Coherent matrix completion. In: Proceedings of the 31st International Conference on Machine Learning, pp. 674–682 (2014)
Eggermont, P.P.B., Herman, G.T., Lent, A.: Iterative algorithms for large partitioned linear systems, with applications to image reconstruction. Linear Algebra Appl. 40, 37–67 (1981). doi:10.1016/0024-3795(81)90139-7. ISSN:0024–3795
Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011). doi:10.1007/s11075-011-9451-z. ISSN:1017–1398
Elfving, T.: Block-iterative methods for consistent and inconsistent linear equations. Numer. Math. 35(1), 1–12 (1980). doi:10.1007/BF01396365. ISSN:0029–599X
Foygel, R., Srebro, N.: Concentration-based guarantees for low-rank matrix reconstruction. In: 24th Annual Conference on Learning Theory (COLT) (2011)
Gittens, A., Mahoney, M.: Revisiting the nyström method for improved large-scale machine learning. In: International Conference on Machine learning (ICML) (2013)
Hanke, M., Niethammer, W.: On the acceleration of Kaczmarz’s method for inconsistent linear systems. Linear Algebra Appl. 130, 83–98 (1990)
Herman, G.T.: Fundamentals of computerized tomography: image reconstruction from projections. Springer, Berlin (2009)
Herman, G.T., Meyer, L.B.: Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans. Med. Imaging 12(3), 600–609 (1993)
Hounsfield, G.N.: Computerized transverse axial scanning (tomography): Part 1. Description of system. Br. J. Radiol. 46(552), 1016–1022 (1973)
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Polon. Sci. Lett. Ser. A. 35, 335–357 (1937)
Krahmer, F., Ward, R.: Stable and robust sampling strategies for compressive imaging. IEEE Trans. Image Proc. 23(2), 612–622 (2014)
Lai, M., Yin, W.: Augmented \(\ell _1\) and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imaging Sci. 6(2), 1059–1091 (2013)
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156 (2013)
Liu, J., Wright, S.J., Sridhar, S.: An asynchronous parallel randomized kaczmarz algorithm. arXiv:1401.4780 (2014)
Ma, P., Yu, B., Mahoney, M.: A statistical perspective on algorithmic leveraging. In: International Conference on Machine Learning (ICML). arXiv:1306.5362 (2014)
Mahoney, M.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
Murata, N.: A Statistical Study of On-Line Learning. Cambridge University Press, Cambridge (1998)
Natterer, F.: The Mathematics of Computerized Tomography, volume 32 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2001). ISBN:0-89871-493-1. doi:10.1137/1.9780898719284. Reprint of the 1986 original
Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT 50(2), 395–403 (2010). doi:10.1007/s10543-010-0265-5. ISSN:0006–3835
Needell, D., Tropp, J.A.: Paved with good intentions: analysis of a randomized block kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)
Needell, D., Ward, R.: Two-subspace projection method for coherent overdetermined linear systems. J Fourier Anal. Appl. 19(2), 256–269 (2013)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer, Boston (2004)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Popa, C.: Extensions of block-projections methods with relaxation parameters to inconsistent and rank-deficient least-squares problems. BIT 38(1), 151–176 (1998). doi:10.1007/BF02510922. ISSN:0006–3835
Popa, C.: Block-projections algorithms with blocks containing mutually orthogonal rows and columns. BIT 39(2), 323–338 (1999). doi:10.1023/A:1022398014630. ISSN:0006–3835
Popa, C.: A fast Kaczmarz–Kovarik algorithm for consistent least-squares problems. Korean J. Comput. Appl. Math. 8(1), 9–26 (2001). ISSN:1229–9502
Popa, C.: A Kaczmarz–Kovarik algorithm for symmetric ill-conditioned matrices. An. Ştiinţ. Univ. Ovidius Constanţa Ser. Mat. 12(2), 135–146 (2004). ISSN:1223–723X
Popa, C., Preclik, T., Köstler, H., Rüde, U.: On Kaczmarz’s projection iteration as a direct solver for linear least squares problems. Linear Algebra Appl. 436(2), 389–404 (2012)
Rauhut, H., Ward, R.: Sparse Legendre expansions via \(\ell _1\)-minimization. J. Approx. Theory 164(5), 517–533 (2012)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2012)
Robbins, H., Monroe, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Shalev-Shwartz, S., Srebro, N.: Svm optimization: inverse dependence on training set size. In: Proceedings of the 25th International Conference on Machine Learning, pp. 928–935 (2008)
Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. arXiv:1212.1824 (2012)
Spielman, D., Srivastava, N.: Graph sparsification by effective resistances. SIAM J. Comput. 40(6), 1913–1926 (2011)
Srebro, N., Sridharan, K., Tewari, A.: Smoothness, low noise and fast rates. In: Advances in Neural Information Processing Systems, pp. 2199–2207 (2010)
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009). doi:10.1007/s00041-008-9030-4. ISSN:1069–5869
Tanabe, K.: Projection method for solving a singular system of linear equations and its applications. Numer. Math. 17(3), 203–214 (1971)
Wang, S., Zhang, Z.: Improving cur matrix decomposition and the nystrom approximation via adaptive sampling. J. Mach. Learn. Res. 14, 2729–2769 (2013)
Whitney, T.M., Meany, R.K.: Two algorithms related to the method of steepest descent. SIAM J. Numer. Anal. 4(1), 109–118 (1967)
Zhang, H., Yin, W.: Gradient methods for convex minimization: better rates under weaker conditions. CAM Report 13–17, UCLA (2013)
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling. arXiv:1401.2753 (2014)
Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least-squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)
Acknowledgments
FundingWe would like to thank the anonymous reviewers for their useful feedback which significantly improved the manuscript. We would like to thank Chris White for pointing out a simplified proof of Corollary 2.2. DN was partially supported by a Simons Foundation Collaboration grant, NSF CAREER #1348721 and an Alfred P. Sloan Fellowship. NS was partially supported by a Google Research Award. RW was supported in part by ONR Grant N00014-12-1-0743, an AFOSR Young Investigator Program Award, and an NSF CAREER award.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs
Appendix: Proofs
Our main results utilize an elementary fact about smooth functions with Lipschitz continuous gradient, called the co-coercivity of the gradient. We state the lemma and recall its proof for completeness.
1.1 The co-coercivity Lemma
Lemma 8.1
(Co-coercivity) For a smooth function \(f\) whose gradient has Lipschitz constant \(L\),
Proof
Since \(\nabla f\) has Lipschitz constant \(L\), if \({\varvec{x}}_{\star }\) is the minimizer of \(f\), then (see e.g. [32], page 26)
Now define the convex functions
and observe that both have Lipschitz constants \(L\) and minimizers \(\varvec{x}\) and \(\varvec{y}\), respectively. Applying (8.1) to these functions therefore gives that
By their definitions, this implies that
Adding these two inequalities and canceling terms yields the desired result. \(\square \)
1.2 Proof of Theorem 2.1
With the notation of Theorem 2.1, and where \(i\) is the random index chosen at iteration \(k\), and \(w=w_{\lambda }\), we have
where we have employed Jensen’s inequality in the first inequality and the co-coercivity Lemma 8.1 in the final line. We next take an expectation with respect to the choice of \(i\). By assumption, \(i\sim {\mathcal {D}}\) such that \(F(\varvec{x}) = \mathbb {E} f_i(\varvec{x})\) and \(\sigma ^2 = {\mathbb {E}}\Vert \nabla f_i({\varvec{x}}_{\star })\Vert ^2\). Then \({\mathbb {E}}\nabla f_i(\varvec{x})=\nabla {\mathbf {F}}(\varvec{x})\), and we obtain:
We now utilize the strong convexity of \(F(\varvec{x})\) and obtain that
when \(\gamma \le \frac{1}{ \sup L}\). Recursively applying this bound over the first \(k\) iterations yields the desired result,
Rights and permissions
About this article
Cite this article
Needell, D., Srebro, N. & Ward, R. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155, 549–573 (2016). https://doi.org/10.1007/s10107-015-0864-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-015-0864-7