Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm

Deanna Needell¹,
Nathan Srebro² &
Rachel Ward³

2776 Accesses
240 Citations
Explore all metrics

Abstract

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/\mu )^2$ (where $L$ is a bound on the smoothness and $\mu $ on the strong convexity) to a linear dependence on $L/\mu $. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Convergence rate for the moving least-squares learning with dependent sampling

Article Open access 31 July 2018

Robust and sparse regression in generalized linear model by stochastic optimization

Article 11 June 2019

Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent

Notes

Bach and Moulines’s results are somewhat more general. Their Lipschitz requirement is a bit weaker and more complicated, but in terms of $L_i$ yields (2.3). They also study the use of polynomial decaying step-sizes, but these do not lead to improved runtime if the target accuracy is known ahead of time.

References

Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems (NIPS) (2011)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186. Springer (2010)
Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Optimization for Machine Learning, p. 351 (2011)
Bousquet, O., Bottou, L.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, pp. 161–168 (2007)
Boutsidis, C., Mahoney, M., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Symposium on Discrete Algorithms, pp. 968–977 (2009)
Byrne, C.L.: Applied Iterative Methods. A K Peters Ltd., Wellesley, MA. ISBN:978-1-56881-342-4; 1-56881-271-X (2008)
Cenker, C., Feichtinger, H.G., Mayer, M., Steier, H., Strohmer, T.: New variants of the POCS method using affine subspaces of finite codimension, with applications to irregular sampling. In: Proceedings of SPIE: Visual Communications and Image Processing, pp. 299–310 (1992)
Censor, Y., Eggermont, P.P.B., Gordon, D.: Strong underrelaxation in Kaczmarz’s method for inconsistent systems. Numer. Math. 41(1), 83–92 (1983)
Article MATH MathSciNet Google Scholar
Chen, Y., Bhojanapalli, S., Sanghavi, S., Ward, R.: Coherent matrix completion. In: Proceedings of the 31st International Conference on Machine Learning, pp. 674–682 (2014)
Eggermont, P.P.B., Herman, G.T., Lent, A.: Iterative algorithms for large partitioned linear systems, with applications to image reconstruction. Linear Algebra Appl. 40, 37–67 (1981). doi:10.1016/0024-3795(81)90139-7. ISSN:0024–3795
Article MATH MathSciNet Google Scholar
Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011). doi:10.1007/s11075-011-9451-z. ISSN:1017–1398
Elfving, T.: Block-iterative methods for consistent and inconsistent linear equations. Numer. Math. 35(1), 1–12 (1980). doi:10.1007/BF01396365. ISSN:0029–599X
Article MATH MathSciNet Google Scholar
Foygel, R., Srebro, N.: Concentration-based guarantees for low-rank matrix reconstruction. In: 24th Annual Conference on Learning Theory (COLT) (2011)
Gittens, A., Mahoney, M.: Revisiting the nyström method for improved large-scale machine learning. In: International Conference on Machine learning (ICML) (2013)
Hanke, M., Niethammer, W.: On the acceleration of Kaczmarz’s method for inconsistent linear systems. Linear Algebra Appl. 130, 83–98 (1990)
Article MATH MathSciNet Google Scholar
Herman, G.T.: Fundamentals of computerized tomography: image reconstruction from projections. Springer, Berlin (2009)
Book Google Scholar
Herman, G.T., Meyer, L.B.: Algebraic reconstruction techniques can be made computationally efficient. IEEE Trans. Med. Imaging 12(3), 600–609 (1993)
Article Google Scholar
Hounsfield, G.N.: Computerized transverse axial scanning (tomography): Part 1. Description of system. Br. J. Radiol. 46(552), 1016–1022 (1973)
Article Google Scholar
Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bull. Int. Acad. Polon. Sci. Lett. Ser. A. 35, 335–357 (1937)
Krahmer, F., Ward, R.: Stable and robust sampling strategies for compressive imaging. IEEE Trans. Image Proc. 23(2), 612–622 (2014)
Article MathSciNet Google Scholar
Lai, M., Yin, W.: Augmented $\ell _1$ and nuclear-norm models with a globally linearly convergent algorithm. SIAM J. Imaging Sci. 6(2), 1059–1091 (2013)
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156 (2013)
Liu, J., Wright, S.J., Sridhar, S.: An asynchronous parallel randomized kaczmarz algorithm. arXiv:1401.4780 (2014)
Ma, P., Yu, B., Mahoney, M.: A statistical perspective on algorithmic leveraging. In: International Conference on Machine Learning (ICML). arXiv:1306.5362 (2014)
Mahoney, M.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011)
Murata, N.: A Statistical Study of On-Line Learning. Cambridge University Press, Cambridge (1998)
Google Scholar
Natterer, F.: The Mathematics of Computerized Tomography, volume 32 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (2001). ISBN:0-89871-493-1. doi:10.1137/1.9780898719284. Reprint of the 1986 original
Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT 50(2), 395–403 (2010). doi:10.1007/s10543-010-0265-5. ISSN:0006–3835
Article MATH MathSciNet Google Scholar
Needell, D., Tropp, J.A.: Paved with good intentions: analysis of a randomized block kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)
Needell, D., Ward, R.: Two-subspace projection method for coherent overdetermined linear systems. J Fourier Anal. Appl. 19(2), 256–269 (2013)
Article MATH MathSciNet Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MATH MathSciNet Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer, Boston (2004)
Book MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MATH MathSciNet Google Scholar
Popa, C.: Extensions of block-projections methods with relaxation parameters to inconsistent and rank-deficient least-squares problems. BIT 38(1), 151–176 (1998). doi:10.1007/BF02510922. ISSN:0006–3835
Article MATH MathSciNet Google Scholar
Popa, C.: Block-projections algorithms with blocks containing mutually orthogonal rows and columns. BIT 39(2), 323–338 (1999). doi:10.1023/A:1022398014630. ISSN:0006–3835
Article MATH MathSciNet Google Scholar
Popa, C.: A fast Kaczmarz–Kovarik algorithm for consistent least-squares problems. Korean J. Comput. Appl. Math. 8(1), 9–26 (2001). ISSN:1229–9502
MATH MathSciNet Google Scholar
Popa, C.: A Kaczmarz–Kovarik algorithm for symmetric ill-conditioned matrices. An. Ştiinţ. Univ. Ovidius Constanţa Ser. Mat. 12(2), 135–146 (2004). ISSN:1223–723X
MATH Google Scholar
Popa, C., Preclik, T., Köstler, H., Rüde, U.: On Kaczmarz’s projection iteration as a direct solver for linear least squares problems. Linear Algebra Appl. 436(2), 389–404 (2012)
Article MATH MathSciNet Google Scholar
Rauhut, H., Ward, R.: Sparse Legendre expansions via $\ell _1$-minimization. J. Approx. Theory 164(5), 517–533 (2012)
Article MATH MathSciNet Google Scholar
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2012)
Robbins, H., Monroe, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MATH Google Scholar
Shalev-Shwartz, S., Srebro, N.: Svm optimization: inverse dependence on training set size. In: Proceedings of the 25th International Conference on Machine Learning, pp. 928–935 (2008)
Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. arXiv:1212.1824 (2012)
Spielman, D., Srivastava, N.: Graph sparsification by effective resistances. SIAM J. Comput. 40(6), 1913–1926 (2011)
Article MATH MathSciNet Google Scholar
Srebro, N., Sridharan, K., Tewari, A.: Smoothness, low noise and fast rates. In: Advances in Neural Information Processing Systems, pp. 2199–2207 (2010)
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009). doi:10.1007/s00041-008-9030-4. ISSN:1069–5869
Tanabe, K.: Projection method for solving a singular system of linear equations and its applications. Numer. Math. 17(3), 203–214 (1971)
Article MATH MathSciNet Google Scholar
Wang, S., Zhang, Z.: Improving cur matrix decomposition and the nystrom approximation via adaptive sampling. J. Mach. Learn. Res. 14, 2729–2769 (2013)
MATH MathSciNet Google Scholar
Whitney, T.M., Meany, R.K.: Two algorithms related to the method of steepest descent. SIAM J. Numer. Anal. 4(1), 109–118 (1967)
Article MATH MathSciNet Google Scholar
Zhang, H., Yin, W.: Gradient methods for convex minimization: better rates under weaker conditions. CAM Report 13–17, UCLA (2013)
Zhao, P., Zhang, T.: Stochastic optimization with importance sampling. arXiv:1401.2753 (2014)
Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least-squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)

Download references

Acknowledgments

FundingWe would like to thank the anonymous reviewers for their useful feedback which significantly improved the manuscript. We would like to thank Chris White for pointing out a simplified proof of Corollary 2.2. DN was partially supported by a Simons Foundation Collaboration grant, NSF CAREER #1348721 and an Alfred P. Sloan Fellowship. NS was partially supported by a Google Research Award. RW was supported in part by ONR Grant N00014-12-1-0743, an AFOSR Young Investigator Program Award, and an NSF CAREER award.

Author information

Authors and Affiliations

Department of Mathematical Sciences, Claremont McKenna College, Claremont, CA, USA
Deanna Needell
Toyota Technological Institute at Chicago, Chicago, IL, USA
Nathan Srebro
Department of Mathematics, University of Texas at Austin, Austin, TX, USA
Rachel Ward

Authors

Deanna Needell
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Srebro
View author publications
You can also search for this author in PubMed Google Scholar
Rachel Ward
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deanna Needell.

Appendix: Proofs

Our main results utilize an elementary fact about smooth functions with Lipschitz continuous gradient, called the co-coercivity of the gradient. We state the lemma and recall its proof for completeness.

1.1 The co-coercivity Lemma

Lemma 8.1

(Co-coercivity) For a smooth function $f$ whose gradient has Lipschitz constant $L$,

$$\begin{aligned} ||{\nabla f(\varvec{x}) - \nabla f(\varvec{y})} ||_2^2 \le L\left\langle \varvec{x}-\varvec{y}, \nabla f(\varvec{x}) - \nabla f(\varvec{y})\right\rangle . \end{aligned}$$

Proof

Since $\nabla f$ has Lipschitz constant $L$, if ${\varvec{x}}_{\star }$ is the minimizer of $f$, then (see e.g. [32], page 26)

(8.1)

Now define the convex functions

$$\begin{aligned} G(\varvec{z}) = f(\varvec{z}) - \left\langle \nabla f(\varvec{x}), \varvec{z}\right\rangle , \quad \text {and}\quad H(\varvec{z}) = f(\varvec{z}) - \left\langle \nabla f(\varvec{y}), \varvec{z}\right\rangle , \end{aligned}$$

and observe that both have Lipschitz constants $L$ and minimizers $\varvec{x}$ and $\varvec{y}$, respectively. Applying (8.1) to these functions therefore gives that

$$\begin{aligned} G(\varvec{x}) \le G(\varvec{y}) - \frac{1}{2L}||{\nabla G(\varvec{y})} ||_2^2, \quad \text {and} \quad H(\varvec{y}) \le H(\varvec{x}) - \frac{1}{2L}||{\nabla H(\varvec{y})} ||_2^2. \end{aligned}$$

By their definitions, this implies that

Adding these two inequalities and canceling terms yields the desired result. $\square $

1.2 Proof of Theorem 2.1

With the notation of Theorem 2.1, and where $i$ is the random index chosen at iteration $k$, and $w=w_{\lambda }$, we have

where we have employed Jensen’s inequality in the first inequality and the co-coercivity Lemma 8.1 in the final line. We next take an expectation with respect to the choice of $i$. By assumption, $i\sim {\mathcal {D}}$ such that $F(\varvec{x}) = \mathbb {E} f_i(\varvec{x})$ and $\sigma ^2 = {\mathbb {E}}\Vert \nabla f_i({\varvec{x}}_{\star })\Vert ^2$. Then ${\mathbb {E}}\nabla f_i(\varvec{x})=\nabla {\mathbf {F}}(\varvec{x})$, and we obtain:

$$\begin{aligned} {\mathbb {E}}||{{\varvec{x}}_{k+1}-{\varvec{x}}_{\star }} ||_2^2&\le ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k)\right\rangle \\&+ 2\gamma ^2 {\mathbb {E}}\left[ L_i \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla f_i({\varvec{x}}_k) - \nabla f_i({\varvec{x}}_{\star })\right\rangle \right] + 2\gamma ^2 {\mathbb {E}}||{\nabla f_i({\varvec{x}}_{\star })} ||_2^2 \\&\le ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k)\right\rangle \\&\quad +\, 2\gamma ^2 \sup _i L_i {\mathbb {E}}\left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla f_i({\varvec{x}}_k) - \nabla f_i({\varvec{x}}_{\star })\right\rangle + 2\gamma ^2 {\mathbb {E}}||{\nabla f_i({\varvec{x}}_{\star })} ||_2^2 \\&= ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k)\right\rangle \\&\quad +\, 2\gamma ^2 \sup L \left\langle {\varvec{x}}_k-{\varvec{x}}_{\star }, \nabla {\mathbf {F}}({\varvec{x}}_k) - \nabla {\mathbf {F}}({\varvec{x}}_{\star })\right\rangle + 2\gamma ^2 \sigma ^2 \end{aligned}$$

We now utilize the strong convexity of $F(\varvec{x})$ and obtain that

$$\begin{aligned}&\le ||{{\varvec{x}}_k- {\varvec{x}}_{\star }} ||_2^2 - 2\gamma \mu (1 - \gamma \sup L) ||{{\varvec{x}}_k-{\varvec{x}}_{\star }} ||_2^2 + 2\gamma ^2 \sigma ^2 \\&= (1 - 2\gamma \mu (1 - \gamma \sup L)) ||{{\varvec{x}}_k-{\varvec{x}}_{\star }} ||_2^2 + 2\gamma ^2 \sigma ^2 \end{aligned}$$

when $\gamma \le \frac{1}{ \sup L}$. Recursively applying this bound over the first $k$ iterations yields the desired result,

$$\begin{aligned} {\mathbb {E}}||{{\varvec{x}}_k-{\varvec{x}}_{\star }} ||_2^2&\le \Big (1 - 2\gamma \mu (1 - \gamma \sup L)\Big )\Big )^k||{{\varvec{x}}_0- {\varvec{x}}_{\star }} ||_2^2\\&\quad +\, 2\sum _{j=0}^{k-1} \Big (1 - 2\gamma \mu (1 - \gamma \sup L)\Big )\Big )^j \gamma ^2\sigma ^2 \\&\le \Big (1 - 2\gamma \mu (1 - \gamma \sup L)\Big )\Big )^k ||{{\varvec{x}}_0- {\varvec{x}}_{\star }} ||_2^2 + \frac{\gamma \sigma ^2}{\mu \big ( 1 - \gamma \sup L \big )}. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Needell, D., Srebro, N. & Ward, R. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155, 549–573 (2016). https://doi.org/10.1007/s10107-015-0864-7

Download citation

Received: 14 February 2014
Accepted: 17 January 2015
Published: 03 February 2015
Issue Date: January 2016
DOI: https://doi.org/10.1007/s10107-015-0864-7