More Web Proxy on the site http://driver.im/

research-article

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

Authors:

Tong ZhangAuthors Info & Claims

SIAM Journal on Optimization, Volume 24, Issue 4

Pages 2057 - 2075

https://doi.org/10.1137/140961791

Published: 01 January 2014 Publication History

Abstract

We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such problems often arise in machine learning, known as regularized empirical risk minimization. We propose and analyze a new proximal stochastic gradient method, which uses a multistage scheme to progressively reduce the variance of the stochastic gradient. While each iteration of this algorithm has similar cost as the classical stochastic gradient method (or incremental gradient method), we show that the expected objective value converges to the optimum at a geometric rate. The overall complexity of this method is much lower than both the proximal full gradient method and the standard proximal stochastic gradient method.

References

[1]

A. Beck and M. Teboulle, A fast iterative shrinkage-threshold algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), pp. 183--202.

[2]

D. P. Bertsekas, Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey, Report LIDS-P-2848, Laboratory for Information and Decision Systems, MIT, Cambridge, MA, 2010.

[3]

D. P. Bertsekas, Incremental proximal methods for large scale convex optimization, Math. Program. Ser. B, 129 (2011), pp. 163--195.

[4]

J. A. Blackard, D. J. Dean, and C. W. Anderson, Covertype data set, in UCI Machine Learning Repository, K. Bache and M. Lichman, eds., http://archive.ics.uci.edu/ml (2013).

[5]

D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29--51.

[6]

R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, Sample size selection in optimization methods for machine learning, Math. Program. Ser. B, 134 (2012), pp. 127--155.

[7]

G. H.-G. Chen and R. T. Rockafellar, Convergence rates in forward-backward splitting, SIAM J. Optim., 7 (1997), pp. 421--444.

[8]

J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, J. Mach. Learn. Res., 10 (2009), pp. 2873--2898.

[9]

R.-E. Fan and C.-J. Lin, LIBSVM Data: Classification, Regression and Multi-label, http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets (2011).

[10]

M. P. Friedlander and G. Goh, Tail Bounds for Stochastic Approximation, \newblock arXiv:1304.5586, 2013.

[11]

M. P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM J. Sci. Comput., 34 (2012), pp. 1380--1405.

[12]

I. Guyon, Sido: A Pharmacology Dataset, http://www.causality.inf.ethz.ch/data/SIDO.html (2008).

[13]

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009.

[14]

C. Hu, J. T. Kwok, and W. Pan, Accelerated gradient methods for stochastic optimization and online learning, in Adv. Neural Inf. Process. Syst. 22, MIT Press, Cambridge, MA, 2009, pp. 781--789.

[15]

R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance Reduction, in Adv. Neural Inf. Process. Syst. 26, MIT Press, Cambridge, MA, 2013, pp. 315--323.

[16]

J. Konečný and P. Richtárik, Semi-Stochastic Gradient Descent Methods, \newblock arXiv:1312.1666, 2013.

[17]

J. Langford, L. Li, and T. Zhang, Sparse online learning via truncated gradient, J. Mach. Learn. Res., 10 (2009), pp. 777--801.

[18]

Y. T. Lee and A. Sidford, Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, \newblock arXiv:1305.1922, 2013.

[19]

D. D. Lewis, Y. Yang, T. Rose, and F. Li, RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res., 5 (2004), pp. 361--397.

[20]

P.-L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators, SIAM J. Numer. Anal., 16 (1979), pp. 964--979.

[21]

M. Mahdavi, L. Zhang, and R. Jin, Mixed optimization for smooth functions, in Adv. Neural Inf. Process. Syst. 26, MIT Press, Cambridge, MA, 2013, pp. 674--682.

[22]

D. Needell, N. Srebro, and R. Ward, Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz Algorithm, preprint, arXiv:1310.5715, 2014.

[23]

Yu. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer, Boston, 2004.

[24]

Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim., 22 (2012), pp. 341--362.

[25]

Yu. Nesterov, Gradient methods for minimizing composite functions, Math. Program. Ser. B, 140 (2013), pp. 125--161.

[26]

P. Richtárik and M. Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Math. Program., 144 (2014), pp. 1--38.

[27]

R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.

[28]

N. Le Roux, M. Schmidt, and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Adv. Neural Inf. Process. Syst. 25, MIT Press, Cambridge, MA, 2012, pp. 2672--2680.

[29]

M. Schmidt, N. Le Roux, and F. Bach, Minimizing Finite Sums with the Stochastic Average Gradient, Technical report HAL 00860051, INRIA, Paris, 2013.

[30]

S. Shalev-Shwartz and T. Zhang, Proximal Stochatic Dual Coordinate Ascent, \newblock arXiv:1211.2772, 2012.

[31]

S. Shalev-Shwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, J. Mach. Learn. Res., 14 (2013), pp. 567--599.

[32]

P. Tseng, A modified forward-backward splitting method for maximal monotone mappings, SIAM J. Control Optim., 38 (2000), pp. 431--446.

[33]

L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, J. Mach. Learn. Res., 11 (2010), pp. 2534--2596.

[34]

L. Zhang, M. Mahdavi, and R. Jin, Linear convergence with condition number independent access of full gradients, in Adv. Neural Inf. Process. Syst. 26, MIT Press, Cambridge, MA, 2013, pp. 980--988.

[35]

P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling, \newblock arXiv:1401.2753, 2014.

Cited By

Zhang MLi S(2025)A Proximal Stochastic Quasi-Newton Algorithm with Dynamical Sampling and Stochastic Line SearchJournal of Scientific Computing10.1007/s10915-024-02748-2102:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s10915-024-02748-2
Zhang YZhang CFang CWang LLi TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Quantum algorithms and lower bounds for finite-sum optimizationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694562(60244-60270)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694562
Wang YYe HDai GTsang ISalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Can Gaussian sketching converge faster on a preconditioned landscape?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694202(52064-52082)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694202
Show More Cited By

Recommendations

An Accelerated Randomized Proximal Coordinate Gradient Method and its Application to Regularized Empirical Risk Minimization

We consider the problem of minimizing the sum of two convex functions: one is smooth and given by a gradient oracle, and the other is separable over blocks of coordinates and has a simple known structure over each block. We develop an accelerated ...
Generalized Nesterov’s accelerated proximal gradient algorithms with convergence rate of order o(1/k²)
Abstract
The accelerated gradient method initiated by Nesterov is now recognized to be one of the most powerful tools for solving smooth convex optimization problems. This method improves significantly the convergence rate of function values from O(1/k) of ... $^{}$ $^{}$ $(^{})$ $^{}$ $_{}$
Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods
Abstract
In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Optimization

SIAM Journal on Optimization Volume 24, Issue 4

2014

425 pages

ISSN:1052-6234

DOI:10.1137/sjope8.24.4

Issue’s Table of Contents

© 2014, Society for Industrial and Applied Mathematics.

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2014

Author Tags

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang MLi S(2025)A Proximal Stochastic Quasi-Newton Algorithm with Dynamical Sampling and Stochastic Line SearchJournal of Scientific Computing10.1007/s10915-024-02748-2102:1Online publication date: 1-Jan-2025
https://dl.acm.org/doi/10.1007/s10915-024-02748-2
Zhang YZhang CFang CWang LLi TSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Quantum algorithms and lower bounds for finite-sum optimizationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694562(60244-60270)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694562
Wang YYe HDai GTsang ISalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Can Gaussian sketching converge faster on a preconditioned landscape?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694202(52064-52082)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694202
Wang YArora RSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Adversarially robust hypothesis transfer learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694120(50092-50117)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694120
Di HYe HZhang YChang XDai GTsang ISalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)Double variance reductionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692499(10792-10810)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3692499
Bhandari JRusso D(2024)Global Optimality Guarantees for Policy Gradient MethodsOperations Research10.1287/opre.2021.001472:5(1906-1927)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1287/opre.2021.0014
Kazemi EWang L(2024)Efficient zeroth-order proximal stochastic method for nonconvex nonsmooth black-box problemsMachine Language10.1007/s10994-023-06409-7113:1(97-120)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s10994-023-06409-7
Sadiev ABeznosikov AAlmansoori AKamzolov DTappenden RTakáč M(2024)Stochastic Gradient Methods with Preconditioned UpdatesJournal of Optimization Theory and Applications10.1007/s10957-023-02365-3201:2(471-489)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s10957-023-02365-3
Hien LZhao RHaskell W(2024)An Inexact Primal-Dual Smoothing Framework for Large-Scale Non-Bilinear Saddle Point ProblemsJournal of Optimization Theory and Applications10.1007/s10957-023-02351-9200:1(34-67)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s10957-023-02351-9
Le Thi HPham Dinh T(2024)Open issues and recent advances in DC programming and DCAJournal of Global Optimization10.1007/s10898-023-01272-188:3(533-590)Online publication date: 1-Mar-2024
https://dl.acm.org/doi/10.1007/s10898-023-01272-1
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents