[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

Published: 01 January 2014 Publication History

Abstract

We consider the problem of minimizing the sum of two convex functions: one is the average of a large number of smooth component functions, and the other is a general convex function that admits a simple proximal mapping. We assume the whole objective function is strongly convex. Such problems often arise in machine learning, known as regularized empirical risk minimization. We propose and analyze a new proximal stochastic gradient method, which uses a multistage scheme to progressively reduce the variance of the stochastic gradient. While each iteration of this algorithm has similar cost as the classical stochastic gradient method (or incremental gradient method), we show that the expected objective value converges to the optimum at a geometric rate. The overall complexity of this method is much lower than both the proximal full gradient method and the standard proximal stochastic gradient method.

References

[1]
A. Beck and M. Teboulle, A fast iterative shrinkage-threshold algorithm for linear inverse problems, SIAM J. Imaging Sci., 2 (2009), pp. 183--202.
[2]
D. P. Bertsekas, Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey, Report LIDS-P-2848, Laboratory for Information and Decision Systems, MIT, Cambridge, MA, 2010.
[3]
D. P. Bertsekas, Incremental proximal methods for large scale convex optimization, Math. Program. Ser. B, 129 (2011), pp. 163--195.
[4]
J. A. Blackard, D. J. Dean, and C. W. Anderson, Covertype data set, in UCI Machine Learning Repository, K. Bache and M. Lichman, eds., http://archive.ics.uci.edu/ml (2013).
[5]
D. Blatt, A. O. Hero, and H. Gauchman, A convergent incremental gradient method with a constant step size, SIAM J. Optim., 18 (2007), pp. 29--51.
[6]
R. H. Byrd, G. M. Chin, J. Nocedal, and Y. Wu, Sample size selection in optimization methods for machine learning, Math. Program. Ser. B, 134 (2012), pp. 127--155.
[7]
G. H.-G. Chen and R. T. Rockafellar, Convergence rates in forward-backward splitting, SIAM J. Optim., 7 (1997), pp. 421--444.
[8]
J. Duchi and Y. Singer, Efficient online and batch learning using forward backward splitting, J. Mach. Learn. Res., 10 (2009), pp. 2873--2898.
[9]
R.-E. Fan and C.-J. Lin, LIBSVM Data: Classification, Regression and Multi-label, http://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets (2011).
[10]
M. P. Friedlander and G. Goh, Tail Bounds for Stochastic Approximation, \newblock arXiv:1304.5586, 2013.
[11]
M. P. Friedlander and M. Schmidt, Hybrid deterministic-stochastic methods for data fitting, SIAM J. Sci. Comput., 34 (2012), pp. 1380--1405.
[12]
I. Guyon, Sido: A Pharmacology Dataset, http://www.causality.inf.ethz.ch/data/SIDO.html (2008).
[13]
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009.
[14]
C. Hu, J. T. Kwok, and W. Pan, Accelerated gradient methods for stochastic optimization and online learning, in Adv. Neural Inf. Process. Syst. 22, MIT Press, Cambridge, MA, 2009, pp. 781--789.
[15]
R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance Reduction, in Adv. Neural Inf. Process. Syst. 26, MIT Press, Cambridge, MA, 2013, pp. 315--323.
[16]
J. Konečný and P. Richtárik, Semi-Stochastic Gradient Descent Methods, \newblock arXiv:1312.1666, 2013.
[17]
J. Langford, L. Li, and T. Zhang, Sparse online learning via truncated gradient, J. Mach. Learn. Res., 10 (2009), pp. 777--801.
[18]
Y. T. Lee and A. Sidford, Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems, \newblock arXiv:1305.1922, 2013.
[19]
D. D. Lewis, Y. Yang, T. Rose, and F. Li, RCV1: A new benchmark collection for text categorization research, J. Mach. Learn. Res., 5 (2004), pp. 361--397.
[20]
P.-L. Lions and B. Mercier, Splitting algorithms for the sum of two nonlinear operators, SIAM J. Numer. Anal., 16 (1979), pp. 964--979.
[21]
M. Mahdavi, L. Zhang, and R. Jin, Mixed optimization for smooth functions, in Adv. Neural Inf. Process. Syst. 26, MIT Press, Cambridge, MA, 2013, pp. 674--682.
[22]
D. Needell, N. Srebro, and R. Ward, Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz Algorithm, preprint, arXiv:1310.5715, 2014.
[23]
Yu. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Kluwer, Boston, 2004.
[24]
Y. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems, SIAM J. Optim., 22 (2012), pp. 341--362.
[25]
Yu. Nesterov, Gradient methods for minimizing composite functions, Math. Program. Ser. B, 140 (2013), pp. 125--161.
[26]
P. Richtárik and M. Takáč, Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, Math. Program., 144 (2014), pp. 1--38.
[27]
R. T. Rockafellar, Convex Analysis, Princeton University Press, Princeton, NJ, 1970.
[28]
N. Le Roux, M. Schmidt, and F. Bach, A stochastic gradient method with an exponential convergence rate for finite training sets, in Adv. Neural Inf. Process. Syst. 25, MIT Press, Cambridge, MA, 2012, pp. 2672--2680.
[29]
M. Schmidt, N. Le Roux, and F. Bach, Minimizing Finite Sums with the Stochastic Average Gradient, Technical report HAL 00860051, INRIA, Paris, 2013.
[30]
S. Shalev-Shwartz and T. Zhang, Proximal Stochatic Dual Coordinate Ascent, \newblock arXiv:1211.2772, 2012.
[31]
S. Shalev-Shwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, J. Mach. Learn. Res., 14 (2013), pp. 567--599.
[32]
P. Tseng, A modified forward-backward splitting method for maximal monotone mappings, SIAM J. Control Optim., 38 (2000), pp. 431--446.
[33]
L. Xiao, Dual averaging methods for regularized stochastic learning and online optimization, J. Mach. Learn. Res., 11 (2010), pp. 2534--2596.
[34]
L. Zhang, M. Mahdavi, and R. Jin, Linear convergence with condition number independent access of full gradients, in Adv. Neural Inf. Process. Syst. 26, MIT Press, Cambridge, MA, 2013, pp. 980--988.
[35]
P. Zhao and T. Zhang, Stochastic Optimization with Importance Sampling, \newblock arXiv:1401.2753, 2014.

Cited By

View all
  • (2025)A Proximal Stochastic Quasi-Newton Algorithm with Dynamical Sampling and Stochastic Line SearchJournal of Scientific Computing10.1007/s10915-024-02748-2102:1Online publication date: 1-Jan-2025
  • (2024)Quantum algorithms and lower bounds for finite-sum optimizationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694562(60244-60270)Online publication date: 21-Jul-2024
  • (2024)Can Gaussian sketching converge faster on a preconditioned landscape?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694202(52064-52082)Online publication date: 21-Jul-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image SIAM Journal on Optimization
SIAM Journal on Optimization  Volume 24, Issue 4
2014
425 pages
ISSN:1052-6234
DOI:10.1137/sjope8.24.4
Issue’s Table of Contents

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 01 January 2014

Author Tags

  1. stochastic gradient method
  2. proximal mapping
  3. variance reduction

Author Tags

  1. 65C60
  2. 65Y20
  3. 90C25

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Proximal Stochastic Quasi-Newton Algorithm with Dynamical Sampling and Stochastic Line SearchJournal of Scientific Computing10.1007/s10915-024-02748-2102:1Online publication date: 1-Jan-2025
  • (2024)Quantum algorithms and lower bounds for finite-sum optimizationProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694562(60244-60270)Online publication date: 21-Jul-2024
  • (2024)Can Gaussian sketching converge faster on a preconditioned landscape?Proceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694202(52064-52082)Online publication date: 21-Jul-2024
  • (2024)Adversarially robust hypothesis transfer learningProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694120(50092-50117)Online publication date: 21-Jul-2024
  • (2024)Double variance reductionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692499(10792-10810)Online publication date: 21-Jul-2024
  • (2024)Global Optimality Guarantees for Policy Gradient MethodsOperations Research10.1287/opre.2021.001472:5(1906-1927)Online publication date: 1-Sep-2024
  • (2024)Efficient zeroth-order proximal stochastic method for nonconvex nonsmooth black-box problemsMachine Language10.1007/s10994-023-06409-7113:1(97-120)Online publication date: 1-Jan-2024
  • (2024)Stochastic Gradient Methods with Preconditioned UpdatesJournal of Optimization Theory and Applications10.1007/s10957-023-02365-3201:2(471-489)Online publication date: 1-May-2024
  • (2024)An Inexact Primal-Dual Smoothing Framework for Large-Scale Non-Bilinear Saddle Point ProblemsJournal of Optimization Theory and Applications10.1007/s10957-023-02351-9200:1(34-67)Online publication date: 1-Jan-2024
  • (2024)Open issues and recent advances in DC programming and DCAJournal of Global Optimization10.1007/s10898-023-01272-188:3(533-590)Online publication date: 1-Mar-2024
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media