[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Shuffling Momentum Gradient Algorithm for Convex Optimization

  • Original Article
  • Published:
Vietnam Journal of Mathematics Aims and scope Submit manuscript

Abstract

The Stochastic Gradient Method (SGD) and its stochastic variants have become methods of choice for solving finite-sum optimization problems arising from machine learning and data science thanks to their scalability to handle large-scale applications and big datasets. In the last decades, researchers have made substantial effort to study the theoretical performance of SGD and its shuffling variants. However, only limited work has investigated its shuffling momentum variants, including shuffling heavy-ball momentum schemes for non-convex problems and Nesterov’s momentum for convex settings. In this work, we extend the analysis of the shuffling momentum gradient method developed in (Proceedings of the 38th International Conference on Machine Learning, PMLR 139: 10379–10389, 2021) to both finite-sum convex and strongly convex optimization problems. We provide the first analysis of shuffling momentum-based methods for the strongly convex setting, attaining a convergence rate of \(\mathcal {O}(1/nT^2)\), where n is the number of samples and T is the number of training epochs. Our analysis is a state-of-the-art, matching the best rates of existing shuffling stochastic gradient algorithms in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Algorithm 1
Fig. 1

Similar content being viewed by others

References

  1. Ahn, K., Yun, C., Sra, S.: SGD with shuffling: optimal rates without component convexity and large epoch requirements. arXiv:2006.06946 (2020)

  2. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  3. Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)

  4. Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Berlin, Heidelberg (2012)

  5. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Tech. 2, 27 (2011). Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm

  6. Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Shawe-Taylor, J. et al. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)

  7. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Ghahramani, Z. et al. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654. Curran Associates, Inc. (2014)

  8. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  Google Scholar 

  9. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2021)

    Article  MathSciNet  Google Scholar 

  10. Haochen, J., Sra, S.: Random shuffling beats sgd after finite epochs. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, vol. 97, pp. 2624–2633 (2019)

  11. Hu, C., Pan, W., Kwok, J.: Accelerated gradient methods for Stochastic optimization and online learning. In: Bengio, Y. et al. (eds.) Advances in Neural Information Processing Systems, vol. 22. Curran Associates, Inc. (2009)

  12. Jin, R., Xing, Y., He, X.: On the convergence of mSGD and AdaGrad for Stochastic optimization. arXiv:2201.11204 (2022)

  13. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Burges, C.J. et al. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 315–323. Curran Associates, Inc. (2013)

  14. Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014). arXiv:1412.6980 (2014)

  15. Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Pereire, F. et al. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2663–2671. Curran Associates, Inc. (2012)

  16. Li, X., Zhuang, Z., Orabona, F.: A second look at exponential and cosine step sizes: simplicity, adaptivity, and performance. In: Proceedings of the 38th International Conference on Machine Learning, PMLR, vol. 139, pp. 6553–6564 (2021)

  17. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. arXiv:1908.03265 (2019)

  18. Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. arXiv:1608.03983v5 (2017)

  19. Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. In: Larochelle, H. et al. (eds.) Advances in neural information processing systems, vol. 33, pp. 17309–17320. Curran Associates, Inc. (2020)

  20. Nagaraj, D., Jain, P., Netrapalli, P.: Sgd without replacement: sharper rates for general smooth convex functions. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, vol. 97, pp. 4703–4711 (2019)

  21. Nedić, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P.M. (eds.) Stochastic Optimization: Algorithms and Applications, pp. 223–264. Springer, New York (2001)

  22. Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12, 109–138 (2001)

    Article  MathSciNet  Google Scholar 

  23. Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence \(\cal{O} (1/k^2)\). Dokl. Akad. Nauk. SSSR 269, 543–547 (1983)

    MathSciNet  Google Scholar 

  24. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers (2004)

  25. Nguyen, L., Nguyen, P.H., Dijk, M., Richtarik, P., Scheinberg, K., Takac, M.: SGD and Hogwild! Convergence without the bounded gradients assumption. In: Proceedings of the 35th International Conference on Machine Learning, PMLR, vol. 80, pp. 3747–3755 (2018)

  26. Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22, 1–44 (2021)

    MathSciNet  Google Scholar 

  27. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: A novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, PMLR, vol. 70, pp. 2613–2621 (2017)

  28. Nguyen, L.M., Tran, T.H.: On the convergence to a global solution of shuffling-type gradient algorithms. In: Oh, A. et al. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 75545–75566. Curran Associates, Inc. (2023)

  29. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4, 1–17 (1964)

    Article  Google Scholar 

  30. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4, 1–17 (1964)

    Article  Google Scholar 

  31. Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of sgd without replacement. In: Proceedings of the 37th International Conference on Machine Learning, PMLR, vol. 119, pp. 7964–7973 (2020)

  32. Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5, 201–226 (2013)

    Article  MathSciNet  Google Scholar 

  33. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. arXiv:1904.09237 (2019)

  34. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  35. Safran, I., Shamir, O.: How good is sgd with random shuffling. In: Proceedings of 33th Conference on Learning Theory, PMLR, vol. 125, pp. 3250–3284 (2020)

  36. Shamir, O.: Without-replacement sampling for stochastic gradient methods. In: Lee, D. et al. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 46–54. Curran Associates, Inc. (2016)

  37. Sra, S., Nowozin, S., Wright, S.J. (eds.): Optimization for Machine Learning. MIT Press (2012)

  38. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning, PMLR vol. 28, pp. 1139–1147 (2013)

  39. Tran, T.H., Nguyen, L.M., Tran-Dinh, Q.: SMG: a shuffling gradient-based method with momentum. In: Proceedings of the 38th International Conference on Machine Learning, PMLR vol. 139, pp. 10379–10389 (2021)

  40. Tran, T.H., Scheinberg, K., Nguyen, L.M.: Nesterov accelerated shuffling gradient method for convex optimization. In: Proceedings of the 39th international conference on machine learning, PMLR, vol. 162, pp. 21703–21732 (2022)

  41. Yuan, K., Ying, B., Sayed, A.H.: On the influence of momentum acceleration on online learning. J. Mach. Learn. Res. 17(192), 1–66 (2016)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Trang H. Tran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is dedicated to Professor Tamás Terlaky on the occasion of his 70th birthday.

Appendices

Appendix A: Technical Notations and Expressions

First, we introduce some additional notations for our analysis in the sequel. For \(t \ge 1\) and \(i = 0,\dots , n\), we denote

$$\begin{aligned} A_i^{(t)} := \left\| \sum _{j=0}^{i-1} g_{j}^{(t)} \right\| ^2 \quad \text { and } \quad B_i^{(t)} := \left\| \sum _{j = i}^{n-1} g_{j}^{(t)}\right\| ^2. \end{aligned}$$
(A.1)

We adopt the convention \(\sum _{j=0}^{-1} g_{j}^{(t)} = 0\), and \(\sum _{j=n}^{n-1} g_{j}^{(t)} = 0\) in the above definitions. Next, we collect the following expressions from Algorithm 1 to use later in our analysis. From the update rules of Algorithm 1, for \(t > 1\), we have

$$\begin{aligned} m_0^{(t)} = \tilde{m}_{t-1} = v_{n}^{(t-1)} \overset{2}{=} v_{n-1}^{(t-1)} + \frac{1}{n} g_{n-1}^{(t-1)} = v_{0}^{(t-1)} + \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)} = \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)}. \end{aligned}$$
(A.2)

If \(t = 1\), then we set \(m_0^{(1)} = \tilde{m}_{0} = {\textbf {0}} \in \mathbb {R}^d\). From the update \(w_{i+1}^{(t)} := w_{i}^{(t)} - \eta _i^{(t)} m_{i+1}^{(t)}\) with \(\eta _i^{(t)} := \frac{\eta _t}{n}\), for \(i = 1, 2,\dots ,n-1\), at Step 7 of Algorithm 1, we can derive

$$\begin{aligned} w_{i}^{(t)} \overset{2}{=} w_{i-1}^{(t)} - \frac{\eta _t}{n} m_{i}^{(t)} = w_{0}^{(t)} - \frac{\eta _t}{n} \sum _{j=0}^{i-1} m_{j+1}^{(t)}, \quad t \ge 1. \end{aligned}$$
(A.3)

Note that \(\sum _{j=0}^{-1} m_{j+1}^{(t)}\) and \(\sum _{j = n}^{n-1} m_{j+1}^{(t-1)} = 0\) by convention, these equations also hold true for \(i=0\) and \(i=n\).

Now, letting \(i=n\) in equation (A.3), for all \(t\ge 1\), we have

$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)}= & {} - \frac{\eta _t}{n} \sum _{j=0}^{n-1} m_{j+1}^{(t)} \overset{2}{=} - \frac{\eta _t}{n} \sum _{j=0}^{n-1} \left( \beta m_{0}^{(t)} + (1-\beta ) g_{j}^{(t)}\right) \nonumber \\= & {} - \frac{\eta _t}{n} \left( n \beta m_{0}^{(t)} + (1-\beta ) \sum _{j=0}^{n-1} g_{j}^{(t)}\right) . \end{aligned}$$

Since \(m_0^{(t)} = \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)}\) for \(t \ge 2\) (due to (A.2)), we have the following update

$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)} = - \frac{\eta _t}{n} \sum _{j=0}^{n-1} \left( \beta g_{j}^{(t-1)} + (1-\beta ) g_{j}^{(t)} \right) . \end{aligned}$$
(A.4)

For \(t = 1\), since \(m_{0}^{(t)} =\textbf{0}\), we get

$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)} =- \frac{\eta _t}{n} (1-\beta ) \sum _{j=0}^{n-1} g_{j}^{(t)}. \end{aligned}$$
(A.5)

Appendix B: Technical Lemmas

This appendix provides necessary technical lemmas used in our analysis.

1.1 B.1 Lemma 2: Sampling Without Replacement

We need [19, Lemma 1] for sampling without replacement in our analysis. For complete references, we recall it here.

Lemma 2

(Lemma 1 in [19]) Let \(X_1, \dots , X_n \in \mathbb {R}^d\) be fixed vectors, \(\bar{X} := \frac{1}{n} \sum _{i=1}^n X_i\) be their average and \(\sigma ^2 := \frac{1}{n} \sum _{i=1}^n \Vert X_i -\bar{X}\Vert ^2\) be the population variance. Fix any \(k \in \{1,\dots , n\}\), let \(X_{\pi _1}, \dots , X_{\pi _k}\) be sampled uniformly without replacement from \(\{X_1, \dots , X_n\}\) and \(\bar{X}_\pi \) be their average. Then, the sample average and the variance are given, respectively by

$$\begin{aligned} \mathbb {E}[\bar{X}_\pi ] = \bar{X} \qquad \text {and} \qquad \mathbb {E}\left[ \Vert \bar{X}_\pi - \bar{X}\Vert ^2\right] = \frac{n-k}{k(n-1)} \sigma ^2. \end{aligned}$$

Since \(\pi ^{(t)} = (\pi ^{(t)}(1),\dots ,\pi ^{(t)}(n))\) is uniformly sampled at random without replacement from \(\{1,\dots ,n\}\), by Lemma 2, we have \(\mathbb {E}\left[ \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1)) \right] = \nabla F(w_*)\) and

$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1))\right\| ^2 \right] \nonumber \\{} & {} = (n-i)^2 \mathbb {E} \left[ \left\| \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1))\right\| ^2 \right] \nonumber \\{} & {} = (n-i)^2 \mathbb {E} \left[ \left\| \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1)) - \nabla F(w_*)\right\| ^2 \right] \nonumber \\{} & {} = \frac{(n-i)^2 i}{(n-i)(n-1)} \frac{1}{n} \sum _{j=0}^{n-1} \left\| \nabla f ( w_*; \pi ^{(t)} (j + 1)) \right\| ^2\nonumber \\{} & {} \overset{8}{\le } \frac{(n-i) i}{(n-1)} \sigma ^2. \end{aligned}$$
(B.1)

1.2 B.2 Bound Expected Squared Norm of Sum of Gradients

Lemma 3

Let \(w_*\) be an optimal solution of F, and \(\{w_i^{(t)}\}\) be generated by Algorithm 1 with \(0\le \beta < 1\) and \(\eta _i^{(t)} := \frac{\eta _t}{n}\) for every \(t \ge 1\). Assume that Assumptions 1 and 2 hold, with the application of a randomized reshuffling strategy, we have the following bounds for \(t \ge 1\):

$$ \mathbb {E}\left[ B_0^{(t)}\right] \le 4nL \cdot \mathbb {E}[C_t]\quad \text {and}\quad \sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)}\right] \le 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}. $$

Proof

We start with the definition of \(B_i^{(t)}\):

$$\begin{aligned} B_i^{(t)}= & {} \left\| \sum _{j=i}^{n-1} g_{j}^{(t)}\right\| ^2 \nonumber \\= & {} \left\| \sum _{j=i}^{n-1}\left( g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right) + \sum _{j=i}^{n-1}\nabla f(w_*; \pi ^{(t)}(j+1)) \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&2\left\| \sum _{j=i}^{n-1}\left( g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right) \right\| ^2 + 2\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1) ) \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&2(n-i) \sum _{j=i}^{n-1} \left\| g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2 + 2\left\| \sum _{j=i}^{n-1}\nabla f(w_*; \pi ^{(t)}(j+1) ) \right\| ^2\nonumber \\&\overset{\mathrm{(b)}}{\le }\ {}&2 (n-i) \cdot 2L \sum _{j=i}^{n-1} D_{j}^{(t)} (w_*; w_{j}^{(t)}) + 2\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2, \end{aligned}$$

where (a) is from from the Cauchy–Schwarz inequality, and the last line (b) follows by the inequality (4). Now taking expectation to both sides and using (B.1), we have

$$\begin{aligned} \mathbb {E}\left[ B_i^{(t)}\right]= & {} 4L (n-i) \mathbb {E}\left[ \sum _{j=i}^{n-1} D_{j}^{(t)} (w_*; w_{j}^{(t)})\right] + 2\left[ \mathbb {E}\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2\right] \nonumber \\\overset{6B.1}{\le } & {} 4L (n-i) \mathbb {E}[C_t] + 2\frac{(n-i) i}{(n-1)} \sigma ^2. \end{aligned}$$

For the special case \(i=0\), we have:

$$ \mathbb {E}\left[ B_0^{(t)}\right] \le 4L n \mathbb {E}[C_t]. $$

Summing up the expression from \(i := 0\) to \(i := n-1\), we get

$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)}\right]\le & {} 4L\cdot \mathbb {E}[C_t] \sum _{i=0}^{n-1} (n-i) + 2\sigma ^2 \sum _{i=0}^{n-1}\frac{(n-i) i}{(n-1)}\\\le & {} 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}, \end{aligned}$$

where we use the facts that \(\sum _{i=0}^{n-1} (n-i) \le n^2\) and \(\sum _{i=0}^{n-1}\frac{(n-i) i}{(n-1)} \le \frac{n(n+1)}{6} \le \frac{n^2}{3}\).\(\square \)

1.3 B.3 Bound Expected Sum of Bregman Divergence (1)

Lemma 4

Under the same setting as of Lemma 3, it holds that

$$ \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right] \le 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2(1-\beta )L\sigma ^2\quad \text { for } t \ge 1. $$

Proof

We start with the case \(t \ge 2\):

$$\begin{aligned} D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right)\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t)}\right\| ^2\nonumber \\\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta (n -i) m_{0}^{(t)} + (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \nonumber \\&\overset{}{\le }&\frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta \frac{n-i}{n} \sum _{i=0}^{n-1} g_{i}^{(t-1)} + (1-\beta )\sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \left\| \frac{n-i}{n} \sum _{i=0}^{n-1} g_{i}^{(t-1)}\right\| ^2 + (1-\beta ) \left\| \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \right] \\\overset{A.1, A.1}{=} & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right] , \end{aligned}$$

where (a) follows from the convexity of \(\Vert \cdot \Vert ^2\) for \(0\le \beta < 1\). Summing up the expression from \(i := 0\) to \(i := n-1\) and talking expectation, we get

$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right]\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{\sum _{i=0}^{n-1}(n-i)^2}{n^2} \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \sum _{i=0}^{n-1} \mathbb {E} \left[ B_i^{(t)} \right] \right] \\&\overset{\mathrm{(b)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{n^3}{n^2} \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \sum _{i=0}^{n-1} \mathbb {E} \left[ B_i^{(t)} \right] \right] \nonumber \\&\overset{\mathrm{(c)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ 4\beta n^2L \cdot \mathbb {E}[C_{t-1}] + (1-\beta ) \left( 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3} \right) \right] \\\le & {} \frac{L\eta _t^2}{2} \left[ 4 \beta L \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) L\cdot \mathbb {E}[C_t] + (1-\beta ) \frac{2\sigma ^2}{3} \right] \\\overset{7}{\le } & {} 2 L^2\eta _t^2 F_t + \frac{L\eta _t^2}{2} (1-\beta ) \frac{2\sigma ^2}{3}\\\le & {} 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2, \end{aligned}$$

where (b) follows from the fact that \(\sum _{i=0}^{n-1}(n-i)^2 \le n^3\), and in (c) we use the results from Lemma 3.

We do the same for the case \(t =1\):

$$\begin{aligned} D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)})\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2 \le \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta (n -i) m_{0}^{(t)} + (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \\\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \le \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta )^2 \left\| \sum _{j=i}^{n-1} g_{j}^{(t)}\right\| ^2\right] \\\overset{A.1}{=} & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta ) B_i^{(t)}\right] . \end{aligned}$$

Summing up the expression from \(i := 0\) to \(i := n-1\) and talking expectation, we get

$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right]\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta )\sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)} \right] \right] \\&\overset{\mathrm{(d)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta ) \left( 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}\right) \right] \\\le & {} \frac{L\eta _t^2}{2} \left[ 4(1-\beta ) L\cdot \mathbb {E}[C_t] + (1-\beta ) \frac{2\sigma ^2}{3} \right] \\\overset{7}{\le } & {} 2 L^2\eta _t^2 F_t + \frac{L\eta _t^2}{2} (1-\beta ) \frac{2\sigma ^2}{3}\\\le & {} 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2, \end{aligned}$$

where in (d) we use the results from Lemma 3. Hence, the statement of Lemma 4 is true for all \(t \ge 1\).\(\square \)

1.4 B.4 Bound Expected Sum of Bregman Divergence (2)

Lemma 5

Under the same setting as of Lemma 3, the following holds for \(t\ge 2\):

$$ \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1}+ \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2. $$

Proof

We start with the case \(t \ge 3\):

$$\begin{aligned} D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)})\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t-1)} \right\| ^2\nonumber \\\le & {} L\left\| w_{n}^{(t)} - w_{n}^{(t-1)} \right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2, \end{aligned}$$

where we use the inequality \(\Vert u + v\Vert ^2 \le 2 \Vert u \Vert ^2 + 2 \Vert v\Vert ^2\). Note that \(w_{n}^{(t-1)} = w_{0}^{(t)}\), using the result

$$ \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2 = \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right] $$

from Lemma 4 for t and \(t-1\), we have

$$\begin{aligned}{} & {} D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)})\\{} & {} \le L\left\| w_{n}^{(t)} - w_{0}^{(t)}\right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2\\{} & {} \le L \frac{\eta _t^2}{n^2} \left[ \beta B_0^{(t-1)} + (1-\beta ) B_0^{(t)}\right] + L \frac{\eta _{t-1}^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-2)} + (1-\beta ) B_i^{(t-1)}\right] . \end{aligned}$$

Summing up the expression from \(i := 0\) to \(i := n-1\) and taking expectation, we get

$$\begin{aligned}{} & {} \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \\{} & {} \le L \frac{\eta _t^2}{n} \left[ \beta \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \mathbb {E}\left[ B_0^{(t)}\right] \right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ \beta \frac{\sum _{i=0}^{n-1}(n-i)^2}{n^2} \mathbb {E}\left[ B_0^{(t-2)}\right] + (1-\beta ) \sum _{i=0}^{n-1}\mathbb {E}\left[ B_i^{(t-1)}\right] \right] \\{} & {} \overset{\mathrm{(a)}}{\le }\ L \frac{\eta _t^2}{n} \left[ 4\beta nL \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) nL \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ 4\beta n^2L \cdot \mathbb {E}[C_{t-2}]+ (1-\beta ) \left[ 4n^2 L\cdot \mathbb {E}[C_{t-1}] + \frac{2n^2\sigma ^2}{3}\right] \right] \\{} & {} \le 4L^2 \eta _t^2 \left[ \beta \cdot \mathbb {E}[C_{t-1}] + (1-\beta )\cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + 4L^2\eta _{t-1}^2 \left[ \beta \cdot \mathbb {E}[C_{t-2}]+ (1-\beta ) \mathbb {E}[C_{t-1}]\right] + (1-\beta )L\frac{\eta _{t-1}^2}{n^2} \frac{2n^2\sigma ^2}{3}\\{} & {} \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1} + \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2, \end{aligned}$$

where (a) follows from the fact that \(\sum _{i=0}^{n-1}(n-i)^2 \le n^3\) and from the results of Lemma 3. Now we analyze similarly for the case \(t = 2\):

$$\begin{aligned} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right)\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t-1)} \right\| ^2\\\le & {} L\left\| w_{n}^{(t)} - w_{n}^{(t-1)} \right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2, \end{aligned}$$

where we use the inequality \(\Vert u + v\Vert ^2 \le 2 \Vert u\Vert ^2 + 2 \Vert v\Vert ^2\). Note that \(w_{n}^{(t-1)} = w_{0}^{(t)}\), using the result

$$\begin{aligned} \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2&= \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right]{} & {} \text { for } t =2,\\ \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2&= \frac{\eta _t^2}{n^2} \left[ (1-\beta ) B_i^{(t)} \right]{} & {} \text { for } t =1, \end{aligned}$$

from Lemma 4, we have

$$\begin{aligned} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right)\le & {} L\left\| w_{n}^{(t)} - w_{0}^{(t)}\right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2\\\le & {} L \frac{\eta _t^2}{n^2} \left[ \beta B_0^{(t-1)} + (1-\beta ) B_0^{(t)}\right] + L \frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) B_i^{(t-1)}\right] . \end{aligned}$$

Summing up the expression from \(i := 0\) to \(i := n-1\) and taking expectation, we get

$$\begin{aligned}{} & {} \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \\{} & {} \le L \frac{\eta _t^2}{n} \left[ \beta \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \mathbb {E}\left[ B_0^{(t)}\right] \right] + L\frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) \sum _{i=0}^{n-1}\mathbb {E}\left[ B_i^{(t-1)}\right] \right] \\{} & {} \overset{\mathrm{(a)}}{\le }\ L \frac{\eta _t^2}{n} \left[ 4\beta nL \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) nL \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) \left[ 4n^2 L\cdot \mathbb {E}[C_{t-1}] + \frac{2n^2\sigma ^2}{3}\right] \right] \\{} & {} \le 4L^2 \eta _t^2 \left[ \beta \cdot \mathbb {E}[C_{t-1}] + (1-\beta ) \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + 4L^2\eta _{t-1}^2 \left[ (1-\beta ) \mathbb {E}[C_{t-1}]\right] + (1-\beta ) L \frac{\eta _{t-1}^2}{n^2} \frac{2n^2\sigma ^2}{3}\\{} & {} \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1}+ \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2, \end{aligned}$$

where (a) follows from the results of Lemma 3. Hence, the statement of Lemma 5 is true for all \(t \ge 2\).\(\square \)

Appendix C: Proofs of Lemma 1 and Theorem 2

We now provide the full proof of Lemma 1 and Theorem 2 in the main text.

1.1 C.1 Proof of Lemma 1: Key Bounds

Proof

First, let us note that \(\tilde{w}_{t-1} = w_{0}^{(t)}\) for \(t=1, \dots , T\). From the update (A.4), we have the following for \(t\ge 2\):

$$\begin{aligned}{} & {} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\\{} & {} = \Vert w_{0}^{(t)} - w_*\Vert ^2\\{} & {} = \left\| w_{n}^{(t)} + \frac{\eta _t}{n}\sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) - w_*\right\| ^2\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \\{} & {} \quad +\left\| \frac{\eta _t}{n}\sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \right\| ^2\\{} & {} \ge \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ \beta \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t-1)} + (1-\beta ) \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t)} \right] \\{} & {} =\left\| w_{n}^{(t)} - w_*\right\| ^2\\{} & {} \quad + \frac{2\eta _t}{n} \sum _{i=0}^{n-1}\! \beta \left( f_{i}^{(t-1)} (w_{n}^{(t)})\! -\! f_{i}^{(t-1)}(w_*)\! +\! D_{i}^{(t-1)}\!\left( w_*; w_{i}^{(t-1)}\right) \! - D_{i}^{(t-1)}\!\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \\{} & {} \quad + \frac{2\eta _t}{n} \sum _{i=0}^{n-1} (1-\beta ) \left( f_{i}^{(t)} (w_{n}^{(t)}) - f_{i}^{(t)}(w_*) + D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) , \end{aligned}$$

where the last line follows from the following inequality:

$$ (w_1 - w_2)^\top \nabla f_{i}^{(t)} (w_3) = f_{i}^{(t)}(w_1) - f_{i}^{(t)}(w_2) + D_{i}^{(t)} (w_2 - w_3) - D_{i}^{(t)}(w_1 - w_3). $$

Now note that \(\sum _{i=0}^{n-1} f_{i}^{(t)} (\cdot ) = F (\cdot )\), we have

$$\begin{aligned} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\ge & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 \nonumber \\{} & {} + 2\beta \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \nonumber \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ \beta \left( D_{i}^{(t-1)}\left( w_*; w_{i}^{(t-1)}\right) - D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \right] \\{} & {} + 2(1-\beta ) \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \nonumber \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] \\\overset{6}{=} & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2 \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \left[ \beta \left( C_{t-1} - \sum _{i=0}^{n-1} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( C_t- \sum _{i=0}^{n-1}D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] , \end{aligned}$$

where in the last equation we use the definition of \(C_t\). Taking expectation, we get

$$\begin{aligned}{} & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] \\{} & {} \ge \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2 \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} \quad + 2\frac{\eta _t}{n} \left[ \beta \left( \mathbb {E}[C_{t-1}] - \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right] \right) \right] \\{} & {} \quad + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( \mathbb {E}[C_t]- \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right] \right) \right] \\{} & {} \overset{\mathrm{(a)}}{\ge }\ \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 \mathbb {E}\left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} \quad + 2\frac{\eta _t}{n} F_t - 2\beta \frac{\eta _t}{n}\left[ 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1} + \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2 \right] \\{} & {} \quad - 2(1-\beta ) \frac{\eta _t}{n}\left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2\right] \end{aligned}$$
$$\begin{aligned}{} & {} \ge \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2 \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] + \frac{2\eta _t}{n} F_t\\{} & {} \quad - \frac{2\eta _t}{n} \cdot 4 L^2 \eta _t^2 F_t - \frac{2\eta _t}{n} \cdot 4\beta L^2\eta _{t-1}^2 F_{t-1}\\{} & {} \quad - \frac{4\eta _t}{3n} (1-\beta ) L\sigma ^2 \left( \beta \eta _{t-1}^2 +(1-\beta )\eta _t^2 \right) , \end{aligned}$$

where (a) comes from the definition (7) of \(F_t\) and the results of Lemmas 4 and 5 for \(t \ge 2\). Note that \(w_{n}^{(t)} = \tilde{w}_{t}\), \(\eta _t \le \frac{1}{2L\sqrt{K}}\) and \(4 L^2 \eta _t^2 \le \frac{1}{K}\), we further have

$$\begin{aligned}{} & {} 2 \eta _t \mathbb {E}\left[ F( \tilde{w}_{t}) - F(w_*)\right] \le \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t\\{} & {} \qquad + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3. \end{aligned}$$

Now we consider the case \(t=1\). From the update (A.5), we have the following for \(t=1\):

$$\begin{aligned}{} & {} \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \nonumber \\{} & {} = \Vert w_{0}^{(t)} - w_*\Vert ^2\nonumber \\{} & {} = \left\| w_{n}^{(t)} + \frac{\eta _t}{n} \sum _{i=0}^{n-1} (1-\beta ) g_{i}^{(t)} - w_*\right\| ^2\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1}(1-\beta )g_{i}^{(t)} + \left\| \frac{\eta _t}{n} \sum _{i=0}^{n-1} \left( (1-\beta )g_{i}^{(t)} \right) \right\| ^2\\{} & {} \ge \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1}(1-\beta )g_{i}^{(t)}\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t)} \right] \\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2\\{} & {} \quad + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( f_{i}^{(t)} ( w_{n}^{(t)} )- f_{i}^{(t)}( w_*) + D_{i}^{(t)}(w_*; w_{i}^{(t)}) - D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right) \right] , \end{aligned}$$

where the last line follows from the following inequality:

$$ (w_1 - w_2)^\top \nabla f_{i}^{(t)} (w_3) = f_{i}^{(t)}(w_1) - f_{i}^{(t)}(w_2) + D_{i}^{(t)} (w_2 - w_3) - D_{i}^{(t)}(w_1 - w_3). $$

Now note that \(\sum _{i=0}^{n-1} f_{i}^{(t)} (\cdot ) = F (\cdot )\), we have

$$\begin{aligned} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\ge & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2(1-\beta ) \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] \\\overset{6}{=} & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2 \eta _t (1-\beta ) \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( C_t- \sum _{i=0}^{n-1}D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] . \end{aligned}$$

Taking the full expectation, we get

$$\begin{aligned} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right]\ge & {} \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta ) \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( \mathbb {E}[C_t]- \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right] \right) \right] \\&\overset{\mathrm{(a)}}{\ge }\ {}&\mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta )\mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2\right] \\\ge & {} \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta )\mathbb {E}\left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2 \right] , \end{aligned}$$

where (a) comes from the definition (7) of \(F_t\) and the results of Lemma 4 for \(t \ge 1\).

$$\begin{aligned} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right]\ge & {} \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2(1-\beta )\mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + \frac{2\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ K F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2 \right] . \end{aligned}$$

Finally, we have

$$\begin{aligned} 2\eta _t(1-\beta )\mathbb {E} \left[ F( \tilde{w}_{t} ) - F(w_*) \right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t} - w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t\\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3. \end{aligned}$$

This completes our proof.\(\square \)

1.2 C.2 Proof of Theorem 2: Strongly Convex Objectives

Proof

From the results of Lemma 1, for \(\eta _t \le \frac{1}{2L\sqrt{K}}\) we have

$$\begin{aligned} 2 \eta _t \mathbb {E}\left[ F( \tilde{w}_{t}) - F(w_*)\right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t \\{} & {} + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t \ge 2,\\ 2 \eta _t(1-\beta )\mathbb {E}\left[ F( \tilde{w}_{t} ) - F(w_*) \right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t} - w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t\\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3,\quad t = 1. \end{aligned}$$

Since F is \(\mu \)-strongly convex, i.e. \(F(w) - F(w_*) \ge \frac{\mu }{2} \Vert w - w_*\Vert ^2\), for \(\forall w \in \mathbb {R}^d\), we have the following:

$$\begin{aligned} \mu \eta _t \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t\\{} & {} + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n}\beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2}{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 ,\quad t \ge 2,\\ \mu \eta _t (1-\beta )\mathbb {E} \left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t} - w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t \\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t = 1. \end{aligned}$$

Hence, one has

$$\begin{aligned} (1+\mu \eta _t)\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t\\{} & {} + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 , \quad t \ge 2,\\ (1+\mu \eta _t (1-\beta ))\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t = 1. \end{aligned}$$

Now we adopt the following conventional notations:

$$\begin{aligned} \phi _t := \left\{ \begin{array}{ll} 1+\mu \eta _t &{}\quad \text {if } t \ge 2,\\ 1+\mu \eta _t (1-\beta ) &{}\quad \text {if } t = 1 \end{array}\right. \end{aligned}$$

and for every \(t\ge 1\):

$$\begin{aligned} H_t= - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 , \end{aligned}$$

where we use the convention that \(F_0 = 0\). Hence, we have the following recursive equation for all \(t\ge 1\):

$$ \phi _t\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] \le \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] + H_t. $$

Unrolling this recursive equation, we have

$$\begin{aligned} \prod _{t=1}^{T} \phi _t\mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right]\le & {} \prod _{t=1}^{T-1} \phi _t \left( \mathbb {E}\left[ \left\| \tilde{w}_{T-1}- w_*\right\| ^2 \right] + H_{T} \right) \\\le & {} \prod _{t=1}^{T-1} \phi _t \mathbb {E} \left[ \left\| \tilde{w}_{T-1}- w_*\right\| ^2 \right] + \prod _{t=1}^{T-1} \phi _t H_{T}\\\le & {} \prod _{t=1}^{T-2} \phi _t \mathbb {E} \left[ \left\| \tilde{w}_{T-2}- w_*\right\| ^2 \right] + \prod _{t=1}^{T-2} H_{T-1}+ \prod _{t=1}^{T-1} \phi _t H_{T}\\\le & {} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + H_{1} + \phi _1 H_2 + \prod _{t=1}^{T-2}\phi _t H_{T-1}+ \prod _{t=1}^{T-1} \phi _t H_{T}\\= & {} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t , \end{aligned}$$

where

$$ H_j = \frac{2}{n} \left( \frac{1}{K} -1\right) \eta _j F_j + \frac{2}{n} \alpha \beta \frac{1}{K} \eta _{j-1}F_{j-1} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta ) \eta _{j-1}^3 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _j^3 . $$

We bound the last term:

$$\begin{aligned} \sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t= & {} \frac{2}{n} \left( \frac{1}{K} -1\right) \sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t + \frac{2}{n} \alpha \beta \frac{1}{K} \sum _{j=1}^T \eta _{j-1}F_{j-1}\prod _{t=1}^{j-1} \phi _t\\{} & {} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta ) \sum _{j=1}^T \eta _{j-1}^3 \prod _{t=1}^{j-1}\phi _t + \frac{4L\sigma ^2}{3n} (1-\beta )^2 \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \\= & {} \frac{2}{n} \left( \frac{1}{K} -1\right) \sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t + \frac{2}{n} \alpha \beta \frac{1}{K} \sum _{j=0}^{T-1} \eta _{j}F_{j}\prod _{t=1}^{j} \phi _t\\{} & {} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta )\sum _{j=0}^{T-1} \eta _{j}^3 \prod _{t=1}^{j}\phi _t + \frac{4L\sigma ^2}{3n} (1-\beta )^2 \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t\\\le & {} \frac{2}{n}\sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t \left( \frac{1}{K} -1 + \alpha \beta \frac{1}{K} \phi _j \right) \\{} & {} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \left( (1-\beta )^2 + \alpha \beta (1-\beta ) \phi _j \right) \\\le & {} \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \left( 1-\beta + \alpha \beta \phi _j \right) , \end{aligned}$$

where we use the fact that \(K = 1 + \alpha \beta (1 +\mu \max _t \eta _t)\). Plugging the last bound to the previous estimate we get:

$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right] \\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{1}{\prod _{t=1}^{T} \phi _t}\sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t\\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\frac{\prod _{t=1}^{j-1} \phi _t \left( 1-\beta + \alpha \beta \phi _j \right) }{\prod _{t=1}^{T} \phi _t}\\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta \phi _j)}{\prod _{t=j}^{T} \phi _t}. \end{aligned}$$

We use the fact that \(\frac{1}{\phi _1} = \frac{1}{1 + (1-\beta ) \mu \eta _1} \le \frac{1}{(1-\beta )(1 + \mu \eta _1)} \) and have

$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right] \\{} & {} \le \frac{\mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] }{(1+(1-\beta )\mu \eta _1)\prod _{t=1}^{T-1} (1+\mu \eta _t)} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta (1 + \mu \max _t \eta _t))}{\prod _{t=j}^{T} (1+\mu \eta _t)}\\{} & {} \le \frac{\mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] }{(1-\beta ) \prod _{t=1}^{T} (1+\mu \eta _t)} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta (1 + \mu \max _t \eta _t))}{\prod _{t=j}^{T} (1+\mu \eta _t)}. \end{aligned}$$

This completes our proof.\(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tran, T.H., Tran-Dinh, Q. & Nguyen, L.M. Shuffling Momentum Gradient Algorithm for Convex Optimization. Vietnam J. Math. (2024). https://doi.org/10.1007/s10013-024-00699-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10013-024-00699-7

Keywords

Mathematics Subject Classification (2010)