Shuffling Momentum Gradient Algorithm for Convex Optimization

106 Accesses
1 Citation
Explore all metrics

Abstract

The Stochastic Gradient Method (SGD) and its stochastic variants have become methods of choice for solving finite-sum optimization problems arising from machine learning and data science thanks to their scalability to handle large-scale applications and big datasets. In the last decades, researchers have made substantial effort to study the theoretical performance of SGD and its shuffling variants. However, only limited work has investigated its shuffling momentum variants, including shuffling heavy-ball momentum schemes for non-convex problems and Nesterov’s momentum for convex settings. In this work, we extend the analysis of the shuffling momentum gradient method developed in (Proceedings of the 38th International Conference on Machine Learning, PMLR 139: 10379–10389, 2021) to both finite-sum convex and strongly convex optimization problems. We provide the first analysis of shuffling momentum-based methods for the strongly convex setting, attaining a convergence rate of $\mathcal {O}(1/nT^2)$, where n is the number of samples and T is the number of training epochs. Our analysis is a state-of-the-art, matching the best rates of existing shuffling stochastic gradient algorithms in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

On the Convergence Analysis of Aggregated Heavy-Ball Method

Accelerated Stochastic Variance Reduction for a Class of Convex Optimization Problems

Article 12 January 2023

An accelerated first-order regularized momentum descent ascent algorithm for stochastic nonconvex-concave minimax problems

Article 28 December 2024

References

Ahn, K., Yun, C., Sra, S.: SGD with shuffling: optimal rates without component convexity and large epoch requirements. arXiv:2006.06946 (2020)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018)
Article MathSciNet Google Scholar
Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)
Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Berlin, Heidelberg (2012)
Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Tech. 2, 27 (2011). Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
Cotter, A., Shamir, O., Srebro, N., Sridharan, K.: Better mini-batch algorithms via accelerated gradient methods. In: Shawe-Taylor, J. et al. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)
Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Ghahramani, Z. et al. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 1646–1654. Curran Associates, Inc. (2014)
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet Google Scholar
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2021)
Article MathSciNet Google Scholar
Haochen, J., Sra, S.: Random shuffling beats sgd after finite epochs. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, vol. 97, pp. 2624–2633 (2019)
Hu, C., Pan, W., Kwok, J.: Accelerated gradient methods for Stochastic optimization and online learning. In: Bengio, Y. et al. (eds.) Advances in Neural Information Processing Systems, vol. 22. Curran Associates, Inc. (2009)
Jin, R., Xing, Y., He, X.: On the convergence of mSGD and AdaGrad for Stochastic optimization. arXiv:2201.11204 (2022)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Burges, C.J. et al. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 315–323. Curran Associates, Inc. (2013)
Kingma, D.P., Ba, J.: ADAM: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations (ICLR) (2014). arXiv:1412.6980 (2014)
Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Pereire, F. et al. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 2663–2671. Curran Associates, Inc. (2012)
Li, X., Zhuang, Z., Orabona, F.: A second look at exponential and cosine step sizes: simplicity, adaptivity, and performance. In: Proceedings of the 38th International Conference on Machine Learning, PMLR, vol. 139, pp. 6553–6564 (2021)
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. arXiv:1908.03265 (2019)
Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. arXiv:1608.03983v5 (2017)
Mishchenko, K., Khaled, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. In: Larochelle, H. et al. (eds.) Advances in neural information processing systems, vol. 33, pp. 17309–17320. Curran Associates, Inc. (2020)
Nagaraj, D., Jain, P., Netrapalli, P.: Sgd without replacement: sharper rates for general smooth convex functions. In: Proceedings of the 36th International Conference on Machine Learning, PMLR, vol. 97, pp. 4703–4711 (2019)
Nedić, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P.M. (eds.) Stochastic Optimization: Algorithms and Applications, pp. 223–264. Springer, New York (2001)
Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12, 109–138 (2001)
Article MathSciNet Google Scholar
Nesterov, Y.: A method for unconstrained convex minimization problem with the rate of convergence $\cal{O} (1/k^2)$. Dokl. Akad. Nauk. SSSR 269, 543–547 (1983)
MathSciNet Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers (2004)
Nguyen, L., Nguyen, P.H., Dijk, M., Richtarik, P., Scheinberg, K., Takac, M.: SGD and Hogwild! Convergence without the bounded gradients assumption. In: Proceedings of the 35th International Conference on Machine Learning, PMLR, vol. 80, pp. 3747–3755 (2018)
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., Van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods. J. Mach. Learn. Res. 22, 1–44 (2021)
MathSciNet Google Scholar
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: A novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, PMLR, vol. 70, pp. 2613–2621 (2017)
Nguyen, L.M., Tran, T.H.: On the convergence to a global solution of shuffling-type gradient algorithms. In: Oh, A. et al. (eds.) Advances in Neural Information Processing Systems, vol. 36, pp. 75545–75566. Curran Associates, Inc. (2023)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4, 1–17 (1964)
Article Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4, 1–17 (1964)
Article Google Scholar
Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of sgd without replacement. In: Proceedings of the 37th International Conference on Machine Learning, PMLR, vol. 119, pp. 7964–7973 (2020)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5, 201–226 (2013)
Article MathSciNet Google Scholar
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. arXiv:1904.09237 (2019)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Safran, I., Shamir, O.: How good is sgd with random shuffling. In: Proceedings of 33th Conference on Learning Theory, PMLR, vol. 125, pp. 3250–3284 (2020)
Shamir, O.: Without-replacement sampling for stochastic gradient methods. In: Lee, D. et al. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 46–54. Curran Associates, Inc. (2016)
Sra, S., Nowozin, S., Wright, S.J. (eds.): Optimization for Machine Learning. MIT Press (2012)
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Dasgupta, S., McAllester, D. (eds.) Proceedings of the 30th International Conference on Machine Learning, PMLR vol. 28, pp. 1139–1147 (2013)
Tran, T.H., Nguyen, L.M., Tran-Dinh, Q.: SMG: a shuffling gradient-based method with momentum. In: Proceedings of the 38th International Conference on Machine Learning, PMLR vol. 139, pp. 10379–10389 (2021)
Tran, T.H., Scheinberg, K., Nguyen, L.M.: Nesterov accelerated shuffling gradient method for convex optimization. In: Proceedings of the 39th international conference on machine learning, PMLR, vol. 162, pp. 21703–21732 (2022)
Yuan, K., Ying, B., Sayed, A.H.: On the influence of momentum acceleration on online learning. J. Mach. Learn. Res. 17(192), 1–66 (2016)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA
Trang H. Tran
Department of Statistics and Operations Research, The University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Quoc Tran-Dinh
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA
Lam M. Nguyen

Authors

Trang H. Tran
View author publications
You can also search for this author in PubMed Google Scholar
Quoc Tran-Dinh
View author publications
You can also search for this author in PubMed Google Scholar
Lam M. Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Trang H. Tran.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is dedicated to Professor Tamás Terlaky on the occasion of his 70th birthday.

Appendices

Appendix A: Technical Notations and Expressions

First, we introduce some additional notations for our analysis in the sequel. For $t \ge 1$ and $i = 0,\dots , n$, we denote

$$\begin{aligned} A_i^{(t)} := \left\| \sum _{j=0}^{i-1} g_{j}^{(t)} \right\| ^2 \quad \text { and } \quad B_i^{(t)} := \left\| \sum _{j = i}^{n-1} g_{j}^{(t)}\right\| ^2. \end{aligned}$$

(A.1)

We adopt the convention $\sum _{j=0}^{-1} g_{j}^{(t)} = 0$, and $\sum _{j=n}^{n-1} g_{j}^{(t)} = 0$ in the above definitions. Next, we collect the following expressions from Algorithm 1 to use later in our analysis. From the update rules of Algorithm 1, for $t > 1$, we have

$$\begin{aligned} m_0^{(t)} = \tilde{m}_{t-1} = v_{n}^{(t-1)} \overset{2}{=} v_{n-1}^{(t-1)} + \frac{1}{n} g_{n-1}^{(t-1)} = v_{0}^{(t-1)} + \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)} = \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)}. \end{aligned}$$

(A.2)

If $t = 1$, then we set $m_0^{(1)} = \tilde{m}_{0} = {\textbf {0}} \in \mathbb {R}^d$. From the update $w_{i+1}^{(t)} := w_{i}^{(t)} - \eta _i^{(t)} m_{i+1}^{(t)}$ with $\eta _i^{(t)} := \frac{\eta _t}{n}$, for $i = 1, 2,\dots ,n-1$, at Step 7 of Algorithm 1, we can derive

$$\begin{aligned} w_{i}^{(t)} \overset{2}{=} w_{i-1}^{(t)} - \frac{\eta _t}{n} m_{i}^{(t)} = w_{0}^{(t)} - \frac{\eta _t}{n} \sum _{j=0}^{i-1} m_{j+1}^{(t)}, \quad t \ge 1. \end{aligned}$$

(A.3)

Note that $\sum _{j=0}^{-1} m_{j+1}^{(t)}$ and $\sum _{j = n}^{n-1} m_{j+1}^{(t-1)} = 0$ by convention, these equations also hold true for $i=0$ and $i=n$.

Now, letting $i=n$ in equation (A.3), for all $t\ge 1$, we have

$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)}= & {} - \frac{\eta _t}{n} \sum _{j=0}^{n-1} m_{j+1}^{(t)} \overset{2}{=} - \frac{\eta _t}{n} \sum _{j=0}^{n-1} \left( \beta m_{0}^{(t)} + (1-\beta ) g_{j}^{(t)}\right) \nonumber \\= & {} - \frac{\eta _t}{n} \left( n \beta m_{0}^{(t)} + (1-\beta ) \sum _{j=0}^{n-1} g_{j}^{(t)}\right) . \end{aligned}$$

Since $m_0^{(t)} = \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)}$ for $t \ge 2$ (due to (A.2)), we have the following update

$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)} = - \frac{\eta _t}{n} \sum _{j=0}^{n-1} \left( \beta g_{j}^{(t-1)} + (1-\beta ) g_{j}^{(t)} \right) . \end{aligned}$$

(A.4)

For $t = 1$, since $m_{0}^{(t)} =\textbf{0}$, we get

$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)} =- \frac{\eta _t}{n} (1-\beta ) \sum _{j=0}^{n-1} g_{j}^{(t)}. \end{aligned}$$

(A.5)

Appendix B: Technical Lemmas

This appendix provides necessary technical lemmas used in our analysis.

1.1 B.1 Lemma 2: Sampling Without Replacement

We need [19, Lemma 1] for sampling without replacement in our analysis. For complete references, we recall it here.

Lemma 2

(Lemma 1 in [19]) Let $X_1, \dots , X_n \in \mathbb {R}^d$ be fixed vectors, $\bar{X} := \frac{1}{n} \sum _{i=1}^n X_i$ be their average and $\sigma ^2 := \frac{1}{n} \sum _{i=1}^n \Vert X_i -\bar{X}\Vert ^2$ be the population variance. Fix any $k \in \{1,\dots , n\}$, let $X_{\pi _1}, \dots , X_{\pi _k}$ be sampled uniformly without replacement from $\{X_1, \dots , X_n\}$ and $\bar{X}_\pi $ be their average. Then, the sample average and the variance are given, respectively by

$$\begin{aligned} \mathbb {E}[\bar{X}_\pi ] = \bar{X} \qquad \text {and} \qquad \mathbb {E}\left[ \Vert \bar{X}_\pi - \bar{X}\Vert ^2\right] = \frac{n-k}{k(n-1)} \sigma ^2. \end{aligned}$$

Since $\pi ^{(t)} = (\pi ^{(t)}(1),\dots ,\pi ^{(t)}(n))$ is uniformly sampled at random without replacement from $\{1,\dots ,n\}$, by Lemma 2, we have $\mathbb {E}\left[ \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1)) \right] = \nabla F(w_*)$ and

$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1))\right\| ^2 \right] \nonumber \\{} & {} = (n-i)^2 \mathbb {E} \left[ \left\| \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1))\right\| ^2 \right] \nonumber \\{} & {} = (n-i)^2 \mathbb {E} \left[ \left\| \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1)) - \nabla F(w_*)\right\| ^2 \right] \nonumber \\{} & {} = \frac{(n-i)^2 i}{(n-i)(n-1)} \frac{1}{n} \sum _{j=0}^{n-1} \left\| \nabla f ( w_*; \pi ^{(t)} (j + 1)) \right\| ^2\nonumber \\{} & {} \overset{8}{\le } \frac{(n-i) i}{(n-1)} \sigma ^2. \end{aligned}$$

(B.1)

1.2 B.2 Bound Expected Squared Norm of Sum of Gradients

Lemma 3

Let $w_*$ be an optimal solution of F, and $\{w_i^{(t)}\}$ be generated by Algorithm 1 with $0\le \beta < 1$ and $\eta _i^{(t)} := \frac{\eta _t}{n}$ for every $t \ge 1$. Assume that Assumptions 1 and 2 hold, with the application of a randomized reshuffling strategy, we have the following bounds for $t \ge 1$:

$$ \mathbb {E}\left[ B_0^{(t)}\right] \le 4nL \cdot \mathbb {E}[C_t]\quad \text {and}\quad \sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)}\right] \le 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}. $$

Proof

We start with the definition of $B_i^{(t)}$:

$$\begin{aligned} B_i^{(t)}= & {} \left\| \sum _{j=i}^{n-1} g_{j}^{(t)}\right\| ^2 \nonumber \\= & {} \left\| \sum _{j=i}^{n-1}\left( g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right) + \sum _{j=i}^{n-1}\nabla f(w_*; \pi ^{(t)}(j+1)) \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&2\left\| \sum _{j=i}^{n-1}\left( g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right) \right\| ^2 + 2\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1) ) \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&2(n-i) \sum _{j=i}^{n-1} \left\| g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2 + 2\left\| \sum _{j=i}^{n-1}\nabla f(w_*; \pi ^{(t)}(j+1) ) \right\| ^2\nonumber \\&\overset{\mathrm{(b)}}{\le }\ {}&2 (n-i) \cdot 2L \sum _{j=i}^{n-1} D_{j}^{(t)} (w_*; w_{j}^{(t)}) + 2\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2, \end{aligned}$$

where (a) is from from the Cauchy–Schwarz inequality, and the last line (b) follows by the inequality (4). Now taking expectation to both sides and using (B.1), we have

$$\begin{aligned} \mathbb {E}\left[ B_i^{(t)}\right]= & {} 4L (n-i) \mathbb {E}\left[ \sum _{j=i}^{n-1} D_{j}^{(t)} (w_*; w_{j}^{(t)})\right] + 2\left[ \mathbb {E}\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2\right] \nonumber \\\overset{6B.1}{\le } & {} 4L (n-i) \mathbb {E}[C_t] + 2\frac{(n-i) i}{(n-1)} \sigma ^2. \end{aligned}$$

For the special case $i=0$, we have:

$$ \mathbb {E}\left[ B_0^{(t)}\right] \le 4L n \mathbb {E}[C_t]. $$

Summing up the expression from $i := 0$ to $i := n-1$, we get

$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)}\right]\le & {} 4L\cdot \mathbb {E}[C_t] \sum _{i=0}^{n-1} (n-i) + 2\sigma ^2 \sum _{i=0}^{n-1}\frac{(n-i) i}{(n-1)}\\\le & {} 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}, \end{aligned}$$

where we use the facts that $\sum _{i=0}^{n-1} (n-i) \le n^2$ and $\sum _{i=0}^{n-1}\frac{(n-i) i}{(n-1)} \le \frac{n(n+1)}{6} \le \frac{n^2}{3}$.$\square $

1.3 B.3 Bound Expected Sum of Bregman Divergence (1)

Lemma 4

Under the same setting as of Lemma 3, it holds that

$$ \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right] \le 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2(1-\beta )L\sigma ^2\quad \text { for } t \ge 1. $$

Proof

We start with the case $t \ge 2$:

$$\begin{aligned} D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right)\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t)}\right\| ^2\nonumber \\\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta (n -i) m_{0}^{(t)} + (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \nonumber \\&\overset{}{\le }&\frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta \frac{n-i}{n} \sum _{i=0}^{n-1} g_{i}^{(t-1)} + (1-\beta )\sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \left\| \frac{n-i}{n} \sum _{i=0}^{n-1} g_{i}^{(t-1)}\right\| ^2 + (1-\beta ) \left\| \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \right] \\\overset{A.1, A.1}{=} & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right] , \end{aligned}$$

where (a) follows from the convexity of $\Vert \cdot \Vert ^2$ for $0\le \beta < 1$. Summing up the expression from $i := 0$ to $i := n-1$ and talking expectation, we get

$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right]\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{\sum _{i=0}^{n-1}(n-i)^2}{n^2} \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \sum _{i=0}^{n-1} \mathbb {E} \left[ B_i^{(t)} \right] \right] \\&\overset{\mathrm{(b)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{n^3}{n^2} \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \sum _{i=0}^{n-1} \mathbb {E} \left[ B_i^{(t)} \right] \right] \nonumber \\&\overset{\mathrm{(c)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ 4\beta n^2L \cdot \mathbb {E}[C_{t-1}] + (1-\beta ) \left( 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3} \right) \right] \\\le & {} \frac{L\eta _t^2}{2} \left[ 4 \beta L \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) L\cdot \mathbb {E}[C_t] + (1-\beta ) \frac{2\sigma ^2}{3} \right] \\\overset{7}{\le } & {} 2 L^2\eta _t^2 F_t + \frac{L\eta _t^2}{2} (1-\beta ) \frac{2\sigma ^2}{3}\\\le & {} 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2, \end{aligned}$$

where (b) follows from the fact that $\sum _{i=0}^{n-1}(n-i)^2 \le n^3$, and in (c) we use the results from Lemma 3.

We do the same for the case $t =1$:

$$\begin{aligned} D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)})\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2 \le \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta (n -i) m_{0}^{(t)} + (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \\\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \le \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta )^2 \left\| \sum _{j=i}^{n-1} g_{j}^{(t)}\right\| ^2\right] \\\overset{A.1}{=} & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta ) B_i^{(t)}\right] . \end{aligned}$$

Summing up the expression from $i := 0$ to $i := n-1$ and talking expectation, we get

$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right]\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta )\sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)} \right] \right] \\&\overset{\mathrm{(d)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta ) \left( 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}\right) \right] \\\le & {} \frac{L\eta _t^2}{2} \left[ 4(1-\beta ) L\cdot \mathbb {E}[C_t] + (1-\beta ) \frac{2\sigma ^2}{3} \right] \\\overset{7}{\le } & {} 2 L^2\eta _t^2 F_t + \frac{L\eta _t^2}{2} (1-\beta ) \frac{2\sigma ^2}{3}\\\le & {} 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2, \end{aligned}$$

where in (d) we use the results from Lemma 3. Hence, the statement of Lemma 4 is true for all $t \ge 1$.$\square $

1.4 B.4 Bound Expected Sum of Bregman Divergence (2)

Lemma 5

Under the same setting as of Lemma 3, the following holds for $t\ge 2$:

$$ \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1}+ \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2. $$

Proof

We start with the case $t \ge 3$:

$$\begin{aligned} D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)})\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t-1)} \right\| ^2\nonumber \\\le & {} L\left\| w_{n}^{(t)} - w_{n}^{(t-1)} \right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2, \end{aligned}$$

where we use the inequality $\Vert u + v\Vert ^2 \le 2 \Vert u \Vert ^2 + 2 \Vert v\Vert ^2$. Note that $w_{n}^{(t-1)} = w_{0}^{(t)}$, using the result

$$ \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2 = \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right] $$

from Lemma 4 for t and $t-1$, we have

$$\begin{aligned}{} & {} D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)})\\{} & {} \le L\left\| w_{n}^{(t)} - w_{0}^{(t)}\right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2\\{} & {} \le L \frac{\eta _t^2}{n^2} \left[ \beta B_0^{(t-1)} + (1-\beta ) B_0^{(t)}\right] + L \frac{\eta _{t-1}^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-2)} + (1-\beta ) B_i^{(t-1)}\right] . \end{aligned}$$

Summing up the expression from $i := 0$ to $i := n-1$ and taking expectation, we get

$$\begin{aligned}{} & {} \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \\{} & {} \le L \frac{\eta _t^2}{n} \left[ \beta \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \mathbb {E}\left[ B_0^{(t)}\right] \right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ \beta \frac{\sum _{i=0}^{n-1}(n-i)^2}{n^2} \mathbb {E}\left[ B_0^{(t-2)}\right] + (1-\beta ) \sum _{i=0}^{n-1}\mathbb {E}\left[ B_i^{(t-1)}\right] \right] \\{} & {} \overset{\mathrm{(a)}}{\le }\ L \frac{\eta _t^2}{n} \left[ 4\beta nL \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) nL \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ 4\beta n^2L \cdot \mathbb {E}[C_{t-2}]+ (1-\beta ) \left[ 4n^2 L\cdot \mathbb {E}[C_{t-1}] + \frac{2n^2\sigma ^2}{3}\right] \right] \\{} & {} \le 4L^2 \eta _t^2 \left[ \beta \cdot \mathbb {E}[C_{t-1}] + (1-\beta )\cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + 4L^2\eta _{t-1}^2 \left[ \beta \cdot \mathbb {E}[C_{t-2}]+ (1-\beta ) \mathbb {E}[C_{t-1}]\right] + (1-\beta )L\frac{\eta _{t-1}^2}{n^2} \frac{2n^2\sigma ^2}{3}\\{} & {} \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1} + \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2, \end{aligned}$$

where (a) follows from the fact that $\sum _{i=0}^{n-1}(n-i)^2 \le n^3$ and from the results of Lemma 3. Now we analyze similarly for the case $t = 2$:

$$\begin{aligned} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right)\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t-1)} \right\| ^2\\\le & {} L\left\| w_{n}^{(t)} - w_{n}^{(t-1)} \right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2, \end{aligned}$$

where we use the inequality $\Vert u + v\Vert ^2 \le 2 \Vert u\Vert ^2 + 2 \Vert v\Vert ^2$. Note that $w_{n}^{(t-1)} = w_{0}^{(t)}$, using the result

$$\begin{aligned} \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2&= \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right]{} & {} \text { for } t =2,\\ \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2&= \frac{\eta _t^2}{n^2} \left[ (1-\beta ) B_i^{(t)} \right]{} & {} \text { for } t =1, \end{aligned}$$

from Lemma 4, we have

$$\begin{aligned} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right)\le & {} L\left\| w_{n}^{(t)} - w_{0}^{(t)}\right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2\\\le & {} L \frac{\eta _t^2}{n^2} \left[ \beta B_0^{(t-1)} + (1-\beta ) B_0^{(t)}\right] + L \frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) B_i^{(t-1)}\right] . \end{aligned}$$

Summing up the expression from $i := 0$ to $i := n-1$ and taking expectation, we get

$$\begin{aligned}{} & {} \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \\{} & {} \le L \frac{\eta _t^2}{n} \left[ \beta \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \mathbb {E}\left[ B_0^{(t)}\right] \right] + L\frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) \sum _{i=0}^{n-1}\mathbb {E}\left[ B_i^{(t-1)}\right] \right] \\{} & {} \overset{\mathrm{(a)}}{\le }\ L \frac{\eta _t^2}{n} \left[ 4\beta nL \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) nL \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) \left[ 4n^2 L\cdot \mathbb {E}[C_{t-1}] + \frac{2n^2\sigma ^2}{3}\right] \right] \\{} & {} \le 4L^2 \eta _t^2 \left[ \beta \cdot \mathbb {E}[C_{t-1}] + (1-\beta ) \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + 4L^2\eta _{t-1}^2 \left[ (1-\beta ) \mathbb {E}[C_{t-1}]\right] + (1-\beta ) L \frac{\eta _{t-1}^2}{n^2} \frac{2n^2\sigma ^2}{3}\\{} & {} \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1}+ \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2, \end{aligned}$$

where (a) follows from the results of Lemma 3. Hence, the statement of Lemma 5 is true for all $t \ge 2$.$\square $

Appendix C: Proofs of Lemma 1 and Theorem 2

We now provide the full proof of Lemma 1 and Theorem 2 in the main text.

1.1 C.1 Proof of Lemma 1: Key Bounds

Proof

First, let us note that $\tilde{w}_{t-1} = w_{0}^{(t)}$ for $t=1, \dots , T$. From the update (A.4), we have the following for $t\ge 2$:

$$\begin{aligned}{} & {} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\\{} & {} = \Vert w_{0}^{(t)} - w_*\Vert ^2\\{} & {} = \left\| w_{n}^{(t)} + \frac{\eta _t}{n}\sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) - w_*\right\| ^2\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \\{} & {} \quad +\left\| \frac{\eta _t}{n}\sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \right\| ^2\\{} & {} \ge \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ \beta \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t-1)} + (1-\beta ) \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t)} \right] \\{} & {} =\left\| w_{n}^{(t)} - w_*\right\| ^2\\{} & {} \quad + \frac{2\eta _t}{n} \sum _{i=0}^{n-1}\! \beta \left( f_{i}^{(t-1)} (w_{n}^{(t)})\! -\! f_{i}^{(t-1)}(w_*)\! +\! D_{i}^{(t-1)}\!\left( w_*; w_{i}^{(t-1)}\right) \! - D_{i}^{(t-1)}\!\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \\{} & {} \quad + \frac{2\eta _t}{n} \sum _{i=0}^{n-1} (1-\beta ) \left( f_{i}^{(t)} (w_{n}^{(t)}) - f_{i}^{(t)}(w_*) + D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) , \end{aligned}$$

where the last line follows from the following inequality:

$$ (w_1 - w_2)^\top \nabla f_{i}^{(t)} (w_3) = f_{i}^{(t)}(w_1) - f_{i}^{(t)}(w_2) + D_{i}^{(t)} (w_2 - w_3) - D_{i}^{(t)}(w_1 - w_3). $$

Now note that $\sum _{i=0}^{n-1} f_{i}^{(t)} (\cdot ) = F (\cdot )$, we have

$$\begin{aligned} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\ge & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 \nonumber \\{} & {} + 2\beta \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \nonumber \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ \beta \left( D_{i}^{(t-1)}\left( w_*; w_{i}^{(t-1)}\right) - D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \right] \\{} & {} + 2(1-\beta ) \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \nonumber \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] \\\overset{6}{=} & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2 \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \left[ \beta \left( C_{t-1} - \sum _{i=0}^{n-1} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( C_t- \sum _{i=0}^{n-1}D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] , \end{aligned}$$

where in the last equation we use the definition of $C_t$. Taking expectation, we get

$$\begin{aligned}{} & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] \\{} & {} \ge \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2 \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} \quad + 2\frac{\eta _t}{n} \left[ \beta \left( \mathbb {E}[C_{t-1}] - \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right] \right) \right] \\{} & {} \quad + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( \mathbb {E}[C_t]- \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right] \right) \right] \\{} & {} \overset{\mathrm{(a)}}{\ge }\ \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 \mathbb {E}\left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} \quad + 2\frac{\eta _t}{n} F_t - 2\beta \frac{\eta _t}{n}\left[ 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1} + \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2 \right] \\{} & {} \quad - 2(1-\beta ) \frac{\eta _t}{n}\left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2\right] \end{aligned}$$

$$\begin{aligned}{} & {} \ge \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2 \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] + \frac{2\eta _t}{n} F_t\\{} & {} \quad - \frac{2\eta _t}{n} \cdot 4 L^2 \eta _t^2 F_t - \frac{2\eta _t}{n} \cdot 4\beta L^2\eta _{t-1}^2 F_{t-1}\\{} & {} \quad - \frac{4\eta _t}{3n} (1-\beta ) L\sigma ^2 \left( \beta \eta _{t-1}^2 +(1-\beta )\eta _t^2 \right) , \end{aligned}$$

where (a) comes from the definition (7) of $F_t$ and the results of Lemmas 4 and 5 for $t \ge 2$. Note that $w_{n}^{(t)} = \tilde{w}_{t}$, $\eta _t \le \frac{1}{2L\sqrt{K}}$ and $4 L^2 \eta _t^2 \le \frac{1}{K}$, we further have

$$\begin{aligned}{} & {} 2 \eta _t \mathbb {E}\left[ F( \tilde{w}_{t}) - F(w_*)\right] \le \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t\\{} & {} \qquad + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3. \end{aligned}$$

Now we consider the case $t=1$. From the update (A.5), we have the following for $t=1$:

$$\begin{aligned}{} & {} \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \nonumber \\{} & {} = \Vert w_{0}^{(t)} - w_*\Vert ^2\nonumber \\{} & {} = \left\| w_{n}^{(t)} + \frac{\eta _t}{n} \sum _{i=0}^{n-1} (1-\beta ) g_{i}^{(t)} - w_*\right\| ^2\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1}(1-\beta )g_{i}^{(t)} + \left\| \frac{\eta _t}{n} \sum _{i=0}^{n-1} \left( (1-\beta )g_{i}^{(t)} \right) \right\| ^2\\{} & {} \ge \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1}(1-\beta )g_{i}^{(t)}\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t)} \right] \\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2\\{} & {} \quad + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( f_{i}^{(t)} ( w_{n}^{(t)} )- f_{i}^{(t)}( w_*) + D_{i}^{(t)}(w_*; w_{i}^{(t)}) - D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right) \right] , \end{aligned}$$

where the last line follows from the following inequality:

$$ (w_1 - w_2)^\top \nabla f_{i}^{(t)} (w_3) = f_{i}^{(t)}(w_1) - f_{i}^{(t)}(w_2) + D_{i}^{(t)} (w_2 - w_3) - D_{i}^{(t)}(w_1 - w_3). $$

Now note that $\sum _{i=0}^{n-1} f_{i}^{(t)} (\cdot ) = F (\cdot )$, we have

$$\begin{aligned} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\ge & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2(1-\beta ) \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] \\\overset{6}{=} & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2 \eta _t (1-\beta ) \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( C_t- \sum _{i=0}^{n-1}D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] . \end{aligned}$$

Taking the full expectation, we get

$$\begin{aligned} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right]\ge & {} \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta ) \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( \mathbb {E}[C_t]- \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right] \right) \right] \\&\overset{\mathrm{(a)}}{\ge }\ {}&\mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta )\mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2\right] \\\ge & {} \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta )\mathbb {E}\left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2 \right] , \end{aligned}$$

where (a) comes from the definition (7) of $F_t$ and the results of Lemma 4 for $t \ge 1$.

Finally, we have

$$\begin{aligned} 2\eta _t(1-\beta )\mathbb {E} \left[ F( \tilde{w}_{t} ) - F(w_*) \right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t} - w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t\\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3. \end{aligned}$$

This completes our proof.$\square $

1.2 C.2 Proof of Theorem 2: Strongly Convex Objectives

Proof

From the results of Lemma 1, for $\eta _t \le \frac{1}{2L\sqrt{K}}$ we have

$$\begin{aligned} 2 \eta _t \mathbb {E}\left[ F( \tilde{w}_{t}) - F(w_*)\right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t \\{} & {} + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t \ge 2,\\ 2 \eta _t(1-\beta )\mathbb {E}\left[ F( \tilde{w}_{t} ) - F(w_*) \right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t} - w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t\\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3,\quad t = 1. \end{aligned}$$

Since F is $\mu $-strongly convex, i.e. $F(w) - F(w_*) \ge \frac{\mu }{2} \Vert w - w_*\Vert ^2$, for $\forall w \in \mathbb {R}^d$, we have the following:

$$\begin{aligned} \mu \eta _t \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t\\{} & {} + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n}\beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2}{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 ,\quad t \ge 2,\\ \mu \eta _t (1-\beta )\mathbb {E} \left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t} - w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t \\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t = 1. \end{aligned}$$

Hence, one has

$$\begin{aligned} (1+\mu \eta _t)\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t\\{} & {} + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 , \quad t \ge 2,\\ (1+\mu \eta _t (1-\beta ))\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t = 1. \end{aligned}$$

Now we adopt the following conventional notations:

$$\begin{aligned} \phi _t := \left\{ \begin{array}{ll} 1+\mu \eta _t &{}\quad \text {if } t \ge 2,\\ 1+\mu \eta _t (1-\beta ) &{}\quad \text {if } t = 1 \end{array}\right. \end{aligned}$$

and for every $t\ge 1$:

$$\begin{aligned} H_t= - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 , \end{aligned}$$

where we use the convention that $F_0 = 0$. Hence, we have the following recursive equation for all $t\ge 1$:

$$ \phi _t\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] \le \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] + H_t. $$

Unrolling this recursive equation, we have

$$\begin{aligned} \prod _{t=1}^{T} \phi _t\mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right]\le & {} \prod _{t=1}^{T-1} \phi _t \left( \mathbb {E}\left[ \left\| \tilde{w}_{T-1}- w_*\right\| ^2 \right] + H_{T} \right) \\\le & {} \prod _{t=1}^{T-1} \phi _t \mathbb {E} \left[ \left\| \tilde{w}_{T-1}- w_*\right\| ^2 \right] + \prod _{t=1}^{T-1} \phi _t H_{T}\\\le & {} \prod _{t=1}^{T-2} \phi _t \mathbb {E} \left[ \left\| \tilde{w}_{T-2}- w_*\right\| ^2 \right] + \prod _{t=1}^{T-2} H_{T-1}+ \prod _{t=1}^{T-1} \phi _t H_{T}\\\le & {} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + H_{1} + \phi _1 H_2 + \prod _{t=1}^{T-2}\phi _t H_{T-1}+ \prod _{t=1}^{T-1} \phi _t H_{T}\\= & {} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t , \end{aligned}$$

where

$$ H_j = \frac{2}{n} \left( \frac{1}{K} -1\right) \eta _j F_j + \frac{2}{n} \alpha \beta \frac{1}{K} \eta _{j-1}F_{j-1} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta ) \eta _{j-1}^3 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _j^3 . $$

We bound the last term:

$$\begin{aligned} \sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t= & {} \frac{2}{n} \left( \frac{1}{K} -1\right) \sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t + \frac{2}{n} \alpha \beta \frac{1}{K} \sum _{j=1}^T \eta _{j-1}F_{j-1}\prod _{t=1}^{j-1} \phi _t\\{} & {} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta ) \sum _{j=1}^T \eta _{j-1}^3 \prod _{t=1}^{j-1}\phi _t + \frac{4L\sigma ^2}{3n} (1-\beta )^2 \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \\= & {} \frac{2}{n} \left( \frac{1}{K} -1\right) \sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t + \frac{2}{n} \alpha \beta \frac{1}{K} \sum _{j=0}^{T-1} \eta _{j}F_{j}\prod _{t=1}^{j} \phi _t\\{} & {} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta )\sum _{j=0}^{T-1} \eta _{j}^3 \prod _{t=1}^{j}\phi _t + \frac{4L\sigma ^2}{3n} (1-\beta )^2 \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t\\\le & {} \frac{2}{n}\sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t \left( \frac{1}{K} -1 + \alpha \beta \frac{1}{K} \phi _j \right) \\{} & {} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \left( (1-\beta )^2 + \alpha \beta (1-\beta ) \phi _j \right) \\\le & {} \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \left( 1-\beta + \alpha \beta \phi _j \right) , \end{aligned}$$

where we use the fact that $K = 1 + \alpha \beta (1 +\mu \max _t \eta _t)$. Plugging the last bound to the previous estimate we get:

$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right] \\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{1}{\prod _{t=1}^{T} \phi _t}\sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t\\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\frac{\prod _{t=1}^{j-1} \phi _t \left( 1-\beta + \alpha \beta \phi _j \right) }{\prod _{t=1}^{T} \phi _t}\\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta \phi _j)}{\prod _{t=j}^{T} \phi _t}. \end{aligned}$$

We use the fact that $\frac{1}{\phi _1} = \frac{1}{1 + (1-\beta ) \mu \eta _1} \le \frac{1}{(1-\beta )(1 + \mu \eta _1)} $ and have

$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right] \\{} & {} \le \frac{\mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] }{(1+(1-\beta )\mu \eta _1)\prod _{t=1}^{T-1} (1+\mu \eta _t)} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta (1 + \mu \max _t \eta _t))}{\prod _{t=j}^{T} (1+\mu \eta _t)}\\{} & {} \le \frac{\mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] }{(1-\beta ) \prod _{t=1}^{T} (1+\mu \eta _t)} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta (1 + \mu \max _t \eta _t))}{\prod _{t=j}^{T} (1+\mu \eta _t)}. \end{aligned}$$

This completes our proof.$\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tran, T.H., Tran-Dinh, Q. & Nguyen, L.M. Shuffling Momentum Gradient Algorithm for Convex Optimization. Vietnam J. Math. (2024). https://doi.org/10.1007/s10013-024-00699-7

Download citation

Received: 15 November 2023
Accepted: 28 February 2024
Published: 06 July 2024
DOI: https://doi.org/10.1007/s10013-024-00699-7

Shuffling Momentum Gradient Algorithm for Convex Optimization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

On the Convergence Analysis of Aggregated Heavy-Ball Method

Accelerated Stochastic Variance Reduction for a Class of Convex Optimization Problems

An accelerated first-order regularized momentum descent ascent algorithm for stochastic nonconvex-concave minimax problems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Technical Notations and Expressions

Appendix B: Technical Lemmas

1.1 B.1 Lemma 2: Sampling Without Replacement

Lemma 2

1.2 B.2 Bound Expected Squared Norm of Sum of Gradients

Lemma 3

Proof

1.3 B.3 Bound Expected Sum of Bregman Divergence (1)

Lemma 4

Proof

1.4 B.4 Bound Expected Sum of Bregman Divergence (2)

Lemma 5

Proof

Appendix C: Proofs of Lemma 1 and Theorem 2

1.1 C.1 Proof of Lemma 1: Key Bounds

Proof

1.2 C.2 Proof of Theorem 2: Strongly Convex Objectives

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Subscribe and save

Buy Now