Appendix A: Technical Notations and Expressions
First, we introduce some additional notations for our analysis in the sequel. For \(t \ge 1\) and \(i = 0,\dots , n\), we denote
$$\begin{aligned} A_i^{(t)} := \left\| \sum _{j=0}^{i-1} g_{j}^{(t)} \right\| ^2 \quad \text { and } \quad B_i^{(t)} := \left\| \sum _{j = i}^{n-1} g_{j}^{(t)}\right\| ^2. \end{aligned}$$
(A.1)
We adopt the convention \(\sum _{j=0}^{-1} g_{j}^{(t)} = 0\), and \(\sum _{j=n}^{n-1} g_{j}^{(t)} = 0\) in the above definitions. Next, we collect the following expressions from Algorithm 1 to use later in our analysis. From the update rules of Algorithm 1, for \(t > 1\), we have
$$\begin{aligned} m_0^{(t)} = \tilde{m}_{t-1} = v_{n}^{(t-1)} \overset{2}{=} v_{n-1}^{(t-1)} + \frac{1}{n} g_{n-1}^{(t-1)} = v_{0}^{(t-1)} + \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)} = \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)}. \end{aligned}$$
(A.2)
If \(t = 1\), then we set \(m_0^{(1)} = \tilde{m}_{0} = {\textbf {0}} \in \mathbb {R}^d\). From the update \(w_{i+1}^{(t)} := w_{i}^{(t)} - \eta _i^{(t)} m_{i+1}^{(t)}\) with \(\eta _i^{(t)} := \frac{\eta _t}{n}\), for \(i = 1, 2,\dots ,n-1\), at Step 7 of Algorithm 1, we can derive
$$\begin{aligned} w_{i}^{(t)} \overset{2}{=} w_{i-1}^{(t)} - \frac{\eta _t}{n} m_{i}^{(t)} = w_{0}^{(t)} - \frac{\eta _t}{n} \sum _{j=0}^{i-1} m_{j+1}^{(t)}, \quad t \ge 1. \end{aligned}$$
(A.3)
Note that \(\sum _{j=0}^{-1} m_{j+1}^{(t)}\) and \(\sum _{j = n}^{n-1} m_{j+1}^{(t-1)} = 0\) by convention, these equations also hold true for \(i=0\) and \(i=n\).
Now, letting \(i=n\) in equation (A.3), for all \(t\ge 1\), we have
$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)}= & {} - \frac{\eta _t}{n} \sum _{j=0}^{n-1} m_{j+1}^{(t)} \overset{2}{=} - \frac{\eta _t}{n} \sum _{j=0}^{n-1} \left( \beta m_{0}^{(t)} + (1-\beta ) g_{j}^{(t)}\right) \nonumber \\= & {} - \frac{\eta _t}{n} \left( n \beta m_{0}^{(t)} + (1-\beta ) \sum _{j=0}^{n-1} g_{j}^{(t)}\right) . \end{aligned}$$
Since \(m_0^{(t)} = \frac{1}{n} \sum _{j=0}^{n-1} g_{j}^{(t-1)}\) for \(t \ge 2\) (due to (A.2)), we have the following update
$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)} = - \frac{\eta _t}{n} \sum _{j=0}^{n-1} \left( \beta g_{j}^{(t-1)} + (1-\beta ) g_{j}^{(t)} \right) . \end{aligned}$$
(A.4)
For \(t = 1\), since \(m_{0}^{(t)} =\textbf{0}\), we get
$$\begin{aligned} w_{n}^{(t)} - w_{0}^{(t)} =- \frac{\eta _t}{n} (1-\beta ) \sum _{j=0}^{n-1} g_{j}^{(t)}. \end{aligned}$$
(A.5)
Appendix B: Technical Lemmas
This appendix provides necessary technical lemmas used in our analysis.
1.1 B.1 Lemma 2: Sampling Without Replacement
We need [19, Lemma 1] for sampling without replacement in our analysis. For complete references, we recall it here.
Lemma 2
(Lemma 1 in [19]) Let \(X_1, \dots , X_n \in \mathbb {R}^d\) be fixed vectors, \(\bar{X} := \frac{1}{n} \sum _{i=1}^n X_i\) be their average and \(\sigma ^2 := \frac{1}{n} \sum _{i=1}^n \Vert X_i -\bar{X}\Vert ^2\) be the population variance. Fix any \(k \in \{1,\dots , n\}\), let \(X_{\pi _1}, \dots , X_{\pi _k}\) be sampled uniformly without replacement from \(\{X_1, \dots , X_n\}\) and \(\bar{X}_\pi \) be their average. Then, the sample average and the variance are given, respectively by
$$\begin{aligned} \mathbb {E}[\bar{X}_\pi ] = \bar{X} \qquad \text {and} \qquad \mathbb {E}\left[ \Vert \bar{X}_\pi - \bar{X}\Vert ^2\right] = \frac{n-k}{k(n-1)} \sigma ^2. \end{aligned}$$
Since \(\pi ^{(t)} = (\pi ^{(t)}(1),\dots ,\pi ^{(t)}(n))\) is uniformly sampled at random without replacement from \(\{1,\dots ,n\}\), by Lemma 2, we have \(\mathbb {E}\left[ \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1)) \right] = \nabla F(w_*)\) and
$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1))\right\| ^2 \right] \nonumber \\{} & {} = (n-i)^2 \mathbb {E} \left[ \left\| \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1))\right\| ^2 \right] \nonumber \\{} & {} = (n-i)^2 \mathbb {E} \left[ \left\| \frac{1}{n-i} \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)} (j + 1)) - \nabla F(w_*)\right\| ^2 \right] \nonumber \\{} & {} = \frac{(n-i)^2 i}{(n-i)(n-1)} \frac{1}{n} \sum _{j=0}^{n-1} \left\| \nabla f ( w_*; \pi ^{(t)} (j + 1)) \right\| ^2\nonumber \\{} & {} \overset{8}{\le } \frac{(n-i) i}{(n-1)} \sigma ^2. \end{aligned}$$
(B.1)
1.2 B.2 Bound Expected Squared Norm of Sum of Gradients
Lemma 3
Let \(w_*\) be an optimal solution of F, and \(\{w_i^{(t)}\}\) be generated by Algorithm 1 with \(0\le \beta < 1\) and \(\eta _i^{(t)} := \frac{\eta _t}{n}\) for every \(t \ge 1\). Assume that Assumptions 1 and 2 hold, with the application of a randomized reshuffling strategy, we have the following bounds for \(t \ge 1\):
$$ \mathbb {E}\left[ B_0^{(t)}\right] \le 4nL \cdot \mathbb {E}[C_t]\quad \text {and}\quad \sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)}\right] \le 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}. $$
Proof
We start with the definition of \(B_i^{(t)}\):
$$\begin{aligned} B_i^{(t)}= & {} \left\| \sum _{j=i}^{n-1} g_{j}^{(t)}\right\| ^2 \nonumber \\= & {} \left\| \sum _{j=i}^{n-1}\left( g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right) + \sum _{j=i}^{n-1}\nabla f(w_*; \pi ^{(t)}(j+1)) \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&2\left\| \sum _{j=i}^{n-1}\left( g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right) \right\| ^2 + 2\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1) ) \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&2(n-i) \sum _{j=i}^{n-1} \left\| g_{j}^{(t)} - \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2 + 2\left\| \sum _{j=i}^{n-1}\nabla f(w_*; \pi ^{(t)}(j+1) ) \right\| ^2\nonumber \\&\overset{\mathrm{(b)}}{\le }\ {}&2 (n-i) \cdot 2L \sum _{j=i}^{n-1} D_{j}^{(t)} (w_*; w_{j}^{(t)}) + 2\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2, \end{aligned}$$
where (a) is from from the Cauchy–Schwarz inequality, and the last line (b) follows by the inequality (4). Now taking expectation to both sides and using (B.1), we have
$$\begin{aligned} \mathbb {E}\left[ B_i^{(t)}\right]= & {} 4L (n-i) \mathbb {E}\left[ \sum _{j=i}^{n-1} D_{j}^{(t)} (w_*; w_{j}^{(t)})\right] + 2\left[ \mathbb {E}\left\| \sum _{j=i}^{n-1} \nabla f(w_*; \pi ^{(t)}(j+1))\right\| ^2\right] \nonumber \\\overset{6B.1}{\le } & {} 4L (n-i) \mathbb {E}[C_t] + 2\frac{(n-i) i}{(n-1)} \sigma ^2. \end{aligned}$$
For the special case \(i=0\), we have:
$$ \mathbb {E}\left[ B_0^{(t)}\right] \le 4L n \mathbb {E}[C_t]. $$
Summing up the expression from \(i := 0\) to \(i := n-1\), we get
$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)}\right]\le & {} 4L\cdot \mathbb {E}[C_t] \sum _{i=0}^{n-1} (n-i) + 2\sigma ^2 \sum _{i=0}^{n-1}\frac{(n-i) i}{(n-1)}\\\le & {} 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}, \end{aligned}$$
where we use the facts that \(\sum _{i=0}^{n-1} (n-i) \le n^2\) and \(\sum _{i=0}^{n-1}\frac{(n-i) i}{(n-1)} \le \frac{n(n+1)}{6} \le \frac{n^2}{3}\).\(\square \)
1.3 B.3 Bound Expected Sum of Bregman Divergence (1)
Lemma 4
Under the same setting as of Lemma 3, it holds that
$$ \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right] \le 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2(1-\beta )L\sigma ^2\quad \text { for } t \ge 1. $$
Proof
We start with the case \(t \ge 2\):
$$\begin{aligned} D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right)\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t)}\right\| ^2\nonumber \\\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta (n -i) m_{0}^{(t)} + (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \nonumber \\&\overset{}{\le }&\frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta \frac{n-i}{n} \sum _{i=0}^{n-1} g_{i}^{(t-1)} + (1-\beta )\sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \nonumber \\&\overset{\mathrm{(a)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \left\| \frac{n-i}{n} \sum _{i=0}^{n-1} g_{i}^{(t-1)}\right\| ^2 + (1-\beta ) \left\| \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \right] \\\overset{A.1, A.1}{=} & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right] , \end{aligned}$$
where (a) follows from the convexity of \(\Vert \cdot \Vert ^2\) for \(0\le \beta < 1\). Summing up the expression from \(i := 0\) to \(i := n-1\) and talking expectation, we get
$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right]\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{\sum _{i=0}^{n-1}(n-i)^2}{n^2} \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \sum _{i=0}^{n-1} \mathbb {E} \left[ B_i^{(t)} \right] \right] \\&\overset{\mathrm{(b)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ \beta \frac{n^3}{n^2} \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \sum _{i=0}^{n-1} \mathbb {E} \left[ B_i^{(t)} \right] \right] \nonumber \\&\overset{\mathrm{(c)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ 4\beta n^2L \cdot \mathbb {E}[C_{t-1}] + (1-\beta ) \left( 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3} \right) \right] \\\le & {} \frac{L\eta _t^2}{2} \left[ 4 \beta L \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) L\cdot \mathbb {E}[C_t] + (1-\beta ) \frac{2\sigma ^2}{3} \right] \\\overset{7}{\le } & {} 2 L^2\eta _t^2 F_t + \frac{L\eta _t^2}{2} (1-\beta ) \frac{2\sigma ^2}{3}\\\le & {} 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2, \end{aligned}$$
where (b) follows from the fact that \(\sum _{i=0}^{n-1}(n-i)^2 \le n^3\), and in (c) we use the results from Lemma 3.
We do the same for the case \(t =1\):
$$\begin{aligned} D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)})\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2 \le \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| \beta (n -i) m_{0}^{(t)} + (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \\\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left\| (1-\beta ) \sum _{j=i}^{n-1} g_{j}^{(t)} \right\| ^2 \le \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta )^2 \left\| \sum _{j=i}^{n-1} g_{j}^{(t)}\right\| ^2\right] \\\overset{A.1}{=} & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta ) B_i^{(t)}\right] . \end{aligned}$$
Summing up the expression from \(i := 0\) to \(i := n-1\) and talking expectation, we get
$$\begin{aligned} \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right]\le & {} \frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta )\sum _{i=0}^{n-1} \mathbb {E}\left[ B_i^{(t)} \right] \right] \\&\overset{\mathrm{(d)}}{\le }\ {}&\frac{L}{2} \frac{\eta _t^2}{n^2} \left[ (1-\beta ) \left( 4n^2 L\cdot \mathbb {E}[C_t] + \frac{2n^2\sigma ^2}{3}\right) \right] \\\le & {} \frac{L\eta _t^2}{2} \left[ 4(1-\beta ) L\cdot \mathbb {E}[C_t] + (1-\beta ) \frac{2\sigma ^2}{3} \right] \\\overset{7}{\le } & {} 2 L^2\eta _t^2 F_t + \frac{L\eta _t^2}{2} (1-\beta ) \frac{2\sigma ^2}{3}\\\le & {} 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2, \end{aligned}$$
where in (d) we use the results from Lemma 3. Hence, the statement of Lemma 4 is true for all \(t \ge 1\).\(\square \)
1.4 B.4 Bound Expected Sum of Bregman Divergence (2)
Lemma 5
Under the same setting as of Lemma 3, the following holds for \(t\ge 2\):
$$ \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1}+ \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2. $$
Proof
We start with the case \(t \ge 3\):
$$\begin{aligned} D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)})\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t-1)} \right\| ^2\nonumber \\\le & {} L\left\| w_{n}^{(t)} - w_{n}^{(t-1)} \right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2, \end{aligned}$$
where we use the inequality \(\Vert u + v\Vert ^2 \le 2 \Vert u \Vert ^2 + 2 \Vert v\Vert ^2\). Note that \(w_{n}^{(t-1)} = w_{0}^{(t)}\), using the result
$$ \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2 = \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right] $$
from Lemma 4 for t and \(t-1\), we have
$$\begin{aligned}{} & {} D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)})\\{} & {} \le L\left\| w_{n}^{(t)} - w_{0}^{(t)}\right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2\\{} & {} \le L \frac{\eta _t^2}{n^2} \left[ \beta B_0^{(t-1)} + (1-\beta ) B_0^{(t)}\right] + L \frac{\eta _{t-1}^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-2)} + (1-\beta ) B_i^{(t-1)}\right] . \end{aligned}$$
Summing up the expression from \(i := 0\) to \(i := n-1\) and taking expectation, we get
$$\begin{aligned}{} & {} \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \\{} & {} \le L \frac{\eta _t^2}{n} \left[ \beta \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \mathbb {E}\left[ B_0^{(t)}\right] \right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ \beta \frac{\sum _{i=0}^{n-1}(n-i)^2}{n^2} \mathbb {E}\left[ B_0^{(t-2)}\right] + (1-\beta ) \sum _{i=0}^{n-1}\mathbb {E}\left[ B_i^{(t-1)}\right] \right] \\{} & {} \overset{\mathrm{(a)}}{\le }\ L \frac{\eta _t^2}{n} \left[ 4\beta nL \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) nL \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ 4\beta n^2L \cdot \mathbb {E}[C_{t-2}]+ (1-\beta ) \left[ 4n^2 L\cdot \mathbb {E}[C_{t-1}] + \frac{2n^2\sigma ^2}{3}\right] \right] \\{} & {} \le 4L^2 \eta _t^2 \left[ \beta \cdot \mathbb {E}[C_{t-1}] + (1-\beta )\cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + 4L^2\eta _{t-1}^2 \left[ \beta \cdot \mathbb {E}[C_{t-2}]+ (1-\beta ) \mathbb {E}[C_{t-1}]\right] + (1-\beta )L\frac{\eta _{t-1}^2}{n^2} \frac{2n^2\sigma ^2}{3}\\{} & {} \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1} + \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2, \end{aligned}$$
where (a) follows from the fact that \(\sum _{i=0}^{n-1}(n-i)^2 \le n^3\) and from the results of Lemma 3. Now we analyze similarly for the case \(t = 2\):
$$\begin{aligned} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right)\overset{5}{\le } & {} \frac{L}{2} \left\| w_{n}^{(t)} - w_{i}^{(t-1)} \right\| ^2\\\le & {} L\left\| w_{n}^{(t)} - w_{n}^{(t-1)} \right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2, \end{aligned}$$
where we use the inequality \(\Vert u + v\Vert ^2 \le 2 \Vert u\Vert ^2 + 2 \Vert v\Vert ^2\). Note that \(w_{n}^{(t-1)} = w_{0}^{(t)}\), using the result
$$\begin{aligned} \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2&= \frac{\eta _t^2}{n^2} \left[ \beta \frac{(n-i)^2}{n^2} B_0^{(t-1)} + (1-\beta ) B_i^{(t)} \right]{} & {} \text { for } t =2,\\ \left\| w_{n}^{(t)} - w_{i}^{(t)} \right\| ^2&= \frac{\eta _t^2}{n^2} \left[ (1-\beta ) B_i^{(t)} \right]{} & {} \text { for } t =1, \end{aligned}$$
from Lemma 4, we have
$$\begin{aligned} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right)\le & {} L\left\| w_{n}^{(t)} - w_{0}^{(t)}\right\| ^2 + L \left\| w_{n}^{(t-1)} - w_{i}^{(t-1)} \right\| ^2\\\le & {} L \frac{\eta _t^2}{n^2} \left[ \beta B_0^{(t-1)} + (1-\beta ) B_0^{(t)}\right] + L \frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) B_i^{(t-1)}\right] . \end{aligned}$$
Summing up the expression from \(i := 0\) to \(i := n-1\) and taking expectation, we get
$$\begin{aligned}{} & {} \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t-1)}(w_{n}^{(t)}; w_{i}^{(t-1)}) \right] \\{} & {} \le L \frac{\eta _t^2}{n} \left[ \beta \mathbb {E}\left[ B_0^{(t-1)}\right] + (1-\beta ) \mathbb {E}\left[ B_0^{(t)}\right] \right] + L\frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) \sum _{i=0}^{n-1}\mathbb {E}\left[ B_i^{(t-1)}\right] \right] \\{} & {} \overset{\mathrm{(a)}}{\le }\ L \frac{\eta _t^2}{n} \left[ 4\beta nL \cdot \mathbb {E}[C_{t-1}] + 4(1-\beta ) nL \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + L \frac{\eta _{t-1}^2}{n^2} \left[ (1-\beta ) \left[ 4n^2 L\cdot \mathbb {E}[C_{t-1}] + \frac{2n^2\sigma ^2}{3}\right] \right] \\{} & {} \le 4L^2 \eta _t^2 \left[ \beta \cdot \mathbb {E}[C_{t-1}] + (1-\beta ) \cdot \mathbb {E}[C_{t}]\right] \\{} & {} \quad + 4L^2\eta _{t-1}^2 \left[ (1-\beta ) \mathbb {E}[C_{t-1}]\right] + (1-\beta ) L \frac{\eta _{t-1}^2}{n^2} \frac{2n^2\sigma ^2}{3}\\{} & {} \le 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1}+ \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2, \end{aligned}$$
where (a) follows from the results of Lemma 3. Hence, the statement of Lemma 5 is true for all \(t \ge 2\).\(\square \)
Appendix C: Proofs of Lemma 1 and Theorem 2
We now provide the full proof of Lemma 1 and Theorem 2 in the main text.
1.1 C.1 Proof of Lemma 1: Key Bounds
Proof
First, let us note that \(\tilde{w}_{t-1} = w_{0}^{(t)}\) for \(t=1, \dots , T\). From the update (A.4), we have the following for \(t\ge 2\):
$$\begin{aligned}{} & {} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\\{} & {} = \Vert w_{0}^{(t)} - w_*\Vert ^2\\{} & {} = \left\| w_{n}^{(t)} + \frac{\eta _t}{n}\sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) - w_*\right\| ^2\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \\{} & {} \quad +\left\| \frac{\eta _t}{n}\sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \right\| ^2\\{} & {} \ge \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1} \left( \beta g_{i}^{(t-1)} + (1-\beta ) g_{i}^{(t)} \right) \\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ \beta \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t-1)} + (1-\beta ) \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t)} \right] \\{} & {} =\left\| w_{n}^{(t)} - w_*\right\| ^2\\{} & {} \quad + \frac{2\eta _t}{n} \sum _{i=0}^{n-1}\! \beta \left( f_{i}^{(t-1)} (w_{n}^{(t)})\! -\! f_{i}^{(t-1)}(w_*)\! +\! D_{i}^{(t-1)}\!\left( w_*; w_{i}^{(t-1)}\right) \! - D_{i}^{(t-1)}\!\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \\{} & {} \quad + \frac{2\eta _t}{n} \sum _{i=0}^{n-1} (1-\beta ) \left( f_{i}^{(t)} (w_{n}^{(t)}) - f_{i}^{(t)}(w_*) + D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) , \end{aligned}$$
where the last line follows from the following inequality:
$$ (w_1 - w_2)^\top \nabla f_{i}^{(t)} (w_3) = f_{i}^{(t)}(w_1) - f_{i}^{(t)}(w_2) + D_{i}^{(t)} (w_2 - w_3) - D_{i}^{(t)}(w_1 - w_3). $$
Now note that \(\sum _{i=0}^{n-1} f_{i}^{(t)} (\cdot ) = F (\cdot )\), we have
$$\begin{aligned} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\ge & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 \nonumber \\{} & {} + 2\beta \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \nonumber \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ \beta \left( D_{i}^{(t-1)}\left( w_*; w_{i}^{(t-1)}\right) - D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \right] \\{} & {} + 2(1-\beta ) \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \nonumber \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] \\\overset{6}{=} & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2 \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \left[ \beta \left( C_{t-1} - \sum _{i=0}^{n-1} D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right) \right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( C_t- \sum _{i=0}^{n-1}D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] , \end{aligned}$$
where in the last equation we use the definition of \(C_t\). Taking expectation, we get
$$\begin{aligned}{} & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] \\{} & {} \ge \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2 \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} \quad + 2\frac{\eta _t}{n} \left[ \beta \left( \mathbb {E}[C_{t-1}] - \sum _{i=0}^{n-1} \mathbb {E}\left[ D_{i}^{(t-1)}\left( w_{n}^{(t)}; w_{i}^{(t-1)}\right) \right] \right) \right] \\{} & {} \quad + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( \mathbb {E}[C_t]- \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right] \right) \right] \\{} & {} \overset{\mathrm{(a)}}{\ge }\ \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 \mathbb {E}\left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} \quad + 2\frac{\eta _t}{n} F_t - 2\beta \frac{\eta _t}{n}\left[ 4L^2 \eta _t^2 F_t + 4L^2\eta _{t-1}^2 F_{t-1} + \frac{2}{3} \eta _{t-1}^2(1-\beta ) L\sigma ^2 \right] \\{} & {} \quad - 2(1-\beta ) \frac{\eta _t}{n}\left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2\right] \end{aligned}$$
$$\begin{aligned}{} & {} \ge \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2 \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] + \frac{2\eta _t}{n} F_t\\{} & {} \quad - \frac{2\eta _t}{n} \cdot 4 L^2 \eta _t^2 F_t - \frac{2\eta _t}{n} \cdot 4\beta L^2\eta _{t-1}^2 F_{t-1}\\{} & {} \quad - \frac{4\eta _t}{3n} (1-\beta ) L\sigma ^2 \left( \beta \eta _{t-1}^2 +(1-\beta )\eta _t^2 \right) , \end{aligned}$$
where (a) comes from the definition (7) of \(F_t\) and the results of Lemmas 4 and 5 for \(t \ge 2\). Note that \(w_{n}^{(t)} = \tilde{w}_{t}\), \(\eta _t \le \frac{1}{2L\sqrt{K}}\) and \(4 L^2 \eta _t^2 \le \frac{1}{K}\), we further have
$$\begin{aligned}{} & {} 2 \eta _t \mathbb {E}\left[ F( \tilde{w}_{t}) - F(w_*)\right] \le \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t\\{} & {} \qquad + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3. \end{aligned}$$
Now we consider the case \(t=1\). From the update (A.5), we have the following for \(t=1\):
$$\begin{aligned}{} & {} \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \nonumber \\{} & {} = \Vert w_{0}^{(t)} - w_*\Vert ^2\nonumber \\{} & {} = \left\| w_{n}^{(t)} + \frac{\eta _t}{n} \sum _{i=0}^{n-1} (1-\beta ) g_{i}^{(t)} - w_*\right\| ^2\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1}(1-\beta )g_{i}^{(t)} + \left\| \frac{\eta _t}{n} \sum _{i=0}^{n-1} \left( (1-\beta )g_{i}^{(t)} \right) \right\| ^2\\{} & {} \ge \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \left( w_{n}^{(t)} - w_*\right) ^\top \sum _{i=0}^{n-1}(1-\beta )g_{i}^{(t)}\\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( w_{n}^{(t)} - w_*\right) ^\top g_{i}^{(t)} \right] \\{} & {} = \left\| w_{n}^{(t)} - w_*\right\| ^2\\{} & {} \quad + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( f_{i}^{(t)} ( w_{n}^{(t)} )- f_{i}^{(t)}( w_*) + D_{i}^{(t)}(w_*; w_{i}^{(t)}) - D_{i}^{(t)}(w_{n}^{(t)}; w_{i}^{(t)}) \right) \right] , \end{aligned}$$
where the last line follows from the following inequality:
$$ (w_1 - w_2)^\top \nabla f_{i}^{(t)} (w_3) = f_{i}^{(t)}(w_1) - f_{i}^{(t)}(w_2) + D_{i}^{(t)} (w_2 - w_3) - D_{i}^{(t)}(w_1 - w_3). $$
Now note that \(\sum _{i=0}^{n-1} f_{i}^{(t)} (\cdot ) = F (\cdot )\), we have
$$\begin{aligned} \Vert \tilde{w}_{t-1} - w_*\Vert ^2\ge & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2(1-\beta ) \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \sum _{i=0}^{n-1} \left[ (1-\beta ) \left( D_{i}^{(t)}\left( w_*; w_{i}^{(t)}\right) - D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] \\\overset{6}{=} & {} \left\| w_{n}^{(t)} - w_*\right\| ^2 + 2 \eta _t (1-\beta ) \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( C_t- \sum _{i=0}^{n-1}D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right) \right] . \end{aligned}$$
Taking the full expectation, we get
$$\begin{aligned} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right]\ge & {} \mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta ) \mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} \left[ (1-\beta ) \left( \mathbb {E}[C_t]- \sum _{i=0}^{n-1}\mathbb {E}\left[ D_{i}^{(t)}\left( w_{n}^{(t)}; w_{i}^{(t)}\right) \right] \right) \right] \\&\overset{\mathrm{(a)}}{\ge }\ {}&\mathbb {E}\left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta )\mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2\right] \\\ge & {} \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2 \right] + 2 (1-\beta )\mathbb {E}\left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + 2\frac{\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ 4 L^2\eta _t^2 F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2 \right] , \end{aligned}$$
where (a) comes from the definition (7) of \(F_t\) and the results of Lemma 4 for \(t \ge 1\).
$$\begin{aligned} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right]\ge & {} \mathbb {E} \left[ \left\| w_{n}^{(t)} - w_*\right\| ^2\right] + 2(1-\beta )\mathbb {E} \left[ \eta _t \left[ F\left( w_{n}^{(t)}\right) - F(w_*)\right] \right] \\{} & {} + \frac{2\eta _t}{n} F_t - 2(1-\beta ) \frac{\eta _t}{n} \left[ K F_t + \frac{2}{3}\eta _t^2 (1-\beta )L\sigma ^2 \right] . \end{aligned}$$
Finally, we have
$$\begin{aligned} 2\eta _t(1-\beta )\mathbb {E} \left[ F( \tilde{w}_{t} ) - F(w_*) \right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t} - w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t\\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3. \end{aligned}$$
This completes our proof.\(\square \)
1.2 C.2 Proof of Theorem 2: Strongly Convex Objectives
Proof
From the results of Lemma 1, for \(\eta _t \le \frac{1}{2L\sqrt{K}}\) we have
$$\begin{aligned} 2 \eta _t \mathbb {E}\left[ F( \tilde{w}_{t}) - F(w_*)\right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t \\{} & {} + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t \ge 2,\\ 2 \eta _t(1-\beta )\mathbb {E}\left[ F( \tilde{w}_{t} ) - F(w_*) \right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t} - w_*\right\| ^2 \right] - \frac{2\eta _t}{n} F_t\\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3,\quad t = 1. \end{aligned}$$
Since F is \(\mu \)-strongly convex, i.e. \(F(w) - F(w_*) \ge \frac{\mu }{2} \Vert w - w_*\Vert ^2\), for \(\forall w \in \mathbb {R}^d\), we have the following:
$$\begin{aligned} \mu \eta _t \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t\\{} & {} + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n}\beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2}{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 ,\quad t \ge 2,\\ \mu \eta _t (1-\beta )\mathbb {E} \left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \mathbb {E} \left[ \left\| \tilde{w}_{t} - w_*\right\| ^2\right] - \frac{2\eta _t}{n} F_t \\{} & {} + (1-\beta ) \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t = 1. \end{aligned}$$
Hence, one has
$$\begin{aligned} (1+\mu \eta _t)\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right]\le & {} \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t\\{} & {} + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 , \quad t \ge 2,\\ (1+\mu \eta _t (1-\beta ))\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2 \right]\le & {} \mathbb {E}\left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t\\{} & {} + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3, \quad t = 1. \end{aligned}$$
Now we adopt the following conventional notations:
$$\begin{aligned} \phi _t := \left\{ \begin{array}{ll} 1+\mu \eta _t &{}\quad \text {if } t \ge 2,\\ 1+\mu \eta _t (1-\beta ) &{}\quad \text {if } t = 1 \end{array}\right. \end{aligned}$$
and for every \(t\ge 1\):
$$\begin{aligned} H_t= - \frac{2\eta _t}{n} F_t + \frac{2\eta _t}{n} \frac{1}{K} F_t + \frac{2\eta _t}{n} \beta \frac{1}{K} F_{t-1} + \frac{4L\sigma ^2 }{3n} \beta (1-\beta ) \eta _t\eta _{t-1}^2 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _t^3 , \end{aligned}$$
where we use the convention that \(F_0 = 0\). Hence, we have the following recursive equation for all \(t\ge 1\):
$$ \phi _t\mathbb {E}\left[ \left\| \tilde{w}_{t}- w_*\right\| ^2\right] \le \mathbb {E} \left[ \Vert \tilde{w}_{t-1} - w_*\Vert ^2 \right] + H_t. $$
Unrolling this recursive equation, we have
$$\begin{aligned} \prod _{t=1}^{T} \phi _t\mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right]\le & {} \prod _{t=1}^{T-1} \phi _t \left( \mathbb {E}\left[ \left\| \tilde{w}_{T-1}- w_*\right\| ^2 \right] + H_{T} \right) \\\le & {} \prod _{t=1}^{T-1} \phi _t \mathbb {E} \left[ \left\| \tilde{w}_{T-1}- w_*\right\| ^2 \right] + \prod _{t=1}^{T-1} \phi _t H_{T}\\\le & {} \prod _{t=1}^{T-2} \phi _t \mathbb {E} \left[ \left\| \tilde{w}_{T-2}- w_*\right\| ^2 \right] + \prod _{t=1}^{T-2} H_{T-1}+ \prod _{t=1}^{T-1} \phi _t H_{T}\\\le & {} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + H_{1} + \phi _1 H_2 + \prod _{t=1}^{T-2}\phi _t H_{T-1}+ \prod _{t=1}^{T-1} \phi _t H_{T}\\= & {} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t , \end{aligned}$$
where
$$ H_j = \frac{2}{n} \left( \frac{1}{K} -1\right) \eta _j F_j + \frac{2}{n} \alpha \beta \frac{1}{K} \eta _{j-1}F_{j-1} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta ) \eta _{j-1}^3 + \frac{4L\sigma ^2}{3n} (1-\beta )^2\eta _j^3 . $$
We bound the last term:
$$\begin{aligned} \sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t= & {} \frac{2}{n} \left( \frac{1}{K} -1\right) \sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t + \frac{2}{n} \alpha \beta \frac{1}{K} \sum _{j=1}^T \eta _{j-1}F_{j-1}\prod _{t=1}^{j-1} \phi _t\\{} & {} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta ) \sum _{j=1}^T \eta _{j-1}^3 \prod _{t=1}^{j-1}\phi _t + \frac{4L\sigma ^2}{3n} (1-\beta )^2 \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \\= & {} \frac{2}{n} \left( \frac{1}{K} -1\right) \sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t + \frac{2}{n} \alpha \beta \frac{1}{K} \sum _{j=0}^{T-1} \eta _{j}F_{j}\prod _{t=1}^{j} \phi _t\\{} & {} + \frac{4L\sigma ^2 }{3n} \alpha \beta (1-\beta )\sum _{j=0}^{T-1} \eta _{j}^3 \prod _{t=1}^{j}\phi _t + \frac{4L\sigma ^2}{3n} (1-\beta )^2 \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t\\\le & {} \frac{2}{n}\sum _{j=1}^T \eta _j F_j\prod _{t=1}^{j-1} \phi _t \left( \frac{1}{K} -1 + \alpha \beta \frac{1}{K} \phi _j \right) \\{} & {} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \left( (1-\beta )^2 + \alpha \beta (1-\beta ) \phi _j \right) \\\le & {} \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\prod _{t=1}^{j-1} \phi _t \left( 1-\beta + \alpha \beta \phi _j \right) , \end{aligned}$$
where we use the fact that \(K = 1 + \alpha \beta (1 +\mu \max _t \eta _t)\). Plugging the last bound to the previous estimate we get:
$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right] \\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{1}{\prod _{t=1}^{T} \phi _t}\sum _{j=1}^T H_{j}\prod _{t=1}^{j-1} \phi _t\\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E} \left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\frac{\prod _{t=1}^{j-1} \phi _t \left( 1-\beta + \alpha \beta \phi _j \right) }{\prod _{t=1}^{T} \phi _t}\\{} & {} \le \frac{1}{\prod _{t=1}^{T} \phi _t} \mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] + \frac{4L\sigma ^2(1-\beta )}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta \phi _j)}{\prod _{t=j}^{T} \phi _t}. \end{aligned}$$
We use the fact that \(\frac{1}{\phi _1} = \frac{1}{1 + (1-\beta ) \mu \eta _1} \le \frac{1}{(1-\beta )(1 + \mu \eta _1)} \) and have
$$\begin{aligned}{} & {} \mathbb {E}\left[ \left\| \tilde{w}_{T}- w_*\right\| ^2\right] \\{} & {} \le \frac{\mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] }{(1+(1-\beta )\mu \eta _1)\prod _{t=1}^{T-1} (1+\mu \eta _t)} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta (1 + \mu \max _t \eta _t))}{\prod _{t=j}^{T} (1+\mu \eta _t)}\\{} & {} \le \frac{\mathbb {E}\left[ \left\| \tilde{w}_{0}- w_*\right\| ^2 \right] }{(1-\beta ) \prod _{t=1}^{T} (1+\mu \eta _t)} + \frac{4L\sigma ^2}{3n} \sum _{j=1}^T \eta _j^3\frac{(1-\beta + \alpha \beta (1 + \mu \max _t \eta _t))}{\prod _{t=j}^{T} (1+\mu \eta _t)}. \end{aligned}$$
This completes our proof.\(\square \)