Sampling Kaczmarz-Motzkin method for linear feasibility problems: generalization and acceleration

1028 Accesses
12 Citations
2 Altmetric
Explore all metrics

Abstract

Randomized Kaczmarz, Motzkin Method and Sampling Kaczmarz Motzkin (SKM) algorithms are commonly used iterative techniques for solving a system of linear inequalities (i.e., $Ax \le b$). As linear systems of equations represent a modeling paradigm for solving many optimization problems, these randomized and iterative techniques are gaining popularity among researchers in different domains. In this work, we propose a Generalized Sampling Kaczmarz Motzkin (GSKM) method that unifies the iterative methods into a single framework. In addition to the general framework, we propose a Nesterov-type acceleration scheme in the SKM method called Probably Accelerated Sampling Kaczmarz Motzkin (PASKM). We prove the convergence theorems for both GSKM and PASKM algorithms in the $L_2$ norm perspective with respect to the proposed sampling distribution. Furthermore, we prove sub-linear convergence for the Cesaro average of iterates for the proposed GSKM and PASKM algorithms. From the convergence theorem of the GSKM algorithm, we find the convergence results of several well-known algorithms like the Kaczmarz method, Motzkin method and SKM algorithm. We perform thorough numerical experiments using both randomly generated and real-world (classification with support vector machine and Netlib LP) test instances to demonstrate the efficiency of the proposed methods. We compare the proposed algorithms with SKM, Interior Point Method and Active Set Method in terms of computation time and solution quality. In the majority of the problem instances, the proposed generalized and accelerated algorithms significantly outperform the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Efficient randomized block Kaczmarz method for linear feasibility

Article 07 December 2024

An infeasible-start framework for convex quadratic optimization, with application to constraint-reduced interior-point and other methods

Article 21 July 2021

A Stochastic Nesterov’s Smoothing Accelerated Method for General Nonsmooth Constrained Stochastic Composite Convex Optimization

Article 07 October 2022

Notes

We use the notation $(Ax-b)^+_{\underline{\mathbf {i_j}}}$ throughout the paper to express the underlying expectation, where the indices $\underline{\mathbf {i_j}}$ represent the sampling process of 5.
For ease of notation, throughout the paper, we will use ${\mathbb {S}}_k$ to denote the sampling distribution corresponding to any random iterate $x_k \in \mathbb {R}^n$.
Similarly, we will use $\tau _k \sim {\mathbb {S}}_k$ to denote the sampled set and $i^* = {{\,\mathrm{arg\,max}\,}}_{i \in \tau _k \sim {\mathbb {S}}_k} \{a_i^Tx_k-b_i, 0\} \ = \ {{\,\mathrm{arg\,max}\,}}_{i \in \tau _k \sim {\mathbb {S}}_k} (A_{\tau _k}x_k-b_{\tau _k})^+_i$ for any iterate $x_k \in \mathbb {R}^n$.
Similar type of functions with uniform sampling have been studied in [10, 12] and [39, 40, 58] in the context of stochastic gradient descent, alternating projection algorithms and GR sketching methods respectively.
Similar type of approach can be found in the literature for solving linear system [39, 62].
Recently, heavy ball momentum method has been proposed in the context of SKM method [63] and Sketching methods [64] for solving linear system [65] and LF problems.
Several works exit for the Kaczmarz-type methods for solving linear systems [39, 58].
When, $\delta = 2$, we have $h(\delta ) = 1-\eta \mu _1 = 1$. In that case, we can simplify the condition of (29) as $\omega < \frac{\mu _1(2-\gamma )}{3\mu _1+1}$. In other words, for $\delta =2$ the PASKM algorithm will converge if we select the parameters as $0 \le \gamma < 2$, $0 \le \alpha \le 1$, $\omega < \frac{\mu _1(2-\gamma )}{3\mu _1+1}$ and $ \mu _1 = \lambda _{\min }^+(A^TA)/m$.
Note that for the choice $p > \frac{1}{\mu _1}$ the condition (30) trivially holds as the right hand side of (30) is always greater than 1.
Note that there are sudden jumps in the FSC graphs whenever FSC crosses 0.6 approximately. This phenomena is caused by the stopping criteria of our experiments. We are halting the algorithm whenever the residual error satisfy the threshold $10^{-05}$, that’s why there are sudden jumps. One can run the algorithm with smaller threshold to mitigate this.
The credit card data matrix has 30, 000 rows. From our earlier experiments, we observe that the choice of $1 < \beta \le 100$, the proposed algorithms produce the best performance. For that reason, we plot the credit card graph up-to $\beta = 2000$. The irregularity of the credit card graph occurs when $\beta > 2000$.
We note the best possible time from our previous experiments.
Note that, since $\varGamma _1 \varGamma _2 \le 0$ and $\varGamma _1 \varGamma _3 \le 1 \le \sqrt{(1+\phi )}$, from Theorem 4 we have ${{\,\mathrm{\mathbb {E}}\,}}[d(x_{k+1},P)] \le \sqrt{(1+\phi )} \ \rho _2^k \ d(x_0,P)$ for any $\xi \in Q_2$.

References

Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262 (2008)
Article MathSciNet MATH Google Scholar
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)
Article MathSciNet MATH Google Scholar
Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math. 50(2), 395–403 (2010)
Article MathSciNet MATH Google Scholar
Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlós, T.: Faster least squares approximation. Numer. Math. 117(2), 219–249 (2011)
Article MathSciNet MATH Google Scholar
Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)
Article MathSciNet MATH Google Scholar
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS ’13, pp. 147–156, Washington, DC, USA, 2013. IEEE Computer Society
Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)
Article MathSciNet MATH Google Scholar
Gower, R.M., Richtárik, P.: Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl. 36(4), 1660–1690 (2015)
Article MathSciNet MATH Google Scholar
Qu, Z., Richtarik, P., Takac, M., Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1823–1832, New York, USA, 20–22 Jun 2016. PMLR
Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155(1), 549–573 (2016)
Article MathSciNet MATH Google Scholar
De Loera, J., Haddock, J., Needell, D.: A Sampling Kaczmarz-Motzkin algorithm for linear feasibility. SIAM J. Sci. Comput. 39(5), S66–S87 (2017)
Article MathSciNet MATH Google Scholar
Razaviyayn, M., Hong, M., Reyhanian, N., Luo, Z.-Q.: A linearly convergent doubly stochastic Gauss-Seidel algorithm for solving linear equations and a certain class of over-parameterized optimization problems. Math. Program. 176(1), 465–496 (2019)
Article MathSciNet MATH Google Scholar
Haddock, J., Ma, A.: Greed works: an improved analysis of Sampling Kaczmarz-Motkzin (2019)
Kaczmarz, S.: Angenaherte auflsung von systemen linearer gleichungen. Bulletin International de l’Acadmie Polonaise des Sciences et des Letters 35, 355–357 (1937)
MATH Google Scholar
Gordon, R., Bender, R., Herman, G.T.: Algebraic reconstruction techniques (art) for three-dimensional electron microscopy and x-ray photography. J. Theor. Biol. 29(3), 471–481 (1970)
Article Google Scholar
Censor, Y.: Parallel application of block-iterative methods in medical imaging and radiation therapy. Math. Program. 42(1), 307–325 (1988)
Article MathSciNet MATH Google Scholar
Herman, G.T.: Fundamentals of Computerized Tomography: Image Reconstruction from Projections, 2nd edn. Springer, New York (2009)
Book MATH Google Scholar
Lorenz, D.A., Wenger, S., Schöpfer, F., Magnor, M.: A sparse Kaczmarz solver and a linearized bregman method for online compressed sensing. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 1347–1351 (2014)
Elble, J.M., Sahinidis, N.V., Vouzis, P.: Gpu computing with Kaczmarz’s and other iterative algorithms for linear systems. Parallel Comput. 36(5), 215–231 (2010)
Article MathSciNet MATH Google Scholar
Pasqualetti, F., Carli, R., Bullo, F.: Distributed estimation via iterative projections with application to power network monitoring. Automatica 48(5), 747–758 (2012)
Article MathSciNet MATH Google Scholar
Censor, Y.: Row-action methods for huge and sparse systems and their applications. SIAM Rev. 23(4), 444–466 (1981)
Article MathSciNet MATH Google Scholar
Agamon, S.: The relaxation method for linear inequalities. Can. J. Math. 382–392 (1954)
Motzkin, T.S., Schoenberg, I.J.: The relaxation method for linear inequalities. Can. J. Math. 393–404 (1954)
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65–386 (1958)
Ramdas, A., Peña, J.: Margins, kernels and non-linear smoothed perceptrons. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, pp. 244–252, Bejing, China, 22–24 Jun 2014. PMLR
Ramdas, A., Peña, J.: Towards a deeper geometric, analytic and algorithmic understanding of margins. Optim. Methods Softw. 31(2), 377–391 (2016)
Article MathSciNet MATH Google Scholar
Nutini, J., Sepehry, B., Laradji, I., Schmidt, M., Koepke, H., Virani, A.: Convergence rates for greedy Kaczmarz algorithms, and faster randomized Kaczmarz rules using the orthogonality graph. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pp. 547–556, Arlington, Virginia, United States. AUAI Press (2016)
Petra, S., Popa, C.: Single projection Kaczmarz extended algorithms. Numer. Algorithms 73(3), 791–806 (2016)
Article MathSciNet MATH Google Scholar
Telgen, J.: On relaxation methods for systems of linear inequalities. Eur. J. Oper. Res. 9(2), 184–189 (1982)
Article MathSciNet MATH Google Scholar
Maurras, J.F., Truemper, K., Akgül, M.: Polynomial algorithms for a class of linear programs. Math. Program. 21(1), 121–136 (1981)
Article MathSciNet MATH Google Scholar
Chubanov, S.: A strongly polynomial algorithm for linear systems having a binary solution. Math. Program. 134(2), 533–570 (2012)
Article MathSciNet MATH Google Scholar
Chubanov, S.: A polynomial projection algorithm for linear feasibility problems. Math. Program. 153(2), 687–713 (2015)
Article MathSciNet MATH Google Scholar
Liu, J., Wright, S.J.: An accelerated randomized Kaczmarz algorithm. Math. Comput. 85(297), 153–178 (2016)
Article MathSciNet MATH Google Scholar
Needell, D., Zhao, R., Zouzias, A.: Randomized block Kaczmarz method with projection for solving least squares. Linear Algebra Appl. 484, 322–343 (2015)
Article MathSciNet MATH Google Scholar
Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)
Article MathSciNet MATH Google Scholar
Hefny, A., Needell, D., Ramdas, A.: Rows versus columns: randomized Kaczmarz or Gauss-Seidel for ridge regression. SIAM J. Sci. Comput. 39(5), S528–S542 (2017)
Article MathSciNet MATH Google Scholar
Gower, R.M., Richtárik, P.: Linearly convergent randomized iterative methods for computing the pseudoinverse (2016)
Gower, R.M., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)
Article MathSciNet MATH Google Scholar
Richtárik, P., Takáč, M.: Stochastic reformulations of linear systems: algorithms and convergence theory (2017)
Gower, R., Hanzely, F., Richtarik, P., Stich, S.U.: Accelerated stochastic matrix inversion: general theory and speeding up BFGS rules for faster second-order optimization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.), Advances in Neural Information Processing Systems, vol. 31, pp. 1619–1629. Curran Associates, Inc. (2018)
Needell, D., Tropp, J.A.: Paved with good intentions: analysis of a randomized block Kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)
Article MathSciNet MATH Google Scholar
Briskman, J., Needell, D.: Block Kaczmarz method with inequalities. J. Math. Imaging Vis. 52(3), 385–396 (2015)
Article MathSciNet MATH Google Scholar
Needell, D., Rebrova, E.: On block Gaussian sketching for the Kaczmarz method (2019)
Basu, A., De Loera, J.A., Junod, M.: On Chubanov’s method for linear programming. INFORMS J. Comput. 26(2), 336–350 (2014)
Article MathSciNet MATH Google Scholar
Végh, A.L., Zambelli, G.: A polynomial projection-type algorithm for linear programming. Oper. Res. Lett. 42(1), 91–96 (2014)
Article MathSciNet MATH Google Scholar
Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011)
Article MathSciNet MATH Google Scholar
Agaskar, A., Wang, C., Lu, Y.M.: Randomized Kaczmarz algorithms: exact MSE analysis and optimal sampling probabilities. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 389–393 (2014)
Bai, Z.-Z., Wen-Ting, W.: On relaxed greedy randomized Kaczmarz methods for solving large sparse linear systems. Appl. Math. Lett. 83, 21–26 (2018)
Article MathSciNet MATH Google Scholar
Bai, Z.-Z., Wu, W.-T.: On greedy randomized Kaczmarz method for solving large sparse linear systems. SIAM J. Sci. Comput. 40(1), A592–A606 (2018)
Article MathSciNet MATH Google Scholar
Kovachki, N.B., Stuart, A.M.: Analysis of momentum methods. arXiv preprint arXiv:1906.04285 (2019)
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Nesterov, Y.: A method for solving the convex programming problem with convergence rate $o(1/k^{2})$. Sov. Math. Dokl. 27, 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Article MathSciNet MATH Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, 1st edn. Springer, New York (2014)
MATH Google Scholar
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Article MathSciNet MATH Google Scholar
Loizou, N., Richtárik, P.: Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods (2017)
Morshed, M.S., Noor-E-Alam, M.: Generalized affine scaling algorithms for linear programming problems. Comput. Oper. Res. 114, 104807 (2020)
Article MathSciNet MATH Google Scholar
Loizou, M.R.N., Richtárik, P.: Provably accelerated randomized gossip algorithms (2018)
Morshed, Md.S., Islam, Md.S., Noor-E-Alam, Md.: Accelerated sampling Kaczmarz motzkin algorithm for the linear feasibility problem. J. Glob. Optim. (2019)
Censor, Y., Gordon, D., Gordon, R.: Component averaging: an efficient iterative parallel algorithm for large and sparse unstructured problems. Parallel Comput. 27(6), 777–808 (2001)
Article MathSciNet MATH Google Scholar
Morshed, Md.S., Noor-E-Alam, Md.: Heavy Ball Momentum Induced Sampling Kaczmarz Motzkin Methods for Linear Feasibility Problems. arXiv preprint arXiv:200908251 (2020)
Morshed, Md.S., Noor-E-Alam, Md.: Sketch & Project Methods for Linear Feasibility Problems: Greedy Sampling & Momentum. arXiv preprint arXiv:2012.02913 (2020)
Morshed, Md.S., Ahmad, S., Noor-E-Alam, Md.: Stochastic Steepest Descent Methods for Linear Systems: Greedy Sampling & Momentum. arXiv preprint arXiv:2012.13087 (2020)
Hoffman, A.J.: On approximate solutions of systems of linear inequalities. In: Selected Papers of Alan J Hoffman: With Commentary, pp. 174–176. World Scientific (2003)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, Cham (2016)
Chapter Google Scholar
Khachiyan, L.G.: Polynomial algorithms in linear programming. USSR Comput. Math. Math. Phys. 20(1), 53–72 (1980)
Article MathSciNet MATH Google Scholar
Yeh, I.-C., Lien, C.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36(2), 2473–2480 (2009)
Article Google Scholar
Lichman, M.: UCI machine learning repository (2013)
Netlib. The netlib linear programming library
Calafiore, G., Ghaoui, L.E.: Optimization Models. Control systems and optimization series. Cambridge University Press (2014)

Download references

Acknowledgements

The authors are truly grateful to the anonymous referees and the editors for their valuable comments and suggestions in the earlier version of the paper. The comments helped immensely in the revision process and greatly improved the quality of this paper.

Author information

Authors and Affiliations

Department of Mechanical and Industrial Engineering, Northeastern University, Boston, MA, USA
Md Sarowar Morshed, Md Saiful Islam & Md. Noor-E-Alam

Authors

Md Sarowar Morshed
View author publications
You can also search for this author in PubMed Google Scholar
Md Saiful Islam
View author publications
You can also search for this author in PubMed Google Scholar
Md. Noor-E-Alam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Md. Noor-E-Alam.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1 Proof of Lemma 5

Using the expectation expression given in the first section we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ a_{i^*}a_{i^*}^T\right]&= \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) (A^TA)_{\underline{\mathbf {i_j}}} \\&\preceq \ \frac{\left( {\begin{array}{c}m-1\\ \beta -1\end{array}}\right) }{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } (A^TA)_{\underline{\mathbf {i_j}}} \ \preceq \ \frac{\beta }{m} \sum \limits _{i = 1}^{m} a_{i} a_{i}^T \ = \ \frac{\beta }{m} A^TA. \end{aligned}$$

Here, the notation $(A^TA)_{\underline{\mathbf {i_j}}}$ denotes the matrix $a_{i_j} a_{i_j}^T$ where the index $i_j$ belongs to the list (5). Furthermore, the index l corresponds to the $(\beta + j)^{th}$ entry on the list (5). This proves the Lemma. $\square $

Proof of Lemma 6

Using the definition of the expectation from (7), we have,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big | (a_{i^*}^Tx-b_{i^*})^+\big |^2\right]&\overset{(7)}{=} \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) | (Ax-b)^{+}_{\underline{\mathbf {i_j}}} |^2 \\&\overset{\text {Lemma} \ 2}{\ge } \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \frac{\sum \limits _{l = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+l\\ \beta -1\end{array}}\right) }{m-\beta +1} | (Ax-b)^{+}_{\underline{\mathbf {i_j}}} |^2 \\&\ge \frac{1}{m-\beta +1} \sum \limits _{j = 0}^{m-\beta } \big |(Ax - b)^+_{\underline{\mathbf {i_j}}} \big |^2 \\&\ge \frac{1}{m-\beta +1} \min \{\frac{m-\beta +1}{m-s}, 1\} \ \Vert (Ax - b)^+\Vert ^2 \\&\overset{\text {Lemma} \ 1}{\ge } \ \frac{1}{m L^2} \ d(x,P)^2. \end{aligned}$$

Here, s is the number of zero entries in the residual $(Ax - b)^+$, which also corresponds to the number of satisfied constraints for x. Since $A {\mathcal {P}} (x) \le b$, we have the following:

$$ \begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big | (a_{i^*}^Tx-b_{i^*})^+\big |^2\right]&\overset{(7)}{=} \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) | (Ax-b)^{+}_{\underline{\mathbf {i_j}}}|^2 \\&\le \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } \sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) \ \big | (Ax-A{\mathcal {P}} (x))_{\underline{\mathbf {i_j}}} \big |^2 \\&= \frac{1}{\left( {\begin{array}{c}m\\ \beta \end{array}}\right) } (x-{\mathcal {P}} (x))^T\sum \limits _{j = 0}^{m-\beta } \left( {\begin{array}{c}\beta -1+j\\ \beta -1\end{array}}\right) \ (A^TA)_{\underline{\mathbf {i_j}}} (x-{\mathcal {P}} (x) ) \\&= (x-{\mathcal {P}} (x) )^T {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ a_{i^*}a_{i^*}^T\right] (x-{\mathcal {P}} (x) ) \\&\overset{\text {Lemma} \ 4 \ \& \ 5 }{\le } \ \min \left\{ 1, \frac{\beta }{m} \lambda _{\max }\right\} \big \Vert x- {\mathcal {P}} (x) \big \Vert ^2 \\&= \ \min \left\{ 1, \frac{\beta }{m} \lambda _{\max }\right\} d(x,P)^2. \end{aligned}$$

Combining the above identities and using the expression for f(x) from (9) we get

$$\begin{aligned} \frac{\mu _1}{ 2} \ d(x,P)^2 \ \le \ f(x) \ \le \ \frac{\mu _2}{2} \ d(x,P)^2, \end{aligned}$$

which proves the Lemma. $\square $

Proof of Lemma 7

Considering the definitions of f(x) and g(y), we have

$$\begin{aligned} \langle x-y, g(y) \rangle&= \big \langle x-y, {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ (a_{i^*}^Ty-b_{i^*})^{+} a_{i^*}\right] \big \rangle = \sum \limits _{i: a_i^Ty-b_i> 0} p_i \big \langle x-y, (a_{i}^Ty-b_{i}) a_{i} \big \rangle \\&\le \ \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i}) (a_{i}^Tx-b_{i})- \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i})^2 \\&\le \ \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i}) (a_{i}^Tx-b_{i})^+- \sum \limits _{i: a_i^Ty-b_i> 0} p_i (a_{i}^Ty-b_{i})^2 \\&\le \ \sum \limits _{i: a_i^Ty-b_i> 0} p_i \frac{\big | (a_{i}^Tx-b_{i})^+ \big |^2}{2}- \sum \limits _{i: a_i^Ty-b_i > 0} p_i \frac{(a_{i}^Ty-b_{i})^2}{2} \\&\le \ \sum \limits _{i} p_i \frac{\big | (a_{i}^Tx-b_{i})^+ \big |^2}{2}- \sum \limits _{i} p_i \frac{\big | (a_{i}^Ty-b_{i})^+ \big |^2}{2} \\&\le f(x) - f(y) \overset{\text {Lemma} \ 6 }{\le } \ \frac{\mu _2}{2} \ d(x,P)^2 - \frac{\mu _1}{2} \ d(y,P)^2. \end{aligned}$$

where $0 \le p_i \le 1$ denote the corresponding probabilities for all i. Here, in the third inequality we used the relation $2ab \le a^2 + b^2$. This completes the proof. $\square $

Proof of Lemma 8

Since, ${\bar{y}} \in P$, from the definition we have,

$$\begin{aligned} \langle {\bar{y}}-y, {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ a_{i^*}(a_{i^*}^Ty-b_{i^*})^{+}\right] \rangle \&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ (a_{i^*}^Ty-b_{i^*})^{+} \left( a_{i^*}^T{\bar{y}}- a_{i^*}^T y\right) \right] \\&\le {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ (a_{i^*}^Ty-b_{i^*})^{+} \left( b_{i^*}- a_{i^*}^T y\right) \right] \\&= - {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big | (a_{i^*}^Ty-b_{i^*})^{+} \big |^2\right] \\&= - 2f(y) \overset{\text {Lemma} \ 6 }{\le } \ \ -\mu _1 \ d(y,P)^2. \end{aligned}$$

Here, we used the identity $xx^+ = |x^+|^2$. This proves the Lemma. $\square $

Proof of Lemma 9

Since, ${\mathcal {P}}(x) \in P$, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ d(z,P)^2\right]&\overset{\text {Lemma} \ 3}{\le } \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \Vert z- {\mathcal {P}}(x)\Vert ^2\right] = {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}} \left[ \big \Vert x- {\mathcal {P}} (x) - \delta \left( a_{i^*}^Tx-b_{i^*}\right) ^+ a_{i^*} \big \Vert ^2 \right] \\&\overset{(9)}{=} \ \Vert x - {\mathcal {P}} (x) \Vert ^2 + 2\delta ^2 f(x) + 2 \delta \ \big \langle {\mathcal {P}}(x)-x, g(x) \big \rangle \\&\overset{\text {Lemma} \ 8 }{\le } \ \Vert x - {\mathcal {P}} (x) \Vert ^2 -2(2\delta -\delta ^2) f(x) \\&\overset{\text {Lemma} \ 6 }{\le } \ \Vert x- {\mathcal {P}} (x) \Vert ^2 - (2\delta -\delta ^2) \ \mu _1 \Vert x- {\mathcal {P}} (x) \Vert ^2 = h(\delta ) \ d(x,P)^2. \end{aligned}$$

Here, we used the lower bound of the expected value from Lemma 6. $\square $

Proof of Theorem 1

Since, $\phi _1, \phi _2 \ge 0$, the largest root $\phi $ of equation $\phi ^2 + \phi _1 \phi - \phi _2 = 0$ can be written as

$$\begin{aligned} \phi = \frac{-\phi _1+ \sqrt{\phi _1^2+4\phi _2}}{2} \ge \frac{-\phi _1+ \phi _1}{2} = 0. \end{aligned}$$

Then, using the given recurrence, we have,

$$\begin{aligned} G_{k+1} + \phi G_k&\le (\phi +\phi _1) G_k + \phi _2 G_{k-1} \\&= (\phi +\phi _1) \left( G_{k} + \phi G_{k-1}\right) \\&\vdots \\&\le (\phi +\phi _1)^k \left( G_{1} + \phi G_{0}\right) \\&= (\phi +\phi _1)^k (1+\phi ) G_0. \end{aligned}$$

This proves the first part. Also note that since $\phi _1 + \phi _2 < 1$, we have,

$$\begin{aligned} \phi + \phi _1&= \frac{\phi _1+ \sqrt{\phi _1^2+4\phi _2}}{2} < \frac{\phi _1+ \sqrt{\phi _1^2+4(1-\phi _1)}}{2} = \frac{\phi _1+2-\phi _1}{2} = 1. \end{aligned}$$

For the second part, notice that from the recurrence inequality, we can deduce the following matrix inequality:

$$\begin{aligned} \begin{bmatrix} G_{k+1} \\ G_{k} \end{bmatrix} \le \begin{bmatrix} \phi _1^2+\phi _2 &{} \phi _1 \phi _2 \\ \phi _1 &{} \phi _2 \end{bmatrix} \begin{bmatrix} G_{k-1} \\ G_{k-2} \end{bmatrix}. \end{aligned}$$

(34)

The Jordan decomposition of the matrix in the above expression is given by

$$\begin{aligned} \begin{bmatrix} \phi _1^2+\phi _2 &{} \phi _1 \phi _2 \\ \phi _1 &{} \phi _2 \end{bmatrix} = \begin{bmatrix} -\phi &{} \phi + \phi _1 \\ 1 &{} 1 \end{bmatrix} \begin{bmatrix} \phi ^2 &{} 0 \\ 0 &{} \rho ^2 \end{bmatrix} \begin{bmatrix} \frac{-1}{\phi _1+2 \phi } &{} \frac{1}{2}+ \frac{\phi _1}{2(\phi _1+2 \phi )} \\ \frac{1}{\phi _1+2 \phi } &{} \frac{1}{2}-\frac{\phi _1}{2(\phi _1+2 \phi )} \end{bmatrix}. \end{aligned}$$

(35)

Here we substituted $\phi _2 = \phi (\phi +\phi _1)$ in the Jordan decomposition of Eq. (35). Next, we discuss two possible cases of values of k.

Case 1 k even

$$\begin{aligned} \begin{bmatrix} G_{k+1} \\ G_{k} \end{bmatrix}&\overset{(34)}{\le } \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \nonumber \\ \end{bmatrix} \begin{bmatrix} G_{k-1} \\ G_{k-2} \end{bmatrix} \nonumber \\&\quad \vdots \nonumber \\&\le \ \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \nonumber \end{bmatrix}^{\frac{k}{2}} \begin{bmatrix} G_{1} \\ G_{0} \end{bmatrix} \\&\overset{(35) }{=} \begin{bmatrix} -\phi &{} \phi + \phi _1 \\ 1 &{} 1 \end{bmatrix} \begin{bmatrix} \phi ^k &{} 0 \\ 0 &{} \rho ^k \end{bmatrix} \begin{bmatrix} \frac{-1}{\phi _1+2 \phi } &{} \frac{1}{2}+ \frac{\phi _1}{2(\phi _1+2 \phi )} \\ \frac{1}{\phi _1+2 \phi } &{} \frac{1}{2}-\frac{\phi _1}{2(\phi _1+2 \phi )} \end{bmatrix} \begin{bmatrix} G_{0} \\ G_{0} \end{bmatrix} \nonumber \\&= \begin{bmatrix} (1+\phi ) \rho ^{k+1} + (1-\phi -\phi _1) \phi ^{k+1} \\ (1+\phi ) \rho ^{k} - (1-\phi -\phi _1) \phi ^{k} \end{bmatrix} \begin{bmatrix} \frac{G_{0}}{\phi _1+ 2 \phi } \end{bmatrix} \nonumber \\&\overset{(23) }{=} \begin{bmatrix} R_1 \rho ^{k+1}+ R_2 \phi ^{k+1} \\ R_1 \rho ^{k}- R_2 \phi ^{k} \end{bmatrix} \ G_0. \end{aligned}$$

(36)

Here, we used $G_0 = G_1$.

Case 2 k odd

$$\begin{aligned} \begin{bmatrix} G_{k+1} \\ G_{k} \end{bmatrix}&\overset{(34)}{\le } \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \\ \end{bmatrix} \begin{bmatrix} G_{k-1} \\ G_{k-2} \end{bmatrix} \nonumber \\&\quad \vdots \nonumber \\&\le \ \begin{bmatrix} \phi _1^2+\phi ^2 + \phi \phi _1 &{} \phi ^2 \phi _1 + \phi \phi _1^2 \\ \phi _1 &{} \phi ^2 + \phi \phi _1 \end{bmatrix}^{\frac{k-1}{2}} \begin{bmatrix} G_{2} \\ G_{1} \end{bmatrix} \nonumber \\&\overset{(35)}{\le } \begin{bmatrix} -\phi &{} \phi + \phi _1 \\ 1 &{} 1 \end{bmatrix} \begin{bmatrix} \phi ^{k-1} &{} 0 \\ 0 &{} \rho ^{k-1} \end{bmatrix} \begin{bmatrix} \frac{-1}{\phi _1+2 \phi } &{} \frac{1}{2}+ \frac{\phi _1}{2(\phi _1+2 \phi )} \\ \frac{1}{\phi _1+2 \phi } &{} \frac{1}{2}-\frac{\phi _1}{2(\phi _1+2 \phi )} \end{bmatrix} \begin{bmatrix} (\phi _1+\phi _2)G_{0} \\ G_{0} \end{bmatrix} \nonumber \\&= \begin{bmatrix} (\phi +\phi _1+\phi _2)\rho ^{k} - (\phi -\phi _2) \phi ^{k} \\ (\phi +\phi _1+\phi _2)\rho ^{k-1} + (\phi -\phi _2) \phi ^{k-1} \end{bmatrix} \begin{bmatrix} \frac{G_{0}}{\phi _1+ 2 \phi } \end{bmatrix} \nonumber \\&\overset{(23)}{=} \begin{bmatrix} R_3 \rho ^{k}- R_4 \phi ^{k} \\ R_3 \rho ^{k-1} + R_4 \phi ^{k-1} \end{bmatrix} \ G_0. \end{aligned}$$

(37)

Here, we used the inequality $G_2 \le \phi _1 G_1 + \phi _2 G_0$. Now, combining the relations from Eq. (36) and (37), we can prove the second part of Theorem 1. $\square $

Proof of Theorem 2

From the given recurrence relation, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix} \begin{bmatrix} H_{k} \\ F_{k} \end{bmatrix} \le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix}^k \begin{bmatrix} H_{1} \\ F_{1} \end{bmatrix}. \end{aligned}$$

(38)

Using the definitions of (26), we can write the Jordan decomposition of the above matrix as follows:

$$\begin{aligned} \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix} = \begin{bmatrix} \varGamma _2 &{} \varGamma _1 \\ 1 &{} \ 1 \end{bmatrix} \begin{bmatrix} \rho _1 &{} 0 \\ 0 &{} \rho _2 \end{bmatrix} \begin{bmatrix} -\varGamma _3 &{} \varGamma _1 \varGamma _3 \\ \varGamma _3 &{} \ -\varGamma _2 \varGamma _3 \end{bmatrix}. \end{aligned}$$

(39)

Now, substituting the matrix decomposition into Eq. (38) and simplifying, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \ \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix}^k \begin{bmatrix} H_{1} \\ F_{1} \end{bmatrix} \overset{(39) }{=} \begin{bmatrix} \varGamma _2 &{} \varGamma _1 \\ 1 &{} \ 1 \end{bmatrix} \begin{bmatrix} \rho _1^k &{} 0 \\ 0 &{} \rho _2^k \end{bmatrix} \begin{bmatrix} -\varGamma _3 &{} \varGamma _1 \varGamma _3 \\ \varGamma _3 &{} \ -\varGamma _2 \varGamma _3 \end{bmatrix} \begin{bmatrix} H_{1} \\ F_1 \end{bmatrix} \nonumber \\&= \begin{bmatrix} \varGamma _3 (\varGamma _1 \rho _2^{k} -\varGamma _2 \rho _1^{k}) \ \ &{} \ \ \varGamma _1 \varGamma _2 \varGamma _3 (\rho _1^{k}- \rho _2^{k}) \\ \varGamma _3 (\rho _2^{k}- \rho _1^{k}) \ &{} \ \varGamma _3 (\varGamma _1 \rho _1^{k} -\varGamma _2 \rho _2^{k}) \end{bmatrix} \ \begin{bmatrix} H_{1} \\ F_1 \end{bmatrix}. \end{aligned}$$

(40)

Since $\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0$, one can easily verify that $\varGamma _1, \varGamma _3 \ge 0$. Now, it remains to show that $ 0 \le \rho _1 \le \rho _2 < 1$. To show that, first note that

$$\begin{aligned} (\varPi _1 - \varPi _4)^2 + 4\varPi _2 \varPi _3&\overset{(25) }{<} (\varPi _1 - \varPi _4)^2 + 4 - 4 \varPi _1 \varPi _4 - 4(\varPi _1 + \varPi _4) \nonumber \\&= \left( 2-\varPi _1 - \varPi _4\right) ^2. \end{aligned}$$

(41)

Now we have,

$$\begin{aligned} \rho _1&= \frac{1}{2} \left[ \varPi _1+\varPi _4 - \sqrt{(\varPi _1-\varPi _4)^2+4\varPi _2\varPi _3}\right] \\&\ge \frac{1}{2} \left[ \varPi _1+\varPi _4 - \sqrt{(\varPi _1-\varPi _4)^2+4\varPi _1\varPi _4}\right] = \frac{1}{2} \left[ \varPi _1+\varPi _4 - (\varPi _1+\varPi _4)\right] =0. \end{aligned}$$

Moreover, since $2-\varPi _1 - \varPi _4 \ge 0 $, we have

$$\begin{aligned} \rho _1 \le \ \rho _2&= \frac{1}{2} \left[ \varPi _1+\varPi _4 + \sqrt{(\varPi _1-\varPi _4)^2+4\varPi _2\varPi _3}\right] \\&\overset{(41) }{<} \frac{1}{2} \left[ \varPi _1+\varPi _4 + \sqrt{ \left( 2-\varPi _1 - \varPi _4\right) ^2 }\right] \\&\overset{(25) }{=} \frac{1}{2} \left[ \varPi _1+\varPi _4 + 2-\varPi _1 - \varPi _4 \right] = 1. \end{aligned}$$

As $0 \le \rho _1 \le \rho _2 < 1$, considering (40) we can deduce that the sequence $\{H_k\}$ and $ \{F_k\}$ converges. $\square $

Appendix 2 Proof of Theorem 3

From the update formula of Algorithm 2, we have $z_{k} = x_k - (A_{\tau _k}x_k - b_{\tau _k})^+_{i^*} a_{i^*}$ where,

$$\begin{aligned} i^* = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i \in \tau _k}} \{a_i^Tx_k-b_i, 0\} \ = \ {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{i \in \tau _k}} (A_{\tau _k}x_k-b_{\tau _k})^+_i. \end{aligned}$$

(42)

Similarly, the previous update formula can be written as, $z_{k-1} = x_{k-1} - (A_{\tau _{k-1}}x_{k-1} - b_{\tau _{k-1}})^+_{j^*} a_{j^*}$; where

$$\begin{aligned} j^* = {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{j \in \tau _{k-1}}} \{a_j^Tx_{k-1}-b_j, 0\} \ = \ {\mathop {{{\,\mathrm{arg\,max}\,}}}\limits _{j \in \tau _{k-1}}} (A_{\tau _{k-1}}x_{k-1}-b_{\tau _{k-1}})^+_j. \end{aligned}$$

(43)

Note that the notation is consistent with the definition of (6). Since for any $\xi \in Q_1$, $ (1-\xi ) {\mathcal {P}}(x_{k}) + \xi {\mathcal {P}}(x_{k-1}) \in P$ we have,

$$\begin{aligned} d(x_{k+1},P)^2&= \ \ \big \Vert x_{k+1}- {\mathcal {P}}(x_{k+1}) \big \Vert ^2 \nonumber \\&\overset{\text {Lemma} \ 3}{ \le } \ \big \Vert x_{k+1}- (1-\xi ) {\mathcal {P}}(x_{k}) - \xi {\mathcal {P}}(x_{k-1})\big \Vert ^2 \nonumber \\&\overset{(10)}{=} \big \Vert (1-\xi ) z_k + \xi z_{k-1}-(1-\xi ) {\mathcal {P}}(x_{k}) - \xi {\mathcal {P}}(x_{k-1}) \big \Vert ^2 \nonumber \\&\overset{(11)}{=} \big \Vert (1-\xi ) \left\{ x_k- {\mathcal {P}}(x_{k})- \delta \left( a_{i^*}^Tx_{k}-b_{i^*}\right) ^+ a_{i^*}\right\} \nonumber \\&\quad + \xi \left\{ x_{k-1}- {\mathcal {P}}(x_{k-1})- \delta \left( a_{j^*}^Tx_{k-1}-b_{j^*}\right) ^+ a_{j^*}\right\} \big \Vert ^2 \nonumber \\&\le (1-\xi ) \ \big \Vert x_k- {\mathcal {P}}(x_{k})- \delta \left( a_{i^*}^Tx_{k}-b_{i^*}\right) ^+ a_{i^*} \big \Vert ^2 \nonumber \\&\quad + \xi \ \big \Vert x_{k-1}- {\mathcal {P}}(x_{k-1})- \delta \left( a_{j^*}^Tx_{k-1}-b_{j^*}\right) ^+ a_{j^*} \big \Vert ^2. \end{aligned}$$

(44)

We used the fact that the function $\Vert \cdot \Vert ^2$ is convex and $0 \le \xi \le 1$. Now, taking expectation in both sides of the Eq. (44) and using Lemma 9, we get the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[d(x_{k+1},P)^2 \ |&\ {\mathbb {S}}_{k}, {\mathbb {S}}_{k-1}] \le (1-\xi ) \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \big \Vert x_k- {\mathcal {P}}(x_{k})- \delta \left( a_{i^*}^Tx_{k}-b_{i^*}\right) ^+ a_{i^*} \big \Vert ^2 \right] \nonumber \\&+ \xi \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k-1}} \left[ \big \Vert x_{k-1}- {\mathcal {P}}(x_{k-1})- \delta \left( a_{j^*}^Tx_{k-1}-b_{j^*}\right) ^+ a_{j^*} \big \Vert ^2\right] \nonumber \\&\overset{\text {Lemma} \ 9}{\le } (1-\xi ) \ h(\delta ) \ d(x_k,P)^2 + \xi \ h(\delta ) \ d(x_{k-1}, P)^2, \end{aligned}$$

(45)

where $h(\delta )$ is defined in Lemma 9. Taking expectation again in Eq. (45) and letting $G_{k+1} = {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k+1},P)^2\right] $, we get the following:

$$\begin{aligned} G_{k+1} \le \phi _1 G_{k} + \phi _2 G_{k-1}. \end{aligned}$$

(46)

Since, $\phi _1 , \phi _2 \ge 0$, $\phi _1+ \phi _2 < 1$ and $z_0 = z_1$, using first part of Theorem 1, we have the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k+1},P)^2\right] \le (1+\phi ) (\phi + \phi _1)^k G_0 = (1+\phi ) \rho ^k {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{0},P)^2\right] . \end{aligned}$$

(47)

Moreover, considering (47) with Lemma 6 we get the bound of ${{\,\mathrm{\mathbb {E}}\,}}[f(x_k)]$ which proves the first part of Theorem 3. Furthermore, using the second part of Theorem 1 and Eq. (46), we get the second part of Theorem 3. Now, to prove the third part first note that $\frac{1}{k} \sum \limits _{l=1}^{k} {\mathcal {P}}(x_l) \in P$, using Lemma 3 we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[d(\tilde{x}_k,P)^2]&= {{\,\mathrm{\mathbb {E}}\,}}[\Vert \tilde{x}_k- {\mathcal {P}}(\tilde{x}_k)\Vert ^2] \overset{\text {Lemma} \ 3}{ \le } {{\,\mathrm{\mathbb {E}}\,}}\left[ \Big \Vert \frac{1}{k} \sum \limits _{l=1}^{k} \left( x_l-{\mathcal {P}}(x_l)\right) \Big \Vert ^2\right] \nonumber \\&\le {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k} \big \Vert x_l-{\mathcal {P}}(x_l)\big \Vert ^2\right] = \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[d(x_l,P)^2] \nonumber \\&\le \frac{(1+\phi ) d(x_0,P)^2}{k} \sum \limits _{l=1}^{k} \rho ^{l-1} \le \frac{(1+\phi ) d(x_0,P)^2}{ k(1-\rho )}. \end{aligned}$$

(48)

Here we use the geometric series bound along with the fact that $0< \rho < 1$. Furthermore, using a more simplified version of (44) we have the following:

$$\begin{aligned} G_{l+1} - G_l \le \xi (G_l-G_{l-1})+ 2 \xi \delta (2-\delta ) [f(x_l)-f(x_{l-1})] - 2 \delta (2-\delta ) f(x_{l}), \end{aligned}$$

for any $l \ge 1$. Summing up the above identity for $l =1,2,\ldots ,k$, we have the following:

$$\begin{aligned} 2 \delta (2-\delta ) \sum \limits _{l=1}^{k} f(x_l)&\le \xi G_0 + G_1- \xi G_k- G_{k+1}+ 2 \xi \delta (2-\delta ) [f(x_k)-f(x_{0})] \nonumber \\&\le (1+\xi ) G_0 + \xi [2 \eta f(x_k)-G_k] \nonumber \\&\le (1+\xi ) G_0 + \xi [\eta \mu _2 -1] G_k \le (1+\xi ) d(x_0,P)^2, \end{aligned}$$

(49)

where $\eta = 2\delta - \delta ^2 \le 1$. Also, from Lemma 6, we get $\mu _2 = \min \{1, \frac{\beta }{m} \lambda _{\max }\} \le 1$. This implies that $\eta \mu _2 \le 1$ holds. Moreover, we used the non-negativity of the sequences $G_k$ and $ f(x_k)$. We also used the upper bound from Lemma 6. Then we get

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[f(\tilde{x}_k)]&\le {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k} f(x_l)\right] = \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \le \frac{(1+\xi ) \ d(x_0,P)^2}{2 \delta k (2-\delta )}. \end{aligned}$$

This proves the second part of Theorem 3. $\square $

Proof of Theorem 5

For any natural number $l \ge 1$ define, $\vartheta _l = \frac{\xi }{1+\xi }[x_{l-1}-x_l-\delta (a_{j^*}^Tx_{l-1}-b_{j^*})^+a_{j^*}]$, $ \ \varDelta _l = x_l + \vartheta _l$ and $\chi _l = \Vert x_l+\vartheta _l-{\mathcal {P}}(\varDelta _l)\Vert ^2$, then using the update formulas (10) and (11), we have

$$ \begin{aligned} x_{l+1} + \vartheta _{l+1} \overset{(10) \ \& \ (11)}{=} x_l + \vartheta _l - \frac{\delta }{1+\xi } \left( a_{i^*}^Tx_l-b_{i^*}\right) ^+ a_{i^*}, \end{aligned}$$

here, the index $i^{*}$ and $j^*$ are defined based on (6) respectively for the iterates $x_l$ and $x_{l-1}$. Using the above relation, we can write

$$\begin{aligned} \chi _{l+1}&= \Vert x_{l+1}+\vartheta _{l+1}-{\mathcal {P}}(\varDelta _{l+1})\Vert ^2 \overset{\text {Lemma} \ 3}{ \le } \Vert x_{l+1}+\vartheta _{l+1}-{\mathcal {P}}(\varDelta _l)\Vert ^2 \nonumber \\&= \big \Vert x_l + \vartheta _l - \frac{\delta }{1+\xi } \left( a_{i^*}^Tx_l-b_{i^*}\right) ^+ a_{i^*}-{\mathcal {P}}(\varDelta _l) \big \Vert ^2 \nonumber \\&= \underbrace{\Vert x_l+\vartheta _l-{\mathcal {P}}(\varDelta _l)\Vert ^2}_{= \chi _l} + \frac{\delta ^2}{(1+\xi )^2} \underbrace{\Vert (a_{i^*}^Tx_l- b_{i^*})^+a_{i^*}\Vert ^2}_{J_1} \nonumber \\&\qquad - \frac{2 \delta }{1+\xi } \underbrace{\big \langle x_l+\vartheta _l-{\mathcal {P}}(\varDelta _l) \ ,\ a_{i^*} (a_{i^*}^Tx_l-b_{i^{*}})^{+} \big \rangle }_{J_2} \nonumber \\&= \chi _l + \frac{\delta ^2}{(1+\xi )^2} J_1- \frac{2 \delta }{1+\xi } J_2. \end{aligned}$$

(50)

Taking expectation with respect to ${\mathbb {S}}_l$ we have,

$$\begin{aligned} \frac{\delta ^2}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [J_1] \overset{(9) }{=} \frac{2\delta ^2}{(1+\xi )^2} f(x_l). \end{aligned}$$

(51)

Similarly, we can simplify the third term of (50) as

$$ \begin{aligned}&- \frac{2\delta }{1+\xi } {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [J_2] \nonumber \\&\quad \overset{(9) }{=} -\frac{2\delta }{1+\xi } \big \langle x_l-{\mathcal {P}}(\varDelta _l) , g(x_l) \big \rangle - \frac{2\delta \xi }{(1+\xi )^2} \big \langle x_{l-1}-x_l - \delta g(x_{l-1}) , g(x_l) \big \rangle \nonumber \\&\quad = -\frac{2\delta }{1+\xi } \big \langle x_l-{\mathcal {P}}(\varDelta _l) , g(x_l) \big \rangle - \frac{2\delta \xi }{(1+\xi )^2} \big \langle x_{l-1}-x_l , g(x_l) \big \rangle \nonumber \\&\quad \quad \quad \quad \quad +\frac{\delta ^2 \xi }{(1+\xi )^2} \left[ \Vert g(x_l)+g(x_{l-1})\Vert ^2-\Vert g(x_l)\Vert ^2-\Vert g(x_{l-1})\Vert ^2 \right] \nonumber \\&\quad \overset{\text {Lemma} \ 7 \ \& \ 8}{\le } - \frac{4\delta }{1+\xi } f(x_l) - \frac{2\delta \xi }{(1+\xi )^2} \left[ f(x_{l-1}) - f(x_{l}) \right] - \frac{2\delta ^2 \xi }{(1+\xi )^2} \left[ f(x_{l-1}) + f(x_{l}) \right] \nonumber \\&\quad = -\frac{2 \delta \xi (1+\delta )}{(1+\xi )^2} f(x_{l-1}) +\frac{2 \delta \xi (1+\delta )}{(1+\xi )^2} f(x_{l}) - \frac{4 \delta (1+\xi + \delta \xi )}{(1+\xi )^2} f(x_{l}). \end{aligned}$$

(52)

Using the expressions of Eqs. (51) and (52) in (50) and simplifying further, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l}[\chi _{l+1}] - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} f(x_l) + \varpi f(x_l) \ \le \ \chi _l - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} f(x_{l-1}), \end{aligned}$$

(53)

here, the constant $\varpi $ is defined as

$$\begin{aligned} \varpi = \frac{4 \delta (1+\xi + \delta \xi )}{(1+\xi )^2} - \frac{2 \delta ^2 }{(1+\xi )^2} = \frac{2 \delta (2+2 \xi + 2\delta \xi -\delta )}{(1+\xi )^2} \ > \ 0. \end{aligned}$$

(54)

Now, taking expectation again in (53) and using the tower property, we get

$$\begin{aligned} q_{l+1} + \varpi {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \le q_l, \quad l = 1,2,3\ldots , \end{aligned}$$

(55)

where $q_l = {{\,\mathrm{\mathbb {E}}\,}}[\chi _l] - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(x_{l-1})]$. Summing up (55) for $l=1,2,\ldots ,k$ we get

$$\begin{aligned} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \ \le \ \frac{q_1-q_{k+1}}{\varpi } \ \le \ \frac{q_1}{\varpi }. \end{aligned}$$

(56)

Now, using Jensen’s inequality, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{x_k})\right] = {{\,\mathrm{\mathbb {E}}\,}}\left[ f\left( \sum \limits _{l=1}^{k} \frac{x_l}{k}\right) \right] \ \le \ {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k}f(x_l)\right] \ = \ \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(x_l)] \ \overset{(56)}{\le } \frac{q_1}{\varpi k}. \end{aligned}$$

Since, $x_0=x_1$, we have $\vartheta _1 = \frac{-\delta \xi }{1+\xi } (a_{i^*}^Tx_0-b_{i^*})^+ a_{i^*} $. Furthermore,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\chi _1]&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert x_1+\vartheta _1 -{\mathcal {P}}(\varDelta _1)\Vert ^2 \right] \overset{\text {Lemma} \ 3}{ \le } {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert x_1+\vartheta _1 -{\mathcal {P}}(x_0)\Vert ^2 \right] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert x_0 -{\mathcal {P}}(x_0)-\frac{\delta \xi }{1+\xi } (a_{i^*}^Tx_0-b_{i^*})^+ a_{i^*} \Vert ^2 \right] \nonumber \\&= \Vert x_0 -{\mathcal {P}}(x_0)\Vert ^2 + \frac{\delta ^2\xi ^2}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}[|(a_{i^*}^Tx_0-b_{i^*})^+|^2] \nonumber \\&\quad \quad \quad \quad \quad \quad - \frac{2\delta \xi }{1+\xi } \langle x_0 -{\mathcal {P}}(x_0) , {{\,\mathrm{\mathbb {E}}\,}}[(a_{i^*}^Tx_0-b_{i^*})^+ a_{i^*}] \rangle \nonumber \\&\overset{\text {Lemma} \ 6 }{\le } \Vert x_0 -{\mathcal {P}}(x_0)\Vert ^2 + \frac{2\delta ^2\xi ^2}{(1+\xi )^2} f(x_0)-\frac{2\delta \xi \mu _2}{1+\xi } \Vert x_0 -{\mathcal {P}}(x_0)\Vert ^2. \end{aligned}$$

(57)

Now, from our construction we get

$$\begin{aligned} q_1 = {{\,\mathrm{\mathbb {E}}\,}}[\chi _1] - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(x_{0})] \le (1-\frac{2\delta \xi \mu _2}{1+\xi }) \ d(x_0,P)^2+ \frac{2\delta \xi (\delta \xi -1-\delta )}{(1+\xi )^2} f(x_0). \end{aligned}$$

Substituting the values of $\varpi $ and $q_1$ in the expression of ${{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{x_k})\right] $, we have the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f({\bar{x}}_k)\right] \le \frac{ (1+\xi ) (1+\xi -2\delta \xi \mu _2) \ d(x_0,P)^2+ 2 \xi \delta (\delta \xi -\delta -1) f(x_0)}{2 \delta k \left( 2+2 \xi + 2 \delta \xi -\delta \right) }. \end{aligned}$$

$\square $

Proof of Theorem 4

Since the term $\Vert x_{k+1}- {\mathcal {P}}(x_{k+1}) \Vert $ is constant under the expectation with respect to ${\mathbb {S}}_{k+1}$, from the update formula of the GSKM algorithm, we get

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[ \Vert x_{k+1}- {\mathcal {P}}(x_{k+1})&\Vert \ | \ {\mathbb {S}}_{k+1}, {\mathbb {S}}_k] = {{\,\mathrm{\mathbb {E}}\,}}[ \Vert x_{k+1}- {\mathcal {P}}(x_{k+1}) \Vert \ | \ {\mathbb {S}}_k] \nonumber \\&\overset{\text {Lemma} \ 3}{ \le }\ \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [\Vert x_{k+1}- {\mathcal {P}}(x_{k}) \Vert ] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [\Vert z_k-{\mathcal {P}}(x_k)- \xi (z_k-z_{k-1}) \Vert ] \nonumber \\&\le {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[\Vert z_k-{\mathcal {P}}(x_k)\Vert ] + |\xi | {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [\Vert z_k-z_{k-1}\Vert ] \nonumber \\&\le \left\{ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[ \Vert z_k-{\mathcal {P}}(x_k)\Vert ^2]\right\} ^{\frac{1}{2}} + |\xi | \Vert z_k-z_{k-1}\Vert \nonumber \\&\overset{\text {Lemma} \ 9 }{\le } \sqrt{h(\delta )} \ \Vert x_k-{\mathcal {P}}(x_k)\Vert + |\xi | \Vert z_k-z_{k-1}\Vert . \end{aligned}$$

(58)

We performed the two expectations in order, from the innermost to the outermost. Now, taking expectation in (58) and using the tower property of expectation, we have,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert ]&\le \sqrt{h(\delta )} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_k-{\mathcal {P}}(x_k)\Vert ] + |\xi | \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ]. \end{aligned}$$

(59)

Similarly, using the update formula for $z_{k+1}$, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_{k+1}&-z_k \Vert \ | \ {\mathbb {S}}_{k+1}, {\mathbb {S}}_k] = {{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}}[\Vert x_{k+1} - \delta \left( a_{i^*}^Tx_{k+1}-b_{i^*}\right) ^+ a_{i^*}-z_k \Vert ] \ | \ {\mathbb {S}}_k] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}} [\Vert -\xi (z_k-z_{k-1}) - \delta \left( a_{i^*}^Tx_{k+1}-b_{i^*}\right) ^+ a_{i^*}\Vert ] \ | \ {\mathbb {S}}_k] \nonumber \\&\le |\xi | \ \Vert z_k-z_{k-1}\Vert + \delta {{\,\mathrm{\mathbb {E}}\,}}[{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}}[|(a_{i^*}^Tx_{k+1}-b_{i^*})^+ |] \ | \ {\mathbb {S}}_k] \nonumber \\&\le |\xi | \ \Vert z_k-z_{k-1}\Vert + \delta {{\,\mathrm{\mathbb {E}}\,}}[\left\{ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k+1}}[|(a_{i^*}^Tx_{k+1}-b_{i^*})^+ |^2]\right\} ^{\frac{1}{2}} \ | \ {\mathbb {S}}_k] \nonumber \\&\overset{\text {Lemma} \ 6 }{\le } |\xi | \ \Vert z_k-z_{k-1}\Vert + \delta \sqrt{\mu _2} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert \ | \ {\mathbb {S}}_k]. \end{aligned}$$

(60)

Taking expectation in (60) and using (59) along with the tower property, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\Vert&z_{k+1}- z_{k}\Vert ] \ \le \ |\xi | \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ] + \delta \sqrt{\mu _2} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert ] \nonumber \\&\overset{(59)}{\le } |\xi | \left( 1+ \delta \sqrt{\mu _2}\right) {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ] + \delta \sqrt{\mu _2 h(\delta )} {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_k-{\mathcal {P}}(x_k)\Vert ]. \end{aligned}$$

(61)

Combining both (59) and (61), we can deduce the following matrix inequality:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} \Vert x_{k+1}-{\mathcal {P}}(x_{k+1})\Vert \\ \Vert z_{k+1}-z_k\Vert \end{bmatrix}&\le \begin{bmatrix} \sqrt{h(\delta )} &{} |\xi | \\ \delta \sqrt{\mu _2 h(\delta )} &{} \ |\xi | \left( 1+ \delta \sqrt{\mu _2}\right) \end{bmatrix} {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} \Vert x_k-{\mathcal {P}}(x_k)\Vert \\ \Vert z_k-z_{k-1}\Vert \end{bmatrix}. \end{aligned}$$

(62)

Now, from the definition, it can be easily checked that $\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0$. Since, $\xi \in Q_2$, we have

$$\begin{aligned} \varPi _2 \varPi _3 - \varPi _1 \varPi _4&= |\xi | \delta \sqrt{\mu _2 h(\delta )} - |\xi | \sqrt{h(\delta )} - |\xi | \delta \sqrt{\mu _2 h(\delta )} = - |\xi | \sqrt{h(\delta )} \le 0. \end{aligned}$$

(63)

Also, we have

$$\begin{aligned} \varPi _1 + \varPi _4- \varPi _1 \varPi _4+&\varPi _2 \varPi _3 = \sqrt{h(\delta )} + |\xi | \left( 1+ \delta \sqrt{\mu _2}\right) - |\xi | \sqrt{h(\delta )} < 1. \end{aligned}$$

(64)

Here, in the last inequality we used the given condition. Considering (64), we can check that $\varPi _1 + \varPi _4 < 1+|\xi | \sqrt{h(\delta )} = 1+ \min \{1, |\xi | \sqrt{h(\delta )}\} = 1+ \min \{1,\varPi _1 \varPi _4-\varPi _2 \varPi _3\}$. Also from (63), we have $\varPi _2 \varPi _3 - \varPi _1 \varPi _4 \le 0$, which is precisely the condition provided in (25). Let’s define the sequences $F_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ]$ and $H_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_k-{\mathcal {P}}(x_k)\Vert ]$. Now, using Theorem 2, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varGamma _3 (\varGamma _1 \rho _2^{k} -\varGamma _2 \rho _1^{k}) &{} \varGamma _1 \varGamma _2 \varGamma _3 (\rho _1^{k}- \rho _2^{k}) \\ \varGamma _3 (\rho _2^{k}- \rho _1^{k}) &{} \varGamma _3 (\varGamma _1 \rho _1^{k} -\varGamma _2 \rho _2^{k}) \end{bmatrix} \ \begin{bmatrix} H_{1} \\ F_1 \end{bmatrix}. \end{aligned}$$

(65)

where, $\varGamma _1 , \varGamma _2, \varGamma _3, \rho _1, \rho _2$ can be derived from (26) using the parameter choice of (28). Note that, from the GSKM algorithm we have, $x_1 = x_0$ and $z_1 = z_0$.Therefore we can easily check that, $F_1 = {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_1-z_{0}\Vert ] = 0$ and $H_1 = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_1-{\mathcal {P}}(x_1)\Vert ] = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_0-{\mathcal {P}}(x_0)\Vert ] = \Vert x_0-{\mathcal {P}}(x_0)\Vert = H_0 $. Now, substituting the values of $H_1$ and $F_1$ in (65), we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix} = {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} d(x_{k+1}, P) \\ \Vert z_{k+1}-z_k\Vert \end{bmatrix} \le \begin{bmatrix} -\varGamma _2 \varGamma _3 \ \rho _1^{k}+ \varGamma _1 \varGamma _3 \ \rho _2^{k} \\ - \varGamma _3 \ \rho _1^{k}+ \varGamma _3 \ \rho _2^{k} \end{bmatrix} \ d(x_0,P). \end{aligned}$$

(66)

Also from Theorem 2 we have, $\varGamma _1, \varGamma _3 \ge 0$ and $ 0 \le \rho _1 \le \rho _2 < 1$, which proves the Theorem. $\square $

Proof of Theorem 6

Note that, since $Ax \le b$ is feasible, then from Lemma 12, we know that there is a feasible solution $x^*$ with $|x^{*}_j| \le \frac{2^{\sigma }}{2n}$ for $j = 1, \ldots , n$. Thus, we have

$$\begin{aligned} d(x_0,P) = \Vert x_0-{\mathcal {P}}(x_0)\Vert \ \le \ \Vert x^*\Vert \ \le \ \frac{2^{\sigma -1}}{\sqrt{n}}, \end{aligned}$$

(67)

as $x_0 = 0$. Then if the system $Ax \le b$ is infeasible, by using Lemma 10, we have

$$\begin{aligned} \theta (x) \ \ge \ 2^{1-\sigma }. \end{aligned}$$

This implies when GSKM runs on the system $Ax \le b$, the system is feasible when $\theta (x) < 2^{1-\sigma }$. Furthermore, since every point of the feasible region P is inside the half-space defined by $\tilde{H}_i = \{x \ | \ a_i^T x \le b_i\}$ for all $i = 1,2,\ldots ,m$, we have the following:

$$\begin{aligned} \theta (x) \ = \ \left[ \max _{i}\{a_i^Tx-b_i\}\right] ^{+} \ \le \ \Vert a_i^T(x-{\mathcal {P}}(x))\Vert \ \le \ d(x,P). \end{aligned}$$

(68)

Then, for $\xi \in Q_1$ whenever the system $Ax \le b$ is feasible, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] \overset{(68)}{\le } {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k},P)\right]&\le \sqrt{{{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k},P)^2\right] } \overset{\text {Theorem} \ 3}{\le } \sqrt{1+\phi } \rho ^{\frac{k-1}{2}} \ d(x_0,P). \end{aligned}$$

(69)

Similarly, for $\xi \in Q_2$ whenever the system $Ax \le b$ is feasible, we have^{Footnote 13}

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] \overset{(68)}{\le } \ {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k},P)\right]&\overset{\text {Theorem} \ 4}{\le } \sqrt{1+\phi } \ \rho _2^{k-1} \ d(x_0,P). \end{aligned}$$

(70)

Take ${\bar{\rho }} = \max \{\rho , \rho _2^2\}$. Now, combining (69) and (70), for any $\xi \in Q = Q_1 \cup Q_2$, whenever the system $Ax \le b$ is feasible, we have

$$ \begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] \overset{(69) \ \& \ (70)}{\le } \sqrt{1+\phi } \ {\bar{\rho }}^{\frac{k-1}{2}} \ d(x_0,P) \overset{(67)}{\le } \ \sqrt{1+\phi } \ {\bar{\rho }}^{\frac{k-1}{2}} \ \frac{2^{\sigma -1}}{\sqrt{n}}. \end{aligned}$$

(71)

Here, we used Theorems 3 and 4 and the identities from Eqs. (67) and (68). Now, for detecting feasibility, we need to have, ${{\,\mathrm{\mathbb {E}}\,}}[\theta (x_k)] < 2^{1-\sigma }$. That gives us

$$\begin{aligned} \sqrt{1+\phi } \ {\bar{\rho }}^{\frac{k-1}{2}} \ \frac{2^{\sigma -1}}{\sqrt{n}} < 2^{1-\sigma }. \end{aligned}$$

Simplifying the above identity further, we get the following lower bound for k:

$$\begin{aligned} k-1 \ > \ \frac{4 \sigma - 4 -\log n + \log (1+\phi )}{\log \left( \frac{1}{{\bar{\rho }}}\right) }. \end{aligned}$$

Moreover, if the system $Ax \le b$ is feasible, then the probability of not having a certificate of feasibility is bounded as follows:

$$\begin{aligned} p = {\mathbb {P}} \left( \theta (x_k) \ge 2^{1-\sigma }\right) \ \le \ \frac{{{\,\mathrm{\mathbb {E}}\,}}\left[ \theta (x_k)\right] }{2^{1-\sigma }} \ < \ \sqrt{\frac{1+\phi }{n}} \ 2^{2\sigma -2} \ {\bar{\rho }}^{\frac{k-1}{2}}. \end{aligned}$$

Here, we used the Markov’s inequality ${\mathbb {P}} (x \ge t) \le \frac{{{\,\mathrm{\mathbb {E}}\,}}[x]}{t}$. This completes the proof of Theorem 6. $\square $

Appendix 3 Proof of Theorem 7

From the update formula of the PASKM algorithm, we get

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert v_{k+1}- {\mathcal {P}}(v_{k+1}) \Vert ^2] \nonumber \\&\quad \overset{\text {Lemma} \ 3}{ \le }\ \ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert v_{k+1}- \omega {\mathcal {P}}(v_{k})-(1-\omega ) {\mathcal {P}}(y_{k}) \Vert ^2] \nonumber \\&\quad = {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert \omega (v_{k}- {\mathcal {P}}(v_{k})) + (1-\omega ) (y_{k}- {\mathcal {P}}(y_{k})) - \gamma \left( a_{i^*}^Ty_{k}-b_{i^*}\right) ^+ a_{i^*}\Vert ^2] \nonumber \\&\quad = {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [\Vert \omega (v_{k}- {\mathcal {P}}(v_{k})) + (1-\omega ) (y_{k}- {\mathcal {P}}(y_{k}))\Vert ^2] + \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] \nonumber \\&\quad - 2 \gamma (1-\omega ) \big \langle y_k-{\mathcal {P}}(y_{k}), {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [(a_{i^*}^Ty_{k}-b_{i^*})^+ a_{i^*}] \big \rangle \nonumber \\&\quad - 2 \gamma \omega \big \langle v_k-{\mathcal {P}}(v_{k}), {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [(a_{i^*}^Ty_{k}-b_{i^*})^+ a_{i^*}] \big \rangle \nonumber \\&\quad \le \omega \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 + (1-\omega ) \Vert y_{k}- {\mathcal {P}}(y_{k}) \Vert ^2 + \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] \nonumber \\&\quad - 2 \gamma (1-\omega ) {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] + \omega \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}} [|(a_{i^*}^Ty_{k}-b_{i^*})^+|^2] + \omega \gamma \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 \nonumber \\&\quad = \omega (1+\gamma ) \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 + (1-\omega ) \Vert y_{k}- {\mathcal {P}}(y_{k}) \Vert ^2 + 2\gamma (\gamma + 3 \omega -2) f(y_k) \nonumber \\&\quad \le \omega (1+\gamma ) \Vert v_{k}- {\mathcal {P}}(v_{k}) \Vert ^2 + \left\{ 1-\omega +\gamma \mu _1 (\gamma + 3 \omega -2)\right\} \Vert y_{k}- {\mathcal {P}}(y_{k}) \Vert ^2. \end{aligned}$$

(72)

Here, we used the condition $\gamma + 3 \omega -2 \le 0$. Similarly, using the update formula for $y_{k+1}$, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert&y_{k+1}- {\mathcal {P}}(y_{k+1}) \Vert ^2] \nonumber \\&\overset{\text {Lemma} \ 3}{ \le }\ {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert \alpha (v_{k+1}- {\mathcal {P}}(v_{k+1})) + (1-\alpha ) (x_{k+1} - {\mathcal {P}}(y_{k})) \Vert ^2] \nonumber \\&\le \alpha {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert v_{k+1}- {\mathcal {P}}(v_{k+1})\Vert ^2] + (1-\alpha ) {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert x_{k+1} - {\mathcal {P}}(y_{k})\Vert ^2] \nonumber \\&\overset{\text {Lemma} \ 9}{ \le }\ \alpha {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_{k}}[\Vert v_{k+1}- {\mathcal {P}}(v_{k+1})\Vert ^2] + (1-\alpha ) h(\delta ) \Vert y_{k} - {\mathcal {P}}(y_{k})\Vert ^2. \end{aligned}$$

(73)

Following Theorem 2, let us define the sequences $H_k = E [\Vert v_k-{\mathcal {P}}(v_k)\Vert ^2]$ and $F_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert y_k-{\mathcal {P}}(y_k)\Vert ^2]$. The goal is to prove that $H_k$ and $F_k$ satisfy the condition (25). Now, taking expectation in (72) and using the tower property of expectation, we have

$$\begin{aligned} H_{k+1} \&\le \omega (1+\gamma ) H_k + \left\{ 1-\omega +\gamma \mu _1 (\gamma + 3 \omega -2)\right\} F_k. \end{aligned}$$

(74)

Similarly, taking expectation in (73) and using (74) along with the tower property of expectation, we have

$$\begin{aligned}&F_{k+1} \ \le \alpha H_{k+1} + (1-\alpha ) h(\delta ) F_k \nonumber \\&\quad \le \ \alpha \omega (1+\gamma ) H_k + \{ (1-\alpha ) h(\delta ) + \alpha (1-\omega ) + \alpha \gamma \mu _1 (\gamma + 3 \omega -2)\} F_k. \end{aligned}$$

(75)

Combining both (74) and (75), we can deduce the following matrix inequality:

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix} \begin{bmatrix} H_{k} \\ F_{k} \end{bmatrix} \le \begin{bmatrix} \varPi _1 &{} \varPi _2 \\ \varPi _3 &{} \ \varPi _4 \end{bmatrix}^{k+1} \begin{bmatrix} H_{0} \\ F_{0} \end{bmatrix}. \end{aligned}$$

(76)

Here, we use the fact that $\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0$. Now we will use Theorem 2 to simplify the expression of (76). Before we can use Theorem 2, we need to make sure the sequences $H_k$ and $F_k$ satisfy the condition of (25). From the definition, we have

$$\begin{aligned} \varPi _2&\varPi _3 - \varPi _1 \varPi _4 = \alpha \omega (1-\omega )(1+\gamma ) + \alpha \omega \gamma \mu _1 (1+\gamma )(\gamma +3w-2) \nonumber \\&- \omega h(\delta ) (1-\alpha ) (1+\gamma ) -\alpha \omega (1-\omega )(1+\gamma ) - \alpha \omega \gamma \mu _1 (1+\gamma )(\gamma +3w-2) \nonumber \\&\quad = - \omega h(\delta ) (1-\alpha ) (1+\gamma ) \le 0. \end{aligned}$$

(77)

Also, we have

$$\begin{aligned} \varPi _1 + \varPi _4- \varPi _1 \varPi _4+&\varPi _2 \varPi _3 = \omega (1+\gamma ) + h(\delta ) (1-\alpha ) + \alpha (1-\omega ) \nonumber \\&+ \alpha \gamma \mu _1 (\gamma +3w-2)- \omega h(\delta ) (1-\alpha ) (1+\gamma ) < 1. \end{aligned}$$

(78)

Here, in the last inequality, we used the given condition. Considering (78), we can check that $\varPi _1 + \varPi _4 < 1+\omega h(\delta ) (1-\alpha ) (1+\gamma ) = 1+ \min \{1, \omega h(\delta ) (1-\alpha ) (1+\gamma )\} = 1+ \min \{1,\varPi _1 \varPi _4-\varPi _2 \varPi _3\}$. Also from (77), we have $\varPi _2 \varPi _3 - \varPi _1 \varPi _4 \le 0$, which is precisely the condition provided in (25). Now, using Theorem 2, we have

$$\begin{aligned} \begin{bmatrix} H_{k+1} \\ F_{k+1} \end{bmatrix}&\le \begin{bmatrix} \varGamma _3 (\varGamma _1 \rho _2^{k+1} -\varGamma _2 \rho _1^{k+1}) \ \ &{} \ \ \varGamma _1 \varGamma _2 \varGamma _3 (\rho _1^{k+1}- \rho _2^{k+1}) \\ \varGamma _3 (\rho _2^{k+1}- \rho _1^{k+1}) \ &{} \ \varGamma _3 (\varGamma _1 \rho _1^{k+1} -\varGamma _2 \rho _2^{k+1}) \end{bmatrix} \ \begin{bmatrix} H_{0} \\ F_0 \end{bmatrix}. \end{aligned}$$

(79)

where, $\varGamma _1 , \varGamma _2, \varGamma _3, \rho _1, \rho _2$ can be derived from (26) using the given parameter. Note that, from the PASKM algorithm we have, $x_0 = v_0 = y_0$. Therefore we can easily check that, $ H_0 = \Vert v_0-{\mathcal {P}}(v_0)\Vert ^2 = \Vert y_0-{\mathcal {P}}(y_0)\Vert ^2 = F_0 $. Now, substituting the values of $H_0$ and $F_0$ in (79), we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\begin{bmatrix} d(v_{k+1}, P)^2 \\ d(y_{k+1}, P)^2 \end{bmatrix} \le \begin{bmatrix} \varGamma _2 \varGamma _3 (\varGamma _1-1) \ \rho _1^{k+1}+ \varGamma _1 \varGamma _3 (1-\varGamma _2)\ \rho _2^{k+1} \\ \varGamma _3 (\varGamma _1-1) \ \rho _1^{k+1}+ \varGamma _3 (1-\varGamma _2)\ \rho _2^{k+1} \end{bmatrix} d(y_0,P)^2. \end{aligned}$$

(80)

Also from Theorem 2 we have, $\varGamma _1, \varGamma _3 \ge 0$ and $ 0 \le \rho _1 \le \rho _2 < 1$, which proves the the first part of the Theorem. Now, considering Lemma 6, we get

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[ f(x_{k+1})] \ \le \ \frac{\mu _2}{2} \ {{\,\mathrm{\mathbb {E}}\,}}[\Vert y_{k+1}-{\mathcal {P}}(y_{k+1})\Vert ^2] = \ \frac{\mu _2}{2} \ {{\,\mathrm{\mathbb {E}}\,}}[d(y_{k+1},P)^2]. \end{aligned}$$

(81)

Now, substituting the result of (80) in (81), we get the second part of the Theorem. $\square $

Proof of Theorem 8

Let us define, ${\mathcal {V}} = \omega {\mathcal {P}}(v_{k}) +(1-\omega ) {\mathcal {P}}(y_{k}) $. Since ${\mathcal {V}} \in P$, using the update formula of $v_{k+1}$ from Eq. (16), we have,

$$\begin{aligned} d(v_{k+1},P)^2&= \Vert v_{k+1} - {\mathcal {P}}(v_{k+1})\Vert ^2 \overset{\text {Lemma} \ 3}{\le } \Vert v_{k+1} - {\mathcal {V}}\Vert ^2 \nonumber \\&\overset{(16)}{=} \big \Vert \omega v_k + (1-\omega ) y_k- {\mathcal {V}} -\gamma (a_{i^*}^Ty_k- b_{i^*})^+ a_{i^*} \big \Vert ^2 \nonumber \\&= \underbrace{ \Vert \omega v_k + (1-\omega ) y_k- {\mathcal {V}} \Vert ^2}_{I_1} + \gamma ^2 \underbrace{\Vert (a_{i^*}^Ty_k- b_{i^*})^+a_{i^*}\Vert ^2}_{I_2} \nonumber \\&\quad - 2 \gamma \underbrace{\big \langle \omega v_k + (1-\omega ) y_k- {\mathcal {V}} \ ,\ a_{i^*} (a_{i^*}^Ty_k-b_{i^{*}})^{+} \big \rangle }_{I_3} \nonumber \\&= I_1 + \gamma ^2 I_2 -2 \gamma I_3. \end{aligned}$$

(82)

Since $\Vert \cdot \Vert ^{2}$ is a convex function and $0< \omega < 1$, we can bound the expected first term as follows:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [I_1]&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \Vert \omega v_k + (1-\omega ) y_k-{\mathcal {V}}\Vert ^2\right] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \Vert \omega v_k + (1-\omega ) y_k- \omega {\mathcal {P}}(v_{k}) - (1-\omega ) {\mathcal {P}}(y_{k})\Vert ^2\right] \nonumber \\&\le \omega \Vert v_k-{\mathcal {P}}(v_{k})\Vert ^2 + (1-\omega ) \Vert y_k-{\mathcal {P}}(y_{k})\Vert ^2\nonumber \\&= \omega \ d(v_k,P)^2 + (1-\omega ) \ d(y_k,P)^2. \end{aligned}$$

(83)

Taking expectation with respect to the sampling distribution in the second term of Eq. (82) and using Lemma 9 with the choice $z = x_{k+1}, \ x = y_k$ and $\eta = 2\delta -\delta ^2$, we get

$$\begin{aligned} \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ \Vert (a_{i^*}^Ty_k- b_{i^*})^+ a_{i^*}\Vert ^2\right]&\ = \ \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ |(a_{i^*}^Ty_k- b_{i^*})^+ |^2\right] \nonumber \\ \overset{\text {Lemma} \ 9}{\le }&\ \frac{\gamma ^2 }{\eta } \left[ d(y_k,P)^2 - {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(x_{k+1},P)^2 \right] \right] . \end{aligned}$$

(84)

Now, taking expectation in the third term of (82) we get

$$ \begin{aligned} - 2 \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}&[I_3 ] = - 2 \gamma \big \langle \omega v_k + (1-\omega ) y_k-{\mathcal {V}}, {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ a_{i^*} (a_{i^*}^Ty_k-b_{i^*})^{+}\right] \big \rangle \nonumber \\&\overset{(14) \ \& \ (9) }{=} - 2 \gamma \big \langle \frac{\omega }{\alpha } \left[ y_k - (1-\alpha ) x_k\right] + (1-\omega ) y_k-{\mathcal {V}}, g(y_k) \big \rangle \nonumber \\&\qquad = - 2 \gamma \big \langle \frac{\omega (1-\alpha )}{\alpha }(y_k-x_k)+ y_k-{\mathcal {V}}, g(y_k) \big \rangle . \end{aligned}$$

(85)

Using Lemma 7 and Lemma 8 we can simplify Eq. (85) as follows:

$$ \begin{aligned}&\ - 2 \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [I_3 ] = - 2 \gamma \big \langle \frac{\omega (1-\alpha )}{\alpha }(y_k-x_k)+ y_k-{\mathcal {V}}, g(y_k) \big \rangle \nonumber \\&\quad = 2 \gamma \frac{\omega (1-\alpha )}{\alpha } \big \langle x_k-y_k, g(y_k) \big \rangle + 2 \gamma \big \langle {\mathcal {V}}-y_k, g(y_k) \big \rangle \nonumber \\&\overset{\text {Lemma} \ 7 \ \& \ 8}{\le } \frac{\gamma \omega (1-\alpha )}{\alpha } \ d(x_k,P)^2 - \frac{ \mu _1 \gamma \omega (1-\alpha )}{\alpha } \ d(y_k,P)^2 - 2 \mu _1 \gamma \ d(y_k,P)^2. \end{aligned}$$

(86)

Now, substituting the values of Eqs. (83), (84) and (86) in Eq. (82) we get the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} [&d(v_{k+1} ,P)^2 ] = I_1 + \gamma ^2 {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[I_2 ]- 2 \gamma {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k}[I_3]\\&= \omega \ d(v_{k},P)^2 + (1-\omega ) \ d(y_k,P)^2 + \frac{\gamma ^2 }{\eta } \left\{ d(y_k,P)^2 - {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(x_{k+1},P)^2 \right] \right\} \\&\quad + \frac{\gamma \omega (1-\alpha )}{\alpha } d(x_k,P)^2 - \gamma \mu _1 \left( 2+\frac{ \omega (1-\alpha )}{\alpha }\right) d(y_k,P)^2. \end{aligned}$$

With further simplification, the above identity can be written as follows:

$$\begin{aligned}&{{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(v_{k+1},P)^2 + \frac{\gamma ^2}{\eta } d(x_{k+1},P)^2 \right] = \omega \ \left[ d(v_{k},P)^2 + \frac{\gamma (1-\alpha )}{\alpha } d(x_{k},P)^2 \right] \nonumber \\&\quad \quad \quad \quad \quad \quad \quad + d(y_k,P)^2 \ \left\{ 1-\omega + \frac{\gamma ^2 }{\eta } - 2\gamma \mu _1 - \frac{\gamma \omega \mu _1 (1-\alpha )}{\alpha } \right\} . \end{aligned}$$

(87)

Now, let’s choose the parameters as in Eq. (33) along with $0< \zeta < \frac{4\eta \mu _1}{(1-\mu _1)^2} $. We can easily see that $\frac{\gamma ^2 }{\eta } = \frac{ \gamma (1-\alpha )}{\alpha }$ and $ \alpha \in (0,1)$. Also note that

$$\begin{aligned} 2 \mu _1 \gamma \ = \ 2 \mu _1 \sqrt{\eta \zeta \mu _1} \ > \ 2 \mu _1 \sqrt{\frac{\zeta (1-\mu _1)^2 \zeta }{4}} = \mu _1 \zeta (1-\mu _1), \end{aligned}$$

(88)

which implies $\omega < 1$. Similarly, whenever $\mu _1 < 1$ we have

$$\begin{aligned} 2\gamma - \zeta - \frac{1}{\mu _1} < 2\sqrt{\eta \mu _1 \zeta } - \zeta - 1 \le 2 \sqrt{\zeta } - \zeta - 1 = -(\sqrt{\zeta }-1)^2 \le 0, \end{aligned}$$

(89)

which implies $\omega > 0$. Also, using the parameter choice of (33), we have

$$\begin{aligned} 1-\omega + \frac{\gamma ^2 }{\eta } - 2\gamma \mu _1 -&\frac{\gamma \omega \mu _1 (1-\alpha )}{\alpha } = 1-\omega + \zeta \mu _1 - 2\gamma \mu _1 - \omega \zeta \mu _1^2 \nonumber \\&= 1- 2\gamma \mu _1 + \zeta \mu _1 - \omega (1+\zeta \mu _1^2) = 0. \end{aligned}$$

(90)

Now, using all of the above relations (Eqs. (33), (90)) in Eq. (87), we get the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_k} \left[ d(v_{k+1},P)^2 + \frac{\gamma ^2}{\eta } d(x_{k+1},P)^2 \right]&\ \le \ \omega \Big [d(v_{k},P)^2 + \underbrace{\frac{\gamma (1-\alpha )}{\alpha }}_{= \frac{\gamma ^2 }{\eta }} \ d(x_{k},P)^2 \Big ] \nonumber \\&\quad + \Big [\underbrace{1-\omega + \frac{\gamma ^2 }{\eta } - 2\gamma \mu _1 - \frac{\gamma \omega \mu _1 (1-\alpha )}{\alpha } }_{= \ 0} \Big ] \ d(y_k,P)^2 \nonumber \\&= \omega \ \left[ d(v_{k},P)^2+ \frac{\gamma ^2}{\eta } d(x_{k},P)^2 \right] . \end{aligned}$$

(91)

Finally, taking expectation again with tower rule and substituting $\frac{\gamma ^2 }{\eta }= \zeta \mu _1 $ we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ d(v_{k+1},P)^2 + \zeta \mu _1 \ d(x_{k+1},P)^2 \right]&\le \ \omega ^{k+1} \ {{\,\mathrm{\mathbb {E}}\,}}\left[ d(v_{0},P)^2+ \zeta \mu _1 \ d(x_{0},P)^2 \right] \\&= (1+\zeta \mu _1) \ \omega ^{k+1} \ d(x_{0},P)^2. \end{aligned}$$

This proves the Theorem. Furthermore, for faster convergence, we need to choose parameters such that, $\omega $ becomes as small as possible. In the proof, we assumed $\mu _1 <1$ holds which is the most probable scenario. Whenever $\mu _1 = 1$, we must have $\mu _1 = \mu _2 =1$ and Lemma 6 holds with both equality, i.e., $f(x) = d(x,P)^2$. Therefore if we choose $\alpha , \gamma , \omega $ as $\alpha = \frac{\eta }{\eta + \gamma }, \ \gamma = \sqrt{\zeta \eta }, \ \omega = 1- \frac{2\gamma }{1+\zeta }$, we can check that condition (89) holds and $0< \omega < 1$ holds for any $\zeta > 0$. $\square $

Proof of Theorem 9

For any natural number $l \ge 1$, using the update formula of $v_{l+1}$, we have

$$\begin{aligned} v_{l+1}&= \omega v_l +(1-\omega ) y_l - \gamma (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \nonumber \\&\overset{(14)}{=} \ \left( 1-\omega + \frac{\omega }{\alpha }\right) y_l - \frac{\omega (1-\alpha )}{\alpha } x_l - \gamma (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*}. \end{aligned}$$

(92)

Let $\varphi = \omega (1-\alpha )$. It can be easily checked that $0 \le \varphi < 1$. Now, considering Eq. (14), we have

$$ \begin{aligned}&y_{l+1} = \alpha v_{l+1} + (1-\alpha ) x_{l+1} \nonumber \\&\quad \overset{(15) \ \& \ (92) }{=} \ (1+\omega -\alpha \omega ) y_l - \omega (1-\alpha ) y_{l-1} + \omega \delta (1-\alpha ) (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*} \nonumber \\&\quad \quad \quad \quad \quad \quad - [\alpha \gamma + (1-\alpha ) \delta ] \ (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \nonumber \\&\quad = (1+\omega -\alpha \omega ) y_l - \omega (1-\alpha ) y_{l-1} + \omega \delta (1-\alpha ) (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*} \nonumber \\&\quad \quad \quad \quad \quad \quad - \delta (1+\omega - \alpha \omega ) \ (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \nonumber \\&\quad = (1+\varphi ) y_l - \varphi y_{l-1} + \delta \varphi (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*} - \delta (1+\varphi ) \ (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*}. \end{aligned}$$

(93)

here, the index $i^{*}$ and $j^*$ are defined based on (6) respectively for the sequences $y_l$ and $y_{l-1}$. Furthermore, with the choice of $x_0 = v_0$, the points $y_0$ and $y_1$ generated by the PASKM method (i.e, algorithm 3 with arbitrary parameter choice) can be calculated as

$$\begin{aligned} y_1 = \alpha v_1 + (1-\alpha ) x_1&= x_0 - (\alpha \gamma + \delta (1-\alpha )) (a_{i^*}^Ty_{0}-b_{i^*})^+a_{i^*} \nonumber \\&= y_0 - \delta (1+\varphi ) (a_{i^*}^Ty_{0}-b_{i^*})^+a_{i^*}, \end{aligned}$$

(94)

since $y_0 = x_0 = v_0$. Now, let’s define, ${\bar{\vartheta }}_l = \frac{\varphi }{1-\varphi }[y_{l}-y_{l-1}+\delta (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*}]$, $ {\bar{\varDelta }}_l = y_l + {\bar{\vartheta }}_l$ and $ {\bar{\chi }}_l = \Vert y_l+{\bar{\vartheta }}_l-{\mathcal {P}}({\bar{\varDelta }}_l)\Vert ^2$, then using the update formula (93), we have

$$\begin{aligned} y_{l+1} + {\bar{\vartheta }}_{l+1}&= y_{l+1} +\frac{\varphi }{1-\varphi }[y_{l+1}-y_{l}+\delta (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*}] \\&= \frac{1}{1-\varphi } y_{l+1} - \frac{\varphi }{1-\varphi } y_{l}+ \frac{\delta \varphi }{1-\varphi } (a_{i^*}^Ty_{l}-b_{i^*})^+a_{i^*} \\&\overset{(93)}{=} y_l + {\bar{\vartheta }}_l - \frac{\delta }{1-\varphi } \left( a_{i^*}^Ty_l-b_{i^*}\right) ^+ a_{i^*}. \end{aligned}$$

Using the above relation, we can write

$$\begin{aligned} {\bar{\chi }}_{l+1}&= \Vert y_{l+1}+{\bar{\vartheta }}_{l+1}-{\mathcal {P}}({\bar{\varDelta }}_{l+1})\Vert ^2 \overset{\text {Lemma} \ 3}{ \le } \Vert y_{l+1}+{\bar{\vartheta }}_{l+1}-{\mathcal {P}}({\bar{\varDelta }}_l)\Vert ^2 \nonumber \\&= \big \Vert y_l + {\bar{\vartheta }}_l - \frac{\delta }{1-\varphi } \left( a_{i^*}^Ty_l-b_{i^*}\right) ^+ a_{i^*}-{\mathcal {P}}({\bar{\varDelta }}_l) \big \Vert ^2 \nonumber \\&= \underbrace{\Vert y_l+{\bar{\vartheta }}_l-{\mathcal {P}}({\bar{\varDelta }}_l)\Vert ^2}_{= {\bar{\chi }}_l} + \frac{\delta ^2}{(1-\varphi )^2} \underbrace{\Vert (a_{i^*}^Ty_l- b_{i^*})^+a_{i^*}\Vert ^2}_{I_1} \nonumber \\&\quad - \frac{2 \delta }{1-\varphi } \underbrace{\big \langle y_l+{\bar{\vartheta }}_l-{\mathcal {P}}({\bar{\varDelta }}_l) \ ,\ a_{i^*} (a_{i^*}^Ty_l-b_{i^{*}})^{+} \big \rangle }_{ I_2} \nonumber \\&= {\bar{\chi }}_l + \frac{\delta ^2}{(1-\varphi )^2} I_1- \frac{2 \delta }{1-\varphi } I_2. \end{aligned}$$

(95)

Taking expectation with respect to ${\mathbb {S}}_l$ we have,

$$\begin{aligned} \frac{\delta ^2}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [I_1] \overset{(9) }{=} \frac{2\delta ^2}{(1-\varphi )^2} f(y_l). \end{aligned}$$

(96)

Similarly, we can simplify the third term of (95) as

$$ \begin{aligned}&- \frac{2\delta }{1-\varphi } {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l} [I_2] \nonumber \\&\quad \overset{(9) }{=} -\frac{2\delta }{1-\varphi } \big \langle y_l-{\mathcal {P}}({\bar{\varDelta }}_l) , g(y_l) \big \rangle + \frac{2\delta \varphi }{(1-\varphi )^2} \big \langle y_{l-1}-y_l - \delta g(y_{l-1}) , g(y_l) \big \rangle \nonumber \\&\quad = -\frac{2\delta }{1-\varphi } \big \langle y_l-{\mathcal {P}}({\bar{\varDelta }}_l) , g(y_l) \big \rangle + \frac{2\delta \varphi }{(1-\varphi )^2} \big \langle y_{l-1}-y_l , g(y_l) \big \rangle \nonumber \\&\quad \quad \quad \quad \quad \quad -\frac{\delta ^2 \varphi }{(1-\varphi )^2} \left[ \Vert g(y_l)+g(y_{l-1})\Vert ^2-\Vert g(y_l)\Vert ^2-\Vert g(y_{l-1})\Vert ^2 \right] \nonumber \\&\quad \overset{\text {Lemma} \ 7 \ \& \ 8}{\le } - \frac{4\delta }{1-\varphi } f(y_l) + \frac{2\delta \varphi }{(1-\varphi )^2} \left[ f(y_{l-1}) - f(y_{l}) \right] + \frac{2\delta ^2 \varphi }{(1-\varphi )^2} \left[ f(y_{l-1}) + f(y_{l}) \right] \nonumber \\&\quad = \frac{2 \delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_{l-1}) -\frac{2 \delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_{l}) + \frac{4 \delta (\varphi + \delta \varphi -1)}{(1-\varphi )^2} f(y_{l}). \end{aligned}$$

(97)

Using the expressions of Eqs. (96) and (97) in (95) and simplifying further, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_{{\mathbb {S}}_l}[{\bar{\chi }}_{l+1}] + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_l) + \varsigma f(y_l) \ \le \ {\bar{\chi }}_l + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} f(y_{l-1}), \end{aligned}$$

(98)

here, the defined constant $\varsigma $ as,

$$\begin{aligned} \varsigma = \frac{4 \delta (1-\varphi - \delta \varphi )}{(1-\varphi )^2} - \frac{2 \delta ^2 }{(1-\varphi )^2} = \frac{2 \delta (2-2 \varphi - 2\delta \varphi -\delta )}{(1-\varphi )^2} \ > \ 0. \end{aligned}$$

(99)

Now, taking expectation again in (98) and using the tower property, we get

$$\begin{aligned} {\bar{q}}_{l+1} + \varsigma {{\,\mathrm{\mathbb {E}}\,}}[f(y_l)] \le {\bar{q}}_l, \quad l = 1,2,3\ldots , \end{aligned}$$

(100)

where, ${\bar{q}}_l = {{\,\mathrm{\mathbb {E}}\,}}[{\bar{\chi }}_l] + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(y_{l-1})]$. Summing up (100) for $l=1,2,\ldots ,k$ we get

$$\begin{aligned} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(y_l)] \ \le \ \frac{{\bar{q}}_1-{\bar{q}}_{k+1}}{\varsigma } \ \le \ \frac{{\bar{q}}_1}{\varsigma }. \end{aligned}$$

(101)

Now, using Jensen’s inequality, we have

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{y_k})\right] = {{\,\mathrm{\mathbb {E}}\,}}\left[ f\left( \sum \limits _{l=1}^{k} \frac{y_k}{k}\right) \right] \ \le \ {{\,\mathrm{\mathbb {E}}\,}}\left[ \frac{1}{k} \sum \limits _{l=1}^{k}f(y_l)\right] \ = \ \frac{1}{k} \sum \limits _{l=1}^{k} {{\,\mathrm{\mathbb {E}}\,}}[f(y_l)] \ \overset{(101)}{\le } \frac{{\bar{q}}_1}{\varsigma k}. \end{aligned}$$

From (94), $y_1 = y_0 - \delta (1+\varphi ) (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*}$ and ${\bar{\vartheta }}_1 = \frac{-\varphi ^2 \delta }{1-\varphi } (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*} $. Then,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[{\bar{\chi }}_1]&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert y_1+{\bar{\vartheta }}_1 -{\mathcal {P}}({\bar{\varDelta }}_1)\Vert ^2 \right] \overset{\text {Lemma} \ 3}{ \le } {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert y_1+{\bar{\vartheta }}_1 -{\mathcal {P}}(y_0)\Vert ^2 \right] \nonumber \\&= {{\,\mathrm{\mathbb {E}}\,}}\left[ \Vert y_0 -{\mathcal {P}}(y_0)-\frac{\delta }{1-\varphi } (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*} \Vert ^2 \right] \nonumber \\&= \Vert y_0 -{\mathcal {P}}(y_0)\Vert ^2 + \frac{\delta ^2}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}[|(a_{i^*}^Ty_0-b_{i^*})^+|^2] \nonumber \\&\quad \quad \quad \quad \quad \quad - \frac{2\delta }{1-\varphi } \langle y_0 -{\mathcal {P}}(y_0) , {{\,\mathrm{\mathbb {E}}\,}}[(a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*}] \rangle \nonumber \\&\overset{\text {Lemma} \ 8 }{\le } \Vert y_0 -{\mathcal {P}}(y_0)\Vert ^2 + \frac{2\delta ^2}{(1-\varphi )^2} f(y_0)-\frac{4\delta }{1-\varphi } f(y_0). \end{aligned}$$

(102)

Now, from our construction we get

$$\begin{aligned} {\bar{q}}_1 = {{\,\mathrm{\mathbb {E}}\,}}[{\bar{\chi }}_1] + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(y_{0})] \le d(y_0,P)^2+ \frac{2\delta (\delta -2+ 3\varphi + \delta \varphi )}{(1-\varphi )^2} f(y_0). \end{aligned}$$

Substituting the values of $\varsigma $ and $q_1$ in the expression of ${{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{y_k})\right] $, we have the following:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ f({\bar{y}}_k)\right] \le \frac{ (1-\omega + \alpha \omega )^2 \ d(y_0,P)^2+ 2\delta (\delta -2+ 3 \omega - 3 \alpha \omega + \delta \omega - \delta \alpha \omega ) f(y_0)}{2 \delta k \left( 2-2 \omega + 2 \alpha \omega - 2\delta \omega + 2 \delta \alpha \omega -\delta \right) }. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Morshed, M.S., Islam, M.S. & Noor-E-Alam, M. Sampling Kaczmarz-Motzkin method for linear feasibility problems: generalization and acceleration. Math. Program. 194, 719–779 (2022). https://doi.org/10.1007/s10107-021-01649-8

Download citation

Received: 15 February 2020
Accepted: 20 March 2021
Published: 30 March 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10107-021-01649-8

Sampling Kaczmarz-Motzkin method for linear feasibility problems: generalization and acceleration

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Efficient randomized block Kaczmarz method for linear feasibility

An infeasible-start framework for convex quadratic optimization, with application to constraint-reduced interior-point and other methods

A Stochastic Nesterov’s Smoothing Accelerated Method for General Nonsmooth Constrained Stochastic Composite Convex Optimization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Proof of Lemma 5

Proof of Lemma 6

Proof of Lemma 7

Proof of Lemma 8

Proof of Lemma 9

Proof of Theorem 1

Proof of Theorem 2

Appendix 2

Proof of Theorem 3

Proof of Theorem 5

Proof of Theorem 4

Proof of Theorem 6

Appendix 3

Proof of Theorem 7

Proof of Theorem 8

Proof of Theorem 9

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation