Abstract
Randomized Kaczmarz, Motzkin Method and Sampling Kaczmarz Motzkin (SKM) algorithms are commonly used iterative techniques for solving a system of linear inequalities (i.e., \(Ax \le b\)). As linear systems of equations represent a modeling paradigm for solving many optimization problems, these randomized and iterative techniques are gaining popularity among researchers in different domains. In this work, we propose a Generalized Sampling Kaczmarz Motzkin (GSKM) method that unifies the iterative methods into a single framework. In addition to the general framework, we propose a Nesterov-type acceleration scheme in the SKM method called Probably Accelerated Sampling Kaczmarz Motzkin (PASKM). We prove the convergence theorems for both GSKM and PASKM algorithms in the \(L_2\) norm perspective with respect to the proposed sampling distribution. Furthermore, we prove sub-linear convergence for the Cesaro average of iterates for the proposed GSKM and PASKM algorithms. From the convergence theorem of the GSKM algorithm, we find the convergence results of several well-known algorithms like the Kaczmarz method, Motzkin method and SKM algorithm. We perform thorough numerical experiments using both randomly generated and real-world (classification with support vector machine and Netlib LP) test instances to demonstrate the efficiency of the proposed methods. We compare the proposed algorithms with SKM, Interior Point Method and Active Set Method in terms of computation time and solution quality. In the majority of the problem instances, the proposed generalized and accelerated algorithms significantly outperform the state-of-the-art methods.
Similar content being viewed by others
Notes
We use the notation \((Ax-b)^+_{\underline{\mathbf {i_j}}}\) throughout the paper to express the underlying expectation, where the indices \(\underline{\mathbf {i_j}}\) represent the sampling process of 5.
For ease of notation, throughout the paper, we will use \({\mathbb {S}}_k\) to denote the sampling distribution corresponding to any random iterate \(x_k \in \mathbb {R}^n\).
Similarly, we will use \(\tau _k \sim {\mathbb {S}}_k\) to denote the sampled set and \(i^* = {{\,\mathrm{arg\,max}\,}}_{i \in \tau _k \sim {\mathbb {S}}_k} \{a_i^Tx_k-b_i, 0\} \ = \ {{\,\mathrm{arg\,max}\,}}_{i \in \tau _k \sim {\mathbb {S}}_k} (A_{\tau _k}x_k-b_{\tau _k})^+_i\) for any iterate \(x_k \in \mathbb {R}^n\).
When, \(\delta = 2\), we have \(h(\delta ) = 1-\eta \mu _1 = 1\). In that case, we can simplify the condition of (29) as \(\omega < \frac{\mu _1(2-\gamma )}{3\mu _1+1}\). In other words, for \(\delta =2\) the PASKM algorithm will converge if we select the parameters as \(0 \le \gamma < 2\), \(0 \le \alpha \le 1\), \(\omega < \frac{\mu _1(2-\gamma )}{3\mu _1+1}\) and \( \mu _1 = \lambda _{\min }^+(A^TA)/m\).
Note that there are sudden jumps in the FSC graphs whenever FSC crosses 0.6 approximately. This phenomena is caused by the stopping criteria of our experiments. We are halting the algorithm whenever the residual error satisfy the threshold \(10^{-05}\), that’s why there are sudden jumps. One can run the algorithm with smaller threshold to mitigate this.
The credit card data matrix has 30, 000 rows. From our earlier experiments, we observe that the choice of \(1 < \beta \le 100\), the proposed algorithms produce the best performance. For that reason, we plot the credit card graph up-to \(\beta = 2000\). The irregularity of the credit card graph occurs when \(\beta > 2000\).
We note the best possible time from our previous experiments.
Note that, since \(\varGamma _1 \varGamma _2 \le 0\) and \(\varGamma _1 \varGamma _3 \le 1 \le \sqrt{(1+\phi )}\), from Theorem 4 we have \({{\,\mathrm{\mathbb {E}}\,}}[d(x_{k+1},P)] \le \sqrt{(1+\phi )} \ \rho _2^k \ d(x_0,P)\) for any \(\xi \in Q_2\).
References
Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262 (2008)
Leventhal, D., Lewis, A.S.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)
Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math. 50(2), 395–403 (2010)
Drineas, P., Mahoney, M.W., Muthukrishnan, S., Sarlós, T.: Faster least squares approximation. Numer. Math. 117(2), 219–249 (2011)
Zouzias, A., Freris, N.M.: Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)
Lee, Y.T., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, FOCS ’13, pp. 147–156, Washington, DC, USA, 2013. IEEE Computer Society
Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)
Gower, R.M., Richtárik, P.: Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl. 36(4), 1660–1690 (2015)
Qu, Z., Richtarik, P., Takac, M., Fercoq, O.: SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization. In: Proceedings of The 33rd International Conference on Machine Learning, vol. 48, pp. 1823–1832, New York, USA, 20–22 Jun 2016. PMLR
Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. 155(1), 549–573 (2016)
De Loera, J., Haddock, J., Needell, D.: A Sampling Kaczmarz-Motzkin algorithm for linear feasibility. SIAM J. Sci. Comput. 39(5), S66–S87 (2017)
Razaviyayn, M., Hong, M., Reyhanian, N., Luo, Z.-Q.: A linearly convergent doubly stochastic Gauss-Seidel algorithm for solving linear equations and a certain class of over-parameterized optimization problems. Math. Program. 176(1), 465–496 (2019)
Haddock, J., Ma, A.: Greed works: an improved analysis of Sampling Kaczmarz-Motkzin (2019)
Kaczmarz, S.: Angenaherte auflsung von systemen linearer gleichungen. Bulletin International de l’Acadmie Polonaise des Sciences et des Letters 35, 355–357 (1937)
Gordon, R., Bender, R., Herman, G.T.: Algebraic reconstruction techniques (art) for three-dimensional electron microscopy and x-ray photography. J. Theor. Biol. 29(3), 471–481 (1970)
Censor, Y.: Parallel application of block-iterative methods in medical imaging and radiation therapy. Math. Program. 42(1), 307–325 (1988)
Herman, G.T.: Fundamentals of Computerized Tomography: Image Reconstruction from Projections, 2nd edn. Springer, New York (2009)
Lorenz, D.A., Wenger, S., Schöpfer, F., Magnor, M.: A sparse Kaczmarz solver and a linearized bregman method for online compressed sensing. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 1347–1351 (2014)
Elble, J.M., Sahinidis, N.V., Vouzis, P.: Gpu computing with Kaczmarz’s and other iterative algorithms for linear systems. Parallel Comput. 36(5), 215–231 (2010)
Pasqualetti, F., Carli, R., Bullo, F.: Distributed estimation via iterative projections with application to power network monitoring. Automatica 48(5), 747–758 (2012)
Censor, Y.: Row-action methods for huge and sparse systems and their applications. SIAM Rev. 23(4), 444–466 (1981)
Agamon, S.: The relaxation method for linear inequalities. Can. J. Math. 382–392 (1954)
Motzkin, T.S., Schoenberg, I.J.: The relaxation method for linear inequalities. Can. J. Math. 393–404 (1954)
Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65–386 (1958)
Ramdas, A., Peña, J.: Margins, kernels and non-linear smoothed perceptrons. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32, pp. 244–252, Bejing, China, 22–24 Jun 2014. PMLR
Ramdas, A., Peña, J.: Towards a deeper geometric, analytic and algorithmic understanding of margins. Optim. Methods Softw. 31(2), 377–391 (2016)
Nutini, J., Sepehry, B., Laradji, I., Schmidt, M., Koepke, H., Virani, A.: Convergence rates for greedy Kaczmarz algorithms, and faster randomized Kaczmarz rules using the orthogonality graph. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16, pp. 547–556, Arlington, Virginia, United States. AUAI Press (2016)
Petra, S., Popa, C.: Single projection Kaczmarz extended algorithms. Numer. Algorithms 73(3), 791–806 (2016)
Telgen, J.: On relaxation methods for systems of linear inequalities. Eur. J. Oper. Res. 9(2), 184–189 (1982)
Maurras, J.F., Truemper, K., Akgül, M.: Polynomial algorithms for a class of linear programs. Math. Program. 21(1), 121–136 (1981)
Chubanov, S.: A strongly polynomial algorithm for linear systems having a binary solution. Math. Program. 134(2), 533–570 (2012)
Chubanov, S.: A polynomial projection algorithm for linear feasibility problems. Math. Program. 153(2), 687–713 (2015)
Liu, J., Wright, S.J.: An accelerated randomized Kaczmarz algorithm. Math. Comput. 85(297), 153–178 (2016)
Needell, D., Zhao, R., Zouzias, A.: Randomized block Kaczmarz method with projection for solving least squares. Linear Algebra Appl. 484, 322–343 (2015)
Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss-Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)
Hefny, A., Needell, D., Ramdas, A.: Rows versus columns: randomized Kaczmarz or Gauss-Seidel for ridge regression. SIAM J. Sci. Comput. 39(5), S528–S542 (2017)
Gower, R.M., Richtárik, P.: Linearly convergent randomized iterative methods for computing the pseudoinverse (2016)
Gower, R.M., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)
Richtárik, P., Takáč, M.: Stochastic reformulations of linear systems: algorithms and convergence theory (2017)
Gower, R., Hanzely, F., Richtarik, P., Stich, S.U.: Accelerated stochastic matrix inversion: general theory and speeding up BFGS rules for faster second-order optimization. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.), Advances in Neural Information Processing Systems, vol. 31, pp. 1619–1629. Curran Associates, Inc. (2018)
Needell, D., Tropp, J.A.: Paved with good intentions: analysis of a randomized block Kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)
Briskman, J., Needell, D.: Block Kaczmarz method with inequalities. J. Math. Imaging Vis. 52(3), 385–396 (2015)
Needell, D., Rebrova, E.: On block Gaussian sketching for the Kaczmarz method (2019)
Basu, A., De Loera, J.A., Junod, M.: On Chubanov’s method for linear programming. INFORMS J. Comput. 26(2), 336–350 (2014)
Végh, A.L., Zambelli, G.: A polynomial projection-type algorithm for linear programming. Oper. Res. Lett. 42(1), 91–96 (2014)
Eldar, Y.C., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson-Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011)
Agaskar, A., Wang, C., Lu, Y.M.: Randomized Kaczmarz algorithms: exact MSE analysis and optimal sampling probabilities. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 389–393 (2014)
Bai, Z.-Z., Wen-Ting, W.: On relaxed greedy randomized Kaczmarz methods for solving large sparse linear systems. Appl. Math. Lett. 83, 21–26 (2018)
Bai, Z.-Z., Wu, W.-T.: On greedy randomized Kaczmarz method for solving large sparse linear systems. SIAM J. Sci. Comput. 40(1), A592–A606 (2018)
Kovachki, N.B., Stuart, A.M.: Analysis of momentum methods. arXiv preprint arXiv:1906.04285 (2019)
Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Nesterov, Y.: A method for solving the convex programming problem with convergence rate \(o(1/k^{2})\). Sov. Math. Dokl. 27, 372–376 (1983)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, 1st edn. Springer, New York (2014)
Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)
Loizou, N., Richtárik, P.: Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods (2017)
Morshed, M.S., Noor-E-Alam, M.: Generalized affine scaling algorithms for linear programming problems. Comput. Oper. Res. 114, 104807 (2020)
Loizou, M.R.N., Richtárik, P.: Provably accelerated randomized gossip algorithms (2018)
Morshed, Md.S., Islam, Md.S., Noor-E-Alam, Md.: Accelerated sampling Kaczmarz motzkin algorithm for the linear feasibility problem. J. Glob. Optim. (2019)
Censor, Y., Gordon, D., Gordon, R.: Component averaging: an efficient iterative parallel algorithm for large and sparse unstructured problems. Parallel Comput. 27(6), 777–808 (2001)
Morshed, Md.S., Noor-E-Alam, Md.: Heavy Ball Momentum Induced Sampling Kaczmarz Motzkin Methods for Linear Feasibility Problems. arXiv preprint arXiv:200908251 (2020)
Morshed, Md.S., Noor-E-Alam, Md.: Sketch & Project Methods for Linear Feasibility Problems: Greedy Sampling & Momentum. arXiv preprint arXiv:2012.02913 (2020)
Morshed, Md.S., Ahmad, S., Noor-E-Alam, Md.: Stochastic Steepest Descent Methods for Linear Systems: Greedy Sampling & Momentum. arXiv preprint arXiv:2012.13087 (2020)
Hoffman, A.J.: On approximate solutions of systems of linear inequalities. In: Selected Papers of Alan J Hoffman: With Commentary, pp. 174–176. World Scientific (2003)
Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) Machine Learning and Knowledge Discovery in Databases, pp. 795–811. Springer, Cham (2016)
Khachiyan, L.G.: Polynomial algorithms in linear programming. USSR Comput. Math. Math. Phys. 20(1), 53–72 (1980)
Yeh, I.-C., Lien, C.: The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Syst. Appl. 36(2), 2473–2480 (2009)
Lichman, M.: UCI machine learning repository (2013)
Netlib. The netlib linear programming library
Calafiore, G., Ghaoui, L.E.: Optimization Models. Control systems and optimization series. Cambridge University Press (2014)
Acknowledgements
The authors are truly grateful to the anonymous referees and the editors for their valuable comments and suggestions in the earlier version of the paper. The comments helped immensely in the revision process and greatly improved the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix 1
Proof of Lemma 5
Using the expectation expression given in the first section we have
Here, the notation \((A^TA)_{\underline{\mathbf {i_j}}}\) denotes the matrix \(a_{i_j} a_{i_j}^T\) where the index \(i_j\) belongs to the list (5). Furthermore, the index l corresponds to the \((\beta + j)^{th}\) entry on the list (5). This proves the Lemma. \(\square \)
Proof of Lemma 6
Using the definition of the expectation from (7), we have,
Here, s is the number of zero entries in the residual \((Ax - b)^+\), which also corresponds to the number of satisfied constraints for x. Since \(A {\mathcal {P}} (x) \le b\), we have the following:
Combining the above identities and using the expression for f(x) from (9) we get
which proves the Lemma. \(\square \)
Proof of Lemma 7
Considering the definitions of f(x) and g(y), we have
where \(0 \le p_i \le 1\) denote the corresponding probabilities for all i. Here, in the third inequality we used the relation \(2ab \le a^2 + b^2\). This completes the proof. \(\square \)
Proof of Lemma 8
Since, \({\bar{y}} \in P\), from the definition we have,
Here, we used the identity \(xx^+ = |x^+|^2\). This proves the Lemma. \(\square \)
Proof of Lemma 9
Since, \({\mathcal {P}}(x) \in P\), we have
Here, we used the lower bound of the expected value from Lemma 6. \(\square \)
Proof of Theorem 1
Since, \(\phi _1, \phi _2 \ge 0\), the largest root \(\phi \) of equation \(\phi ^2 + \phi _1 \phi - \phi _2 = 0\) can be written as
Then, using the given recurrence, we have,
This proves the first part. Also note that since \(\phi _1 + \phi _2 < 1\), we have,
For the second part, notice that from the recurrence inequality, we can deduce the following matrix inequality:
The Jordan decomposition of the matrix in the above expression is given by
Here we substituted \(\phi _2 = \phi (\phi +\phi _1)\) in the Jordan decomposition of Eq. (35). Next, we discuss two possible cases of values of k.
Case 1 k even
Here, we used \(G_0 = G_1\).
Case 2 k odd
Here, we used the inequality \(G_2 \le \phi _1 G_1 + \phi _2 G_0\). Now, combining the relations from Eq. (36) and (37), we can prove the second part of Theorem 1. \(\square \)
Proof of Theorem 2
From the given recurrence relation, we have
Using the definitions of (26), we can write the Jordan decomposition of the above matrix as follows:
Now, substituting the matrix decomposition into Eq. (38) and simplifying, we have
Since \(\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0\), one can easily verify that \(\varGamma _1, \varGamma _3 \ge 0\). Now, it remains to show that \( 0 \le \rho _1 \le \rho _2 < 1\). To show that, first note that
Now we have,
Moreover, since \(2-\varPi _1 - \varPi _4 \ge 0 \), we have
As \(0 \le \rho _1 \le \rho _2 < 1\), considering (40) we can deduce that the sequence \(\{H_k\}\) and \( \{F_k\}\) converges. \(\square \)
Appendix 2
Proof of Theorem 3
From the update formula of Algorithm 2, we have \(z_{k} = x_k - (A_{\tau _k}x_k - b_{\tau _k})^+_{i^*} a_{i^*}\) where,
Similarly, the previous update formula can be written as, \(z_{k-1} = x_{k-1} - (A_{\tau _{k-1}}x_{k-1} - b_{\tau _{k-1}})^+_{j^*} a_{j^*}\); where
Note that the notation is consistent with the definition of (6). Since for any \(\xi \in Q_1\), \( (1-\xi ) {\mathcal {P}}(x_{k}) + \xi {\mathcal {P}}(x_{k-1}) \in P\) we have,
We used the fact that the function \(\Vert \cdot \Vert ^2\) is convex and \(0 \le \xi \le 1\). Now, taking expectation in both sides of the Eq. (44) and using Lemma 9, we get the following:
where \(h(\delta )\) is defined in Lemma 9. Taking expectation again in Eq. (45) and letting \(G_{k+1} = {{\,\mathrm{\mathbb {E}}\,}}\left[ d(x_{k+1},P)^2\right] \), we get the following:
Since, \(\phi _1 , \phi _2 \ge 0\), \(\phi _1+ \phi _2 < 1\) and \(z_0 = z_1\), using first part of Theorem 1, we have the following:
Moreover, considering (47) with Lemma 6 we get the bound of \({{\,\mathrm{\mathbb {E}}\,}}[f(x_k)]\) which proves the first part of Theorem 3. Furthermore, using the second part of Theorem 1 and Eq. (46), we get the second part of Theorem 3. Now, to prove the third part first note that \(\frac{1}{k} \sum \limits _{l=1}^{k} {\mathcal {P}}(x_l) \in P\), using Lemma 3 we have
Here we use the geometric series bound along with the fact that \(0< \rho < 1\). Furthermore, using a more simplified version of (44) we have the following:
for any \(l \ge 1\). Summing up the above identity for \(l =1,2,\ldots ,k\), we have the following:
where \(\eta = 2\delta - \delta ^2 \le 1\). Also, from Lemma 6, we get \(\mu _2 = \min \{1, \frac{\beta }{m} \lambda _{\max }\} \le 1\). This implies that \(\eta \mu _2 \le 1\) holds. Moreover, we used the non-negativity of the sequences \(G_k\) and \( f(x_k)\). We also used the upper bound from Lemma 6. Then we get
This proves the second part of Theorem 3. \(\square \)
Proof of Theorem 5
For any natural number \(l \ge 1\) define, \(\vartheta _l = \frac{\xi }{1+\xi }[x_{l-1}-x_l-\delta (a_{j^*}^Tx_{l-1}-b_{j^*})^+a_{j^*}]\), \( \ \varDelta _l = x_l + \vartheta _l\) and \(\chi _l = \Vert x_l+\vartheta _l-{\mathcal {P}}(\varDelta _l)\Vert ^2\), then using the update formulas (10) and (11), we have
here, the index \(i^{*}\) and \(j^*\) are defined based on (6) respectively for the iterates \(x_l\) and \(x_{l-1}\). Using the above relation, we can write
Taking expectation with respect to \({\mathbb {S}}_l\) we have,
Similarly, we can simplify the third term of (50) as
Using the expressions of Eqs. (51) and (52) in (50) and simplifying further, we have
here, the constant \(\varpi \) is defined as
Now, taking expectation again in (53) and using the tower property, we get
where \(q_l = {{\,\mathrm{\mathbb {E}}\,}}[\chi _l] - \frac{2\delta \xi (1+\delta )}{(1+\xi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(x_{l-1})]\). Summing up (55) for \(l=1,2,\ldots ,k\) we get
Now, using Jensen’s inequality, we have
Since, \(x_0=x_1\), we have \(\vartheta _1 = \frac{-\delta \xi }{1+\xi } (a_{i^*}^Tx_0-b_{i^*})^+ a_{i^*} \). Furthermore,
Now, from our construction we get
Substituting the values of \(\varpi \) and \(q_1\) in the expression of \({{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{x_k})\right] \), we have the following:
\(\square \)
Proof of Theorem 4
Since the term \(\Vert x_{k+1}- {\mathcal {P}}(x_{k+1}) \Vert \) is constant under the expectation with respect to \({\mathbb {S}}_{k+1}\), from the update formula of the GSKM algorithm, we get
We performed the two expectations in order, from the innermost to the outermost. Now, taking expectation in (58) and using the tower property of expectation, we have,
Similarly, using the update formula for \(z_{k+1}\), we have
Taking expectation in (60) and using (59) along with the tower property, we have
Combining both (59) and (61), we can deduce the following matrix inequality:
Now, from the definition, it can be easily checked that \(\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0\). Since, \(\xi \in Q_2\), we have
Also, we have
Here, in the last inequality we used the given condition. Considering (64), we can check that \(\varPi _1 + \varPi _4 < 1+|\xi | \sqrt{h(\delta )} = 1+ \min \{1, |\xi | \sqrt{h(\delta )}\} = 1+ \min \{1,\varPi _1 \varPi _4-\varPi _2 \varPi _3\}\). Also from (63), we have \(\varPi _2 \varPi _3 - \varPi _1 \varPi _4 \le 0\), which is precisely the condition provided in (25). Let’s define the sequences \(F_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_k-z_{k-1}\Vert ]\) and \(H_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_k-{\mathcal {P}}(x_k)\Vert ]\). Now, using Theorem 2, we have
where, \(\varGamma _1 , \varGamma _2, \varGamma _3, \rho _1, \rho _2\) can be derived from (26) using the parameter choice of (28). Note that, from the GSKM algorithm we have, \(x_1 = x_0\) and \(z_1 = z_0\).Therefore we can easily check that, \(F_1 = {{\,\mathrm{\mathbb {E}}\,}}[\Vert z_1-z_{0}\Vert ] = 0\) and \(H_1 = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_1-{\mathcal {P}}(x_1)\Vert ] = {{\,\mathrm{\mathbb {E}}\,}}[\Vert x_0-{\mathcal {P}}(x_0)\Vert ] = \Vert x_0-{\mathcal {P}}(x_0)\Vert = H_0 \). Now, substituting the values of \(H_1\) and \(F_1\) in (65), we have
Also from Theorem 2 we have, \(\varGamma _1, \varGamma _3 \ge 0\) and \( 0 \le \rho _1 \le \rho _2 < 1\), which proves the Theorem. \(\square \)
Proof of Theorem 6
Note that, since \(Ax \le b\) is feasible, then from Lemma 12, we know that there is a feasible solution \(x^*\) with \(|x^{*}_j| \le \frac{2^{\sigma }}{2n}\) for \(j = 1, \ldots , n\). Thus, we have
as \(x_0 = 0\). Then if the system \(Ax \le b\) is infeasible, by using Lemma 10, we have
This implies when GSKM runs on the system \(Ax \le b\), the system is feasible when \(\theta (x) < 2^{1-\sigma }\). Furthermore, since every point of the feasible region P is inside the half-space defined by \(\tilde{H}_i = \{x \ | \ a_i^T x \le b_i\}\) for all \(i = 1,2,\ldots ,m\), we have the following:
Then, for \(\xi \in Q_1\) whenever the system \(Ax \le b\) is feasible, we have
Similarly, for \(\xi \in Q_2\) whenever the system \(Ax \le b\) is feasible, we haveFootnote 13
Take \({\bar{\rho }} = \max \{\rho , \rho _2^2\}\). Now, combining (69) and (70), for any \(\xi \in Q = Q_1 \cup Q_2\), whenever the system \(Ax \le b\) is feasible, we have
Here, we used Theorems 3 and 4 and the identities from Eqs. (67) and (68). Now, for detecting feasibility, we need to have, \({{\,\mathrm{\mathbb {E}}\,}}[\theta (x_k)] < 2^{1-\sigma }\). That gives us
Simplifying the above identity further, we get the following lower bound for k:
Moreover, if the system \(Ax \le b\) is feasible, then the probability of not having a certificate of feasibility is bounded as follows:
Here, we used the Markov’s inequality \({\mathbb {P}} (x \ge t) \le \frac{{{\,\mathrm{\mathbb {E}}\,}}[x]}{t}\). This completes the proof of Theorem 6. \(\square \)
Appendix 3
Proof of Theorem 7
From the update formula of the PASKM algorithm, we get
Here, we used the condition \(\gamma + 3 \omega -2 \le 0\). Similarly, using the update formula for \(y_{k+1}\), we have
Following Theorem 2, let us define the sequences \(H_k = E [\Vert v_k-{\mathcal {P}}(v_k)\Vert ^2]\) and \(F_k = {{\,\mathrm{\mathbb {E}}\,}}[\Vert y_k-{\mathcal {P}}(y_k)\Vert ^2]\). The goal is to prove that \(H_k\) and \(F_k\) satisfy the condition (25). Now, taking expectation in (72) and using the tower property of expectation, we have
Similarly, taking expectation in (73) and using (74) along with the tower property of expectation, we have
Combining both (74) and (75), we can deduce the following matrix inequality:
Here, we use the fact that \(\varPi _1, \varPi _2, \varPi _3, \varPi _4 \ge 0\). Now we will use Theorem 2 to simplify the expression of (76). Before we can use Theorem 2, we need to make sure the sequences \(H_k\) and \(F_k\) satisfy the condition of (25). From the definition, we have
Also, we have
Here, in the last inequality, we used the given condition. Considering (78), we can check that \(\varPi _1 + \varPi _4 < 1+\omega h(\delta ) (1-\alpha ) (1+\gamma ) = 1+ \min \{1, \omega h(\delta ) (1-\alpha ) (1+\gamma )\} = 1+ \min \{1,\varPi _1 \varPi _4-\varPi _2 \varPi _3\}\). Also from (77), we have \(\varPi _2 \varPi _3 - \varPi _1 \varPi _4 \le 0\), which is precisely the condition provided in (25). Now, using Theorem 2, we have
where, \(\varGamma _1 , \varGamma _2, \varGamma _3, \rho _1, \rho _2\) can be derived from (26) using the given parameter. Note that, from the PASKM algorithm we have, \(x_0 = v_0 = y_0\). Therefore we can easily check that, \( H_0 = \Vert v_0-{\mathcal {P}}(v_0)\Vert ^2 = \Vert y_0-{\mathcal {P}}(y_0)\Vert ^2 = F_0 \). Now, substituting the values of \(H_0\) and \(F_0\) in (79), we have
Also from Theorem 2 we have, \(\varGamma _1, \varGamma _3 \ge 0\) and \( 0 \le \rho _1 \le \rho _2 < 1\), which proves the the first part of the Theorem. Now, considering Lemma 6, we get
Now, substituting the result of (80) in (81), we get the second part of the Theorem. \(\square \)
Proof of Theorem 8
Let us define, \({\mathcal {V}} = \omega {\mathcal {P}}(v_{k}) +(1-\omega ) {\mathcal {P}}(y_{k}) \). Since \({\mathcal {V}} \in P\), using the update formula of \(v_{k+1}\) from Eq. (16), we have,
Since \(\Vert \cdot \Vert ^{2}\) is a convex function and \(0< \omega < 1\), we can bound the expected first term as follows:
Taking expectation with respect to the sampling distribution in the second term of Eq. (82) and using Lemma 9 with the choice \(z = x_{k+1}, \ x = y_k\) and \(\eta = 2\delta -\delta ^2\), we get
Now, taking expectation in the third term of (82) we get
Using Lemma 7 and Lemma 8 we can simplify Eq. (85) as follows:
Now, substituting the values of Eqs. (83), (84) and (86) in Eq. (82) we get the following:
With further simplification, the above identity can be written as follows:
Now, let’s choose the parameters as in Eq. (33) along with \(0< \zeta < \frac{4\eta \mu _1}{(1-\mu _1)^2} \). We can easily see that \(\frac{\gamma ^2 }{\eta } = \frac{ \gamma (1-\alpha )}{\alpha }\) and \( \alpha \in (0,1)\). Also note that
which implies \(\omega < 1\). Similarly, whenever \(\mu _1 < 1\) we have
which implies \(\omega > 0\). Also, using the parameter choice of (33), we have
Now, using all of the above relations (Eqs. (33), (90)) in Eq. (87), we get the following:
Finally, taking expectation again with tower rule and substituting \(\frac{\gamma ^2 }{\eta }= \zeta \mu _1 \) we have
This proves the Theorem. Furthermore, for faster convergence, we need to choose parameters such that, \(\omega \) becomes as small as possible. In the proof, we assumed \(\mu _1 <1\) holds which is the most probable scenario. Whenever \(\mu _1 = 1\), we must have \(\mu _1 = \mu _2 =1\) and Lemma 6 holds with both equality, i.e., \(f(x) = d(x,P)^2\). Therefore if we choose \(\alpha , \gamma , \omega \) as \(\alpha = \frac{\eta }{\eta + \gamma }, \ \gamma = \sqrt{\zeta \eta }, \ \omega = 1- \frac{2\gamma }{1+\zeta }\), we can check that condition (89) holds and \(0< \omega < 1\) holds for any \(\zeta > 0\). \(\square \)
Proof of Theorem 9
For any natural number \(l \ge 1\), using the update formula of \(v_{l+1}\), we have
Let \(\varphi = \omega (1-\alpha )\). It can be easily checked that \(0 \le \varphi < 1\). Now, considering Eq. (14), we have
here, the index \(i^{*}\) and \(j^*\) are defined based on (6) respectively for the sequences \(y_l\) and \(y_{l-1}\). Furthermore, with the choice of \(x_0 = v_0\), the points \(y_0\) and \(y_1\) generated by the PASKM method (i.e, algorithm 3 with arbitrary parameter choice) can be calculated as
since \(y_0 = x_0 = v_0\). Now, let’s define, \({\bar{\vartheta }}_l = \frac{\varphi }{1-\varphi }[y_{l}-y_{l-1}+\delta (a_{j^*}^Ty_{l-1}-b_{j^*})^+a_{j^*}]\), \( {\bar{\varDelta }}_l = y_l + {\bar{\vartheta }}_l\) and \( {\bar{\chi }}_l = \Vert y_l+{\bar{\vartheta }}_l-{\mathcal {P}}({\bar{\varDelta }}_l)\Vert ^2\), then using the update formula (93), we have
Using the above relation, we can write
Taking expectation with respect to \({\mathbb {S}}_l\) we have,
Similarly, we can simplify the third term of (95) as
Using the expressions of Eqs. (96) and (97) in (95) and simplifying further, we have
here, the defined constant \(\varsigma \) as,
Now, taking expectation again in (98) and using the tower property, we get
where, \({\bar{q}}_l = {{\,\mathrm{\mathbb {E}}\,}}[{\bar{\chi }}_l] + \frac{2\delta \varphi (1+\delta )}{(1-\varphi )^2} {{\,\mathrm{\mathbb {E}}\,}}[f(y_{l-1})]\). Summing up (100) for \(l=1,2,\ldots ,k\) we get
Now, using Jensen’s inequality, we have
From (94), \(y_1 = y_0 - \delta (1+\varphi ) (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*}\) and \({\bar{\vartheta }}_1 = \frac{-\varphi ^2 \delta }{1-\varphi } (a_{i^*}^Ty_0-b_{i^*})^+ a_{i^*} \). Then,
Now, from our construction we get
Substituting the values of \(\varsigma \) and \(q_1\) in the expression of \({{\,\mathrm{\mathbb {E}}\,}}\left[ f(\bar{y_k})\right] \), we have the following:
Rights and permissions
About this article
Cite this article
Morshed, M.S., Islam, M.S. & Noor-E-Alam, M. Sampling Kaczmarz-Motzkin method for linear feasibility problems: generalization and acceleration. Math. Program. 194, 719–779 (2022). https://doi.org/10.1007/s10107-021-01649-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01649-8
Keywords
- Kaczmarz method
- Randomized projection
- Sampling Kaczmarz Motzkin
- Linear feasibility
- Nesterov’s acceleration
- Iterative methods