[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Sparse estimation via lower-order penalty optimization methods in high-dimensional linear regression

  • Published:
Journal of Global Optimization Aims and scope Submit manuscript

Abstract

The lower-order penalty optimization methods, including the \(\ell _q\) minimization method and the \(\ell _q\) regularization method \((0<q\le 1)\), have been widely applied to find sparse solutions of linear regression problems and gained successful applications in various mathematics and applied science fields. In this paper, we aim to investigate statistical properties of the \(\ell _q\) penalty optimization methods with randomly noisy observations and a deterministic/random design. For this purpose, we introduce a general q-Restricted Eigenvalue Condition (REC) and provide its sufficient conditions in terms of several widely-used regularity conditions such as sparse eigenvalue condition, restricted isometry property, and mutual incoherence property. By virtue of the q-REC, we exhibit the \(\ell _2\) recovery bounds of order \(O(\epsilon ^2)\) and \(O(\lambda ^{\frac{2}{2-q}}s)\) for the \(\ell _q\) minimization method and the \(\ell _q\) regularization method, respectively, with high probability for either deterministic or random designs. The results in this paper are nonasymptotic and only assume the weak q-REC. The preliminary numerical results verify the established statistical properties and demonstrate the advantages of the \(\ell _q\) penalty optimization methods over existing sparse optimization methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. For two functions f and g, we use \(f\asymp g\) to denote that \(f=cg\) for a universal positive constant c.

References

  1. Candès, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8), 410–412 (2006)

    Article  MATH  Google Scholar 

  2. Pun, C.S., Wong, H.Y.: A linear programming model for selection of sparse high-dimensional multiperiod portfolios. Eur. J. Oper. Res. 273(2), 754–771 (2019)

    Article  MATH  Google Scholar 

  3. Qin, J., Hu, Y.H., Xu, F., Yalamanchili, H.K., Wang, J.W.: Inferring gene regulatory networks by integrating ChIP-seq/chip and transcriptome data via LASSO-type regularization methods. Methods 67(3), 294–303 (2014)

    Article  Google Scholar 

  4. Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory 52(1), 6–18 (2006)

    Article  MATH  Google Scholar 

  5. Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)

    Article  MATH  Google Scholar 

  6. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)

    Article  MATH  Google Scholar 

  7. Figueiredo, M.A.T., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007)

    Article  Google Scholar 

  8. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MATH  Google Scholar 

  9. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MATH  Google Scholar 

  10. Daubechies, I., Devore, R., Fornasier, M.: Iteratively reweighted least squares minimization for sparse recovery. Commun. Pure Appl. Math. 63(1), 1–38 (2010)

    Article  MATH  Google Scholar 

  11. Candès, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)

    Article  MATH  Google Scholar 

  12. Donoho, D.L., Huo, X.M.: Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inf. Theory 47(7), 2845–2862 (2001)

    Article  MATH  Google Scholar 

  13. Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Annal. Stat. 37(4), 1705–1732 (2009)

    Article  MATH  Google Scholar 

  14. Donoho, D.L.: For most large underdetermined systems of linear equations the minimal \(\ell _1\)-norm solution is also the sparsest solution. Commun. Pure Appl. Math. 59(6), 797–829 (2006)

    Article  MATH  Google Scholar 

  15. Cai, T.T., Xu, G.W., Zhang, J.: On recovery of sparse signals via \(\ell _1\) minimization. IEEE Trans. Inf. Theory 55(7), 3388–3397 (2009)

    Article  MATH  Google Scholar 

  16. Raskutti, G., Wainwright, M.J., Yu, B.: Restricted eigenvalue properties for correlated gaussian designs. J. Mach. Learn. Res. 11(2), 2241–2259 (2010)

    MATH  Google Scholar 

  17. van de Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the Lasso. Electr. J. Stat. 3, 2009 (2009)

    MATH  Google Scholar 

  18. Bunea, F., Tsybakov, A., Wegkamp, M.: Sparsity oracle inequalities for the Lasso. Electr. J. Stat. 64(3), 330–2 (2007)

    MATH  Google Scholar 

  19. Zhang, T.: Some sharp performance bounds for least squares regression with \(\ell _1\) regularization. Annal. Stat. 37, 2109–2144 (2009)

    Article  MATH  Google Scholar 

  20. Chartrand, R.: Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process. Lett. 14(10), 707–710 (2007)

    Article  Google Scholar 

  21. Fan, J.Q., Li, R.Z.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    Article  MATH  Google Scholar 

  22. Giuzio, M., Ferrari, D., Paterlini, S.: Sparse and robust normal and t-portfolios by penalized lq-likelihood minimization. Eur. J. Oper. Res. 250(1), 251–261 (2016)

    Article  MATH  Google Scholar 

  23. Le Thi, H.A., Dinh, T.P., Le, H.M., Vo, X.T.: DC approximation approaches for sparse optimization. Eur. J. Oper. Res. 244(1), 26–46 (2015)

    Article  MATH  Google Scholar 

  24. Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Annal. Stat. 38(2), 894–942 (2010)

    Article  MATH  Google Scholar 

  25. Xu, Z.B., Chang, X.Y., Xu, F.M., Zhang, H.: \(\text{ L}_{1/2}\) regularization: a thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 23(7), 1013–1027 (2012)

    Article  Google Scholar 

  26. Burachik, R.S., Rubinov, A.: Abstract convexity and augmented Lagrangians. SIAM J. Optim. 18(2), 413–436 (2007). https://doi.org/10.1137/050647621

    Article  MATH  Google Scholar 

  27. Huang, X., Yang, X.: A unified augmented Lagrangian approach to duality and exact penalization. Math. Oper. Res. 28(3), 533–552 (2003). https://doi.org/10.1287/moor.28.3.533.16395

    Article  MATH  Google Scholar 

  28. Luo, Z., Pang, J., Ralph, D.: Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge (1996)

    Book  MATH  Google Scholar 

  29. Yang, X., Huang, X.: A nonlinear Lagrangian approach to constrained optimization problems. SIAM J. Optim. 11(4), 1119–1144 (2001)

    Article  MATH  Google Scholar 

  30. Dong, Z.L., Yang, X.Q., Dai, Y.H.: A unified recovery bound estimation for noise-aware \(\ell _q\) optimization model in compressed sensing. arXiv preprint arXiv:1609.01531 (2016)

  31. Song, C.B., Xia, S.T.: Sparse signal recovery by \(\ell _q\) minimization under restricted isometry property. IEEE Signal Process. Lett. 21(9), 1154–1158 (2014)

    Article  Google Scholar 

  32. Hu, Y.H., Li, C., Meng, K.W., Qin, J., Yang, X.Q.: Group sparse optimization via \(\ell _{p, q}\) regularization. J. Mach. Learn. Res. 18(30), 1–52 (2017)

    Google Scholar 

  33. Liu, H.C., Yao, T., Li, R.Z., Ye, Y.Y.: Folded concave penalized sparse linear regression: sparsity, statistical performance, and algorithmic theory for local solutions. Math. Progr. 166(1–2), 207–240 (2017)

    Article  MATH  Google Scholar 

  34. Loh, P.L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16(1), 559–616 (2015)

    MATH  Google Scholar 

  35. Zhang, C.H., Zhang, T.: A general theory of concave regularization for high-dimensional sparse estimation problems. Stat. Sci. 27(4), 576–593 (2012)

    Article  MATH  Google Scholar 

  36. Candès, E.J., Romberg, J.K., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)

    Article  MATH  Google Scholar 

  37. Agarwal, A., Negahban, S., Wainwright, M.J.: Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 40(5), 2452–2482 (2012)

    Article  MATH  Google Scholar 

  38. Zhou, S.H.: Restricted eigenvalue conditions on subgaussian random matrices. arXiv preprint arXiv:0912.4045 (2009)

  39. Rao, C.R., Statistiker, M.: Linear Statistical Inference and Its Applications. Wiley, New York, New York (1973)

    Book  Google Scholar 

  40. van de Geer, S.A.: High-dimensional generalized linear models and the Lasso. Annal. Stat. 36(2), 614–645 (2008)

    MATH  Google Scholar 

  41. Loh, P.L., Wainwright, M.J.: High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Annal. Stat. 40(3), 1637–1664 (2012)

    Article  MATH  Google Scholar 

  42. Negahban, S., Ravikumar, P., Wainwright, M.J., Yu, B.: A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci. 27(4), 538–557 (2012)

    Article  MATH  Google Scholar 

  43. Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)

    Article  MATH  Google Scholar 

  44. Mazumder, R., Radchenko, P.: The discrete dantzig selector: estimating sparse linear models via mixed integer linear optimization. IEEE Trans. Inf. Theory 63(5), 3053–3075 (2017)

    MATH  Google Scholar 

  45. Mazumder, R., Radchenko, P., Dedieu, A.: Subset selection with shrinkage: sparse linear modeling when the SNR is low. arXiv preprint arXiv:1708.03288 (2017)

  46. Zhang, C.H., Huang, J.: The sparsity and bias of the Lasso selection in high-dimensional linear regression. Annal. Stat. 36(4), 1567–1594 (2008)

    Article  MATH  Google Scholar 

  47. Candès, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted \(l_1\) minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008)

    Article  MATH  Google Scholar 

  48. Chartrand, R., Yin, W.T.: Iteratively reweighted algorithms for compressive sensing. In: IEEE International Conference on Acoustics, pp. 3869–3872 (2008)

  49. Blumensath, T., Davies, M.E.: Iterative thresholding for sparse approximations. J. Fourier Anal. Appl. 14(5–6), 629–654 (2008)

    Article  MATH  Google Scholar 

  50. Herman, J., Kucera, R., Simsa, J.: Equations and Inequalities: Elementary Problems and Theorems in Algebra and Number Theory. Springer, Berlin (2000)

    Book  MATH  Google Scholar 

  51. Ross, S.: A First Course in Probability. Pearson, London (2009)

    MATH  Google Scholar 

  52. Raskutti, G., Wainwright, M.J., Yu, B.: Minimax rates of estimation for high-dimensional linear regression over \(\ell _q\)-balls. IEEE Trans. Inf. Theory 57(10), 6976–6994 (2011)

    Article  MATH  Google Scholar 

  53. Ge, R.: A filled function method for finding a global minimizer of a function of several variables. Math. Progr. 46(1–3), 191–204 (1990). https://doi.org/10.1007/BF01585737

    Article  MATH  Google Scholar 

Download references

Acknowledgements

The authors are grateful to the editor and the anonymous reviewers for their valuable comments and suggestions toward the improvement of this paper. Xin Li’s work was supported in part by the Natural Science Foundation of Shaanxi Province of China (2022JQ-045). Yaohua Hu’s work was supported in part by the National Natural Science Foundation of China (12071306, 32170655, 11871347), Natural Science Foundation of Guangdong Province of China (2019A1515011917, 2020B1515310008), Project of Educational Commission of Guangdong Province of China (2021KTSCX103, 2019KZDZX1007), and Natural Science Foundation of Shenzhen (JCYJ20190808173603590). Chong Li’s work was supported in part by the National Natural Science Foundation of China (11971429) and Zhejiang Provincial Natural Science Foundation of China (LY18A010004). Xiaoqi Yang’s work was supported in part by the Research Grants Council of Hong Kong (PolyU 15212817). Tianzi Jiang’s work was supported in part by the Science and Technology Innovation 2030 - Brain Science and Brain-Inspired Intelligence Project of China (2021ZD0200200).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yaohua Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Preliminary lemmas

We first recall some basic properties of the \(\ell _q\) norm in the following lemmas; particularly, the first inequality in (A.1) is from [32, Eq. (7)] and the second inequality in (A.1) is from [50, Eq. (104)] and (A.2) is from [32, Lemma 2] by taking \(p=1\) and a simple variable substitution.

Lemma A.1

Let \(\alpha ,\beta \in {{\mathbb {R}}}^n\). Then the following relations are true:

$$\begin{aligned} \Vert \beta \Vert _{q_2}\le \Vert \beta \Vert _{q_1} \le n^{\frac{1}{q_1}-\frac{1}{q_2}}\Vert \beta \Vert _{q_2} \quad \text{ for } \text{ any } 0<q_1\le q_2<+\infty , \end{aligned}$$
(A.1)
$$\begin{aligned} \Vert \alpha \Vert _q^q-\Vert \beta \Vert _q^q\le \Vert \alpha +\beta \Vert _q^q\le \Vert \alpha \Vert _q^q+\Vert \beta \Vert _q^q \quad \text{ for } \text{ any } 0<q\le 1. \end{aligned}$$
(A.2)

Lemma A.2

Let \(p\ge 1\), \(n_1,n_2\in {{\mathbb {N}}}\), \(\alpha \in {{\mathbb {R}}}_+^{n_1}\), \(\beta \in {{\mathbb {R}}}_+^{n_2}\) and \(c>0\) be such that

$$\begin{aligned} \max \limits _{1\le i\le n_1}\alpha _i\le \min \limits _{1\le j\le n_2}\beta _j\quad \text {and}\quad \sum \limits _{i=1}^{n_1}\alpha _i\le c\sum \limits _{j=1}^{n_2}\beta _j. \end{aligned}$$
(A.3)

Then

$$\begin{aligned} \sum \limits _{i=1}^{n_1}\alpha _i^p\le c\sum \limits _{j=1}^{n_2}\beta _j^p. \end{aligned}$$
(A.4)

Proof

Let \(\alpha _{\max }:=\max \limits _{1\le i\le n_1}\alpha _i\) and \(\beta _{\min }:=\min \limits _{1\le j\le n_2}\beta _j\). Then it holds that

$$\begin{aligned} \alpha _{\max }\sum _{i=1}^{n_1}\alpha _i^p\le \alpha _{\max }^p\sum _{i=1}^{n_1}\alpha _i\quad \text{ and }\quad \beta _{\min }^p\sum _{j=1}^{n_2}\beta _j\le \beta _{\min }\sum _{j=1}^{n_2}\beta _j^p. \end{aligned}$$
(A.5)

Without loss of generality, we assume that \(\alpha _{\max }>0\); otherwise, (A.4) holds automatically. Thus, by the first inequality of (A.3) and noting \(p\ge 1\), we have that

$$\begin{aligned} 0<\alpha _{\max }^p\beta _{\min }\le \alpha _{\max }\beta _{\min }^p. \end{aligned}$$
(A.6)

Multiplying the inequalities in (A.5) by \(\beta _{\min }\sum \limits _{j=1}^{n_2}\beta _j\) and \(\alpha _{\max }\sum \limits _{i=1}^{n_1}\alpha _i\) respectively, we obtain that

$$\begin{aligned} \begin{aligned} \alpha _{\max }\beta _{\min }\sum _{i=1}^{n_1}\alpha _i^p\sum _{j=1}^{n_2}\beta _j&\le \alpha _{\max }^p\beta _{\min }\sum _{i=1}^{n_1}\alpha _i\sum _{j=1}^{n_2}\beta _j\\&\le \alpha _{\max }\beta _{\min }^p\sum _{i=1}^{n_1}\alpha _i\sum _{j=1}^{n_2}\beta _j\\&\le \alpha _{\max }\beta _{\min }\sum _{i=1}^{n_1}\alpha _i\sum _{j=1}^{n_2}\beta _j^p, \end{aligned} \end{aligned}$$

where the second inequality follows from (A.6). This, together with the second inequality of (A.3), yields (A.4). The proof is complete. \(\square \)

The following lemmas are useful for establishing the relationship between the q-REC and other types of regularity conditions; in particular, Lemmas A.3 and A.4 are taken from [11, Lemma 1.1] and [17, Lemma 3.1], respectively.

Lemma A.3

Let \(X\in {{\mathbb {R}}}^{m\times n}\) and \(s, t \in {{\mathbb {N}}}\) be such that \(s+t\le n\). Then

$$\begin{aligned}\theta _{s,t}(X)\le \eta _{s+t}(X)\le \theta _{s,t}(X)+\max \{\eta _s(X),\eta _{t}(X)\}.\end{aligned}$$

Lemma A.4

Let \(\alpha ,\beta \in {{\mathbb {R}}}^n\) and \(0<\tau <1\) be such that \(-\langle \alpha ,\beta \rangle \le \tau \Vert \alpha \Vert _2^2\). Then \((1-\tau )\Vert \alpha \Vert _2\le \Vert \alpha +\beta \Vert _2\).

For the sake of simplicity, a partition structure and some notations are presented. For a vector \(\delta \in {{\mathbb {R}}}^n\) and an index set \(J\subseteq \{1,2,\dots ,n\}\), we use \(\mathrm{rank}(\delta _i;J^c)\) to denote the rank of the absolute value of \(\delta _i\) in \(J^c\) (in a decreasing order) and \(J_k(\delta ;t)\) to denote the index set of the k-th batch of the first t largest coordinates in absolute value of \(\delta \) in \(J^c\). That is,

$$\begin{aligned} J_k(\delta ;t):= \left\{ i\in J^c:\mathrm{rank}(\delta _i;J^c)\in \{kt+1,\dots ,(k+1)t\}\right\} \quad \text{ for } \text{ each } k\in {{\mathbb {N}}}. \end{aligned}$$
(A.7)

Furthermore, we let \(r:=\lceil \frac{n-s}{t}\rceil \) (where \(\lceil u\rceil \) denotes the largest integer not greater than u), \(J_k:=J_k(\delta ;t)\) (defined by (A.7)) for each \(k\in {{\mathbb {N}}}\) and \(J_*:=J\cup J_0\). With these notations, the Lemma A.5 is taken from [32, Lemma 7] when the group structure is degenerated.

Lemma A.5

Let \(\delta \in {{\mathbb {R}}}^n\), \(0<q\le 1\) and \(\tau \ge 1\). Then the following inequalities hold

$$\begin{aligned} \Vert \delta _{J_*^c}\Vert _\tau \le \sum _{k=1}^{r}\Vert \delta _{J_k}\Vert _\tau \le t^{\frac{1}{\tau }-\frac{1}{q}}\Vert \delta _{J^c}\Vert _q. \end{aligned}$$

Lemma A.6

Let \(X\in {{\mathbb {R}}}^{m\times n}\), \(0<q\le 1\), \(a>0\), and (st) be a pair of integers satisfying (10). Then the following relations are true:

$$\begin{aligned} \phi _q(s,t,a,X)\ge \frac{1}{\sqrt{m}}\left( \sqrt{\sigma _{\min }(s+t,X)}-a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\sqrt{\sigma _{\max }(t,X)}\right) , \end{aligned}$$
(A.8)
$$\begin{aligned} \phi _q(s,t,a,X)\le \frac{1}{\sqrt{m}}\left( \sqrt{\sigma _{\max }(s+t,X)}+a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\sqrt{\sigma _{\max }(t,X)}\right) . \end{aligned}$$
(A.9)

Proof

Fix \(\delta \in C_q(s,a)\), as defined by (B.1). Then there exists \(J\subseteq \{1,2,\dots ,n\}\) such that

$$\begin{aligned} |J|\le s \quad \text{ and }\quad \Vert \delta _{J^c}\Vert _q^q\le a\Vert \delta _{J}\Vert _q^q. \end{aligned}$$
(A.10)

Write \(r:=\lceil \frac{n-s}{t}\rceil \), \(J_k:=J_k(\delta ;t)\) (defined by (A.7)) for each \(k\in {{\mathbb {N}}}\) and \(J_*:=J\cup J_0\). Then it follows from Lemma A.5 and (A.10) that

$$\begin{aligned} \sum _{k=1}^{r}\Vert \delta _{J_k}\Vert _2\le t^{\frac{1}{2}-\frac{1}{q}}\Vert \delta _{J^c}\Vert _q \le a^{\frac{1}{q}}t^{\frac{1}{2}-\frac{1}{q}}\Vert \delta _{J}\Vert _q\le a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\Vert \delta _{J}\Vert _2 \end{aligned}$$
(A.11)

(due to (A.1)). Noting by (A.7) and (A.10) that \(|J_*|\le s+t\) and \(|J_k|\le t\) for each \(k\in {{\mathbb {N}}}\), one has by (4) that

$$\begin{aligned} \sqrt{\sigma _{\min }(s+t,X)}\Vert \delta _{J_*}\Vert _2 \le \Vert X\delta _{J_*}\Vert _2 \le \sqrt{\sigma _{\max }(s+t,X)}\Vert \delta _{J_*}\Vert _2, \\ \Vert X\delta _{J_k}\Vert _2 \le \sqrt{\sigma _{\max }(t,X)}\Vert \delta _{J_k}\Vert _2\quad \text {for each }\ k\in {{\mathbb {N}}}. \end{aligned}$$

These, together with (A.11), imply that

$$\begin{aligned} \begin{aligned} \Vert X\delta \Vert _2&\ge \Vert X\delta _{J_*}\Vert _2-\sum _{k=1}^{r}\Vert X\delta _{J_k}\Vert _2\\&\ge \left( \sqrt{\sigma _{\min }(s+t,X)}-a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\sqrt{\sigma _{\max }(t,X)}\right) \Vert \delta _{J_*}\Vert _2. \end{aligned} \end{aligned}$$

Since \(\delta \) and J satisfying (A.10) are arbitrary, (A.8) is shown to hold by (11) and the fact that \(J_*=J\cup J(\delta ;t)\). One can prove (A.9) in a similar way, and thus, the details are omitted. \(\square \)

The next two lemmas provide some preliminary lemmas to measure the probabilities of random events related to the linear regression model (1), in which Lemma A.7 is taken from [38, Lemma C.1].

Lemma A.7

Let \(0\le \theta <1\) and \(b\ge 0\). Suppose that

$$\begin{aligned} \max \limits _{1\le j\le n}\Vert X_{\cdot j}\Vert _2\le (1+\theta )\sqrt{m}. \end{aligned}$$
(A.12)

Then

$$\begin{aligned} {\mathbb {P}}\left( \frac{\Vert X^{\top }e\Vert _\infty }{m}\ge \sigma (1+\theta )\sqrt{\frac{2(1+b)\log n}{m}}\right) \le \left( n^b\sqrt{\pi \log n}\right) ^{-1}. \end{aligned}$$

Lemma A.8

Let \(d\ge 5\). Then

$$\begin{aligned} {\mathbb {P}}\left( \Vert e\Vert _2^2\ge dm\sigma ^2\right) \le \exp \left( -\frac{d-1}{4}m\right) . \end{aligned}$$

Proof

Recall that \(e=(e_1,\dots ,e_m)^\top \sim {\mathscr {N}}(0,\sigma ^2{\mathbb {I}}_m)\). Let \(u_i:=\frac{1}{\sigma }e_i\) for \(i=1,\dots ,m\). Then one has that \(u_1,\dots ,u_m\) are i.i.d. Gaussian variables with \(u_i\sim {\mathscr {N}}(0,1)\) for \(i=1,\dots ,m\). Let \(u:=(u_1,\dots ,u_m)^\top \). Clearly, \(\Vert u\Vert _2^2=\frac{1}{\sigma ^2}\Vert e\Vert _2^2\) is a chi-square random variable with m degrees of freedom (see, e.g., [51, Section 5.6]). Then it follows from standard tail bounds of chi-square random variable (see, e.g., [52, Appendix I]) that

$$\begin{aligned} {\mathbb {P}}\left( \frac{\Vert u\Vert _2^2-m}{m}\ge d-1\right) \le \exp \left( -\frac{d-1}{4}m\right) \end{aligned}$$

(as \(d\ge 5\)). Consequently, we obtain that

$$\begin{aligned} {\mathbb {P}}\left( \Vert e\Vert _2^2\ge dm\sigma ^2\right) ={\mathbb {P}}\left( \Vert u\Vert _2^2\ge dm\right) \le \exp \left( -\frac{d-1}{4}m\right) . \end{aligned}$$

The proof is complete. \(\square \)

Recall that \(\beta ^*\) satisfies the linear regression model (1). The following lemma is beneficial in proving Theorem 2.

Lemma A.9

Let \({\hat{\beta }}_{q,\lambda }\) be an optimal solution of \((RP _{q,\lambda })\). Then

$$\begin{aligned} \frac{1}{2m}\Vert X\beta ^*-X{\hat{\beta }}_{q,\lambda }\Vert _2^2\le \lambda \Vert \beta ^*\Vert _q^q-\lambda \Vert {\hat{\beta }}_{q,\lambda }\Vert _q^q+ \frac{1}{m}\Vert {\hat{\beta }}_{q,\lambda }-\beta ^*\Vert _1\Vert X^{\top }e\Vert _{\infty }. \end{aligned}$$

Proof

Since \({\hat{\beta }}_{q,\lambda }\) is an optimal solution of \((\text {RP}_{q,\lambda })\), it follows that

$$\begin{aligned} \frac{1}{2m}\Vert y-X{\hat{\beta }}_{q,\lambda }\Vert _2^2+\lambda \Vert {\hat{\beta }}_{q,\lambda }\Vert _q^q\le \frac{1}{2m}\Vert y-X\beta ^*\Vert _2^2+\lambda \Vert \beta ^*\Vert _q^q. \end{aligned}$$

This, together with (1), yields that

$$\begin{aligned} \begin{aligned} \lambda \Vert {\hat{\beta }}_{q,\lambda }\Vert _q^q-\lambda \Vert \beta ^*\Vert _q^q&\le \frac{1}{2m}\Vert y-X\beta ^*\Vert _2^2-\frac{1}{2m}\Vert y-X{\hat{\beta }}_{q,\lambda }\Vert _2^2\\&= \frac{1}{m}\left\langle X({\hat{\beta }}_{q,\lambda }-\beta ^*),e\right\rangle -\frac{1}{2m}\Vert X\beta ^*-X{\hat{\beta }}_{q,\lambda }\Vert _2^2\\&\le \frac{1}{m}\Vert {\hat{\beta }}_{q,\lambda }-\beta ^*\Vert _1\Vert X^{\top }e\Vert _{\infty }-\frac{1}{2m}\Vert X\beta ^*-X{\hat{\beta }}_{q,\lambda }\Vert _2^2. \end{aligned} \end{aligned}$$

The proof is complete. \(\square \)

Assume \(X_{1\cdot },\dots ,X_{m\cdot }\) are i.i.d. random vectors with each \(X_{i\cdot }\sim {\mathscr {N}}(0,\varSigma )\). Then the following lemma is taken from [37, Supplementary, Lemma 6], which is useful for providing a sufficient condition for the q-REC of the random design X.

Lemma A.10

There exist universal positive constants \((c_1,c_2)\) (independent of \(m,n,\varSigma \)) such that it holds with probability at least \(1-\exp (-c_2m)\) that, for each \(\delta \in {{\mathbb {R}}}^n\)

$$\begin{aligned} \frac{\Vert X\delta \Vert _2^2}{m}\ge \frac{1}{2}\Vert \varSigma ^{\frac{1}{2}}\delta \Vert _2^2-c_1\zeta (\varSigma )\frac{\log n}{m}\Vert \delta \Vert _1^2. \end{aligned}$$
(A.13)

Technical proofs

Proof of Proposition 1

Associated with the q-REC(sta), we define the feasible set

$$\begin{aligned} C_q(s,a):=\{\delta \in {{\mathbb {R}}}^n:\Vert \delta _{J^c}\Vert _q^q\le a\Vert \delta _{J}\Vert _q^q\ \text {for some}\ |J|\le s\}. \end{aligned}$$
(B.1)

By Definition 5, it remains to show that \(C_{q_1}(s,a)\subseteq C_{q_2}(s,a)\). To this end, let \(\delta \in C_{q_1}(s,a)\), and let \(J_0\) denote the index set of the first s largest coordinates in absolute value of \(\delta \). By the assumption that \(\delta \in C_{q_1}(s,a)\) and by the construction of \(J_0\), one has \(\Vert \delta _{J_0^c}\Vert _{q_1}^{q_1}\le a\Vert \delta _{J_0}\Vert _{q_1}^{q_1}\). Then we obtain by Lemma A.2 (with \(q_2/q_1\) in place of p) that \(\Vert \delta _{J_0^c}\Vert _{q_2}^{q_2}\le a\Vert \delta _{J_0}\Vert _{q_2}^{q_2}\); consequently, \(\delta \in C_{q_2}(s,a)\). Hence, it follows that \(C_{q_1}(s,a)\subseteq C_{q_2}(s,a)\), and the proof is complete. \(\square \)

Proof of Proposition 2

It directly follows from Lemma A.6 (cf. (A.8)) that X satisfies the q-REC(sta) provided that condition (a) holds. Fix \(\delta \in C_q(s,a)\), and let J, r, \(J_k\) (for each \(k\in {{\mathbb {N}}}\)) and \(J_*\) be defined, respectively, as in the beginning of the proof of Lemma A.6. Then (A.11) follows directly and it follows from Lemma A.5 and (17) that

$$\begin{aligned} \Vert \delta _{J_*^c}\Vert _1=\sum _{k=1}^{r}\Vert \delta _{J_k}\Vert _1\le t^{1-\frac{1}{q}}\Vert \delta _{J^c}\Vert _q \le a^{\frac{1}{q}}t^{1-\frac{1}{q}}\Vert \delta _{J}\Vert _q\le a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-1}\Vert \delta _{J}\Vert _1. \end{aligned}$$
(B.2)

Suppose that condition (b) is satisfied. By Definition 2 (cf. (3)), one has that

$$\begin{aligned} |\langle X\delta _{J_*}, X\delta _{J_*^c}\rangle | \le \sum _{k=1}^r|\langle X\delta _{J_*}, X\delta _{J_k}\rangle |\le \theta _{t,s+t}(X) \Vert \delta _{J_*}\Vert _2 \sum _{k=1}^r\Vert \delta _{J_k}\Vert _2. \end{aligned}$$

Then it follows from (A.11) that

$$\begin{aligned} |\langle X\delta _{J_*}, X\delta _{J_*^c}\rangle |\le & {} a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)\Vert \delta _{J_*}\Vert _2\Vert \delta _{J}\Vert _2 \nonumber \\\le & {} \frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-\eta _{s+t}(X)}\Vert X\delta _{J_*}\Vert _2^2 \end{aligned}$$
(B.3)

(by (2)). Since \(s\le t\) (by (10)), one has by Definition 1(i) that \(\eta _s(X)\le \eta _t(X)\), and then by Lemma A.3 that \(\eta _{s+t}(X)\le \theta _{s,t}(X)+\eta _t(X)\). Then it follows from (b) that

$$\begin{aligned} 0<\frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-\eta _{s+t}(X)} \le \frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-(\eta _{t}(X)+\theta _{s,t}(X))}<1. \end{aligned}$$
(B.4)

This, together with (B.3), shows that Lemma A.4 is applicable (with \(X\delta _{J_*}\), \(X\delta _{J_*^c}\), \(\frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-\eta _{s+t}(X)}\) in place of \(\alpha \), \(\beta \), \(\tau \)) to concluding that

$$\begin{aligned} \Vert X\delta \Vert _2^2\ge & {} \left( 1-\frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-\eta _{s+t}(X)}\right) ^2\Vert X\delta _{J_*}\Vert _2^2\\\ge & {} (1-\eta _{s+t}(X))\left( 1-\frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-\eta _{s+t}(X)}\right) ^2\Vert \delta _{J_*}\Vert _2^2 \end{aligned}$$

(due to (2)). Since \(\delta \) and J satisfying (A.10) are arbitrary, we derive by (11) and (B.4) that

$$\begin{aligned} \phi _q(s,t,a,X)\ge \frac{1}{\sqrt{m}}\left( \sqrt{1-\eta _{s+t}(X)}\left( 1-\frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-\eta _{s+t}(X)}\right) \right) >0; \end{aligned}$$

consequently, X satisfies the q-REC(sta).

Suppose that (c) is satisfied. Then we have by (B.2) and Definition 2 (cf. (3)) that

$$\begin{aligned} \begin{aligned} \Vert X\delta \Vert _2^2&= \Vert X\delta _{J_*}\Vert _2^2+2\langle X\delta _{J_*}, X\delta _{J_*^c}\rangle +\Vert X\delta _{J_*^c}\Vert _2^2\\&\ge \Vert X\delta _{J_*}\Vert _2^2-2|\langle X\delta _{J_*}, X\delta _{J_*^c}\rangle |\\&\ge \Vert X\delta _{J_*}\Vert _2^2-2\theta _{1,1}(X)\Vert \delta _{J_*}\Vert _1\Vert \delta _{J_*^c}\Vert _1\\&\ge \Vert X\delta _{J_*}\Vert _2^2-2a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-1}\theta _{1,1}(X)\Vert \delta _{J_*}\Vert _1^2. \end{aligned} \end{aligned}$$
(B.5)

Separating the diagonal and off-diagonal terms of the quadratic form \(\delta _{J_*}^TX^TX\delta _{J_*}\), one has by (A.1) and (c) that

$$\begin{aligned} \begin{aligned} \Vert X\delta _{J_*}\Vert _2^2&= \sum _{i=1}^{n}(X^TX)_{i,i}(\delta _{J_*})_i(\delta _{J_*})_i+\sum _{j\ne k}^{n}(X^TX)_{j,k}(\delta _{J_*})_j(\delta _{J_*})_k\\&= \Vert \delta _{J_*}\Vert _2^2+\sum _{j\ne k}^{n}\langle X_{\cdot j}(\delta _{J_*})_j, X_{\cdot k}(\delta _{J_*})_k \rangle \\&\ge \Vert \delta _{J_*}\Vert _2^2-\theta _{1,1}(X)\Vert \delta _{J_*}\Vert _1^2\\&\ge (1-(s+t)\theta _{1,1}(X))\Vert \delta _{J_*}\Vert _2^2. \end{aligned} \end{aligned}$$

Combining this inequality with (B.5), we get that

$$\begin{aligned} \Vert X\delta \Vert _2^2\ge \left( 1-\left( 1+2a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-1}\right) (s+t)\theta _{1,1}(X)\right) \Vert \delta _{J_*}\Vert _2^2. \end{aligned}$$

Since \(\delta \) and J satisfying (A.10) are arbitrary, we derive by (11) and (c) that

$$\begin{aligned} \phi _q(s,t,a,X)\ge \frac{1}{\sqrt{m}}\left( 1-\left( 1+2a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-1}\right) (s+t)\theta _{1,1}(X)\right) >0; \end{aligned}$$

consequently, X satisfies the q-REC(sta). The proof is complete. \(\square \)

Proof of Lemma 1

By (14) and (16), Lemma A.8 is applicable (with \(d=5\)) to showing that \({\mathbb {P}}({\mathscr {A}}^c)\le \exp (-m)\), that is, (18) is proved. Then it remains to show (20) and (21). For this purpose, we have by (15) that \(\lambda \ge \frac{5}{2}\sigma ^2\), and noting that \(0<q\le 1\),

$$\begin{aligned} \lambda&\ge \frac{a+1}{a-1}\sigma (1+\theta ) 2^{1-q}\left( \frac{5\sigma ^2}{2\lambda }+r^q\right) ^{\frac{1-q}{q}}\sqrt{\frac{2(1+b)\log n}{m}}\\&= \frac{a+1}{a-1}\sigma (1+\theta )(2\rho )^{1-q}\sqrt{\frac{2(1+b)\log n}{m}} \end{aligned}$$

(due to (14)). Then one has by (17) that

$$\begin{aligned} {\mathbb {P}}({\mathscr {B}}^c)\le & {} {\mathbb {P}}\left( \frac{a+1}{(a-1)m}(2\rho )^{1-q}\Vert X^{\top }e\Vert _\infty \ge \frac{a+1}{a-1}\sigma (1+\theta )(2\rho )^{1-q}\sqrt{\frac{2(1+b)\log n}{m}}\right) \\= & {} {\mathbb {P}}\left( \frac{\Vert X^{\top }e\Vert _\infty }{m}\ge \sigma (1+\theta )\sqrt{\frac{2(1+b)\log n}{m}}\right) . \end{aligned}$$

Hence, by assumption (A.12), Lemma A.7 is applicable to ensuring (20). Moreover, it follows from the elementary probability theory that

$$\begin{aligned} {\mathbb {P}}({\mathscr {A}}\cap {\mathscr {B}})\ge {\mathbb {P}}({\mathscr {A}})-{\mathbb {P}}({\mathscr {B}}^c)\ge 1-\exp (-m)-\left( n^b\sqrt{\pi \log n}\right) ^{-1}. \end{aligned}$$

The proof is complete. \(\square \)

Proof of Proposition 3

Let \(e\in {\mathscr {A}}\). Recall that \(\beta ^*\) satisfies the linear regression model (1), one has that \(\Vert y-X\beta ^*\Vert _2=\Vert e\Vert _2\le \epsilon \) (under the event \({\mathscr {A}}\)), and so, \(\beta ^*\) is a feasible vector of \((\text {CP}_{q,\epsilon })\). Consequently, by the optimality of \({\bar{\beta }}_{q,\epsilon }\) for \((\text {CP}_{q,\epsilon })\), it follows that \(\Vert {\bar{\beta }}_{q,\epsilon }\Vert _q\le \Vert \beta ^*\Vert _q\). Write \(\delta :={\bar{\beta }}_{q,\epsilon }-\beta ^*\). Then we obtain that

$$\begin{aligned} \Vert \beta ^*\Vert _q^q\ge \Vert \beta ^*+\delta \Vert _q^q= \Vert \beta ^*+\delta _{J}+\delta _{J^c}\Vert _q^q= \Vert \beta ^*+\delta _{J}\Vert _q^q+\Vert \delta _{J^c}\Vert _q^q, \end{aligned}$$
(B.6)

where the last equality holds because \(\beta ^*_{J^c}=0\). On the other hand, one has by (A.2) that \(\Vert \beta ^*+\delta _{J}\Vert _q^q\ge \Vert \beta ^*\Vert _q^q-\Vert \delta _{J}\Vert _q^q\). This, together with (B.6), implies (22). The proof is complete. \(\square \)

Proof of Proposition 4

Let \(e\in {\mathscr {A}}\). Since \({\hat{\beta }}_{q,\lambda }\) is an optimal solution of \((\text {RP}_{q,\lambda })\), one has that

$$\begin{aligned} \frac{1}{2m}\Vert y-X{\hat{\beta }}_{q,\lambda }\Vert _2^2+\lambda \Vert {\hat{\beta }}_{q,\lambda }\Vert _q^q\le \frac{1}{2m}\Vert y-X\beta ^*\Vert _2^2+\lambda \Vert \beta ^*\Vert _q^q. \end{aligned}$$

Then, by (1) and (13), it follows that

$$\begin{aligned} \Vert {\hat{\beta }}_{q,\lambda }\Vert _q^q\le \frac{1}{2m\lambda }\Vert y-X\beta ^*\Vert _2^2+\Vert \beta ^*\Vert _q^q \le \frac{1}{2m\lambda }\Vert e\Vert _2^2+r^q \le \rho ^q \end{aligned}$$

(due to (14) and (16)). Write \(\delta :={\hat{\beta }}_{q,\lambda }-\beta ^*\). Then, we obtain by (A.1) and (13) that

$$\begin{aligned} \Vert \delta \Vert _1\le \Vert {\hat{\beta }}_{q,\lambda }\Vert _1+\Vert \beta ^*\Vert _1\le \Vert {\hat{\beta }}_{q,\lambda }\Vert _q+\Vert \beta ^*\Vert _q\le \rho +r<2\rho . \end{aligned}$$

Consequently, noting that \(0<q\le 1\), one sees that \(\frac{\Vert \delta \Vert _1}{2\rho }\le \left( \frac{\Vert \delta \Vert _1}{2\rho }\right) ^q\), and then, by (A.1) that

$$\begin{aligned} \Vert \delta \Vert _1\le (2\rho )^{1-q}\Vert \delta \Vert _1^q\le (2\rho )^{1-q}\Vert \delta \Vert _q^q. \end{aligned}$$
(B.7)

This shows that (23) is proved. Then it remains to claim (24). To this end, noting that \(\beta ^*_{J^c}=0\), we derive by Lemma A.9 that

$$\begin{aligned} \begin{aligned} -\frac{1}{m}\Vert \delta \Vert _1\Vert X^{\top }e\Vert _{\infty }&\le \lambda \Vert \beta ^*\Vert _q^q-\lambda \Vert \beta ^*+\delta \Vert _q^q \\&= \lambda \Vert \beta _{J}^*\Vert _q^q-\lambda \Vert \beta _{J}^*+\delta _{J}\Vert _q^q-\lambda \Vert \delta _{J^c}\Vert _q^q \\&\le \lambda \left( \Vert \delta _{J}\Vert _q^q-\Vert \delta _{J^c}\Vert _q^q\right) \end{aligned} \end{aligned}$$

(by (A.2)). This, together with (B.7), yields that

$$\begin{aligned} \lambda \left( \Vert \delta _{J}\Vert _q^q-\Vert \delta _{J^c}\Vert _q^q\right) \ge -\frac{1}{m}(2\rho )^{1-q}\Vert \delta \Vert _q^q\Vert X^{\top }e\Vert _{\infty }. \end{aligned}$$

Then, under the event \({\mathscr {A}}\cap {\mathscr {B}}\), we obtain by (17) that

$$\begin{aligned} (a+1)\left( \Vert \delta _{J}\Vert _q^q-\Vert \delta _{J^c}\Vert _q^q\right) \ge -(a-1)\Vert \delta \Vert _q^q= -(a-1)(\Vert \delta _{J}\Vert _q^q+\Vert \delta _{J^c}\Vert _q^q), \end{aligned}$$

which yields (24). The proof is complete. \(\square \)

Proof of Theorem 1

Write \(\delta :={\bar{\beta }}_{q,\epsilon }-\beta ^*\), and let \(J_*:=J\cup J_0(\delta ;t)\) (defined by (A.7)). Fix \(e\in {\mathscr {A}}\). Then it follows from Lemma A.5 and Proposition 3 that

$$\begin{aligned} \Vert \delta _{J_*^c}\Vert _2^2\le t^{1-\frac{2}{q}}\Vert \delta _{J^c}\Vert _q^2\le t^{1-\frac{2}{q}}\Vert \delta _{J}\Vert _q^2\le \left( \frac{s}{t}\right) ^{\frac{2}{q}-1}\Vert \delta _J\Vert _2^2\le \left( \frac{s}{t}\right) ^{\frac{2}{q}-1}\Vert \delta _{J_*}\Vert _2^2 \end{aligned}$$

(by (A.1)), and so

$$\begin{aligned} \Vert \delta \Vert _2^2=\Vert \delta _{J_*}\Vert _2^2+\Vert \delta _{J_*^c}\Vert _2^2 \le \left( 1+\left( \frac{s}{t}\right) ^{\frac{2}{q}-1}\right) \Vert \delta _{J_*}\Vert _2^2. \end{aligned}$$
(B.8)

Recalling that \(\beta ^*\) satisfies the linear regression model (1), we have that \(\Vert y-X\beta ^*\Vert _2=\Vert e\Vert _2\le \epsilon \) (by (16)), and then

$$\begin{aligned} \Vert X\delta \Vert _2=\Vert X{\bar{\beta }}_{q,\epsilon }-X\beta ^*\Vert _2\le \Vert X{\bar{\beta }}_{q,\epsilon }-y\Vert _2+\Vert X\beta ^*-y\Vert _2\le 2\epsilon . \end{aligned}$$
(B.9)

On the other hand, Proposition 3 is applicable to concluding that (22) holds, which shows \(\delta \in C_q(s,1)\subseteq C_q(s,a)\) due to \(a\ge 1\) (cf. (B.1)). Consequently, we obtain by the assumption of the q-REC(sta) that

$$\begin{aligned} \Vert \delta _{J_*}\Vert _2\le \frac{\Vert X\delta \Vert _2}{\sqrt{m}\phi _q(s,t,a,X)}. \end{aligned}$$

This, together with (B.8) and (B.9), implies that (25) holds under the event \({\mathscr {A}}\). Noting from Lemma 1 that \({\mathbb {P}}({\mathscr {A}})\ge 1-\exp (-m)\), we obtain the conclusion. The proof is complete. \(\square \)

Proof of Theorem 2

Write \(\delta :={\hat{\beta }}_{q,\lambda }-\beta ^*\) and fix \(e\in {\mathscr {A}}\cap {\mathscr {B}}\). Note by (23) and (17) that

$$\begin{aligned} \frac{1}{m}\Vert \delta \Vert _1\Vert X^{\top }e\Vert _\infty \le \frac{a-1}{a+1}\lambda \Vert \delta \Vert _q^q. \end{aligned}$$

This, together with Lemma A.9, implies that

$$\begin{aligned} \begin{aligned} \frac{1}{2m}\Vert X{\hat{\beta }}_{q,\lambda }-X\beta ^*\Vert _2^2&\le \lambda \Vert \beta ^*\Vert _q^q-\lambda \Vert {\hat{\beta }}_{q,\lambda }\Vert _q^q+\frac{a-1}{a+1}\lambda \Vert \delta \Vert _q^q\\&\le \lambda \Vert \delta _{J}\Vert _q^q-\lambda \Vert ({\hat{\beta }}_{q,\lambda })_{J^c}\Vert _q^q+\frac{a-1}{a+1}\lambda \Vert \delta \Vert _q^q \end{aligned} \end{aligned}$$
(B.10)

(noting that \(\beta ^*_{J^c}=0\) and by (A.2)). Let \(J_*:=J\cup J_0(\delta ;t)\). One has by (24) and (A.1) that

$$\begin{aligned} \lambda \Vert \delta _{J}\Vert _q^q+\frac{a-1}{a+1}\lambda \Vert \delta \Vert _q^q \le a\lambda \Vert \delta _{J}\Vert _q^q \le a\lambda s^{1-\frac{q}{2}}\Vert \delta _{J}\Vert _2^q, \end{aligned}$$

and by the assumption of the q-REC(sta) that

$$\begin{aligned} \Vert \delta _{J}\Vert _2\le \Vert \delta _{J^*}\Vert _2\le \frac{\Vert X\delta \Vert _2}{\sqrt{m}\phi _q(s,t,a,X)}. \end{aligned}$$

These two inequalities, together with (B.10), imply that

$$\begin{aligned} \frac{1}{2m}\Vert X{\hat{\beta }}_{q,\lambda }-X\beta ^*\Vert _2^2+\lambda \Vert ({\hat{\beta }}_{q,\lambda })_{J^c}\Vert _q^q \le \frac{a\lambda s^{1-\frac{q}{2}}}{m^{\frac{q}{2}}\phi _q^q(s,t,a,X)}\Vert X{\hat{\beta }}_{q,\lambda }-X\beta ^*\Vert _2^q. \end{aligned}$$

This yields that

$$\begin{aligned} \text{(27) } \text{ and } \text{(28) } \text{ hold } \text{ under } \text{ the } \text{ event }\ {\mathscr {A}}\cap {\mathscr {B}}. \end{aligned}$$
(B.11)

Furthermore, it follows from Lemma A.5 that

$$\begin{aligned} \Vert \delta _{J_*^c}\Vert _2^2\le t^{1-\frac{2}{q}}\Vert \delta _{J^c}\Vert _q^2\le a^{\frac{2}{q}}t^{1-\frac{2}{q}}\Vert \delta _{J}\Vert _q^2\le a^{\frac{2}{q}}\left( \frac{s}{t}\right) ^{\frac{2}{q}-1}\Vert \delta _{J}\Vert _2^2. \end{aligned}$$

(by (24) and (A.1)). By the assumption of the q-REC(sta), one has by (27) that

$$\begin{aligned} \Vert \delta _{J_*}\Vert _2^2\le \frac{\Vert X\delta \Vert _2^2}{m\phi _q^2(s,t,a,X)}\le \left( \frac{2a\lambda }{(\phi _q^2(s,t,a,X)}\right) ^{\frac{2}{2-q}}s. \end{aligned}$$

Hence we obtain that

$$\begin{aligned} \begin{aligned} \Vert {\hat{\beta }}_{q,\lambda }-\beta ^*\Vert _2^2&= \Vert \delta _{J_*}\Vert _2^2+\Vert \delta _{J_*^c}\Vert _2^2\le \left( 1+a^{\frac{2}{q}}\left( \frac{s}{t}\right) ^{\frac{2}{q}-1}\right) \Vert \delta _{J_*}\Vert _2^2 \\&\le \left( 1+a^{\frac{2}{q}}\left( \frac{s}{t}\right) ^{\frac{2}{q}-1}\right) \left( \frac{2a\lambda }{(\phi _q^2(s,t,a,X))}\right) ^{\frac{2}{2-q}}s. \end{aligned} \end{aligned}$$

This shows that

$$\begin{aligned} \text{(29) } \text{ holds } \text{ under } \text{ the } \text{ event }\ {\mathscr {A}}\cap {\mathscr {B}}. \end{aligned}$$
(B.12)

By assumption (A.12), Lemma 1 is applicable to concluding that

$$\begin{aligned} {\mathbb {P}}({\mathscr {A}}\cap {\mathscr {B}}) \ge 1-\exp (-m)-\left( n^b\sqrt{\pi \log n}\right) ^{-1}. \end{aligned}$$

This, together with (B.11) and (B.12), yields that (27)-(29) hold with probability at least \(1-\exp (-m) -\left( n^b\sqrt{\pi \log n}\right) ^{-1}\). The proof is complete. \(\square \)

Proof of Lemma 2

(i) We first claim that

$$\begin{aligned} \phi _q(s,t,a,X)>\frac{1}{2}\varPhi _q(s,t,a,\varSigma ), \end{aligned}$$
(B.13)

whenever (A.13) holds for each \(\delta \in {{\mathbb {R}}}^n\). To this end, we suppose that (A.13) is satisfied for each \(\delta \in {{\mathbb {R}}}^n\). Fix \(\delta \in C_q(s,a)\), and let J, r, \(J_k\) (for each \(k\in {{\mathbb {N}}}\)) and \(J_*\) be defined, respectively, as in the beginning of the proof of Lemma A.6. Then (B.2) follows directly, and one has that

$$\begin{aligned} \begin{aligned} \Vert \delta \Vert _1&=\Vert \delta _{J_*}\Vert _1+\Vert \delta _{J_*^c}\Vert _1\\&\le \sqrt{s+t}\Vert \delta _{J_*}\Vert _2+a\sqrt{s}\left( \frac{as}{t}\right) ^{\frac{1}{q}-1}\Vert \delta _{J}\Vert _2\\&\le \left( \sqrt{s+t}+a\sqrt{s}\left( \frac{as}{t}\right) ^{\frac{1}{q}-1}\right) \Vert \delta _{J_*}\Vert _2. \end{aligned} \end{aligned}$$
(B.14)

By the assumption that \(\varSigma \) satisfies (34), it follows that

$$\begin{aligned} \Vert \varSigma ^{\frac{1}{2}}\delta \Vert _2^2\ge \varPhi _q^2(s,t,a,\varSigma )\Vert \delta _{J_*}\Vert _2^2. \end{aligned}$$

Substituting this inequality and (B.14) into (A.13) yields

$$\begin{aligned} \frac{\Vert X\delta \Vert _2^2}{m}\ge \left( \frac{1}{2}\varPhi _q^2(s,t,a,\varSigma )-c_1\zeta (\varSigma )\left( \sqrt{s+t}+a\sqrt{s}\left( \frac{as}{t}\right) ^{\frac{1}{q}-1}\right) ^2\frac{\log n}{m}\right) \Vert \delta _{J_*}\Vert _2^2. \end{aligned}$$

This, together with (37), shows that

$$\begin{aligned} \frac{\Vert X\delta \Vert _2^2}{m}\ge \frac{1}{4}\varPhi _q^2(s,t,a,\varSigma )\Vert \delta _{J_*}\Vert _2^2. \end{aligned}$$

Since \(\delta \) and J satisfying (A.10) are arbitrary, we derive by (11) that (B.13) holds, as desired. Then, Lemma A.10 is applicable to concluding (38).

(ii) Noting by the assumption that \(\varSigma _{jj}=1\) for all \(j=1,\dots ,n\), [38, Theorem 1.6] is applicable to showing that there exist universal positive constants \((c_1,c_2)\) and \(\tau \ge 1\) such that

$$\begin{aligned} {\mathbb {P}}\left( \cap _{j=1}^n\left\{ (1-\theta )\sqrt{m}\le \Vert X_{\cdot j}\Vert _2\le (1+\theta )\sqrt{m}\right\} \right) \ge 1-2\exp (-c_2\theta ^2m/\tau ^4), \end{aligned}$$

whenever m satisfies (39). Then it immediately follows from (36) that

$$\begin{aligned} \begin{aligned} {\mathbb {P}}({\mathscr {D}})&= {\mathbb {P}}(\cap _{j=1}^n\{\Vert X_{\cdot j}\Vert _2\le (1+\theta )\sqrt{m}\})\\&\ge 1-2\exp (-c_2\theta ^2m/\tau ^4), \end{aligned} \end{aligned}$$

that is, (40) is proved. \(\square \)

Proof of Proposition 5

By (36), one sees by Proposition 4 that (24) holds under the event \({\mathscr {A}}\cap {\mathscr {B}}\cap {\mathscr {D}}\). Then it remains to estimate \({\mathbb {P}}({\mathscr {A}}\cap {\mathscr {B}}\cap {\mathscr {D}})\). By Lemma 2(ii), there exist universal positive constants \((c_1,c_2)\) and \(\tau \ge 1\) such that

$$\begin{aligned} {\mathbb {P}}({\mathscr {D}})\ge 1-2\exp (-c_2\theta ^2m/\tau ^4), \end{aligned}$$

whenever m satisfies (42). From Lemma 1 (cf. (20)), we have also by (36) that

$$\begin{aligned} {\mathbb {P}}({\mathscr {B}}|{\mathscr {D}})\ge 1-(n^b\sqrt{\pi \log n})^{-1}. \end{aligned}$$

Then, it follows that

$$\begin{aligned} \begin{aligned} {\mathbb {P}}({\mathscr {B}}\cap {\mathscr {D}})\&= {\mathbb {P}}({\mathscr {B}}|{\mathscr {D}}){\mathbb {P}}({\mathscr {D}})\\&\ge (1-(n^b\sqrt{\pi \log n})^{-1})(1-2\exp (-c_2\theta ^2m/\tau ^4)), \end{aligned} \end{aligned}$$

and then by the elementary probability theory and (18) that,

$$\begin{aligned} \begin{aligned} {\mathbb {P}}({\mathscr {A}}\cap {\mathscr {B}}\cap {\mathscr {D}})&= {\mathbb {P}}({\mathscr {B}}\cap {\mathscr {D}})-{\mathbb {P}}({\mathscr {B}}\cap {\mathscr {D}}\cap {\mathscr {A}}^c)\\&\ge {\mathbb {P}}({\mathscr {B}}\cap {\mathscr {D}})+{\mathbb {P}}({\mathscr {A}})-1\\&\ge \left( 1-\left( n^b\sqrt{\pi \log n}\right) ^{-1}\right) (1-2\exp (-c_2\theta ^2m/\tau ^4))-\exp (-m), \end{aligned} \end{aligned}$$

whenever m satisfies (42). The proof is complete. \(\square \)

Proof of Theorem 3

To simplify the proof, corresponding to inequalities (25) and (43), we define the following two events

Then, by the definition of \({\mathscr {C}}_1\) (35), we have that \({\mathscr {C}}_1\cap {\mathscr {E}}_1\subseteq {\mathscr {E}}_2\) and thus

$$\begin{aligned} {\mathbb {P}}({\mathscr {E}}_2)\ge {\mathbb {P}}({\mathscr {E}}_1\cap {\mathscr {C}}_1)={\mathbb {P}}({\mathscr {E}}_1|{\mathscr {C}}_1){\mathbb {P}}({\mathscr {C}}_1). \end{aligned}$$
(B.15)

Note by Theorem 1 that

$$\begin{aligned} {\mathbb {P}}({\mathscr {E}}_1|{\mathscr {C}}_1)\ge 1-\exp (-m). \end{aligned}$$
(B.16)

By Lemma 2(i) (with \(a=1\)), there exist universal positive constants \((c_1,c_2)\) such that (37) ensures (38). Then we obtain by (B.15) and (B.16) that

$$\begin{aligned} {\mathbb {P}}({\mathscr {E}}_2)\ge (1-\exp (-m))(1-\exp (-c_2m)), \end{aligned}$$

whenever m satisfies (37). The proof is complete. \(\square \)

Proof of Theorem 4

To simplify the proof, we define the following six events

Fix \(i\in \{1,2,3\}\). Then, we have by (35) that \({\mathscr {C}}_{a}\cap {\mathscr {F}}_i\subseteq {\mathscr {G}}_{i}\) and thus

$$\begin{aligned} {\mathbb {P}}({\mathscr {G}}_{i})\ge {\mathbb {P}}({\mathscr {C}}_{a}\cap {\mathscr {F}}_i). \end{aligned}$$
(B.17)

By Lemma 2, there exist universal positive constants \((c_1,c_2,c_3,c_4)\) and \(\tau \ge 1\) such that, (44) ensures (38) and (40). Then it follows from (38) and (40) that

$$\begin{aligned} {\mathbb {P}}({\mathscr {C}}_{a}\cap {\mathscr {D}})\ge {\mathbb {P}}({\mathscr {C}}_{a})+P({\mathscr {D}})-1\ge 1-\exp (-c_2m)-2\exp (-c_4\theta ^2m/\tau ^4), \end{aligned}$$
(B.18)

whenever m satisfies (44). Recall from Theorem 2 that

$$\begin{aligned} {\mathbb {P}}({\mathscr {F}}_i|{\mathscr {C}}_{a}\cap {\mathscr {D}})\ge 1-\exp (-m)-\left( n^b\sqrt{\pi \log n}\right) ^{-1}. \end{aligned}$$

This, together with (B.18), implies that

$$\begin{aligned} \begin{aligned} {\mathbb {P}}({\mathscr {C}}_{a}\cap {\mathscr {F}}_i)&\ge {\mathbb {P}}({\mathscr {F}}_i|{\mathscr {C}}_{a}\cap {\mathscr {D}})\,{\mathbb {P}}({\mathscr {C}}_{a}\cap {\mathscr {D}})\\&\ge \left( 1-\exp (-m)-\left( n^b\sqrt{\pi \log n}\right) ^{-1}\right) (1-\exp (-c_2m)-2\exp (-c_4\theta ^2m/\tau ^4)). \end{aligned} \end{aligned}$$

Then, one has by (B.17) that

$$\begin{aligned} {\mathbb {P}}({\mathscr {G}}_i)\ge \left( 1-\exp (-m)-\left( n^b\sqrt{\pi \log n}\right) ^{-1}\right) (1-\exp (-c_2m)-2\exp (-c_4\theta ^2m/\tau ^4)), \end{aligned}$$

whenever m satisfies (44). The proof is complete. \(\square \)

Example to illustrate the recovery bound

The following example shows the performance of the \(\ell _{1/2}\) regularization method and the \(\ell _1\) regularization method in the case where 1/2-REC(1, 1, 1) is satisfied but not the classical REC(1, 1, 1).

Example C.1

Consider the linear regression model (1), where

$$\begin{aligned} X:=\begin{pmatrix} 2 &{} 3 &{} 1 \\ 2 &{} 1 &{} 3 \end{pmatrix},\quad \beta ^*:=(1,0,0)^\top ,\quad e\sim {\mathscr {N}}(0,0.01). \end{aligned}$$

It was validated in [32, Example 1] that the matrix X satisfies 1/2-REC(1, 1, 1) but not the classical REC(1, 1, 1); hence the recovery bound of the \(\ell _{1/2}\) regularization method is satisfied but may not for the \(\ell _1\) regularization method.

To show the performance of the \(\ell _{1/2}\) regularization method and the \(\ell _1\) regularization method in this case, for each regularization parameter \(\lambda \) varying from \(10^{-8}\) to 1, we randomly generate the Gaussian noise 500 times and calculate the estimated errors \(\Vert {\hat{\beta }}_{q,\lambda }-\beta ^*\Vert _2^2\) for the \(\ell _{1/2}\) regularization method and the \(\ell _1\) regularization method, respectively. We employ FISTA [9] and the filled function method [53] to find the global solution of the \(\ell _1\) regularization problem and the \(\ell _{1/2}\) regularization problem, respectively. The results are illustrated in Fig. 5, in which the error bars represent the 95% confidence intervals and the curves of recovery bounds stand for the terms in the right-hand side of (29) (cf. [32, Example 2]) and (30), respectively. It is observed from Fig. 5a that the recovery bound (29) is satisfied with high probability for most of \(\lambda \)’s and tight when \(\lambda \thickapprox \frac{1}{2}\) for the \(\ell _{1/2}\) regularization method. Fig. 5b shows that the estimated error (30) for the \(\ell _1\) regularization method is not satisfied when \(\lambda \) is small because the classical REC violates. Moreover, the solutions of the \(\ell _1\) regularization problem are always equal-contributed among 3 components that leads to the failure approach to a sparse solution.

Fig. 5
figure 5

The illustration of recovery bounds and estimated errors

The next example is to illustrate the influence of the parameter a as in the REC on estimated errors obtained by the \(\ell _1\) regularization method.

Example C.2

Consider the linear regression model (1), where

$$\begin{aligned} X:=\begin{pmatrix} 1 \, &{} 1 \, &{} -a \\ 1 \, &{} 2 \, &{} -(a+1) \end{pmatrix},\quad \beta ^*:=(0,0,1)^\top ,\quad e\sim {\mathscr {N}}(0,0.01). \end{aligned}$$
(C.1)

Due to the geometric interpretation of the \(\mathrm{REC}\), one sees that the \(\mathrm{REC}(1,1,a)\) holds if and only if the null space of X does not intersect the feasible set \(C_1(1,a)\) (B.1); one can also refer to [32, Fig.  1]. Hence one can check by the construction of X that it satisfies the \(\mathrm{REC}(1,1,c)\) for each \(c<\min \{a,2\}\) and fails otherwise. That is, if X is given by (C.1) with a larger parameter \(a\in [1,2]\), then X satisfies a stronger \(\mathrm{REC}(1,1,a)\).

Fig. 6
figure 6

The illustration of influence of parameter a as in the REC on estimated errors

To show the influence of the parameter a on the estimated error, we select \(a:=1,1.5,2,2.5\) as an instance to construct the matrix X by (C.1). Then for each regularization parameter \(\lambda \) varying from \(10^{-8}\) to 1, we randomly generate the Gaussian noise 500 times and calculate the estimated errors \(\Vert {\hat{\beta }}_{1,\lambda }-\beta ^*\Vert _2^2\) using FISTA [9] to find the global solution of the \(\ell _1\) regularization problem. The averaged result is displayed in Fig. 6. One can see that (i) the recovery bound in (29) is satisfied when \(a>1\) but fails when \(a=1\) that is consistent with Theorem 2; and (ii) as parameter a becomes larger, the estimated error becomes smaller that shows a better result than Theorem 2.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Hu, Y., Li, C. et al. Sparse estimation via lower-order penalty optimization methods in high-dimensional linear regression. J Glob Optim 85, 315–349 (2023). https://doi.org/10.1007/s10898-022-01220-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10898-022-01220-5

Keywords

Navigation