Abstract
The lower-order penalty optimization methods, including the \(\ell _q\) minimization method and the \(\ell _q\) regularization method \((0<q\le 1)\), have been widely applied to find sparse solutions of linear regression problems and gained successful applications in various mathematics and applied science fields. In this paper, we aim to investigate statistical properties of the \(\ell _q\) penalty optimization methods with randomly noisy observations and a deterministic/random design. For this purpose, we introduce a general q-Restricted Eigenvalue Condition (REC) and provide its sufficient conditions in terms of several widely-used regularity conditions such as sparse eigenvalue condition, restricted isometry property, and mutual incoherence property. By virtue of the q-REC, we exhibit the \(\ell _2\) recovery bounds of order \(O(\epsilon ^2)\) and \(O(\lambda ^{\frac{2}{2-q}}s)\) for the \(\ell _q\) minimization method and the \(\ell _q\) regularization method, respectively, with high probability for either deterministic or random designs. The results in this paper are nonasymptotic and only assume the weak q-REC. The preliminary numerical results verify the established statistical properties and demonstrate the advantages of the \(\ell _q\) penalty optimization methods over existing sparse optimization methods.
Similar content being viewed by others
Notes
For two functions f and g, we use \(f\asymp g\) to denote that \(f=cg\) for a universal positive constant c.
References
Candès, E.J., Romberg, J.K., Tao, T.: Stable signal recovery from incomplete and inaccurate measurements. Commun. Pure Appl. Math. 59(8), 410–412 (2006)
Pun, C.S., Wong, H.Y.: A linear programming model for selection of sparse high-dimensional multiperiod portfolios. Eur. J. Oper. Res. 273(2), 754–771 (2019)
Qin, J., Hu, Y.H., Xu, F., Yalamanchili, H.K., Wang, J.W.: Inferring gene regulatory networks by integrating ChIP-seq/chip and transcriptome data via LASSO-type regularization methods. Methods 67(3), 294–303 (2014)
Donoho, D.L., Elad, M., Temlyakov, V.N.: Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inf. Theory 52(1), 6–18 (2006)
Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)
Figueiredo, M.A.T., Nowak, R.D., Wright, S.J.: Gradient projection for sparse reconstruction: application to compressed sensing and other inverse problems. IEEE J. Sel. Top. Signal Process. 1(4), 586–597 (2007)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Daubechies, I., Devore, R., Fornasier, M.: Iteratively reweighted least squares minimization for sparse recovery. Commun. Pure Appl. Math. 63(1), 1–38 (2010)
Candès, E.J., Tao, T.: Decoding by linear programming. IEEE Trans. Inf. Theory 51(12), 4203–4215 (2005)
Donoho, D.L., Huo, X.M.: Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inf. Theory 47(7), 2845–2862 (2001)
Bickel, P.J., Ritov, Y., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Annal. Stat. 37(4), 1705–1732 (2009)
Donoho, D.L.: For most large underdetermined systems of linear equations the minimal \(\ell _1\)-norm solution is also the sparsest solution. Commun. Pure Appl. Math. 59(6), 797–829 (2006)
Cai, T.T., Xu, G.W., Zhang, J.: On recovery of sparse signals via \(\ell _1\) minimization. IEEE Trans. Inf. Theory 55(7), 3388–3397 (2009)
Raskutti, G., Wainwright, M.J., Yu, B.: Restricted eigenvalue properties for correlated gaussian designs. J. Mach. Learn. Res. 11(2), 2241–2259 (2010)
van de Geer, S.A., Bühlmann, P.: On the conditions used to prove oracle results for the Lasso. Electr. J. Stat. 3, 2009 (2009)
Bunea, F., Tsybakov, A., Wegkamp, M.: Sparsity oracle inequalities for the Lasso. Electr. J. Stat. 64(3), 330–2 (2007)
Zhang, T.: Some sharp performance bounds for least squares regression with \(\ell _1\) regularization. Annal. Stat. 37, 2109–2144 (2009)
Chartrand, R.: Exact reconstruction of sparse signals via nonconvex minimization. IEEE Signal Process. Lett. 14(10), 707–710 (2007)
Fan, J.Q., Li, R.Z.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Giuzio, M., Ferrari, D., Paterlini, S.: Sparse and robust normal and t-portfolios by penalized lq-likelihood minimization. Eur. J. Oper. Res. 250(1), 251–261 (2016)
Le Thi, H.A., Dinh, T.P., Le, H.M., Vo, X.T.: DC approximation approaches for sparse optimization. Eur. J. Oper. Res. 244(1), 26–46 (2015)
Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Annal. Stat. 38(2), 894–942 (2010)
Xu, Z.B., Chang, X.Y., Xu, F.M., Zhang, H.: \(\text{ L}_{1/2}\) regularization: a thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 23(7), 1013–1027 (2012)
Burachik, R.S., Rubinov, A.: Abstract convexity and augmented Lagrangians. SIAM J. Optim. 18(2), 413–436 (2007). https://doi.org/10.1137/050647621
Huang, X., Yang, X.: A unified augmented Lagrangian approach to duality and exact penalization. Math. Oper. Res. 28(3), 533–552 (2003). https://doi.org/10.1287/moor.28.3.533.16395
Luo, Z., Pang, J., Ralph, D.: Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge (1996)
Yang, X., Huang, X.: A nonlinear Lagrangian approach to constrained optimization problems. SIAM J. Optim. 11(4), 1119–1144 (2001)
Dong, Z.L., Yang, X.Q., Dai, Y.H.: A unified recovery bound estimation for noise-aware \(\ell _q\) optimization model in compressed sensing. arXiv preprint arXiv:1609.01531 (2016)
Song, C.B., Xia, S.T.: Sparse signal recovery by \(\ell _q\) minimization under restricted isometry property. IEEE Signal Process. Lett. 21(9), 1154–1158 (2014)
Hu, Y.H., Li, C., Meng, K.W., Qin, J., Yang, X.Q.: Group sparse optimization via \(\ell _{p, q}\) regularization. J. Mach. Learn. Res. 18(30), 1–52 (2017)
Liu, H.C., Yao, T., Li, R.Z., Ye, Y.Y.: Folded concave penalized sparse linear regression: sparsity, statistical performance, and algorithmic theory for local solutions. Math. Progr. 166(1–2), 207–240 (2017)
Loh, P.L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16(1), 559–616 (2015)
Zhang, C.H., Zhang, T.: A general theory of concave regularization for high-dimensional sparse estimation problems. Stat. Sci. 27(4), 576–593 (2012)
Candès, E.J., Romberg, J.K., Tao, T.: Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
Agarwal, A., Negahban, S., Wainwright, M.J.: Fast global convergence of gradient methods for high-dimensional statistical recovery. Ann. Stat. 40(5), 2452–2482 (2012)
Zhou, S.H.: Restricted eigenvalue conditions on subgaussian random matrices. arXiv preprint arXiv:0912.4045 (2009)
Rao, C.R., Statistiker, M.: Linear Statistical Inference and Its Applications. Wiley, New York, New York (1973)
van de Geer, S.A.: High-dimensional generalized linear models and the Lasso. Annal. Stat. 36(2), 614–645 (2008)
Loh, P.L., Wainwright, M.J.: High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Annal. Stat. 40(3), 1637–1664 (2012)
Negahban, S., Ravikumar, P., Wainwright, M.J., Yu, B.: A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. Stat. Sci. 27(4), 538–557 (2012)
Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)
Mazumder, R., Radchenko, P.: The discrete dantzig selector: estimating sparse linear models via mixed integer linear optimization. IEEE Trans. Inf. Theory 63(5), 3053–3075 (2017)
Mazumder, R., Radchenko, P., Dedieu, A.: Subset selection with shrinkage: sparse linear modeling when the SNR is low. arXiv preprint arXiv:1708.03288 (2017)
Zhang, C.H., Huang, J.: The sparsity and bias of the Lasso selection in high-dimensional linear regression. Annal. Stat. 36(4), 1567–1594 (2008)
Candès, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted \(l_1\) minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008)
Chartrand, R., Yin, W.T.: Iteratively reweighted algorithms for compressive sensing. In: IEEE International Conference on Acoustics, pp. 3869–3872 (2008)
Blumensath, T., Davies, M.E.: Iterative thresholding for sparse approximations. J. Fourier Anal. Appl. 14(5–6), 629–654 (2008)
Herman, J., Kucera, R., Simsa, J.: Equations and Inequalities: Elementary Problems and Theorems in Algebra and Number Theory. Springer, Berlin (2000)
Ross, S.: A First Course in Probability. Pearson, London (2009)
Raskutti, G., Wainwright, M.J., Yu, B.: Minimax rates of estimation for high-dimensional linear regression over \(\ell _q\)-balls. IEEE Trans. Inf. Theory 57(10), 6976–6994 (2011)
Ge, R.: A filled function method for finding a global minimizer of a function of several variables. Math. Progr. 46(1–3), 191–204 (1990). https://doi.org/10.1007/BF01585737
Acknowledgements
The authors are grateful to the editor and the anonymous reviewers for their valuable comments and suggestions toward the improvement of this paper. Xin Li’s work was supported in part by the Natural Science Foundation of Shaanxi Province of China (2022JQ-045). Yaohua Hu’s work was supported in part by the National Natural Science Foundation of China (12071306, 32170655, 11871347), Natural Science Foundation of Guangdong Province of China (2019A1515011917, 2020B1515310008), Project of Educational Commission of Guangdong Province of China (2021KTSCX103, 2019KZDZX1007), and Natural Science Foundation of Shenzhen (JCYJ20190808173603590). Chong Li’s work was supported in part by the National Natural Science Foundation of China (11971429) and Zhejiang Provincial Natural Science Foundation of China (LY18A010004). Xiaoqi Yang’s work was supported in part by the Research Grants Council of Hong Kong (PolyU 15212817). Tianzi Jiang’s work was supported in part by the Science and Technology Innovation 2030 - Brain Science and Brain-Inspired Intelligence Project of China (2021ZD0200200).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Preliminary lemmas
We first recall some basic properties of the \(\ell _q\) norm in the following lemmas; particularly, the first inequality in (A.1) is from [32, Eq. (7)] and the second inequality in (A.1) is from [50, Eq. (104)] and (A.2) is from [32, Lemma 2] by taking \(p=1\) and a simple variable substitution.
Lemma A.1
Let \(\alpha ,\beta \in {{\mathbb {R}}}^n\). Then the following relations are true:
Lemma A.2
Let \(p\ge 1\), \(n_1,n_2\in {{\mathbb {N}}}\), \(\alpha \in {{\mathbb {R}}}_+^{n_1}\), \(\beta \in {{\mathbb {R}}}_+^{n_2}\) and \(c>0\) be such that
Then
Proof
Let \(\alpha _{\max }:=\max \limits _{1\le i\le n_1}\alpha _i\) and \(\beta _{\min }:=\min \limits _{1\le j\le n_2}\beta _j\). Then it holds that
Without loss of generality, we assume that \(\alpha _{\max }>0\); otherwise, (A.4) holds automatically. Thus, by the first inequality of (A.3) and noting \(p\ge 1\), we have that
Multiplying the inequalities in (A.5) by \(\beta _{\min }\sum \limits _{j=1}^{n_2}\beta _j\) and \(\alpha _{\max }\sum \limits _{i=1}^{n_1}\alpha _i\) respectively, we obtain that
where the second inequality follows from (A.6). This, together with the second inequality of (A.3), yields (A.4). The proof is complete. \(\square \)
The following lemmas are useful for establishing the relationship between the q-REC and other types of regularity conditions; in particular, Lemmas A.3 and A.4 are taken from [11, Lemma 1.1] and [17, Lemma 3.1], respectively.
Lemma A.3
Let \(X\in {{\mathbb {R}}}^{m\times n}\) and \(s, t \in {{\mathbb {N}}}\) be such that \(s+t\le n\). Then
Lemma A.4
Let \(\alpha ,\beta \in {{\mathbb {R}}}^n\) and \(0<\tau <1\) be such that \(-\langle \alpha ,\beta \rangle \le \tau \Vert \alpha \Vert _2^2\). Then \((1-\tau )\Vert \alpha \Vert _2\le \Vert \alpha +\beta \Vert _2\).
For the sake of simplicity, a partition structure and some notations are presented. For a vector \(\delta \in {{\mathbb {R}}}^n\) and an index set \(J\subseteq \{1,2,\dots ,n\}\), we use \(\mathrm{rank}(\delta _i;J^c)\) to denote the rank of the absolute value of \(\delta _i\) in \(J^c\) (in a decreasing order) and \(J_k(\delta ;t)\) to denote the index set of the k-th batch of the first t largest coordinates in absolute value of \(\delta \) in \(J^c\). That is,
Furthermore, we let \(r:=\lceil \frac{n-s}{t}\rceil \) (where \(\lceil u\rceil \) denotes the largest integer not greater than u), \(J_k:=J_k(\delta ;t)\) (defined by (A.7)) for each \(k\in {{\mathbb {N}}}\) and \(J_*:=J\cup J_0\). With these notations, the Lemma A.5 is taken from [32, Lemma 7] when the group structure is degenerated.
Lemma A.5
Let \(\delta \in {{\mathbb {R}}}^n\), \(0<q\le 1\) and \(\tau \ge 1\). Then the following inequalities hold
Lemma A.6
Let \(X\in {{\mathbb {R}}}^{m\times n}\), \(0<q\le 1\), \(a>0\), and (s, t) be a pair of integers satisfying (10). Then the following relations are true:
Proof
Fix \(\delta \in C_q(s,a)\), as defined by (B.1). Then there exists \(J\subseteq \{1,2,\dots ,n\}\) such that
Write \(r:=\lceil \frac{n-s}{t}\rceil \), \(J_k:=J_k(\delta ;t)\) (defined by (A.7)) for each \(k\in {{\mathbb {N}}}\) and \(J_*:=J\cup J_0\). Then it follows from Lemma A.5 and (A.10) that
(due to (A.1)). Noting by (A.7) and (A.10) that \(|J_*|\le s+t\) and \(|J_k|\le t\) for each \(k\in {{\mathbb {N}}}\), one has by (4) that
These, together with (A.11), imply that
Since \(\delta \) and J satisfying (A.10) are arbitrary, (A.8) is shown to hold by (11) and the fact that \(J_*=J\cup J(\delta ;t)\). One can prove (A.9) in a similar way, and thus, the details are omitted. \(\square \)
The next two lemmas provide some preliminary lemmas to measure the probabilities of random events related to the linear regression model (1), in which Lemma A.7 is taken from [38, Lemma C.1].
Lemma A.7
Let \(0\le \theta <1\) and \(b\ge 0\). Suppose that
Then
Lemma A.8
Let \(d\ge 5\). Then
Proof
Recall that \(e=(e_1,\dots ,e_m)^\top \sim {\mathscr {N}}(0,\sigma ^2{\mathbb {I}}_m)\). Let \(u_i:=\frac{1}{\sigma }e_i\) for \(i=1,\dots ,m\). Then one has that \(u_1,\dots ,u_m\) are i.i.d. Gaussian variables with \(u_i\sim {\mathscr {N}}(0,1)\) for \(i=1,\dots ,m\). Let \(u:=(u_1,\dots ,u_m)^\top \). Clearly, \(\Vert u\Vert _2^2=\frac{1}{\sigma ^2}\Vert e\Vert _2^2\) is a chi-square random variable with m degrees of freedom (see, e.g., [51, Section 5.6]). Then it follows from standard tail bounds of chi-square random variable (see, e.g., [52, Appendix I]) that
(as \(d\ge 5\)). Consequently, we obtain that
The proof is complete. \(\square \)
Recall that \(\beta ^*\) satisfies the linear regression model (1). The following lemma is beneficial in proving Theorem 2.
Lemma A.9
Let \({\hat{\beta }}_{q,\lambda }\) be an optimal solution of \((RP _{q,\lambda })\). Then
Proof
Since \({\hat{\beta }}_{q,\lambda }\) is an optimal solution of \((\text {RP}_{q,\lambda })\), it follows that
This, together with (1), yields that
The proof is complete. \(\square \)
Assume \(X_{1\cdot },\dots ,X_{m\cdot }\) are i.i.d. random vectors with each \(X_{i\cdot }\sim {\mathscr {N}}(0,\varSigma )\). Then the following lemma is taken from [37, Supplementary, Lemma 6], which is useful for providing a sufficient condition for the q-REC of the random design X.
Lemma A.10
There exist universal positive constants \((c_1,c_2)\) (independent of \(m,n,\varSigma \)) such that it holds with probability at least \(1-\exp (-c_2m)\) that, for each \(\delta \in {{\mathbb {R}}}^n\)
Technical proofs
Proof of Proposition 1
Associated with the q-REC(s, t, a), we define the feasible set
By Definition 5, it remains to show that \(C_{q_1}(s,a)\subseteq C_{q_2}(s,a)\). To this end, let \(\delta \in C_{q_1}(s,a)\), and let \(J_0\) denote the index set of the first s largest coordinates in absolute value of \(\delta \). By the assumption that \(\delta \in C_{q_1}(s,a)\) and by the construction of \(J_0\), one has \(\Vert \delta _{J_0^c}\Vert _{q_1}^{q_1}\le a\Vert \delta _{J_0}\Vert _{q_1}^{q_1}\). Then we obtain by Lemma A.2 (with \(q_2/q_1\) in place of p) that \(\Vert \delta _{J_0^c}\Vert _{q_2}^{q_2}\le a\Vert \delta _{J_0}\Vert _{q_2}^{q_2}\); consequently, \(\delta \in C_{q_2}(s,a)\). Hence, it follows that \(C_{q_1}(s,a)\subseteq C_{q_2}(s,a)\), and the proof is complete. \(\square \)
Proof of Proposition 2
It directly follows from Lemma A.6 (cf. (A.8)) that X satisfies the q-REC(s, t, a) provided that condition (a) holds. Fix \(\delta \in C_q(s,a)\), and let J, r, \(J_k\) (for each \(k\in {{\mathbb {N}}}\)) and \(J_*\) be defined, respectively, as in the beginning of the proof of Lemma A.6. Then (A.11) follows directly and it follows from Lemma A.5 and (17) that
Suppose that condition (b) is satisfied. By Definition 2 (cf. (3)), one has that
Then it follows from (A.11) that
(by (2)). Since \(s\le t\) (by (10)), one has by Definition 1(i) that \(\eta _s(X)\le \eta _t(X)\), and then by Lemma A.3 that \(\eta _{s+t}(X)\le \theta _{s,t}(X)+\eta _t(X)\). Then it follows from (b) that
This, together with (B.3), shows that Lemma A.4 is applicable (with \(X\delta _{J_*}\), \(X\delta _{J_*^c}\), \(\frac{a^{\frac{1}{q}}\left( \frac{s}{t}\right) ^{\frac{1}{q}-\frac{1}{2}}\theta _{t,s+t}(X)}{1-\eta _{s+t}(X)}\) in place of \(\alpha \), \(\beta \), \(\tau \)) to concluding that
(due to (2)). Since \(\delta \) and J satisfying (A.10) are arbitrary, we derive by (11) and (B.4) that
consequently, X satisfies the q-REC(s, t, a).
Suppose that (c) is satisfied. Then we have by (B.2) and Definition 2 (cf. (3)) that
Separating the diagonal and off-diagonal terms of the quadratic form \(\delta _{J_*}^TX^TX\delta _{J_*}\), one has by (A.1) and (c) that
Combining this inequality with (B.5), we get that
Since \(\delta \) and J satisfying (A.10) are arbitrary, we derive by (11) and (c) that
consequently, X satisfies the q-REC(s, t, a). The proof is complete. \(\square \)
Proof of Lemma 1
By (14) and (16), Lemma A.8 is applicable (with \(d=5\)) to showing that \({\mathbb {P}}({\mathscr {A}}^c)\le \exp (-m)\), that is, (18) is proved. Then it remains to show (20) and (21). For this purpose, we have by (15) that \(\lambda \ge \frac{5}{2}\sigma ^2\), and noting that \(0<q\le 1\),
(due to (14)). Then one has by (17) that
Hence, by assumption (A.12), Lemma A.7 is applicable to ensuring (20). Moreover, it follows from the elementary probability theory that
The proof is complete. \(\square \)
Proof of Proposition 3
Let \(e\in {\mathscr {A}}\). Recall that \(\beta ^*\) satisfies the linear regression model (1), one has that \(\Vert y-X\beta ^*\Vert _2=\Vert e\Vert _2\le \epsilon \) (under the event \({\mathscr {A}}\)), and so, \(\beta ^*\) is a feasible vector of \((\text {CP}_{q,\epsilon })\). Consequently, by the optimality of \({\bar{\beta }}_{q,\epsilon }\) for \((\text {CP}_{q,\epsilon })\), it follows that \(\Vert {\bar{\beta }}_{q,\epsilon }\Vert _q\le \Vert \beta ^*\Vert _q\). Write \(\delta :={\bar{\beta }}_{q,\epsilon }-\beta ^*\). Then we obtain that
where the last equality holds because \(\beta ^*_{J^c}=0\). On the other hand, one has by (A.2) that \(\Vert \beta ^*+\delta _{J}\Vert _q^q\ge \Vert \beta ^*\Vert _q^q-\Vert \delta _{J}\Vert _q^q\). This, together with (B.6), implies (22). The proof is complete. \(\square \)
Proof of Proposition 4
Let \(e\in {\mathscr {A}}\). Since \({\hat{\beta }}_{q,\lambda }\) is an optimal solution of \((\text {RP}_{q,\lambda })\), one has that
Then, by (1) and (13), it follows that
(due to (14) and (16)). Write \(\delta :={\hat{\beta }}_{q,\lambda }-\beta ^*\). Then, we obtain by (A.1) and (13) that
Consequently, noting that \(0<q\le 1\), one sees that \(\frac{\Vert \delta \Vert _1}{2\rho }\le \left( \frac{\Vert \delta \Vert _1}{2\rho }\right) ^q\), and then, by (A.1) that
This shows that (23) is proved. Then it remains to claim (24). To this end, noting that \(\beta ^*_{J^c}=0\), we derive by Lemma A.9 that
(by (A.2)). This, together with (B.7), yields that
Then, under the event \({\mathscr {A}}\cap {\mathscr {B}}\), we obtain by (17) that
which yields (24). The proof is complete. \(\square \)
Proof of Theorem 1
Write \(\delta :={\bar{\beta }}_{q,\epsilon }-\beta ^*\), and let \(J_*:=J\cup J_0(\delta ;t)\) (defined by (A.7)). Fix \(e\in {\mathscr {A}}\). Then it follows from Lemma A.5 and Proposition 3 that
(by (A.1)), and so
Recalling that \(\beta ^*\) satisfies the linear regression model (1), we have that \(\Vert y-X\beta ^*\Vert _2=\Vert e\Vert _2\le \epsilon \) (by (16)), and then
On the other hand, Proposition 3 is applicable to concluding that (22) holds, which shows \(\delta \in C_q(s,1)\subseteq C_q(s,a)\) due to \(a\ge 1\) (cf. (B.1)). Consequently, we obtain by the assumption of the q-REC(s, t, a) that
This, together with (B.8) and (B.9), implies that (25) holds under the event \({\mathscr {A}}\). Noting from Lemma 1 that \({\mathbb {P}}({\mathscr {A}})\ge 1-\exp (-m)\), we obtain the conclusion. The proof is complete. \(\square \)
Proof of Theorem 2
Write \(\delta :={\hat{\beta }}_{q,\lambda }-\beta ^*\) and fix \(e\in {\mathscr {A}}\cap {\mathscr {B}}\). Note by (23) and (17) that
This, together with Lemma A.9, implies that
(noting that \(\beta ^*_{J^c}=0\) and by (A.2)). Let \(J_*:=J\cup J_0(\delta ;t)\). One has by (24) and (A.1) that
and by the assumption of the q-REC(s, t, a) that
These two inequalities, together with (B.10), imply that
This yields that
Furthermore, it follows from Lemma A.5 that
(by (24) and (A.1)). By the assumption of the q-REC(s, t, a), one has by (27) that
Hence we obtain that
This shows that
By assumption (A.12), Lemma 1 is applicable to concluding that
This, together with (B.11) and (B.12), yields that (27)-(29) hold with probability at least \(1-\exp (-m) -\left( n^b\sqrt{\pi \log n}\right) ^{-1}\). The proof is complete. \(\square \)
Proof of Lemma 2
(i) We first claim that
whenever (A.13) holds for each \(\delta \in {{\mathbb {R}}}^n\). To this end, we suppose that (A.13) is satisfied for each \(\delta \in {{\mathbb {R}}}^n\). Fix \(\delta \in C_q(s,a)\), and let J, r, \(J_k\) (for each \(k\in {{\mathbb {N}}}\)) and \(J_*\) be defined, respectively, as in the beginning of the proof of Lemma A.6. Then (B.2) follows directly, and one has that
By the assumption that \(\varSigma \) satisfies (34), it follows that
Substituting this inequality and (B.14) into (A.13) yields
This, together with (37), shows that
Since \(\delta \) and J satisfying (A.10) are arbitrary, we derive by (11) that (B.13) holds, as desired. Then, Lemma A.10 is applicable to concluding (38).
(ii) Noting by the assumption that \(\varSigma _{jj}=1\) for all \(j=1,\dots ,n\), [38, Theorem 1.6] is applicable to showing that there exist universal positive constants \((c_1,c_2)\) and \(\tau \ge 1\) such that
whenever m satisfies (39). Then it immediately follows from (36) that
that is, (40) is proved. \(\square \)
Proof of Proposition 5
By (36), one sees by Proposition 4 that (24) holds under the event \({\mathscr {A}}\cap {\mathscr {B}}\cap {\mathscr {D}}\). Then it remains to estimate \({\mathbb {P}}({\mathscr {A}}\cap {\mathscr {B}}\cap {\mathscr {D}})\). By Lemma 2(ii), there exist universal positive constants \((c_1,c_2)\) and \(\tau \ge 1\) such that
whenever m satisfies (42). From Lemma 1 (cf. (20)), we have also by (36) that
Then, it follows that
and then by the elementary probability theory and (18) that,
whenever m satisfies (42). The proof is complete. \(\square \)
Proof of Theorem 3
To simplify the proof, corresponding to inequalities (25) and (43), we define the following two events
Then, by the definition of \({\mathscr {C}}_1\) (35), we have that \({\mathscr {C}}_1\cap {\mathscr {E}}_1\subseteq {\mathscr {E}}_2\) and thus
Note by Theorem 1 that
By Lemma 2(i) (with \(a=1\)), there exist universal positive constants \((c_1,c_2)\) such that (37) ensures (38). Then we obtain by (B.15) and (B.16) that
whenever m satisfies (37). The proof is complete. \(\square \)
Proof of Theorem 4
To simplify the proof, we define the following six events
Fix \(i\in \{1,2,3\}\). Then, we have by (35) that \({\mathscr {C}}_{a}\cap {\mathscr {F}}_i\subseteq {\mathscr {G}}_{i}\) and thus
By Lemma 2, there exist universal positive constants \((c_1,c_2,c_3,c_4)\) and \(\tau \ge 1\) such that, (44) ensures (38) and (40). Then it follows from (38) and (40) that
whenever m satisfies (44). Recall from Theorem 2 that
This, together with (B.18), implies that
Then, one has by (B.17) that
whenever m satisfies (44). The proof is complete. \(\square \)
Example to illustrate the recovery bound
The following example shows the performance of the \(\ell _{1/2}\) regularization method and the \(\ell _1\) regularization method in the case where 1/2-REC(1, 1, 1) is satisfied but not the classical REC(1, 1, 1).
Example C.1
Consider the linear regression model (1), where
It was validated in [32, Example 1] that the matrix X satisfies 1/2-REC(1, 1, 1) but not the classical REC(1, 1, 1); hence the recovery bound of the \(\ell _{1/2}\) regularization method is satisfied but may not for the \(\ell _1\) regularization method.
To show the performance of the \(\ell _{1/2}\) regularization method and the \(\ell _1\) regularization method in this case, for each regularization parameter \(\lambda \) varying from \(10^{-8}\) to 1, we randomly generate the Gaussian noise 500 times and calculate the estimated errors \(\Vert {\hat{\beta }}_{q,\lambda }-\beta ^*\Vert _2^2\) for the \(\ell _{1/2}\) regularization method and the \(\ell _1\) regularization method, respectively. We employ FISTA [9] and the filled function method [53] to find the global solution of the \(\ell _1\) regularization problem and the \(\ell _{1/2}\) regularization problem, respectively. The results are illustrated in Fig. 5, in which the error bars represent the 95% confidence intervals and the curves of recovery bounds stand for the terms in the right-hand side of (29) (cf. [32, Example 2]) and (30), respectively. It is observed from Fig. 5a that the recovery bound (29) is satisfied with high probability for most of \(\lambda \)’s and tight when \(\lambda \thickapprox \frac{1}{2}\) for the \(\ell _{1/2}\) regularization method. Fig. 5b shows that the estimated error (30) for the \(\ell _1\) regularization method is not satisfied when \(\lambda \) is small because the classical REC violates. Moreover, the solutions of the \(\ell _1\) regularization problem are always equal-contributed among 3 components that leads to the failure approach to a sparse solution.
The next example is to illustrate the influence of the parameter a as in the REC on estimated errors obtained by the \(\ell _1\) regularization method.
Example C.2
Consider the linear regression model (1), where
Due to the geometric interpretation of the \(\mathrm{REC}\), one sees that the \(\mathrm{REC}(1,1,a)\) holds if and only if the null space of X does not intersect the feasible set \(C_1(1,a)\) (B.1); one can also refer to [32, Fig. 1]. Hence one can check by the construction of X that it satisfies the \(\mathrm{REC}(1,1,c)\) for each \(c<\min \{a,2\}\) and fails otherwise. That is, if X is given by (C.1) with a larger parameter \(a\in [1,2]\), then X satisfies a stronger \(\mathrm{REC}(1,1,a)\).
To show the influence of the parameter a on the estimated error, we select \(a:=1,1.5,2,2.5\) as an instance to construct the matrix X by (C.1). Then for each regularization parameter \(\lambda \) varying from \(10^{-8}\) to 1, we randomly generate the Gaussian noise 500 times and calculate the estimated errors \(\Vert {\hat{\beta }}_{1,\lambda }-\beta ^*\Vert _2^2\) using FISTA [9] to find the global solution of the \(\ell _1\) regularization problem. The averaged result is displayed in Fig. 6. One can see that (i) the recovery bound in (29) is satisfied when \(a>1\) but fails when \(a=1\) that is consistent with Theorem 2; and (ii) as parameter a becomes larger, the estimated error becomes smaller that shows a better result than Theorem 2.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, X., Hu, Y., Li, C. et al. Sparse estimation via lower-order penalty optimization methods in high-dimensional linear regression. J Glob Optim 85, 315–349 (2023). https://doi.org/10.1007/s10898-022-01220-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10898-022-01220-5