Abstract
Statistical data analysis and machine learning heavily rely on error measures for regression, classification, and forecasting. Bregman divergence (\({\text{BD}}\)) is a widely used family of error measures, but it is not robust to outlying observations or high leverage points in large- and high-dimensional datasets. In this paper, we propose a new family of robust Bregman divergences called “robust-\({\text{BD}}\)” that are less sensitive to data outliers. We explore their suitability for sparse large-dimensional regression models with incompletely specified response variable distributions and propose a new estimate called the “penalized robust-\({\text{BD}}\) estimate” that achieves the same oracle property as ordinary non-robust penalized least-squares and penalized-likelihood estimates. We conduct extensive numerical experiments to evaluate the performance of the proposed penalized robust-\({\text{BD}}\) estimate and compare it with classical approaches, and show that our proposed method improves on existing approaches. Finally, we analyze a real dataset to illustrate the practicality of our proposed method. Our findings suggest that the proposed method can be a useful tool for robust statistical data analysis and machine learning in the presence of outliers and large-dimensional data.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Advancements in high-throughput technologies have made it possible to collect sophisticated, high-dimensional datasets, such as microarray data, genome-wide human SNP data, high-frequency financial data, functional data, and brain imaging data. In comparison to conventional datasets, where there are fewer variables than observations, large-dimensional datasets involve variables that could be as many as, or even more than, observations. These scenarios correspond to \(p_{_n}\ll n\), \(p_{_n}\approx n\), and \(p_{_n}\gg n\), respectively, with \(p_{_n}\) being the number of variables and n being the number of observations. A fundamental problem common to any large-dimensional dataset is that observations are more prone to outliers in either the covariate space or the response space than those in low-dimensional datasets. In such settings, outliers can lead to possible erroneous conclusions concerning statistical estimation, but the mechanism of contamination can be quite complex and intractable in general. It can also be difficult or even impossible to spot outliers in large-dimensional or highly structured data. Hence, exploring and developing robust statistical estimation and inference procedures that are resistant to outliers in large-dimensional data becomes increasingly important.
In recent years, much attention has been focused on developing penalized estimates of parameters in regression models with a large number of parameters. Examples include the \(\text{Lasso}\) (Tibshirani, 1996), the \(\text{SCAD}\) (Fan & Peng, 2004), the adaptive \(\text{Lasso}\) (Zou, 2006), the Dantzig selector (Candes & Tao, 2007), and the group \(\text{Lasso}\) (Meier et al., 2008), among others. These penalized least-squares estimates or penalized-likelihood estimates are obtained using a quadratic loss function or require full knowledge of the likelihood function. The penalized “classical-\({\text{BD}}\)” estimation was proposed in Zhang et al. (2010) for \(p_{_n} \ll n\) and \(p_{_n}\approx n\), where the error measure \({\text{BD}}\) includes the quadratic loss, the exponential family of distributions, the (negative) quasi-likelihood, and many others as special cases. However, these non-robust estimates do not handle outlying observations. General tools for investigating the robustness properties of penalized estimates, especially when \(p_{_n}\approx n\) and \(p_{_n}\gg n\), seem to be much less developed.
It is well-known that the influence functions of classical (non-penalized) regression estimates based on the quadratic loss function and likelihood are unbounded. Large deviations of the response from its mean, as measured by the Pearson residuals, or outlying points in the covariate space, can have a significant influence on the estimates. While robust procedures in Bianco et al. (1996), Künsch et al. (1989), Stefanski et al. (1986) control outliers for the generalized linear model (\(\text{GLM}\)), these procedures are limited to finite- and low-dimensional problems. It remains unclear to what extent they are useful in large- and high-dimensional settings. The works (Boente et al., 2006; Cantoni & Ronchetti, 2001) developed robust quasi-likelihood estimates of finite-dimensional parameters. However, the robust quasi-likelihood procedure is not available for other types of error measures, such as the hinge loss for the support vector machine (\({\text{SVM}}\)) (Vapnik, 1996) and the exponential loss for AdaBoost (Freund & Schapire, 1997), which are commonly used in classification procedures and machine learning practice. This is because these error measures do not fall into the (negative) quasi-likelihood category.
This paper aims to investigate the applicability of robust statistical inference for regression estimation and classification procedures in large-dimensional (\(p_{_n}\approx n\)) and high-dimensional (\(p_{_n}\gg n\)) settings, where the distribution of the response variable given covariates may be incompletely specified. The proposed work is not a simple endeavor and does not aim to solve all possible issues stemming from the combination of the “robustness property coupled with large-dimensionality”. The paper identifies major challenges and presents new results below.
-
In Sect. 2, we contribute to constructing a new class of robust error measures called “robust-\({\text{BD}}\)”. This is motivated by Bregman divergence (\({\text{BD}}\)), which plays an important role in quantifying error measures for regression estimates and classification procedures. The quadratic loss function and negative quasi-likelihood are two widely used error measures that, along with many others, belong to the family of \({\text{BD}}\). This newly proposed “robust-\({\text{BD}}\)” method broadens the scope of penalized estimation methods, greatly facilitating the investigation of their asymptotic behavior in a systematic way. The new method is applicable to all aforementioned error measures (e.g., the hinge loss and exponential loss, which fail to be (negative) quasi-likelihood but belong to \({\text{BD}}\)). The “robust-\({\text{BD}}\)” benefits from the flexibility and extensibility offered by \({\text{BD}}\). Nonetheless, unlike the “classical-\({\text{BD}}\)”, the “robust-\({\text{BD}}\)” entails a bias-correction procedure that complicates theoretical derivations as well as practical implementations; see concrete examples in Sect. 2.3. Moreover, when \(p_{_n}< n\), justifying the influence function of a “robust-\({\text{BD}}\)” estimate for a \(p_{_n}\)-dimensional parameter calls for re-examination and re-derivation beyond the framework of Hampel et al. (1986), confined to a fixed-dimensional parameter.
-
In Sects. 3 and 4, we study the consistency and oracle property of the proposed “penalized robust-\({\text{BD}}\) estimates” in \(p_{_n}\approx n\) and \(p_{_n}\gg n\) settings, respectively. It is shown that the new estimate, combined with an appropriately weighted \(L_1\) penalty, achieves the same oracle property as the ordinary non-robust penalized least-squares and penalized-likelihood estimates, but is less sensitive to outliers, a very desirable property in many applications. It will be seen that the robust counterpart eliminates the finiteness assumption of some higher-order moments of the response variable Y, typically assumed in the non-robust case. Nonetheless, dealing with large-dimensionality in the robust case will face many more theoretical and computational challenges than in the non-robust case. For example, the oracle property in the case \(p_{_n}\approx n\) could not be directly extended to the case \(p_{_n}\gg n\) without requiring more technical assumptions and invoking more careful mathematical treatments. For practical implementation, two data-driven selection procedures for penalty weights, \({\text{PMR}}\) and \({\text{MR}}\), will be proposed and justified for both \(p_{_n}\approx n\) and \(p_{_n}\gg n\). Unlike the selection method in Zhang et al. (2010) for penalty weights which deals with \(p_{_n}\approx n\) and requires \({\text{E}}({\varvec{X}})={\textbf{0}}\) for the \(p_{_n}\)-dimensional predictor vector \({\varvec{X}}=(X_1,\ldots ,X_{p_{_n}})^T\), both the methods and theory developed in the current work do not impose such a requirement, thus are more widely applicable. In the context of large-dimensional inference, we demonstrate that the Wald-type test statistic based on the “penalized robust-\({\text{BD}}\) estimates” will be asymptotically distribution-free, whereas the likelihood ratio-type test statistic will fail to be.
-
In Sect. 5, we devise “penalized robust-\({\text{BD}}\) classifiers” based on the proposed “penalized robust-\({\text{BD}}\) estimates” in large- and high-dimensional binary classification. We demonstrate that if a parameter estimate possesses the sparsity property and is consistent at an appropriate rate, then the induced classifier attains classification consistency. Hence, even for data contaminated with outliers, the choice of loss functions for regression estimates invoked in the classifier has an asymptotically relatively negligible impact on classification performance.
There is a diverse and extensive literature on robust procedures for model selection. For instance, Zou and Yuan (2008) proposed the composite quantile regression and the oracle model selection theory for linear models. Theorem 3 of Zhang et al. (2009) shows that the quantile loss function does not belong to the class of \({\text{BD}}\). Therefore, the “robust-\({\text{BD}}\)” and the quantile loss, as two operationally different robust alternatives, work in non-overlapping frameworks with different motivations and demand theoretically distinct manipulations. The work (Dupuis & Victoria-Feser, 2011) developed a fast algorithm for robust forward selection procedure in linear regression, where \(p_{_n} <n\).
The rest of the paper is organized as follows. Section 6 presents simulation comparisons of the penalized “robust-\({\text{BD}}\)” estimates with the classical ones, including classical-\({\text{SVM}}\) and robust-\({\text{SVM}}\) for Bernoulli responses, to assess the performance in statistical model fitting, variable selection, and classification rules. Section 7 analyzes a real dataset. Limitations and open questions are discussed in Sect. 8. Notations, technical and algorithmic details, figures and tables, and additional analysis are collected in Appendix 1 (in the supplementary materials).
2 Proposed robust penalized regression estimation
Let \(\{({\varvec{X}}_{1},Y_{1}),\ldots ,({\varvec{X}}_{n},Y_{n})\}\) be a sample of independent observations from some underlying population, \(({\varvec{X}},Y)\), where \({\varvec{X}}=(X_1,\ldots ,X_{p_{_n}})^T \in \mathbb {R}^{p_{_n}}\) is the input vector and Y is the output variable. We assume the parametric model for the conditional mean function,
together with the conditional variance function
where \(F(\cdot )\) is a known link function, \(F^{-1}\) denotes the inverse function of F, \(\beta _{0;0}\in \mathbb {R}^1\) and \({\varvec{\beta }}_{0}=(\beta _{1;0},\ldots ,\beta _{p_{_n};0})^T \in \mathbb {R}^{p_{_n}}\) are the unknown true intercept and regression parameters, and the functional form of \(V(\cdot )\) is known. It is worth noting that (1)–(2) include the \(\text{GLM}\) as a special case. Moreover, they allow the conditional distribution of Y given \({\varvec{X}}\) to be incompletely (or partially) specified.
Let \(\rho _{{q}}(y,\mu )\) be a robust loss function (to be proposed in Sect. 2.1) which aims to guard against outlying observations in the response space. We define the “penalized robust-\({\text{BD}}\) estimate” \(({\widehat{\beta }}_{0},{\widehat{{\varvec{\beta }}}})\) as the minimizer of the criterion function,
over \(\beta _{0}\in \mathbb {R}^1\) and \({\varvec{\beta }}=(\beta _{1},\ldots ,\beta _{p_{_n}})^T \in \mathbb {R}^{p_{_n}}\), where \(w(\cdot )\ge 0\) is a given bounded weight function that downweights high leverage design points in the \(p_{_n}\)-dimensional covariate space, \(\{w_{n,j}\}\) represent non-negative weights for the penalty terms, and \(\lambda _n\ge 0\) serves as a regularization parameter. In practice, the weight function \(w(\cdot )\) can be chosen in various ways, such as through prior knowledge or data-driven methods. An empirical choice of \(w(\cdot )\) is provided at the beginning of Sect. 6. Data-driven procedures for properly selected penalty weights \(\{w_{n,j}\}\) will be carefully developed in Sects. 3.5 and 4. Optimization solutions of (3) will be discussed in Sect. 6. Hereafter, we write \({\widetilde{{\varvec{X}}}} = (1, {\varvec{X}}^T)^T\) and \({\widetilde{{\varvec{\beta }}}} = (\beta _{0},{\varvec{\beta }}^T)^T\) to simplify notations.
2.1 Construction of robust loss functions \(\rho _{{q}}(\cdot ,\cdot )\)
We propose a class of robust loss functions \(\rho _{{q}}\), which is motivated from Bregman divergence (\({\text{BD}}\)), a notion commonly used in the machine learning applications. The original form of \({\text{BD}}\), which is a bivariate function introduced by Brègman (Brègman, 1967), is defined by
where \(q(\cdot )\) is a given concave differentiable function. For an extensive literature on \({\text{BD}}\), see Altun and Smola (2006), Efron (1986), Gneiting (2011), Grünwald and Dawid (2004), Lafferty et al. (1997), Lafferty (1999), Vemuri et al. (2011) and references therein. The \({\text{BD}}\) is suitable for a broad array of error measures \(Q_{{q}}\). For example, \(q(\mu ) = a \mu - \mu ^2\) for any constant a yields the quadratic loss \(Q_{{q}}(y, {\mu }) = (y - {\mu })^2\). For a binary response variable y, \(q(\mu ) = \min \{\mu , (1-\mu )\}\) gives the misclassification loss \(Q_{{q}}(y,\mu )={\text{I}}\{y\ne {\text{I}}(\mu >1/2)\}\), where \({\text{I}}(\cdot )\) denotes the indicator function; \(q(\mu ) = -2\{\mu \log (\mu )+(1-\mu )\log (1-\mu )\}\) gives the Bernoulli deviance-based loss \(Q_{{q}}(y,\mu ) = -2\{y\log (\mu )+(1-y)\log (1-\mu )\}\); \(q(\mu ) = 2 \min \{\mu , (1-\mu )\}\) results in the hinge loss; \(q(\mu ) = 2\{\mu (1-\mu )\}^{1/2}\) yields the exponential loss \(Q_{{q}}(y,\mu ) = \exp [-(y-.5)\log \{\mu /(1-\mu )\}]\). In the cases of \(p_{_n} \ll n\) and \(p_{_n}\approx n\), Zhang et al. (2010) developed the penalized “classical-\({\text{BD}}\)” estimation.
Despite a wide range of applications of \({\text{BD}}\) in many different domains, its original form, including the quadratic loss used in the ordinary least squares estimates for regression models, yields estimates not resistant to outliers. The robust loss functions for boosting was studied in Kanamori et al. (2007). To the best of our knowledge, there is very little work in the literature on systematically developing robust forms of \({\text{BD}}\) and related inference, in the presence of outliers. In the present work, we describe the construction of a “robust-\({\text{BD}}\)”. In accordance with the conditional variance function in (2), let \(r(y, \mu )={(y-\mu )}/{\sqrt{V(\mu )}}\) denote the Pearson residual, which reduces to the standardized residual for linear models. Following (4), we get partial derivatives \({\partial } Q_{{q}}(y,\mu )/{\partial \mu } = (y-\mu ) q''(\mu )\), which can be rewritten as
To guard against outliers with large Pearson residuals, we replace \(r(y,\mu )\) by \(\psi (r(y,\mu ))\), where \(\psi (\cdot )\) is chosen to be a bounded, odd function. There is a wide class of functions \(\psi (\cdot )\) satisfying these requirements; feasible choices include the Huber \(\psi\)-function (Huber, 1964) defined by
and the Tukey biweight function formed by \(\psi (r) = r\{1-({r}/{c})^2\}^2\, {\text{I}}( |r|\le c),\) where c is a positive constant. The proposed robust version of \({\text{BD}}\), \(\rho _{{q}}\), is formed by
where the bias-correction term, \(G(\mu )\), serves to entail the “conditional zero-mean property” (see part (b) of Sect. 2.2) of a non-penalized and low-dimensional parameter estimate (i.e. minimizing (3) in the case of \(\lambda _n=0\) and \(p_{_n} < n\)) and satisfies
with
See Sect. 2.3 for explicit expressions of \(G(\mu )\) for Bernoulli responses. We call \(\rho _{{q}}(\cdot ,\cdot )\) defined in (6) the “robust-\({\text{BD}}\)”, and call the resulting parameter estimate which minimizes (3) the “penalized robust-\({\text{BD}}\) estimate”. An illustrative plot of “robust-\({\text{BD}}\)”, as compared with the classical-\({\text{BD}}\), is displayed in Fig. 1.
As a specific example of the class \(\rho _{{q}}\) of “robust-\({\text{BD}}\)” in (6), the robust (negative) quasi-likelihood in Boente et al. (2006) and Cantoni and Ronchetti (2001) can be recovered by setting the generating q-function of \({\text{BD}}\) to be \(q(\mu ) =\int _{a}^\mu {(s-\mu )}/{V(s)} {\text{d}}s\), where a is a finite constant such that the integral is well-defined. More generally, the availability of the necessary and sufficient conditions as given in Theorem 3 of Zhang et al. (2009) for an error measure to be a \({\text{BD}}\) enables the construction of the corresponding “robust-\({\text{BD}}\)” from expression (6).
2.2 General properties of “robust-\({\text{BD}}\)” \(\rho _{{q}}\)
We make the following comments regarding features of the “robust-\({\text{BD}}\)”. To facilitate the discussion, we first introduce some necessary notation. Assume that the quantities
exist finitely up to any order required. Then we have the following expressions,
where \(\mu =F^{-1}(\theta )\),
\(A_1(\mu )=[\{q^{(3)}(\mu )\sqrt{V(\mu )}+2^{-1}q''(\mu ){V'(\mu )}/{\sqrt{V(\mu )}}\}F'(\mu )\) \(-q''(\mu )\sqrt{V(\mu )}F''(\mu )] / {\{F'(\mu )\}^3}\), and \(A_2(y, \mu )=[{\partial A_0(y,\mu )}/{\partial \mu } + {\partial \{\psi (r(y,\mu ))- g_1 (\mu )\}}/{\partial \mu }\, A_1(\mu )]/F'(\mu )\). Particularly, \(\text{p}_{_1}(y;\theta )\) contains \(\psi (r)\); \(\text{p}_{_2}(y;\theta )\) contains \(\psi (r)\), \(\psi '(r)\), and \(\psi '(r)r\); \(\text{p}_{_3}(y;\theta )\) contains \(\psi (r)\), \(\psi '(r)\), \(\psi '(r)r\), \(\psi ''(r)\), \(\psi ''(r)r\), and \(\psi ''(r)r^2\), where \(r=r(y, \mu )={(y-\mu )}/{\sqrt{V(\mu )}}\) denotes the Pearson residual. Accordingly, \(\{\text{p}_{_j}(y; \theta ): j=1,2,3\}\) depend on y through \(\psi (r)\) and its derivatives coupled with r.
(a) Relation between the “robust-\({\text{BD}}\)” \(\rho _{{q}}\) and “classical-\({\text{BD}}\)” \(Q_{{q}}\). For the particular choice of \(\psi (r)=r\), it is clearly noticed from (7) that \(g_1 (\cdot )=0\) and thus \(G'(\cdot )=0\). In such a case, the proposed robust \(\rho _{{q}}(y, \mu )\) in (6) reduces to the conventional form, \(Q_{{q}}(y, \mu )\), of \({\text{BD}}\); similarly, \(\text{p}_{_j}(y;\theta )\) reduces to \(\text{q}_{_j}(y;\theta )\), where
Accordingly, \(\text{q}_{_j}(y; \theta )\) is linear in y for fixed \(\theta\). As a comparison,
where \(A_{1,q}(\mu )={\{q^{(3)}(\mu ) F^{(1)}(\mu )-q^{(2)}(\mu ) F^{(2)}(\mu )\}}/\{F^{(1)}(\mu )\}^3\), \(A_2(\mu )=\{-2q^{(3)}(\mu ) F^{(1)}(\mu ) + 3q^{(2)}(\mu ) F^{(2)}(\mu )\}/\{F^{(1)}(\mu )\}^4\), and \(A_3(\mu )=[q^{(4)}(\mu ) \{F^{(1)}(\mu )\}^2-3q^{(3)}(\mu ) F^{(1)}(\mu ) F^{(2)}(\mu )-q^{(2)}(\mu ) F^{(1)}(\mu ) F^{(3)}(\mu ) +3q^{(2)}(\mu )\{F^{(2)}(\mu )\}^2]/\{F^{(1)}(\mu )\}^5\). In addition, assuming that
we see that \(Q_{{q}}(Y, F^{-1}({\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}))\) is strictly convex in \({\widetilde{{\varvec{\beta }}}}\).
(b) “Conditional zero-mean property”. For the proposed class of “robust-\({\text{BD}}\)” \(\rho _{{q}}\) induced by the classical-\({\text{BD}}\) \(Q_{{q}}\), it follows from the expression of \(\text{p}_{_1}(y;\theta )\) in (9) that \({\text{E}}\{\text{p}_{_1}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0) \mid {\varvec{X}}\} = 0\).
(c) Bounded influence function. When \(\lambda _n=0\) in (3) and \(p_{_n}<n\), the “robust-\({\text{BD}}\) estimate” defined by minimizing (3) is characterized by the score function and influence function below,
where \(M(\varvec{\psi }_{\rho _{{q}}}) =-{\text{E}}\{{\partial }\varvec{\psi }_{\rho _{{q}}}(Y, {\varvec{X}})/{\partial {\widetilde{{\varvec{\beta }}}}_0}\} =-{\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}}_0)\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}} {\widetilde{{\varvec{X}}}}^T\}\); note that justifying the influence function of a “robust-\({\text{BD}}\)” estimate for a \(p_{_n}\)-dimensional parameter calls for re-examination and re-derivation beyond the framework of Hampel et al. (1986) for a fixed-dimensional parameter. Thus, the boundedness of \(\psi (\cdot )\) and the weight function \(w(\cdot )\) ensure a bounded influence function. In contrast, for non-robust counterparts, with \(\psi (r)=r\) and \(w(\cdot )\equiv 1\), the influence function is unbounded. As to the score function, the “conditional zero-mean property” in part (b) ensures that under the parametric model (1), \({\text{E}}\{\varvec{\psi }_{\rho _{{q}}}(Y, {\varvec{X}})\}={\textbf{0}}\) holds for the proposed class of “robust-\({\text{BD}}\)” \(\rho _{{q}}\) induced by the classical-\({\text{BD}}\) \(Q_{{q}}\).
(d) Conditions under which \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\}\ge 0\). This is a very minimal condition relevant to discussing Theorems 1 and 2, assumption (19) and numerical minimization of (3). First, as observed from (9), the sign of \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\}\) is invariant with the choice of generating q-functions of \({\text{BD}}\). Second, one sufficient condition for \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\} \ge 0\) is that the conditional distribution of Y given \({\varvec{X}}\) is symmetric about \(m({\varvec{X}})\). Third, another sufficient condition for \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\} \ge 0\) is that \({\text{E}}[ \psi (r(Y, m({\varvec{X}}))) \frac{\partial }{\partial m({\varvec{X}})} \log \{f(Y\mid {\varvec{X}}, m({\varvec{X}}))\} \mid {\varvec{X}}] \ge 0\), which holds for \(\psi (r)r\ge 0\) (applicable to Huber and Tukey \(\psi\)-functions), and the conditional distribution of Y given \({\varvec{X}}\) belongs to the exponential family, where f denotes the conditional density or probability of Y given \({\varvec{X}}\). Fourth, in the particular choice of \(\psi (r)=r\), which is unbounded, a direct computation gives that \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\} = -q''(m({\varvec{X}}))/\{F'(m({\varvec{X}}))\}^2 \ge 0\), for any conditional distribution of Y given \({\varvec{X}}\).
2.3 Difference between “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)”
To better distinguish between the “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)”, we derive below their closed-form expressions for the Bernoulli responses, using the canonical link \(\theta =\log \{\mu /(1-\mu )\}\), Huber \(\psi\)-function, and the deviance loss and the exponential loss as the \({\text{BD}}\). In that case, assume \(c\ge 1\) in the Huber \(\psi\)-function (5), and define two constants \(C_1={1}/{(1+c^2)}\) and \(C_2 =1-C_1\). Results in the case of \(0<c<1\) can be similarly obtained.
For the deviance loss employed as the \({\text{BD}}\), the “robust-\({\text{BD}}\)” in (6) takes the form,
where
The two related derivative quantities are
The “classical-\({\text{BD}}\)” is \(Q_{{q}}(y,\mu ) = -2\log (1-\mu ) (1-y)-2\log (\mu ) y,\) and the two related quantities are \(\text{q}_{_1}(y;\theta ) = -2 (y-\mu )\) and \(\text{q}_{_2}(y;\theta ) = 2V(\mu )\).
Analogously, for the exponential loss used as the \({\text{BD}}\), the “robust-\({\text{BD}}\)” counterpart is represented by \(\rho _{{q}}(y,\mu ) = p^*(y,\mu ) - G(\mu ),\) where
The two related derivative quantities are
The “classical-\({\text{BD}}\)” is \(Q_{{q}}(y,\mu ) = \{\mu (1-y) + (1-\mu ) y\}/{\sqrt{V(\mu )}},\) and the two related quantities are \(\text{q}_{_1}(y;\theta ) = -(y-\mu )/\{2\sqrt{V(\mu )}\}\) and \(\text{q}_{_2}(y;\theta ) = \{\mu (1-y)+(1-\mu )y\}/\{4\sqrt{V(\mu )}\}\).
In summary, for both types of “classical-\({\text{BD}}\)”, which are unbounded, the corresponding versions of “robust-\({\text{BD}}\)” are bounded. See the left panels of Fig. 1 with the response \(y=1\), where the “classical-\({\text{BD}}\)” goes to infinity as \(\mu\) approaches zero. (The case of \(y=0\) will be similar.) From the middle and right panels, \(\text{p}_{_1}(y;\theta )\) and \(\text{p}_{_2}(y;\theta )\) associated with the “robust-\({\text{BD}}\)” are always bounded. In contrast, for the exponential loss, the non-robust counterparts for the “classical-\({\text{BD}}\)” are unbounded. Moreover, we observe from each panel that the “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)” differ at lower and upper tails of \(\mu\), but coincide at the intermediate values of \(\mu\).
3 Robust estimation with large-dimensions: \(p_{_n}\approx n\)
This section investigates the statistical properties of the “penalized robust-\({\text{BD}}\) estimate” defined by minimizing (3) in sparse large-dimensional parametric models with \(p_{_n} \approx n\). Throughout the paper, it is assumed that some entries in \({\varvec{\beta }}_{0}\) are exactly zero. Without loss of generality, write \({\varvec{X}}= ({\varvec{X}}^{({\text{I}})T}, {\varvec{X}}^{({\text{II}})T})^T\) and \({\varvec{\beta }}_{0} = ({\varvec{\beta }}_{0}^{({\text{I}})T}, {\varvec{\beta }}_{0}^{({\text{II}})T})^T\), where the \({\varvec{\beta }}_{0}^{({\text{I}})}\) part collects all non-zero coefficients, and \({\varvec{\beta }}_{0}^{({\text{II}})} = {\textbf{0}}\). Let \(s_n\) denote the number of non-zero coordinates of \({\varvec{\beta }}_{0}\), and set \({\widetilde{{\varvec{\beta }}}}_{0} = (\beta _{0;0},{\varvec{\beta }}_{0}^T)^T\). Correspondingly, write \({\widetilde{{\varvec{X}}}}^{({\text{I}})} = (1,{\varvec{X}}^{({\text{I}})T})^T\) and \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})} = (\beta _{0},{\varvec{\beta }}^{({\text{I}})T})^T\).
It will be demonstrated that the impact of penalty weights \(\{w_{n,j}\}\) in (3) on the penalized “robust-\({\text{BD}}\)” estimate is primarily captured by two quantities, defined by
3.1 Consistency
Theorem 1 guarantees the existence of a \(\sqrt{n/s_n}\)-consistent local minimizer of (3). In particular, Theorem 1 allows the dimension to diverge with n at the rate \(p_{_n} = o\{n^{(3+\delta )/(4+\delta )}\}\) for any \(\delta >0\), as long as the number of truly non-zero parameters fulfills that \(s_n = O\{n^{1/(4+\delta )}\}\).
Theorem 1
(existence and consistency: \(p_{_n}\approx n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1, \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and there exists a constant \(M\in (0,\infty )\) such that \(\lim _{n\rightarrow \infty } {\text{P}}(w_{\min }^{({\text{II}})} \lambda _n > M)=1\). If \(s_n^4/n \rightarrow 0\) and \(s_n (p_{_n}-s_n) = o(n)\), then there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of (3) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), where \(\Vert \cdot \Vert _2\) denotes the Euclidean norm.
To clarify the distinction between Theorem 5 of Zhang et al. (2010) and Theorem 1 of this paper, we make the following comparison. (i) Theorem 5 of Zhang et al. (2010) uses the classical-\({\text{BD}}\), corresponding to \(\psi (r)=r\) in (6), and assumes the finiteness of some moments of Y. (ii) Condition \({\text{A}}5\) of our Theorem 1 uses the robust-\({\text{BD}}\) and assumes the boundedness of \(\psi (r)\), thus excluding \(\psi (r)=r\), but avoids the moment assumption. Thus, our Theorem 1 is not applicable to the case of \(\psi (r)=r\) in (6) associated with a classical-\({\text{BD}}\) of Zhang et al. (2010).
3.2 Oracle property
To investigate the asymptotic distribution of the penalized robust-\({\text{BD}}\) estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\), we define three square matrices of size \((s_n+1)\) by \(\textbf{W}_n^{({\text{I}})} = \text{diag}(0, w_{n,1},\ldots ,w_{n,s_n})\) and
Both \(\Omega _n^{({\text{I}})}\) and \(\textbf{H}_n^{({\text{I}})}\) depend on the choice of \({\text{BD}}\), weight function \(w(\cdot )\), and the \(\psi\)-function \(\psi (\cdot )\).
Following Theorem 1, Theorem 2 obtains the oracle property of the \(\sqrt{n/s_n}\)-consistent local minimizer. Namely, if the “robust-\({\text{BD}}\)” is used as the loss function for parameter estimation, then the penalized robust-\({\text{BD}}\) estimates of the zero parameters take exactly zero values with probability tending to one, and the penalized robust-\({\text{BD}}\) estimates of the non-zero parameters are asymptotically Gaussian with the same means and variances as if the zero coefficients were known in advance.
Theorem 2
(oracle property: \(p_{_n}\approx n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1.
- \(\mathrm {(i)}\):
-
If \(s_n^2/n=O(1)\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as \(n\rightarrow \infty\), then any \(\sqrt{n/s_n}\)-consistent local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}} = (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})T},{\widehat{{\varvec{\beta }}}}^{({\text{II}})T})^T\) of (3) satisfies \({\text{P}}({\widehat{{\varvec{\beta }}}}^{({\text{II}})} = {\textbf{0}})\rightarrow 1\).
- \(\mathrm {(ii)}\):
-
Moreover, if \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\), \(s_n^5/n \rightarrow 0\) , and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\), then for any fixed integer \({\textsf {k}}\) and any \({\textsf {k}}\times (s_n+1)\) matrix \(A_n\) such that \(A_nA_n^T \rightarrow \mathbb {G}\) for a \({\textsf {k}}\times {\textsf {k}}\) nonnegative-definite symmetric matrix \(\mathbb {G}\), we have \(\sqrt{n} A_n (\Omega _n^{({\text{I}})})^{-1/2} [\textbf{H}_n^{({\text{I}})}\{\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\} + \lambda _n \textbf{W}_n^{({\text{I}})} \,\text{sign}\{{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\}] {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \mathbb {G})\), where \(\,\text{sign}\{{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\}=(\,\text{sign}(\beta _{0}),\,\text{sign}(\beta _{1}),\ldots ,\,\text{sign}(\beta _{s_n}))^T\).
Remark 1
In Theorem 2, Condition \({\text{B}}5\) extends the positive-definiteness assumption of \({\text{E}}\{ {\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}\) in the non-robust and fixed-dimensional case to the robust and large-dimensional case. The assumption \(\min _{1\le j\le s_n} |\beta _{j;0}|/ \sqrt{s_n/n} \rightarrow \infty\) is relevant to the magnitude of coefficients for significant variables which can be selected, and is fulfilled when \(\min _{1\le j\le s_n} |\beta _{j;0}|\ge C n^{-4/5}\) for a constant \(C>0\).
3.3 Comparison with the penalized “classical-\({\text{BD}}\)” estimate
Comparisons are made between the penalized “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)” estimates. (I) The penalized “classical-\({\text{BD}}\)” estimate in Zhang et al. (2010) requires \({\text{E}}(Y^2)<\infty\) for the consistency and requires finiteness of some higher-order moments of Y for the oracle property. These requirements are avoided in the “robust-\({\text{BD}}\)” counterpart. (II) The two types of penalized estimates appear to share similar forms of the asymptotic distribution, except that matrices \(\Omega _n^{({\text{I}})}\) and \(\textbf{H}_n^{({\text{I}})}\) for the “classical-\({\text{BD}}\)” estimate are given by
respectively. Hence, the differences are captured by the distinction between the robust versions, \(\{\text{p}_{_j}(y; \theta )\}_{j=1}^{2}\) (defined in (8)) and weight function \(w(\cdot )\), used in the penalized “robust-\({\text{BD}}\)” estimate and the non-robust counterparts, \(\{\text{q}_{_j}(y; \theta )\}_{j=1}^{2}\) (defined in (10)) and \(w(\cdot ) \equiv 1\), used in the penalized “classical-\({\text{BD}}\)” estimate.
As observed from (12)–(13), a bounded function \(\text{p}_{_1}(y; \theta )\) is introduced from a bounded function \(\psi (r)\) to control deviations in the Y-space, and leverage points are down-weighted by the weight function \(w({\varvec{X}})\). In contrary, \(\text{q}_{_1}(y; \theta )\) is not guaranteed to be bounded. It is then clear that for penalized “robust-\({\text{BD}}\)” estimates of non-zero parameters, the choice of a bounded score function ensures robustness by putting a bound on the influence function. Such property is not possessed by the penalized “classical-\({\text{BD}}\)” counterparts.
3.4 Hypothesis testing
We consider the hypothesis testing about \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) formulated as
where \(A_n\) is a given \({\textsf {k}}\times (s_n+1)\) matrix such that \(A_nA_n^T\) equals a \({\textsf {k}}\times {\textsf {k}}\) positive-definite matrix \(\mathbb {G}\). This form of linear hypotheses allows one to simultaneously test whether a subset of variables used are statistically significant by taking some specific form of the matrix \(A_n\); for instance \(A_n=[\textbf{I}_{{\textsf {k}}},{\textbf{0}}_{{\textsf {k}},s_n+1-{\textsf {k}}}]\) yields \(A_nA_n^T =\textbf{I}_{{\textsf {k}}}\) for a \({\textsf {k}}\times {\textsf {k}}\) identity matrix.
In the context of non-robust penalized-likelihood estimation, Fan and Peng (2004) showed that the likelihood ratio-type test statistic asymptotically follows a chi-squared distribution under the null. It is thus natural to explore the extent to which the likelihood ratio-type test can feasibly be extended to the “robust-\({\text{BD}}\)”. Our derivations (with details omitted) indicate that the resulting asymptotic null distribution is generally not chi-squared, but a sum of weighted chi-squared variables, with weights involving unknown quantities, thus not distribution free, and holds under restrictive conditions.
To ameliorate this undesirable property, we propose a robust generalized Wald-type test statistic of the form,
where \({\widehat{\Omega }}_n^{({\text{I}})} = n^{-1} \sum _{i=1}^n \text{p}_{_1}^2(Y_i; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})})\, w^2({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\) and
\({\widehat{\textbf{H}}}_n^{({\text{I}})} = n^{-1} \sum _{i=1}^n \text{p}_{_2}(Y_i; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\). This test is asymptotically distribution-free, as Theorem 3 justifies that under the null, \(W_n\) is asymptotically chi-squared with \({\textsf {k}}\) degrees of freedom.
Theorem 3
(Wald-type test under \(H_0\) based on robust-\({\text{BD}}\): \(p_{_n}\approx n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{C}}4\), \({\text{A}}5\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1, and \(w_{\max }^{({\text{I}})}=o_{_{{\text{P}}}}\{{1}/(\lambda _n \sqrt{ns_n})\}\). If \(s_n^5/n \rightarrow 0\) and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\) as \(n\rightarrow \infty\), then \(W_n {\mathop {\longrightarrow }\limits ^{\mathcal L}}\chi ^2_{{\textsf {k}}}\) under the null \(H_0\) in (14).
Since the influence function of \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is bounded, Proposition 2 of Cantoni and Ronchetti (2001) can be modified to show that the asymptotic level of the robust test statistic \(W_n\) under a sequence of \({\epsilon }\)-contaminations is bounded. Similarly, the asymptotic power is stable under contamination. Details are omitted for lack of space.
3.5 Proposed selection for penalty weights
In practice, the weights \(\{w_{n,j}\}\) in the penalty part of (3) need to be selected and their validity also needs to be justified. To accommodate the general error measures, we propose two procedures. The first one, called the “penalized marginal regression” (\({\text{PMR}}\)) method, selects the data-dependent penalty weights \({\widehat{w}}_{n,j}\), for each individual \(j = 1,\ldots ,p_{_n}\), according to
where \({\widehat{\beta }}^{{\text{PMR}}}_j\) satisfies that \(({\widehat{\alpha }}^{{\text{PMR}}}_j,{\widehat{\beta }}^{{\text{PMR}}}_j) \in \mathbb {R}^2\) minimize the criterion function
over \((\alpha ,\beta )\), with some sequence \(\kappa _n>0\), and \(X_{i,j}\) denoting the jth variable in the ith sample.
An alternative procedure, called the “marginal regression” (\({\text{MR}}\)) method, selects the penalty weights \({\widehat{w}}_{n,j}\), for each individual \(j = 1,\ldots ,p_{_n}\), by means of
where \({\widehat{\beta }}^{{\text{MR}}}_j\) satisfies that \(({\widehat{\alpha }}^{{\text{MR}}}_j,{\widehat{\beta }}^{{\text{MR}}}_j) \in \mathbb {R}^2\) minimize the criterion function
over \((\alpha ,\beta )\).
Note that (16) and (18) each involves a univariate predictor with the intercept term. Thus fast bivariate optimization solutions of (16) and (18) would be feasible even when \(p_{_n} > n\). Compared with the \({\text{PMR}}\) method, the \({\text{MR}}\) method gains computational superiority with less computational cost.
Theorems 4 and 5 indicate that under the assumptions on the correlation between the predictor variables and the response variable, the penalty weights selected by either the \({\text{PMR}}\) or \({\text{MR}}\) method satisfy the conditions on \(\{w_{n,j}\}\) in Theorem 1.
Theorem 4
(\({\text{PMR}}\) for penalty weights: \(p_{_n}\approx n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\), \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\) and \({\mathcal {B}}_n/\kappa _n = O(1)\) for \(\kappa _n\) in (16). Suppose \(\lambda _n \sqrt{n} = O(1)\), \(\lambda _n = o(\kappa _n)\) and \(\log (p_{_n}) = o(n\kappa _n^2)\). Then there exist local minimizers \(\{{\widehat{\beta }}_{j}^{{\text{PMR}}}\}_{j=1}^{p_{_n}}\) of (16) such that the penalty weights \(\{{\widehat{w}}_{n,j}\}_{j=1}^{p_{_n}}\) defined in (15) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})} \lambda _n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 1, where \({\widehat{w}}_{\max }^{({\text{I}})} = \max _{1\le j\le s_n} {\widehat{w}}_{n,j}\) and \({\widehat{w}}_{\min }^{({\text{II}})} = \min _{s_n+1\le j\le p_{_n}} {\widehat{w}}_{n,j}\).
Theorem 5
(\({\text{MR}}\) for penalty weights: \(p_{_n}\approx n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}2\) and \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\) and \({\mathcal {B}}_n = \lambda _n\). Suppose \(\lambda _n \sqrt{n} = O(1)\), \(\lambda _n \rightarrow 0\), \(\lambda _n n/s_n \rightarrow \infty\) and \(s_n^2 \log (p_{_n}) = o(\lambda _n^2 n^2)\). Then there exist local minimizers \({\widehat{\beta }}^{{\text{MR}}}_{j}\) of (18) such that the penalty weights \({\widehat{w}}_{n,j}\) defined in (17) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})}\lambda _n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 1.
It should be pointed out that the \({\text{PMR}}\) method here differs from the weights selection method in Zhang et al. (2010), which is restricted to \(p_{_n}\approx n\), excludes the intercept term in (16) and requires \({\text{E}}({\varvec{X}})={\textbf{0}}\) to satisfy the conditions on \(\{w_{n,j}\}\) in Theorem 1. In contrast, the \({\text{PMR}}\) method here removes that requirement and will also be applicable to \(p_{_n}\gg n\); see Theorem 7.
4 Robust estimation with high-dimensions: \(p_{_n}\gg n\)
This section explores the behavior of penalized robust-\({\text{BD}}\) estimates in sparse high-dimensional parametric models when \(p_{_n}\) is allowed to grow faster than n. Evidently, the technical conditions (e.g., in Theorem 1) on \(p_{_n}\) in Sect. 3 are violated for \(p_{_n} \gg n\). Thus, directly carrying through the proofs in Sect. 3 to the counterpart of \(p_{_n} \gg n\) is infeasible. To facilitate the study in the case of \(p_{_n}\gg n\), we impose a convexity assumption on the “robust-\({\text{BD}}\)”:
Under this assumption, \(\rho _{{q}}(Y, F^{-1}({\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}))\) is strictly convex in \({\widetilde{{\varvec{\beta }}}}\).
Apparently, assumption (19) is stronger than \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\}\ge 0\) discussed in Sect. 2.2; relaxing (19) will be desirable but is not pursued in this paper. We consider here particular cases where assumption (19) is practically/approximately achievable and theoretically relevant for high-dimensional settings.
-
Case 1:
For any observation \(({\varvec{X}}, Y)\) such that the conditional distribution of Y given \({\varvec{X}}\) is symmetric about \(m({\varvec{X}})\), the use of a quadratic loss function combined with an identity link, and a constant conditional variance ensures that \(\text{p}_{_2}(y;\theta ) = 2\psi '(r(y, \theta )) \ge 0\), for a monotone non-decreasing \(\psi\)-function. Thus, the Gaussianity assumption on the conditional distribution of Y given \({\varvec{X}}\) is relaxed.
-
Case 2:
Recall that if \(\psi (r)=r\), then \(\text{p}_{_j}(y;\theta )=\text{q}_{_j}(y;\theta )\), and thus condition (19) is equivalent to condition (11). Indeed, condition (11) holds broadly for nearly all commonly used \({\text{BD}}\). Examples include the quadratic loss function, the deviance-based loss and exponential loss functions for the Bernoulli responses, and the (negative) quasi-likelihood for over-dispersed Poisson responses, among many others. The implication is that for high-dimensional data, if we are most concerned with dealing with outliers arising from the explanatory variables, we may employ \(\psi (r)=r\) for Y alone while retaining the weight function \(w(\cdot )\) on the covariates \({\varvec{X}}\). This is particularly relevant to samples with Bernoulli or Binomial responses, where both the parameter space and the response space are bounded, and thus is applicable to a wide class of classification procedures.
Theorem 6 states that the oracle property remains true, under suitable conditions, for the penalized robust-\({\text{BD}}\) estimates in the \(p_{_n}\gg n\) settings. Due to the technical challenge from \(p_{_n}\gg n\), Theorem 6 contains stronger assumptions than those in Theorem 2 (with \(p_{_n}\approx n\)). Lemma 4 in Appendix 1.1 provides key proofs for Theorem 6.
Theorem 6
(oracle property: \(p_{_n}\gg n\)) Assume (19) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5'\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Suppose \(s_n^4/n \rightarrow 0\), \(\log (p_{_n}-s_n)/n = O(1)\), \(\log (p_{_n}-s_n)/\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\} = o_{_{{\text{P}}}}(1)\) and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\). Assume \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\).
- \(\mathrm {(i)}\):
-
Then there exists a global minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of (3) such that \({\text{P}}({\widehat{{\varvec{\beta }}}}^{({\text{II}})} = {\textbf{0}})\rightarrow 1\).
- \(\mathrm {(ii)}\):
-
Moreover, if \(s_n^5/n \rightarrow 0\), then for any fixed integer \({\textsf {k}}\) and any \({\textsf {k}}\times (s_n+1)\) matrix \(A_n\) such that \(A_nA_n^T \rightarrow \mathbb {G}\) for a \({\textsf {k}}\times {\textsf {k}}\) nonnegative-definite symmetric matrix \(\mathbb {G}\), we have \(\sqrt{n} A_n (\Omega _n^{({\text{I}})})^{-1/2} [\textbf{H}_n^{({\text{I}})}(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) + \lambda _n \textbf{W}_n^{({\text{I}})} \,\text{sign}\{{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\}] {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \mathbb {G})\).
The conditions \(s_n^5/n=o(1)\), \(\log (p_{_n}-s_n) = O(n)\) and \(\log (p_{_n}-s_n)/\{(w_{\min }^{({\text{II}})})^2 \lambda _n^2 n\} = o_{_{{\text{P}}}}(1)\) impose the constraints on \(s_n\) and \(p_{_n}\) simultaneously. To illustrate, take \(s_n = n^{a}\), where \(0<a<1/5\). Then we can take \(w_{\min }^{({\text{II}})}={n^b}/{(\lambda _n \sqrt{n})}\) for \(b>a\), namely, \(n\lambda _n^2 \{w_{\min }^{({\text{II}})}\}^2=n^{2b}\). A sufficient condition for the model dimension \(p_{_n}\) is that \(\log (p_{_n}-s_n) = O(n)\) and \(\log (p_{_n}-s_n) = o(n^{2b})\). So \(\log (p_{_n}-s_n)=\min \{o(n^{2b}), O(n)\}\). This indicates that Theorem 6 allows \(p_{_n}=\exp \{o(n^{2b}) \wedge O(n)\}\), which grows nearly exponentially fast with n.
Regarding the selection of penalty weights, the \({\text{PMR}}\) and \({\text{MR}}\) methods proposed in Sect. 3.5 with \(p_{_n}\approx n\) continue to work well for selecting weights with \(p_{_n} \gg n\). The validity is presented in Theorems 7 and 8, which again do not require \({\text{E}}({\varvec{X}})={\textbf{0}}\).
Theorem 7
(\({\text{PMR}}\) for penalty weights: \(p_{_n}\gg n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\), \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\), and \({\mathcal {B}}_n/\kappa _n = O(1)\) for \(\kappa _n\) in (16). Suppose \(\lambda _n \sqrt{n} = O(1)\), \({\lambda _n \sqrt{n}}/{s_n} = o(\kappa _n)\) and \(\log (p_{_n}) = o(n\kappa _n^2)\). Then there exist local minimizers \({\widehat{\beta }}^{{\text{PMR}}}_{j}\) of (16) such that the penalty weights \({\widehat{w}}_{n,j}\) defined in (15) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})} {\lambda _n \sqrt{n}}/ {s_n} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 6.
Theorem 8
(\({\text{MR}}\) for penalty weights: \(p_{_n}\gg n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}2\) and \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\) and \({\mathcal {B}}_n = {\lambda _n\sqrt{n}} / {s_n}\). Suppose \(\lambda _n \sqrt{n} = O(1)\), \(\lambda _n\sqrt{n}/s_n \rightarrow 0\), \(\lambda _n n/s_n \rightarrow \infty\) and \(s_n^2 \log (p_{_n}) = o(\lambda _n^2 n^2)\). Then there exist local minimizers \({\widehat{\beta }}^{{\text{MR}}}_{j}\) of (18) such that the penalty weights \({\widehat{w}}_{n,j}\) defined in (17) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 6.
5 Robust estimation in classification: \(p_{_n}\approx n\), \(p_{_n}\gg n\)
This section deals with the binary response variable Y which takes values 0 and 1. In this case, the mean regression function \(m({\varvec{x}})\) in (1) becomes the class label probability, \({\text{P}}(Y = 1 \mid {\varvec{X}}= {\varvec{x}})\). From the penalized robust-\({\text{BD}}\) estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) proposed in either Sect. 3 or Sect. 4, we can construct the “penalized robust-\({\text{BD}}\) classifier”, \({\widehat{\phi }}({\varvec{x}}) = {\text{I}}\{{\widehat{m}}({\varvec{x}}) > 1/2\},\) for a future input variable \({\varvec{x}}\), where \({\widehat{m}}({\varvec{x}})= F^{-1}({\widetilde{{\varvec{x}}}}^T\widehat{{\widetilde{{\varvec{\beta }}}}})\).
In the classification literature, the misclassification loss of a classification rule \(\phi\) at a sample point \(({\varvec{x}}, y)\) is \(l(y, \phi ({\varvec{x}})) = {\text{I}}\{y \ne \phi ({\varvec{x}})\}\). The risk of \(\phi\) is the expected misclassification loss, \(R(\phi ) = {\text{E}}\{l(Y, \phi (X))\} = {\text{P}}(\phi ({\varvec{X}}) \ne Y)\). The optimal Bayes rule, which minimizes the risk over \(\phi\), is \(\phi _{\mathrm {\scriptscriptstyle B}}({\varvec{x}})={\text{I}}\{m({\varvec{x}}) > 1/2\}\). For a test sample \(({\varvec{X}}^o, Y^o)\), which is an \(\mathrm {i.i.d.}\) copy of samples in the training set \(\mathcal T_n=\{({\varvec{X}}_{i}, Y_{i}): i=1,\ldots ,n\}\), the optimal Bayes risk is then \(R(\phi _{{\mathrm {\scriptscriptstyle B}}}) = {\text{P}}(\phi _{{\mathrm {\scriptscriptstyle B}}}({\varvec{X}}^o) \ne Y^o)\). Meanwhile, the conditional risk of the classification rule \({\widehat{\phi }}\) is \(R({\widehat{\phi }} \mid \mathcal T_n) = {\text{P}}({\widehat{\phi }}({\varvec{X}}^o) \ne Y^o \mid \mathcal T_n)\).
Theorem 9 demonstrates that \({\widehat{\phi }}\) attains the classification consistency, provided that the estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) possesses the sparsity property and is consistent at an appropriate rate. As observed, the conclusion of Theorem 9 is applicable to penalized robust-\({\text{BD}}\) estimates in both Theorem 2(i) with \(p_{_n}\approx n\) and Theorem 6(i) with \(p_{_n} \gg n\).
Theorem 9
(consistency of the penalized robust-\({\text{BD}}\) classifier) Assume Conditions \({\text{A}}0\), \({\text{A}}1\) and \({\text{A}}4\) in Appendix 1.1. Suppose that the estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}} = (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})T},{\widehat{{\varvec{\beta }}}}^{({\text{II}})T})^T\) satisfies \({\text{P}}({\widehat{{\varvec{\beta }}}}^{({\text{II}})} = {\textbf{0}})\rightarrow 1\) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(r_n)\). If \(r_n\sqrt{s_n}=o(1)\), then the classification rule \({\widehat{\phi }}\) constructed from \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) is consistent in the sense that \(\mathrm {(i)}\) \({\text{E}}\{R({\widehat{\phi }} \mid \mathcal T_n)\} - R(\phi _{{\mathrm {\scriptscriptstyle B}}}) \rightarrow 0\) as \(n \rightarrow \infty\), which in turn yields \(\mathrm {(ii)}\) \(R({\widehat{\phi }} \mid \mathcal T_n) {\mathop {\longrightarrow }\limits ^{\text{P}}}R(\phi _{{\mathrm {\scriptscriptstyle B}}})\).
6 Simulation study
We conduct simulation studies to evaluate the performance of the penalized robust-\({\text{BD}}\) estimates in the absence and presence of outliers. The Huber \(\psi\)-function is used with \(c=1.345\). For practical applications, we suggest an empirical choice of the weight function, \(w({\varvec{x}})=1/\{1+\sum _{j=1}^{p_{_n}} (\frac{x_{j}-m_{\centerdot ,j}}{s_{\centerdot ,j}})^2\}^{1/2}\), where \({\varvec{x}}=(x_1,\ldots ,x_{p_{_n}})^T\), \(m_{\centerdot ,j}\) and \(s_{\centerdot ,j}\) denote the sample median and sample median absolute deviation of \(\{X_{i,j}: i=1,\ldots ,n\}\) respectively, \(j=1,\ldots ,p_{_n}\). This form of \(w({\varvec{x}})\) is a generalization of a weight function used on page 2864 of Boente et al. (2006) for a one-dimensional covariate. The classical non-robust counterparts correspond to \(\psi (r)=r\) and \(w({\varvec{x}})\equiv 1\). In the simulation, \(p_{_n}=50\) and 500 are treated respectively.
For illustrative purpose, 4 types of penalization techniques combined with the loss term in (3) are compared: (I) the \(\text{SCAD}\) penalty, with an accompanying parameter \(a = 3.7\), combined with the local linear approximation; (II) the \(L_1\) penalty; (III) the weighted-\(L_1\) penalty with weights selected by the proposed \({\text{MR}}\) method; (IV) the weighted-\(L_1\) penalty with weights selected by the proposed \({\text{PMR}}\) method. For comparison, the oracle (non-penalized) estimator (abbreviated as “Oracle”) of parameters using the true model containing truly significant variables is included. The tuning constants \(\lambda _n\) in each simulation for methods (I)–(III) are selected separately by minimizing the classical-\({\text{BD}}\) (for non-robust methods) and “robust-\({\text{BD}}\)” (for robust methods) on a test set of size n; \(\lambda _n\) and \(\kappa _n\) for method (IV) are searched on a surface of grid points. Numerical algorithms for the proposed penalized estimator in (3) are given in Appendix 1.3.
The number of Monte Carlo runs is 500. To measure the performance of a parameter estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) through simulation, we use the average value and standard deviation (\({\text{sd}}\)) of estimation errors, \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})=\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{{\widetilde{{\varvec{\beta }}}}}_0\Vert _2\). Variable selection is assessed by \({\text {C-Z}}\), the average number of regression coefficients which are correctly estimated to be zero when the true coefficients are zero, and \({\text {C-NZ}}\), the average number of coefficients which are correctly estimated to be non-zero when the true coefficients are non-zero. For binary responses, \({\text{MCR}}\) denotes the average of the misclassification rates on a test set of size 5000.
6.1 Overdispersed Poisson responses
We generate overdispersed Poisson counts \(\{Y_{i}\}_{i=1}^n\) with \(n=100\), satisfying \({\text{var}}(Y_{i}\mid {\varvec{X}}_{i}={\varvec{x}}_{i})=2\, m({\varvec{x}}_{i})\). In the predictor \({\varvec{X}}_{i}=(X_{i,1}, X_{i,2}, \ldots , X_{i,p_{_n}})^T,\) \(X_{i,1}=i/n-0.5\); for \(j=2,\ldots ,p_{_n}\), \(X_{i,j}=\Phi (Z_{i,j})-0.5\), where \(\Phi\) is the standard Gaussian distribution function, and \((Z_{i,2},\ldots ,Z_{i,p_{_n}})^T \sim N({\textbf{0}}, \Sigma _{p_{_n}-1})\), with \(\Sigma _{p_{_n}-1}(j,k)=0.2^{|j-k |}\), \(j,k=1,\ldots ,p_{_n}-1\). The link function is \(\log \{m({\varvec{x}})\}=\beta _{0;0}+ {\varvec{x}}^T {\varvec{\beta }}_{0}\), with \(\beta _{0;0}=2.5\) and \({\varvec{\beta }}_{0}=(2, 2, 0,\ldots ,0)^T\). The (negative) quasi-likelihood with \(V(\mu )=\mu\) is utilized as the \({\text{BD}}\).
Study 1 (raw data without outliers). For simulated data without contamination, the results are summarized in Table 1. The lose of efficiency (at the true model) from classical to robust estimates using the (non-penalized) oracle estimation method can be seen from the increase of \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})\). For penalized methods, the robust estimates exhibit similar performance to those of the non-robust counterparts, with little loss in estimation efficiency and selecting relevant and irrelevant variables. Thus, there is no serious adverse effect of applying penalized robust estimation to clean datasets.
Study 2 (contaminated data with outliers). For each data set generated from the model, we create a contaminated data set, where 8 data points \((X_{i,j},Y_i)\) are subject to contamination: They are replaced by \((X_{i,j}^*,Y_i^*)\), where \(Y_i^* = Y_i\,{\text{I}}(Y_i > 100) + A\, {\text{I}}(Y_i \le 100)\) with \(A=50\), \(i=1,\ldots ,8\),
with \(\{U_i\}{\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{Uniform}(0,1)\). Table 2 summarizes the results over 500 sets of contaminated data. A comparison of each penalized quasi-likelihood estimate across Tables 1 and 2 indicates that the presence of contamination substantially increases \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})\). Among the 4 penalized estimates, the \(L_1\) penalty tends to have higher false positive rates. As observed from Table 2 with contaminated cases, the non-robust estimates are more sensitive to outliers than the robust counterparts. This lends support to Theorems 2 and 6. To provide a closer view of the estimates, Fig. 2 draws the boxplots of biases \((\widehat{\beta }_j-{\beta }_{j;0})\), \(j=0,1,\ldots ,5\), corresponding to results in Tables 1 and 2, using the \({\text{PMR}}\) selection method for penalty weights in the weighted-\(L_1\) penalty.
6.2 Bernoulli responses
We generate samples \(\{({\varvec{X}}_i, Y_i)\}_{i=1}^n\) with \(n=200\) from the model, \(Y_i\mid {\varvec{X}}_i = {\varvec{x}}_i \sim \text{Bernoulli}\{m({\varvec{x}}_i)\},\) where \({\text{logit}}\{m({\varvec{x}})\} = \beta _{0;0} + {\varvec{x}}^T{\varvec{\beta }}_{0}\) with \(\beta _{0;0} = 2\) and \({\varvec{\beta }}_{0} = (2,2,0, \ldots , 0)^T\). The predictor \({\varvec{X}}_i\sim N({\textbf{0}}, \Sigma _{p_{_n}})\), with \(\Sigma _{p_{_n}}(j,k)=0.1^{|j-k |}\), \(j,k=1,\ldots ,p_{_n}\). Both the deviance and exponential loss functions are employed as the \({\text{BD}}\).
Study 1 (raw data without outliers). For simulated data without contamination, the results are summarized in Table 3. The robust estimates perform as well as the non-robust counterparts, with respect to parameter estimation, variable selection and classification accuracy. Indeed, the optimal Bayes rule gives misclassification rates 0.137 for \(p_{_n}=50\), and 0.138 for \(p_{_n}=500\). Thus, the choice of loss functions has an asymptotically relatively negligible impact on classification performance. This agrees with results of Theorem 9 on the asymptotic classification consistency.
Study 2 (contaminated data with outliers). For each data set generated from the model, we create a contaminated data set. The contamination scheme is to replace the original 8 data points \((X_{i,j},Y_i)\) by \((X_{i,j}^*,Y_i^*)\), where \(Y_i^* = 1- Y_i\), \(i=1,\ldots ,8\),
with \(\{U_i\}{\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{Uniform}(0,1)\). Table 4 summarizes the results over 500 sets of contaminated data. Regarding the robustness-efficiency tradeoff, analogous conclusions to Sect. 6.1 can be reached. Moreover, comparing Tables 3 and 4 reveals that (i) contamination increases the misclassification rates of all 5 methods; (ii) for contaminated cases in Table 4, robust procedures tend to reduce the misclassification rates; (iii) for robust estimation, the deviance loss is computationally more stable yielding relatively lower misclassification rates than the exponential loss, see also Fig. 1, and thus the deviance loss is recommended for practical applications.
For the penalized methods, the tuning parameter \(\lambda _n\) is searched in the interval [0.0034, 0.1975], and \(\kappa _n\) is searched in the interval \([1/2^{8}, 1/2^{5}]\). To compare the classification performance with methods such as the classical-\({\text{SVM}}\) and robust-\({\text{SVM}}\) (using either the linear or Gaussian kernel, combined with auxiliary parameters \(c^*\) and/or s) in Wu and Liu (2007), Table 5 summarizes the results under the same set-ups as in Tables 3 and 4. Compared with classical- and robust-\({\text{SVM}}\)s (in Table 5), the robust-\({\text{BD}}\) method (in Tables 3, 4) clearly lowers \({\text{MCR}}\)s.
6.3 Gaussian responses
To further illustrate the benefits of the robust method relative to the non-robust method, Appendix 1.2 gives additional simulation studies for Gaussian responses.
7 Real data application
To illustrate the application of the penalized robust-\({\text{BD}}\) methods for classifying high-dimension low sample size data, we consider the Lymphoma data studied in Alizadeh et al. (2000), which identified two molecularly distinct forms of diffuse large B-cell Lymphoma (\(\text{DLBCL}\)). These two forms, called “germinal centre B-like (gc-B)” \(\text{DLBCL}\) and “activated B-like (a-B)” \(\text{DLBCL}\), had gene expression patterns indicative of different stages of B-cell differentiation.
The publicly available dataset contains 4026 genes across 47 samples, of which 24 are “gc-B” and 23 are “a-B”. We use the 10-nearest neighbor method to impute the missing expression data. After imputing, the dataset is standardized to zero mean and unit variance across genes. We intend to predict whether a sample can be categorized as “gc-B” or “a-B”.
To evaluate the performance of the penalized estimates of parameters in the model, \({\text{logit}}\{{\text{P}}(Y=1 \mid X_1,\ldots ,X_{4026})\}=\beta _0+\sum _{j=1}^{4026} \beta _j X_j\), we randomly split the data into a training set with 31 samples (containing 16 cases of “gc-B” and 15 cases of “a-B”) and a test set with 16 samples (containing 8 cases of “gc-B” and 8 cases of “a-B”). For each training set, \(\lambda _n\) is selected by minimizing a 3-fold cross validated estimate of the misclassification rate; \(\lambda _n\) and \(\kappa _n\) for the proposed \({\text{PMR}}\) method are searched on a surface of grid points. The test error (\(\text{TE}\)) gives the misclassification rate of the penalized classifier to the test set. Both the Huber \(\psi\)-function (with \(c=1.345\)) and Tukey \(\psi\)-function (with \(c=4.685\)) are utilized in the robust estimates, and the weight function \(w({\varvec{x}})\) is the same as used in Sect. 6. Table 6 tabulates the average of \(\text{TE}\) and the average number of selected genes over 100 random splits. The penalized estimates/classifiers induced by the deviance loss and the exponential loss yield similar performance. The \(L_1\) penalty selects approximately twice as many genes as the other penalty choices. On the basis of \(\text{TE}\) and sparse modeling simultaneously, the robust estimate combined with the weighted-\(L_1\) penalty appears to perform the best. The results also reveal that the choice of the \(\psi\)-functions in (6) has a negligible impact on the performance of the robust penalized estimates.
8 Discussion
The conventional penalized least-squares and penalized-likelihood estimates of parameters exhibit the oracle property for sparsity recovery but lack resistance to outlying observations. This paper proposes a class of robust error measures called “robust-\({\text{BD}}\)” and introduces the “penalized robust-\({\text{BD}}\) estimate”. The “robust-\({\text{BD}}\)” induces a bounded influence function that makes the resulting penalized estimation less sensitive to outliers. Since \({\text{BD}}\) is widely used in machine learning practice, the proposed “penalized robust-\({\text{BD}}\) estimate” combined with suitably chosen weights for penalties can be broadly applicable in regression and classification problems of large dimensions (\(p_{_n}\approx n\)) and high dimensions (\(p_{_n} \gg n\)), achieving both the oracle property and the robustness to outliers in either the covariate space or the response space.
There are several limitations to our study and ongoing challenges that should be considered. Firstly, in the current scope of the paper, technical conditions, particularly \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}5\) and \({\text{B}}3\) in Appendix 1.1 exclude heavy-tailed covariates \(X_j\) and responses Y. Similarly, the proposed method does not handle very high levels of contamination in the data. Moreover, relaxing (19) could be pursued in a separate study. Secondly, Algorithm 1 in Appendix 1.3.2 shows that the computational complexity depends on various key factors, including the sample size (n), dimensionality (\(p_{_n}\)), the response variable type (Gaussian, Bernoulli, or Poisson), selected penalty weights (\(\{{\widehat{w}}_{n,j}\}_{j=1}^{p_{_n}}\)), proportion of contamination in the dataset, approximation accuracy of the quadratic function to the loss function in (3), and convergence property of the coordinate-descent (\({\text{CD}}\)) algorithm (Friedman et al., 2010). In particular, the quadratic approximation may reduce the accuracy of the “penalized robust-\({\text{BD}}\) estimator” and slow down the convergence of the resulting \({\text{CD}}\) algorithm. This is demonstrated in numerical experiments that indicate an increase in CPU runtime as the response type changes from Gaussian to Bernoulli to Poisson, and as non-robust procedures using classical-\({\text{BD}}\) are replaced with robust counterparts using “robust-\({\text{BD}}\)”. Despite these limitations, our work contributes to the “robust-\({\text{BD}}\)” estimation of \(p_{_n}\)-dimensional parameters with some provable results when \(p_{_n} \approx n\) and \(p_{_n} \gg n\), and thus fills a gap in the literature. The goal of our work is to better understand the flexibility, challenges, and limitations of robust-\({\text{BD}}\) estimation as a data-analytic tool for big data analysis.
A number of open questions need to be discussed. (a) In the linear regression model, the quantification of the robustness property of estimates in terms of gross error sensitivity, rejection point, or local-shift sensitivity has been studied. However, beyond the linear model, such as the \(\text{GLM}\), relatively little theoretical work has been done about this property, even in the case of fixed dimensions \(p_{_n}=p\). Rigorously exploring the robustness property, including the breakdown point, for the proposed class of “penalized robust-\({\text{BD}}\) estimates” in the current model, assuming (1)–(2), remains a nontrivial task. (b) The existence of consistent local solutions of penalized estimates has also appeared in several other existing works, such as the nonconcave penalized likelihood method in Fan and Peng (2004). Devising efficient numerical procedures for obtaining a local solution that is consistent will be desirable but challenging. See Gong et al. (2013) for recent progress made in addressing this issue. (c) For \(p_{_n}\gg n\), it is of interest to explore some variable screening procedure for dimension reduction before applying the proposed “penalized robust-\({\text{BD}}\)” estimation method. (d) The weight function \(w({\varvec{x}})\) used in the numerical evaluations in Sects. 6 and 7 is a feasible choice for robust-\({\text{BD}}\) estimation in large (\(p_{_n}\approx n\)) and high (\(p_{_n} \gg n\)) dimensions. However, an optimal method for selecting the weight function would be desirable, depending on specific criteria, such as the robustness property, which has yet to be explored. In low dimensions (\(p_{_n}<n\)), the weight function (e.g., \(w({\varvec{x}}) = 1/\{1+({\varvec{x}}-\widehat{{\varvec{m}}})^T {\widehat{\Sigma }}^{-1} ({\varvec{x}}-\widehat{{\varvec{m}}})\}^{1/2}\)) can rely on robust estimates \(\widehat{{\varvec{m}}}\) and \({\widehat{\Sigma }}\) of the location vector and scatter matrix of \({\varvec{X}}\), and pages 137–138 of Heritier et al. (2009) suggest alternative weight functions for robust estimators of linear model parameters in fixed (\(p_{_n}=p\)) dimensions. (e) Finite sample other than asymptotic results may be obtained for certain types of covariates \(X_j\) and responses Y under more stringent assumptions. A complete and thorough study of these theoretical/methodological development and computational advancement is beyond the scope of the current paper and needs to be examined in future research.
Data availability
The Lymphoma data studied in Sect. 7 is publicly available from Alizadeh et al. (2000); the Colon cancer dataset in Appendix 1.2.2 is at http://genomics-pubs.princeton.edu/oncology/.
Code availability
MATLAB codes are available at https://github.com/ChunmingZhangUW/Robust_penalized_BD_high_dim_GLM.
References
Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Jr., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., … Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511.
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., & Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the USA, 96, 6745–6750.
Altun, Y., & Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In G. Lugosi & H. U. Simon (Eds.), Learning theory: 19th annual conference on learning theory (pp. 139–153). Springer.
Bianco, A. M., & Yohai, V. J. (1996). Robust estimation in the logistic regression model. In Robust statistics, data analysis, and computer intensive methods (Schloss Thurnau, 1994) (Vol. 109, pp. 17–34), Lecture Notes in Statist., Springer.
Boente, G., He, X., & Zhou, J. (2006). Robust estimates in generalized partially linear models. Annals of Statistics, 34, 2856–2878.
Brègman, L. M. (1967). A relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. USSR Computational Mathematics and Mathematical Physics, 7, 620–631.
Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation when \(p\) is much larger than \(n\). Annals of Statistics, 35, 2313–2351.
Cantoni, E., & Ronchetti, E. (2001). Robust inference for generalized linear models. Journal of the American Statistical Association, 96, 1022–1030.
Dupuis, D. J., & Victoria-Feser, M.-P. (2011). Fast robust model selection in large datasets. Journal of the American Statistical Association, 106, 203–212.
Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461–470.
Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32, 928–961.
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22.
Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106, 746–762.
Gong, P., Zhang, C., Lu, Z., Huang, J., & Ye, J. (2013). A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In The 30th international conference on machine learning (ICML 2013).
Grünwald, P. D., & Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Annals of Statistics, 32, 1367–1433.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., & Stahel, W. A. (1986). Robust statistics: The approach based on influence functions. Wiley.
Heritier, S., Cantoni, E., Copt, S., & Victoria-Feser, M.-P. (2009). Robust methods in biostatistics. Wiley.
Huber, P. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73–101.
Kanamori, T., Takenouchi, T., Eguchi, S., & Murata, N. (2007). Robust loss functions for boosting. Neural Computation, 19, 2183–2244.
Künsch, H., Stefanski, L., & Carroll, R. (1989). Conditionally unbiased bounded influence estimation in general regression models, with applications to generalized linear models. Journal of the American Statistical Association, 84, 460–466.
Lafferty, J. D., Della Piestra, S., & Della Piestra, V. (1997). Statistical learning algorithms based on Bregman distances. In Proceedings of the 5th Canadian workshop on information theory.
Lafferty, J. (1999). Additive models, boosting and inference for generalized divergences. In: Proceedings of the twelfth annual conference on computational learning theory (pp. 125–133). ACM Press.
Meier, L., van de Geer, S., & Bühlmann, P. (2008). The group Lasso for logistic regression. Journal of the Royal Statistical Society Series B, 70, 53–71.
Stefanski, L., Carroll, R., & Ruppert, D. (1986). Optimally bounded score functions for generalized linear models with applications to logistic regression. Biometrika, 73, 413–424.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge University Press.
van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes: With applications to statistics. Springer.
Vapnik, V. (1996). The nature of statistical learning theory. Springer.
Vemuri, B. C., Liu, M., Amari, S.-I., & Nielsen, F. (2011). Total Bregman divergence and its applications to DTI analysis. IEEE Transactions on Medical Imaging, 30, 475–483.
Wright, S. J. (1997). Primal-dual interior-point methods. SIAM.
Wu, Y. C., & Liu, Y. F. (2007). Robust truncated hinge Loss support vector machines. Journal of the American Statistical Association, 102, 974–983.
Zhang, C. M., Jiang, Y., & Shang, Z. (2009). New aspects of Bregman divergence in regression and classification with parametric and nonparametric estimation. Canadian Journal of Statistics, 37, 119–139.
Zhang, C. M., Jiang, Y., & Chai, Y. (2010). Penalized Bregman divergence for large-dimensional regression and classification. Biometrika, 97, 551–566.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Zou, H., & Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. Annals of Statistics, 36, 1108–1126.
Funding
Zhang’s work was partially supported by the U.S. National Science Foundation grants DMS-2013486 and DMS-1712418, and provided by the University of Wisconsin-Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. The second author gratefully acknowledges an NSFC grant (NSFC12131006, 12071038) from the National Natural Science Foundation of China and a grant from the University Grants Council of Hong Kong.
Author information
Authors and Affiliations
Contributions
CM Zhang contributed to the proof, computation, and writing; LXZ contributed to the discussion, proof, and writing; YBS contributed to the implementation of \({\text{SVM}}\) and robust-\({\text{SVM}}\).
Corresponding author
Ethics declarations
Conflict of interest
‘Not applicable’.
Ethics approval
‘Not applicable’.
Consent to participate
‘Not applicable’.
Consent for publication
All authors of the paper consent for publication.
Additional information
Editors: Krzysztof Dembczynski and Emilie Devijver.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Supplementary Appendix
1. Proofs, figures and tables, algorithm
Notations and symbols.
For a vector \({\varvec{a}}=(a_1,\ldots ,a_d)^T\), \(\Vert {\varvec{a}}\Vert _1={\sum _{j=1}^d |a_j|}\), \(\Vert {\varvec{a}}\Vert _2=(\sum _{j=1}^d a_j^2)^{1/2}\) and \(\Vert {\varvec{a}}\Vert _{\infty } = \max _{1\le j \le d} |a_j |\). Let \(\textbf{I}_{{\textsf {k}}}\) denote a \({\textsf {k}}\times {\textsf {k}}\) identity matrix, and \({\textbf{0}}_{p,q}\) denote a \(p\times q\) matrix of zero entries. For a matrix M, its eigenvalues, minimum eigenvalue, maximum eigenvalue are labeled by \(\lambda _j(M)\), \(\lambda _{\min }(M)\), \(\lambda _{\max }(M)\) respectively; \(\text{tr}(M)\) denotes the trace of a square matrix M; let \(\Vert M\Vert =\Vert M\Vert _2=\sup _{\Vert {\varvec{x}}\Vert _2=1}\Vert M{\varvec{x}}\Vert _2=\{\lambda _{\max }(M^TM)\}^{1/2}\) be the matrix \(L_2\) norm, and \(\Vert M\Vert _F = \{\text{tr}(M^T M)\}^{1/2}\) be the Frobenius norm. Throughout the proof, C is used as a generic finite constant. The sign function \(\,\text{sign}(x)\) equals \(+1\) if \(x>0\), 0 if \(x=0\), and \(-1\) if \(x<0\). For a function g(x), the first-order derivative is \(g'(x)\) or \(g^{(1)}(x)\), the second-order derivative is \(g''(x)\) or \(g^{(2)}(x)\), and the jth other derivative is \(g^{(j)}(x)\). The chi-squared distribution with k degrees of freedom is denoted by \(\chi _{k}^2\).
The conditional expectation and condition variance of Y given \({\varvec{X}}\) are denoted by \({\text{E}}(Y \mid {\varvec{X}})\) and \({\text{var}}(Y \mid {\varvec{X}})\) respectively. Notations in the asymptotic derivations follow (van der Vaart, 1998), where \({\mathop {\longrightarrow }\limits ^{\text{P}}}\) denotes converges in probability, \({\mathop {\longrightarrow }\limits ^{\mathcal L}}\) means converges in distribution, \(o_{{\text{P}}}(1)\) is a term which converges to zero in probability, and \(O_{{\text{P}}}(1)\) is a term which is bounded in probability.
1.1 1.1. Proofs of Theorems 1 up to 9
We first impose some regularity conditions, which are not the weakest possible but facilitate the technical derivations.
Condition A.
- A0.:
-
\(s_n\ge 1\) and \(p_{_n}-s_n \ge 1\). \(\sup _{n\ge 1} \Vert {\varvec{\beta }}_{0}^{({\text{I}})}\Vert _1<\infty\).
- A1.:
-
\(\Vert {\varvec{X}}\Vert _{\infty } = \max _{1\le j \le p_{_n}} |X_j |\) is bounded almost surely.
- A2.:
-
\({\text{E}}({\widetilde{{\varvec{X}}}}{\widetilde{{\varvec{X}}}}^T)\) exists and is nonsingular in the case of \(p_{_n}+1 \le n\); \({\text{E}}\{{\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}\) exists and is nonsingular in the case of \(p_{_n}+1 > n\).
- A4.:
-
There is a large enough open subset of \(\mathbb {R}^{p_{_n}+1}\) which contains the true parameter point \({\widetilde{{\varvec{\beta }}}}_{0}\), such that \(F^{-1}({\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}})\) is bounded almost surely for all \({\widetilde{{\varvec{\beta }}}}\) in the subset.
- A5.:
-
\(w(\cdot )\ge 0\) is a bounded function. Assume that \(\psi (r)\) is a bounded, odd function, and twice differentiable, such that \(\psi '(r)\), \(\psi '(r)r\), \(\psi ''(r)\), \(\psi ''(r)r\) and \(\psi ''(r)r^2\) are bounded; \(V(\cdot )>0\), \(V^{(2)}(\cdot )\) is continuous. The matrix \(\textbf{H}_n^{({\text{I}})}\) is positive definite, with eigenvalues uniformly bounded away from zero.
- A5\('\).:
-
\(w(\cdot )\ge 0\) is a bounded function.
- A6.:
-
\(q^{(4)}(\cdot )\) is continuous, and \(q^{(2)}(\cdot )<0\). \(g_1^{(2)} (\cdot )\) is continuous.
- A7.:
-
\(F(\cdot )\) is monotone and a bijection, \(F^{(3)}(\cdot )\) is continuous, and \(F^{(1)}(\cdot )\ne 0\).
Condition B.
-
B3.
There exists a constant \(C \in (0,\infty )\) such that \(\sup _{n\ge 1}{\text{E}}\{|Y-m({\varvec{X}})|^j \} \le j! C^j\) for all \(j\ge 3\). Also, \(\inf _{n\ge 1,\, 1\le j\le p_{_n}} {\text{E}}\{{\text{var}}(Y\mid {\varvec{X}}) X_j^2 \} > 0\).
-
B5.
The matrices \(\Omega _n^{({\text{I}})}\) and \(\textbf{H}_n^{({\text{I}})}\) are positive definite, with eigenvalues uniformly bounded away from zero. Also, \(\Vert (\textbf{H}_n^{({\text{I}})})^{-1} \Omega _n^{({\text{I}})}\Vert _2\) is bounded away from infinity.
Condition C.
-
C4.
There is a large enough open subset of \(\mathbb {R}^{p_{_n}+1}\) which contains the true parameter point \({\widetilde{{\varvec{\beta }}}}_{0}\), such that \(F^{-1}({\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}})\) is bounded almost surely for all \({\widetilde{{\varvec{\beta }}}}\) in the subset. Moreover, the subset contains the origin.
Condition D.
-
D5.
The eigenvalues of \(\textbf{H}_n^{({\text{I}})}\) are uniformly bounded away from zero. Also, \(\Vert (\textbf{H}_n^{({\text{I}})})^{-1/2} (\Omega _n^{({\text{I}})})^{1/2}\Vert _2\) is bounded away from infinity.
Condition E.
-
E1.
\(\min _{1\le j\le s_n} |{\text{cov}}(X_{j}, Y)|\succeq {\mathcal {A}}_n\) and \(\max _{s_n+1\le j\le p_{_n}}|{\text{cov}}(X_{j}, Y)|= o({\mathcal {B}}_n)\) for some positive sequences \({\mathcal {A}}_n\) and \({\mathcal {B}}_n\), where the symbol \(s_{n} \succeq t_{n}\), for two nonnegative sequences \(s_{n}\) and \(t_{n}\), means that there exists a constant \(c > 0\) such that \(s_{n} \ge c\, t_{n}\) for all \(n\ge 1\).
-
E2.
\(\sup _{n \ge 1, 1\le j\le s_n} {\text{E}}\{\text{q}_{_2}(Y; \alpha _0) X_{j}^2\} < \infty\); \(\inf _{n \ge 1, s_n+1\le j\le p_{_n}} {\text{E}}\{\text{q}_{_2}(Y; \alpha _0) X_{j}^2\} = \eta > 0\), where \(\alpha _0=F({\text{E}}(Y))\).
Proof of Theorem 1
We first need to show Lemma 1. \(\square\)
Lemma 1
(existence and consistency: \(p_{_n} \ll n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5\), \({\text{A}}6\) and \({\text{A}}7\) in Appendix 1.1, the matrix \(\textbf{H}_n = {\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}{\widetilde{{\varvec{X}}}}^{T}\}\) is positive definite with eigenvalues uniformly bounded away from zero, and \(w_{\max }^{({\text{I}})}=O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n} \sqrt{s_n/p_{_n}})\}\). If \(p_{_n}^4/n \rightarrow 0\) as \(n\rightarrow \infty\), then there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of (3) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{p_{_n}/n})\).
Proof
We follow the idea of the proof of Theorem 1 in Fan and Peng (2004). Let \(r_n = \sqrt{p_{_n}/n}\) and \({\widetilde{{\varvec{u}}}}_n=(u_0,u_1,\ldots ,u_{p_{_n}})^T\in \mathbb {R}^{p_{_n}+1}\). It suffices to show that for any given \({\epsilon }>0\), there exists a sufficiently large constant \(C_{\epsilon }\) such that, for large n we have
This implies that with probability at least \(1-{\epsilon }\), there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of \(\ell _{n}({\widetilde{{\varvec{\beta }}}})\) in the ball \(\{{\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n: \Vert {\widetilde{{\varvec{u}}}}_n\Vert _2\le C_{\epsilon }\}\) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2=O_{{\text{P}}}(r_n)\). To show (20), consider
where \(\Vert {\widetilde{{\varvec{u}}}}_n\Vert _2=C_{\epsilon }\).
First, we consider \(I_1\). By Taylor’s expansion, \(I_1\) has the decomposition,
where \(I_{1,1} = {r_n } /{n} \sum _{i=1}^n \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n\), \(I_{1,2} = {r_n^2}/{(2n)}\sum _{i=1}^n \text{p}_{_2}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}_{0})\, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^2\), and
\(I_{1,3} = {r_n^3}/{(6n)}\sum _{i=1}^n \text{p}_{_3}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^T {\widetilde{{\varvec{\beta }}}}^*) \, w({\varvec{X}}_{i}) ({\widetilde{{\varvec{X}}}}_{i}^T{\widetilde{{\varvec{u}}}}_n)^3\) for \({\widetilde{{\varvec{\beta }}}}^*\) located between \({\widetilde{{\varvec{\beta }}}}_{0}\) and \({\widetilde{{\varvec{\beta }}}}_{0} + r_n{\widetilde{{\varvec{u}}}}_n\). Hence
For the term \(I_{1,2}\) in (22),
where \(I_{1,2,1} = 2^{-1}{r_n^2}{\widetilde{{\varvec{u}}}}_n^T \textbf{H}_n{\widetilde{{\varvec{u}}}}_n.\) Meanwhile, we have
Thus,
For the term \(I_{1,3}\) in (22), we observe that
which follows from Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}4\) and \({\text{A}}5\).
Next, we consider \(I_2\) in (21). Note \(I_2 = \lambda _n \sum _{j=1}^{s_n} w_{n,j} (|\beta _{j;0} + r_n u_j |-|\beta _{j;0}|) +\lambda _n r_n \sum _{j=s_n+1}^{p_{_n}} w_{n,j} |u_j |\). Clearly, by the triangle inequality,
in which
where \({\varvec{u}}_n^{({\text{I}})}=(u_1,\ldots ,u_{s_n})^T\). By (23)–(25) and \(p_{_n}^4/n \rightarrow 0\), we can choose some large \(C_{\epsilon }\) such that \(I_{1,1}\), \(I_{1,3}\) and \(I_{2,1}\) are all dominated by the first term of \(I_{1,2}\) in (24), which is positive by the eigenvalue assumption. This implies (20). \(\square\)
We now show Theorem 1. Write \({\widetilde{{\varvec{u}}}}_n= ({\widetilde{{\varvec{u}}}}_n^{({\text{I}})T}, {\varvec{u}}_n^{({\text{II}})T})^T\), where \({\widetilde{{\varvec{u}}}}_n^{({\text{I}})}=(u_0,u_1,\ldots ,u_{s_n})^T\) and \({\varvec{u}}_n^{({\text{II}})}=(u_{s_n+1},\ldots ,u_{p_{_n}})^T\). Following the proof of Lemma 1, it suffices to show (20) for \(r_n=\sqrt{s_n/n}\).
For the term \(I_{1,1}\) in (22),
It follows that
For the term \(I_{1,2}\) in (22), similar to the proof of Lemma 1, \(I_{1,2}=I_{1,2,1}+I_{1,2,2}\). We observe that
Then there exists a constant \(C>0\) such that
For the term \(I_{1,2,2}\),
where
For the term \(I_{1,3}\) in (22), since \(s_n p_{_n}=o(n)\), \(\Vert {\widetilde{{\varvec{\beta }}}}^*\Vert _1\) is bounded and thus
where
For the term \(I_2\) in (21), \(I_2 \ge I_{2,1}^{({\text{I}})}+I_{2,1}^{({\text{II}})},\) where \(I_{2,1}^{({\text{I}})}=-\lambda _n r_n \sum _{j=1}^{s_n} w_{n,j} |u_j |\) and \(I_{2,1}^{({\text{II}})}={\lambda _n r_n \sum _{j=s_n+1}^{p_{_n}} w_{n,j}|u_j |}\). Thus, we have
It can be shown that either \(I_{1,2,1}^{({\text{I}})}\) or \(I_{2,1}^{({\text{II}})}\) dominates all other terms in groups \(\mathcal G_1=\{I_{1,2,2}^{({\text{I}})}, I_{1,3}^{({\text{I}})}\}\), \(\mathcal G_2=\{I_{1,1}^{({\text{II}})}, I_{1,2,2}^{({\text{II}})}, I_{1,3}^{({\text{II}})}, I_{1,2,1}^{({\text{cross}})}, I_{1,2,2}^{({\text{cross}})}\}\) and \(\mathcal G_3=\{I_{1,1}^{({\text{I}})}, I_{2,1}^{({\text{I}})}\}\). Namely, \(I_{1,2,1}^{({\text{I}})}\) dominates \(\mathcal G_1\), and \(I_{2,1}^{({\text{II}})}\) dominates \(\mathcal G_2\). For \(\mathcal G_3\), since \(\Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2 \le C_{\epsilon }\), we have that
Hence, if \(\Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1 \le C_{\epsilon }/2\), then \(\Vert {\widetilde{{\varvec{u}}}}_n^{({\text{I}})}\Vert _2 > C_{\epsilon }/2\), and thus \(\mathcal G_3\) is dominated by \(I_{1,2,1}^{({\text{I}})}\), which is positive; if \(\Vert {\varvec{u}}_n^{({\text{II}})}\Vert _1 > C_{\epsilon }/2\), then \(\mathcal G_3\) is dominated by \(I_{2,1}^{({\text{II}})}\), which is positive. This completes the proof. \(\square\)
Proof of Theorem 2
We first need to show Lemma 2. \(\square\)
Lemma 2
Assume Condition A in Appendix 1.1. If \(s_n^2/n=O(1)\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), then with probability tending to one, for any given \({\widetilde{{\varvec{\beta }}}}=({\widetilde{{\varvec{\beta }}}}^{({\text{I}})T}, {\varvec{\beta }}^{({\text{II}})T})^T\) satisfying \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\) and any constant \(C>0\), it follows that \(\ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})},{\textbf{0}}) = \min _{\Vert {\varvec{\beta }}^{({\text{II}})}\Vert _2 \le C\sqrt{s_n/n}} \ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})},{\varvec{\beta }}^{({\text{II}})})\).
Proof
It suffices to prove that with probability tending to one, for any \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}\) satisfying \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), the following inequalities hold for \(s_n+1 \le j \le p_{_n}\),
namely, with probability tending to one,
Proofs for showing both inequalities are similar; we only need to show (26). Note that for \(\beta _{j} \ne 0\),
where \({\widetilde{{\varvec{\beta }}}}^*\) lies between \({\widetilde{{\varvec{\beta }}}}_{0}\) and \({\widetilde{{\varvec{\beta }}}}\). It follows that
The first term \(I_1\) satisfies that
For the term \(I_2\),
Therefore, by (27) and (28), the left side of (26) is
By \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), (26) is proved. \(\square\)
We now show Theorem 2. By Lemma 2, the first part of Theorem 2 holds that \(\widehat{{\widetilde{{\varvec{\beta }}}}}=(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})},{\textbf{0}}^T)^T\). To verify the second part of Theorem 2, notice the estimating equations \(\frac{\partial \ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}, {\textbf{0}})}{\partial {\widetilde{{\varvec{\beta }}}}^{({\text{I}})}} |_{{\widetilde{{\varvec{\beta }}}}^{({\text{I}})}=\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}} = {\textbf{0}}\), since \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is a local minimizer of \(\ell _{n}({\widetilde{{\varvec{\beta }}}}^{({\text{I}})}, {\textbf{0}})\). Denote \({\varvec{d}}_n({{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}) = \lambda _n \textbf{W}_n^{({\text{I}})} \,\text{sign}\{{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\}\) which is equal to \({\varvec{d}}_n\) when \({{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}={\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\). Since \(\min _{1\le j\le s_n}|\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), it follows that
Thus with probability tending to one, \({\varvec{d}}_n(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}) = {\varvec{d}}_n({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) = {\varvec{d}}_n\). Taylor’s expansion applied to the loss part on the left side of the estimating equations yields
where both \({\widetilde{{\varvec{\beta }}}}^{*({\text{I}})}\) and \({\widetilde{{\varvec{\beta }}}}^{**({\text{I}})}\) lie between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\). Below, we will show
First, to show (30), note that \(K_2 - \textbf{H}_n^{({\text{I}})} = K_2-{\text{E}}(K_2) \equiv L_1\). Similar arguments for the proof of Lemma 1 give \(\Vert L_1\Vert _2 = O_{{\text{P}}}(s_n/\sqrt{n})\).
Second, a similar proof used for \(I_{1,3}\) in (22) completes (31).
Third, by (29)–(31) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), we see that
where \(\Vert {\varvec{u}}_n\Vert _2 = O_{{\text{P}}}(s_n^{5/2}/n)\). Note that by Condition \({\text{B}}5\),
Thus
To complete proving the second part of Theorem 2, we apply the Lindeberg-Feller central limit theorem (van der Vaart, 1998) to \(\sum _{i=1}^n {\varvec{Z}}_{i}\), where \({\varvec{Z}}_{i} = - n^{-1/2} A_n (\Omega _n^{({\text{I}})})^{-1/2} \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}\). It suffices to check two conditions: (I) \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) \rightarrow \mathbb {G}\); (II) \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = o(1)\) for some \(\delta > 0\). Condition (I) follows from the fact that \({\text{var}}\{\text{p}_{_1}(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}^{({\text{I}})}\} = \Omega _n^{({\text{I}})}\). To verify condition (II), notice that using Conditions \({\text{B}}5\) and \({\text{A}}5\),
Thus, we get \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) \le O\{n (s_n/n)^{(2+\delta )/2}\} = O\{s_n^{(2+\delta )/2} / n^{\delta /2}\}\), which is o(1). This verifies Condition (II). \(\square\)
Proof of Theorem 3
Before showing Theorem 3, Lemma 3 is needed. \(\square\)
Lemma 3
Assume conditions of Theorem 3. Then
Proof
Following (32) in the proof of Theorem 2, we observe that \(\Vert {\varvec{u}}_n\Vert _2 = O_{{\text{P}}}(s_n^{5/2}/n) = o_{_{{\text{P}}}}(n^{-1/2})\). Furthermore, \(\Vert {\varvec{d}}_n\Vert _2 \le \sqrt{s_n} \lambda _n w_{\max }^{({\text{I}})} = o_{_{{\text{P}}}}(n^{-1/2})\). Condition \({\text{B}}5\) completes the proof for the first part.
To show the second part, denote \(U_n = A_n (\textbf{H}_n^{({\text{I}})})^{-1}\Omega _n^{({\text{I}})} (\textbf{H}_n^{({\text{I}})})^{-1}A_n^T\) and \({\widehat{U}}_n = A_n ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}{\widehat{\Omega }}_n^{({\text{I}})} ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}A_n^T\). Notice that the eigenvalues of \((\textbf{H}_n^{({\text{I}})})^{-1} \Omega _n^{({\text{I}})} (\textbf{H}_n^{({\text{I}})})^{-1}\) are uniformly bounded away from zero. So are the eigenvalues of \(U_n\). From the first part, we see that
It follows that
where \({\varvec{Z}}_{i} = - n^{-1/2} U_n^{-1/2} A_n (\textbf{H}_n^{({\text{I}})})^{-1} \text{p}_{_1}(Y_{i}; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}\). To show \(\sum _{i=1}^n {\varvec{Z}}_{i} {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}})\), similar to the proof for Theorem 2, we check two conditions: (III) \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) \rightarrow \textbf{I}_{{\textsf {k}}}\); (IV) \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = o(1)\) for some \(\delta > 0\). Condition (III) is straightforward since \(\sum _{i=1}^n {\text{cov}}({\varvec{Z}}_{i}) = U_n^{-1/2} U_n U_n^{-1/2} = \textbf{I}_{{\textsf {k}}}\). To check condition (IV), similar arguments used in the proof of Theorem 2 give that \({\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) = O\{(s_n/n)^{{(2+\delta )}/{2}}\}.\) This and the boundedness of the \(\psi\)-function yield \(\sum _{i=1}^n {\text{E}}(\Vert {\varvec{Z}}_{i}\Vert _2^{2+\delta }) \le O\{s_n^{(2+\delta )/2}/n^{\delta /2}\} = o(1)\). Hence
Also, it can be concluded that \(\Vert {\widehat{U}}_n - U_n \Vert _2=o_{_{{\text{P}}}}(1)\) and that the eigenvalues of \({\widehat{U}}_n\) are uniformly bounded away from zero and infinity with probability tending to one. Consequently,
Combining (33), (34) and Slutsky’s theorem completes the proof that \(\sqrt{n} {\widehat{U}}_n^{-1/2}A_n (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \textbf{I}_{{\textsf {k}}})\). \(\square\)
We now show Theorem 3, which follows directly from the null hypothesis \(H_0\) in (14) and the second part of Lemma 3. This completes the proof. \(\square\)
Proof of Theorem 4
The proof of Theorem 4 is similar to that used in Theorem 7, except that in the Part 2, \({\mathcal {C}}_n\) is changed from \(\lambda _n\sqrt{n}/s_n\) to \(\lambda _n\). \(\square\)
Proof of Theorem 5
The proof of Theorem 5 is similar to that used in Theorem 8, except that in the Part 2, \({\mathcal {B}}_n\) is changed from \(\lambda _n\sqrt{n}/s_n\) to \(\lambda _n\). \(\square\)
Proof of Theorem 6
Assumption (19) implies that \(\ell _{n}({\widetilde{{\varvec{\beta }}}})\) in (3) is convex in \({\widetilde{{\varvec{\beta }}}}\). By Karush-Kuhn-Tucker conditions (Wright 1997, Theorem A.2), a set of sufficient conditions for an estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}} = ({\widehat{\beta }}_{0},{\widehat{\beta }}_{1},\ldots ,{\widehat{\beta }}_{p_{_n}})^T\) being a global minimizer of (3) is that
Before proving Theorem 6, we first show Lemma 4. \(\square\)
Lemma 4
(existence and consistency: \(p_{_n}\gg n\)) Assume (19) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5'\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Suppose \(s_n^4/n \rightarrow 0\), \(\log (p_{_n}-s_n)/n = O(1)\), \(\log (p_{_n}-s_n)/\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\} = o_{_{{\text{P}}}}(1)\) and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\). Assume \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\). Then with probability tending to one, there exists a global minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}} = (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})T},{\widehat{{\varvec{\beta }}}}^{({\text{II}})T})^T\) of \(\ell _n({\widetilde{{\varvec{\beta }}}})\) in (3) which satisfies that
- \(\mathrm {(i)}\):
-
\(\widehat{{\varvec{\beta }}}^{({\text{II}})} = {\textbf{0}}\),
- \(\mathrm {(ii)}\):
-
\(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is the minimizer of the oracle subproblem,
$$\begin{aligned} \ell _{n}^O({\widetilde{{\varvec{\beta }}}}^{({\text{I}})})=\frac{1}{n}\sum _{i=1}^n \rho _{{q}}(Y_{i}, F^{-1}({\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}^{({\text{I}})}))\, w({\varvec{X}}_{i})+\lambda _n\sum _{j=1}^{s_n} w_{n,j}|\beta _{j} |. \end{aligned}$$(36)
Proof
Let \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} = ({\widehat{b}}_{0},{\widehat{b}}_{1},\ldots ,{\widehat{b}}_{s_n})^T\) be the minimizer of the subproblem (36). By Karush-Kuhn-Tucker necessary conditions (Wright 1997, Theorem A.1), \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})}\) satisfies that
In the following, we will verify conditions
and
It then follows, from (37), (38) and (35), that \((\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})T}, {\textbf{0}}^T)^T\) is the global minimizer of (3). This will in turn imply Lemma 4.
First, we prove that (37) holds with probability tending to one. Applying Lemma 1 to the subproblem (36), we conclude that \(\Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\). Since \({\min _{1\le j\le s_n}|\beta _{j;0}|/ \sqrt{s_n/n}} \rightarrow \infty\) as \(n \rightarrow \infty\), it is seen that
Hence (37) holds with probability tending to one.
Second, we prove that (38) holds with probability tending to one. It suffices to prove that
By Taylor’s expansion, we have that
with \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})*}\) located between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})}\). Then (39) holds if we can prove
and
We first prove (40). Set \(\text{p}_{1i} = \text{p}_{_1}(Y_{i};{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\). Since \({\log (p_{_n}-s_n)=O(n)}\) and \({\log (p_{_n}-s_n) = o_{_{{\text{P}}}}\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\}}\), we see that
This implies (40).
Second, we prove (41). Since \(\Vert {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _1 < \infty\) and \(\Vert \widehat{{\widetilde{{\varvec{b}}}}}_n^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), it follows that
and then \(\Vert {\widetilde{{\varvec{\beta }}}}^{({\text{I}})*}\Vert _1 = O_{{\text{P}}}(1)\), thus
Here \({w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty }\) is used. Hence (41) is proved. \(\square\)
The first part of Theorem 6 follows from the first part of Lemma 4. The second part of Theorem 6 follows directly from applying Theorem 2 to the oracle subproblem (36). \(\square\)
Proof of Theorem 7
It is easy to see that \({\widehat{\beta }}_j^{{\text{PMR}}} = \arg \min _\beta \ell ^{{\text{PMR}}*}_{j}(\beta )\), where \(\ell ^{{\text{PMR}}*}_{n,j}(\beta ) = \ell ^{{\text{PMR}}}_{n,j}({\widehat{\alpha }}_j(\beta ),\beta )\), and \({\widehat{\alpha }}_j(\beta )\) satisfies \(n^{-1}\sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta )=0\) for \(j = 1,\ldots ,p_{_n}\). From (11), \({\widehat{\alpha }}_1(0)=\cdots ={\widehat{\alpha }}_{p_{_n}}(0)\). Let \({\widehat{\alpha }}_0 = {\widehat{\alpha }}_1(0)\). Then \({\widehat{\alpha }}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}\alpha _0\), where \(\alpha _0 = F(\mu _{_0})\) with \(\mu _{_0} = {\text{E}}(Y)\). The rest of the proof contains two parts.
Part 1. For \({\mathcal {A}}_n=\lambda _n \sqrt{n}\), we will show that \({\widehat{w}}_{\max }^{({\text{I}})} {\mathcal {A}}_n = O_{{\text{P}}}(1)\). It suffices to show that there exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \(\lim _{\delta \rightarrow 0+} \inf _{n\ge 1} {\text{P}}(\min _{1\le j \le s_n} |{\widehat{\beta }}_{j}^{{\text{PMR}}} |> {\mathcal {A}}_n \delta ) = 1.\) It suffices to prove that, for \(1\le j\le s_n\), there exist some \(b_{j}\) with \(|b_{j} |= 2\delta\) such that
and there exists some large enough \(C_n>0\) such that
Equations (42) and (43) imply that with probability tending to one, there must exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \({\mathcal {A}}_n\, \delta< |{\widehat{\beta }}_{j}^{{\text{PMR}}} |< {\mathcal {A}}_n\, C_n\) for \(1\le j\le s_n\).
First, we prove (43). For every \(n \ge 1\), when \(|\beta |\rightarrow \infty\),
Thus (43) holds.
Second, we prove (42). Since \({\mathcal {A}}_n=O(1)\), we see that \(|{\mathcal {A}}_n\, \beta |\le O(1) \delta \rightarrow 0\) as \(\delta \rightarrow 0+\). For \(1\le j\le s_n\), by Taylor’s expansion,
where \({\widehat{\mu }}_{_0}=F^{-1}({\widehat{\alpha }}_0)\), \(\theta _{ij}^*=\theta _{ij}({\mathcal {A}}_n\, \beta _{j}^*)\), \(\theta _{ij}(\beta )={\widehat{\alpha }}_j(\beta ) +X_{i,j}\beta\) and \(\beta _{j}^*\) is between 0 and \(\beta\). Thus we have that
where \(c_{ij}^{*}=\theta _{ij}({\mathcal {A}}_n\, b_j^*)\), with \(b_{j}^*\) between 0 and \(b_{j}\). Let \({\widehat{C}}_0 = q''({\widehat{\mu }}_{_0})/F'({\widehat{\mu }}_{_0}) \ne 0\). Then \({\widehat{C}}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}C_0\), where \(C_0=q''(\mu _{_0})/F'(\mu _{_0})\). We obtain
Choosing \(b_{j} = -2\delta \,\text{sign}\{{\widehat{C}}_0\, {\text{cov}}(X_{j},Y)\}\), which satisfies \(|b_{j} |= 2\delta\), gives
We can see that \(|I_{1,2} |= O_{{\text{P}}}({\mathcal {A}}_n \{\log (s_n)/n\}^{1/2})\delta ,\) by the Bernstein’s inequality (van der Vaart and Wellner 1996, Lemma 2.2.11). Similarly, \(|I_{1,3} |\le o_{_{{\text{P}}}}({\mathcal {A}}_n \{\log (s_n)/n\}^{1/2})\delta\). For terms \(I_2\) and \(I_3\), we observe that \(|I_2 |\le O_{{\text{P}}}({\mathcal {A}}_n^2)\,\delta ^2\) and \(|I_3 |= O({\mathcal {A}}_n\,\kappa _n)\delta\). The conditions \(\log (p_{_n}) = o(n \kappa _n^2)\) and \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\) imply that \(\{\log (s_n)/n\}^{1/2}/{\mathcal {A}}_n=o(1)\). Together with the condition \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\), we can choose a small enough \(\delta > 0\) such that with probability tending to one, \(I_{1,2}\), \(I_{1,3}\), \(I_2\) and \(I_3\) are dominated by \(I_{1,1}\), which is positive. Thus (42) is proved.
Part 2. For \({\mathcal {C}}_n=\lambda _n\sqrt{n}/s_n\), we will show that \({\widehat{w}}_{\min }^{({\text{II}})} {\mathcal {C}}_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\). It suffices to prove that for any \({\epsilon }> 0\), there exist local minimizers \({\widehat{\beta }}_{j}^{{\text{PMR}}}\) of \(\ell ^{{\text{PMR}}*}_{n,j}(\beta )\) such that \(\lim _{n\rightarrow \infty } {\text{P}}(\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{PMR}}} |\le {\mathcal {C}}_n {\epsilon }) = 1.\) Similar to the proof of Lemma 1, we will prove that for any \({\epsilon }> 0\),
Since \({\mathcal {C}}_n\rightarrow 0\) as \(n\rightarrow \infty\), we have that by Taylor’s expansion,
where \(\beta _{j}^*\) is between 0 and \(\beta\). Similar to the proof in Part 1,
Then \(|I_{1,1} |\le o({\mathcal {C}}_n {\mathcal {B}}_n {\epsilon }),\) \(|I_{1,2} |\le O_{{\text{P}}}[{\mathcal {C}}_n \{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }\) and \(|I_{1,3} |\le o_{_{{\text{P}}}}[{\mathcal {C}}_n \{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }\). Hence \(|I_1 |\le O_{{\text{P}}}[{\mathcal {C}}_n\{\log (p_{_n}-s_n+1)/n\}^{1/2}]{\epsilon }+ o({\mathcal {C}}_n {\mathcal {B}}_n){\epsilon }.\) For the term \(I_2\), we have that \(|I_2 |\le O_{{\text{P}}}({\mathcal {C}}_n^2){\epsilon }^2.\) Note \(I_3 = {\mathcal {C}}_n \kappa _n {\epsilon }.\) Since \(\log (p_{_n}) = o(n \kappa _n^2)\), \({\mathcal {B}}_n = O(\kappa _n)\) and \({\mathcal {C}}_n = o(\kappa _n)\), it follows that with probability tending to one, terms \(I_1\) and \(I_2\) are dominated by \(I_3\), which is positive. So (44) is proved. \(\square\)
Proof of Theorem 8
It is easy to see that \({\widehat{\beta }}_j^{{\text{MR}}} = \arg \min _\beta \ell ^{{\text{MR}}*}_{j}(\beta )\), where \(\ell ^{{\text{MR}}*}_{n,j}(\beta ) = \ell ^{{\text{MR}}}_{n,j}({\widehat{\alpha }}_j(\beta ),\beta )\), and \({\widehat{\alpha }}_j(\beta )\) satisfies \(n^{-1}\sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta )=0\) for \(j = 1,\ldots ,p_{_n}\). From (11), \({\widehat{\alpha }}_1(0)=\cdots ={\widehat{\alpha }}_{p_{_n}}(0)\). Let \({\widehat{\alpha }}_0 = {\widehat{\alpha }}_1(0)\). Then \({\widehat{\alpha }}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}\alpha _0\), where \(\alpha _0 = F(\mu _{_0})\) with \(\mu _{_0} = {\text{E}}(Y)\). Let \(h_{n,j}(\beta ) =\frac{{\text{d}}}{{\text{d}}\beta } \ell _{n,j}^{{\text{MR}}*}(\beta )= n^{-1} \sum _{i=1}^n \text{q}_{_1}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta ) \{{\widehat{\alpha }}_j'(\beta )+X_{i,j}\}\). Then \(h_{n,j}'(\beta ) = n^{-1} \sum _{i=1}^n \text{q}_{_2}(Y_i; {\widehat{\alpha }}_j(\beta )+X_{i,j}\beta ) \{{\widehat{\alpha }}_j'(\beta )+X_{i,j}\}^2\) and \(h_{n,j}''(\beta ) = n^{-1} \sum _{i=1}^n \text{q}_{3i}(\beta )\). The minimizer \({\widehat{\beta }}^{{\text{MR}}}_{j}\) of (17) satisfies the estimating equations, \(h_{n,j}({\widehat{\beta }}^{{\text{MR}}}_{j})= 0\). The rest of the proof consists of two parts.
Part 1. For \({\mathcal {A}}_n=\lambda _n \sqrt{n}\), we will show that \({\widehat{w}}_{\max }^{({\text{I}})}{\mathcal {A}}_n=O_{{\text{P}}}(1)\), which is \({{\mathcal {A}}_n}/{\min _{1\le j\le s_n} |{\widehat{\beta }}_{j}^{{\text{MR}}} |} =O_{{\text{P}}}(1)\). That is, \(\lim _{\delta \rightarrow 0+} \sup _{n\ge 1} {\text{P}}(\min _{1\le j\le s_n} |{\widehat{\beta }}_{j}^{{\text{MR}}} |< {\mathcal {A}}_n \delta )=0\). Using the Bonferroni inequality, it suffices to show that
With assumption (11) for the convex \({\text{BD}}\), \(h_{n,j}(\cdot )\) is an increasing function. Thus
Note that \({\mathcal {A}}_n=O(1)\) gives \({\mathcal {A}}_n\,\delta \rightarrow 0\) as \(\delta \rightarrow 0+\). By Taylor’s expansion, for \(1\le j\le s_n\), we have that
with \(\delta _j^* \in (0, \delta )\). Let \({\widehat{C}}_0 = q''({\widehat{\mu }}_{_0})/F'({\widehat{\mu }}_{_0}) \ne 0\), where \({\widehat{\mu }}_{_0}=F^{-1}({\widehat{\alpha }}_0)\). Then \({\widehat{C}}_0 {\mathop {\longrightarrow }\limits ^{\text{P}}}C_0\), where \(C_0=q''(\mu _{_0})/F'(\mu _{_0})\). We obtain
Because \({\mathcal {A}}_n = O(1)\), \(|{\text{cov}}(X_j, Y) |\ge c\, {\mathcal {A}}_n\), \(1\le j\le s_n\), and both
\(\max _{1\le j\le s_n} {\text{E}}[n^{-1} \sum _{i=1}^n \text{q}_{_2}(Y_{i}; {\widehat{\alpha }}_0) \{{\widehat{\alpha }}_j'(0)+X_{i,j}\}^2]\) and \(\max _{1\le j\le s_n} {\text{E}}\{n^{-1} \sum _{i=1}^n |\text{q}_{3i}({\mathcal {A}}_n\,\delta _j^*) |\}\) are bounded, we can choose \(\delta\) small enough such that, uniformly for all \(1 \le j\le s_n\), the term \(I_{1j, 1}={\widehat{C}}_0\, {\text{cov}}(X_j, Y)\) dominates \(I_{2j}\) and \(I_{3j}\). By assuming \({\widehat{C}}_0\, {\text{cov}}(X_j, Y) < 0\) without loss of generality,
for some positive constants C, \(C_1\) and \(C_2\), where the last inequality applies the Bernstein inequality. By (45), for a small enough \(\delta > 0\),
The equality in (46) follows from \({\mathcal {A}}_n = O(1)\), \(\lambda _n n \rightarrow \infty\) and \(\log (s_n) = o(\lambda _n^2 n^2)\), where the latter two are implied by the conditions \(\lambda _n n/s_n \rightarrow \infty\) and \(\log (p_{_n}) = o(\lambda _n^2 n^2/s_n^2)\).
Part 2. For \({\mathcal {B}}_n=\lambda _n \sqrt{n}/s_n\), we will prove that \({\widehat{w}}_{\min }^{({\text{II}})} {\mathcal {B}}_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\), which is \({\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{MR}}} |} / {{\mathcal {B}}_n} = o_{_{{\text{P}}}}(1)\). Namely, for any \({\epsilon }> 0\), \(\lim _{n\rightarrow \infty } {\text{P}}(\max _{s_n+1\le j\le p_{_n}} |{\widehat{\beta }}_{j}^{{\text{MR}}} |\ge {\mathcal {B}}_n {\epsilon }) = 0\). By the Bonferroni inequality, it suffices to show that
Since \(h_{n,j}(\cdot )\) is increasing, we have that for \(j=s_n+1, \ldots , p_{_n}\),
Similar to Part 1, \({\mathcal {B}}_n=o(1)\) gives that for \(j=s_n+1, \ldots , p_{_n}\),
with \({\epsilon }_j^* \in (0, \delta )\). Since \({\mathcal {B}}_n=o(1)\), \(|{\text{cov}}(X_{j}, Y)|= o({\mathcal {B}}_n)\), \(s_n+1\le j\le p_{_n}\), and from Condition \({{\text{E}}2}\), \(|J_{2j} |\ge {\mathcal {B}}_n{\epsilon }\eta\), as \(n\rightarrow \infty\), \(J_{2j}\) dominates \(I_{1j,1}\) and \(J_{3j}\). Applying the Bernstein’s inequality, for large n,
for some positive constants C, \(C_1\) and \(C_2\), where \(I_{1j}\), \(I_{1j,2}\) and \(I_{1j,3}\) are as defined in Part 1. Similarly,
The equality in (50) follows from the conditions \({\mathcal {B}}_n = o(1)\), \(\lambda _n n/s_n \rightarrow \infty\) and \(\log (p_{_n}) = o(\lambda _n^2 n^2/s_n^2)\). \(\square\)
Proof of Theorem 9
For part (i), note that for \({\varvec{X}}^o = ({\varvec{X}}^{o({\text{I}})T}, {\varvec{X}}^{o({\text{II}})T})^T\), \({\widetilde{{\varvec{X}}}}^o = (1, {\varvec{X}}^{oT})^T\) and \({\widetilde{{\varvec{X}}}}^{o({\text{I}})} = (1,{\varvec{X}}^{o({\text{I}})T})^T\),
for some \({\widetilde{{\varvec{\beta }}}}^*\) located between \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) and \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\). By Condition \({\text{A}}4\), we conclude that \((F^{-1})'({\widetilde{{\varvec{X}}}}^{o({\text{I}})T} {\widetilde{{\varvec{\beta }}}}^*) = O_{{\text{P}}}(1)\). This along with \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(r_n)\) and \(\Vert {\widetilde{{\varvec{X}}}}^{o({\text{I}})}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n})\) implies that \(|{\widehat{m}}({\varvec{X}}^o)- m({\varvec{X}}^o) |= O_{{\text{P}}}(r_n\sqrt{s_n}) = o_{_{{\text{P}}}}(1).\) The rest of the proof is similar to that of Theorem 9 in Zhang et al. (2010) and is omitted.
For part (ii), using the proof similar to Lemma A1 of Zhang et al. (2010), we obtain that for any \({\text{BD}}\) \(\texttt {Q}\) satisfying (4),
It follows that
Setting \(\texttt {Q}\) to be the misclassification loss implies
which combined with part (i) completes the proof. \(\square\)
1.2 1.2 Additional numerical studies
1.2.1 1.2.1 Gaussian responses in Sect. 6.3
Random samples \(\{({\varvec{X}}_{i}, Y_i)\}_{i=1}^n\) of size \(n=200\) are generated from the model,
where \(\beta _{0;0} = 1\), \({\varvec{\beta }}_{0} = (2, 1.5, 0.8, -1.5, 0.4, 0,\ldots ,0)^T\) with \(\sigma ^2 =1\). Here \(\Sigma _{p_{_n}}(j,k)=\rho ^{|j-k |}\), \(j,k=1,\ldots ,p_{_n}\), with \(\rho =0.1\). The qudartic loss is used as the \({\text{BD}}\).
Study 1 (raw data without outliers). For simulated data in the non-contaminated case, the results are summarized in Table 7. The robust estimators perform very similar to the non-robust counterparts.
Study 2 (contaminated data with outliers). For each data set generated from the model, we create a contaminated data set, where 7 data points \((X_{i,j},Y_i)\) are contaminated as follows: They are replaced by \((X_{i,j}^*,Y_i^*)\), where \(Y_i^* = Y_i {\text{I}}\{|Y_i-m({\varvec{X}}_{i}) |/\sigma > 2\} + 15 {\text{I}}\{|Y_i-m({\varvec{X}}_{i}) |/\sigma \le 2\}\), \(i=1,\ldots ,7\),
with \(\{U_i\}{\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{Uniform}(0,1)\). Table 8 summarizes the results over 500 sets of contaminated data. A comparison of each estimator in Tables 7 and 8 indicates that the presence of contamination substantially increases the estimation errors \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})\) and reduces either \({\text {C-Z}}\) or \({\text {C-NZ}}\). On the other hand, it is clearly observed that the non-robust estimates are more sensitive to outliers than the robust counterparts.
To further assess the impact of the sample size n on the parameter estimates, we display boxplots of \((\widehat{\beta }_j-{\beta }_{j;0})\), \(j=0,1,\ldots ,8\), using the \({\text{PMR}}\) selection method for the weighted-\(L_1\) penalty, in Fig. 3 using \(n=200\) and Fig. 4 using \(n=100\), respectively. The comparison supports the consistency of both the classical and robust estimates of large dimensional model parameters for clean data as n increases, in addition to the stability of the robust estimates under a small amount of contaminated outliers.
1.2.2 1.2.2 Real data analysis
We consider the classification of Colon cancer data discussed in Alon et al. (1999) and available at http://genomics-pubs.princeton.edu/oncology/. It consists of 2000 genes and 62 samples, where 22 samples are from normal colon tissues and 40 samples are from tumor tissues. Similar to the analysis in Sect. 7, the data set is randomly split into two parts, with 45 samples as training samples and the rest 17 as test samples. Table 9 summarizes the average of the test errors (\(\text{TE}\)) and the average number of selected genes over 100 random splits. We observe that robust procedures tend to select fewer genes than the non-robust procedures, without getting much increase in the test errors. This lends further support to the practicality of the proposed penalized robust-\({\text{BD}}\) estimation.
1.3 1.3 Numerical procedure for penalized robust-\({\text{BD}}\) estimator in (3)
1.3.1 1.3.1 Optimization algorithm
Numerically, the penalized robust-\({\text{BD}}\) estimators in (3) combined with penalties used in Sects. 6 and 7 are implemented by extending the coordinate descent (\({\text{CD}}\)) iterative algorithm (Friedman et al., 2010), with the initial value \((b,0,\ldots ,0)^T\), where \(b=\log \{(\overline{Y}_n+0.1)/(1-\overline{Y}_n+0.1)\}\) and \(b=\log (\overline{Y}_n+0.1)\) for Bernoulli and count responses respectively, using the sample mean \(\overline{Y}_n\) of \(\{Y_i\}_{i=1}^n\). Namely, the loss term
in (3) is locally approximated by a weighted form of quadratic loss functions, and the optimization solution of (3) is obtained by the \({\text{CD}}\) method. Particularly, the gradient vector and Hessian matrix of \(L({\widetilde{{\varvec{\beta }}}})\) are
The quadratic approximation is supported by the fact that the Hessian matrix of \(L({\widetilde{{\varvec{\beta }}}})\) evaluated at the true parameter vector \({\widetilde{{\varvec{\beta }}}}_0\) is
which, combined with the property \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}}_0) \mid {\varvec{X}}\}\ge 0\) discussed in part (d) of Sect. 2.2, indicates that with probability tending to one, the matrix \(L''({\widetilde{{\varvec{\beta }}}}_0)\) is positive semidefinite.
Both \(\text{p}_{_1}(y;\theta )\) and \(\text{p}_{_2}(y;\theta )\) in \(L'({\widetilde{{\varvec{\beta }}}})\) and \(L''({\widetilde{{\varvec{\beta }}}})\) are calculated using (9), which incorporates the Huber and Tukey \(\psi\)-functions whose derivatives \(\psi '(r)\) can be substituted by its subgradient or approximation.
1.3.2 1.3.2 Pseudo codes, source codes and computational complexity analysis
Algorithm 1 summarizes the complete procedure for numerically solving the “penalized robust-\({\text{BD}}\) estimator” in (3).
To illustrate the computational complexity analysis, Tables 10 and 11 compare runtime of the non-robust and robust procedures. All computations are performed using MATLAB (Version: 9.12.0.1956245 (R2022a) Update 2) on Windows 11, 12th Gen Intel(R) Core(TM) i9-12900, 2400 Mhz, 16 Core(s), 24 Logical Processors. MATLAB source codes are available at GitHub https://github.com/ChunmingZhangUW/Robust_penalized_BD_high_dim_GLM. For either clean or contaminated data, the algorithmic complexity depends on the type of response variables, the dimensionality and the procedure. Poisson-type responses are more computationally intensive than Gaussian responses; robust procedures are slower than the non-robust counterparts; higher dimensions demand more computational costs than lower dimensional settings.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, C., Zhu, L. & Shen, Y. Robust estimation in regression and classification methods for large dimensional data. Mach Learn 112, 3361–3411 (2023). https://doi.org/10.1007/s10994-023-06349-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06349-2