1 Introduction

Advancements in high-throughput technologies have made it possible to collect sophisticated, high-dimensional datasets, such as microarray data, genome-wide human SNP data, high-frequency financial data, functional data, and brain imaging data. In comparison to conventional datasets, where there are fewer variables than observations, large-dimensional datasets involve variables that could be as many as, or even more than, observations. These scenarios correspond to \(p_{_n}\ll n\), \(p_{_n}\approx n\), and \(p_{_n}\gg n\), respectively, with \(p_{_n}\) being the number of variables and n being the number of observations. A fundamental problem common to any large-dimensional dataset is that observations are more prone to outliers in either the covariate space or the response space than those in low-dimensional datasets. In such settings, outliers can lead to possible erroneous conclusions concerning statistical estimation, but the mechanism of contamination can be quite complex and intractable in general. It can also be difficult or even impossible to spot outliers in large-dimensional or highly structured data. Hence, exploring and developing robust statistical estimation and inference procedures that are resistant to outliers in large-dimensional data becomes increasingly important.

In recent years, much attention has been focused on developing penalized estimates of parameters in regression models with a large number of parameters. Examples include the \(\text{Lasso}\) (Tibshirani, 1996), the \(\text{SCAD}\) (Fan & Peng, 2004), the adaptive \(\text{Lasso}\) (Zou, 2006), the Dantzig selector (Candes & Tao, 2007), and the group \(\text{Lasso}\) (Meier et al., 2008), among others. These penalized least-squares estimates or penalized-likelihood estimates are obtained using a quadratic loss function or require full knowledge of the likelihood function. The penalized “classical-\({\text{BD}}\)” estimation was proposed in Zhang et al. (2010) for \(p_{_n} \ll n\) and \(p_{_n}\approx n\), where the error measure \({\text{BD}}\) includes the quadratic loss, the exponential family of distributions, the (negative) quasi-likelihood, and many others as special cases. However, these non-robust estimates do not handle outlying observations. General tools for investigating the robustness properties of penalized estimates, especially when \(p_{_n}\approx n\) and \(p_{_n}\gg n\), seem to be much less developed.

It is well-known that the influence functions of classical (non-penalized) regression estimates based on the quadratic loss function and likelihood are unbounded. Large deviations of the response from its mean, as measured by the Pearson residuals, or outlying points in the covariate space, can have a significant influence on the estimates. While robust procedures in Bianco et al. (1996), Künsch et al. (1989), Stefanski et al. (1986) control outliers for the generalized linear model (\(\text{GLM}\)), these procedures are limited to finite- and low-dimensional problems. It remains unclear to what extent they are useful in large- and high-dimensional settings. The works (Boente et al., 2006; Cantoni & Ronchetti, 2001) developed robust quasi-likelihood estimates of finite-dimensional parameters. However, the robust quasi-likelihood procedure is not available for other types of error measures, such as the hinge loss for the support vector machine (\({\text{SVM}}\)) (Vapnik, 1996) and the exponential loss for AdaBoost (Freund & Schapire, 1997), which are commonly used in classification procedures and machine learning practice. This is because these error measures do not fall into the (negative) quasi-likelihood category.

This paper aims to investigate the applicability of robust statistical inference for regression estimation and classification procedures in large-dimensional (\(p_{_n}\approx n\)) and high-dimensional (\(p_{_n}\gg n\)) settings, where the distribution of the response variable given covariates may be incompletely specified. The proposed work is not a simple endeavor and does not aim to solve all possible issues stemming from the combination of the “robustness property coupled with large-dimensionality”. The paper identifies major challenges and presents new results below.

  • In Sect. 2, we contribute to constructing a new class of robust error measures called “robust-\({\text{BD}}\)”. This is motivated by Bregman divergence (\({\text{BD}}\)), which plays an important role in quantifying error measures for regression estimates and classification procedures. The quadratic loss function and negative quasi-likelihood are two widely used error measures that, along with many others, belong to the family of \({\text{BD}}\). This newly proposed “robust-\({\text{BD}}\)” method broadens the scope of penalized estimation methods, greatly facilitating the investigation of their asymptotic behavior in a systematic way. The new method is applicable to all aforementioned error measures (e.g., the hinge loss and exponential loss, which fail to be (negative) quasi-likelihood but belong to \({\text{BD}}\)). The “robust-\({\text{BD}}\)” benefits from the flexibility and extensibility offered by \({\text{BD}}\). Nonetheless, unlike the “classical-\({\text{BD}}\)”, the “robust-\({\text{BD}}\)” entails a bias-correction procedure that complicates theoretical derivations as well as practical implementations; see concrete examples in Sect. 2.3. Moreover, when \(p_{_n}< n\), justifying the influence function of a “robust-\({\text{BD}}\)” estimate for a \(p_{_n}\)-dimensional parameter calls for re-examination and re-derivation beyond the framework of Hampel et al. (1986), confined to a fixed-dimensional parameter.

  • In Sects. 3 and 4, we study the consistency and oracle property of the proposed “penalized robust-\({\text{BD}}\)  estimates” in \(p_{_n}\approx n\) and \(p_{_n}\gg n\) settings, respectively. It is shown that the new estimate, combined with an appropriately weighted \(L_1\) penalty, achieves the same oracle property as the ordinary non-robust penalized least-squares and penalized-likelihood estimates, but is less sensitive to outliers, a very desirable property in many applications. It will be seen that the robust counterpart eliminates the finiteness assumption of some higher-order moments of the response variable Y, typically assumed in the non-robust case. Nonetheless, dealing with large-dimensionality in the robust case will face many more theoretical and computational challenges than in the non-robust case. For example, the oracle property in the case \(p_{_n}\approx n\) could not be directly extended to the case \(p_{_n}\gg n\) without requiring more technical assumptions and invoking more careful mathematical treatments. For practical implementation, two data-driven selection procedures for penalty weights, \({\text{PMR}}\) and \({\text{MR}}\), will be proposed and justified for both \(p_{_n}\approx n\) and \(p_{_n}\gg n\). Unlike the selection method in Zhang et al. (2010) for penalty weights which deals with \(p_{_n}\approx n\) and requires \({\text{E}}({\varvec{X}})={\textbf{0}}\) for the \(p_{_n}\)-dimensional predictor vector \({\varvec{X}}=(X_1,\ldots ,X_{p_{_n}})^T\), both the methods and theory developed in the current work do not impose such a requirement, thus are more widely applicable. In the context of large-dimensional inference, we demonstrate that the Wald-type test statistic based on the “penalized robust-\({\text{BD}}\) estimates” will be asymptotically distribution-free, whereas the likelihood ratio-type test statistic will fail to be.

  • In Sect. 5, we devise “penalized robust-\({\text{BD}}\)  classifiers” based on the proposed “penalized robust-\({\text{BD}}\)  estimates” in large- and high-dimensional binary classification. We demonstrate that if a parameter estimate possesses the sparsity property and is consistent at an appropriate rate, then the induced classifier attains classification consistency. Hence, even for data contaminated with outliers, the choice of loss functions for regression estimates invoked in the classifier has an asymptotically relatively negligible impact on classification performance.

There is a diverse and extensive literature on robust procedures for model selection. For instance, Zou and Yuan (2008) proposed the composite quantile regression and the oracle model selection theory for linear models. Theorem 3 of Zhang et al. (2009) shows that the quantile loss function does not belong to the class of \({\text{BD}}\). Therefore, the “robust-\({\text{BD}}\)” and the quantile loss, as two operationally different robust alternatives, work in non-overlapping frameworks with different motivations and demand theoretically distinct manipulations. The work (Dupuis & Victoria-Feser, 2011) developed a fast algorithm for robust forward selection procedure in linear regression, where \(p_{_n} <n\).

The rest of the paper is organized as follows. Section 6 presents simulation comparisons of the penalized “robust-\({\text{BD}}\)” estimates with the classical ones, including classical-\({\text{SVM}}\) and robust-\({\text{SVM}}\) for Bernoulli responses, to assess the performance in statistical model fitting, variable selection, and classification rules. Section 7 analyzes a real dataset. Limitations and open questions are discussed in Sect. 8. Notations, technical and algorithmic details, figures and tables, and additional analysis are collected in Appendix 1 (in the supplementary materials).

2 Proposed robust penalized regression estimation

Let \(\{({\varvec{X}}_{1},Y_{1}),\ldots ,({\varvec{X}}_{n},Y_{n})\}\) be a sample of independent observations from some underlying population, \(({\varvec{X}},Y)\), where \({\varvec{X}}=(X_1,\ldots ,X_{p_{_n}})^T \in \mathbb {R}^{p_{_n}}\) is the input vector and Y is the output variable. We assume the parametric model for the conditional mean function,

$$\begin{aligned} m({\varvec{x}}) = {\text{E}}(Y\mid {\varvec{X}}= {\varvec{x}}) = F^{-1}(\beta _{0;0}+{\varvec{x}}^T{\varvec{\beta }}_{0}), \end{aligned}$$
(1)

together with the conditional variance function

$$\begin{aligned} {\text{var}}(Y \mid {\varvec{X}}={\varvec{x}}) = V(m({\varvec{x}})), \end{aligned}$$
(2)

where \(F(\cdot )\) is a known link function, \(F^{-1}\) denotes the inverse function of F, \(\beta _{0;0}\in \mathbb {R}^1\) and \({\varvec{\beta }}_{0}=(\beta _{1;0},\ldots ,\beta _{p_{_n};0})^T \in \mathbb {R}^{p_{_n}}\) are the unknown true intercept and regression parameters, and the functional form of \(V(\cdot )\) is known. It is worth noting that (1)–(2) include the \(\text{GLM}\) as a special case. Moreover, they allow the conditional distribution of Y given \({\varvec{X}}\) to be incompletely (or partially) specified.

Let \(\rho _{{q}}(y,\mu )\) be a robust loss function (to be proposed in Sect. 2.1) which aims to guard against outlying observations in the response space. We define the “penalized robust-\({\text{BD}}\) estimate\(({\widehat{\beta }}_{0},{\widehat{{\varvec{\beta }}}})\) as the minimizer of the criterion function,

$$\begin{aligned} \ell _{n}(\beta _{0},{\varvec{\beta }})=\frac{1}{n}\sum _{i=1}^n \rho _{{q}}(Y_{i}, F^{-1}(\beta _{0}+{\varvec{X}}_{i}^T{\varvec{\beta }}))\, w({\varvec{X}}_{i})+\lambda _n\sum _{j=1}^{p_{_n}} w_{n,j} |\beta _{j} |, \end{aligned}$$
(3)

over \(\beta _{0}\in \mathbb {R}^1\) and \({\varvec{\beta }}=(\beta _{1},\ldots ,\beta _{p_{_n}})^T \in \mathbb {R}^{p_{_n}}\), where \(w(\cdot )\ge 0\) is a given bounded weight function that downweights high leverage design points in the \(p_{_n}\)-dimensional covariate space, \(\{w_{n,j}\}\) represent non-negative weights for the penalty terms, and \(\lambda _n\ge 0\) serves as a regularization parameter. In practice, the weight function \(w(\cdot )\) can be chosen in various ways, such as through prior knowledge or data-driven methods. An empirical choice of \(w(\cdot )\) is provided at the beginning of Sect. 6. Data-driven procedures for properly selected penalty weights \(\{w_{n,j}\}\) will be carefully developed in Sects. 3.5 and 4. Optimization solutions of (3) will be discussed in Sect. 6. Hereafter, we write \({\widetilde{{\varvec{X}}}} = (1, {\varvec{X}}^T)^T\) and \({\widetilde{{\varvec{\beta }}}} = (\beta _{0},{\varvec{\beta }}^T)^T\) to simplify notations.

2.1 Construction of robust loss functions \(\rho _{{q}}(\cdot ,\cdot )\)

We propose a class of robust loss functions \(\rho _{{q}}\), which is motivated from Bregman divergence (\({\text{BD}}\)), a notion commonly used in the machine learning applications. The original form of \({\text{BD}}\), which is a bivariate function introduced by Brègman (Brègman, 1967), is defined by

$$\begin{aligned} Q_{{q}}(\nu ,\mu ) =-q(\nu )+q(\mu )+(\nu -\mu )q'(\mu ), \end{aligned}$$
(4)

where \(q(\cdot )\) is a given concave differentiable function. For an extensive literature on \({\text{BD}}\), see Altun and Smola (2006), Efron (1986), Gneiting (2011), Grünwald and Dawid (2004), Lafferty et al. (1997), Lafferty (1999), Vemuri et al. (2011) and references therein. The \({\text{BD}}\) is suitable for a broad array of error measures \(Q_{{q}}\). For example, \(q(\mu ) = a \mu - \mu ^2\) for any constant a yields the quadratic loss \(Q_{{q}}(y, {\mu }) = (y - {\mu })^2\). For a binary response variable y, \(q(\mu ) = \min \{\mu , (1-\mu )\}\) gives the misclassification loss \(Q_{{q}}(y,\mu )={\text{I}}\{y\ne {\text{I}}(\mu >1/2)\}\), where \({\text{I}}(\cdot )\) denotes the indicator function; \(q(\mu ) = -2\{\mu \log (\mu )+(1-\mu )\log (1-\mu )\}\) gives the Bernoulli deviance-based loss \(Q_{{q}}(y,\mu ) = -2\{y\log (\mu )+(1-y)\log (1-\mu )\}\); \(q(\mu ) = 2 \min \{\mu , (1-\mu )\}\) results in the hinge loss; \(q(\mu ) = 2\{\mu (1-\mu )\}^{1/2}\) yields the exponential loss \(Q_{{q}}(y,\mu ) = \exp [-(y-.5)\log \{\mu /(1-\mu )\}]\). In the cases of \(p_{_n} \ll n\) and \(p_{_n}\approx n\), Zhang et al. (2010) developed the penalized “classical-\({\text{BD}}\)” estimation.

Despite a wide range of applications of \({\text{BD}}\) in many different domains, its original form, including the quadratic loss used in the ordinary least squares estimates for regression models, yields estimates not resistant to outliers. The robust loss functions for boosting was studied in Kanamori et al. (2007). To the best of our knowledge, there is very little work in the literature on systematically developing robust forms of \({\text{BD}}\) and related inference, in the presence of outliers. In the present work, we describe the construction of a “robust-\({\text{BD}}\)”. In accordance with the conditional variance function in (2), let \(r(y, \mu )={(y-\mu )}/{\sqrt{V(\mu )}}\) denote the Pearson residual, which reduces to the standardized residual for linear models. Following (4), we get partial derivatives \({\partial } Q_{{q}}(y,\mu )/{\partial \mu } = (y-\mu ) q''(\mu )\), which can be rewritten as

$$\begin{aligned} \frac{\partial }{\partial \mu } Q_{{q}}(y,\mu ) = r(y,\mu ) \{q''(\mu ) \sqrt{V(\mu )} \}. \end{aligned}$$

To guard against outliers with large Pearson residuals, we replace \(r(y,\mu )\) by \(\psi (r(y,\mu ))\), where \(\psi (\cdot )\) is chosen to be a bounded, odd function. There is a wide class of functions \(\psi (\cdot )\) satisfying these requirements; feasible choices include the Huber \(\psi\)-function (Huber, 1964) defined by

$$\begin{aligned} \psi (r) = r\, {\text{I}}( |r|\le c) + c\, \,\text{sign}(r)\, {\text{I}}( |r|> c), \end{aligned}$$
(5)

and the Tukey biweight function formed by \(\psi (r) = r\{1-({r}/{c})^2\}^2\, {\text{I}}( |r|\le c),\) where c is a positive constant. The proposed robust version of \({\text{BD}}\), \(\rho _{{q}}\), is formed by

$$\begin{aligned} \rho _{{q}}(y, \mu ) = \int _{y}^{\mu } {\psi (r(y,s))} \{q''(s) \sqrt{V(s)} \} {\text{d}}s -G(\mu ), \end{aligned}$$
(6)

where the bias-correction term, \(G(\mu )\), serves to entail the “conditional zero-mean property” (see part (b) of Sect. 2.2) of a non-penalized and low-dimensional parameter estimate (i.e. minimizing (3) in the case of \(\lambda _n=0\) and \(p_{_n} < n\)) and satisfies

$$\begin{aligned} G'(\mu ) = g_1 (\mu ) \{q''(\mu ) \sqrt{V(\mu )}\}, \end{aligned}$$

with

$$\begin{aligned} g_1 (m({\varvec{x}}))={\text{E}}\{\psi (r(Y,m({\varvec{x}}))) \mid {\varvec{X}}={\varvec{x}}\}. \end{aligned}$$
(7)

See Sect. 2.3 for explicit expressions of \(G(\mu )\) for Bernoulli responses. We call \(\rho _{{q}}(\cdot ,\cdot )\) defined in (6) the “robust-\({\text{BD}}\)”, and call the resulting parameter estimate which minimizes (3) the “penalized robust-\({\text{BD}}\) estimate”. An illustrative plot of “robust-\({\text{BD}}\)”, as compared with the classical-\({\text{BD}}\), is displayed in Fig. 1.

As a specific example of the class \(\rho _{{q}}\) of “robust-\({\text{BD}}\)” in (6), the robust (negative) quasi-likelihood in Boente et al. (2006) and Cantoni and Ronchetti (2001) can be recovered by setting the generating q-function of \({\text{BD}}\) to be \(q(\mu ) =\int _{a}^\mu {(s-\mu )}/{V(s)} {\text{d}}s\), where a is a finite constant such that the integral is well-defined. More generally, the availability of the necessary and sufficient conditions as given in Theorem 3 of Zhang et al. (2009) for an error measure to be a \({\text{BD}}\) enables the construction of the corresponding “robust-\({\text{BD}}\)” from expression (6).

2.2 General properties of “robust-\({\text{BD}}\)\(\rho _{{q}}\)

We make the following comments regarding features of the “robust-\({\text{BD}}\)”. To facilitate the discussion, we first introduce some necessary notation. Assume that the quantities

$$\begin{aligned} \text{p}_{_j}(y;\theta )=\frac{\partial ^j}{\partial \theta ^j} \rho _{{q}}(y, F^{-1}(\theta )), \quad j=0,1,\ldots , \end{aligned}$$
(8)

exist finitely up to any order required. Then we have the following expressions,

$$\begin{aligned} \text{p}_{_1}(y;\theta ) & = \{\psi (r(y,\mu )) - g_1 (\mu )\} \{q''(\mu )\sqrt{V(\mu )}\} / F'(\mu ), \nonumber \\ \text{p}_{_2}(y;\theta ) & = A_0(y,\mu ) + \{\psi (r(y,\mu ))- g_1 (\mu )\} A_1(\mu ), \nonumber \\ \text{p}_{_3}(y;\theta ) & = A_2(y,\mu )+\{\psi (r(y,\mu ))- g_1 (\mu )\} A_1'(\mu )/F'(\mu ), \end{aligned}$$
(9)

where \(\mu =F^{-1}(\theta )\),

$$\begin{aligned} A_0(y,\mu ) = -\Big [\psi '(r(y,\mu )) \Big \{1+\frac{y-\mu }{\sqrt{V(\mu )}}\times \frac{V'(\mu )}{2 \sqrt{V(\mu )}}\Big \} + g_1' (\mu )\sqrt{V(\mu )} \Big ] \frac{q''(\mu )}{\{F'(\mu )\}^2}, \end{aligned}$$

\(A_1(\mu )=[\{q^{(3)}(\mu )\sqrt{V(\mu )}+2^{-1}q''(\mu ){V'(\mu )}/{\sqrt{V(\mu )}}\}F'(\mu )\) \(-q''(\mu )\sqrt{V(\mu )}F''(\mu )] / {\{F'(\mu )\}^3}\), and \(A_2(y, \mu )=[{\partial A_0(y,\mu )}/{\partial \mu } + {\partial \{\psi (r(y,\mu ))- g_1 (\mu )\}}/{\partial \mu }\, A_1(\mu )]/F'(\mu )\). Particularly, \(\text{p}_{_1}(y;\theta )\) contains \(\psi (r)\); \(\text{p}_{_2}(y;\theta )\) contains \(\psi (r)\), \(\psi '(r)\), and \(\psi '(r)r\); \(\text{p}_{_3}(y;\theta )\) contains \(\psi (r)\), \(\psi '(r)\), \(\psi '(r)r\), \(\psi ''(r)\), \(\psi ''(r)r\), and \(\psi ''(r)r^2\), where \(r=r(y, \mu )={(y-\mu )}/{\sqrt{V(\mu )}}\) denotes the Pearson residual. Accordingly, \(\{\text{p}_{_j}(y; \theta ): j=1,2,3\}\) depend on y through \(\psi (r)\) and its derivatives coupled with r.

(a) Relation between the “robust-\({\text{BD}}\)\(\rho _{{q}}\) and “classical-\({\text{BD}}\)\(Q_{{q}}\). For the particular choice of \(\psi (r)=r\), it is clearly noticed from (7) that \(g_1 (\cdot )=0\) and thus \(G'(\cdot )=0\). In such a case, the proposed robust \(\rho _{{q}}(y, \mu )\) in (6) reduces to the conventional form, \(Q_{{q}}(y, \mu )\), of \({\text{BD}}\); similarly, \(\text{p}_{_j}(y;\theta )\) reduces to \(\text{q}_{_j}(y;\theta )\), where

$$\begin{aligned} \text{q}_{_j}(y;\theta )=\frac{\partial ^j}{\partial \theta ^j} Q_{{q}}(y, F^{-1}(\theta )), \quad j=0,1,\ldots . \end{aligned}$$
(10)

Accordingly, \(\text{q}_{_j}(y; \theta )\) is linear in y for fixed \(\theta\). As a comparison,

$$\begin{aligned} \text{q}_{_1}(y;\theta ) & = (y-\mu ) q^{(2)}(\mu ) / {F^{(1)}(\mu )}, \\ \text{q}_{_2}(y;\theta ) & = -{q^{(2)}(\mu )}/{\{F^{(1)}(\mu )\}^2}+(y-\mu )A_{1,q}(\mu ), \\ \text{q}_{_3}(y;\theta ) &\equiv A_2(\mu )+(y-\mu )A_3(\mu ), \end{aligned}$$

where \(A_{1,q}(\mu )={\{q^{(3)}(\mu ) F^{(1)}(\mu )-q^{(2)}(\mu ) F^{(2)}(\mu )\}}/\{F^{(1)}(\mu )\}^3\), \(A_2(\mu )=\{-2q^{(3)}(\mu ) F^{(1)}(\mu ) + 3q^{(2)}(\mu ) F^{(2)}(\mu )\}/\{F^{(1)}(\mu )\}^4\), and \(A_3(\mu )=[q^{(4)}(\mu ) \{F^{(1)}(\mu )\}^2-3q^{(3)}(\mu ) F^{(1)}(\mu ) F^{(2)}(\mu )-q^{(2)}(\mu ) F^{(1)}(\mu ) F^{(3)}(\mu ) +3q^{(2)}(\mu )\{F^{(2)}(\mu )\}^2]/\{F^{(1)}(\mu )\}^5\). In addition, assuming that

$$\begin{aligned} \text{q}_{_2}(y;\theta ) > 0\text { for all }\theta \in \mathbb {R}\text { and all } y\text { in the range of }Y, \end{aligned}$$
(11)

we see that \(Q_{{q}}(Y, F^{-1}({\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}))\) is strictly convex in \({\widetilde{{\varvec{\beta }}}}\).

(b) “Conditional zero-mean property”. For the proposed class of “robust-\({\text{BD}}\)\(\rho _{{q}}\) induced by the classical-\({\text{BD}}\) \(Q_{{q}}\), it follows from the expression of \(\text{p}_{_1}(y;\theta )\) in (9) that \({\text{E}}\{\text{p}_{_1}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0) \mid {\varvec{X}}\} = 0\).

(c) Bounded influence function. When \(\lambda _n=0\) in (3) and \(p_{_n}<n\), the “robust-\({\text{BD}}\) estimate” defined by minimizing (3) is characterized by the score function and influence function below,

$$\begin{aligned} \varvec{\psi }_{\rho _{{q}}}(Y, {\varvec{X}}) & = \text{p}_{_1}(Y; {\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}}_0)\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}, \end{aligned}$$
(12)
$$\begin{aligned} {\text{IF}}(Y, {\varvec{X}}; \varvec{\psi }_{\rho _{{q}}}) & = \{M(\varvec{\psi }_{\rho _{{q}}})\}^{-1} \varvec{\psi }_{\rho _{{q}}}(Y, {\varvec{X}}), \end{aligned}$$
(13)

where \(M(\varvec{\psi }_{\rho _{{q}}}) =-{\text{E}}\{{\partial }\varvec{\psi }_{\rho _{{q}}}(Y, {\varvec{X}})/{\partial {\widetilde{{\varvec{\beta }}}}_0}\} =-{\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T{\widetilde{{\varvec{\beta }}}}_0)\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}} {\widetilde{{\varvec{X}}}}^T\}\); note that justifying the influence function of a “robust-\({\text{BD}}\)” estimate for a \(p_{_n}\)-dimensional parameter calls for re-examination and re-derivation beyond the framework of Hampel et al. (1986) for a fixed-dimensional parameter. Thus, the boundedness of \(\psi (\cdot )\) and the weight function \(w(\cdot )\) ensure a bounded influence function. In contrast, for non-robust counterparts, with \(\psi (r)=r\) and \(w(\cdot )\equiv 1\), the influence function is unbounded. As to the score function, the “conditional zero-mean property” in part (b) ensures that under the parametric model (1), \({\text{E}}\{\varvec{\psi }_{\rho _{{q}}}(Y, {\varvec{X}})\}={\textbf{0}}\) holds for the proposed class of “robust-\({\text{BD}}\)\(\rho _{{q}}\) induced by the classical-\({\text{BD}}\) \(Q_{{q}}\).

(d) Conditions under which \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\}\ge 0\). This is a very minimal condition relevant to discussing Theorems 1 and 2, assumption (19) and numerical minimization of (3). First, as observed from (9), the sign of \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\}\) is invariant with the choice of generating q-functions of \({\text{BD}}\). Second, one sufficient condition for \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\} \ge 0\) is that the conditional distribution of Y given \({\varvec{X}}\) is symmetric about \(m({\varvec{X}})\). Third, another sufficient condition for \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\} \ge 0\) is that \({\text{E}}[ \psi (r(Y, m({\varvec{X}}))) \frac{\partial }{\partial m({\varvec{X}})} \log \{f(Y\mid {\varvec{X}}, m({\varvec{X}}))\} \mid {\varvec{X}}] \ge 0\), which holds for \(\psi (r)r\ge 0\) (applicable to Huber and Tukey \(\psi\)-functions), and the conditional distribution of Y given \({\varvec{X}}\) belongs to the exponential family, where f denotes the conditional density or probability of Y given \({\varvec{X}}\). Fourth, in the particular choice of \(\psi (r)=r\), which is unbounded, a direct computation gives that \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\} = -q''(m({\varvec{X}}))/\{F'(m({\varvec{X}}))\}^2 \ge 0\), for any conditional distribution of Y given \({\varvec{X}}\).

2.3 Difference between “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)

To better distinguish between the “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)”, we derive below their closed-form expressions for the Bernoulli responses, using the canonical link \(\theta =\log \{\mu /(1-\mu )\}\), Huber \(\psi\)-function, and the deviance loss and the exponential loss as the \({\text{BD}}\). In that case, assume \(c\ge 1\) in the Huber \(\psi\)-function (5), and define two constants \(C_1={1}/{(1+c^2)}\) and \(C_2 =1-C_1\). Results in the case of \(0<c<1\) can be similarly obtained.

For the deviance loss employed as the \({\text{BD}}\), the “robust-\({\text{BD}}\)” in (6) takes the form,

$$\begin{aligned} \rho _{{q}}(y,\mu ) = p^*(y,\mu ) - G(\mu ), \end{aligned}$$

where

$$\begin{aligned} \begin{array}{ll} p^*(y,\mu ) &= {\left\{ \begin{array}{ll} -2\log ({{{1-\mu }}}) (1-y)-[4c \{\sin ^{-1}(\sqrt{\mu }) - \sin ^{-1}(\sqrt{C_1})\} & \\ +2\log (C_1)] y, & \text{if}\,0\le \mu \le C_1, \\ -2\log (1-\mu ) (1-y)-2\log (\mu ) y, & \text{if}\,C_1< \mu< C_2, \\ -2\log ({{{\mu }}}) y +[4c \{\sin ^{-1}(\sqrt{\mu }) - \sin ^{-1}(\sqrt{C_2})\} & \\ -2\log (C_1)](1-y), & \text{if}\,C_2 \le \mu \le 1, \end{array}\right. } \\ G(\mu ) &= {\left\{ \begin{array}{ll} -2(1-\mu )-2c\{\sin ^{-1}(\sqrt{\mu })-\sin ^{-1}(\sqrt{C_1})-\sqrt{V(\mu )}\}, & \text{if}\,0\le \mu \le C_1, \\ 0, & \text{if}\,C_1< \mu < C_2, \\ -2\mu +2c\{\sin ^{-1}(\sqrt{\mu })-\sin ^{-1}(\sqrt{C_2})+\sqrt{V(\mu )}\}, & \text{if}\,C_2 \le \mu \le 1. \end{array}\right. } \end{array} \end{aligned}$$

The two related derivative quantities are

$$\begin{aligned} \begin{array}{l} \text{p}_{_1}(y;\theta ) = {\left\{ \begin{array}{ll} -2 (y-\mu )\{\mu +c\sqrt{V(\mu )}\}, & \text{if}\,0\le \mu \le C_1, \\ -2 (y-\mu ), & \text{if}\,C_1< \mu< C_2, \\ -2 (y-\mu )\{(1-\mu )+c\sqrt{V(\mu )}\}, & \text{if}\,C_2 \le \mu \le 1, \end{array}\right. } \\ \text{p}_{_2}(y;\theta ) = {\left\{ \begin{array}{ll} 2\sqrt{V(\mu )}\{(2\mu -y) \sqrt{V(\mu )}+{c}/{2}{(\mu -y)(1-2\mu )}+c V(\mu )\}, & \text{if}\,0 \le \mu< C_1, \\ 2V(\mu ), & \text{if}\,C_1< \mu< C_2, \\ 2\sqrt{V(\mu )}\{(1-2\mu +y)\sqrt{V(\mu )}+{c}/{2}{(\mu -y)(1-2\mu )}+c V(\mu )\}, & \text{if}\,C_2 < \mu \le 1. \end{array}\right. } \end{array} \end{aligned}$$

The “classical-\({\text{BD}}\)” is \(Q_{{q}}(y,\mu ) = -2\log (1-\mu ) (1-y)-2\log (\mu ) y,\) and the two related quantities are \(\text{q}_{_1}(y;\theta ) = -2 (y-\mu )\) and \(\text{q}_{_2}(y;\theta ) = 2V(\mu )\).

Analogously, for the exponential loss used as the \({\text{BD}}\), the “robust-\({\text{BD}}\)” counterpart is represented by \(\rho _{{q}}(y,\mu ) = p^*(y,\mu ) - G(\mu ),\) where

$$\begin{aligned} \begin{array}{ll} p^*(y,\mu ) &= {\left\{ \begin{array}{ll} \sqrt{{\mu }/{(1-\mu )}} (1-y)-{c}/{2} \{\log ({\mu }/{(1-\mu )}) +2\log (c)-2\}y, & \text{if}\,0\le \mu \le C_1, \\ \{\mu (1-y) + (1-\mu ) y\}/{\sqrt{V(\mu )}}, & \text{if}\,C_1< \mu< C_2, \\ \sqrt{{(1-\mu )}/{\mu }}\, y +{c}/{2} \{\log ({\mu }/{(1-\mu )}) -2\log (c)+2\} (1-y), & \text{if}\,C_2 \le \mu \le 1, \end{array}\right. } \\ G(\mu ) &= {\left\{ \begin{array}{ll} \{\sin ^{-1}(\sqrt{\mu })-\sin ^{-1}(\sqrt{C_1})\}+{c}/{2}\{\log (1-\mu )-\log (C_2)\}, & \text{if}\,0\le \mu \le C_1, \\ 0, & \text{if}\,C_1< \mu < C_2, \\ -\{\sin ^{-1}(\sqrt{\mu })-\sin ^{-1}(\sqrt{C_2})\}+{c}/{2}\{\log (\mu )-\log (C_2)\}, & \text{if}\,C_2 \le \mu \le 1. \end{array}\right. } \end{array} \end{aligned}$$

The two related derivative quantities are

$$\begin{aligned} \begin{array}{l} \text{p}_{_1}(y;\theta ) = {\left\{ \begin{array}{ll} - (y-\mu )\{\sqrt{{\mu }/{(1-\mu )}} + c\}/2, & \text{if}\,0\le \mu \le C_1, \\ - (y-\mu )/\{2\sqrt{V(\mu )}\}, & \text{if}\,C_1< \mu< C_2, \\ - (y-\mu )\{\sqrt{{(1-\mu )}/{\mu }} + c\}/2, & \text{if}\,C_2 \le \mu \le 1, \end{array}\right. } \\ \text{p}_{_2}(y;\theta ) = {\left\{ \begin{array}{ll} \sqrt{{\mu }/{(1-\mu )}} (1-y)/4-{(1-2\mu )\sqrt{V(\mu )}}/{4} +{c}/{2} V(\mu ), & \text{if}\,0 \le \mu< C_1, \\ \{\mu (1-y)+(1-\mu )y\}/\{4\sqrt{V(\mu )}\}, & \text{if}\,C_1< \mu< C_2, \\ \sqrt{{(1-\mu )}/{\mu }}\, y /4+{(1-2\mu )\sqrt{V(\mu )}}/{4} +{c}/{2} V(\mu ), & \text{if}\,C_2 < \mu \le 1. \end{array}\right. } \end{array} \end{aligned}$$

The “classical-\({\text{BD}}\)” is \(Q_{{q}}(y,\mu ) = \{\mu (1-y) + (1-\mu ) y\}/{\sqrt{V(\mu )}},\) and the two related quantities are \(\text{q}_{_1}(y;\theta ) = -(y-\mu )/\{2\sqrt{V(\mu )}\}\) and \(\text{q}_{_2}(y;\theta ) = \{\mu (1-y)+(1-\mu )y\}/\{4\sqrt{V(\mu )}\}\).

In summary, for both types of “classical-\({\text{BD}}\)”, which are unbounded, the corresponding versions of “robust-\({\text{BD}}\)” are bounded. See the left panels of Fig. 1 with the response \(y=1\), where the “classical-\({\text{BD}}\)” goes to infinity as \(\mu\) approaches zero. (The case of \(y=0\) will be similar.) From the middle and right panels, \(\text{p}_{_1}(y;\theta )\) and \(\text{p}_{_2}(y;\theta )\) associated with the “robust-\({\text{BD}}\)” are always bounded. In contrast, for the exponential loss, the non-robust counterparts for the “classical-\({\text{BD}}\)” are unbounded. Moreover, we observe from each panel that the “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)” differ at lower and upper tails of \(\mu\), but coincide at the intermediate values of \(\mu\).

3 Robust estimation with large-dimensions: \(p_{_n}\approx n\)

This section investigates the statistical properties of the “penalized robust-\({\text{BD}}\) estimate” defined by minimizing (3) in sparse large-dimensional parametric models with \(p_{_n} \approx n\). Throughout the paper, it is assumed that some entries in \({\varvec{\beta }}_{0}\) are exactly zero. Without loss of generality, write \({\varvec{X}}= ({\varvec{X}}^{({\text{I}})T}, {\varvec{X}}^{({\text{II}})T})^T\) and \({\varvec{\beta }}_{0} = ({\varvec{\beta }}_{0}^{({\text{I}})T}, {\varvec{\beta }}_{0}^{({\text{II}})T})^T\), where the \({\varvec{\beta }}_{0}^{({\text{I}})}\) part collects all non-zero coefficients, and \({\varvec{\beta }}_{0}^{({\text{II}})} = {\textbf{0}}\). Let \(s_n\) denote the number of non-zero coordinates of \({\varvec{\beta }}_{0}\), and set \({\widetilde{{\varvec{\beta }}}}_{0} = (\beta _{0;0},{\varvec{\beta }}_{0}^T)^T\). Correspondingly, write \({\widetilde{{\varvec{X}}}}^{({\text{I}})} = (1,{\varvec{X}}^{({\text{I}})T})^T\) and \({\widetilde{{\varvec{\beta }}}}^{({\text{I}})} = (\beta _{0},{\varvec{\beta }}^{({\text{I}})T})^T\).

It will be demonstrated that the impact of penalty weights \(\{w_{n,j}\}\) in (3) on the penalized “robust-\({\text{BD}}\)” estimate is primarily captured by two quantities, defined by

$$\begin{aligned} w_{\max }^{({\text{I}})} = \max _{1\le j\le s_n} w_{n,j}, \quad w_{\min }^{({\text{II}})} = \min _{s_n+1\le j\le p_{_n}} w_{n,j}. \end{aligned}$$

3.1 Consistency

Theorem 1 guarantees the existence of a \(\sqrt{n/s_n}\)-consistent local minimizer of (3). In particular, Theorem 1 allows the dimension to diverge with n at the rate \(p_{_n} = o\{n^{(3+\delta )/(4+\delta )}\}\) for any \(\delta >0\), as long as the number of truly non-zero parameters fulfills that \(s_n = O\{n^{1/(4+\delta )}\}\).

Theorem 1

(existence and consistency: \(p_{_n}\approx n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1, \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and there exists a constant \(M\in (0,\infty )\) such that \(\lim _{n\rightarrow \infty } {\text{P}}(w_{\min }^{({\text{II}})} \lambda _n > M)=1\). If \(s_n^4/n \rightarrow 0\) and \(s_n (p_{_n}-s_n) = o(n)\), then there exists a local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of (3) such that \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{\widetilde{{\varvec{\beta }}}}_{0}\Vert _2 = O_{{\text{P}}}(\sqrt{s_n/n})\), where \(\Vert \cdot \Vert _2\) denotes the Euclidean norm.

To clarify the distinction between Theorem 5 of Zhang et al. (2010) and Theorem 1 of this paper, we make the following comparison. (i) Theorem 5 of Zhang et al. (2010) uses the classical-\({\text{BD}}\), corresponding to \(\psi (r)=r\) in (6), and assumes the finiteness of some moments of Y. (ii) Condition \({\text{A}}5\) of our Theorem 1 uses the robust-\({\text{BD}}\) and assumes the boundedness of \(\psi (r)\), thus excluding \(\psi (r)=r\), but avoids the moment assumption. Thus, our Theorem 1 is not applicable to the case of \(\psi (r)=r\) in (6) associated with a classical-\({\text{BD}}\) of Zhang et al. (2010).

3.2 Oracle property

To investigate the asymptotic distribution of the penalized robust-\({\text{BD}}\) estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\), we define three square matrices of size \((s_n+1)\) by \(\textbf{W}_n^{({\text{I}})} = \text{diag}(0, w_{n,1},\ldots ,w_{n,s_n})\) and

$$\begin{aligned} \Omega _n^{({\text{I}})} & = {\text{E}}\{\text{p}_{_1}^2(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w^2({\varvec{X}}) {\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}, \\ \textbf{H}_n^{({\text{I}})} & = {\text{E}}\{\text{p}_{_2} (Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})})\, w({\varvec{X}}) {\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}. \end{aligned}$$

Both \(\Omega _n^{({\text{I}})}\) and \(\textbf{H}_n^{({\text{I}})}\) depend on the choice of \({\text{BD}}\), weight function \(w(\cdot )\), and the \(\psi\)-function \(\psi (\cdot )\).

Following Theorem 1, Theorem 2 obtains the oracle property of the \(\sqrt{n/s_n}\)-consistent local minimizer. Namely, if the “robust-\({\text{BD}}\)” is used as the loss function for parameter estimation, then the penalized robust-\({\text{BD}}\) estimates of the zero parameters take exactly zero values with probability tending to one, and the penalized robust-\({\text{BD}}\) estimates of the non-zero parameters are asymptotically Gaussian with the same means and variances as if the zero coefficients were known in advance.

Theorem 2

(oracle property: \(p_{_n}\approx n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1.

\(\mathrm {(i)}\):

If \(s_n^2/n=O(1)\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/\sqrt{s_n p_{_n}} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as \(n\rightarrow \infty\), then any \(\sqrt{n/s_n}\)-consistent local minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}} = (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})T},{\widehat{{\varvec{\beta }}}}^{({\text{II}})T})^T\) of (3) satisfies \({\text{P}}({\widehat{{\varvec{\beta }}}}^{({\text{II}})} = {\textbf{0}})\rightarrow 1\).

\(\mathrm {(ii)}\):

Moreover, if \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\), \(s_n^5/n \rightarrow 0\) , and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\), then for any fixed integer \({\textsf {k}}\) and any \({\textsf {k}}\times (s_n+1)\) matrix \(A_n\) such that \(A_nA_n^T \rightarrow \mathbb {G}\) for a \({\textsf {k}}\times {\textsf {k}}\) nonnegative-definite symmetric matrix \(\mathbb {G}\), we have \(\sqrt{n} A_n (\Omega _n^{({\text{I}})})^{-1/2} [\textbf{H}_n^{({\text{I}})}\{\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\} + \lambda _n \textbf{W}_n^{({\text{I}})} \,\text{sign}\{{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\}] {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \mathbb {G})\), where \(\,\text{sign}\{{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\}=(\,\text{sign}(\beta _{0}),\,\text{sign}(\beta _{1}),\ldots ,\,\text{sign}(\beta _{s_n}))^T\).

Remark 1

In Theorem 2, Condition \({\text{B}}5\) extends the positive-definiteness assumption of \({\text{E}}\{ {\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}\) in the non-robust and fixed-dimensional case to the robust and large-dimensional case. The assumption \(\min _{1\le j\le s_n} |\beta _{j;0}|/ \sqrt{s_n/n} \rightarrow \infty\) is relevant to the magnitude of coefficients for significant variables which can be selected, and is fulfilled when \(\min _{1\le j\le s_n} |\beta _{j;0}|\ge C n^{-4/5}\) for a constant \(C>0\).

3.3 Comparison with the penalized “classical-\({\text{BD}}\)” estimate

Comparisons are made between the penalized “robust-\({\text{BD}}\)” and “classical-\({\text{BD}}\)” estimates. (I) The penalized “classical-\({\text{BD}}\)” estimate in Zhang et al. (2010) requires \({\text{E}}(Y^2)<\infty\) for the consistency and requires finiteness of some higher-order moments of Y for the oracle property. These requirements are avoided in the “robust-\({\text{BD}}\)” counterpart. (II) The two types of penalized estimates appear to share similar forms of the asymptotic distribution, except that matrices \(\Omega _n^{({\text{I}})}\) and \(\textbf{H}_n^{({\text{I}})}\) for the “classical-\({\text{BD}}\)” estimate are given by

$$\begin{aligned} {\text{E}}\{\text{q}_{_1}^2(Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) {\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}, \qquad {\text{E}}\{\text{q}_{_2} (Y; {\widetilde{{\varvec{X}}}}^{({\text{I}})T} {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) {\widetilde{{\varvec{X}}}}^{({\text{I}})}{\widetilde{{\varvec{X}}}}^{({\text{I}})T}\}, \end{aligned}$$

respectively. Hence, the differences are captured by the distinction between the robust versions, \(\{\text{p}_{_j}(y; \theta )\}_{j=1}^{2}\) (defined in (8)) and weight function \(w(\cdot )\), used in the penalized “robust-\({\text{BD}}\)” estimate and the non-robust counterparts, \(\{\text{q}_{_j}(y; \theta )\}_{j=1}^{2}\) (defined in (10)) and \(w(\cdot ) \equiv 1\), used in the penalized “classical-\({\text{BD}}\)” estimate.

As observed from (12)–(13), a bounded function \(\text{p}_{_1}(y; \theta )\) is introduced from a bounded function \(\psi (r)\) to control deviations in the Y-space, and leverage points are down-weighted by the weight function \(w({\varvec{X}})\). In contrary, \(\text{q}_{_1}(y; \theta )\) is not guaranteed to be bounded. It is then clear that for penalized “robust-\({\text{BD}}\)” estimates of non-zero parameters, the choice of a bounded score function ensures robustness by putting a bound on the influence function. Such property is not possessed by the penalized “classical-\({\text{BD}}\)” counterparts.

3.4 Hypothesis testing

We consider the hypothesis testing about \({\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\) formulated as

$$\begin{aligned} H_0: A_n {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})} = {\textbf{0}} \quad \text{versus}\quad H_1: A_n {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})} \ne {\textbf{0}}, \end{aligned}$$
(14)

where \(A_n\) is a given \({\textsf {k}}\times (s_n+1)\) matrix such that \(A_nA_n^T\) equals a \({\textsf {k}}\times {\textsf {k}}\) positive-definite matrix \(\mathbb {G}\). This form of linear hypotheses allows one to simultaneously test whether a subset of variables used are statistically significant by taking some specific form of the matrix \(A_n\); for instance \(A_n=[\textbf{I}_{{\textsf {k}}},{\textbf{0}}_{{\textsf {k}},s_n+1-{\textsf {k}}}]\) yields \(A_nA_n^T =\textbf{I}_{{\textsf {k}}}\) for a \({\textsf {k}}\times {\textsf {k}}\) identity matrix.

In the context of non-robust penalized-likelihood estimation, Fan and Peng (2004) showed that the likelihood ratio-type test statistic asymptotically follows a chi-squared distribution under the null. It is thus natural to explore the extent to which the likelihood ratio-type test can feasibly be extended to the “robust-\({\text{BD}}\)”. Our derivations (with details omitted) indicate that the resulting asymptotic null distribution is generally not chi-squared, but a sum of weighted chi-squared variables, with weights involving unknown quantities, thus not distribution free, and holds under restrictive conditions.

To ameliorate this undesirable property, we propose a robust generalized Wald-type test statistic of the form,

$$\begin{aligned} W_n = n \{A_n\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\}^T \{A_n ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}{\widehat{\Omega }}_n^{({\text{I}})} ({\widehat{\textbf{H}}}_n^{({\text{I}})})^{-1}A_n^T\}^{-1} \{A_n\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\}, \end{aligned}$$

where \({\widehat{\Omega }}_n^{({\text{I}})} = n^{-1} \sum _{i=1}^n \text{p}_{_1}^2(Y_i; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})})\, w^2({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\) and

\({\widehat{\textbf{H}}}_n^{({\text{I}})} = n^{-1} \sum _{i=1}^n \text{p}_{_2}(Y_i; {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T} \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})})\, w({\varvec{X}}_{i}) {\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})}{\widetilde{{\varvec{X}}}}_{i}^{({\text{I}})T}\). This test is asymptotically distribution-free, as Theorem 3 justifies that under the null, \(W_n\) is asymptotically chi-squared with \({\textsf {k}}\) degrees of freedom.

Theorem 3

(Wald-type test under \(H_0\) based on robust-\({\text{BD}}\): \(p_{_n}\approx n\)) Assume Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{C}}4\), \({\text{A}}5\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1, and \(w_{\max }^{({\text{I}})}=o_{_{{\text{P}}}}\{{1}/(\lambda _n \sqrt{ns_n})\}\). If \(s_n^5/n \rightarrow 0\) and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\) as \(n\rightarrow \infty\), then \(W_n {\mathop {\longrightarrow }\limits ^{\mathcal L}}\chi ^2_{{\textsf {k}}}\) under the null \(H_0\) in (14).

Since the influence function of \(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})}\) is bounded, Proposition 2 of Cantoni and Ronchetti (2001) can be modified to show that the asymptotic level of the robust test statistic \(W_n\) under a sequence of \({\epsilon }\)-contaminations is bounded. Similarly, the asymptotic power is stable under contamination. Details are omitted for lack of space.

3.5 Proposed selection for penalty weights

In practice, the weights \(\{w_{n,j}\}\) in the penalty part of (3) need to be selected and their validity also needs to be justified. To accommodate the general error measures, we propose two procedures. The first one, called the “penalized marginal regression” (\({\text{PMR}}\)) method, selects the data-dependent penalty weights \({\widehat{w}}_{n,j}\), for each individual \(j = 1,\ldots ,p_{_n}\), according to

$$\begin{aligned} {\widehat{w}}_{n,j} = 1/|{\widehat{\beta }}_{j}^{{\text{PMR}}}|, \end{aligned}$$
(15)

where \({\widehat{\beta }}^{{\text{PMR}}}_j\) satisfies that \(({\widehat{\alpha }}^{{\text{PMR}}}_j,{\widehat{\beta }}^{{\text{PMR}}}_j) \in \mathbb {R}^2\) minimize the criterion function

$$\begin{aligned} \ell _{n,j}^{{\text{PMR}}}(\alpha ,\beta ) = \frac{1}{n}\sum _{i=1}^n Q_{{q}}(Y_i, F^{-1}(\alpha +X_{i,j}\beta ))+\kappa _n |\beta | \end{aligned}$$
(16)

over \((\alpha ,\beta )\), with some sequence \(\kappa _n>0\), and \(X_{i,j}\) denoting the jth variable in the ith sample.

An alternative procedure, called the “marginal regression” (\({\text{MR}}\)) method, selects the penalty weights \({\widehat{w}}_{n,j}\), for each individual \(j = 1,\ldots ,p_{_n}\), by means of

$$\begin{aligned} {\widehat{w}}_{n,j} = 1/|{\widehat{\beta }}_{j}^{{\text{MR}}}|, \end{aligned}$$
(17)

where \({\widehat{\beta }}^{{\text{MR}}}_j\) satisfies that \(({\widehat{\alpha }}^{{\text{MR}}}_j,{\widehat{\beta }}^{{\text{MR}}}_j) \in \mathbb {R}^2\) minimize the criterion function

$$\begin{aligned} \ell _{n,j}^{{\text{MR}}}(\alpha ,\beta ) = \frac{1}{n} \sum _{i=1}^n Q_{{q}}(Y_{i}, F^{-1}(\alpha +X_{i,j} \beta )) \end{aligned}$$
(18)

over \((\alpha ,\beta )\).

Note that (16) and (18) each involves a univariate predictor with the intercept term. Thus fast bivariate optimization solutions of (16) and (18) would be feasible even when \(p_{_n} > n\). Compared with the \({\text{PMR}}\) method, the \({\text{MR}}\) method gains computational superiority with less computational cost.

Theorems 4 and 5 indicate that under the assumptions on the correlation between the predictor variables and the response variable, the penalty weights selected by either the \({\text{PMR}}\) or \({\text{MR}}\) method satisfy the conditions on \(\{w_{n,j}\}\) in Theorem 1.

Theorem 4

(\({\text{PMR}}\) for penalty weights: \(p_{_n}\approx n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\), \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\) and \({\mathcal {B}}_n/\kappa _n = O(1)\) for \(\kappa _n\) in (16). Suppose \(\lambda _n \sqrt{n} = O(1)\), \(\lambda _n = o(\kappa _n)\) and \(\log (p_{_n}) = o(n\kappa _n^2)\). Then there exist local minimizers \(\{{\widehat{\beta }}_{j}^{{\text{PMR}}}\}_{j=1}^{p_{_n}}\) of (16) such that the penalty weights \(\{{\widehat{w}}_{n,j}\}_{j=1}^{p_{_n}}\) defined in (15) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})} \lambda _n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 1, where \({\widehat{w}}_{\max }^{({\text{I}})} = \max _{1\le j\le s_n} {\widehat{w}}_{n,j}\) and \({\widehat{w}}_{\min }^{({\text{II}})} = \min _{s_n+1\le j\le p_{_n}} {\widehat{w}}_{n,j}\).

Theorem 5

(\({\text{MR}}\) for penalty weights: \(p_{_n}\approx n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}2\) and \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\) and \({\mathcal {B}}_n = \lambda _n\). Suppose \(\lambda _n \sqrt{n} = O(1)\), \(\lambda _n \rightarrow 0\), \(\lambda _n n/s_n \rightarrow \infty\) and \(s_n^2 \log (p_{_n}) = o(\lambda _n^2 n^2)\). Then there exist local minimizers \({\widehat{\beta }}^{{\text{MR}}}_{j}\) of (18) such that the penalty weights \({\widehat{w}}_{n,j}\) defined in (17) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})}\lambda _n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 1.

It should be pointed out that the \({\text{PMR}}\) method here differs from the weights selection method in Zhang et al. (2010), which is restricted to \(p_{_n}\approx n\), excludes the intercept term in (16) and requires \({\text{E}}({\varvec{X}})={\textbf{0}}\) to satisfy the conditions on \(\{w_{n,j}\}\) in Theorem 1. In contrast, the \({\text{PMR}}\) method here removes that requirement and will also be applicable to \(p_{_n}\gg n\); see Theorem 7.

4 Robust estimation with high-dimensions: \(p_{_n}\gg n\)

This section explores the behavior of penalized robust-\({\text{BD}}\) estimates in sparse high-dimensional parametric models when \(p_{_n}\) is allowed to grow faster than n. Evidently, the technical conditions (e.g., in Theorem 1) on \(p_{_n}\) in Sect. 3 are violated for \(p_{_n} \gg n\). Thus, directly carrying through the proofs in Sect. 3 to the counterpart of \(p_{_n} \gg n\) is infeasible. To facilitate the study in the case of \(p_{_n}\gg n\), we impose a convexity assumption on the “robust-\({\text{BD}}\)”:

$$\begin{aligned} \text{p}_{_2}(y;\theta ) > 0\text { for all }\theta \in \mathbb {R}\text { and all } y\text { in the range of }Y. \end{aligned}$$
(19)

Under this assumption, \(\rho _{{q}}(Y, F^{-1}({\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}))\) is strictly convex in \({\widetilde{{\varvec{\beta }}}}\).

Apparently, assumption (19) is stronger than \({\text{E}}\{\text{p}_{_2}(Y; {\widetilde{{\varvec{X}}}}^T {\widetilde{{\varvec{\beta }}}}_0)\mid {\varvec{X}}\}\ge 0\) discussed in Sect. 2.2; relaxing (19) will be desirable but is not pursued in this paper. We consider here particular cases where assumption (19) is practically/approximately achievable and theoretically relevant for high-dimensional settings.

  1. Case 1:

    For any observation \(({\varvec{X}}, Y)\) such that the conditional distribution of Y given \({\varvec{X}}\) is symmetric about \(m({\varvec{X}})\), the use of a quadratic loss function combined with an identity link, and a constant conditional variance ensures that \(\text{p}_{_2}(y;\theta ) = 2\psi '(r(y, \theta )) \ge 0\), for a monotone non-decreasing \(\psi\)-function. Thus, the Gaussianity assumption on the conditional distribution of Y given \({\varvec{X}}\) is relaxed.

  2. Case 2:

    Recall that if \(\psi (r)=r\), then \(\text{p}_{_j}(y;\theta )=\text{q}_{_j}(y;\theta )\), and thus condition (19) is equivalent to condition (11). Indeed, condition (11) holds broadly for nearly all commonly used \({\text{BD}}\). Examples include the quadratic loss function, the deviance-based loss and exponential loss functions for the Bernoulli responses, and the (negative) quasi-likelihood for over-dispersed Poisson responses, among many others. The implication is that for high-dimensional data, if we are most concerned with dealing with outliers arising from the explanatory variables, we may employ \(\psi (r)=r\) for Y alone while retaining the weight function \(w(\cdot )\) on the covariates \({\varvec{X}}\). This is particularly relevant to samples with Bernoulli or Binomial responses, where both the parameter space and the response space are bounded, and thus is applicable to a wide class of classification procedures.

Theorem 6 states that the oracle property remains true, under suitable conditions, for the penalized robust-\({\text{BD}}\) estimates in the \(p_{_n}\gg n\) settings. Due to the technical challenge from \(p_{_n}\gg n\), Theorem 6 contains stronger assumptions than those in Theorem 2 (with \(p_{_n}\approx n\)). Lemma 4 in Appendix 1.1 provides key proofs for Theorem 6.

Theorem 6

(oracle property: \(p_{_n}\gg n\)) Assume (19) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}4\), \({\text{A}}5'\), \({\text{B}}5\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Suppose \(s_n^4/n \rightarrow 0\), \(\log (p_{_n}-s_n)/n = O(1)\), \(\log (p_{_n}-s_n)/\{n\lambda _n^2(w_{\min }^{({\text{II}})})^2\} = o_{_{{\text{P}}}}(1)\) and \(\min _{1\le j\le s_n} |\beta _{j;0}|/\sqrt{s_n/n} \rightarrow \infty\). Assume \(w_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \(w_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\).

\(\mathrm {(i)}\):

Then there exists a global minimizer \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) of (3) such that \({\text{P}}({\widehat{{\varvec{\beta }}}}^{({\text{II}})} = {\textbf{0}})\rightarrow 1\).

\(\mathrm {(ii)}\):

Moreover, if \(s_n^5/n \rightarrow 0\), then for any fixed integer \({\textsf {k}}\) and any \({\textsf {k}}\times (s_n+1)\) matrix \(A_n\) such that \(A_nA_n^T \rightarrow \mathbb {G}\) for a \({\textsf {k}}\times {\textsf {k}}\) nonnegative-definite symmetric matrix \(\mathbb {G}\), we have  \(\sqrt{n} A_n (\Omega _n^{({\text{I}})})^{-1/2} [\textbf{H}_n^{({\text{I}})}(\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}) + \lambda _n \textbf{W}_n^{({\text{I}})} \,\text{sign}\{{\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\}] {\mathop {\longrightarrow }\limits ^{\mathcal L}}N({\textbf{0}}, \mathbb {G})\).

The conditions \(s_n^5/n=o(1)\), \(\log (p_{_n}-s_n) = O(n)\) and \(\log (p_{_n}-s_n)/\{(w_{\min }^{({\text{II}})})^2 \lambda _n^2 n\} = o_{_{{\text{P}}}}(1)\) impose the constraints on \(s_n\) and \(p_{_n}\) simultaneously. To illustrate, take \(s_n = n^{a}\), where \(0<a<1/5\). Then we can take \(w_{\min }^{({\text{II}})}={n^b}/{(\lambda _n \sqrt{n})}\) for \(b>a\), namely, \(n\lambda _n^2 \{w_{\min }^{({\text{II}})}\}^2=n^{2b}\). A sufficient condition for the model dimension \(p_{_n}\) is that \(\log (p_{_n}-s_n) = O(n)\) and \(\log (p_{_n}-s_n) = o(n^{2b})\). So \(\log (p_{_n}-s_n)=\min \{o(n^{2b}), O(n)\}\). This indicates that Theorem 6 allows \(p_{_n}=\exp \{o(n^{2b}) \wedge O(n)\}\), which grows nearly exponentially fast with n.

Regarding the selection of penalty weights, the \({\text{PMR}}\) and \({\text{MR}}\) methods proposed in Sect. 3.5 with \(p_{_n}\approx n\) continue to work well for selecting weights with \(p_{_n} \gg n\). The validity is presented in Theorems 7 and 8, which again do not require \({\text{E}}({\varvec{X}})={\textbf{0}}\).

Theorem 7

(\({\text{PMR}}\) for penalty weights: \(p_{_n}\gg n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\), \({\mathcal {A}}_n/\kappa _n \rightarrow \infty\), and \({\mathcal {B}}_n/\kappa _n = O(1)\) for \(\kappa _n\) in (16). Suppose \(\lambda _n \sqrt{n} = O(1)\), \({\lambda _n \sqrt{n}}/{s_n} = o(\kappa _n)\) and \(\log (p_{_n}) = o(n\kappa _n^2)\). Then there exist local minimizers \({\widehat{\beta }}^{{\text{PMR}}}_{j}\) of (16) such that the penalty weights \({\widehat{w}}_{n,j}\) defined in (15) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})} {\lambda _n \sqrt{n}}/ {s_n} {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 6.

Theorem 8

(\({\text{MR}}\) for penalty weights: \(p_{_n}\gg n\)) Assume (11) and Conditions \({\text{A}}0\), \({\text{A}}1\), \({\text{A}}2\), \({\text{B}}3\), \({\text{A}}4\), \({\text{A}}6\), \({\text{A}}7\) in Appendix 1.1. Assume \({\text{E}}2\) and \({\text{E}}1\) in Appendix 1.1, where \({\mathcal {A}}_n = \lambda _n \sqrt{n}\) and \({\mathcal {B}}_n = {\lambda _n\sqrt{n}} / {s_n}\). Suppose \(\lambda _n \sqrt{n} = O(1)\), \(\lambda _n\sqrt{n}/s_n \rightarrow 0\), \(\lambda _n n/s_n \rightarrow \infty\) and \(s_n^2 \log (p_{_n}) = o(\lambda _n^2 n^2)\). Then there exist local minimizers \({\widehat{\beta }}^{{\text{MR}}}_{j}\) of (18) such that the penalty weights \({\widehat{w}}_{n,j}\) defined in (17) satisfy that \({\widehat{w}}_{\max }^{({\text{I}})} = O_{{\text{P}}}\{{1}/(\lambda _n \sqrt{n})\}\) and \({\widehat{w}}_{\min }^{({\text{II}})}\lambda _n \sqrt{n}/s_n {\mathop {\longrightarrow }\limits ^{\text{P}}}\infty\) as needed in Theorem 6.

5 Robust estimation in classification: \(p_{_n}\approx n\), \(p_{_n}\gg n\)

This section deals with the binary response variable Y which takes values 0 and 1. In this case, the mean regression function \(m({\varvec{x}})\) in (1) becomes the class label probability, \({\text{P}}(Y = 1 \mid {\varvec{X}}= {\varvec{x}})\). From the penalized robust-\({\text{BD}}\) estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) proposed in either Sect. 3 or Sect. 4, we can construct the “penalized robust-\({\text{BD}}\) classifier”, \({\widehat{\phi }}({\varvec{x}}) = {\text{I}}\{{\widehat{m}}({\varvec{x}}) > 1/2\},\) for a future input variable \({\varvec{x}}\), where \({\widehat{m}}({\varvec{x}})= F^{-1}({\widetilde{{\varvec{x}}}}^T\widehat{{\widetilde{{\varvec{\beta }}}}})\).

In the classification literature, the misclassification loss of a classification rule \(\phi\) at a sample point \(({\varvec{x}}, y)\) is \(l(y, \phi ({\varvec{x}})) = {\text{I}}\{y \ne \phi ({\varvec{x}})\}\). The risk of \(\phi\) is the expected misclassification loss, \(R(\phi ) = {\text{E}}\{l(Y, \phi (X))\} = {\text{P}}(\phi ({\varvec{X}}) \ne Y)\). The optimal Bayes rule, which minimizes the risk over \(\phi\), is \(\phi _{\mathrm {\scriptscriptstyle B}}({\varvec{x}})={\text{I}}\{m({\varvec{x}}) > 1/2\}\). For a test sample \(({\varvec{X}}^o, Y^o)\), which is an \(\mathrm {i.i.d.}\) copy of samples in the training set \(\mathcal T_n=\{({\varvec{X}}_{i}, Y_{i}): i=1,\ldots ,n\}\), the optimal Bayes risk is then \(R(\phi _{{\mathrm {\scriptscriptstyle B}}}) = {\text{P}}(\phi _{{\mathrm {\scriptscriptstyle B}}}({\varvec{X}}^o) \ne Y^o)\). Meanwhile, the conditional risk of the classification rule \({\widehat{\phi }}\) is \(R({\widehat{\phi }} \mid \mathcal T_n) = {\text{P}}({\widehat{\phi }}({\varvec{X}}^o) \ne Y^o \mid \mathcal T_n)\).

Theorem 9 demonstrates that \({\widehat{\phi }}\) attains the classification consistency, provided that the estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) possesses the sparsity property and is consistent at an appropriate rate. As observed, the conclusion of Theorem 9 is applicable to penalized robust-\({\text{BD}}\) estimates in both Theorem 2(i) with \(p_{_n}\approx n\) and Theorem 6(i) with \(p_{_n} \gg n\).

Theorem 9

(consistency of the penalized robust-\({\text{BD}}\) classifier) Assume Conditions \({\text{A}}0\), \({\text{A}}1\) and \({\text{A}}4\) in Appendix 1.1. Suppose that the estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}} = (\widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})T},{\widehat{{\varvec{\beta }}}}^{({\text{II}})T})^T\) satisfies \({\text{P}}({\widehat{{\varvec{\beta }}}}^{({\text{II}})} = {\textbf{0}})\rightarrow 1\) and \(\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}^{({\text{I}})} - {\widetilde{{\varvec{\beta }}}}_{0}^{({\text{I}})}\Vert _2 = O_{{\text{P}}}(r_n)\). If \(r_n\sqrt{s_n}=o(1)\), then the classification rule \({\widehat{\phi }}\) constructed from \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) is consistent in the sense that \(\mathrm {(i)}\) \({\text{E}}\{R({\widehat{\phi }} \mid \mathcal T_n)\} - R(\phi _{{\mathrm {\scriptscriptstyle B}}}) \rightarrow 0\) as \(n \rightarrow \infty\), which in turn yields \(\mathrm {(ii)}\) \(R({\widehat{\phi }} \mid \mathcal T_n) {\mathop {\longrightarrow }\limits ^{\text{P}}}R(\phi _{{\mathrm {\scriptscriptstyle B}}})\).

6 Simulation study

We conduct simulation studies to evaluate the performance of the penalized robust-\({\text{BD}}\) estimates in the absence and presence of outliers. The Huber \(\psi\)-function is used with \(c=1.345\). For practical applications, we suggest an empirical choice of the weight function, \(w({\varvec{x}})=1/\{1+\sum _{j=1}^{p_{_n}} (\frac{x_{j}-m_{\centerdot ,j}}{s_{\centerdot ,j}})^2\}^{1/2}\), where \({\varvec{x}}=(x_1,\ldots ,x_{p_{_n}})^T\), \(m_{\centerdot ,j}\) and \(s_{\centerdot ,j}\) denote the sample median and sample median absolute deviation of \(\{X_{i,j}: i=1,\ldots ,n\}\) respectively, \(j=1,\ldots ,p_{_n}\). This form of \(w({\varvec{x}})\) is a generalization of a weight function used on page 2864 of Boente et al. (2006) for a one-dimensional covariate. The classical non-robust counterparts correspond to \(\psi (r)=r\) and \(w({\varvec{x}})\equiv 1\). In the simulation, \(p_{_n}=50\) and 500 are treated respectively.

For illustrative purpose, 4 types of penalization techniques combined with the loss term in (3) are compared: (I) the \(\text{SCAD}\) penalty, with an accompanying parameter \(a = 3.7\), combined with the local linear approximation; (II) the \(L_1\) penalty; (III) the weighted-\(L_1\) penalty with weights selected by the proposed \({\text{MR}}\) method; (IV) the weighted-\(L_1\) penalty with weights selected by the proposed \({\text{PMR}}\) method. For comparison, the oracle (non-penalized) estimator (abbreviated as “Oracle”) of parameters using the true model containing truly significant variables is included. The tuning constants \(\lambda _n\) in each simulation for methods (I)–(III) are selected separately by minimizing the classical-\({\text{BD}}\) (for non-robust methods) and “robust-\({\text{BD}}\)” (for robust methods) on a test set of size n; \(\lambda _n\) and \(\kappa _n\) for method (IV) are searched on a surface of grid points. Numerical algorithms for the proposed penalized estimator in (3) are given in Appendix 1.3.

The number of Monte Carlo runs is 500. To measure the performance of a parameter estimate \(\widehat{{\widetilde{{\varvec{\beta }}}}}\) through simulation, we use the average value and standard deviation (\({\text{sd}}\)) of estimation errors, \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})=\Vert \widehat{{\widetilde{{\varvec{\beta }}}}}-{{\widetilde{{\varvec{\beta }}}}}_0\Vert _2\). Variable selection is assessed by \({\text {C-Z}}\), the average number of regression coefficients which are correctly estimated to be zero when the true coefficients are zero, and \({\text {C-NZ}}\), the average number of coefficients which are correctly estimated to be non-zero when the true coefficients are non-zero. For binary responses, \({\text{MCR}}\) denotes the average of the misclassification rates on a test set of size 5000.

6.1 Overdispersed Poisson responses

Fig. 1
figure 1

Plots of \(\rho _{{q}}(y, \mu )\) (left panels), \(\text{p}_{_1}(y;\theta )\) (middle panels)  and \(\text{p}_{_2}(y;\theta )\) (right panels) for the Bernoulli response \(y=1\). In each panel, solid line: using  “robust-\({\text{BD}}\)” with Huber  \(\psi\)-function (5) and \(c=1.345\); dashed line: using “classical-\({\text{BD}}\)”. Top panels: deviance loss used as the \({\text{BD}}\);  bottom panels: exponential loss used as the \({\text{BD}}\)

Table 1 (Simulation study: overdispersed Poisson responses, \(n=100\)) Summary results for Study 1 (raw data without outliers)
Table 2 (Simulation study: overdispersed Poisson responses, \(n=100\)) Summary results for Study 2 (contaminated data with outliers)
Fig. 2
figure 2

(Simulation study: overdispersed Poisson responses, \(n=100\), \(p_{_n}=50\)  (left panel) and \(p_{_n}=500\) (right panel)) Boxplots of \((\widehat{\beta }_j-{\beta }_{j;0})\), \(j=0,1,\ldots ,5\), corresponding to results in Tables 1 and 2, using the \({\text{PMR}}\) selection method for penalty weights in the weighted-\(L_1\) penalty. The first row: raw data and using non-robust method; the second row: raw data and using robust method; the third row: contaminated data and using non-robust method; the fourth row: contaminated data and using robust method

Table 3 (Simulation study: Bernoulli responses, \(n=200\)) Summary results for Study 1 (raw data without outliers)
Table 4 (Simulation study: Bernoulli responses, \(n=200\))  Summary results for Study 2 (contaminated data with outliers)
Table 5 (Simulation study: Bernoulli responses, \(n=200\)) Compare \({\text{MCR}}\) using classical-\({\text{SVM}}\) and robust-\({\text{SVM}}\) for Study 1 (raw data without outliers) in Table 3 and Study 2 (contaminated data with outliers) in Table 4
Table 6 (Real data) Classification for the Lymphoma data

We generate overdispersed Poisson counts \(\{Y_{i}\}_{i=1}^n\) with \(n=100\), satisfying \({\text{var}}(Y_{i}\mid {\varvec{X}}_{i}={\varvec{x}}_{i})=2\, m({\varvec{x}}_{i})\). In the predictor \({\varvec{X}}_{i}=(X_{i,1}, X_{i,2}, \ldots , X_{i,p_{_n}})^T,\) \(X_{i,1}=i/n-0.5\); for \(j=2,\ldots ,p_{_n}\), \(X_{i,j}=\Phi (Z_{i,j})-0.5\), where \(\Phi\) is the standard Gaussian distribution function, and \((Z_{i,2},\ldots ,Z_{i,p_{_n}})^T \sim N({\textbf{0}}, \Sigma _{p_{_n}-1})\), with \(\Sigma _{p_{_n}-1}(j,k)=0.2^{|j-k |}\), \(j,k=1,\ldots ,p_{_n}-1\). The link function is \(\log \{m({\varvec{x}})\}=\beta _{0;0}+ {\varvec{x}}^T {\varvec{\beta }}_{0}\), with \(\beta _{0;0}=2.5\) and \({\varvec{\beta }}_{0}=(2, 2, 0,\ldots ,0)^T\). The (negative) quasi-likelihood with \(V(\mu )=\mu\) is utilized as the \({\text{BD}}\).

Study 1 (raw data without outliers). For simulated data without contamination, the results are summarized in Table 1. The lose of efficiency (at the true model) from classical to robust estimates using the (non-penalized) oracle estimation method can be seen from the increase of \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})\). For penalized methods, the robust estimates exhibit similar performance to those of the non-robust counterparts, with little loss in estimation efficiency and selecting relevant and irrelevant variables. Thus, there is no serious adverse effect of applying penalized robust estimation to clean datasets.

Study 2 (contaminated data with outliers). For each data set generated from the model, we create a contaminated data set, where 8 data points \((X_{i,j},Y_i)\) are subject to contamination: They are replaced by \((X_{i,j}^*,Y_i^*)\), where \(Y_i^* = Y_i\,{\text{I}}(Y_i > 100) + A\, {\text{I}}(Y_i \le 100)\) with \(A=50\), \(i=1,\ldots ,8\),

$$\begin{aligned} \begin{matrix} X_{1, 1}^* =.5\,\text{sign}(U_1-0.5), & X_{2, 2}^* =.5\,\text{sign}(U_2-0.5), & X_{3, 3}^* =.5\,\text{sign}(U_3-0.5), \\ X_{4, 5}^* =.5\,\text{sign}(U_4-0.5), & X_{5, 7}^* =.5\,\text{sign}(U_5-0.5), & X_{6, 8}^* =.5\,\text{sign}(U_6-0.5), \\ X_{7, 9}^* =.5\,\text{sign}(U_7-0.5), & & \end{matrix} \end{aligned}$$

with \(\{U_i\}{\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{Uniform}(0,1)\). Table 2 summarizes the results over 500 sets of contaminated data. A comparison of each penalized quasi-likelihood estimate across Tables 1 and 2 indicates that the presence of contamination substantially increases \({\text{EE}}(\widehat{{\widetilde{{\varvec{\beta }}}}})\). Among the 4 penalized estimates, the \(L_1\) penalty tends to have higher false positive rates. As observed from Table 2 with contaminated cases, the non-robust estimates are more sensitive to outliers than the robust counterparts. This lends support to Theorems 2 and 6. To provide a closer view of the estimates, Fig. 2 draws the boxplots of biases \((\widehat{\beta }_j-{\beta }_{j;0})\), \(j=0,1,\ldots ,5\), corresponding to results in Tables 1 and 2, using the \({\text{PMR}}\) selection method for penalty weights in the weighted-\(L_1\) penalty.

6.2 Bernoulli responses

We generate samples \(\{({\varvec{X}}_i, Y_i)\}_{i=1}^n\) with \(n=200\) from the model, \(Y_i\mid {\varvec{X}}_i = {\varvec{x}}_i \sim \text{Bernoulli}\{m({\varvec{x}}_i)\},\) where \({\text{logit}}\{m({\varvec{x}})\} = \beta _{0;0} + {\varvec{x}}^T{\varvec{\beta }}_{0}\) with \(\beta _{0;0} = 2\) and \({\varvec{\beta }}_{0} = (2,2,0, \ldots , 0)^T\). The predictor \({\varvec{X}}_i\sim N({\textbf{0}}, \Sigma _{p_{_n}})\), with \(\Sigma _{p_{_n}}(j,k)=0.1^{|j-k |}\), \(j,k=1,\ldots ,p_{_n}\). Both the deviance and exponential loss functions are employed as the \({\text{BD}}\).

Study 1 (raw data without outliers). For simulated data without contamination, the results are summarized in Table 3. The robust estimates perform as well as the non-robust counterparts, with respect to parameter estimation, variable selection and classification accuracy. Indeed, the optimal Bayes rule gives misclassification rates 0.137 for \(p_{_n}=50\), and 0.138 for \(p_{_n}=500\). Thus, the choice of loss functions has an asymptotically relatively negligible impact on classification performance. This agrees with results of Theorem 9 on the asymptotic classification consistency.

Study 2 (contaminated data with outliers). For each data set generated from the model, we create a contaminated data set. The contamination scheme is to replace the original 8 data points \((X_{i,j},Y_i)\) by \((X_{i,j}^*,Y_i^*)\), where \(Y_i^* = 1- Y_i\), \(i=1,\ldots ,8\),

$$\begin{aligned} \begin{matrix} X^*_{1, 1} = 3\, \,\text{sign}(U_1-0.5), & X^*_{2, 1} = 3\, \,\text{sign}(U_2-0.5), & X^*_{3, 1} = 3\, \,\text{sign}(U_3-0.5), \\ X^*_{4, 3} = 3\, \,\text{sign}(U_4-0.5), & X^*_{5, 5} = 3\, \,\text{sign}(U_5-0.5), & X^*_{6, 9} = 3\, \,\text{sign}(U_6-0.5), \\ X^*_{7, 9} = 3\, \,\text{sign}(U_7-0.5), & & \end{matrix} \end{aligned}$$

with \(\{U_i\}{\mathop {\sim }\limits ^{\mathrm {i.i.d.}}}\text{Uniform}(0,1)\). Table 4 summarizes the results over 500 sets of contaminated data. Regarding the robustness-efficiency tradeoff, analogous conclusions to Sect. 6.1 can be reached. Moreover, comparing Tables 3 and 4 reveals that (i) contamination increases the misclassification rates of all 5 methods; (ii) for contaminated cases in Table 4, robust procedures tend to reduce the misclassification rates; (iii) for robust estimation, the deviance loss is computationally more stable yielding relatively lower misclassification rates than the exponential loss, see also Fig. 1, and thus the deviance loss is recommended for practical applications.

For the penalized methods, the tuning parameter \(\lambda _n\) is searched in the interval [0.0034, 0.1975], and \(\kappa _n\) is searched in the interval \([1/2^{8}, 1/2^{5}]\). To compare the classification performance with methods such as the classical-\({\text{SVM}}\) and robust-\({\text{SVM}}\) (using either the linear or Gaussian kernel, combined with auxiliary parameters \(c^*\) and/or s) in Wu and Liu (2007), Table 5 summarizes the results under the same set-ups as in Tables 3 and 4. Compared with classical- and robust-\({\text{SVM}}\)s (in Table 5), the robust-\({\text{BD}}\) method (in Tables 3, 4) clearly lowers \({\text{MCR}}\)s.

6.3 Gaussian responses

To further illustrate the benefits of the robust method relative to the non-robust method, Appendix 1.2 gives additional simulation studies for Gaussian responses.

7 Real data application

To illustrate the application of the penalized robust-\({\text{BD}}\) methods for classifying high-dimension low sample size data, we consider the Lymphoma data studied in Alizadeh et al. (2000), which identified two molecularly distinct forms of diffuse large B-cell Lymphoma (\(\text{DLBCL}\)). These two forms, called “germinal centre B-like (gc-B)” \(\text{DLBCL}\) and “activated B-like (a-B)” \(\text{DLBCL}\), had gene expression patterns indicative of different stages of B-cell differentiation.

The publicly available dataset contains 4026 genes across 47 samples, of which 24 are “gc-B” and 23 are “a-B”. We use the 10-nearest neighbor method to impute the missing expression data. After imputing, the dataset is standardized to zero mean and unit variance across genes. We intend to predict whether a sample can be categorized as “gc-B” or “a-B”.

To evaluate the performance of the penalized estimates of parameters in the model, \({\text{logit}}\{{\text{P}}(Y=1 \mid X_1,\ldots ,X_{4026})\}=\beta _0+\sum _{j=1}^{4026} \beta _j X_j\), we randomly split the data into a training set with 31 samples (containing 16 cases of “gc-B” and 15 cases of “a-B”) and a test set with 16 samples (containing 8 cases of “gc-B” and 8 cases of “a-B”). For each training set, \(\lambda _n\) is selected by minimizing a 3-fold cross validated estimate of the misclassification rate; \(\lambda _n\) and \(\kappa _n\) for the proposed \({\text{PMR}}\) method are searched on a surface of grid points. The test error (\(\text{TE}\)) gives the misclassification rate of the penalized classifier to the test set. Both the Huber \(\psi\)-function (with \(c=1.345\)) and Tukey \(\psi\)-function (with \(c=4.685\)) are utilized in the robust estimates, and the weight function \(w({\varvec{x}})\) is the same as used in Sect. 6. Table 6 tabulates the average of \(\text{TE}\) and the average number of selected genes over 100 random splits. The penalized estimates/classifiers induced by the deviance loss and the exponential loss yield similar performance. The \(L_1\) penalty selects approximately twice as many genes as the other penalty choices. On the basis of \(\text{TE}\) and sparse modeling simultaneously, the robust estimate combined with the weighted-\(L_1\) penalty appears to perform the best. The results also reveal that the choice of the \(\psi\)-functions in (6) has a negligible impact on the performance of the robust penalized estimates.

8 Discussion

The conventional penalized least-squares and penalized-likelihood estimates of parameters exhibit the oracle property for sparsity recovery but lack resistance to outlying observations. This paper proposes a class of robust error measures called “robust-\({\text{BD}}\)” and introduces the “penalized robust-\({\text{BD}}\) estimate”. The “robust-\({\text{BD}}\)” induces a bounded influence function that makes the resulting penalized estimation less sensitive to outliers. Since \({\text{BD}}\) is widely used in machine learning practice, the proposed “penalized robust-\({\text{BD}}\) estimate” combined with suitably chosen weights for penalties can be broadly applicable in regression and classification problems of large dimensions (\(p_{_n}\approx n\)) and high dimensions (\(p_{_n} \gg n\)), achieving both the oracle property and the robustness to outliers in either the covariate space or the response space.

There are several limitations to our study and ongoing challenges that should be considered. Firstly, in the current scope of the paper, technical conditions, particularly \({\text{A}}1\), \({\text{A}}2\), \({\text{A}}5\) and \({\text{B}}3\) in Appendix 1.1 exclude heavy-tailed covariates \(X_j\) and responses Y. Similarly, the proposed method does not handle very high levels of contamination in the data. Moreover, relaxing (19) could be pursued in a separate study. Secondly, Algorithm 1 in Appendix 1.3.2 shows that the computational complexity depends on various key factors, including the sample size (n), dimensionality (\(p_{_n}\)), the response variable type (Gaussian, Bernoulli, or Poisson), selected penalty weights (\(\{{\widehat{w}}_{n,j}\}_{j=1}^{p_{_n}}\)), proportion of contamination in the dataset, approximation accuracy of the quadratic function to the loss function in (3), and convergence property of the coordinate-descent (\({\text{CD}}\)) algorithm (Friedman et al., 2010). In particular, the quadratic approximation may reduce the accuracy of the “penalized robust-\({\text{BD}}\) estimator” and slow down the convergence of the resulting \({\text{CD}}\) algorithm. This is demonstrated in numerical experiments that indicate an increase in CPU runtime as the response type changes from Gaussian to Bernoulli to Poisson, and as non-robust procedures using classical-\({\text{BD}}\) are replaced with robust counterparts using “robust-\({\text{BD}}\)”. Despite these limitations, our work contributes to the “robust-\({\text{BD}}\)” estimation of \(p_{_n}\)-dimensional parameters with some provable results when \(p_{_n} \approx n\) and \(p_{_n} \gg n\), and thus fills a gap in the literature. The goal of our work is to better understand the flexibility, challenges, and limitations of robust-\({\text{BD}}\) estimation as a data-analytic tool for big data analysis.

A number of open questions need to be discussed. (a) In the linear regression model, the quantification of the robustness property of estimates in terms of gross error sensitivity, rejection point, or local-shift sensitivity has been studied. However, beyond the linear model, such as the \(\text{GLM}\), relatively little theoretical work has been done about this property, even in the case of fixed dimensions \(p_{_n}=p\). Rigorously exploring the robustness property, including the breakdown point, for the proposed class of “penalized robust-\({\text{BD}}\) estimates” in the current model, assuming (1)–(2), remains a nontrivial task. (b) The existence of consistent local solutions of penalized estimates has also appeared in several other existing works, such as the nonconcave penalized likelihood method in Fan and Peng (2004). Devising efficient numerical procedures for obtaining a local solution that is consistent will be desirable but challenging. See Gong et al. (2013) for recent progress made in addressing this issue. (c) For \(p_{_n}\gg n\), it is of interest to explore some variable screening procedure for dimension reduction before applying the proposed “penalized robust-\({\text{BD}}\)” estimation method. (d) The weight function \(w({\varvec{x}})\) used in the numerical evaluations in Sects. 6 and 7 is a feasible choice for robust-\({\text{BD}}\) estimation in large (\(p_{_n}\approx n\)) and high (\(p_{_n} \gg n\)) dimensions. However, an optimal method for selecting the weight function would be desirable, depending on specific criteria, such as the robustness property, which has yet to be explored. In low dimensions (\(p_{_n}<n\)), the weight function (e.g., \(w({\varvec{x}}) = 1/\{1+({\varvec{x}}-\widehat{{\varvec{m}}})^T {\widehat{\Sigma }}^{-1} ({\varvec{x}}-\widehat{{\varvec{m}}})\}^{1/2}\)) can rely on robust estimates \(\widehat{{\varvec{m}}}\) and \({\widehat{\Sigma }}\) of the location vector and scatter matrix of \({\varvec{X}}\), and pages 137–138 of Heritier et al. (2009) suggest alternative weight functions for robust estimators of linear model parameters in fixed (\(p_{_n}=p\)) dimensions. (e) Finite sample other than asymptotic results may be obtained for certain types of covariates \(X_j\) and responses Y under more stringent assumptions. A complete and thorough study of these theoretical/methodological development and computational advancement is beyond the scope of the current paper and needs to be examined in future research.

9 Supplementary information

Appendices 1.1, 1.2 and 1.3 collect proofs of Theorems 1 to 9, additional numerical studies (Figs. 3 and 4 and Tables 7 and 8 in Sect. 6.3; real data analysis), and algorithmic details, respectively.