Regression analysis: likelihood, error and entropy

Bogdan Grechuk¹ &
Michael Zabarankin²

979 Accesses
3 Citations
Explore all metrics

Abstract

In a regression with independent and identically distributed normal residuals, the log-likelihood function yields an empirical form of the $\mathcal{L}^2$-norm, whereas the normal distribution can be obtained as a solution of differential entropy maximization subject to a constraint on the $\mathcal{L}^2$-norm of a random variable. The $\mathcal{L}^1$-norm and the double exponential (Laplace) distribution are related in a similar way. These are examples of an “inter-regenerative” relationship. In fact, $\mathcal{L}^2$-norm and $\mathcal{L}^1$-norm are just particular cases of general error measures introduced by Rockafellar et al. (Finance Stoch 10(1):51–74, 2006) on a space of random variables. General error measures are not necessarily symmetric with respect to ups and downs of a random variable, which is a desired property in finance applications where gains and losses should be treated differently. This work identifies a set of all error measures, denoted by $\mathscr {E}$, and a set of all probability density functions (PDFs) that form “inter-regenerative” relationships (through log-likelihood and entropy maximization). It also shows that M-estimators, which arise in robust regression but, in general, are not error measures, form “inter-regenerative” relationships with all PDFs. In fact, the set of M-estimators, which are error measures, coincides with $\mathscr {E}$. On the other hand, M-estimators are a particular case of L-estimators that also arise in robust regression. A set of L-estimators which are error measures is identified—it contains $\mathscr {E}$ and the so-called trimmed $\mathcal{L}^p$-norms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Robust Methods for High-Dimensional Regression and Covariance Matrix Estimation

A new method for estimation and model selection:$\rho $-estimation

Article 26 July 2016

Machine learning and the James–Stein estimator

Article Open access 30 June 2023

Notes

The least squares method was used, although without proof, by Legendre in 1805 [28], see [17].
The idea to minimize the sum of the absolute deviations of error residuals was first proposed by Boscovich in 1757 [4], see [17].
Rockafellar et al. [38, 39] proposed a unifying axiomatic framework for general measures of error, deviation and risk—all of them are positively homogenous convex functionals defined on a space of r.v.’s, see also [34, 37], whereas recently, Grechuk and Zabarankin [15] analyzed sensitivity of optimal values of positively homogenous convex functionals in various optimization problems, including linear regression, to noise in the data.
We assume that 0 ln 0 = 0.
A deviation measure is a functional $\mathcal{D}:\mathcal{L}^r(\Theta )\rightarrow [0,\infty ]$ satisfying axioms E2–E4 and such that $\mathcal{D}(Z) = 0$ for constant Z, and $\mathcal{D}(Z) > 0$ otherwise [38]. A deviation measure is called law-invariant if $\mathcal{D}(X) = \mathcal{D}(Y)$ whenever r.v.’s X and Y have the same distribution [12].

References

Alfons, A., Croux, C., Gelper, S.: Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann. Appl. Stat. 7(1), 226–248 (2013)
Article MathSciNet MATH Google Scholar
Bartolucci, F., Scaccia, L.: The use of mixtures for dealing with non-normal regression errors. Comput. Stat. Data Anal. 48(4), 821–834 (2005)
Article MathSciNet MATH Google Scholar
Bernholt, T.: Computing the least median of squares estimator in time o($n^d$). In: International Conference on Computational Science and Its Applications, pp. 697–706. Springer (2005)
Boscovich, R.J.: De litteraria expeditione per pontificiam ditionem, et synopsis amplioris operis, ac habentur plura ejus ex exemplaria etiam sensorum impressa. Bononiensi Scientarum et Artum Instituto Atque Academia Commentarii 4, 353–396 (1757)
Google Scholar
Box, G.: Non-normality and tests on variances. Biometrika 40, 318–335 (1953)
Article MathSciNet MATH Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory. Wiley, New York (2012)
MATH Google Scholar
Edgeworth, F.: On observations relating to several quantities. Hermathena 6(13), 279–285 (1887)
Google Scholar
Efron, B.: Regression percentiles using asymmetric squared error loss. Stat. Sin. 1(1), 93–125 (1991)
MathSciNet MATH Google Scholar
Föllmer, H., Schied, A.: Stochastic Finance, 3rd edn. de Gruyter, Berlin (2011)
Book MATH Google Scholar
Gauss, C.F.: Theoria motus corporum coelestium in sectionibus conicis solem ambientium. sumtibus Frid. Perthes et IH Besser (1809)
Grechuk, B., Molyboha, A., Zabarankin, M.: Maximum entropy principle with general deviation measures. Math. Oper. Res. 34(2), 445–467 (2009)
Article MathSciNet MATH Google Scholar
Grechuk, B., Molyboha, A., Zabarankin, M.: Chebyshev inequalities with law-invariant deviation measures. Probab. Eng. Inf. Sci. 24(1), 145–170 (2010)
Article MathSciNet MATH Google Scholar
Grechuk, B., Zabarankin, M.: Schur convex functionals: Fatou property and representation. Math. Finance 22(2), 411–418 (2012)
Article MathSciNet MATH Google Scholar
Grechuk, B., Zabarankin, M.: Inverse portfolio problem with mean-deviation model. Eur. J. Oper. Res. 234(2), 481–490 (2014)
Article MathSciNet MATH Google Scholar
Grechuk, B., Zabarankin, M.: Sensitivity analysis in applications with deviation, risk, regret, and error measures. SIAM J. Optim. 27(4), 2481–2507 (2017)
Article MathSciNet MATH Google Scholar
Gu, Y., Zou, H.: High-dimensional generalizations of asymmetric least squares regression and their applications. Ann. Stat. 44(6), 2661–2694 (2016)
Article MathSciNet MATH Google Scholar
Harter, L.: The method of least squares and some alternatives: Part I. In: International Statistical Review/Revue Internationale de Statistique, pp. 147–174 (1974)
Hosking, J., Balakrishnan, N.: A uniqueness result for l-estimators, with applications to l-moments. Stat. Methodol. 24, 69–80 (2015)
Article MathSciNet MATH Google Scholar
Huber, P.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964)
Article MathSciNet MATH Google Scholar
Huber, P.: Robust Statistics. Wiley, New York (1981)
Book MATH Google Scholar
Jaynes, E.T.: Information theory and statistical mechanics (notes by the lecturer). Stat. Phys. 3 1, 181 (1963)
MathSciNet Google Scholar
Jouini, E., Schachermayer, W., Touzi, N.: Law invariant risk measures have the Fatou property. Adv. Math. Econ. 9, 49–71 (2006)
Article MathSciNet MATH Google Scholar
Koenker, R., Bassett Jr., G.: Regression quantiles. Econ. J. Econ. Soc. 46(1), 33–50 (1978)
MathSciNet MATH Google Scholar
Krokhmal, P.: Higher moment coherent risk measures. Quant. Finance 7(4), 373–387 (2007)
Article MathSciNet MATH Google Scholar
Kullback, S., Leibler, R.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Laplace, P.S.: Traité de mécanique céleste, vol. 2. J. B. M. Duprat, Paris (1799)
Google Scholar
Lee, W.M., Hsu, Y.C., Kuan, C.M.: Robust hypothesis tests for m-estimators with possibly non-differentiable estimating functions. Econom. J. 18(1), 95–116 (2015)
Article MathSciNet Google Scholar
Legendre, A.M.: Nouvelles méthodes pour la détermination des orbites des comètes. 1. F. Didot, Paris (1805)
Google Scholar
Lisman, J., Van Zuylen, M.: Note on the generation of most probable frequency distributions. Stat. Neerl. 26(1), 19–23 (1972)
Article MATH Google Scholar
Loh, P.L.: Statistical consistency and asymptotic normality for high-dimensional robust $m$-estimators Ann. Stat. 45(2), 866–896 (2017)
Article MathSciNet MATH Google Scholar
Mafusalov, A., Uryasev, S.: CVaR (superquantile) norm: stochastic case. Eur. J. Oper. Res. 249(1), 200–208 (2016)
Article MathSciNet MATH Google Scholar
Morales-Jimenez, D., Couillet, R., McKay, M.: Large dimensional analysis of robust m-estimators of covariance with outliers. IEEE Trans. Signal Process. 63(21), 5784–5797 (2015)
Article MathSciNet MATH Google Scholar
Mount, D., Netanyahu, N., Piatko, C., Silverman, R., Wu, A.: On the least trimmed squares estimator. Algorithmica 69(1), 148–183 (2014)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Royset, J.: Measures of residual risk with connections to regression, risk tracking, surrogate models, and ambiguity. SIAM J. Optim. 25(2), 1179–1208 (2015)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Royset, J.: Random variables, monotone relations, and convex analysis. Math. Program. 148(1–2), 297–331 (2014)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Uryasev, S.: Conditional value-at-risk for general loss distributions. J. Bank. Finance 26(7), 1443–1471 (2002)
Article Google Scholar
Rockafellar, R.T., Uryasev, S.: The fundamental risk quadrangle in risk management, optimization and statistical estimation. Surv. Oper. Res. Manag. Sci. 18(1), 33–53 (2013)
MathSciNet Google Scholar
Rockafellar, R.T., Uryasev, S., Zabarankin, M.: Generalized deviations in risk analysis. Finance Stoch. 10(1), 51–74 (2006)
Article MathSciNet MATH Google Scholar
Rockafellar, R.T., Uryasev, S., Zabarankin, M.: Risk tuning with generalized linear regression. Math. Oper. Res. 33(3), 712–729 (2008)
Article MathSciNet MATH Google Scholar
Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection, vol. 589. Wiley, New York (2005)
MATH Google Scholar
Rousseeuw, P., Van Driessen, K.: Computing LTS regression for large data sets. Data Min. Knowl. Disc. 12(1), 29–45 (2006)
Article MathSciNet MATH Google Scholar
Rousseeuw, P.G.: Least median of squares regression. J. Am. Stat. Assoc. 79, 871–880 (1984)
Article MathSciNet MATH Google Scholar
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. 27, 379–423, 623–656 (1948)
Xie, S., Zhou, Y., Wan, A.: A varying-coefficient expectile model for estimating value at risk. J. Bus. Econ. Stat. 32(4), 576–592 (2014)
Article MathSciNet Google Scholar
Zabarankin, M., Uryasev, S.: Statistical Decision Problems: Selected Concepts and Portfolio Safeguard Case Studies. Springer, Berlin (2014)
Book MATH Google Scholar

Download references

Acknowledgements

We are grateful to the referees for the comments and suggestions, which helped to improve the quality of the paper. The first author thanks the University of Leicester for granting him the academic study leave to do this research.

Author information

Authors and Affiliations

Department of Mathematics, University of Leicester, Leicester, LE1 7RH, UK
Bogdan Grechuk
Department of Mathematical Sciences, Stevens Institute of Technology, Hoboken, NJ, 07030, USA
Michael Zabarankin

Authors

Bogdan Grechuk
View author publications
You can also search for this author in PubMed Google Scholar
Michael Zabarankin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Zabarankin.

Appendix A: Proofs of Propositions 1–6

1.1 Appendix A.1: Proof of Proposition 1

Since $\mathcal{E}(Z)$ assumes all values in $[0,+\infty )$, the range of h is $[0,+\infty )$, hence it is continuous and $h(0)=0$. This implies that h has a strictly increasing continuous inverse function $h^{-1}:\mathbb {R}^+\rightarrow \mathbb {R}^+$, and

$$\begin{aligned} h^{-1}(\mathcal{E}(Z))=h^{-1}[h(\mathbb {E}[\rho (Z)])]=\mathbb {E}[\rho (Z)]. \end{aligned}$$

For constant $Z=t\geqslant 0$,

$$\begin{aligned} \rho (t)=\mathbb {E}[\rho (t)]=h^{-1}(\mathcal{E}(t))=h^{-1}(|t|\mathcal{E}(1)). \end{aligned}$$

Similarly, $\rho (t)=h^{-1}(|t|\mathcal{E}(-1))$ for $t\leqslant 0$. Consequently, in general,

$$\begin{aligned} \rho (t)=h^{-1}\left( a\,[t]_+ +b\,[t]_-\right) , \end{aligned}$$

where $a=\mathcal{E}(1)>0$ and $b=\mathcal{E}(-1)>0$. Thus,

$$\begin{aligned} \mathcal{E}(Z)=\varphi ^{-1}\left( \mathbb {E}\left[ \varphi \left( \,a\,[Z]_++b\,[Z]_-\,\right) \right] \right) , \end{aligned}$$

(28)

where $\varphi =h^{-1}$.

Since $\Theta =(\Omega , \mathcal{M}, \mathbb {P})$ is non-trivial, there exists an event $A\in \mathcal{M}$ such that $p=\mathbb {P}[A]\in (0,1)$. For any non-negative constants c and d, let Z be an r.v. assuming values $Z(\omega )=c/a\geqslant 0$ and $Z(\omega )=d/a\geqslant 0$ for $\omega \in A$ and $\omega \not \in A$, respectively. Then

$$\begin{aligned} \begin{aligned} \varphi ^{-1}\left[ p \varphi (\lambda c) + (1-p)\varphi (\lambda d)\right]&= \mathcal{E}(\lambda \,Z) = \lambda \,\mathcal{E}(Z) \\&= \lambda \varphi ^{-1}\left[ p \varphi (c) + (1-p)\varphi (d)\right] \end{aligned} \end{aligned}$$

(29)

for any $\lambda \geqslant 0$. Replacing c and d by $\varphi ^{-1}(c)$ and $\varphi ^{-1}(d)$, respectively, and applying $\varphi (\cdot )$ to the left-hand and right-hand parts of (29), we obtain

$$\begin{aligned} p \varphi (\lambda \varphi ^{-1}(c)) + (1-p)\varphi (\lambda \varphi ^{-1}(d)) = \varphi (\lambda \varphi ^{-1}(pc + (1-p)d). \end{aligned}$$

Consequently, the function $g(x)=\varphi (\lambda \varphi ^{-1}(x))$ satisfies

$$\begin{aligned} pg(c)+(1-p)g(d)=g(pc + (1-p)d) \quad \forall c,d\geqslant 0. \end{aligned}$$

(30)

Let

$$\begin{aligned} \mathcal{A}=\{a\in [0,1] \, : \, a g(c) + (1-a) g(d) = g(a c + (1-a)d) \,\, \forall c,d\geqslant 0 \}. \end{aligned}$$

By definition, $0\in \mathcal{A}$ and $1\in \mathcal{A}$. Also, (30) implies that $p a + (1-p)b \in \mathcal{A}$ whenever $a,b\in \mathcal{A}$, hence $\mathcal{A}$ is a dense subset of [0, 1]. Finally, $\mathcal{A}$ is closed due to continuity of g, so that $\mathcal{A}=[0,1]$, and g is a linear function. Since $g(0)=\varphi (\lambda \varphi ^{-1}(0))=0$, there exists a constant $C(\lambda )$ such that

$$\begin{aligned} \varphi (\lambda \varphi ^{-1}(x))=g(x)=C(\lambda )x \quad \forall x, \lambda \geqslant 0. \end{aligned}$$

(31)

Setting $x=\varphi (y)$ in (31), we obtain

$$\begin{aligned} \varphi (\lambda y)=C(\lambda )\varphi (y) \quad \forall y, \lambda \geqslant 0. \end{aligned}$$

(32)

Then setting $y=1$ in (32), we obtain $\varphi (\lambda )=C(\lambda )\varphi (1)$. Consequently, $C(\lambda )=\varphi (\lambda )/\varphi (1)$, and (32) takes the form $\varphi (\lambda y)=\varphi (\lambda )\varphi (y)/\varphi (1)\quad \forall y, \lambda \geqslant 0$. For the function

$$\begin{aligned} g(x)=\log \frac{\varphi (e^x)}{\varphi (1)}, \end{aligned}$$

this implies that

$$\begin{aligned} g(x+y)= \log \frac{\varphi (e^{x+y})}{\varphi (1)} = \log \frac{\varphi (e^{x})\varphi (e^{y})}{\varphi (1)^2}=g(x)+g(y). \end{aligned}$$

Since g is additive, continuous, and $g(0)=0$, it is linear, i.e., $g(x)=px$ for some constant p. Consequently, $e^{px}=e^{g(x)}=\varphi (e^x)/\varphi (1)$. Finally, with $e^x=y$, we obtain $\varphi (y)=\varphi (1)y^p$, and (28) simplifies to

$$\begin{aligned} \mathcal{E}(Z)=\left( \mathbb {E}\left[ \,a\,[Z]_+ +b\,[Z]_-\,\right] ^p\right) ^{1/p}. \end{aligned}$$

The condition $p\geqslant 1$ follows from sub-additivity of $\mathcal{E}$.

1.2 Appendix A.2: Proof of Proposition 2

Proposition 4.7 (b) in [11] implies that if $Z^*\in \mathcal{C}^1(\Theta )$ has a log-concave PDF, then it is a solution to

$$\begin{aligned} \max _{Z\in \mathcal{C}^1(\Theta )} S(Z)\quad \text {subject to}\quad \mathbb {E}[Z]=\mu , \quad \mathcal{D}(Z)\leqslant 1, \end{aligned}$$

(33)

for $\mu =\mathbb {E}[Z^{*}]$ and some law-invariant the deviation measure^{Footnote 5}$\mathcal{D}$. Hence $Z^*$ is a solution to with $\mathcal{X}=\{Z\in \mathcal{C}^1(\Theta )\,|\,\mathbb {E}[Z]=\mu ,\,\mathcal{D}(Z)\leqslant 1\}$.

Conversely, let $Z^*\in \mathcal{C}^1(\Theta )$ be a solution to (13) for some convex closed law-invariant set $\mathcal{X}$. Then it is a solution to (33) for the deviation measure

$$\begin{aligned} \mathcal{D}(Z)=\sup \limits _{\alpha \in [0,1]}\frac{\mathrm{CVaR}_\alpha ^\Delta (Z)}{\mathrm{CVaR}_\alpha ^\Delta (Z^*)} \quad \hbox { for all}\ Z\in \mathcal{L}^1(\Theta ), \end{aligned}$$

(34)

where

$$\begin{aligned} \mathrm{CVaR}_\alpha ^\Delta (Z)\equiv \mathbb {E}[Z]-\frac{1}{\alpha }\int \nolimits _{0}^{\alpha }q_Z(s)\,ds, \quad \alpha \in (0,1), \end{aligned}$$

$\mathrm{CVaR}_{0}^\Delta (Z)=\mathbb {E}[Z]-\inf Z$ and $\mathrm{CVaR}_{1}^\Delta (Z)=\sup Z - \mathbb {E}[Z]$, see [14]. Indeed, if an r.v. Z satisfies the constraints in (33) with $\mathcal{D}$ given by (34), then $ \mathbb {E}[Z]=\mu =\mathbb {E}[Z^*]$, and $\mathrm{CVaR}_\alpha ^\Delta (Z)\leqslant \mathrm{CVaR}_\alpha ^\Delta (Z^*)$ for all $\alpha \in [0,1]$, so that Z dominates $Z^*$ with respect to concave ordering, see Proposition 1 in [14]. Since $Z^*$ has a PDF, the underlying probability space $\Theta $ is, by definition, atomless, and part “(a) to (d)” of Corollary 2.61 in [9] along with Lemma 4.2 in [22] implies that $Z \in \mathcal{X}$. Since $Z^*\in \mathcal{C}^1(\Theta )$ is a solution to (13), this yields $S(Z^*)\geqslant S(Z)$, and consequently, $Z^*$ is a solution to (33). Thus, $Z^*$ has a log-concave PDF by Proposition 4.11 in [11].

1.3 Appendix A.3: Proof of Proposition 3

If $Z^*\in \mathcal{C}^1(\Theta )$ has a log-concave PDF, then it is a solution to (33) for some law-invariant deviation measure $\mathcal{D}$. On the other hand, Proposition 5.1 in [45] shows that problem (33) is equivalent to (14) with an error measure $\mathcal{E}$ such that $\mathcal{D}(Z)=\inf _{C\in \mathbb {R}} \mathcal{E}(Z-C)$, i.e., $\mathcal{D}$ is the deviation measure projected from $\mathcal{E}$. In general, for a given deviation measure $\mathcal{D}$, such an error measure is non-unique and can be determined by

$$\begin{aligned} \mathcal{E}(Z)=\frac{1}{1+\mu }\left( \mathcal{D}(Z)+|\mathbb {E}[Z]|\right) , \end{aligned}$$

(35)

which is called inverse projection of $\mathcal{D}$, see [39]. Thus, $Z^*$ is a solution to (14) with (35).

Conversely, let $Z^*\in \mathcal{C}^1(\Theta )$ be a solution to (14) for some law-invariant error measure $\mathcal{E}$. Then positive homogeneity of $\mathcal{E}$ and relation $S(kZ)=S(Z)+\ln k,k>0$, imply that $Z^*$ is also a solution to

$$\begin{aligned} \max _{Z\in \mathcal{L}^r(\Theta )} S(Z)\quad \text {subject to}\quad \mathcal{E}(Z)\leqslant 1. \end{aligned}$$

Since $\{Z\,|\, \mathcal{E}(Z)\leqslant 1\}$ is a convex closed law-invariant set, $Z^*$ has a log-concave PDF by Proposition 2.

1.4 Appendix A.4: Proof of Proposition 4

If $\mathcal{E}$ and f satisfy the conditions of Proposition 4, then $\mathcal{E}$ and $\rho (t) = -\log (f(t))$ satisfy the conditions of Proposition 1. Consequently, $\rho $ has the form in (12), which implies that $f(t)=e^{-\rho (t)}$ has the form of (2b).

1.5 Appendix A.5: Proof of Proposition 5

Since h is strictly increasing, problem (8) with $\mathcal{E}^*$ is equivalent to minimizing $\mathbb {E}[\rho ^*(Z)]$ or to maximizing $\mathbb {E}[\ln (f^*(Z))]$. For an r.v. Z such that $\mathbb {P}[Z=z_i]=1/n,i=1,\dots ,n$, it reduces to (6).

With $c=h\left( - \int _{-\infty }^\infty f^*(t)\ln f^*(t)\,dt\right) $, the constraint $\mathcal{E}^*(Z)= c$ in (19) simplifies to

$$\begin{aligned} \int _{-\infty }^\infty f(t)\ln f^*(t)\,dt = \int _{-\infty }^\infty f^*(t)\ln f^*(t)\,dt, \end{aligned}$$

which holds for $f=f^*$ and for any $f \ne f^*$ implies that

$$\begin{aligned} -\int _{-\infty }^\infty f(t)\ln f(t)\,dt\leqslant & {} -\int _{-\infty }^\infty f(t)\ln f^*(t)\,dt \\= & {} -\int _{-\infty }^\infty f^*(t)\ln f^*(t)\,dt, \end{aligned}$$

where the first inequality follows from the non-negativity of relative entropy (Kullback-Leibler divergence between f and $f^*$), defined as $D_{KL}(f||f^*)=\int _{-\infty }^\infty f(t)\ln \frac{f(t)}{f^*(t)}\,dt \geqslant 0$, see [25].

1.6 Appendix A.6: Proof of Proposition 6

We first prove the “if” part in (a) and (b). If $\mathcal{E}$ is a particular case of (2a), it is an error measure that can be represented in the form of (11), which is (21) with M being a Lebesgue measure on (0, 1), and the “if” part in (a) follows. If $\mathcal{E}$ is a particular case of (25), then it can be represented in the form of (23) with $M(c,d)=\int _c^d w(\alpha ) \, d\alpha , \, 0\leqslant c<d\leqslant 1,\rho (t)=t_{a,b}^p$, and $h(x)=x^{1/p}$. For $Z\ne 0,q_{Z_{a,b}}^p(\alpha )$ is a non-negative non-decreasing function with $\int _0^1 q_{Z_{a,b}}^p(\alpha ) \,d\alpha > 0$, so that $L=\lim \limits _{\alpha \rightarrow 1} q_{Z_{a,b}}^p(\alpha ) > 0$, and we claim that

$$\begin{aligned} I=\int _0^1 w(\alpha )\,q_{Z_{a,b}}^p(\alpha )\,d\alpha > 0. \end{aligned}$$

(36)

Indeed, if $w(\alpha )$ is a delta function at 1, (36) reduces to $I=L>0$. Otherwise $\lim \limits _{\alpha \rightarrow 1} w(\alpha ) > 0$, hence $w(\alpha ^*)>0$ and $q_{Z_{a,b}}^p(\alpha ^*)>0$ for some $\alpha ^*<1$, and $I \geqslant \int _{\alpha ^*}^1 w(\alpha ^*)q_{Z_{a,b}}^p(\alpha ^*) = (1-\alpha ^*)w(\alpha ^*)q_{Z_{a,b}}^p(\alpha ^*) > 0$.

Inequality $I>0$ implies that $\mathcal{E}(Z)$ is well-defined and satisfies E1. Property E2 is obvious, whereas E4 is proved for $w(\alpha )=1$ in [38, Proposition 6], and the general case holds by a similar argument. Next, we claim that

$$\begin{aligned} \mathcal{E}(X+Y) \leqslant \left( \int _0^1 w(\alpha )\,(q_{X_{a,b}}+q_{Y_{a,b}})^p(\alpha )\,d\alpha \right) ^{1/p} \leqslant \mathcal{E}(X) + \mathcal{E}(Y) \end{aligned}$$

(37)

holds for all $X,Y \in \mathcal{L}^r(\Theta )$. Indeed, the second inequality in (37) is a triangle inequality for the $\mathcal{L}^p[0,1]$-norm, and the first one states that

$$\begin{aligned} \int _0^1 w(\alpha )\,f(\alpha )\,d\alpha \leqslant \int _0^1 w(\alpha )\,g(\alpha )\,d\alpha \end{aligned}$$

(38)

for $f(\alpha )=q_{(X+Y)_{a,b}}^p(\alpha )$ and $g(\alpha )=(q_{X_{a,b}}(\alpha )+q_{Y_{a,b}}(\alpha ))^p$.

If $f, g \in \mathcal{L}^r[0,1]$ are such that (38) holds for any non-negative non-decreasing $w\in \mathcal{L}^1[0,1]$, we write $g \succcurlyeq f$. The relation $\succcurlyeq $ is

(i)
associative;
(ii)
monotone, in sense that $f_1(\alpha ) \geqslant f_2(\alpha )$$\forall \alpha \in [0,1]$ implies that $f_1 \succcurlyeq f_2$;
(iii)
$q_{X}(\alpha ) + q_{Y}(\alpha ) \succcurlyeq q_{X+Y}(\alpha )$ for any r.v.’s $X,Y \in \mathcal{L}^r(\Theta )$ due to sub-additivity of functional $\mathcal{F}(Z) = \int _0^1 w(\alpha ) \, q_Z(\alpha ) \, d\alpha $, see [13, Proposition 4.3];
(iv)
$f_1 \succcurlyeq f_2$ is equivalent to $\int _c^1 f_1(\alpha )\,d\alpha \geqslant \int _c^1 \,f_2(\alpha )\,d\alpha $ for all $c\in (0,1)$, which, in turn, is equivalent to $\int _0^1 u(f_1(\alpha ))\,d\alpha \geqslant \int _0^1 u(f_2(\alpha ))\,d\alpha $ for all convex increasing u, see [35, Theorem 8]; and
(v)
$f_1 \succcurlyeq f_2$ implies that $u(f_1) \succcurlyeq u(f_2)$ for any convex increasing function u, which follows from (iv) and the fact that superposition of two convex increasing functions is convex increasing.

Properties (i)–(iii) imply that

$$\begin{aligned} q_{X_{a,b}}+q_{Y_{a,b}} \succcurlyeq q_{X_{a,b}+Y_{a,b}} \succcurlyeq q_{(X+Y)_{a,b}}, \end{aligned}$$

and since the function $\xi (z)=z^p$ is convex increasing for $z\geqslant 0$, (38) follows from (v). This finishes the proof of “if” part in (b).

Now we prove the “only if” part. Let $\mathcal{E}$ be an error measure that can be represented in the form of either (21) or (23). Since $\mathcal{E}(Z)$ assumes all values in $[0,+\infty ),h$ is a strictly increasing continuous function with $h(0)=0$ and has a strictly increasing continuous inverse function $h^{-1}:\mathbb {R}^+\rightarrow \mathbb {R}^+$. Applying $h^{-1}$ to both parts of either (21) or (23) and setting $Z=t$, we obtain

$$\begin{aligned} h^{-1}(\mathcal{E}(t)) = \int _0^1 \rho (t) M(d\alpha ) = \rho (t) M(0,1), \quad t \in {{\mathbb {R}}}. \end{aligned}$$

Consequently, $M(0,1)\ne 0$ and $\rho (t) = \frac{1}{M(0,1)}h^{-1}(\mathcal{E}(t))$. If M and $\rho $ are replaced by $-M$ by $-\rho $, respectively, then $\mathcal{E}$ in (21) remains unchanged. Consequently, without loss of generality, we may assume that $M(0,1)>0$. Positive homogeneity of $\mathcal{E}$ implies that

$$\begin{aligned} \rho (t)=\frac{1}{M(0,1)}\varphi \left( t_{a,b}\right) , \end{aligned}$$

where $\varphi =h^{-1},t_{a,b}$ is given by (3), $a=\mathcal{E}(1)>0$ and $b=\mathcal{E}(-1)>0$. In particular, both (21) and (23) imply that

$$\begin{aligned} \mathcal{E}(Z)=\varphi ^{-1}\left( \frac{1}{M(0,1)}\int _0^1 q_{\varphi \left( aZ\right) }(\alpha )\,M(d\alpha )\right) , \quad Z\geqslant 0, \end{aligned}$$

(39)

where we used $q_{\varphi \left( aZ\right) }(\alpha )=\varphi (q_{aZ}(\alpha ))$.

If $M(0,\alpha )=0$ for all $\alpha <1$, (21) reduces to $\mathcal{E}(Z)= a\,[\sup \, Z]_+ +b\,[\sup \, Z]_-$, which is not an error measure (property E1 fails), whereas (23) simplifies to $\mathcal{E}(Z)=\sup (Z_{a,b})$, which is a particular case of (25) with w being the Dirac delta function at 1. Otherwise there exists $\alpha \in (0,1)$ such that $q=M(0,\alpha )/M(0,1)>0$. Since $\Theta $ is atomless, there exists an event $A\in \Theta $ with $\mathbb {P}[A]=\alpha $. Let $0 \leqslant c \leqslant d$, and let Z be an r.v. such that $Z(\omega )=c/a$ for $\omega \in A$ and $Z(\omega )=d/a$ for $\omega \not \in A$. Then (39) implies that

$$\begin{aligned} \begin{aligned} \varphi ^{-1}\left[ q \varphi (\lambda c) + (1-q)\varphi (\lambda d)\right]&= \mathcal{E}(\lambda \,Z)= \lambda \,\mathcal{E}(Z) \\&= \lambda \varphi ^{-1}\left[ q \varphi (c) + (1-q)\varphi (d)\right] \end{aligned} \end{aligned}$$

(40)

for any $\lambda \geqslant 0$. Expression (40) coincides with (29), and the proof of Proposition 1 implies that $\varphi $ should be in the form of $\varphi (y)=\varphi (1)y^p,p>0$. Consequently,

$$\begin{aligned} h(z)=\left( \frac{z}{\varphi (1)} \right) ^{1/p} = h(1) z^{1/p}, \end{aligned}$$

(41)

and

$$\begin{aligned} \rho (t)=\frac{\varphi (1)}{M(0,1)}t_{a,b}^p. \end{aligned}$$

(42)

In particular, (39) simplifies to

$$\begin{aligned} \mathcal{E}(Z)=\left( \frac{a^p}{M(0,1)}\int _0^1 q_Z(\alpha )^p\,M(d\alpha )\right) ^{1/p}, \quad Z\geqslant 0. \end{aligned}$$

(43)

Let $0=\alpha _0\leqslant \alpha _1<\alpha _2<\alpha _3\leqslant \alpha _4=1$ be such that $\alpha _2-\alpha _1=\alpha _3-\alpha _2$, and let

$$\begin{aligned} M_i=\frac{1}{M(0,1)}\int _{\alpha _{i-1}}^{\alpha _i} M(d\alpha ),\qquad i=1,2,3,4. \end{aligned}$$

Since $\Theta $ is atomless, there exist events $A,B \in \mathcal{M}$ such that $\mathbb {P}[A]=\mathbb {P}[B]=\alpha _2$ and $\mathbb {P}[A \cap B]=\alpha _1$. Subadditivity of $\mathcal{E}$ implies that

$$\begin{aligned} \left[ \mathcal{E}\left( 1+\epsilon I_{\Omega /A}\right) + \mathcal{E}\left( 1+\epsilon I_{\Omega /B}\right) \right] ^p \geqslant \mathcal{E}\left( 2+\epsilon I_{\Omega /A} + \epsilon I_{\Omega /B}\right) ^p \quad \forall \epsilon >0, \end{aligned}$$

where I is an indicator function. With (43), this yields

$$\begin{aligned} 2^p\left( M_1+M_2+(1+\epsilon )^p(M_3+M_4) \right) \geqslant 2^p M_1 + (2+\epsilon )^p(M_2+M_3) + (2+2\epsilon )^p, \end{aligned}$$

which simplifies to

$$\begin{aligned}{}[(2+2\epsilon )^p - (2+\epsilon )^p] M_3\geqslant [(2+\epsilon )^p-2^p] M_2. \end{aligned}$$

(44)

Dividing both parts of (44) by $\epsilon >0$ and taking limit $\epsilon \rightarrow 0^+$, we obtain $p2^{p-1}M_3\geqslant p2^{p-1}M_2$, or $M_3\geqslant M_2$. This implies that the measure $M(d\alpha )$ has a non-decreasing density $\omega $ on [0, 1], which can be the Dirac delta function at the ends of the interval.

By selecting $\alpha _1=\alpha _2-\delta $ and $\alpha _3=\alpha _2+\delta $ and by taking $\delta \rightarrow 0^+$, we can make $M_3$ arbitrarily close to $M_2$. Consequently, (44) may hold only if $(2+2\epsilon )^p - (2+\epsilon )^p\geqslant (2+\epsilon )^p-2^p$. With $\epsilon =1$, this inequality reduces to $4^p - 2\cdot 3^p + 2^p\geqslant 0$ and implies that $p\geqslant 1$. If $\mathcal{E}$ can be represented in the form of (23), inequality $p \ge 1$ along with (41) and (42) yields (25). Moreover, $\int _0^1 w(\alpha )d\alpha =M[0,1]>0$. To prove (b), it is left to verify that w is non-negative.

Let $a\geqslant b$ in (25)—the case $a \leqslant b$ is treated similarly. Since $\Theta $ is atomless, for every $\alpha \in (0,1/2]$, there exist events $A,B \in \mathcal{M}$ such that $\mathbb {P}[A]=\mathbb {P}[B]=\alpha $ and $\mathbb {P}[A \cap B]=0$. Subadditivity of $\mathcal{E}$ implies that

$$\begin{aligned} \mathcal{E}\left( 1-2 I_A\right) + \mathcal{E}\left( 1-2 I_B\right) \geqslant \mathcal{E}\left( 2-2 I_{A\cup B}\right) . \end{aligned}$$

With (25), this yields

$$\begin{aligned} 2 \left( b^p M(0,\alpha ) + a^p M(\alpha , 1)\right) ^{1/p} \geqslant \left( (2a)^p M(2\alpha ,1)\right) ^{1/p}, \end{aligned}$$

which simplifies to

$$\begin{aligned} a^p M(\alpha , 2\alpha ) \geqslant - b^p M(0,\alpha ) \quad \forall \alpha \in (0,1/2]. \end{aligned}$$

(45)

Let $\alpha ^*=\sup \{\alpha : w(\alpha )<0\}$. Since $w(\alpha )$ is non-decreasing, (45) fails for $\alpha =\alpha ^*/2$, and consequently, $\alpha ^*=0$. Then $\lim \limits _{\alpha \rightarrow 0} M(\alpha , 2\alpha ) \leqslant \lim \limits _{\alpha \rightarrow 0} \alpha w(2\alpha ) = 0$, so that $\lim \limits _{\alpha \rightarrow 0} M(0, \alpha ) \geqslant 0$ by (45), which implies that w has no negative delta function at 0 as well. This finishes the proof of (b).

Finally, suppose that $\mathcal{E}$ has the form of (21). Then an analogue of (43) for negative r.v.’s is given by

$$\begin{aligned} \mathcal{E}(Z)=\left( \frac{b^p}{M(0,1)}\int _0^1 |q_Z(\alpha )|^p\,M(d\alpha )\right) ^{1/p}, \quad Z\leqslant 0. \end{aligned}$$

(46)

Since $q_{-Z}(\alpha )=-q_{Z}(1-\alpha )$ for almost all $\alpha \in (0,1)$, (46) can be written as

$$\begin{aligned} \mathcal{E}(Z')=\left( \frac{b^p}{M(0,1)}\int _0^1 |q_{Z'}(\alpha )|^p\,M'(d\alpha )\right) ^{1/p}, \quad Z'\geqslant 0, \end{aligned}$$

where $Z'=-Z$ and $M'$ is a measure such that $M'(a,b)=M(1-b,1-a)$ for any interval (a, b). The last expression coincides with (43) and the same argument implies that $M'(d\alpha )$ has a non-decreasing density $\omega '$ on (0, 1). Since $\omega '(\alpha )=\omega (1-\alpha ),\alpha \in (0,1)$, both $\omega $ and $\omega '$ may be non-decreasing only if $\omega $ is constant, which along with (41) and (42) yields (2a) and proves (a).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grechuk, B., Zabarankin, M. Regression analysis: likelihood, error and entropy. Math. Program. 174, 145–166 (2019). https://doi.org/10.1007/s10107-018-1256-6

Download citation

Received: 23 February 2017
Accepted: 02 March 2018
Published: 23 March 2018
Issue Date: 01 March 2019
DOI: https://doi.org/10.1007/s10107-018-1256-6

Regression analysis: likelihood, error and entropy

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Methods for High-Dimensional Regression and Covariance Matrix Estimation

A new method for estimation and model selection:\(\rho \)-estimation

Machine learning and the James–Stein estimator

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A: Proofs of Propositions 1–6

1.1 Appendix A.1: Proof of Proposition 1

1.2 Appendix A.2: Proof of Proposition 2

1.3 Appendix A.3: Proof of Proposition 3

1.4 Appendix A.4: Proof of Proposition 4

1.5 Appendix A.5: Proof of Proposition 5

1.6 Appendix A.6: Proof of Proposition 6

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Regression analysis: likelihood, error and entropy

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Robust Methods for High-Dimensional Regression and Covariance Matrix Estimation

A new method for estimation and model selection:\(\rho \)-estimation

Machine learning and the James–Stein estimator

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix A: Proofs of Propositions 1–6

Appendix A: Proofs of Propositions 1–6

1.1 Appendix A.1: Proof of Proposition 1

1.2 Appendix A.2: Proof of Proposition 2

1.3 Appendix A.3: Proof of Proposition 3

1.4 Appendix A.4: Proof of Proposition 4

1.5 Appendix A.5: Proof of Proposition 5

1.6 Appendix A.6: Proof of Proposition 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation