[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Fractal Structure and Non-Extensive Statistics
Next Article in Special Issue
Convex Optimization via Symmetrical Hölder Divergence for a WLAN Indoor Positioning System
Previous Article in Journal
Entropic Equilibria Selection of Stationary Extrema in Finite Populations
Previous Article in Special Issue
Non-Quadratic Distances in Model Assessment
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Robust Relative Error Estimation

1
Institute of Mathematics for Industry, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan
2
RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
3
Faculty of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka 819-0395, Japan
*
Author to whom correspondence should be addressed.
Entropy 2018, 20(9), 632; https://doi.org/10.3390/e20090632
Submission received: 11 July 2018 / Revised: 15 August 2018 / Accepted: 20 August 2018 / Published: 24 August 2018
Figure 1
<p>Median and error bar of relative prediction error (RPE) in Equation (<a href="#FD31-entropy-20-00632" class="html-disp-formula">31</a>) and mean squared error (MSE) of <math display="inline"><semantics> <mi mathvariant="bold">β</mi> </semantics></math> in Equation (32) when parameters of the log-normal distribution (distribution of outliers) are <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>μ</mi> <mo>,</mo> <mi>σ</mi> <mo>)</mo> <mo>=</mo> <mo>(</mo> <mo>±</mo> <mn>5</mn> <mo>,</mo> <mn>1</mn> <mo>)</mo> </mrow> </semantics></math>. The error bars are delineated by 25th and 75th percentiles.</p> ">
Figure 2
<p>Histograms of T = 100,00 samples of <math display="inline"><semantics> <msub> <mi>z</mi> <mn>2</mn> </msub> </semantics></math> along with the density function of standard normal distribution for <math display="inline"><semantics> <mrow> <mi>μ</mi> <mo>=</mo> <mn>5</mn> </mrow> </semantics></math> in Model 1.</p> ">
Figure 3
<p>Electricity consumption from January 2012 to December 2014 for one of the 370 households.</p> ">
Figure 4
<p>Relative prediction error for various values of <math display="inline"><semantics> <mi>γ</mi> </semantics></math> for household electricity consumption data.</p> ">
Figure 5
<p>Prediction value based on least product relative error (LPRE) loss for household electricity consumption data.</p> ">
Figure 6
<p>Prediction value based on the proposed method with <math display="inline"><semantics> <mrow> <mi>γ</mi> <mo>=</mo> <mn>0</mn> <mo>.</mo> <mn>03</mn> </mrow> </semantics></math> for household electricity consumption data.</p> ">
Versions Notes

Abstract

:
Relative error estimation has been recently used in regression analysis. A crucial issue of the existing relative error estimation procedures is that they are sensitive to outliers. To address this issue, we employ the γ -likelihood function, which is constructed through γ -cross entropy with keeping the original statistical model in use. The estimating equation has a redescending property, a desirable property in robust statistics, for a broad class of noise distributions. To find a minimizer of the negative γ -likelihood function, a majorize-minimization (MM) algorithm is constructed. The proposed algorithm is guaranteed to decrease the negative γ -likelihood function at each iteration. We also derive asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix, so that we can readily construct approximate confidence sets. Monte Carlo simulation is conducted to investigate the effectiveness of the proposed procedure. Real data analysis illustrates the usefulness of our proposed procedure.

1. Introduction

In regression analysis, many analysts use the (penalized) least squares estimation, which aims at minimizing the mean squared prediction error [1]. On the other hand, the relative (percentage) error is often more useful and/or adequate than the mean squared error. For example, in econometrics, the comparison of prediction performance between different stock prices with different units should be made by relative error; we refer to [2,3] among others. Additionally, the prediction error of photovoltaic power production or electricity consumption is evaluated by not only mean squared error but also relative error (see, e.g., [4]). We refer to [5] regarding the usefulness and importance of the relative error.
In relative error estimation, we minimize a loss function based on the relative error. An advantage of using such a loss function is that it is scale free or unit free. Recently, several researchers have proposed various loss functions based on relative error [2,3,6,7,8,9]. Some of these procedures have been extended to the nonparameteric model [10] and random effect model [11]. The relative error estimation via the L 1 regularization, including the least absolute shrinkage and operator (lasso; [12]), and the group lasso [13], have also been proposed by several authors [14,15,16], to allow for the analysis of high-dimensional data.
In practice, a response variable y ( > 0 ) can turn out to be extremely large or close to zero. For example, the electricity consumption of a company may be low during holidays and high on exceptionally hot days. These responses may often be considered to be outliers, to which the relative error estimator is sensitive because the loss function diverges when y or y 0 . Therefore, a relative error estimation that is robust against outliers must be considered. Recently, Chen et al. [8] discussed the robustness of various relative error estimation procedures by investigating the corresponding distributions, and concluded that the distribution of least product relative error estimation (LPRE) proposed by [8] has heavier tails than others, implying that the LPRE might be more robust than others in practical applications. However, our numerical experiments show that the LPRE is not as robust as expected, so that the robustification of the LPRE is yet to be investigated from the both theoretical and practical viewpoints.
To achieve a relative error estimation that is robust against outliers, this paper employs the γ -likelihood function for regression analysis by Kawashima and Fujisawa [17], which is constructed by the γ -cross entropy [18]. The estimating equation is shown to have a redescending property, a desirable property in robust statistics literature [19]. To find a minimizer of the negative γ -likelihood function, we construct a majorize-minimization (MM) algorithm. The loss function of our algorithm at each iteration is shown to be convex, although the original negative γ -likelihood function is nonconvex. Our algorithm is guaranteed to decrease the objective function at each iteration. Moreover, we derive the asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix, which enables us to straightforwardly create approximate confidence sets. Monte Carlo simulation is conducted to investigate the performance of our proposed procedure. An analysis of electricity consumption data is presented to illustrate the usefulness of our procedure. Supplemental material includes our R package rree (robust relative error estimation), which implements our algorithm, along with a sample program of the rree function.
The reminder of this paper is organized as follows: Section 2 reviews several relative error estimation procedures. In Section 3, we propose a relative error estimation that is robust against outliers via the γ -likelihood function. Section 4 presents theoretical properties: the redescending property of our method and the asymptotic distribution of the estimator, the proof of the latter being deferred to Appendix A. In Section 5, the MM algorithm is constructed to find the minimizer of the negative γ -likelihood function. Section 6 investigates the effectiveness of our proposed procedure via Monte Carlo simulations. Section 7 presents the analysis on electricity consumption data. Finally, concluding remarks are given in Section 8.

2. Relative Error Estimation

Suppose that x i = ( x i 1 , , x i p ) T ( i = 1 , . . . , n ) are predictors and y = ( y 1 , . . . , y n ) T is a vector of positive responses. Consider the multiplicative regression model
y i = exp ( x i T β ) ε i = exp j = 1 p x i j β j ε i , ( i = 1 , , n ) ,
where β = ( β 1 , , β p ) T is a p-dimensional coefficient vector, and ε i are positive random variables. Predictors x i R p may be random and serially dependent, while we often set x i 1 = 1 , that is, incorporate the intercept in the exponent. The parameter space B R p of β is a bounded convex domain such that β 0 B . We implicitly assume that the model is correctly specified, so that there exists a true parameter β 0 = ( β 1 , 0 , , β p , 0 ) B . We want to estimate β 0 from a sample { ( x i , y i ) , i = 1 , , n } .
We first remark that the condition x i 1 = 1 ensures that the model (1) is scale-free regarding variables ε i , which is an essentially different nature from the linear regression model y i = x i T β + ε i . Specifically, multiplying a positive constant σ to ε i results in the translation of the intercept in the exponent:
y i = exp ( x i T β ) σ ε i = exp ( log σ + x i T β ) ε i
so that the change from ε i to σ ε i is equivalent to that from β 1 to β 1 + log σ . See Remark 1 on the distribution of ε 1 .
To provide a simple expression of the loss functions based on the relative error, we write
t i = t i ( β ) = exp ( x i T β ) , ( i = 1 , , n ) .
Chen et al. [6,8] pointed out that the loss criterion for relative error may depend on | ( y i - t i ) / y i | and / or | ( y i - t i ) / t i | . These authors also proposed general relative error (GRE) criteria, defined as
G ( β ) = i = 1 n g y i - t i y i , y i - t i t i ,
where g : [ 0 , ) × [ 0 , ) [ 0 , ) . Most of the loss functions based on the relative error are included in the GRE. Park and Stefanski [2] considered a loss function g ( a , b ) = a 2 . It may highly depend on a small y i because it includes 1 / y i 2 terms, and then the estimator can be numerically unstable. Consistency and asymptotic normality may not be established under general regularity conditions [8]. The loss functions based on g ( a , b ) = max { a , b } [3] and g ( a , b ) = a + b (least absolute relative error estimation, [6]) can have desirable asymptotic properties [3,6]. However, the minimization of the loss function can be challenging, in particular for high-dimensional data, when the function is nonsmooth or nonconvex.
In practice, the following two criteria would be useful:
  • Least product relative error estimation (LPRE) Chen et al. [8] proposed the LPRE given by g ( a , b ) = a b . The LPRE tries to minimize the product | 1 - t i / y i | × | 1 - y i / t i | , not necessarily both terms at once.
  • Least squared-sum relative error estimation (LSRE) Chen et al. [8] considered the LSRE given by g ( a , b ) = a 2 + b 2 . The LSRE aims to minimize both | 1 - t i / y i | and | 1 - y i / t i | through sum of squares ( 1 - t i / y i ) 2 + ( 1 - y i / t i ) 2 .
The loss functions of LPRE and LSRE are smooth and convex, and also possess desirable asymptotic properties [8]. The above-described GRE criteria and their properties are summarized in Table 1. Particularly, the “convexity” in the case of g ( a , b ) = a + b holds when ε i > 0 , ε i 1 , and i = 1 n x i x i T is positive definite, since the Hessian matrix of the corresponding G ( β ) is i = 1 n | ε i - ε i - 1 | x i x i T a.s.
Although not essential, we assume that the variables ε i in Equation (1) are i.i.d. with common density function h. As in Chen et al. [8], we consider the following class of h associated with g:
h ( ε ) : = C ( g ) ε exp - ρ ( ε ) I + ( ε ) ,
where
ρ ( ε ) = ρ ( ε ; g ) : = g 1 - 1 ε , | 1 - ε | ,
and C ( g ) is a normalizing constant ( h ( ε ) d ε = 1 ) and I + denotes the indicator function of set ( 0 , ) . Furthermore, we assume the symmetry property g ( a , b ) = g ( b , a ) , a , b 0 , from which it follows that ε 1 ε 1 - 1 . The latter property is necessary for a score function to be associated with the gradient of a GRE loss function, hence being a martingale with respect to a suitable filtration, which often entails estimation efficiency. Indeed, the asymmetry of g ( a , b ) (i.e., g ( a , b ) g ( b , a ) ) may produce a substantial bias in the estimation [3]. The entire set of our regularity conditions will be shown in Section 4.3. The conditions therein concerning g are easily verified for both LPRE and LSRE.
In this paper, we implicitly suppose that i = 1 , , n denote “time” indices. As usual, in order to deal with cases of non-random and random predictors in a unified manner, we employ the partial-likelihood framework. Specifically, in the expression of the joint density (with the obvious notation for the densities)
f ( x 1 , , x n , y 1 , , y n ) = f ( x 1 ) i = 2 n f ( x i | x 1 ` , , x i - 1 , y 1 , , y i - 1 ) f ( y 1 | x 1 ) i = 2 n f ( y i | x 1 , , x i , y 1 , , y i - 1 ) ,
we ignore the first product { } and only look at the second one { } , which is defined as the partial likelihood. We further assume that the ith-stage noise ε i is independent of ( x 1 , , x i , y 1 , , y i - 1 ) , so that, in view of Equation (1), we have
f ( y i | x 1 , , x i , y 1 , , y i - 1 ) = f ( y i | x i ) , i = 1 , , n .
The density function of response y given x i is
f ( y | x i ; β ) = exp ( - x i T β ) h y exp ( - x i T β )
= 1 t i h y t i
From Equation (3), we see that the maximum likelihood estimator (MLE) based on the error distribution in Equation (5) is obtained by the minimization of Equation (2). For example, the density functions of LPRE and LSRE are
L P R E : f ( y | x i ) = 1 2 K 0 ( 2 ) y - 1 exp - y t i - t i y , ( y > 0 ) , L S R E : f ( y | x i ) = C L S R E y - 1 exp - 1 - t i y 2 - 1 - y t i 2 , ( y > 0 ) ,
where K ν ( z ) denotes the modified Bessel function of third kind with index ν R :
K ν ( z ) = z ν 2 ν + 1 0 t - ν - 1 exp - t - z 2 4 t d t
and C L S R E is a constant term. Constant terms are numerically computed as K 0 ( 2 ) 0 . 1139 and C L S R E 0 . 911411 . Density (6) is a special case of the generalized inverse Gaussian distribution (see, e.g., [20]).
Remark 1.
We assume that the noise density h is fully specified in the sense that, given g, the density h does not involve any unknown quantity. However, this is never essential. For example, for the LPRE defined by Equation (6), we could naturally incorporate one more parameter σ > 0 into h, the resulting form of h ( ε ) being
ε 1 2 K 0 ( σ ) ε - 1 exp - σ 2 ε + 1 ε I + ( ε ) .
Then, we can verify that the distributional equivalence ε 1 ε 1 - 1 holds whatever the value of σ is. Particularly, the estimation of parameter σ does make statistical sense and, indeed, it is possible to deduce the asymptotic normality of the joint maximum-(partial-) likelihood estimator of ( β , σ ) . In this paper, we do not pay attention to such a possible additional parameter, but instead regard it (whenever it exists) as a nuisance parameter, as in the noise variance in the least-squares estimation of a linear regression model.

3. Robust Estimation via γ -Likelihood

In practice, outliers can often be observed. For example, the electricity consumption data can have the outliers on extremely hot days. The estimation methods via GRE criteria, including LPRE and LSRE, are not robust against outliers, because the corresponding density functions are not generally heavy-tailed. Therefore, a relative error estimation method that is robust against the outliers is needed. To achieve this, we consider minimizing the negative γ -(partial-)likelihood function based on the γ -cross entropy [17].
We now define the negative γ -(partial-)likelihood function by
γ , n ( β ) = - 1 γ log 1 n i = 1 n f ( y i | x i ; β ) γ + 1 1 + γ log 1 n i = 1 n 0 f ( y | x i ; β ) 1 + γ d y ,
where γ > 0 is a parameter that controls the degrees of robustness; γ 0 corresponds to the negative log-likelihood function, and robustness is enhanced as γ increases. On the other hand, a too large γ can decrease the efficiency of the estimator [18]. In practice, the value of γ may be selected by a cross-validation based on γ -cross entropy (see, e.g., [18,21]). We refer to Kawashima and Fujisawa [22] for more recent observations on comparison of the γ -divergences between Fujisawa and Eguchi [18] and Kawashima and Fujisawa [17].
There are several likelihood functions which yield robust estimation. Examples include the L q -likelihood [23], and the likelihood based on the density power divergence [24], referred to as β -likelihood. It is shown that the γ -likelihood, the L q -likelihood, and the β -likelihood are closely related. The negative β -likelihood function α , n ( β ) and the negative L q -likelihood function q , n ( β ) are, respectively, expressed as
α , n ( β ) = - 1 α 1 n i = 1 n f ( y i | x i ; β ) α + 1 1 + α 1 n i = 1 n 0 f ( y | x i ; β ) 1 + α d y ,
q , n ( β ) = - i = 1 n f ( y i | x i ; β ) 1 - q - 1 1 - q .
The difference between γ -likelihood and β -likelihood is just the existence of the logarithm on γ , n ( β ) . Furthermore, substituting q = 1 - α into Equation (9) gives us
q , n ( β ) = - 1 α i = 1 n f ( y i | x i ; β ) α + const .
Therefore, the minimization of the negative L q -likelihood function is equivalent to minimization of the negative β -likelihood function without second term in the right side of Equation (8). Note that the γ -likelihood has the redescending property, a desirable property in robust statistics literature, as shown in Section 4.2. Moreover, it is known that the γ -likelihood is the essentially unique divergence that is robust against heavy contamination (see [18] for details). On the other hand, we have not shown whether the L q -likelihood and/or the β -likelihood have the redescending property or not.
The integration f ( y | x i ; β ) 1 + γ d y in the second term on the right-hand side of Equation (7) is
0 f ( y | x i ; β ) 1 + γ d y = 1 t i 1 + γ 0 h y t i 1 + γ d y = : t i - γ C ( γ , h ) ,
where
C ( γ , h ) : = 0 h ( v ) 1 + γ d v
is a constant term, which is assumed to be finite. Then, Equation (7) is expressed as
γ , n ( β ) = - 1 γ log i = 1 n f ( y i | x i ; β ) γ = : 1 ( β ) + 1 1 + γ log i = 1 n t i - γ = : 2 ( β ) + C 0 ( γ , h ) ,
where C 0 ( γ , h ) is a constant term free from β . We define the maximum γ -likelihood estimator to be any element such that
β ^ γ argmin γ , n .

4. Theoretical Properties

4.1. Technical Assumptions

Let p denote the convergence in probability.
Assumption A1 (Stability of the predictor).
There exists a probability measure π ( d x ) on the state space X of the predictors and positive constants δ , δ > 0 such that
1 n i = 1 n | x i | 3 exp δ | x i | 1 + δ = O p ( 1 ) ,
and that
1 n i = 1 n η ( x i ) p X η ( x ) π ( d x ) , n ,
where the limit is finite for any measurable η satisfying that
sup x R p | η ( x ) | ( 1 + | x | 3 ) exp δ | x | 1 + δ < .
Assumption A2 (Noise structure).
The a.s. positive i.i.d. random variables ε 1 , ε 2 , have a common positive density h of the form (3):
h ( ε ) = C ( g ) ε exp - ρ ( ε ) I + ( ε ) ,
for which the following conditions hold.
  • Function g : [ 0 , ) × [ 0 , ) [ 0 , ) is three times continuously differentiable on ( 0 , ) and satisfies that
    g ( a , b ) = g ( b , a ) , a , b 0 .
  • There exist constants κ 0 , κ > 0 , and c > 1 such that
    1 c ε - κ 0 ε κ ρ ( ε ) c ε - κ 0 ε κ
    for every ε > 0 .
  • There exist constants c 0 , c 0 such that
    sup ε > 0 ε - c 0 ε c - 1 max k = 1 , 2 , 3 ε k ρ ( ε ) < .
Here and in the sequel, for a variable a, we denote by a k the kth-order partial differentiation with respect to a.
Assumption 1 is necessary to identify the large-sample stochastic limits of the several key quantities in the proofs: without them, we will not be able to deduce an explicit asymptotic normality result. Assumption 2 holds for many cases, including the LPRE and the LSRE (i.e., g ( a , b ) = a b and a 2 + b 2 ), while excluding g ( a , b ) = a 2 and g ( a , b ) = b 2 . The smoothness condition on h on ( 0 , ) is not essential and could be weakened in light of the M-estimation theory ([25], Chapter 5). Under these assumptions, we can deduce the following statements.
  • h is three times continuously differentiable on ( 0 , ) , and for each α > 0 ,
    0 h α ( ε ) d ε < and max k = 0 , 1 , 2 , 3 sup ε > 0 ε k h ( ε ) α < .
  • For each γ > 0 and α > 0 (recall that the value of γ > 0 is given),
    lim ε 0 h ( ε ) γ u h ( ε ) α = lim ε h ( ε ) γ u h ( ε ) α = 0 ,
    where
    u h ( z ) : = 1 + z z log h ( z ) = 1 + z h ( z ) h ( z ) .
The verifications are straightforward hence omitted.
Finally, we impose the following assumption:
Assumption A3 (Identifiability).
We have β = β 0 if
ρ e - x T β y = ρ e - x T β 0 y π ( d x ) λ + ( d y ) - a . e . ( x , y ) ,
where λ + denotes the Lebesgue measure on ( 0 , ) .

4.2. Redescending Property

The estimating function based on the negative γ -likelihood function is given by
i = 1 n ψ ( y i | x i ; β ) = 0 .
In our model, we consider not only too large y i s but also too small y i s as outliers: the estimating equation is said to have the redescending property if
lim y ψ ( y | x ; β 0 ) = lim y + 0 ψ ( y | x ; β 0 ) = 0
for each x . The redescending property is known as a desirable property in robust statistics literature [19]. Here, we show the proposed procedure has the redescending property.
The estimating equation based on the negative γ -likelihood function is
- i = 1 n f ( y i | x i ; β ) γ s ( y i | x i ; β ) j = 1 n f ( y j | x j ; β ) γ + β 2 ( β ) = 0 ,
where
s ( y | x ; β ) = log f ( y | x ; β ) β .
We have expression
ψ ( y | x ; β ) = f ( y | x ; β ) γ s ( y | x ; β ) - β 2 ( β ) .
Note that β 2 ( β ) is free from y. For each ( x , β ) , direct computations give the estimate
ψ ( y | x ; β ) C ( x ; β ) h exp ( - x T β ) y γ u h exp ( - x T β ) y
for some constant C ( x ; β ) free from y. Hence, Equation (12) combined with the inequality (13) leads to the redescending property.

4.3. Asymptotic Distribution

Recall Equation (10) for the definition of C ( γ , h ) and let
C 1 ( γ , h ) : = 0 ε h ( ε ) γ h ( ε ) d ε , C 2 ( γ , h ) : = 0 u h ( ε ) 2 h ( ε ) 2 γ + 1 d ε , Π k ( γ ) : = x k exp ( - γ x T β 0 ) π ( d x ) , k = 0 , 1 , 2 ,
where x 0 : = 1 R , x 1 : = x R p , and x 2 : = x x T R p R p ; Assumptions 1 and 2 ensure that all these quantities are finite for each γ > 0 . Moreover,
H γ ( β 0 ) : = f ( y | x ; β 0 ) γ + 1 d y π ( d x ) = C ( γ , h ) Π 0 ( γ ) , H γ ( β 0 ) : = f ( y | x ; β 0 ) γ + 1 s ( y | x ; β 0 ) d y π ( d x ) = - C ( γ , h ) + C 1 ( γ , h ) Π 1 ( γ ) , Δ γ ( β 0 ) : = C ( γ , h ) 2 C 2 ( γ , h ) Π 0 ( γ ) 2 Π 2 ( 2 γ ) + C ( γ , h ) + C 1 ( γ , h ) 2 C ( 2 γ , h ) Π 0 ( 2 γ ) Π 1 ( γ ) 2 - 2 C ( γ , h ) C ( γ , h ) + C 1 ( γ , h ) C ( 2 γ , h ) + C 1 ( 2 γ , h ) Π 0 ( γ ) Π 1 ( 2 γ ) Π 1 ( γ ) T ,
J γ ( β 0 ) : = C ( γ , h ) C 2 ( γ / 2 , h ) Π 0 ( γ ) Π 2 ( γ ) - C ( γ , h ) + C 1 ( γ , h ) 2 Π 1 ( γ ) 2 .
We are assuming that density h and tuning parameter γ are given a priori, hence we can (numerically) compute constants C ( γ , h ) , C 1 ( γ , h ) , and C 2 ( γ , h ) . In the following, we often omit “ ( β 0 ) ” from the notation.
Let L denote the convergence in distribution.
Theorem 1.
Under Assumptions 1–3, we have
n β ^ γ - β 0 L N p 0 , J γ - 1 Δ γ J γ - 1 .
The asymptotic covariance matrix can be consistently estimated through expressions (14) and (15) with quantities Π k ( γ ) therein replaced by the empirical estimates:
Π ^ k , n ( γ ) : = 1 n i = 1 n x i k exp ( - γ x i T β ^ γ ) p Π k ( γ ) , k = 0 , 1 , 2 .
The proof of Theorem 1 will be given in Appendix A. Note that, for γ 0 , we have C ( γ , h ) 1 , C 1 ( γ , h ) - 1 , and C 2 ( γ , h ) 0 u h ( ε ) 2 h ( ε ) d ε , which in particular entails H γ 1 and H γ 0 . Then, both Δ γ and J γ tend to the Fisher information matrix
I 0 : = s ( y | x ; β 0 ) 2 f ( y | x , β 0 ) π ( d x ) d y = 0 u h ( ε ) 2 h ( ε ) d ε x 2 π ( d x )
as γ 0 , so that the asymptotic distribution N p ( 0 , J γ - 1 Δ γ J γ - 1 ) becomes N p ( 0 , I 0 - 1 ) , the usual one of the MLE.
We also note that, without details, we could deduce a density-power divergence (also known as the β -divergence [26]) counterpart to Theorem 1 similarly but with slightly lesser computation cost; in that case, we consider the objective function α , n ( β ) defined by Equation (8) instead of the γ -(partial-)likelihood (7). See Basu et al. [24] and Jone et al. [21] for details of the density-power divergence.

5. Algorithm

Even if the GRE criterion in Equation (2) is a convex function, the negative γ -likelihood function is nonconvex. Therefore, it is difficult to find a global minimum. Here, we derive the MM (majorize-minimization) algorithm to obtain a local minimum. The MM algorithm monotonically decreases the objective function at each iteration. We refer to Hunter and Lange [27] for a concise account of the MM algorithm.
Let β ( t ) be the value of the parameter at the tth iteration. The negative γ -likelihood function in Equation (11) consists of two nonconvex functions, 1 ( β ) and 2 ( β ) . The majorization functions of j ( β ) , say ˜ j ( β | β ( t ) ) ( j = 1 , 2 ), are constructed so that the optimization of min β ˜ j ( β | β ( t ) ) is much easier than that of min β j ( β ) . The majorization functions must satisfy the following inequalities:
˜ j ( β | β ( t ) ) j ( β ) ,
˜ j ( β ( t ) | β ( t ) ) = j ( β ( t ) ) .
Here, we construct majorization functions ˜ j ( β | β ( t ) ) for j = 1 , 2 .

5.1. Majorization Function for 1 ( β )

Let
w i ( t ) = f ( y i | x i ; β ( t ) ) γ j = 1 n f ( y j | x j ; β ( t ) ) γ ,
r i ( t ) = j = 1 n f ( y j | x j ; β ( t ) ) γ f ( y i | x i ; β ) γ f ( y i | x i ; β ( t ) ) γ .
Obviously, i = 1 n w i ( t ) = 1 and w i ( t ) r i ( t ) = f ( y i | x i ; β ) γ . Applying Jensen’s inequality to y = - log x , we obtain inequality
- log i = 1 n w i ( t ) r i ( t ) - i = 1 n w i ( t ) log r i ( t ) .
Substituting Equation (20) and Equation (21) into Equation (22) gives
1 ( β ) - i = 1 n w i ( t ) log f ( y i | x i ; β ) + C ,
where C = 1 γ i w i ( t ) log w i ( t ) . Denoting
˜ 1 ( β | β ( t ) ) = - i = 1 n w i ( t ) log f ( y i | x i ; β ) + C ,
we observe that Equation (23) satisfies Equation (18) and Equation (19). It is shown that ˜ 1 ( β | β ( t ) ) is a convex function if the original relative error loss function is convex. Particularly, the majorization functions ˜ 1 ( β | β ( t ) ) based on LPRE and LSRE are both convex.

5.2. Majorization Function for 2 ( β )

Let θ i = - γ x i T β . We view 2 ( β ) as a function of θ = ( θ 1 , , θ n ) T . Let
s ( θ ) : = log i = 1 n t i - γ = log i = 1 n exp ( θ i ) .
By taking the derivative of s ( θ ) with respect to θ , we have
s ( θ ) θ i = π i , 2 s ( θ ) θ j θ i = π i δ i j - π i π j ,
where π i = exp ( θ i ) / { k = 1 n exp ( θ k ) } . Note that i = 1 n π i = 1 for any θ .
The Taylor expansion of s ( θ ) at θ = θ ( t ) is expressed as
s ( θ ) = s ( θ ( t ) ) + π ( t ) T ( θ - θ ( t ) ) + 1 2 ( θ - θ ( t ) ) T 2 s ( θ * ) θ θ T ( θ - θ ( t ) ) ,
where π ( t ) = ( π 1 ( t ) , , π n ( t ) ) T and θ * is an n-dimensional vector located between θ and θ ( t ) . We define an n × n matrix B as follows:
B : = 1 2 I - 1 n 1 1 T .
It follows from [28] that, in the matrix sense,
2 s ( θ ) θ θ T B
for any θ . Combining Equation (25) and Equation (26), we have
s ( θ ) s ( θ ( t ) ) + π ( t ) T ( θ - θ ( t ) ) + 1 2 ( θ - θ ( t ) ) T B ( θ - θ ( t ) ) .
Substituting Equation (24) into Equation (27) gives
log i = 1 n exp ( - γ x i T β ) log i = 1 n exp ( - γ x i T β ( t ) ) - γ π ( t ) T X ( β - β ( t ) ) + γ 2 2 ( β - β ( t ) ) T X T B X ( β - β ( t ) ) ,
where X = ( x 1 , , x n ) T . The majorization function of 2 ( β ) is then constructed by
˜ 2 ( β | β ( t ) ) = γ 2 2 ( 1 + γ ) β T X T B X β - γ 1 + γ β T ( X T π ( t ) + γ X T B X β ( t ) ) + C ,
where C is a constant term free from β . We observe that ˜ 2 ( β | β ( t ) ) in Equation (28) satisfies Equation (18) and Equation (). It is shown that ˜ 2 ( β | β ( t ) ) is a convex function because X T B X is positive semi-definite.

5.3. MM Algorithm for Robust Relative Error Estimation

In Section 5.1 and Section 5.2, we have constructed the majorization functions for both 1 ( β ) and 2 ( β ) . The MM algorithm based on these majorization functions is detailed in Algorithm 1. The majorization function ˜ 1 ( β | β ( t ) ) + ˜ 2 ( β | β ( t ) ) is convex if the original relative error loss function is convex. Particularly, the majorization functions of LPRE and LSRE are both convex.
Algorithm 1 Algorithm of robust relative error estimation.
1:
t 0
2:
Set an initial value of parameter vector β ( 0 ) .
3:
while β ( t ) is converged do
4:
 Update the weights by Equation (20)
5:
 Update β by
β ( t + 1 ) arg min β { ˜ 1 ( β | β ( t ) ) + ˜ 2 ( β | β ( t ) ) } ,
 where ˜ 1 ( β | β ( t ) ) and ˜ 2 ( β | β ( t ) ) are given by Equation (23) and Equation (28), respectively.
6:
t t + 1
7:
end while
Remark 2.
Instead of the MM algorithm, one can directly use the quasi-Newton method, such as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm to minimize the negative γ-likelihood function. In our experience, the BFGS algorithm is faster than the MM algorithm but is more sensitive to an initial value than the MM algorithm. The strengths of BFGS and MM algorithms would be shared by using the following hybrid algorithm:
  • We first conduct the MM algorithm with a small number of iterations.
  • Then, the BFGS algorithm is conducted. We use the estimate obtained by the MM algorithm as an initial value of the BFGS algorithm.
The stableness of the MM algorithm is investigated through the real data analysis in Section 7.
Remark 3.
To deal with high-dimensional data, we often use the L 1 regulzarization, such as the lasso [12], elastic net [29], and Smoothly Clipped Absolute Deviation (SCAD) [30]. In robust relative error estimation, the loss function based on the lasso is expressed as
γ , n ( β ) + λ j = 1 p | β j | ,
where λ > 0 is a regularization parameter. However, the loss function in Equation (29) is non-convex and non-differentiable. Instead of directly minimizing the non-convex loss function in Equation (29), we may use the MM algorithm; the following convex loss function is minimized at each iteration:
˜ 1 ( β | β ( t ) ) + ˜ 2 ( β | β ( t ) ) + λ j = 1 p | β j | .
The minimization of Equation (30) can be realized by the alternating direction method of multipliers algorithm [14] or the coordinate descent algorithm with quadratic approximation of ˜ 1 ( β | β ( t ) ) + ˜ 2 ( β | β ( t ) ) [31].

6. Monte Carlo Simulation

6.1. Setting

We consider the following two simulation models as follows:
Model 1 : β 0 = ( 1 , 1 , 1 ) T , Model 2 : β 0 = ( 0 . 5 , , 0 . 5 6 , 0 , , 0 ) 45 T .
The number of observations is set to be n = 200 . For each model, we generate T=10,000 datasets of predictors x i ( i = 1 , , n ) according to N ( 0 , ( 1 - ρ ) I + ρ 1 1 T ) . Here, we consider the case of ρ = 0 . 0 and ρ = 0 . 6 . Responses y i are generated from the mixture distribution
( 1 - δ ) f ( y | x i ; β 0 ) + δ q ( y ) ( i = 1 , , n ) ,
where f ( y | x ; β 0 ) is a density function corresponding to the LPRE defined as Equation (6), q ( y ) is a density function of distribution of outliers, and δ ( 0 δ < 1 ) is an outlier ratio. The outlier ratio is set to be δ = 0 , 0 . 05 , 0 . 1 , and 0 . 2 in this simulation. We assume that q ( y ) follows a log-normal distribution (pdf: q ( y ) = 1 / ( 2 π y σ ) exp { - ( log y - μ ) 2 / ( 2 σ 2 ) } ) with ( μ , σ ) = ( ± 5 , 1 ) . When μ = 5 , the outliers take extremely large values. On the other hand, when μ = - 5 , the data values of outliers are nearly zero.

6.2. Investigation of Relative Prediction Error and Mean Squared Error of the Estimator

To investigate the performance of our proposed procedure, we use the relative prediction error (RPE) and the mean square error (MSE) for the tth dataset, defined as
R P E ( t ) = i = 1 n [ y i new ( t ) - exp { x i ( t ) T β ^ ( t ) } ] 2 y i new ( t ) exp { x i ( t ) T β ^ ( t ) } ,
M S E ( t ) = β ^ ( t ) - β 0 2 ,
respectively, where β ^ ( t ) is an estimator obtained from the dataset { ( x i ( t ) , y i ( t ) ) ; i = 1 , , n } , and y i new ( t ) is an observation from y i new ( t ) | x i ( t ) . Here, y i new ( t ) | x i ( t ) follows a distribution of f ( y | x i ( t ) ; β 0 ) and is independent of y i ( t ) | x i ( t ) . Figure 1 shows the median and error bar of {RPE(1), ⋯, RPE(T)} and {MSE(1), ⋯, MSE(T)}. The error bars are delineated by the 25th and 75th percentiles.
We observe the following tendencies from the results in Figure 1:
  • As the outlier ratio increases, the performance becomes worse in all cases. Interestingly, the length of the error bar of RPE increases as the outlier ratio increases.
  • The proposed method becomes robust against outliers as the value of γ increases. We observe that a too large γ , such as γ = 10 , leads to extremely poor RPE and MSE because most observations are regarded as outliers. Therefore, the not too large γ , such as the γ = 0 . 5 used here, generally results in better estimation accuracy than the MLE.
  • The cases of ρ = 0 . 6 , where the predictors are correlated, are worse than those of ρ = 0 . Particularly, when γ = 0 , the value of RPE of ρ = 0 . 6 becomes large on the large outlier ratio. However, increasing γ has led to better estimation performance uniformly.
  • The results for different simulation models on the same value of γ are generally different, which implies the appropriate value of γ may change according to the data generating mechanisms.

6.3. Investigation of Asymptotic Distribution

The asymptotic distribution is derived under the assumption that the true distribution of y | x i follows f ( y | x i ; β 0 ) , that is, δ = 0 . However, we expect that, when γ is sufficiently large and δ is moderate, the asymptotic distribution may approximate the true distribution well, a point underlined by Fujisawa and Eguchi ([18], Theorem 5.1) in the case of i.i.d. data. We investigate whether the asymptotic distribution given by Equation (16) appropriately works when there exist outliers.
The asymptotic covariance matrix in Equation (16) depends on C ( γ , h ) , C 1 ( γ , h ) , and C 2 ( γ , h ) . For the LPRE, simple calculations provide
C ( γ , h ) = 0 h ( x ) 1 + γ d x = K γ ( 2 + 2 γ ) 2 γ K 0 ( 2 ) 1 + γ , C 1 ( γ , h ) = 0 x h ( x ) γ h ( x ) d x = - K γ ( 2 + 2 γ ) ( 1 + γ ) 2 γ K 0 ( 2 ) 1 + γ , C 2 ( γ , h ) = 0 u ( x ) 2 h ( x ) 2 γ + 1 d x = 2 1 - 2 γ ( 2 γ + 1 ) 2 K 0 ( 2 ) 2 γ + 1 γ ( 2 γ + 1 ) K 2 γ - 2 ( 4 γ + 2 ) + ( 1 + γ + 2 γ 2 ) K 2 γ - 1 ( 4 γ + 2 ) .
The Bessel function of third kind, K · ( · ) , can be numerically computed, and then we obtain the values of C ( γ , h ) , C 1 ( γ , h ) , and C 2 ( γ , h ) .
Let z = ( z 1 , , z p ) T be
z : = n diag J γ - 1 Δ γ J γ - 1 - 1 2 β ^ γ - β 0 .
Equation (16) implies that
z j L N 0 , 1 , ( j = 1 , , p ) .
We expect that the histogram of z j obtained by the simulation would approximate the density function of the standard normal distribution when there are no (or a few) outliers. When there exists a significant number of outliers, the asymptotic distribution of z j may not be N ( 0 , 1 ) but is expected to be close to N ( 0 , 1 ) for large γ . Figure 2 shows the histograms of T=10,000 samples of z 2 along with the density function of the standard normal distribution for μ = 5 in Model 1.
When there are no outliers, the distribution of z 2 is close to the standard normal distribution whatever the value of γ is selected. When the outlier ratio is large, the histogram of z 2 is far from the density function of N ( 0 , 1 ) for a small γ . However, when the value of γ is large, the histogram of z 2 is close to the density function of N ( 0 , 1 ) , which implies the asymptotic distribution in Equation (16) appropriately approximates the distribution of estimators even when there exist outliers. We observe that the result of the asymptotic distributions for other z j s shows a similar tendency to that of z 2 .

7. Real Data Analysis

We apply the proposed method to electricity consumption data from the UCI (University of California, Irvine) Machine Learning repository [32]. The dataset consists of 370 household electricity consumption observations from January 2011 to December 2014. The electricity consumption is in kWh at 15-minute intervals. We consider the problem of prediction of the electricity consumption for next day by using past electricity consumption. The prediction of the day ahead electricity consumption is needed when we trade electricity on markets, such as the European Power Exchange (EPEX) day ahead market (https://www.epexspot.com/en/market-data/dayaheadauction) and the Japan Power Exchange (JEPX) day ahead market (http://www.jepx.org/english/index.html). In the JEPX market, when the prediction value of electricity consumption y ^ t is smaller than actual electricity consumption y t , the price of the amount of y t - y ^ t becomes “imbalance price”, which is usually higher than the ordinary price. For details, please refer to Sioshansi and Pfaffenberger [33].
To investigate the effectiveness of the proposed procedure, we choose one household that includes small positive values of electricity consumption. The consumption data for 25 December 2014 were deleted because the corresponding data values are zero. We predict the electricity consumption from January 2012 to December 2014 (the data in 2011 are used only for estimating the parameter). The actual electricity consumption data from January 2012 to December 2014 are depicted in Figure 3.
We observe that several data values are close to zero from Figure 3. Particularly, from October to December 2014, several spikes exist that attain nearly zero values. In this case, the estimation accuracy is poor with ordinary GRE criteria, as shown in our numerical simulation in the previous section.
We assume the multiplicative regression model in Equation (1) to predict electricity consumption. Let y t denote the electricity consumption at t ( t = 1 , , T ). The number of observations is T = ( 365 × 3 + 366 - 1 ) × 96 = 146,160. Here, 96 is the number of measurements in one day because electricity demand is expressed in 15-minute intervals. We define x t as x t = ( y t - d , , y t - d q ) T , where d = 96 . In our model, the electricity consumption at t is explained by the electricity consumption of the past q days for the same period. We set q = 5 for data analysis and use past n = 100 days of observations to estimate the model.
The model parameters are estimated by robust LPRE. The values of γ are set to be regular sequences from 0 to 0.1, with increments of 0.01. To minimize the negative γ -likelihood function, we apply our proposed MM algorithm. As the electricity consumption pattern on weekdays is known to be completely different from that on weekends, we make predictions for weekdays and weekends separately. The results of the relative prediction error are depicted in Figure 4.
The relative prediction error is large when γ = 0 (i.e., ordinary LPRE estimation). The minimum value of relative prediction error is 0.049 and the corresponding value of γ is γ = 0 . 03 . When we set a too large value of γ , efficiency decreases and the relative prediction error might increase.
Figure 5 shows the prediction value when γ = 0 . We observe that there exist several extremely large prediction values (e.g., 8 July 2013 and 6 November 2014) due to the model parameters, which are heavily affected by the nearly zero values of electricity consumption.
Figure 6 shows the prediction values when γ = 0 . 03 . Extremely large prediction values are not observed and the prediction values are similar to the actual electricity demand in Figure 3. Therefore, our proposed procedure is robust against outliers.
Additionally, we apply the Yule–Walker method, one of the most popular estimation procedures in the autoregressive (AR) model. Note that the Yule–Walker method does not regard a small positive value of y t as an outlier, so that we do not have to conduct the robust AR model for this dataset. The relative prediction error of the Yule–Walker is 0.123, which is larger than that of our proposed method (0.049).
Furthermore, to investigate the stableness of the MM algorithm described in Section 5, we also apply the BFGS method to obtain the minimizer of the negative γ -likelihood function. The optim function in R is used to implement the BFGS method. With the BFGS method, relative prediction errors diverge when γ 0 . 03 . Consequently, the MM algorithm is more stable than the BFGS algorithm for this dataset.

8. Discussion

We proposed a relative error estimation procedure that is robust against outliers. The proposed procedure is based on the γ -likelihood function, which is constructed by γ -cross entropy [18]. We showed that the proposed method has the redescending property, a desirable property in robust statistics literature. The asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix are derived, which allows the construction of approximate confidence sets. Besides the theoretical results, we have constructed an efficient algorithm, in which we minimize a convex loss function at each iteration. The proposed algorithm monotonically decreases the objective function at each iteration.
Our simulation results showed that the proposed method performed better than the ordinary relative error estimation procedures in terms of prediction accuracy. Furthermore, the asymptotic distribution of the estimator yielded a good approximation, with an appropriate value of γ , even when outliers existed. The proposed method was applied to electricity consumption data, which included small positive values. Although the ordinary LPRE was sensitive to small positive values, our method was able to appropriately eliminate the negative effect of these values.
In practice, variable selection is one of the most important topics in regression analysis. The ordinary AIC (Akaike information criterion, Akaike [34]) cannot be directly applied to our proposed method because the AIC aims at minimizing the Kullback–Leibler divergence, whereas our method aims at minimizing the γ -divergence. As a future research topic, it would be interesting to derive the model selection criterion for evaluating a model estimated by the γ -likelihood method.
High-dimensional data analysis is also an important topic in statistics. In particular, the sparse estimation, such as the lasso [12], is a standard tool to deal with high-dimensional data. As shown in Remark 3, our method may be extended to L 1 regularization. An important point in the regularization procedure is the selection of a regularization parameter. Hao et al. [14] suggested using the BIC (Bayesian information criterion)-type criterion of Wang et al. [35,36] for the ordinary LPRE estimator. It would also be interesting to consider the problem of regularization parameter selection in high-dimensional robust relative error estimation.
In regression analysis, we may formulate two types of γ -likelihood functions: Fujisawa and Eguchi’s formulation [18] and Kawashima and Fujisawa’s formulation [17]. [22] reported that the difference of performance occurs when the outlier ratio depends on the explanatory variable. In multiplicative regression model in Equation (1), the responses y i highly depend on the exploratory variables x i compared with the ordinary linear regression model because y i is an exponential function of x i j . As a result, the comparison of the above two formulations of the γ -likelihood functions would be important from both theoretical and practical points of view.

Supplementary Materials

The following are available online at https://www.mdpi.com/1099-4300/20/9/632/s1.

Author Contributions

K.H. proposed the algorithm, made the R package rree, conducted the simulation study, and analyzed the real data; H.M. derived the asymptotics; K.H. and H.M. wrote the paper.

Funding

This research was funded by the Japan Society for the Promotion of Science KAKENHI 15K15949, and the Center of Innovation Program (COI) from Japan Science and Technology Agency (JST), Japan (K.H.), and JST CREST Grant Number JPMJCR14D7 (H.M.)

Acknowledgments

The authors would like to thank anonymous reviewers for the constructive and helpful comments that improved the quality of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1

All of the asymptotics will be taken under n . We write a n b n if there exists a universal constant c > 0 such that a n c b n for every n large enough. For any random functions X n and X 0 on B ¯ , we denote X n ( β ) p X 0 ( β ) if sup β B ¯ | X n ( β ) - X 0 ( β ) | p 0 ; below, we will simply write sup β for sup β B ¯ .
First, we state a preliminary lemma, which will be repeatedly used in the sequel.
Lemma A1.
Let η ( x ; β ) and ζ ( x , y ; β ) be vector-valued measurable functions satisfying that
sup β max k { 0 , 1 } β k η ( x ; β ) η ¯ ( x ) , sup β max k { 0 , 1 } β k ζ ( x , y ; β ) ζ ¯ ( x , y ) ,
for some η ¯ and ζ ¯ such that
η ¯ + 0 ζ ¯ ( · , y ) d y q > 0 L q ( π ) .
Then,
1 n i = 1 n η ( x i ; β ) p X η ( x ; β ) π ( d x ) ,
1 n i = 1 n ζ ( x i , y i ; β ) p 0 X ζ ( x , y ; β ) f ( y | x , β 0 ) π ( d x ) d y .
Proof. 
Equation (A1) is a special case of Equation (A2), hence we only show the latter. Observe that
sup β 1 n i = 1 n ζ ( x i , y i ; β ) - ζ ( x , y ; β ) f ( y | x , β 0 ) π ( d x ) d y 1 n sup β i = 1 n 1 n ζ ( x i , y i ; β ) - ζ ( x i , y ; β ) f ( y | x i , β 0 ) d y + sup β 1 n i = 1 n ζ ( x i , y ; β ) f ( y | x i , β 0 ) d y - ζ ( x , y ; β ) f ( y | x , β 0 ) π ( d x ) d y = : 1 n sup β M n ( β ) + sup β C n ( β ) .
For the first term, let us recall the Sobolev inequality ([37], Section 10.2):
E sup β M n ( β ) q sup β E | M n ( β ) | q + sup β E | β M n ( β ) | q
for q > p . The summands of M n ( β ) trivially form a martingale difference array with respect to the filtration F i : = σ ( x j ; j i ) , i N : since we are assuming that the conditional distribution of y i given { ( x i , x i - 1 , x i - 2 , ) , ( y i - 1 , y i - 2 , ) } equals that given x i (Section 2 and Section 3), each summand of M n ( β ) equals 1 n ζ ( x i , y i ; β ) - E { ζ ( x i , y i ; β ) | F i - 1 } . Hence, by means of the Burkholder’s inequality for martingales, we obtain, for q > p 2 ,
sup β E | M n ( β ) | q sup β 1 n i = 1 n E ζ ( x i , y i ; β ) - ζ ( x i , y ; β ) f ( y | x i , β 0 ) d y q < .
We can take the same route for the summands of β M n ( β ) to conclude that sup β E | β M n ( β ) | q < . These estimates combined with Equation (A3) then lead to the conclusion that
sup β M n ( β ) = O p ( 1 ) .
As for the other term, we have C n ( β ) p 0 for each β and also
sup n E sup β β C n ( β ) < .
The latter implies the tightness of the family { C n ( β ) } n of continuous random functions on the compact set B ¯ , thereby entailing that C n ( β ) p 0 . The proof is complete. □

Appendix A.1. Consistency

Let f i ( β ) : = f ( y i | x i ; β ) for brevity and
A γ , n ( β ) : = 1 n i = 1 n f i ( β ) γ , A ¯ γ , n ( β ) : = 1 n i = 1 n f ( y | x i ; β ) γ + 1 d y = C ( γ , h ) 1 n i = 1 n exp ( - γ x i T β ) .
By means of Lemma A1, we have
A γ , n ( β ) p A γ ( β ) : = f ( y | x ; β ) γ f ( y | x , β 0 ) π ( d x ) d y , A ¯ γ , n ( β ) p A ¯ γ ( β ) : = f ( y | x ; β ) γ + 1 π ( d x ) d y = C ( γ , h ) exp ( - γ x T β ) π ( d x ) .
Since inf β A γ ( β ) A ¯ γ ( β ) > 0 , we see that taking the logarithm preserves the uniformity of the convergence in probability: for the γ -likelihood function (7), it holds that
γ , n ( β ) p γ , 0 ( β ) : = - 1 γ log A γ ( β ) + 1 1 + γ log A ¯ γ ( β ) .
The limit equals the γ -cross entropy from g ( · | · ) = f ( · | · ; β 0 ) to f ( · | · ; β ) . We have γ , 0 ( β ) γ , 0 ( β 0 ) , the equality holding if and only if f ( · | · ; β 0 ) = f ( · | · ; β ) (see [17], Theorem 1). By Equation (4), the latter condition is equivalent to ρ ( e - x T β y ) = ρ ( e - x T β 0 y ) , followed by β = β 0 from Assumption 3. This, combined with Equation (A4) and the argmin theorem (cf. [25], Chapter 5), concludes the consistency β ^ γ p β 0 . Note that we do not need Assumption 3 if γ , n is a.s. convex, which generally may not be the case for γ > 0 .

Appendix A.2. Asymptotic Normality

First, we note that Assumption 2 ensures that, for every α > 0 , there corresponds a function F ¯ α L 1 f ( y | x , β 0 ) π ( d x ) d y such that
max k = 0 , 1 , 2 , 3 sup β β k f ( y | x , β ) α F ¯ α ( x , y ) .
This estimate will enable us to interchange the order of β and the d y -Lebesgue integration, repeatedly used below without mention.
Let s i ( β ) = s ( y i | x i ; β ) , and
S γ , n ( β ) : = 1 n i = 1 n f i ( β ) γ s i ( β ) , S ¯ γ , n ( β ) : = 1 n i = 1 n f ( y | x i ; β ) γ + 1 s ( y | x i ; β ) d y .
Then, the γ -likelihood equation β γ , n ( β ) = 0 is equivalent to
Ψ γ , n ( β ) : = A ¯ γ , n ( β ) S γ , n ( β ) - A γ , n ( β ) S ¯ γ , n ( β ) = 0 .
By the consistency of β ^ γ , we have P ( β ^ γ B ) 1 ; hence P { Ψ γ , n ( β ^ γ ) = 0 } 1 as well, for B is open. Therefore, virtually defining β ^ γ to be β 0 B if Ψ γ , n ( β ^ γ ) = 0 has no root, we may and do proceed as if Ψ γ , n ( β ^ γ ) = 0 a.s. Because of the Taylor expansion
- 0 1 β Ψ γ , n β 0 + s ( β ^ n - β 0 ) d s n β ^ γ - β 0 = n Ψ γ , n ( β 0 )
to conclude Equation (16), it suffices to show that (recall the definitions (14) and (15))
n Ψ γ , n ( β 0 ) L N p 0 , Δ γ ,
- β Ψ γ , n ( β ^ n ) p J γ for every β ^ n p β 0 .
First, we prove Equation (A5). By direct computations and Lemma A1, we see that
n Ψ γ , n = A ¯ γ , n n S γ , n - S ¯ γ , n - n A γ , n - A ¯ γ , n S ¯ γ , n = i = 1 n 1 n H γ f i γ s i - f ( y | x i ; β 0 ) γ + 1 s ( y | x i ; β 0 ) d y - f i γ - f ( y | x i ; β 0 ) γ + 1 d y H γ = : i = 1 n χ γ , i .
The sequence ( χ γ , i ) i n is an ( F j ) -martingale-difference array. It is easy to verify the Lapunov condition:
α > 0 , sup n sup i n E χ γ , i 2 + α < .
Hence, the martingale central limit theorem concludes Equation (A5) if we show the following convergence of the quadratic characteristic:
1 n i = 1 n E χ γ , i 2 | F i - 1 p Δ γ .
This follows upon observation that
1 n i = 1 n E χ γ , i 2 | F i - 1 = ( H γ ) 2 1 n i = 1 n var f j γ s j | F i - 1 + ( H γ ) 2 1 n i = 1 n var f j γ | F i - 1 - 2 H γ 1 n i = 1 n cov f j γ s j , f j γ | F i - 1 H γ = ( H γ ) 2 f ( y | x ; β 0 ) 2 γ + 1 s ( y | x ; β 0 ) 2 d y π ( d x ) - ( H γ ) 2 + ( H γ ) 2 f ( y | x ; β 0 ) 2 γ + 1 d y π ( d x ) - ( H γ ) 2 - 2 H γ f ( y | x ; β 0 ) 2 γ + 1 s ( y | x ; β 0 ) d y π ( d x ) - H γ H γ ( H γ ) T + o p ( 1 ) = ( H γ ) 2 f ( y | x ; β 0 ) 2 γ + 1 s ( y | x ; β 0 ) 2 d y π ( d x ) + ( H γ ) 2 f ( y | x ; β 0 ) 2 γ + 1 d y π ( d x ) - 2 H γ f ( y | x ; β 0 ) 2 γ + 1 s ( y | x ; β 0 ) d y π ( d x ) ( H γ ) T + o p ( 1 ) = Δ γ + o p ( 1 )
invoke the expression (4) for the last equality.
Next, we show Equation (A6). Under the present regularity condition, we can deduce that
sup β | β 2 Ψ γ , n ( β ) | = O p ( 1 ) .
It therefore suffices to verify that - β Ψ γ , n ( β 0 ) p J γ ( β 0 ) = J γ . This follows from a direct computation of - β Ψ γ , n ( β 0 ) , combined with the applications of Lemma A1 (note that A ¯ γ , n and A γ , n have the same limit in probability):
- β Ψ γ , n ( β 0 ) = A γ , n 1 n i = 1 n f ( y | x i ; β 0 ) γ + 1 s ( y | x i ; β 0 ) 2 d y - S ¯ γ , n S γ , n T + γ S γ , n S ¯ γ , n T - S ¯ γ , n S γ , n T + γ A γ , n 1 n i = 1 n f ( y | x i ; β 0 ) γ + 1 s ( y | x i ; β 0 ) 2 d y - A ¯ γ , n 1 n i = 1 n f i γ s i 2 + A γ , n 1 n i = 1 n f ( y | x i ; β 0 ) γ + 1 β s ( y | x i ; β 0 ) d y - A ¯ γ , n 1 n i = 1 n f i γ β s i = A γ , n 1 n i = 1 n f ( y | x i ; β 0 ) γ + 1 s ( y | x i ; β 0 ) 2 d y - S ¯ γ , n S γ , n T + o p ( 1 ) = H γ f ( y | x ; β 0 ) γ + 1 s ( y | x ; β 0 ) 2 d y π ( d x ) - ( H γ ) 2 + o p ( 1 ) = J γ + o p ( 1 ) .

Appendix A.3. Consistent Estimator of the Asymptotic Covariance Matrix

Thanks to the stability assumptions on the sequence x 1 , x 2 , , we have
1 n i = 1 n x i k exp ( - γ x i T β 0 ) p Π k ( γ ) , k = 0 , 1 , 2 .
Moreover, for δ , δ > 0 given in Assumption 1, we have
1 n i = 1 n x i k exp ( - γ x i T β 0 ) - 1 n i = 1 n x i k exp ( - γ x i T β ^ γ ) 1 n i = 1 n | x i | k + 1 exp δ | x i | 1 + δ β ^ γ - β 0 = O p ( 1 ) β ^ γ - β 0 p 0 .
These observations are enough to conclude Equation (17).

References

  1. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
  2. Park, H.; Stefanski, L.A. Relative-error prediction. Stat. Probab. Lett. 1998, 40, 227–236. [Google Scholar] [CrossRef]
  3. Ye, J.; Price Models and the Value Relevance of Accounting Information. SSRN Electronic Journal 2007. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1003067 (accessed on 20 August 2018).
  4. Van der Meer, D.W.; Widén, J.; Munkhammar, J. Review on probabilistic forecasting of photovoltaic power production and electricity consumption. Renew. Sust. Energ. Rev. 2018, 81, 1484–1512. [Google Scholar] [CrossRef]
  5. Mount, J. Relative error distributions, without the heavy tail theatrics. 2016. Available online: http://www.win-vector.com/blog/2016/09/relative-error-distributions-without-the-heavy-tail-theatrics/ (accessed on 20 August 2018).
  6. Chen, K.; Guo, S.; Lin, Y.; Ying, Z. Least Absolute Relative Error Estimation. J. Am. Stat. Assoc. 2010, 105, 1104–1112. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Li, Z.; Lin, Y.; Zhou, G.; Zhou, W. Empirical likelihood for least absolute relative error regression. TEST 2013, 23, 86–99. [Google Scholar] [CrossRef]
  8. Chen, K.; Lin, Y.; Wang, Z.; Ying, Z. Least product relative error estimation. J. Multivariate Anal. 2016, 144, 91–98. [Google Scholar] [CrossRef] [Green Version]
  9. Ding, H.; Wang, Z.; Wu, Y. A relative error-based estimation with an increasing number of parameters. Commun. Stat. Theory Methods 2017, 47, 196–209. [Google Scholar] [CrossRef]
  10. Demongeot, J.; Hamie, A.; Laksaci, A.; Rachdi, M. Relative-error prediction in nonparametric functional statistics: Theory and practice. J. Multivariate Anal. 2016, 146, 261–268. [Google Scholar] [CrossRef]
  11. Wang, Z.; Chen, Z.; Chen, Z. H-relative error estimation for multiplicative regression model with random effect. Comput. Stat. 2018, 33, 623–638. [Google Scholar] [CrossRef]
  12. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Methodol. 1996, 58, 267–288. [Google Scholar]
  13. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Series B Stat. Methodol. 2006, 68, 49–67. [Google Scholar] [CrossRef] [Green Version]
  14. Hao, M.; Lin, Y.; Zhao, X. A relative error-based approach for variable selection. Comput. Stat. Data Anal. 2016, 103, 250–262. [Google Scholar] [CrossRef]
  15. Liu, X.; Lin, Y.; Wang, Z. Group variable selection for relative error regression. J. Stat. Plan. Inference 2016, 175, 40–50. [Google Scholar] [CrossRef]
  16. Xia, X.; Liu, Z.; Yang, H. Regularized estimation for the least absolute relative error models with a diverging number of covariates. Comput. Stat. Data Anal. 2016, 96, 104–119. [Google Scholar] [CrossRef]
  17. Kawashima, T.; Fujisawa, H. Robust and Sparse Regression via γ-Divergence. Entropy 2017, 19, 608. [Google Scholar] [CrossRef]
  18. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  19. Maronna, R.; Martin, D.; Yohai, V. Robust Statistics; John Wiley & Sons: Chichester, UK, 2006. [Google Scholar]
  20. Koudou, A.E.; Ley, C. Characterizations of GIG laws: A survey. Probab. Surv. 2014, 11, 161–176. [Google Scholar] [CrossRef]
  21. Jones, M.C.; Hjort, N.L.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 2001, 88, 865–873. [Google Scholar] [CrossRef]
  22. Kawashima, T.; Fujisawa, H. On Difference between Two Types of γ-divergence for Regression. 2018. Available online: https://arxiv.org/abs/1805.06144 (accessed on 20 August 2018).
  23. Ferrari, D.; Yang, Y. Maximum Lq-likelihood estimation. Ann. Stat. 2010, 38, 753–783. [Google Scholar] [CrossRef]
  24. Basu, A.; Harris, I.R.; Hjort, N.L.; Jones, M.C. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
  25. Van der Vaart, A.W. Asymptotic Statistics; Vol. 3, Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
  26. Eguchi, S.; Kano, Y. Robustifing maximum likelihood estimation by psi-divergence. ISM Research Memorandam. 2001. Available online: https://www.researchgate.net/profile/Shinto Eguchi/publication/228561230Robustifing maximum likelihood estimation by psi-divergence/links545d65910cf2c1a63bfa63e6.pdf (accessed on 20 August 2018).
  27. Hunter, D.R.; Lange, K. A tutorial on MM algorithms. Am. Stat. 2004, 58, 30–37. [Google Scholar] [CrossRef]
  28. Böhning, D. Multinomial logistic regression algorithm. Ann. Inst. Stat. Math. 1992, 44, 197–200. [Google Scholar] [CrossRef]
  29. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol. 2005, 67, 301–320. [Google Scholar] [CrossRef]
  30. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  31. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1. [Google Scholar] [CrossRef] [PubMed]
  32. Dheeru, D.; Karra Taniskidou, E. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (accessed on 20 August 2018).
  33. Sioshansi, F.P.; Pfaffenberger, W. Electricity Market Reform: An International Perspective; Elsevier: Oxford, UK, 2006. [Google Scholar]
  34. Akaike, H. A new look at the statistical model identification. IEEE Trans. Automat. Contr. 1974, 19, 716–723. [Google Scholar] [CrossRef]
  35. Wang, H.; Li, R.; Tsai, C.L. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika 2007, 94, 553–568. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  36. Wang, H.; Li, B.; Leng, C. Shrinkage tuning parameter selection with a diverging number of parameters. J. R. Stat. Soc. Series B Stat. Methodol. 2009, 71, 671–683. [Google Scholar] [CrossRef] [Green Version]
  37. Friedman, A. Stochastic Differential Equations and Applications; Dover Publications: New York, NY, USA, 2006. [Google Scholar]
Figure 1. Median and error bar of relative prediction error (RPE) in Equation (31) and mean squared error (MSE) of β in Equation (32) when parameters of the log-normal distribution (distribution of outliers) are ( μ , σ ) = ( ± 5 , 1 ) . The error bars are delineated by 25th and 75th percentiles.
Figure 1. Median and error bar of relative prediction error (RPE) in Equation (31) and mean squared error (MSE) of β in Equation (32) when parameters of the log-normal distribution (distribution of outliers) are ( μ , σ ) = ( ± 5 , 1 ) . The error bars are delineated by 25th and 75th percentiles.
Entropy 20 00632 g001
Figure 2. Histograms of T = 100,00 samples of z 2 along with the density function of standard normal distribution for μ = 5 in Model 1.
Figure 2. Histograms of T = 100,00 samples of z 2 along with the density function of standard normal distribution for μ = 5 in Model 1.
Entropy 20 00632 g002
Figure 3. Electricity consumption from January 2012 to December 2014 for one of the 370 households.
Figure 3. Electricity consumption from January 2012 to December 2014 for one of the 370 households.
Entropy 20 00632 g003
Figure 4. Relative prediction error for various values of γ for household electricity consumption data.
Figure 4. Relative prediction error for various values of γ for household electricity consumption data.
Entropy 20 00632 g004
Figure 5. Prediction value based on least product relative error (LPRE) loss for household electricity consumption data.
Figure 5. Prediction value based on least product relative error (LPRE) loss for household electricity consumption data.
Entropy 20 00632 g005
Figure 6. Prediction value based on the proposed method with γ = 0 . 03 for household electricity consumption data.
Figure 6. Prediction value based on the proposed method with γ = 0 . 03 for household electricity consumption data.
Entropy 20 00632 g006
Table 1. Several examples of general relative error (GRE) criteria and their properties. “Likelihood” in the second column means the existence of a likelihood function that corresponds to the loss function. The properties of “Convexity” and “Smoothness” in the last two columns respectively indicate those with respect to β of the corresponding loss function.
Table 1. Several examples of general relative error (GRE) criteria and their properties. “Likelihood” in the second column means the existence of a likelihood function that corresponds to the loss function. The properties of “Convexity” and “Smoothness” in the last two columns respectively indicate those with respect to β of the corresponding loss function.
g ( a , b ) LikelihoodConvexitySmoothness
a 2
a + b
max { a , b }
a b
a 2 + b 2

Share and Cite

MDPI and ACS Style

Hirose, K.; Masuda, H. Robust Relative Error Estimation. Entropy 2018, 20, 632. https://doi.org/10.3390/e20090632

AMA Style

Hirose K, Masuda H. Robust Relative Error Estimation. Entropy. 2018; 20(9):632. https://doi.org/10.3390/e20090632

Chicago/Turabian Style

Hirose, Kei, and Hiroki Masuda. 2018. "Robust Relative Error Estimation" Entropy 20, no. 9: 632. https://doi.org/10.3390/e20090632

APA Style

Hirose, K., & Masuda, H. (2018). Robust Relative Error Estimation. Entropy, 20(9), 632. https://doi.org/10.3390/e20090632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop