1. Introduction
In regression analysis, many analysts use the (penalized) least squares estimation, which aims at minimizing the mean squared prediction error [
1]. On the other hand, the relative (percentage) error is often more useful and/or adequate than the mean squared error. For example, in econometrics, the comparison of prediction performance between different stock prices with different units should be made by relative error; we refer to [
2,
3] among others. Additionally, the prediction error of photovoltaic power production or electricity consumption is evaluated by not only mean squared error but also relative error (see, e.g., [
4]). We refer to [
5] regarding the usefulness and importance of the relative error.
In relative error estimation, we minimize a loss function based on the relative error. An advantage of using such a loss function is that it is scale free or unit free. Recently, several researchers have proposed various loss functions based on relative error [
2,
3,
6,
7,
8,
9]. Some of these procedures have been extended to the nonparameteric model [
10] and random effect model [
11]. The relative error estimation via the
regularization, including the least absolute shrinkage and operator (lasso; [
12]), and the group lasso [
13], have also been proposed by several authors [
14,
15,
16], to allow for the analysis of high-dimensional data.
In practice, a response variable
can turn out to be extremely large or close to zero. For example, the electricity consumption of a company may be low during holidays and high on exceptionally hot days. These responses may often be considered to be outliers, to which the relative error estimator is sensitive because the loss function diverges when
or
. Therefore, a relative error estimation that is robust against outliers must be considered. Recently, Chen et al. [
8] discussed the robustness of various relative error estimation procedures by investigating the corresponding distributions, and concluded that the distribution of least product relative error estimation (LPRE) proposed by [
8] has heavier tails than others, implying that the LPRE might be more robust than others in practical applications. However, our numerical experiments show that the LPRE is not as robust as expected, so that the robustification of the LPRE is yet to be investigated from the both theoretical and practical viewpoints.
To achieve a relative error estimation that is robust against outliers, this paper employs the
-likelihood function for regression analysis by Kawashima and Fujisawa [
17], which is constructed by the
-cross entropy [
18]. The estimating equation is shown to have a redescending property, a desirable property in robust statistics literature [
19]. To find a minimizer of the negative
-likelihood function, we construct a majorize-minimization (MM) algorithm. The loss function of our algorithm at each iteration is shown to be convex, although the original negative
-likelihood function is nonconvex. Our algorithm is guaranteed to decrease the objective function at each iteration. Moreover, we derive the asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix, which enables us to straightforwardly create approximate confidence sets. Monte Carlo simulation is conducted to investigate the performance of our proposed procedure. An analysis of electricity consumption data is presented to illustrate the usefulness of our procedure. Supplemental material includes our
R package
rree (robust relative error estimation), which implements our algorithm, along with a sample program of the
rree function.
The reminder of this paper is organized as follows:
Section 2 reviews several relative error estimation procedures. In
Section 3, we propose a relative error estimation that is robust against outliers via the
-likelihood function.
Section 4 presents theoretical properties: the redescending property of our method and the asymptotic distribution of the estimator, the proof of the latter being deferred to
Appendix A. In
Section 5, the MM algorithm is constructed to find the minimizer of the negative
-likelihood function.
Section 6 investigates the effectiveness of our proposed procedure via Monte Carlo simulations.
Section 7 presents the analysis on electricity consumption data. Finally, concluding remarks are given in
Section 8.
2. Relative Error Estimation
Suppose that
=
are predictors and
is a vector of positive responses. Consider the multiplicative regression model
where
is a
p-dimensional coefficient vector, and
are positive random variables. Predictors
may be random and serially dependent, while we often set
, that is, incorporate the intercept in the exponent. The parameter space
of
is a bounded convex domain such that
. We implicitly assume that the model is correctly specified, so that there exists a true parameter
. We want to estimate
from a sample
,
.
We first remark that the condition
ensures that the model (
1) is scale-free regarding variables
, which is an essentially different nature from the linear regression model
. Specifically, multiplying a positive constant
to
results in the translation of the intercept in the exponent:
so that the change from
to
is equivalent to that from
to
. See Remark 1 on the distribution of
.
To provide a simple expression of the loss functions based on the relative error, we write
Chen et al. [
6,
8] pointed out that the loss criterion for relative error may depend on
and / or
. These authors also proposed general relative error (GRE) criteria, defined as
where
. Most of the loss functions based on the relative error are included in the GRE. Park and Stefanski [
2] considered a loss function
. It may highly depend on a small
because it includes
terms, and then the estimator can be numerically unstable. Consistency and asymptotic normality may not be established under general regularity conditions [
8]. The loss functions based on
[
3] and
(least absolute relative error estimation, [
6]) can have desirable asymptotic properties [
3,
6]. However, the minimization of the loss function can be challenging, in particular for high-dimensional data, when the function is nonsmooth or nonconvex.
In practice, the following two criteria would be useful:
Least product relative error estimation (LPRE) Chen et al. [
8] proposed the LPRE given by
. The LPRE tries to minimize the product
, not necessarily both terms at once.
Least squared-sum relative error estimation (LSRE) Chen et al. [
8] considered the LSRE given by
. The LSRE aims to minimize both
and
through sum of squares
.
The loss functions of LPRE and LSRE are smooth and convex, and also possess desirable asymptotic properties [
8]. The above-described GRE criteria and their properties are summarized in
Table 1. Particularly, the “convexity” in the case of
holds when
,
, and
is positive definite, since the Hessian matrix of the corresponding
is
a.s.
Although not essential, we assume that the variables
in Equation (
1) are i.i.d. with common density function
h. As in Chen et al. [
8], we consider the following class of
h associated with
g:
where
and
is a normalizing constant (
) and
denotes the indicator function of set
. Furthermore, we assume the symmetry property
,
, from which it follows that
. The latter property is necessary for a score function to be associated with the gradient of a GRE loss function, hence being a martingale with respect to a suitable filtration, which often entails estimation efficiency. Indeed, the asymmetry of
(i.e.,
) may produce a substantial bias in the estimation [
3]. The entire set of our regularity conditions will be shown in
Section 4.3. The conditions therein concerning
g are easily verified for both LPRE and LSRE.
In this paper, we implicitly suppose that
denote “time” indices. As usual, in order to deal with cases of non-random and random predictors in a unified manner, we employ the partial-likelihood framework. Specifically, in the expression of the joint density (with the obvious notation for the densities)
we ignore the first product
and only look at the second one
, which is defined as the partial likelihood. We further assume that the
ith-stage noise
is independent of
, so that, in view of Equation (
1), we have
The density function of response
y given
is
From Equation (
3), we see that the maximum likelihood estimator (MLE) based on the error distribution in Equation (5) is obtained by the minimization of Equation (
2). For example, the density functions of LPRE and LSRE are
where
denotes the modified Bessel function of third kind with index
:
and
is a constant term. Constant terms are numerically computed as
and
. Density (
6) is a special case of the generalized inverse Gaussian distribution (see, e.g., [
20]).
Remark 1. We assume that the noise density h is fully specified in the sense that, given g, the density h does not involve any unknown quantity. However, this is never essential. For example, for the LPRE defined by Equation (6), we could naturally incorporate one more parameter into h, the resulting form of being Then, we can verify that the distributional equivalence holds whatever the value of σ is. Particularly, the estimation of parameter σ does make statistical sense and, indeed, it is possible to deduce the asymptotic normality of the joint maximum-(partial-) likelihood estimator of . In this paper, we do not pay attention to such a possible additional parameter, but instead regard it (whenever it exists) as a nuisance parameter, as in the noise variance in the least-squares estimation of a linear regression model.
3. Robust Estimation via -Likelihood
In practice, outliers can often be observed. For example, the electricity consumption data can have the outliers on extremely hot days. The estimation methods via GRE criteria, including LPRE and LSRE, are not robust against outliers, because the corresponding density functions are not generally heavy-tailed. Therefore, a relative error estimation method that is robust against the outliers is needed. To achieve this, we consider minimizing the negative
-(partial-)likelihood function based on the
-cross entropy [
17].
We now define the negative
-(partial-)likelihood function by
where
is a parameter that controls the degrees of robustness;
corresponds to the negative log-likelihood function, and robustness is enhanced as
increases. On the other hand, a too large
can decrease the efficiency of the estimator [
18]. In practice, the value of
may be selected by a cross-validation based on
-cross entropy (see, e.g., [
18,
21]). We refer to Kawashima and Fujisawa [
22] for more recent observations on comparison of the
-divergences between Fujisawa and Eguchi [
18] and Kawashima and Fujisawa [
17].
There are several likelihood functions which yield robust estimation. Examples include the
-likelihood [
23], and the likelihood based on the density power divergence [
24], referred to as
-likelihood. It is shown that the
-likelihood, the
-likelihood, and the
-likelihood are closely related. The negative
-likelihood function
and the negative
-likelihood function
are, respectively, expressed as
The difference between
-likelihood and
-likelihood is just the existence of the logarithm on
. Furthermore, substituting
into Equation (9) gives us
Therefore, the minimization of the negative
-likelihood function is equivalent to minimization of the negative
-likelihood function without second term in the right side of Equation (
8). Note that the
-likelihood has the redescending property, a desirable property in robust statistics literature, as shown in
Section 4.2. Moreover, it is known that the
-likelihood is the essentially unique divergence that is robust against heavy contamination (see [
18] for details). On the other hand, we have not shown whether the
-likelihood and/or the
-likelihood have the redescending property or not.
The integration
in the second term on the right-hand side of Equation (
7) is
where
is a constant term, which is assumed to be finite. Then, Equation (
7) is expressed as
where
is a constant term free from
. We define the maximum
-likelihood estimator to be any element such that
5. Algorithm
Even if the GRE criterion in Equation (
2) is a convex function, the negative
-likelihood function is nonconvex. Therefore, it is difficult to find a global minimum. Here, we derive the MM (majorize-minimization) algorithm to obtain a local minimum. The MM algorithm monotonically decreases the objective function at each iteration. We refer to Hunter and Lange [
27] for a concise account of the MM algorithm.
Let
be the value of the parameter at the
tth iteration. The negative
-likelihood function in Equation (
11) consists of two nonconvex functions,
and
. The majorization functions of
, say
(
), are constructed so that the optimization of
is much easier than that of
. The majorization functions must satisfy the following inequalities:
Here, we construct majorization functions for .
5.1. Majorization Function for
Obviously,
and
. Applying Jensen’s inequality to
, we obtain inequality
Substituting Equation (
20) and Equation (21) into Equation (
22) gives
where
. Denoting
we observe that Equation (
23) satisfies Equation (
18) and Equation (19). It is shown that
is a convex function if the original relative error loss function is convex. Particularly, the majorization functions
based on LPRE and LSRE are both convex.
5.2. Majorization Function for
Let
. We view
as a function of
. Let
By taking the derivative of
with respect to
, we have
where
. Note that
for any
.
The Taylor expansion of
at
is expressed as
where
and
is an
n-dimensional vector located between
and
. We define an
matrix
as follows:
It follows from [
28] that, in the matrix sense,
for any
. Combining Equation (
25) and Equation (
26), we have
Substituting Equation (
24) into Equation (
27) gives
where
. The majorization function of
is then constructed by
where
C is a constant term free from
. We observe that
in Equation (
28) satisfies Equation (
18) and Equation (). It is shown that
is a convex function because
is positive semi-definite.
5.3. MM Algorithm for Robust Relative Error Estimation
In
Section 5.1 and
Section 5.2, we have constructed the majorization functions for both
and
. The MM algorithm based on these majorization functions is detailed in Algorithm 1. The majorization function
is convex if the original relative error loss function is convex. Particularly, the majorization functions of LPRE and LSRE are both convex.
Algorithm 1 Algorithm of robust relative error estimation. |
- 1:
- 2:
Set an initial value of parameter vector . - 3:
while is converged do - 4:
Update the weights by Equation ( 20) - 5:
Update by
where and are given by Equation ( 23) and Equation ( 28), respectively. - 6:
- 7:
end while
|
Remark 2. Instead of the MM algorithm, one can directly use the quasi-Newton method, such as the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm to minimize the negative γ-likelihood function. In our experience, the BFGS algorithm is faster than the MM algorithm but is more sensitive to an initial value than the MM algorithm. The strengths of BFGS and MM algorithms would be shared by using the following hybrid algorithm:
We first conduct the MM algorithm with a small number of iterations.
Then, the BFGS algorithm is conducted. We use the estimate obtained by the MM algorithm as an initial value of the BFGS algorithm.
The stableness of the MM algorithm is investigated through the real data analysis in Section 7. Remark 3. To deal with high-dimensional data, we often use the regulzarization, such as the lasso [12], elastic net [29], and Smoothly Clipped Absolute Deviation (SCAD) [30]. In robust relative error estimation, the loss function based on the lasso is expressed aswhere is a regularization parameter. However, the loss function in Equation (29) is non-convex and non-differentiable. Instead of directly minimizing the non-convex loss function in Equation (29), we may use the MM algorithm; the following convex loss function is minimized at each iteration: The minimization of Equation (30) can be realized by the alternating direction method of multipliers algorithm [14] or the coordinate descent algorithm with quadratic approximation of [31]. 7. Real Data Analysis
We apply the proposed method to electricity consumption data from the UCI (University of California, Irvine) Machine Learning repository [
32]. The dataset consists of 370 household electricity consumption observations from January 2011 to December 2014. The electricity consumption is in kWh at 15-minute intervals. We consider the problem of prediction of the electricity consumption for next day by using past electricity consumption. The prediction of the day ahead electricity consumption is needed when we trade electricity on markets, such as the European Power Exchange (EPEX) day ahead market (
https://www.epexspot.com/en/market-data/dayaheadauction) and the Japan Power Exchange (JEPX) day ahead market (
http://www.jepx.org/english/index.html). In the JEPX market, when the prediction value of electricity consumption
is smaller than actual electricity consumption
, the price of the amount of
becomes “imbalance price”, which is usually higher than the ordinary price. For details, please refer to Sioshansi and Pfaffenberger [
33].
To investigate the effectiveness of the proposed procedure, we choose one household that includes small positive values of electricity consumption. The consumption data for 25 December 2014 were deleted because the corresponding data values are zero. We predict the electricity consumption from January 2012 to December 2014 (the data in 2011 are used only for estimating the parameter). The actual electricity consumption data from January 2012 to December 2014 are depicted in
Figure 3.
We observe that several data values are close to zero from
Figure 3. Particularly, from October to December 2014, several spikes exist that attain nearly zero values. In this case, the estimation accuracy is poor with ordinary GRE criteria, as shown in our numerical simulation in the previous section.
We assume the multiplicative regression model in Equation (
1) to predict electricity consumption. Let
denote the electricity consumption at
t (
). The number of observations is
146,160. Here, 96 is the number of measurements in one day because electricity demand is expressed in 15-minute intervals. We define
as
, where
. In our model, the electricity consumption at
t is explained by the electricity consumption of the past
q days for the same period. We set
for data analysis and use past
days of observations to estimate the model.
The model parameters are estimated by robust LPRE. The values of
are set to be regular sequences from 0 to 0.1, with increments of 0.01. To minimize the negative
-likelihood function, we apply our proposed MM algorithm. As the electricity consumption pattern on weekdays is known to be completely different from that on weekends, we make predictions for weekdays and weekends separately. The results of the relative prediction error are depicted in
Figure 4.
The relative prediction error is large when (i.e., ordinary LPRE estimation). The minimum value of relative prediction error is 0.049 and the corresponding value of is . When we set a too large value of , efficiency decreases and the relative prediction error might increase.
Figure 5 shows the prediction value when
. We observe that there exist several extremely large prediction values (e.g., 8 July 2013 and 6 November 2014) due to the model parameters, which are heavily affected by the nearly zero values of electricity consumption.
Figure 6 shows the prediction values when
. Extremely large prediction values are not observed and the prediction values are similar to the actual electricity demand in
Figure 3. Therefore, our proposed procedure is robust against outliers.
Additionally, we apply the Yule–Walker method, one of the most popular estimation procedures in the autoregressive (AR) model. Note that the Yule–Walker method does not regard a small positive value of as an outlier, so that we do not have to conduct the robust AR model for this dataset. The relative prediction error of the Yule–Walker is 0.123, which is larger than that of our proposed method (0.049).
Furthermore, to investigate the stableness of the MM algorithm described in
Section 5, we also apply the BFGS method to obtain the minimizer of the negative
-likelihood function. The optim function in R is used to implement the BFGS method. With the BFGS method, relative prediction errors diverge when
. Consequently, the MM algorithm is more stable than the BFGS algorithm for this dataset.
8. Discussion
We proposed a relative error estimation procedure that is robust against outliers. The proposed procedure is based on the
-likelihood function, which is constructed by
-cross entropy [
18]. We showed that the proposed method has the redescending property, a desirable property in robust statistics literature. The asymptotic normality of the corresponding estimator together with a simple consistent estimator of the asymptotic covariance matrix are derived, which allows the construction of approximate confidence sets. Besides the theoretical results, we have constructed an efficient algorithm, in which we minimize a convex loss function at each iteration. The proposed algorithm monotonically decreases the objective function at each iteration.
Our simulation results showed that the proposed method performed better than the ordinary relative error estimation procedures in terms of prediction accuracy. Furthermore, the asymptotic distribution of the estimator yielded a good approximation, with an appropriate value of , even when outliers existed. The proposed method was applied to electricity consumption data, which included small positive values. Although the ordinary LPRE was sensitive to small positive values, our method was able to appropriately eliminate the negative effect of these values.
In practice, variable selection is one of the most important topics in regression analysis. The ordinary AIC (Akaike information criterion, Akaike [
34]) cannot be directly applied to our proposed method because the AIC aims at minimizing the Kullback–Leibler divergence, whereas our method aims at minimizing the
-divergence. As a future research topic, it would be interesting to derive the model selection criterion for evaluating a model estimated by the
-likelihood method.
High-dimensional data analysis is also an important topic in statistics. In particular, the sparse estimation, such as the lasso [
12], is a standard tool to deal with high-dimensional data. As shown in Remark 3, our method may be extended to
regularization. An important point in the regularization procedure is the selection of a regularization parameter. Hao et al. [
14] suggested using the BIC (Bayesian information criterion)-type criterion of Wang et al. [
35,
36] for the ordinary LPRE estimator. It would also be interesting to consider the problem of regularization parameter selection in high-dimensional robust relative error estimation.
In regression analysis, we may formulate two types of
-likelihood functions: Fujisawa and Eguchi’s formulation [
18] and Kawashima and Fujisawa’s formulation [
17]. [
22] reported that the difference of performance occurs when the outlier ratio depends on the explanatory variable. In multiplicative regression model in Equation (
1), the responses
highly depend on the exploratory variables
compared with the ordinary linear regression model because
is an exponential function of
. As a result, the comparison of the above two formulations of the
-likelihood functions would be important from both theoretical and practical points of view.