1. Introduction
A regression model approximates the functional relationship between a dependent y and independent variables x. Parameters of a regression model are estimated using a set of observations of x and y. The model with the estimated parameters can be used to predict a dependent variable value for a specific combination of the independent variables. In practice, statistical regression models are most often used for this purpose, but their usage is limited by an assumption that any deviation of a prediction from a corresponding observation is due to a random error.
In many practical applications, the deviations are a result of imprecise observations, an indefiniteness of the system structure and parameters [
1,
2], or the vagueness of human perception of the model (in contrast with the statistical regression where the errors are associated with observations) [
3]. There are also cases where the observations are inherently fuzzy, e.g., if the observations are described by linguistic terms [
4,
5,
6]. In such cases, the deviations are not due to randomness, but they are due to fuzziness and fuzzy regression should be used. The fuzzy regression can be also used when statistical distributional assumptions cannot be justified, or if the representation of the regression model is poor [
3]. The fuzzy regression is an efficient alternative to statistical regression whenever a dataset is insufficient to support statistical regression analysis [
7].
The fuzzy regression approximates the relationship between the dependent
y and the independent variables
x using a fuzzy regression model. Once the underlying regression relationship is known, an appropriate parametric fuzzy regression model (e.g., linear [
8], polynomial [
9], and logistic [
10] ) can be used for the approximation of the relationship. If the relationship is unknown, it is possible to utilize a nonparametric fuzzy regression model. Such as a model based on
k-nearest neighbour smoothing [
11], kernel smoothing [
11], local linear smoothing [
12], and adaptive neuro-fuzzy inference systems [
13].
In fuzzy regression analysis, the relationship is typically approximated as a linear dependence by a model with all fuzzy parameters [
2,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24], but other fuzzy linear models are used as well [
8]. For example, the model with all fuzzy parameters can be extended with a fuzzy error term [
22], or some of the parameters can be real value numbers [
8]. Models with all fuzzy parameters are known for a dependence of model prediction spreads on absolute values of
x [
8,
23]. In the case that all parameters, except a y-intercept or the fuzzy error term, are real value numbers, the model prediction spreads are constant in the whole range of
x values [
8].
For estimation of unknown parameters of a fuzzy model, possibilistic and statistics-based approaches are frequently used. Possibilistic fuzzy estimators minimize a total spread of the fuzzy model predictions subject to constraints that arise from observations [
1,
3,
15,
16,
19,
21,
25,
26]. The statistics-based solutions adopt concepts that are used in statistics such as the ordinary least squares method [
17,
18,
20,
27,
28,
29,
30,
31,
32,
33], the least absolute deviations method [
22,
23,
24,
34,
35,
36,
37], and adaptive smoothing methods [
11,
12]. Some fuzzy estimators combine the possibilistic and the statistics-based approaches [
38].
Several limitations were observed for some fuzzy estimators. As pointed out by many authors [
22,
24,
39,
40,
41], numerous fuzzy estimators are sensitive to outliers. This criticism resulted in a number of fuzzy estimators that are more or less resilient to outliers [
3,
11,
12,
24,
26,
41,
42,
43,
44]. A serious issue of some estimators is the fact that they do not guarantee non-negativity of spreads [
35,
41,
45,
46]. A certain disadvantage is also the computational complexity of the estimators. Depending on a used fuzzy estimator, matrix operations [
18], linear programming [
1,
3,
15,
16,
19,
21,
24,
44], quadratic programming [
26,
27,
32,
38], or a general constrained optimization problem [
25,
29,
30,
31,
33] must be solved to obtain estimates of the unknown parameters. Some methods employ customized iterative optimization algorithms [
20,
41].
Herein, we propose a new fuzzy estimator intended for a simple fuzzy regression model with real value independent and fuzzy dependent variables. We based the estimator on the Boscovich regression line [
47,
48], hence we named it Boscovich fuzzy regression line. The original Boscovich regression line was a pioneering regression method based on minimization criterion, i.e., it is a predecessor of today’s statistical regression methods. The method was characterized by extremely low computing demands. The presented Boscovich fuzzy regression line inherited this property. Moreover, the used fuzzy linear model guaranties non-negativity of the model parameter spreads by its nature, and spreads of model predictions are not influenced by independent variable values. Even with its simplicity, the presented fuzzy estimator provides regression models in a quality comparable to other fuzzy regression methods.
2. Materials and Methods
The Boscovich regression method [
48] was designed for the simple theoretical linear regression model
where
and
are unknown parameters of the model. The estimation of
is based on
n imprecise observations
, where
, and
. The estimator introduced by Boscovich was based on two constraints [
49]:
- (1)
the sum of the positive and negative residuals (in the sense of y axis) shall be equal,
- (2)
the sum of the best absolute values of all the residuals shall be as small as possible.
The constraint (1) implies that the regression line should pass through the centroid
formed by the observations, i.e.,
where
are the best estimates of the unknown regression coefficients
with respect to the constraints (1) and (2). As follows from the constraint (1), the coordinates of the centroid can be expressed as
Thus, the best estimate of the unknown regression coefficient
can be expressed on the base of (
2) as
Naturally, only one point,
, is insufficient to form any line. Therefore, at least one other point is needed. Boscovich suggested using one of the observations as the second point of the regression line. Whereas
n observations are available,
n regression lines can be constructed in such a way. The
k-th regression line is clearly determined by
where
is the
k-th estimate of
based on the
k-th observation
, and
. On the basis of (
5), the
k-th estimate of
can be expressed as
Altogether,
n estimates of
are obtained in such a way. The selection of the best one can be carried out using an evaluation function
J based on the constraint (2), i.e., for the
k-th regression line, it holds that
where
is the
i-th prediction of the dependent variable
y using the
k-th regression model (
5).
Boscovich expressed a premise that one of the
n regression lines is the best approximation of the model (
1). Considering constraint (2), the best estimate of the unknown regression coefficient
is given by
4. Boscovich Fuzzy Regression Line
The presented fuzzy regression method is intended for datasets where observations of the independent variable
x are real value numbers and observations of the dependent variable
y are triangular fuzzy numbers
, i.e., a set of
n observations is given as
where
is the
i-th real value observation of the independent variable
x,
is the
i-th fuzzy observation of the dependent variable
y,
,
is the mean,
is the left, and
is the right spread of
.
For the approximation of a linear dependence between
x and
, we use a simple fuzzy linear model
where
and
are unknown model parameters,
,
,
,
is the mean,
is the left, and
is the right spread of
. As the y-intercept
is the only fuzzy parameter of the model (
23), the fuzziness of model predictions is independent of model input
x [
8].
Following the Boscovich idea (see
Section 2, constraint (1)), the best estimate of the fuzzy regression line (
23) shall pass through a centroid formed by the observations
O, i.e.,
where
are the best estimates of the unknown regression coefficients
and
according to the constraints (1) and (2);
is the
y coordinate of the centroid, and
is its
x coordinate. The coordinates of the centroid formed by the observations (
22) are given as
therefore
.
Let us express the estimate of the unknown fuzzy regression coefficient
, using the Formula (
24), as
As in the case of the Boscovich regression line,
n fuzzy regression lines can be constructed using the observations (
22). The
k-th fuzzy regression line, based on the
k-th observation
is given as
where
is the slope of the
k-th fuzzy regression line (i.e., the
k-th estimate of
), and
.
The model (
27) uses the trick used in the ordinary least squares method. Specifically, we relate the explanatory variable to the centre of gravity using the relation
. The y coordinate of the centroid
incorporates the fuzziness of the underling relationship (
25), which is reflected in the intercept of the regression line
(
26). The intercept is the only fuzzy coefficient of the linear function in our model. Such a constructed line, fulfilling the first Boscovich assumption (
Section 2, constraint (1)), is going through the centroid, which is a necessary constraint for an unbiased estimate of the regression line. Considering this, we can construct an estimate of the slope of the
k-th regression line on the basis of the mean values of the fuzzy numbers
and
where
is the mean of the
k-th observation of
, and
is the mean of the
y coordinate of the centroid
.
The
k-th estimate of the slope
is then given as
For the selection of the best estimate of the slope
, an appropriate evaluation function has to be formulated (see
Section 2, constraint (2)). As follows from the relaxed Equation (
28), the criterion is given as
and the best estimate of
is given as
The proposed fuzzy estimator can be written using a pseudocode as an Algorithm 1.
Algorithm 1 Boscovich fuzzy regression line |
- 1:
functionBFRL(O) Require: The set of n observations Ensure: The best estimates of and - 2:
- 3:
- 4:
- 5:
- 6:
- 7:
▹ Best estimate of - 8:
▹Best estimate of - 9:
return - 10:
end function
|
5. Numerical Examples
We compared the presented Boscovich fuzzy regression line (BFRL) with several possibilistic and statistics-based fuzzy regression methods. As representatives of the possibilistic family, we selected a possibilistic linear regression (PLR) analysis [
16], a PLR combined with an omission approach (OPRL) [
44], and a multi-objective fuzzy linear regression (MOFLR) [
26]. The OPRL and MOFLR were designed to be less sensitive to outliers. We implemented the OPRL for detection of one outlier. Each of these three methods requires setting of one parameter by a decision maker. The decision maker must set up a threshold value
h in the cases of PRL and OPRL, where
. The threshold value indicates a degree of fitness of an estimated fuzzy regression model [
16]. MOFLR requires presenting of a weighting coefficient
, where
. The weighting coefficient determines a trade off between outlier penalization (
) and data fitting (
) [
26]. From the statistics-based methods, we considered a fuzzy least squares (FLS) [
18], a fuzzy least absolute linear regression (FLAR) [
24], and a robust fuzzy regression (RFR) analysis [
41]. Note that the RFR requires at least six observations to estimate parameters of a simple fuzzy linear model and it employs a customized iterative optimization method.
BFRL was designed as a parameter estimator of the fuzzy linear model (
23). PLR, OPLR, MOFLR, FLS, FLAR and FLS expect a fuzzy linear model with all fuzzy parameters. For one real value independent variable
x, the model is given as
where
and
are unknown model parameters,
,
,
is the mean,
is the left, and
is the right spread of
.
RFR uses a different approach to model prediction calculation. The
i-th prediction of
y,
, is given as
where
and
h are unknown model parameters, and
.
We evaluated performance of the aforementioned methods on various datasets. The possibilistic approaches (PRL, OPRL, MOFLR) were designed for symmetric fuzzy numbers. To allow comparison of all the methods, we involved datasets with symmetric fuzzy observations into the evaluation process. We used data from example 1 published in [
16] and from example 2 published in [
18]. We labelled the datasets SFN-1 and SFN-2, respectively. The presented BFRL, as well as the statistics-based methods (FLS, FLAR, RFR), are also capable of processing non-symmetric fuzzy numbers. To fully examine performance of these methods, we included three datasets with non-symmetric fuzzy observations of the dependent variable into the evaluation process. Specifically, we used data from example 1 published in [
45], from example 2 published in [
50], and from example 3 published in [
51]. We labelled the datasets NFN-1, NFN-2, and NFN-3, respectively. The datasets SFN-1, SFN-2, NFN-1, NFN-2, and NFN-3 consist of 5, 8, 16, 8 and 8 observations, respectively.
We evaluated performance of the methods using a total error
E which we defined as a sum of absolute errors, i.e.,
where
D is a difference between membership functions of observed and estimated fuzzy responses. For the
i-th observation, the difference is given as
where
and
are the membership functions of the
i-th observed
and the
i-th estimated response
, respectively; and
and
represent supports of
and
, respectively [
52]. Total errors (
34) of the examined fuzzy regression methods are summarized in
Table 1. The errors are organized with respect to fuzzy regression methods (columns) and datasets (rows). For PRL and OPRL, we used threshold values
and
. For MOFLR, we considered
. Note that RFR cannot be applied on SFN-1 since the dataset consists of 5 observations only. Since PRL, OPRL and MOFLR return biased results on datasets with non-symmetric fuzzy observations, the results were not included into
Table 1.
We also investigated the sensitivity of the methods on outliers. We considered three types of outliers: (a) outliers with respect to centres of the fuzzy dependent variable
(o1), (b) outliers with respect to spreads of
(o2), and (c) outliers with respect to both the centres and the spreads (o3). To examine the sensitivity, we created from each above stated dataset, three new sets where each new set contained one outlier. The outliers in the datasets are specified in
Table 2 by their serial numbers
i, means
, left spreads
and right spreads
. Their original values are written in normal text while the changed ones are in bold.
Total errors of acquired fuzzy regression models on the modified datasets with symmetric and non-symmetric fuzzy observations are summarized in
Table 3 and
Table 4, respectively. The results are organized with respect to fuzzy regression methods (columns) and datasets (rows). Settings of adjustable parameters are given under method abbreviations. Asterisk marked results point to the situation when outliers were correctly recognized by OPRL (
Table 3).
The obtained parameter estimates by PRL and OPRL on datasets with symmetric fuzzy numbers are summarized in
Table 5. Estimates provided by MOFLR on the same datasets are given in
Table 6. Settings of adjustable parameters
h and
are stated under method abbreviations. The estimates of model parameters generated by FLS, FLAR, RFR and BFRL on datasets with symmetric and non-symmetric fuzzy observations are summarized in
Table 7 and
Table 8, respectively.
We implemented all the fuzzy regression methods in MATLAB R2016a. We used default setting of optimization functions which were used within the calculations. It means that interior point methods were utilized to solve both linear and quadratic programming problems.
6. Discussion
We demonstrated in the numerical examples that the proposed Boscovich fuzzy regression line (BFRL) is capable of approximating a fuzzy linear relationship between the dependent
y and one independent variable
x, where the independent variable is a real value number and the dependent variable is a triangular fuzzy number
. We compared BFRL with several other fuzzy linear regression methods. Most of the reference methods (PRL, OPRL, MOFLR, FLS and FLAR) approximate the relationship using the fuzzy linear model (
32), while RFR uses the model (
33). Prediction spreads of both models are dependent on
x. BFRL estimates parameters of the fuzzy linear model (
23). Prediction spreads of this model are independent of
x. This fact predetermines BFRL for applications where the fuzziness of model predictions is independent of model inputs. An example of such an application is approximation of the dependence of water level, in an uncovered channel, on the opening of a floodgate. The level is measured using a perpendicularly positioned scale.
We studied performance of the fuzzy regression methods on twenty datasets where fifteen of them were affected by outliers. We measured prediction errors of acquired fuzzy regression models using the total error (
34). We found that, in most cases, BFRL based models show lower total errors compared to PLR, OPRL, MOFLR and RFR models. Total errors of BFRL and FLS models are comparable but, in comparison, FLAR models always show lower errors (
Table 1,
Table 3 and
Table 4). Sensitivity of BFRL to all types of outliers is comparable with other statistics-based methods (FLS, FLAR, RFR) but considerably lower compared to PLR and MOFLR (
Table 3 and
Table 4). If an outlier is correctly recognized by OPRL, the total error of a model produced by OPRL on the outlier affected dataset, is comparable with the total error of a BFRL based model (
Table 3).
Unlike some reference methods, BFRL proved to be generally applicable and reliable. BFRL provides sensible parameter estimates for datasets with symmetric as well as with non-symmetric triangular fuzzy numbers (as compared to PLR, OPRL and MOFLR). BFRL is capable of operating on datasets with only two observations. For example, RFR requires at least six observations. In contrast to OPRL (SFN-2-o2 for
,
Table 5), BFRL always provided estimates of parameters (
Table 5 and
Table 6). In contrast to RFR, BFRL always returned the same estimates on the same dataset. The iterative manner of optimization in RFR leads generally to various parameter estimates and various total errors. BFRL also guaranteed non-negativity of parameter spreads (
Table 7 and
Table 8). This basic requirement is not guaranteed by MOFLR and FLS.
Fuzzy regression methods have mostly large computational complexity, which makes them difficult to use especially with regard to their not easy implementation. For example, PLR, OPRL, and FLAR result in a linear programming problem, MOFLR results in a quadratic one. An analytical solution must be expressed in the case of FLS, and a customized iterative optimization method must be implemented in the case of RFR. Moreover, time-complexity of these methods is exponential. A pleasant feature of the method BFRL (Algorithm 1) is straightforward implementation and it has linear time-complexity.