1 Introduction

Mining time series data is one of the most challenging problems in data mining [52]. Time series forecasting holds a key importance in many application domains, where time series data are highly imbalanced. This occurs when certain ranges of values are over-represented in comparison with others, and the user is particularly interested in the predictive performance on values that are the least represented. Such examples may be found in financial data analysis, intrusion detection in network forensics, oil spill detection and prognosis of machine failures. In these scenarios of imbalanced data sets, standard learning algorithms bias the models towards the more frequent situations, away from the user preference biases, proving to be an ineffective approach and a major source of performance degradation [10].

A common solution for the general problem of mining imbalanced data sets is to resort to resampling strategies. These strategies change the distribution of learning data in order to balance the number of rare and normal cases, attempting to reduce the skewness of the data. Resampling strategies commonly achieve their goal by under or oversampling the data. In the former, some of the cases considered as normal (i.e. the majority of cases) are removed from the learning data; in the latter, cases considered to be rare (i.e. the minority) are generated and added to the data. For example, in fraud detection problems, fraud cases are infrequent, and detecting them is the prime objective. Also, in intrusion detection problems, most of the behaviour in networks is normal, and cases of intrusion, which one aims to detect, are scarce. This task of predicting rare occurrences has proven to be a difficult task to solve, but due to its importance in so many domains, it is a fundamental problem within predictive analytics [16].

Resampling strategies are a popular method for dealing with imbalanced domains. This is a simple, intuitive and efficient method for dealing with imbalanced domains. Moreover, it allows the use of any out-of-the-box learner, enabling a diversity of choices at the learning step. An alternative could be to develop special-purpose learning methods, or to act at the post-processing level. Generally, special-purpose learning methods have the advantage of improving performance for their specific problem. However, they require a thorough knowledge of the learning algorithm manipulated and their application to other problems typically fails. Regarding post-processing methods, they have not been much explored and usually involve the output of conditional probabilities.

Most existing work using resampling strategies for predictive tasks with an imbalanced target variable distribution involves classification problems ([6, 26, 38, 48]). Recently, efforts have been made to adapt existing strategies to numeric targets, i.e. regression problems ([45, 46]). To the best of our knowledge, no previous work addresses this question using resampling strategies in the context of time series forecasting. Although time series forecasting involves numeric predictions, there is a crucial difference compared to regression tasks: the time dependency among the observed values. The main motivation of the current work is our claim that this order dependency should be taken into account when changing the distribution of the training set, i.e. when applying resampling. Our work is driven by the hypothesis that by biasing the sampling procedure with information on this order dependency, we are able to improve predictive performance.

In this paper, we study the use of resampling strategies in imbalanced time series. Our endeavour is based on three strategies: (i) the first is based on undersampling (random undersampling [24]); (ii) the second is based on oversampling (random oversampling [19]); and (iii) the third combines undersampling and oversampling (random undersampling with Synthetic Minority Over-sampling TEchnique [9]). These strategies were initially proposed for classification problems and were then extended for regression tasks [4, 45, 46]. We will refer to the extension of the SMOTE resampling strategy as SmoteR.

Time series often exhibit systematic changes in the distribution of observed values. These non-stationarities are often known as concept drift [51]. This concept describes the changes in the conditional distribution of the target variable in relation to the input features (i.e. predictors), while the distribution of the latter stays unchanged. This raises the question of how to devise learning approaches capable of coping with this issue. We introduce the concept of temporal bias in resampling strategies associated with forecasting tasks using imbalanced time series. Our motivation is the idea that in an imbalanced time series, where concept drift occurs, it is possible to improve forecasting accuracy by introducing a temporal bias in the case selection process of resampling strategies. This bias favours cases that are within the temporal vicinity of apparent regime changes. In this paper, we propose two alternatives for the resampling strategies used in our work: undersampling, oversampling and SmoteR with (1) temporal bias, and (2) with temporal and relevance bias.

An extensive experimental evaluation was carried out to evaluate our proposals comprising 24 time series data sets from 6 different sources. The objective is to verify if resampling strategies are capable of improving the predictive accuracy in comparison with standard forecasting tools, including those designed specifically for time series (e.g. ARIMA models [8]).

The contributions of this paper are:

  • The extension of resampling strategies for time series forecasting tasks;

  • The proposal of novel resampling strategies that introduce the concept of temporal and relevance bias;

  • An extensive evaluation including standard regression tools, time series-specific models and the use of resampling strategies.

The remainder of this paper is structured as follows. In Sect. 2 the problem tackled in our work is introduced and the hypotheses in which our proposals are based are presented. Resampling strategies are described in Sect. 3 along with the adaptation of previous proposals and new proposals. The data used to evaluate the proposals are introduced in Sect. 4, as well as the regression tools used and the evaluation methods. The evaluation process is described and results presented in Sect. 5, followed by a discussion in Sect. 6. Finally, previous work is discussed in Sect. 7 and conclusions are presented in Sect. 8.

2 Problem definition

The main objective of our proposals is to provide solutions that significantly improve the predictive accuracy on relevant (rare) cases in forecasting tasks involving imbalanced time series.

The task of time series forecasting assumes the availability of a time-ordered set of observations of a given continuous variable \(y_1, y_2, \ldots , y_t \in Y\), where \(y_t\) is the value measured at time t. The objective of this predictive task is to forecast future values of variable Y. The overall assumption is that an unknown function correlates the past and future values of Y, i.e. \(Y_{t+h} = f( \left\langle Y_{t-k}, \ldots , Y_{t-1}, Y_{t} \right\rangle )\). The goal of the learning process is to provide an approximation of this unknown function. This is carried out using a data set with historic examples of the function mapping (i.e. training set).

Time series forecasting models usually assume the existence of a degree of correlation between successive values of the series. A form of modelling this correlation consists of using the previous values of the series as predictors of the future value(s), in a procedure known as time delay embedding [39]. This process allows the use of standard regression tools on time series forecasting tasks. However, specific time series modelling tools also exist, such as the ARIMA models [8].

In this work, we focus on imbalanced time series, where certain ranges of values of the target variable Y are more important to the end-user, but severely under-represented in the training data. As training data, we assume a set of cases built using a time delay embedding strategy, i.e. where the target variable is the value of Y in the next time step (\(y_{t+1}\)) and the predictors are the k recent values of the time series, i.e. \(y_t, y_{t-1}, \ldots , y_{t-k}\).

To formalise our prediction task, namely in terms of criteria for evaluating the results of modelling approaches, we need to specify what we mean by “more important” values of the target variable. We resort to the work of Ribeiro [36] that proposes the use of a relevance function to map the domain of continuous variables into a [0, 1] scale of relevance, i.e. \(\phi (Y): \mathcal {Y} \rightarrow [0,1]\). Normally, this function is given by the users, attributing levels of importance to ranges of the target variable specific to their interest, taking into consideration the domain of the data. In our work, due to the lack of expert knowledge concerning the domains, we employ an automatic approach to define the relevance function using box plot statistics, detailed in Ribeiro [36], which automatically assigns more relevance/importance to the rare extreme low and high values of the target variable. This automatic approach uses a piecewise cubic Hermite interpolation polynomials [12] (pchip) algorithm to interpolate a set of points describing the distribution of the target variable. These points are given by box plot statistics. The outlier values according to box plot statistics (either extreme high or low) are given a maximum relevance of 1 and the median value of the distribution is given a relevance of 0. The relevance of the remaining values is then interpolated using the pchip algorithm.

Based on the concept of relevance, Ribeiro [36] has also proposed an evaluation framework that allows us to assert the quality of numeric predictions considering the user bias. We use this evaluation framework to ascertain the predictive accuracy when using imbalanced time series data, by combining standard learning algorithms and resampling strategies.

The hypotheses tested in our experimental evaluation are:

Hypothesis 1

The use of resampling strategies significantly improves the predictive accuracy of forecasting models on imbalanced time series in comparison with the standard use of out-of-the-box regression tools.

Hypothesis 2

The use of bias in case selection of resampling strategies significantly improves the predictive accuracy of forecasting models on imbalanced time series in comparison with non-biased strategies.

Hypothesis 3

The use of resampling strategies significantly improves the predictive accuracy of forecasting models on imbalanced time series in comparison with the use of time series-specific models.

From a practical point of view, only time series forecasting tasks with rare important cases may benefit from the proposed approach. Our target applications are forecasting tasks where the user has a preference bias towards the rare values which also motivates the use of specific performance assessment measures that are able to capture what is important to the user. Also the hypotheses tested are only meaningful in the context of time series with imbalanced distributions where the user is more interested in obtaining more accurate predictions on the least represented cases. This means that our proposed approach is not suitable for forecasting tasks whose goal is accurate predictions across the entire domain irrespective of the errors location.

3 Resampling strategies

Resampling strategies are pre-processing approaches that change the original data distribution in order to meet some user-given criteria. Among the advantages of pre-processing strategies is the ability of using any standard learning tool. However, to match a change in the data distribution with the user preferences is not a trivial task. The proposed resampling strategies aim at pre-processing the data for obtaining an increased predictive performance in cases that are scarce and simultaneously important to the user. As mentioned before, this importance is described by a relevance function \(\phi (Y)\). Being domain-dependent information, it is the user responsibility to specify the relevance function. Nonetheless, when lacking expert knowledge, it is possible to automatically generate the relevance function. Being a continuous function on the scale [0, 1], we require the user to specify a relevance threshold, \(t_R\), that establishes the minimum relevance score for a certain value of the target variable to be considered relevant. This threshold is only required because the proposed resampling strategies need to be able to decide which values are the most relevant when the distribution changes.

Figure 2 shows an example of an automatically generated relevance function, with a 0.9 relevance threshold, defined for the temperature time series (Fig. 1) obtained from the Bike Sharing data source [14] using observations between 22 March and 1 May 2011. In this example, we assign more importance to the highest and lowest values of Y.

Our resampling strategies proposals for imbalanced time series data are based on the concept of relevance bins. These are successive observations of the time series where the observed value is either relevant or irrelevant, for the user. Algorithm 1 describes how these bins are created from the original time series. The algorithm uses time stamp information and the relevance of the values from the original time series, to cluster the observations into bins that have the following properties:

  1. 1.

    Each bin contains observations whose target variable value has a relevance score that is either all above or all below the relevance threshold \(t_R\); and

  2. 2.

    Observations in a given bin are always consecutive cases in terms of the time stamp.

Fig. 1
figure 1

Sample of temperature time series from the bike sharing data source [14]

Fig. 2
figure 2

Relevance function \(\phi (Y)\) with a relevance threshold of 0.9 (dashed line) for the time series is shown in Fig. 1

figure a

Figure 3 shows the bins obtained in the temperature time series displayed in Fig. 1. The six dashed rectangles represent the bins containing consecutive observations with relevant value of the target variable, while the non-dashed regions correspond to consecutive observations with common values with a lower relevance to the user, based on the automatically generated relevance function (Fig. 2). This means that, for the example under consideration, we have 13 bins: 6 bins with relevant values, and 7 bins with common values (non-dashed areas).

Fig. 3
figure 3

Bins generated for time series of Fig. 1 with relevance function (\(\phi ()\)) provided in Fig. 2 using a relevance threshold of 0.9 (dashed ranges represent bins with important cases)

Our first proposals are an adaption to the time series context of the random undersampling, random oversampling and SmoteR strategies proposed by Torgo et al. [46] and Branco et al. [4] for tackling imbalanced regression tasks. The main change applied in both algorithms is the way the sampling is carried out. Instead of pure random selection as in the original algorithms, here we carry out sampling within each individual bin.

The random undersampling (U_B) strategy is described in Algorithm 2. This approach has the default behaviour of balancing the number of normal and rare values by randomly removing examples from the bins with normal cases, i.e. bins with low relevance examples. In this case, the number of examples removed is automatically calculated to ensure that: (1) each undersampled bin gets the same number of normal cases; and (2) the total number of normal and rare cases are balanced. The algorithm also allows the specification of a particular undersampling percentage through the parameter u. When the user sets this percentage, the number of cases removed is calculated for each bin with normal values. The percentage \(u < 1\) defines the number of examples that are maintained in each bin.

figure b

Our second proposal is the random oversampling (O_B) approach that is described in Algorithm 3. In this strategy, the default behaviour is to balance the number of normal and rare cases with the introduction of replicas of the most relevant and rare cases in the bins containing examples with high relevance. The number of copies included is automatically determined to ensure: (1) balance between rare and normal cases and (2) the same frequency in the oversampled bins. An optional parameter o allows the user to select a specific percentage of oversampling to apply in each bin with relevant values.

figure c

The third strategy (SM_B) is an adaptation of the SmoteR algorithm to the time series context. The SmoteR algorithm combines random undersampling with oversampling through the generation of synthetic cases. The default behaviour of this strategy is to automatically balance the number of examples in the bins. The random undersampling part is carried out through the process described in Algorithm 2. The oversampling strategy generates new synthetic cases by interpolating a seed example with one of its k-nearest neighbours from the respective bin of rare examples. The main difference between SM_B and the original SmoteR algorithm is on the process used to select the cases for both under- and oversampling. SM_B works with time series data, and thus it must take the time ordering of the cases into account, which we have done by defining the relevance bins that are formed by subsets of cases that are adjacent in terms of time.

Algorithm 4 shows the process for generating synthetic examples, and Algorithm 5 describes the SM_B algorithm. This algorithm by default balances the cases in the bins. Alternatively, the user may set the percentages of under/oversampling to be applied in the bins using parameters u and o. These are optional parameters that allow the user to completely control the percentages applied.

figure d
figure e

3.1 Resampling with temporal bias

Concept drift is one of the main challenges in time series forecasting. This is particularly true for our target applications where the preference bias of the user concerns rare values of the series. In effect, this rarity makes it even more important to understand and anticipate when these shifts of regime occur.

A first step in the identification of these different regimes according to user preferences is implemented by the previously described creation of relevance bins using Algorithm 1 (c.f. Fig. 3). Still, within each bin the cases are not equally relevant. We claim that the most recent cases within each bin may potentially contain important information for understanding these changes in regime. In this context, we propose three new algorithms (Undersampling, Oversampling and SmoteR with Temporal Bias) that favour the selection of training cases that are in the vicinity of transitions between bins. This resembles the adaptive learning notion of gradual forgetting, where the older cases have a higher likelihood of being excluded from the learning data. However, this concept is applied to the full extent of the data and in our proposal of temporal bias it is applied in each bin of normal cases.

The Undersampling with Temporal Bias (U_T) proposal is based on Algorithm 2. The main difference is the process of selecting examples to undersample within each bin of normal cases. Instead of randomly selecting cases, we use a biased undersampling procedure. In U_T, for each bin where undersampling is applied, the older the example is, the lower the probability of being selected for the new training set. This provides a modified distribution which is balanced in terms of normal and rare cases with a probabilistic preference towards the most recent cases, i.e. those in the vicinity of bin transitions. The integration of the temporal bias is performed as follows:

  • order the cases in each bin B of normal cases by increasing time in a new bin OrdB;

  • assign the preference of \(i \times \frac{1}{|OrdB|}\) for selecting \(ex_i\) in OrdB, where \(i \in ( 1,\ldots ,|OrdB| )\);

  • select a sample from OrdB based on the former preferences.

This corresponds to substituting line 17 in Algorithm 2 by the lines 11, 12 and 13 previously presented in Algorithm 6.

figure f
figure g

Our second proposed strategy, oversampling with temporal bias (O_T), is based on Algorithm 3. This strategy performs oversampling giving an higher preference to the most recent examples. This way, the strategy incorporates a bias towards the newer cases in the replicas selected for inclusion. The integration of the temporal bias is achieved as follows:

  • order the cases in each bin B of rare cases by increasing time in a new bin OrdB;

  • assign the preference of \(i \times \frac{1}{|OrdB|}\) for selecting \(ex_i\) in OrdB, where \(i \in ( 1,\ldots ,|OrdB| )\);

  • select a sample from OrdB based on the former preferences.

This corresponds to replacing line 17 in Algorithm 3 by the lines 11, 12 and 13 presented in Algorithm 7.

Our third proposed strategy is SmoteR with Temporal Bias (SM_T). This approach combines undersampling with temporal bias in the bins containing normal cases, with an oversampling mechanism that also integrates a temporal component. The undersampling with temporal bias strategy is the same as described in Algorithm 6. Regarding the oversampling strategy, we included in the SmoteR generation of synthetic examples a preference for the most recent examples. This means that when generating a new synthetic case, after evaluating the k-nearest neighbours of the seed example, the neighbour selected for the interpolation process is the most recent case. This includes, in the synthetic cases generation, a time bias towards the most recent examples instead of randomly selecting cases. Algorithm 8 shows the lines that were changed in Algorithm 5. To include the temporal bias, we have replaced line 19 in Algorithm 5 referring to the undersampling step, by lines 12, 13 and 14 in Algorithm 8. Also, concerning the oversampling step, we replaced line 28 in Algorithm 5 by line 28 in Algorithm 8.

Regarding the function for generating synthetic examples, Algorithm 9 describes what was necessary to change in Algorithm 4 for including the temporal bias. In this case, only line 13 of Algorithm 4 was changed, in order to consider the time factor, so that the nearest neighbour is not randomly selected.

figure h
figure i

3.2 Resampling with temporal and relevance bias

This section describes our final proposals of resampling strategies for imbalanced time series forecasting. The idea of the three algorithms described in this section is to also include the relevance scores in the sampling bias. The motivation is that while we assume that the most recent cases within each bin are important as they precede regime changes, we consider that older cases that are highly relevant should not be completely disregarded given the user preferences. To combine the temporal and relevance bias, we propose three new algorithms: undersampling (Algorithm 10), oversampling (Algorithm 11) and SmoteR with temporal and relevance bias (Algorithm 12).

The integration of temporal and relevance bias in undersampling (U_TPhi) is performed as follows:

  • order examples in each bin B of normal cases by increasing time in a new bin OrdB;

  • for each example \(ex_i\) in OrdB use \(\frac{i}{|OrdB|} \times \phi (ex_i[y])\) as the preference of selecting example \(ex_i\);

  • sample a number of examples from OrdB assuming the previously determined preferences.

This process corresponds to replacing the line 17 in Algorithm 2 by the lines 11, 12 and 13 in Algorithm 10.

figure j

In order to incorporate a temporal and relevance bias in the oversampling algorithm (O_TPhi), the following steps were necessary:

  • order examples in each bin B of rare cases by increasing time in a new bin OrdB;

  • for each example \(ex_i\) in OrdB use \(\frac{i}{|OrdB|} \times \phi (ex_i[y])\) as the preference of selecting example \(ex_i\);

  • sample a number of examples from OrdB assuming the above preferences.

This corresponds to replacing line 17 in Algorithm 3 by lines 11, 12 and 13 in Algorithm 11. These changes allow to bias oversampling procedures towards recent cases of high relevance.

figure k

The same integration of time and relevance bias is also done in the SmoteR algorithm. In this case, we altered both the undersampling and oversampling steps of SmoteR algorithm. Algorithm 12 shows what was changed in Algorithm 5 to accomplish this. Lines 19 and 28 of Algorithm 5 were replaced by lines 12, 13 and 14, and by line 15 in Algorithm 12, respectively. These changes correspond to biasing the undersampling process to consider time and relevance of the examples in each bin, as previously described: the most recent examples with higher relevance are preferred to others for staying in the changed data set. Regarding the oversampling strategy, the generation of synthetic examples also assumes this tendency, i.e. the new examples are built using the function , by prioritising the selection of highly relevant and recent examples. Algorithm 13 shows the changes made in Algorithm 4 (line 13 in Algorithm 4 was replaced by lines 13, 14 and 15). The bias towards more recent and high relevance examples is achieved in the selection of a nearest neighbour for the interpolation, as follows:

  • calculate the relevance of the k-nearest neighbours;

  • calculate the time position of k-nearest neighbours by ascending order and normalized to [0, 1];

  • select the nearest neighbour with the highest value of the product of relevance by time position.

These changes bias the undersampling and the generation of new cases of SmoteR algorithm towards the most recent and relevant cases.

figure l
figure m

In summary, for each of the three resampling strategies considered (random undersampling, random oversampling and SmoteR), we have proposed three new variants that try to incorporate some form of sampling bias that we hypothesize as being advantageous in terms of forecasting accuracy on imbalanced time series tasks where the user favours the performance on rare values of the series. The first variants (U_B, O_B and SM_B) carry out sampling within relevance bins that are obtained with the goal of including successive cases with similar relevance according to the user preference. The second variants (U_T, O_T and SM_T) add to the first variant a preference towards the most recent cases within each bin as these are the cases that precede regime transitions. Finally, the third variants (U_TPhi, O_TPhi and SM_TPhi) add a third preference to the sampling procedures, to also include the relevance scores of the cases and avoid discarding cases that may not be the most recent, but are the most relevant for the user.

Table 1 Description of the data sets used

4 Materials and methods

4.1 Data

The experiments described in this paper use data from 6 different sources, totalling 24 time series from diverse real-world domains. For the purposes of evaluation, we assumed that each time series is independent from others of the same source (i.e. we did not use the temperature time series data in the Bike Sharing source to predict the count of bike rentals). All proposed resampling strategies, in combination with each of the regression tools, are tested on these 24 time series which are detailed in Table 1. All of the time series were pre-processed to overcome some well-known issues with this type of data, as is non-available (NA) observations. To resolve issues of this type, we resorted to the imputation of values using the R function knnImputation of the package DMwR [42]. For each of these time series, we applied the previously described approach of the time delay coordinate embedding. It requires an essential parameter: how many values to include as recent values, i.e. the size of the embed, k. This is not a trivial task as it requires to try different values of embed size in order to decide on an acceptable value. In our experiments, we have used \(k=10\). Experiments with a few other values have not shown significant differences in results. The outcome of the application of this embedding approach produces the data sets used as learning data.

For each of these data sets, we need to decide which are the relevant ranges of the time series values. To this purpose, we use a relevance function. As previously mentioned, due to the lack of expert knowledge concerning the used domains, we resort to an automatic approach to define the relevance function, detailed in Ribeiro [36]. This approach uses box plot statistics to derive a relevance function that assigns higher relevance scores to values that are unusually high or low, i.e. extreme and rare values. We use this process to obtain the relevance functions for all our time series. An example of the application of this approach, where only high extreme values exist (from a data set on water consumption in the area of Rotunda AEP in the city of Porto), is depicted in Fig. 4, while in Fig. 2 a case with both type of extremes is shown. Having defined the relevance functions, we still need to set a threshold on the relevance scores above which a value is considered important, i.e. the relevance threshold \(t_R\). The definition of this parameter is domain dependent. Still, we have used a relevance threshold \(t_R\) of 0.9, which generally leads to a small percentage of the values to be considered important. In Table 1 we added an indication concerning the proportion of rare cases (both very high and low values) for each used data set.

Fig. 4
figure 4

Relevance function \(\phi ()\) with high extreme values and box plot of Y distribution

4.2 Regression algorithms

To test our hypotheses, we selected a diverse set of standard regression tools. Our goal is to verify that our conclusions are not biased by a particular tool.

Table 2 shows the regression methods used in our experiments. To ensure that our work is easily replicable we used the implementations of these tools available in the free and open source R environment. Concerning the parameter settings for each of these regression methods, we carried out a preliminary test to search for the optimal parameterization (i.e. the setting that obtains the best possible results within a certain set of values of the parameters). The search for optimal parameters was carried out for each combination regression method—dataset and the results are detailed in “Annex 1”. In addition to these standard regression tools, we also include two time series-specific forecasting approaches: (i) the ARIMA model [8] and (ii) a bagging approach proposed by Oliveira and Torgo [34]. Regarding the first, ARIMA models require a significant tuning effort in terms of parameters. To tackle this issue, we used the auto.arima function available in the R package forecast [17], which implements an automatic search method for the optimal parameter settings. The second describes a bagging approach for time series forecasting tasks using bagged regression trees, proposed by Oliveira and Torgo [34]. The authors discuss the difficulties in optimizing the size of the embed (w.r.t. time delay embedding [39]) and propose the use of ensembles with models using different values for embed size. The authors report best results using ensembles where a third of the models use the maximum embed \(k_{max}\), another third uses an embed of \(k_{max}/2\) and the last third uses \(k_{max}/4\). Additionally, all models within the ensemble use the mean and variance of the respective embed as extra features. This approach will be henceforth referred to as BDES.

Table 2 Regression algorithms and respective R packages

4.3 Evaluation metrics

When the interest of the user is predictive performance at a small proportion of cases (i.e. rare cases), the use of standard performance metrics will lead to biased conclusions [36]. In effect, standard metrics focus on the “average” behaviour of the prediction models and for the tasks addressed in this paper, the user goal is a small proportion of cases. Although most of the previous studies on this type of issues are focused on classification tasks, Torgo and Ribeiro [36, 44] have shown that the same problems arise on numeric prediction tasks when using standard metrics, such as mean squared error.

In this context, we will base our evaluation on the utility-based regression framework proposed in the work by Torgo and Ribeiro [36, 44] which also assumes the existence of a relevance function \(\phi \), as the one previously described. Using this approach and the user-provided relevance threshold, the authors defined a series of metrics that focus the evaluation of models on the cases that the user is interested. In our experiments, we used the value 0.9 as relevance threshold.

In our evaluation process, we mainly rely on the utility-based regression metric F1-Score, denoted as \({F1}_{\phi }\). It integrates the precision and recall measures proposed by the mentioned framework of Ribeiro [36] and extended by Branco et al. [3]. In this context, precision, recall and F1-Score are defined as:

$$\begin{aligned} prec_{\phi }= & {} \frac{\sum _{\phi (\hat{y}_i)>t_R}(1+u(\hat{y}_i, y_i))}{\sum _{\phi (\hat{y}_i)>t_R}(1+\phi (\hat{y}_i))} \end{aligned}$$
(1)
$$\begin{aligned} rec_{\phi }= & {} \frac{\sum _{\phi (y_i)>t_R}(1+u(\hat{y}_i, y_i))}{\sum _{\phi (y_i)>t_R}(1+\phi (y_i))} \end{aligned}$$
(2)
$$\begin{aligned} F1_{\phi }= & {} 2 \times \frac{ prec_{\phi } \times rec_{\phi } }{ prec_{\phi } + rec_{\phi } } \end{aligned}$$
(3)

where \(\phi (y_i)\) is the relevance associated with the true value \(y_i\), \(\phi (\hat{y}_i)\) is the relevance of the predicted value \(\hat{y}_i\), \(t_R\) is the user-defined threshold signalling the cases that are relevant for the user, and \(u(\hat{y}_i, y_i)\) is the utility of making the prediction \(\hat{y}_i\) for the true value \(y_i\), normalized to \([-1,1]\).

Utility is commonly referred to as being a function combining positive benefits and negative benefits (costs). In this paper, we use the approach for utility surfaces by Ribeiro [36]. Differently from classification tasks, utility is interpreted as a continuous version of the benefit matrix proposed by Elkan [13]. Coarsely, utility U is defined as the difference between benefits B and costs C, \(U=B-C\). To calculate utility, two factors are taken into consideration: (i) if the true and predicted values and their respective relevance belong to similar relevance bins (e.g. both values are high extremes and highly relevant); and (ii) that the prediction is reasonably accurate, given a factor of maximum admissible loss, defined by the author. Figures 5 and 6 illustrate the utility surfaces given by the approach of Ribeiro [36] for the relevance functions presented in Figures 2 and 4, where the former has both high and low extreme values, and the latter only has high extreme values.

In Fig. 5, we observe that, for the accurate predictions (on the diagonal), the utility values range between 0 and 1. The higher utility values are given to both extremes (low and high) of the target variable. Outside the diagonal, we have an error that must also be taken into account. Predictions reasonably close to the true values have a positive utility. However, as the predicted and true values increase its distance, also the utility becomes negative, tending to \(-1\). Figure 6 shows a similar setting with only one type of extremes: extreme high values.

Fig. 5
figure 5

Utility surface for the relevance function depicted in Fig. 2

Fig. 6
figure 6

Utility surface for the relevance function depicted in Fig. 4

5 Experimental evaluation

This section presents the results of our experimental evaluation on three sets of experiments concerning forecasting tasks with imbalanced time series data sets. Each of these experiments was designed with the objective of testing the hypothesis set forth in Sect. 2. In the first set, we evaluate the predictive accuracy of standard regression tools in combination with the proposed resampling strategies. In the second set of experiments, the evaluation is focused on the task of inferring the possibility of the biased resampling strategies over-performing the non-biased strategies. Finally, in the third set, we evaluate the hypothesis of enabling a better predictive performance of models using standard regression tools with resampling strategies over time series-specific forecasting approaches such as ARIMA and BDES models. These models and all of the proposed resampling strategies combined with each of the standard regression tools were tested on 24 real-world time series data sets, obtained from six different data sources described in Table 1. In every application of the proposed resampling strategies, an inference method is applied in order to set the parameters concerning the amount of undersampling and oversampling. The objective of this method is to balance the number of normal and relevant cases in order to have an equal number of both in the training data.

The evaluation process is based on the evaluation metric \(F1_{\phi }\), as described by the referred utility-based regression framework (see Sect. 4.3). Concerning the testing of our hypothesis, we resort to paired comparisons using Wilcoxon signed rank tests in order to infer the statistical significance (with p value <0.05) of the paired differences in the outcome of the approaches.

Fig. 7
figure 7

Evaluation of regression algorithms and resampling strategies, with the mean utility-based regression metric \(F1_{\phi }\)

Concerning evaluation algorithms, caution is required in the decision on how to obtain reliable estimates of the evaluation metrics. Since time series data are temporally ordered, we must ensure that the original order of the cases is maintained as to guarantee that prediction models are trained with past data and tested with future data, thus avoiding over-fitting and over-estimated scores. As such, we rely on Monte Carlo estimates as the chosen experimental methodology for our evaluation. This methodology selects a set of random points in the data. For each of these points, a past window is selected as training data (Tr) and a subsequent window as test data (Ts). This methodology guarantees that each method used in our forecasting task is evaluated using the same training and test sets, thus ensuring a fair pairwise comparison of the estimates obtained. In our evaluation 50 repetitions of the Monte Carlo estimation process are carried out for each data set with 50% of the cases used as training set and the subsequent 25% used as test set. Exceptionally, due to their size, in the case of the data sets DS21 and DS22 we used 10% of the cases as training set and the following 5% as test set, and 20% of the cases as training set and the following 10% as test set for data sets DS23 and DS24. This process is carried out using the infrastructure provided by the R package performanceEstimation [43].

In order to clarify the nomenclature associated with the standard regression tools used in this evaluation process, the experiments include results given by multiple linear regression (LM), support vector machine (SVM), multivariate adaptive regression splines (MARS), random forest (RF) and regression trees (RPART) models. As for the resampling strategies, we use random undersampling (U_B), random oversampling (O_B), SmoteR (SM_B), undersampling (U_T), oversampling (O_T) and SmoteR (SM_T) with temporal bias, and undersampling (U_TPhi), oversampling (O_TPhi) and SmoteR (SM_TPhi) with temporal and relevance bias. The overall results given by the \(F1_{\phi }\) evaluation metric proposed by Ribeiro [36], obtained with Monte Carlo estimates, concerning all 24 time series data sets are presented in Fig. 7.

Table 3 Paired comparisons results of each Regression Algorithm Baseline with the application of Resampling Strategies, in the format Number of Wins (Statistically Significant Wins) / Number of Losses (Statistically Significant Losses)
Table 4 Paired comparisons results of each Regression algorithm with Baseline Resampling Strategies and the application of Biased Resampling Strategies, in the format Number of Wins (Statistically Significant Wins) / Number of Losses (Statistically Significant Losses)

From the obtained results, we observe that the application of resampling strategies shows great potential in terms of boosting the performance of forecasting tasks using imbalanced time series data. This is observed within each of the standard regression tools used (vertical analysis), but also regarding the data sets used (horizontal analysis), where it is clear that the approaches employing resampling strategies obtain the best results overall, according to the averaged \(F1_{\phi }\) evaluation metric. We should note that the results obtained by the baseline SVM models with the optimal parameter search method employed are very competitive and provide a better result than the resampled approaches in several occasions. We should also note that although an optimal parameter search method was employed for the baseline regression algorithms, and such parameters were used in the resampled alternatives, a similar approach was not employed concerning the optimal parameters for under and oversampling percentages. This is intended, as our objective is to assert the impact of these resampling strategies in a default setting, i.e. balancing the number of normal and rare cases.

5.1 Hypothesis 1

The first hypothesis brought forth in our work proposes that the use of resampling strategies significantly improves the predictive accuracy of imbalanced time series forecasting tasks in comparison with the use of standard regression tools. Although results presented in Fig. 7 point to the empirical confirmation of this hypothesis, it still remains unclear the degree of statistical significance concerning the difference in evaluation between the use or non-use of resampling strategies combined with standard regression tools.

Table 3 presents the paired comparisons of the application of random undersampling (U_B), random oversampling (O_B) and SmoteR (SM_B), and the standard regression tools with the application of the optimal parameter search method and without any applied resampling strategy. The information in the columns represents the number of wins and losses for each approach against the baseline. In this case, the baseline represents the optimized models from the regression tools, without the application of resampling strategies.

We can observe that the use of resampling strategies adds a significant boost in terms of forecasting relevant cases in imbalanced time series data, when compared to its non-use, in all standard regression tools employed in the experiment, except for the SVM models. Although not in a considerable magnitude, these models collected more significant wins. Nonetheless, these experiments still provide sufficient overall empirical evidence to confirm our first hypothesis.

Given the results on \(F1_{\phi }\) measure, a natural question arises: Are these results a reflection of a good performance in only one of the two metrics from which \(F1_{\phi }\) depends? To assess this, we observed the results of both \(rec_{\phi }\) and \(prec_{\phi }\) on all alternative approaches tested. These figures are available at http://tinyurl.com/z4xlup5. Generally, the results obtained with resampling strategies for \(prec_{\phi }\) measure present higher gains than those obtained with \(rec_{\phi }\). Still, we do not observe a performance decrease with \(rec_{\phi }\) metric in the time series data used. This means that higher \(F1_{\phi }\) results are obtained mostly due to higher \(prec_{\phi }\) values.

5.2 Hypothesis 2

The second hypothesis states that the use of a temporal and/or relevance bias in resampling strategies significantly improves the predictive accuracy of time series forecasting tasks in comparison with the baseline versions of each respective strategy. In order to empirically prove this hypothesis, results in Table 4 presents the paired comparisons of the application of the resampling strategies U_T, U_TPhi, O_T, O_TPhi, SM_T and SM_TPhi, against the respective resampling strategies U_B, O_B and SM_B, for each standard regression tool. For this experiment set, the baseline is defined as being the application of random undersampling, random oversampling and SmoteR in their initial adaptation to imbalanced time series.

Results show an overall advantage of the use of temporal and/or relevance bias in the case selection process of the resampling strategies used in our experiments for random undersampling and random oversampling. In the case of SmoteR, results show that the use of temporal and/or relevance bias did not improve results, given the experimental design used. In the case of random undersampling, results show that the use of temporal bias does not provide any clear advantage to the baseline version of the resampling strategy. However, when applying both temporal and relevance bias, results show significant ability for improvement. As to random oversampling, both proposals (temporal and temporal and relevance bias) show that in many cases it is possible to obtain a significant advantage result-wise, but there is no clear advantage for either one. As such, the application of temporal or temporal and relevance bias does provide empirical evidence that confirm our second hypothesis, in the case of under and oversampling.

5.3 Hypothesis 3

The third hypothesis proposed in our work is that the use of resampling strategies significantly improves the predictive accuracy of time series forecasting tasks in comparison with the use of ARIMA and BDES models. These models are approaches design specifically for time series forecasting. In this context, we want to check if our proposals based on resampling are able to significantly improve the predictive performance of these models. We remind that in this evaluation we employed a version of ARIMA models that automatically searches for the optimal number of past values to build the embed, while the standard regression tools are used with an optimal parameter setting for their baseline regression algorithm and enhanced through the proposed resampling strategies. The results from the paired comparisons of all the approaches employing resampling strategies and the ARIMA and BDES models (considered the baseline) are presented in Table 5.

Table 5 Paired comparisons results of ARIMA and BDES models and the application of Resampling Strategies in each Regression algorithm, in the format Number of Wins (Statistically Significant Wins) / Number of Losses (Statistically Significant Losses)

Results show that independently of the regression tool used, the application of resampling strategies provides a highly significant improvement over the results obtained by the ARIMA and BDES models. This goes to show the validity of our third and final hypothesis.

6 Discussion

The results presented in the experimental evaluation although proving to some extent the hypothesis set forth in our work, they may not provide the strongest evidence given the experimental settings. The main reason for this is related to the optimal parameter search method applied to the regression algorithms.

This method derives multiple models using diverse parameter settings in order to find the best option for each pair of regression algorithm and dataset. These optimal parameter settings are also used in the models where resampling strategies are applied. This option was intended to ensure any observed differences are being caused only by the usage of the resampling strategies. Nonetheless, there is no underlying evidence or intuition that the best parameter settings for the baseline regression algorithms should be the best setting for the models when resampling strategies are applied.

This raises a problem as to uncovering the real potential of the application of resampling strategies when optimized by a similar optimal parameter search method, by testing additional parameters concerning such strategies (i.e. the percentage of cases to remove and/or add). However, this may come at a great computational cost. For example, when using the search method as described in “Annex 1” with an additional five possible values for under sampling percentage and four values for oversampling, the amount of models produced for deciding the optimal parameter settings could amount to about 600 for a single pair of regression algorithm—data set.

Despite these issues, it is important to assert the performance of models when applying the proposed resampling strategies with optimized parameters. Therefore, we proceeded with a smaller experimental setting, where all components of each approach are optimized. This small subset includes data sets 4, 10 and 12 and the regression algorithm SVM. This decision is based on the analysis of previous results, where SVM models provided better evaluation results than the models where resampling strategies were applied, in several occasions. As such, we focus on this regression model, and on three data sets where the results of the baseline regression algorithm models provided better results than any other resampled alternative. The optimal parameterization efforts and results are described in “Annex 2”, and the results of repeating the same experimental evaluation described in the previous section considering only the SVM models, and the three mentioned datasets are presented in Table 6.

Table 6 Evaluation of SVM models and resampling strategies, with parameter optimization for three datasets, using the mean utility-based regression metric \(F1_{\phi }\)

Results show that by optimizing the parameters of both the regression algorithms and the resampling strategies, the results obtained by the latter significantly improve the results over the baseline models of the former. Additionally, it further shows the potential positive impact in terms of evaluation, when using the temporal or temporal and relevance bias.

The relations between data characteristics and the performance of methods for addressing imbalanced domains have been explored in other studies [30]. To assess if some time series characteristics are related with our results, we observed the \(F1_\phi \), \(rec_\phi \) and \(prec_\phi \) metrics on the data sets sorted according to the following criteria:

  • by ascending order of imbalance (i.e. increasing percentage of rare cases);

  • by increasing number of total values in the data series; and

  • by increasing number of rare cases, i.e. ascending total number of rare cases in the time series.

Fig. 8
figure 8

Evaluation of regression algorithms and resampling strategies, with the mean utility-based regression metric \(F1_{\phi }\) with data sets sorted by increasing number of rare cases

Figure 8 shows the results of \(F1_{\phi }\) on the data sets sorted by ascending number of rare cases. The remaining results are available in http://tinyurl.com/z4xlup5. We observe that the characteristic that has most impact in our results is the total number of rare cases. In fact, time series with a low percentage of rare cases having a large number of values are not as problematic as time series with fewer values and a higher percentage of rare cases. This is related with the small sample problem and is in accordance with other works (e.g. [20, 21]) where it is observed that when the data set is large enough the learners can more easily detect rare cases.

Notwithstanding the predictive evaluation results presented, the impact of our proposed resampling strategies in terms of computation requirements has not been addressed so far. Considering that changing the data set may have a computational impact in building the models and forecasting future values, this issue should be studied and discussed. As such, in Fig. 9 we present a comparative evaluation of the average computational time necessary to build models using each of the regression algorithms with application of resampling strategies, for all datasets, in the same experimental setting defined for the experimental evaluation described in Sect. 5. The results report to the proportion of computational time required to train each model using resampling strategies in comparison with the non-resampled versions (i.e. baseline regression algorithms). The environment for these tests was an 8-core AMD Opteron 6300 processor with 2.5 GHz and 32 GBytes of main memory, running Ubuntu 14.04 with kernel 3.16.0-30-generic.

Fig. 9
figure 9

Evaluation of computational time required to build models where resampling strategies are applied in comparison with the computational time of baseline regression algorithms

By analysing the results shown by the computational time comparative evaluation, we are able to reach strong conclusions. First, that the use of resampling strategies have a different impact concerning computational time: (i) under sampling considerably improves the computational time required to train the models; (ii) oversampling requires a much longer computational time to train the models; and (iii) the SmoteR resampling strategy shows a similar computational time to train the models in comparison with the baseline regression algorithms. Results also show that these conclusions are applicable across all of the regression algorithms used in the evaluation. Secondly, results show that the use of temporal or temporal and relevance bias does not show a significant advantage or disadvantage in comparison with the computational time required to train the models by the baseline version of the resampling strategies.

7 Related work

Typically the problem of imbalanced domains is tackled either by pre-processing methods, special-purpose learning methods or post-processing methods [5]. In the specific context of forecasting tasks with imbalanced time series data, we did not find any previous work that proposes the use of resampling strategies. However, we found different approaches related to the scope of our endeavour, in the problems of rare event forecasting and anomaly detection, which we describe below. Most of this work is focused in specific problems for which special-purpose learners are developed. These proposals tend to be very effective in the context for which they were developed for. However, these methods performance is severely affected when their use is extrapolated to other problems. This means that they cannot be used as general methods for imbalanced time series, as opposed to resampling strategies.

A genetic-base machine learning system, timeweaver, was proposed by Weiss and Hirsh [50], designed to address rare event prediction problems with categorical features, by identifying predictive temporal and sequential patterns. The genetic algorithm used is responsible for updating a set of prediction patterns, where each individual should perform well at classifying a subset of the target events and which collectively should cover most of those events.

Vilalta and Ma [47] proposed an algorithm to address prediction of rare events in imbalanced time series. The authors proposed to resolve the class-imbalance by transforming the event prediction problem into a search for all frequent event sets (patterns) preceding target events, focused solely on the minority class. These patterns are then combined into a rule-based model for prediction. Both the work of Weiss and Hirsh [50] and of Vilalta and Ma [47] assume that events are characterized by categorical features and display uneven inter-arrival times. However, this is not assumed in classical time series analysis.

A new algorithm, ContrastMiner, is proposed by Wei et al. [49] for detection of sophisticated online banking fraud. This algorithm distinguishes between fraudulent and legitimate behaviours through contrast patterns. Then, a pattern selection and risk scoring are performed by combining different models predictions.

Temporal sequence associations are used by Chen et al. [11] for predicting rare events. The authors propose a heuristic for searching interesting patterns associated with rare events in large temporal event sequences. The authors combine association and sequential pattern discovery with a epidemiology-based measure of risk in order to assess the relevance of the discovered patterns.

Another interesting direction was pursued by Cao et. al. [7] with the development of new algorithms for discovering rare impact-targeted activities.

In anomaly detection [15] problems, applications for several domains have been proposed using diverse techniques. In the Medical and Public Health Domain, Lin et al. [28] use nearest neighbour-based techniques to detect these rare cases. These same techniques are used by Basu and Meckesheimer [2], and parametric statistical modelling is used by Keogh et al. [22] in the domain of mechanical units fault detection. Finally, Scott [37] and Ihler et al. [18] propose Poisson-based analysis techniques for the respective domains of intrusion detection in telephone networks and Web Click data.

Concerning our proposal of temporal and temporal and relevance bias in imbalanced time series forecasting tasks, it is somewhat related to the seminal work of Japkowicz [19] in classification tasks. The author proposes the concept of focused resampling, for both under and oversampling. The former reduces the number of cases further away from the boundaries between the positive class (i.e. rare cases) and the negative class. The latter increases the number of cases closest to this boundary. Several other proposals of informed resampling have been presented since then (e.g. [25, 29]).

8 Conclusions

In this work, we study the application of resampling strategies with imbalanced time series data. Our overall objective is to enhance the predictive accuracy on rare and relevant cases as this is the goal in several application domains. This fact increases the interest in finding ways to significantly improve the predictive accuracy of prediction models in these tasks.

In this context, we have proposed the extension of existing resampling methods to time series forecasting tasks. Resampling methods can be used to change the distribution of the available learning sets with the goal of biasing learning algorithms to the cases that are more relevant to the users. Our proposals build upon prior work on resampling methods for numeric prediction tasks. Besides the extension of existing resampling strategies, we propose new resampling strategies with the goal of adapting them to the specific characteristics of time series data. Specifically, we have proposed sampling strategies that introduce a temporal bias that we claim to be useful when facing non-stationary time series that are frequently subjected to concept drift. We also propose a relevance bias that makes more relevant cases have a higher preference of being selected for the final training sets.

An extensive set of experiments was carried out to ascertain the advantages of applying resampling strategies to such problems. Results from the experimental evaluation show a significant improvement in the predictive accuracy of the models, focusing on rare and relevant cases of imbalanced time series data. This is confirmed by all tested evaluation metrics. Results show that: (1) the application of resampling strategies in combination with standard regression tools can significantly improve the ability to predict rare and relevant cases in comparison with not applying these strategies; (2) the use of a temporal and/or relevance bias can improve the results in relation to the non-biased resampling approaches; and (3) the combination of resampling approaches with standard regression tools provides a significant advantage in comparison with models (ARIMA and BDES) specifically developed for time series forecasting. Additionally, by studying the computational time associated to learning prediction models with and without resampling strategies, we observe that undersampling allows for a significant reduction of this required computation time, that oversampling greatly increases the required time and that SmoteR presents a similar computational time in relation to the baseline regression tools.

Concerning future work, we plan to further evaluate these proposals concerning the effect of additional parameters values such as the relevance threshold or the k number of nearest neighbours in SmoteR, and study ways of automatically adapting these parameters to the distribution. We also plan to generalize the concept of bias in resampling strategies as to study the possibility of its use not only in time series problems, but also in classification and regression tasks using various types of dependency-oriented data, such as discrete sequences, spatial and spatiotemporal data.

For the sake of reproducible science, all code and data necessary to replicate the results shown in this paper are available in the Web page http://tinyurl.com/zr9s6tz. All code is written in the free and open source R software environment.