1 Introduction

The field of machine learning has been developing rapidly and proved useful in modelling complex real-life applications. In many application domains such as social networks, financial industries, and engineering monitoring systems, data are generated in continuous flows in the form of data streams. Such data format requires the models to work in an online mode, i.e. analysing the data in real time and evolving accordingly. Examples of data streams include network event logs, telephone call records, credit card transactional flows, sensing and surveillance video streams, financial applications, monitoring patient health, and many others (Wang et al. 2003; Fan 2004).

Usually, traditional supervised learning approaches assume that the data probability distribution does not change between training data and the application data. Such assumption typically means that data used to train the predictive models can reflect the probability distribution of the problem. However, this assumption is often violated in real-world applications (Gállego et al. 2017; Ren et al. 2018). For many reasons, the data distribution in real-world applications is often not stable and tends to change with time (Tsymbal 2004; Zliobaite et al. 2016). This is due to the evolving nature of the processes, which causes a phenomenon frequently referred to in the literature as concept drift. The presence of concept drift is likely to cause a decrease in the accuracy of the models as time passes, since the training data used to build the models may be carrying out-of-date concepts. Besides the evolving nature of data, other properties that make the prediction task in data streams challenging include infinite length, high dimensionality, orderliness, non-repetitive, high speed, and time varying (Masud et al. 2008; Farid et al. 2013). These characteristics demand for algorithms that can process the data under time constraints with a right level of accuracy and can adapt rapidly to change in the data distribution.

A significant research effort has been made in recent years towards data stream mining, although most of the attention has been directed to supervised classification problems. Krawczyk et al. (2017) recognised an evident lack of research dedicated to the regression online learning algorithms, as also stated in Ikonomovska et al. (2015). A promising research direction in the context of modelling data streams are the ensemble methods (Krawczyk et al. 2017). Single models usually require complex operations to modify the internal structure of the model and may perform poorly in the presence of concept drift (Masud et al. 2011). Ensemble approaches are proven to be effective to overcome common limitations of single models, such as accuracy and stability (Yin et al. 2015). Additionally, they are able to maintain information about different concepts and new models can be easily trained to cope with new concepts that may appear. Hence, they can efficiently deal with evolving data streams and achieve superior accuracy compared to single models.

A crucial aspect of ensemble design is the choice of base models. Artificial neural networks (ANNs) is a successful model widely used in the field of machine learning for tasks such as classification and regression. Despite its success, some problems such as slow convergence and local minima have led to the emergence of research towards neural networks with random weights (NNRW) (Cao et al. 2018). NNRW was introduced by Pao and Takefuji (1992), which proposed the random vector functional link (RVFL). The main idea of such models is to randomly initialise the weights between input and hidden layers, which are kept fixed during the optimisation process, and optimise the weights between the hidden and output layers. This process results in a much lower training complexity compared to traditional ANN training algorithms.

In this article, the authors propose a new bagging ensemble method based on NNRW. The proposed approach, bagging NNRW (B-NNRW), is developed to deal with online regression problems and takes advantage of the efficiency of NNRWs and bootstrapping mechanisms to build the ensemble. The approach enables the use of different updating strategies to accommodate possible concept drifts. Three main updating mechanisms are evaluated. These include weighting (B-NNRW), pruning (BP-NNRW), and replacement (BR-NNRW). The updating process is performed at fixed intervals in a batch mode, i.e. data samples are stored before the updating is applied. This approach avoids the assumption of the presence and type of concept drift and also improves the accuracy in case of insufficient training data. The proposed approach relies on a primary buffer of training data to build the initial ensemble. Although some of the online ensemble approaches do not rely on data buffering, these methods require that a considerable amount of training samples are presented to the model before it reaches an acceptable level of accuracy (Oza and Russell 2001; Ikonomovska et al. 2015). The authors have evaluated the proposed approach by comparing its performance with a recent ensemble algorithm, O-DNNE (online-decorrelated neural network ensemble), developed by Ding et al. (2017), and have demonstrated an apparent enhancement to the existing approach.

As a further contribution reported by this article, the use of factorial experiments is examined and proven to be effective to adjust the algorithm’s hyperparameters. The current hyperparameter optimisation approaches do not consider the importance of each hyperparameter or the interaction between them. The use of factorial experiments offers a systematic way to identify the hyperparameters that have higher effect in the algorithm’s variability. Furthermore, it is possible to identify significant interactions between hyperparameters, which helps to understand how the adjustment of one hyperparameter affects another and achieve a better hyperparameter tuning.

The remainder of this article is structured as follows: in Sect. 2, the authors report state of the art in the domain of data stream prediction and Sect. 3 describes the methodology for hyperparameter tuning and the new ensemble approach to data stream regression. The experiments and results are outlined and discussed in Sects. 4 and 5 concludes the article and suggests further research directions.

2 Literature review

The evolving nature of data has presented as an important challenge in the field of machine learning due to several factors, such as a change in consumer preferences, change in economic dynamics, or change in environmental conditions. Besides concept drift (data mining and predictive analytics), this phenomenon is also found in the literature as covariate shift or dataset shift (pattern recognition) and non-stationary (signal processing) (Zliobaite et al. 2016).

Tsai et al. (2009) defined three main categories of algorithms for concept drift: window-based approaches, weight-based approaches, and ensemble classifiers. Elwell and Polikar (2011) further classified algorithms for concept drift as:

  1. (1)

    Online versus batch algorithms: online algorithms learn one instance at a time while batch algorithms learn chunks of instances;

  2. (2)

    Single model versus ensemble-based approaches; and

  3. (3)

    Active versus passive approaches: active approaches rely on drift detector mechanisms while passive approaches assume constant drift and update the model continuously.

Ensemble approaches have been successfully applied in both classification and regression problems. The classical ensemble approaches include boosting (Schapire 1990), staking (Wolpert 1992), bagging (Breiman 1996), and random forests (Breiman 2001), and many variants can be found in the literature for solving a wide variety of tasks. The ensemble learning represents an important research direction in solving concept-drifting data streams (Yin et al. 2015) and has been successfully applied in classification and regression problems. Some advantages of ensemble approaches, compared to single models, include the suitability for dynamic updates and integration with drift detection mechanisms (Gomes et al. 2017). Moreover, they are easy to scale and parallelise, and the underperforming parts can be pruned to adapt to changes and usually generate more accurate concept description compared to single models (Bifet et al. 2009). The ensembles can be divided into two categories: fixed ensemble, where base predictors are trained in advance and are updated online; and growing ensembles, where component learners are added and/or removed, and voting weights are updated according to the incoming data.

2.1 Ensemble approaches for concept-drifting data streams

Wang et al. (2003) introduced a weighted ensemble classifier to address data stream mining and concept drift. They emphasise the advantage of their approach compared to single classifiers in terms of accuracy, efficiency, and ease of use. The classifiers are trained sequentially from chunks of data. The criterion to discard data is not based on time of arrival, i.e. old models are replaced, but base on the class distributions that better represent the current concept. In the approach developed by Fan (2004), the new models are built based on the last chunk of data and a combination of new data and old data. The old data are composed of a selection of examples from past chunks. Fan (2004) also highlights the problem of data insufficiency, where the use of additional data from previous chunks improves the model accuracy when concept drift is not present.

An approach developed by Gao et al. (2008) trains a new classifier at each new chunk of data. Besides keeping the model up to date with the latest concept, a sampling mechanism allows the model to deal with unbalanced datasets. Another method that trains a new model for every new chunk of data to cope with data evolution is presented by Masud et al. (2011). The classification is performed using k-NN as base models and is designed to be effective in problems with a limited amount of labelled data. Furthermore, this approach also incorporates a novels class detection mechanism based on clustering. In both algorithms, the new model is incorporated into the ensemble based on its accuracy in modelling the current concept.

Two variants of bagging were introduced by Bifet et al. (2009), ADWIN bagging and adaptive-size Hoeffding tree (ASHT) bagging. While both algorithms deal with classification tasks, the first one adapts the concept drift using a drift detector and the latter takes advantage of the incremental property of Hoeffding Trees to restart the trees according to its size and keep the ensemble updated. Elwell and Polikar (2011) developed an incremental learning algorithm to solve classification problems in non-stationary environments. The algorithm trains a new classifier for each new chunk of data and uses a dynamically weighted majority voting scheme in order to cope with concept drift. An adaptive ensemble that is not only able to deal with concept drift but also is capable of detecting new classes is presented by Farid et al. (2013). The authors train three Decision Trees in a boosting manner, i.e. creating subsets of the training data based on instance weighting. A new Decision Tree is trained for each new data chunk, and this new tree can replace one of the existing trees based on accuracy criterion. The novel class detection is performed by a clustering mechanism in the tree leaves.

An ensemble of ensembles is proposed by Yin et al. (2015). They argue that while in the traditional batch growing ensemble methods all the previous ensembles are discarded, their approach takes advantage of them for the final decision. Since the previous ensembles are composed of the same classifiers minus the last trained classifiers, the combination of ensembles is performed through the weights of previous ensembles. Ren et al. (2018) aggregated the operators of online ensembles and chunk-based ensembles to develop an ensemble classifier that is able to manage different types of drift and a limited number of labelled data. Iwashita et al. (2019) tackled classification in drifting data streams using ensembles of optimum-path forest with different approaches for training and updating the OPFs, i.e. full memory, no memory, and window of fixed size. The base models are combined using three voting mechanisms: Combined, Weighted, and Major.

In the context of data stream regression learning, only a few research papers have been published in the literature (Ding et al. 2017). Despite the success of batch growing ensembles achieved in data stream classification, in general, regression ensemble algorithms use iterative strategies. The Additive Expert Ensemble (AddExp) was developed to deal with online classification tasks with concept drift (Kolter and Maloof 2005). However, the authors argue that this approach can be further extended to also deal with the regression problems. AddExp relies on incremental algorithms, i.e. algorithms that adapt to every new instance. In the case of regression tasks, an online version of least squares regression is adopted as base learner. In order to control the size of the ensemble, two pruning strategies were evaluated, oldest first (the oldest model is excluded) and weakest first (the weakest model is excluded). The latter proved a better pruning choice. This approach works under the assumption that there is no change in the output distribution, since it is designed to make predictions in the interval [0, 1], and this assumption would be easily violated in practical applications. The AddExp also relies on a threshold parameter that determines when new experts should be added to the ensemble, which may be especially difficult to adjust in noisy datasets.

Kadlec and Gabrys (2011) developed an algorithmic soft sensor, i.e. simulating the sensor’s output, based on iterative recursive partial least squares (RPLS) model, called ILLSA (incremental local learning soft sensing algorithm). The ensemble is built using partitions of historical data. In order to cope with concept drift, the ensemble is updated in two levels. At the local level, the RPLSs are updated using the new data, and at the global level, the model’s weights are updated according to its performance. Another incremental online ensemble algorithm for regression based on partial least squares, the OWE (online-weighted ensemble) algorithm, was proposed by Soares and Araújo (2015a). It updates the ensemble weights at the arrival of each new data sample based on the error on a sliding window of data. The training of new models considers the error of the ensemble in each sample of the current data window using a boosting strategy. It also retains information about old data windows in the hope that this information could be useful in case of recurrent concept drift.

Soares and Araújo (2015b) also developed another sliding window-based ensemble, the dynamic and online ensemble regression (DOER). DOER uses OS-ELM (Liang et al. 2006), which is a type of NNRW, as base models. The updating approach is based on an overlapping sliding window, and at each new data sample, all the base models are re-trained and the weights of each model are updated. The approach also considers a mechanism that replaces underperforming models when the accuracy of the ensemble decreases.

Two algorithms based on online Hoeffding-based regression trees (Ikonomovska et al. 2010), namely OBag (online bagging of Hoeffding-based trees for regression) and ORF (online random forest for any-time regression), are presented by Ikonomovska et al. (2015). The models are constructed using online bagging meta-algorithm and learn in an incremental fashion. The adaptation to concept drift is performed by replacing with low-accuracy models when a significant increase in error is detected.

The main problem with iterative approaches is the fact that, in general, all new samples are presented to the base models, which could result in a higher correlation between the base models and consequently lower diversity of the ensemble. The diversity among the models is responsible for uncorrelated predictions that lead to improved accuracy. Several authors have highlighted the importance of ensemble diversity (Tumer and Ghosh 1996; Liu and Yao 1999; Brown et al. 2005; Rokach 2010; Alhamdoosh and Wang 2014; Ding et al. 2017).

More recently, regression of sequential data stream is addressed by Ding et al. (2017), who proposed the O-DNNE. Their algorithm is an online version of DNNE (Alhamdoosh and Wang 2014) that is based on a decorrelated strategy (Bruce 1996) and the negative correlation learning (Liu and Yao 1999). DNNE is an ensemble of NNRWs that trains all base models simultaneously and considers the correlation among them in the optimisation process. This method allows that fewer models are required to build the ensemble since redundant models are avoided; however, the training and updating process may become computationally cumbersome, especially when a large number of models and/or a large number of hidden nodes are required, as shown in Sect. 3.3. Additionally, base models with convergence problems due to the choice of the random weights are kept in the ensemble since no pruning mechanism is provided.

A summary of the ensembles approaches for data stream classification and regression in the presence of concept drift is presented in Table 1, in chronological order.

Table 1 Ensemble approaches developed to deal with data streams in the presence of concept drift

2.2 Base models

The challenges posed by data streams require base models that are not only accurate but also computationally efficient. ANNs have been successfully applied for classification and regression tasks in many fields. However, some issues may make the ANN model difficult for implementation. These include high computational cost, a high number of hyperparameters, and convergence issues. ANN training process is usually based on the optimisation of a nonlinear least squares problem, where the derivatives of the loss function are back-propagated to each layer to control the weights’ adjustment. This may cause slow convergence and/or convergence to local minima (Zhang and Suganthan 2016). Cao et al. (2018) yet point out the model selection uncertainty as an additional drawback of ANNs.

Back in the early 1990s, Schmidt et al. (1992) evaluated the use of random weights in the hidden layer of a single-hidden-layer ANN to analyse the behaviour of ANNs in terms of learning. At the same year, Pao and Takefuji (1992) proposed a similar approach called random vector functional link (RVFL). The mentioned approaches share the same principle, i.e. to randomly generate the weights between input and hidden layers, which are kept fixed during the training process. Only the weights between the hidden layer and output layer are optimised, which transform the optimisation function in a linear least squares minimisation that can be solved in a single step using pseudo-inverse algorithms or ridge regression. In this sense, the attention of the research community towards the NNRW has been increasing due to its efficiency compared to traditional ANNs.

The NNRW capabilities make it a good choice to deal with high-dimensional datasets and online applications where computational efficiency is an essential requirement. The usually lower accuracy compared to the traditional ANNs can be compensated with the use of ensembles. Several approaches can be used to introduce diversity in the ensemble and increase its accuracy, such as varying initial random weights, varying the topology of ANNs, varying the training algorithm, and varying the training datasets (Masoudnia and Ebrahimpour 2014).

Following a comprehensive review of the existing literature and methods, a number of problems with the current approaches are identified. These include:

  1. (a)

    Need for development of faster algorithms that can be effectively updated under restricted time constraints;

  2. (b)

    Lack of a systematic approach for hyperparameter adjustment;

  3. (c)

    Need for accuracy improvement, which is always a desirable property, especially in response to concept drift.

The drawbacks of current approaches in dealing with online data stream regression with concept drift motivate the development of an ensemble algorithm using NNRW as base models for data stream regression to improve computational efficiency and accuracy.

3 Methodology

In this section, the methodology adopted to develop the proposed approach is described. The novelty of this approach is to combine a bootstrap sampling with random feature selection to train a highly diversified pool NNRWs. In the proposed approach, the least accurate base models are replaced and deactivated at updating points, while the highly accurate models have their decision power increased through a weighting mechanism. Firstly, the use of design of experiments as an alternative for hyperparameter tuning is outlined. Then, a description of the NNRW algorithm and the B-NNRW approach is presented. Finally, some limitations of the existing method, i.e. O-DNNE, are discussed. The performance of the proposed B-NNRW algorithm will be evaluated in Sect. 4 using eight datasets (four synthetic and four benchmark datasets).

3.1 Hyperparameter optimisation

Usually, machine learning algorithms have several hyperparameters, and their adjustment is an important aspect to be taken into consideration. Besides the manual hyperparameter adjustment, another popular approach is grid search, which is used to perform an exhaustive search through all combinations of predefined levels of hyperparameters to find the best combinations. Other methods for hyperparameter optimisation include random search, Bayesian optimisation, and evolutionary algorithms (Hutter et al. 2015; Francescomarino et al. 2018). It was observed in all previous approaches that the importance of each hyperparameter and the information regarding the interactions among them are neglected.

In this research, the use of design of experiments (DOE) (Montgomery 2012) for hyperparameter adjustment, specifically the full factorial design, is proposed by the authors. The factorial design relies on the computation of all combinations of predefined levels of hyperparameters. However, similar to grid search, it offers a systematic way of analysing not only the sensitiveness of each hyperparameter but also the interactions among them.

The sensitiveness refers to the amount of change in the algorithm’s response due to a change in the hyperparameter level. This can help to identify the most critical hyperparameters and direct the effort to their optimisation, especially for algorithms with a high number of hyperparameters. When a high number of hyperparameters are involved, the tuning can be done in two steps. Firstly, experiments with fewer levels are carried out to identify the importance of each hyperparameter. Secondly, a new experiment is executed for fine-tuning, keeping hyperparameters with low importance at fixed levels and therefore reducing the search space.

The interaction among hyperparameters may also play a critical role in hyperparameter optimisation. Using DOE, it is possible to identify significant hyperparameter interactions, i.e. different response patterns of one hyperparameter for different levels of a second hyperparameter. As a hypothetical example, one could observe that for the level 1 of hyperparameter A, the accuracy of the algorithm increases when the level of hyperparameter B changes from 1 to 2, while the accuracy decreases, for the same change in B, when A is at level 2.

This article will adopt the full factorial DOE to adjust the hyperparameters of both B-NNRW and O-DNNE. The authors highlight that this approach is only used to tune the hyperparameters of the initial model fitting, which are kept fixed through the evaluation of the simulated stream.

3.2 Neural networks with random weights

The use of NN with randomised weights appeared simultaneously in the works of Schmidt et al. (1992) and Pao and Takefuji (1992). While the former authors were interested in evaluating the effect of the parameters in the hidden layer, the latter proposed the RVFL network. Both approaches shared a similar architecture, which is a fully connected feed-forward neural network, as shown in Fig. 1, except for the fact that Schmidt’s approach does not consider the use of direct links between the input layer and the output layer.

Fig. 1
figure 1

Single-hidden-layer feed-forward neural network

In both cases, the weights between the input layer and the hidden layer are chosen randomly and are kept fixed during the training process. Only the weights between hidden and output layers are optimised. The main advantage of this procedure is that it converts the non-convex optimisation of parameters into a convex optimisation problem, where the global minimum can be fast approximated using a pseudo-inverse technique, such as Moore–Penrose or ridge regression.

In this work, the authors considered an NNRW with no direct link using ridge regression as the learning algorithm. The learning function can be mathematically expressed by Eq. (1):

$$ \varvec{T} = g\left( {\varvec{X} \cdot \varvec{W}_{H} + \varvec{B}} \right) \cdot \varvec{W}_{O} $$
(1)

where T is the target vector, X is the set of input training data, \( \varvec{W}_{H} \) is the set of weights from the input layer to the hidden layer, B is the bias vector, and \( \varvec{W}_{O} \) is the vector of weights from the hidden layer to the output layer. The function g(•) is the activation function, in this article the sigmoid function, given by Eq. (2):

$$ g\left( x \right) = \frac{1}{{1 + e^{ - x} }} $$
(2)

Since the NNRW learning process does not rely on derivatives, as is the case of back-propagation learning algorithms, almost any nonzero activation function can be successfully applied (Huang et al. 2004).

Due to the fact that WH and B are randomly chosen and kept fixed during the training process, the training function becomes a linear system and can be summarised as Eq. (3):

$$ \varvec{T} = \varvec{H} \cdot \varvec{W}_{O} $$
(3)

where \( \varvec{H} \) is the output from the hidden layer and is computed as Eq. (4).

$$ \varvec{H} = g\left( {\varvec{X} \cdot \varvec{W}_{H} + \varvec{B}} \right) $$
(4)

The optimised set of weights Wo is the one that satisfies Eq. (5):

$$ { \hbox{min} } \left\| {\varvec{H} \cdot \varvec{W}_{O} - \varvec{T}} \right\| $$
(5)

The optimisation algorithm applied by Schmidt et al. (1992), referred to as Fisher solution, can be written as Eq. (6):

$$ \varvec{W}_{O}^{*} = \left( {\varvec{H}^{T} \cdot \varvec{H}} \right)^{ - 1} \cdot \varvec{H}^{T} \cdot \varvec{T} $$
(6)

which is equivalent to the least squares (LS) estimator. The computation of \( \left( {\varvec{H}^{T} \cdot \varvec{H}} \right)^{ - 1} \) may lead to instability if the matrix resulted from \( \varvec{H}^{T} \cdot \varvec{H} \) is singular or nearly singular. This issue can be addressed using the ridge regression, introduced by Hoerl and Kennard (1970), which consists of small positive quantities added to the diagonal of \( \varvec{H}^{T} \cdot \varvec{H} \), given by Eq. (7).

$$ \varvec{W}_{O}^{*} = \left( {\varvec{H}^{T} \cdot \varvec{H} + \lambda \cdot \varvec{I}} \right)^{ - 1} \cdot \varvec{H}^{T} \cdot \varvec{T} $$
(7)

where \( \lambda \) is a small constant value and I is the identity matrix, also known as penalty term, since it penalises large weights in the optimisation process.

3.3 Proposed bagging of NNRW approach

In this section, a new bagging ensemble of NNRW (B-NNRW) to tackle online regression problems is presented. The main advantage compared with O-DNNE is the potential ability to cope with problems with high rates of data arrival and also with high-dimensional data that usually require more complex models. Two interrelated factors contribute to this advantage, and they are mainly related to the matrix inversion needed for the model optimisation process. Primarily, the fact that the computational complexity is O(M(N3)), in case of B-NNRW, while O-DNNE computational complexity is O((MN)3). In addition, since each model is optimised separately in B-NNRW, it is prone to be parallelised.

Both algorithms update without making any assumptions on the type or rate of drift. The diversity of B-NNRW is mainly introduced through bootstrapping the samples, not only to generate new training sets as the original bagging algorithm but also bootstrapping the features from the training set. The number of features that are used to build each model is given according to the percentage of total features, and several values are evaluated in this article. Three updating approaches are adopted and evaluated: weighting, pruning, and replacement.

  1. (a)

    Weighting Each model in the ensemble is assigned a weight according to its accuracy in the most recent data chunk. Given the most recent chunk of data C, and an ensemble of M elements (m = 1, 2, …, M), the weight of each model is computed according to Eq. (8):

$$ w_{m} = \frac{1}{{mse_{m} }} $$
(8)

where msem is the mean square error of the mth model, computed on C. Therefore, the output yE of the ensemble for a data sample xn is calculated according to Eq. (9):

$$ y_{E} \left( {\varvec{x}_{n} } \right) = \frac{{\mathop \sum \nolimits_{m = 1}^{M} w_{m} *\hat{y}_{m} \left( {\varvec{x}_{n} } \right)}}{{\mathop \sum \nolimits_{m = 1}^{M} w_{m} }} $$
(9)

where \( \hat{y}_{m} \left( {\varvec{x}_{n} } \right) \) is the individual prediction of the mth model on the data sample xn.

  1. (b)

    Pruning Pruning consists of temporarily deactivating the less accurate models of the ensemble. For each data chunk C, only Q models with the lowest error (Q < M) are eligible to participate in the final ensemble decision. Given a pruning rate \( p \in {\mathbb{R}}^{{\left( {0,1} \right]}} \), the number of Q models that participate in the prediction is given by Eq. (10):

$$ Q = p*M $$
(10)

The idea behind keeping the deactivated models is that they may carry useful information learnt from past examples and can be helpful in future predictions, such as in the case of recurrent data concepts. The algorithm keeps track of the deactivated models’ accuracy, and once one of them is included in the Q models with the lowest error, it is reactivated and used in the prediction of the subsequent chunk of data. The updating B-NNRW with pruning mechanism is referred to as BP-NNRW.

  1. (c)

    Replacement Replacement consists of training new models, one at a time, using labelled data from the last chunk of data. The data chunk is randomly divided into training (85%) and validation (15%) sets. If the accuracy of the newly trained model is better than the worst existing model, then the worst model is replaced by the new model. This process is repeated until the desired number of new models is trained. Given a replacement rate \( r \in {\mathbb{R}}^{{\left[ {0,1} \right]}} \), the number of new models Mnew is computed as Eq. (11):

$$ M_{\text{new}} = {\text{round}}\left( {r*M} \right) $$
(11)

The replacement not only keeps the ensemble up to date with the most recent data concepts but also works as a natural selection mechanism since low-performing models are constantly excluded. The updating B-NNRW with replacement mechanism is referred to as BR-NNRW.

Figure 2 illustrates the B-NNRW procedure. It is assumed that an initial amount of data is available to build the initial model. The updating approaches above are applied at each predetermined updating point, which is given by the number of data samples.

Fig. 2
figure 2

B-NNRW ensemble procedure

3.4 Online DNNE

The O-DNNE (Ding et al. 2017) is an approach derived from the decorrelated neural network ensemble (DNNE) to deal with online regression problems. DNNE builds an ensemble of single-hidden-layer NNRWs, as described in Sect. 3.2, and incorporates the concept of negative correlation learning (NCL) in the training process to create a well-diversified set of models. The main idea behind NCL is to train the models simultaneously in a way that their individual errors are decorrelated (Rosen 1996) since no gains can be obtained from a combination of outputs if they are positively correlated. Given a dataset of size N consisting of pairs (xn, yn) and fi(xn) the output of sample xn of the ith model in the ensemble of size (M), the error function for the ith model can be written as Eq. (12).

$$ E_{i} = \mathop \sum \limits_{n = 1}^{N} \frac{1}{2}(f_{i} \left( {\varvec{x}_{n} ) - y_{n} } \right)^{2} $$
(12)

Rosen (1996) proposed a modification in the error function (Eq. 12) to include a decorrelation penalty term pi, resulting in the error model given by Eq. (13):

$$ e_{i} = \mathop \sum \limits_{n = 1}^{N} \left[ {\frac{1}{2}(f_{i} \left( {\varvec{x}_{n} ) - y_{n} } \right)^{2} - \lambda p_{i} \left( {\varvec{x}_{n} } \right)} \right] $$
(13)

where \( \lambda \in \left[ {0, 1} \right] \) is a regularisation factor. Alhamdoosh and Wang (2004) adopted the penalty term formulated in Eq. (14):

$$ p_{i} \left( {\varvec{x}_{n} } \right) = \left( {f_{i} \left( {\varvec{x}_{n} } \right) - \bar{f}(\varvec{x}_{n} )} \right)\mathop \sum \limits_{j \ne i} \left( {f_{j} \left( {\varvec{x}_{n} } \right) - \bar{f}(\varvec{x}_{n} )} \right) $$
(14)

where \( \bar{f}(\varvec{x}_{n} ) \) is the average output, which is used instead of the target value yn to reduce the correlation among ensemble individuals mutually. The final DNNE consists of a set of weights \( \varvec{B}_{\text{ens}} = \left[ {\beta_{11} , \ldots ,\beta_{1L} , \ldots ,\beta_{M1} , \ldots ,\beta_{ML} } \right]_{ML \times 1}^{T} \), where \( \beta_{ij} \) is the output weight of the jth hidden node of the ith model and can be obtained by solving the following linear system given by Eq. (15):

$$ \hat{\varvec{B}}_{\text{ens}} = \varvec{H}_{\text{corr}}^{ + } \varvec{T}_{h} $$
(15)

The \( \varvec{H}_{\text{corr}}^{ + } \) is generalised pseudo-inverse of matrix \( \varvec{H}_{\text{corr}} \). The hidden–target matrix \( \varvec{T}_{h} = \left[ {\varphi \left( {1,1} \right), \ldots ,\varphi \left( {1,L} \right), \ldots ,\varphi \left( {M,1} \right), \ldots ,\varphi \left( {M,L} \right)} \right]_{ML \times 1}^{T} \), where \( \varphi \left( {i,j} \right) \) models the correlation between the jth hidden neuron of the ith base model and the target function \( G\left( \cdot \right) \) (Eq. 4) and is computed as Eq. (16):

$$ \varphi \left( {i,j} \right) = \mathop \sum \limits_{n = 1}^{N} g_{ij} \left( {\varvec{x}_{n} } \right)y_{n} $$
(16)

Finally, Hcorr is an ML × ML, where each element is given the following condition:

$$ \varvec{H}_{\text{corr}} \left( {p,q} \right) = \left\{ {\begin{array}{*{20}c} {C_{1} \varphi \left( {m,n,k,l} \right)\quad {\text{if}}\; m = k;} \\ {C_{2} \varphi \left( {m,n,k,l} \right)\quad {\text{otherwise}}.} \\ \end{array} } \right. $$

where p, q = 1,…, M × L; \( m = \frac{p}{L} \); \( n = ((p - 1)\bmod \;L) + 1 \); \( k = \frac{q}{L} \); \( l = ((q - 1)\bmod \;L) + 1 \). The constants \( C_{1} \) and \( C_{2} \) and the correlation between the jth hidden neuron of the ith base model and lth hidden neuron of the kth base model \( \varphi \left( {m,n,k,l} \right) \) are formulated as Eqs. (17), (18), and (19), respectively.

$$ \varphi \left( {i,j,k,l} \right) = \mathop \sum \limits_{n = 1}^{N} g_{ij} \left( {\varvec{x}_{n} } \right)g_{kl} \left( {\varvec{x}_{n} } \right) $$
(17)
$$ C_{1} = 1 - 2\lambda \frac{{\left( {M - 1} \right)^{2} }}{{M^{2} }} $$
(18)
$$ C_{2} = 2\lambda \frac{M - 1}{{M^{2} }} $$
(19)

In the online version of DNNE (Ding et al. 2017), both Hcorr and Th are updated according to new data simply by adding the Hcorr and Th computed using the new data (\( \varvec{H}_{\text{corr}}^{\text{new}} \) and \( \varvec{T}_{h}^{\text{new}} \), respectively) and then adding up to the existing Hcorr and Th, (\( \varvec{H}_{\text{corr}}^{\text{old}} \) and \( \varvec{T}_{h}^{\text{old}} \), respectively) as shown in Eqs. (20) and (21), respectively:

$$ \varvec{H}_{\text{corr}} = \varvec{H}_{\text{corr}}^{\text{old}} + \varvec{H}_{\text{corr}}^{\text{new}} $$
(20)
$$ \varvec{T}_{h} = \varvec{T}_{h}^{\text{old}} + \varvec{T}_{h}^{\text{new}} $$
(21)

For further details, the reader could refer to Ding et al. (2017). Once Hcorr and Th are updated, the Bens is recomputed according to Eq. (15).

The O-DNNE can effectively process a single new data sample due to the fact that the processing cost of computing Eq. (15) is not affected by the number of samples to be evaluated. However, the computation of \( \varvec{H}_{\text{corr}}^{ + } \) is very sensitive to the number of NNRW hidden nodes as well as the number of models. Considering an ensemble with n nodes and m models, an increment of one node results in an increment in the size of Hcorr matrix in the order of \( m^{2} \left( {n^{2} + 2n + 1} \right) \); likewise, an increment in one model in the ensemble increases the size of Hcorr in the order of \( n^{2} \left( {m^{2} + 2m + 1} \right) \).

4 Experiments

In this section, the description of the benchmark datasets used in this article is presented, followed by the hyperparameter analysis and tuning using DOE. Finally, the B-NNRW algorithm and its variants (BP-NNRW and BR-NNRW) are analysed and compared to the O-DNNE.

4.1 Datasets

The experiments are carried out using eight datasets, four synthetic datasets (Ding et al. 2017) and four benchmark datasets. The synthetic datasets are used to evaluate how the algorithms respond to drift, which is simulated by expanding the variable domain. For each synthetic dataset, 5000 samples are generated. The variable domain of each attribute is divided into ten parts, with seven parts used to create the first 2000 samples. A new part is added at every 1000 samples, expanding the domain and including new data never presented to the algorithms, as illustrated in Fig. 3.

Fig. 3
figure 3

Synthetic dataset generation schematics showing the drifting points at 2000, 3000, and 4000 samples

The benchmark datasets were chosen from public repositories based on the number of features, the number of samples to simulate the stream, and the diversity of application domains. A summary of the main features of each dataset, i.e. the number of predictive attributes and the number of data samples, is presented in Table 2.

Table 2 Dataset features

At data pre-processing stage, the standardisation feature scaling (zero mean and unit variance) was applied to the attributes of all datasets to avoid the effects of large differences in scale. It also prevents the impact of outliers when compared to the normalisation feature scale. The data transformation is given by Eq. (22):

$$ \varvec{X}_{\text{new}} = \frac{{\varvec{X} - \mu \left( \varvec{X} \right)}}{{\sigma \left( \varvec{X} \right)}} $$
(22)

Additionally, a random Gaussian noise e\( \in \)N(0, 0.02) is added to the input variables of the synthetic datasets. In Sects. (4.2 and 4.3), the authors analyse the hyperparameter optimisation using DOE.

4.2 Hyperparameter adjustment

In this section, the authors detail the experimental protocol to tune the models’ hyperparameters using full factorial DOE. The full factorial relies on the two-way ANOVA (Montgomery 2012) and works under the following assumptions: populations are normally distributed, populations have equal variances, and samples are randomly and independently drawn.

For the purpose of hyperparameter tuning, a real application is simulated where only a portion of the data is available. The first 1000 data samples of each dataset were used, randomly divided into 70% for training and 30% reserved for testing. The hyperparameters to be optimised for both B-NNRW and DNNE algorithms are described as follows:

  • Number of models (M) Number of base models that compose the ensemble;

  • Number of nodes (N) Number of hidden nodes for each base model;

  • Regularisation factor (R) In the case of B-NNRW, the regularisation factor is responsible for penalising large weights in the optimisation process. For DNNE, it acts to control the decorrelation term in the optimisation function;

  • Random weights range (W) This hyperparameter determines the interval in which the initial random weights are uniformly distributed. Although the authors (Alhamdoosh and Wang 2014; Ding et al. 2017) suggest the weights of DNNE to be set in the interval [− 1, 1], this hyperparameter plays a vital role in the accuracy of the models. The effect of initial weights in RVFL was investigated by Zhang and Sugantham (2016);

  • Number of attributes (A) This hyperparameter is exclusive of B-NNRW. It determines the fraction of total inputs that are randomly selected to train each NNRW base model.

Five levels of each hyperparameter are investigated, and each treatment (a combination of hyperparameters) was run ten times. The hyperparameter levels are summarised in Table 3.

Table 3 Set of evaluated hyperparameters for DNNE and B-NNRW

It is important to note that for B-NNRW, N is a function of the number of inputs. For a dataset with ten inputs and A = 0.8, N is equal to the resulting number of inputs (10 × 0.8 = 8) times N. The analysis of hyperparameter tuning and the resulting hyperparameter set for each algorithm is presented in the next section.

4.3 Hyperparameter analysis

The results of experiments show that not only each hyperparameter has a different level of importance in the hyperparameter optimisation but also the levels of importance change for different problems. Tables 4 and 5 show the first three most important sources of variability, given by the F0 statistic. The small p values (≪ 0.01) indicate that the results are statistically significant. The analysis of the importance of each factor can help to prioritise the optimisation of the hyperparameters that have more influence in the model’s performance.

Table 4 DNNE significance (p value), F0 statistic, the percentage of explained variance, and cumulative percentage of explained variance
Table 5 B-NNRW significance (p value), F0 statistic, the percentage of explained variance, and cumulative percentage of explained variance

The adjustment of W is the most important hyperparameter to the tuning of DNNE, except in the Mex dataset that showed no statistically significant differences among different treatments. In the case of B-NNRW algorithm, the critical factor depends on the problem, although the number of features was found to be the most important hyperparameter for synthetic datasets. The results also showed that some interactions between factors also resulted in relevant sources of variability. An illustrative example is an interaction between the N and W of B-NNRW in the quality dataset, which is statistically significant and is responsible for 10.8% of the total variability. Figure 4 shows the different effect of the number of nodes according to the levels of weight initialisation. This type of analysis is not possible when the traditional hyperparameter tuning approaches are applied.

Fig. 4
figure 4

Effects of the number of nodes (N) given different initialisation weights for the quality dataset (B-NNRW)

Following the analysis of DOE results, supported by the results of paired t test and Wilcoxon tests, the hyperparameters for each problem were optimised. The best set of hyperparameters for each problem is summarised in Table 6.

Table 6 Best set of hyperparameters for each problem: B-NNRW and DNNE algorithms

The optimised models are evaluated in a simulated data stream, as described in Sects. 4.4 and 4.5.

4.4 Data stream evaluation set-up

The second set of experiments was aimed at evaluating the performance of the algorithms in the data stream environment. The optimised models, obtained from the model fitting process, are run in a simulated data stream that consists of the remaining data from the hyperparameter tuning process. The effectiveness of BP-NNRW (B-NNRW with pruning) and BR-NNRW (B-NNRW with replacement) is evaluated for different ranges of pruning (p) and replacement (r), respectively. The range of p evaluated in this experiment is p = [0.1, 0.3, 0.5, 0.7, 0.9]. The r range investigated is defined as r = [0.1, 0.2, 0.3, 0.4, 0.5].

The algorithms are updated at fixed intervals. Once the algorithms start the predictions, the data samples are stored until the updating point is reached. It is assumed that the true label of the stored data is available at this point; therefore, they can be used to compute the evaluation metrics and update the models. In the case of synthetic datasets, where a drift occurs at every 1000 samples, the updating interval is set at every 250 samples. For the benchmark datasets, since no drift is reported, several updating points are evaluated (at every 250, 500, 1000, and 1500 samples). The updating process of DNNE follows the O-DNNE procedure described in Ding et al. (2017). The algorithms are compared in terms of MSE over the simulated data stream. Each experiment is run for 20 times using a random seed, and the results are compared in terms of MSE. Furthermore, all the results are submitted to the t test with 95% confidence to check the statistical significance. The algorithms were coded by the authors using the Matlab® software version 2017b.

4.5 Results and discussion

The results of O-DNNE and B-NNRW strategies are compared in this section using the optimised hyperparameter sets for synthetic and benchmark datasets.

  1. (a)

    Synthetic datasets

First, the response of the algorithms in the synthetic datasets is discussed. The main results are summarised in Table 7.

Table 7 General results of B-NNRW and its variants and O-DNNE for the synthetic datasets

In general, the replacement strategy (BR-NNRW) resulted in better accuracy than the pruning (BP-NNRW) and weight updating (B-NNRW) strategies. An analysis based on statistical tests, specifically the t test with 95% of confidence, showed that different accuracies are achieved according to the level of replacement. For the Mex dataset, 0.2 and 0.3 were statistically equal and resulted in better accuracy. For Fried1 and Multi, all levels are statistically different, and the best result is achieved with a 0.5 replacement rate. The increase in accuracy for the Fried3 dataset is observed until the rate of 0.3, from where onwards the results are statistically equal.

When compared to O-DNNE, the BR-NNRW achieved statistically better accuracy in Mex, Fried1 and Multi datasets, while O-DNNE performed better in Fried3 dataset. However, before the first point of drift (the first 1000 test samples), the O-DNNE showed better accuracy, suggesting a better learning capability. After the first point of drift, at the test sample 1001, BR-NNRW resulted in better accuracy, showing a better ability to cope with concept drift. The results are summarised in Table 8, where the MSE of O-DNNE and BR-NNRW (with the best replacement rate, according to Table 7) on test data is presented. The test data shown are separated by the samples on each domain. Table 8 also shows the average updating time, in seconds, where the computational advantage of BR-NNRW to keep updated to the data can be observed.

Table 8 Accuracy (MSE) of BR-NNRW and O-DNNE by the samples on each domain in the test data

The smoothed error curves shown in Fig. 5 illustrate how the BR-NNRW and O-DNNE algorithms perform on each dataset.

Fig. 5
figure 5

Effects of the number of nodes (N) given different initialisation weights for the quality dataset (B-NNRW)

For synthetic datasets with known drifts, BR-NNRW has been shown to improve the accuracy compared to B-NNRW and BP-NNRW. The best results were achieved with different replacement rates, highlighting the importance of keeping old knowledge (low replacement rate) or discarding old knowledge (high replacement rate) according to the problem at hand. The O-DNNE showed considerable learning capability, achieving better accuracy in the first part of the test data, which is within the same domain of the training data. BR-NNRW, on the other hand, showed an advantage to cope with concept drift not only in terms of accuracy but especially in terms of computational efficiency.

In the next section, the tests of the algorithms on benchmark datasets of four varied application domains with larger sample sizes and attributes are presented.

  1. (b)

    Benchmark datasets

Next, the performance of O-DNNE and B-NNRW, along with its variations (BP-NNRW and BR-NNRW), is analysed at different updating intervals on the benchmark datasets. The main results are summarised in Tables 9, 10, 11, and 12.

Table 9 Results of O-DNNE and B-NNRW variations for Housing problem at various model updating intervals
Table 10 Results of O-DNNE and B-NNRW variations for the Quality problem at various model updating intervals
Table 11 Results of O-DNNE and B-NNRW variations for the Maintenance problem at various model updating intervals
Table 12 Results of O-DNNE and B-NNRW variations for Energy problem at various model updating intervals

The first result that should be highlighted is the effect of updating interval. In general, for each problem, all algorithms showed a similar pattern in terms of the best updating interval. This trend is illustrated in Fig. 6, which shows the average accuracy of all algorithms for each dataset. In this figure, the MSE was normalised between 0 and 1 for each algorithm and then averaged.

Fig. 6
figure 6

Datasets average normalised MSE for each chunk size

The results show that the choice of updating frequency is related to the problem that is being studied, which in turn may be related not only to the level of drift present in data but also to data sufficiency. The best updating interval for Maintenance and Quality datasets is likely to be near 1000 observations. Better accuracy is achieved in Energy dataset when it is updated more frequently, i.e. smaller data chunks, while an opposite behaviour is observed in Housing dataset, which can indicate that no dataset shift is occurring in this dataset.

From Tables 9, 10, 11, and 12, it is possible to observe the advantage of the replacement strategy compared to the pruning strategy. In Fig. 7, the results of BP-NNRW and BR-NNRW, averaged over the chunk size, are summarised to illustrate the effects of the different levels of pruning and replacement.

Fig. 7
figure 7

Average results of pruning (BP-NNRW) and replacement (BR-NNRW) strategies

The pruning (BP-NNRW) approach can prevent the vote of low-performing members of the ensemble in the final decision. If properly adjusted, the pruning can increase the accuracy of the model significantly compared to weight update only, as given in Tables 9, 10, 11, and 12. The replacement (BR-NNRW) not only avoids the use of low-performing members due to the exclusion of them but also includes new members trained with the most recent data. Although the flat lines shown in Fig. 7b, c, d for the levels of replacement suggest no difference in response for different levels, especially for Energy and Maintenance datasets, the results of t test with 95% of significance showed that there are statistically significant differences between the replacement levels. The difference between various levels of replacement is shown in Fig. 8, considering the average results of the best updating frequency of each dataset.

Fig. 8
figure 8

Average MSE for B-NNRW for different rates of replacement levels

For the Energy and Housing datasets, the best accuracy was achieved with a lower rate of replacement (0.1). It should be pointed out that in both cases, the difference between 0.1 and 0.2 rates of replacement was statistically significant, according to the t test. In Quality and Maintenance datasets, B-NNRW required higher rates of replacement to achieve their best performance. The best replacement rate for Quality was 0.5, while for Maintenance, the best accuracy was achieved by replacing 40% of the models. The adjustment of the rate of replacement is an additional challenge for the use of BR-NNRW; however, the gain in accuracy greatly outweighs the effort.

A comparison between the algorithm with a more accurate updating strategy (BR-NNRW) and a recent online data stream regression algorithm from literature (O-DNNE) is summarised in Table 13. The BR-NNRW is set with the best replacement rate for each dataset. Table 13 shows the average MSE and average updating time (in seconds), i.e. the computational processing time spent to execute the updating process of each algorithm, considering the best updating interval for each dataset (Fig. 6).

Table 13 O-DNNE versus B-NNRW

Both BR-NNRW and O-DNNE algorithms showed statistically similar results, according to the t tests with 95% significance, in Maintenance dataset. Furthermore, in this dataset, the O-DNNE showed an advantage in processing time due to the reduced number of models resulted from the hyperparameter tuning; however, as the number of models and nodes increases, the exponential increase in the Hcorr matrix (Sect. 3.4) makes the model updating computationally expensive. In the other benchmark datasets, the results showed the advantage of BR-NNRW compared to O-DNNE, in terms of both accuracy and updating time. The replacement mechanism was able to effectively update the ensemble and keep/improve the accuracy through the simulated stream of data. Although no concept drift was reported in the benchmark datasets studied in this paper, the advantage of the use of updating methods compared to static algorithms is clear. This is shown in Fig. 9, where a comparison between the smoothed error of B-NNRW with no updating (static) and BR-NNRW (dynamic) is presented.

Fig. 9
figure 9

Comparison between static B-NNRW and dynamic BP-NNRW

From Fig. 9, it is possible to observe that, except in Housing dataset, the use of updating strategies was able to improve the results compared with their static versions, i.e. no update applied. In the case of Housing dataset, it can be concluded that the data used for training may not have been sufficient to train the models, or no concept drift was observed during the evaluation period. The behaviour of the error in the Quality dataset may indicate a recurring drift, due to the decrease of accuracy achieved by the static model through the stream of data. The Maintenance dataset showed stronger evidence of concept drift not only due to the difference in accuracy between static and dynamic models but also due to the high level of replacement required. Both static and dynamic models start at a similar level of accuracy. However, as the time evolves, the static models lose accuracy, while the updating process keeps the error of dynamic models at low levels. The Energy dataset benefited from the updating strategies, especially in the beginning of the evaluation period, where the MSE shows a decrease before it stabilises, which could indicate that the data used for training may not be sufficient to represent all the concepts present in data; this idea is reinforced by the low rate of replacement required for this dataset.

5 Conclusions

In this article, the authors have reviewed the available solutions to the online data stream regression problems and identified a need for a faster and more accurate approach. The development of a new B-NNRW method is presented, which is designed to adapt to the evolving nature of the data streams by updating the ensemble in specified intervals to maintain the accuracy of the predictions. This was made possible through combining a bootstrap sampling with random feature selection to train a highly diversified pool of NNRWs. Synthetic datasets for simulating concept drift were used to validate the ability of the proposed algorithm to deal with changing data concepts.

Additionally, datasets from various industries were selected to evaluate the potential enhancement to the prediction model in different dataset types. A series of experiments were carried out on datasets from Housing, Maintenance, Energy, and Quality Control applications, and the results were compared with the existing O-DNNE method. The results with the proposed updating mechanism showed an average of 8% improvement in the accuracy of predictions across all four types of datasets, with up to 47 times shorter computational time. Furthermore, the use of DOE proved a promising technique to optimise the hyperparameters systematically and can be applied to any ML algorithm.

As part of future research, the next step will be the development of strategies to automatically define the rate of replacement in the proposed approach. The results of the experiments indicated that such automated updating mechanism should also be linked to the types of the datasets. Such a development could potentially improve the suitability of the proposed method for industrial applications.