1 Introduction

Software systems are nowadays essential in our everyday lives. The structures of these systems have been growing larger and ever more complex to fulfill the increasing demands from various different sectors. Hence, ensuring the quality of software systems has become a critical task. Improving software quality highly depends on reducing the amount of software defects. Currently, one of the most active research areas in the Software Engineering domain is Software Defect Prediction (SDP) (Li et al. 2018; Catal and Diri 2009; Hall et al. 2011). The main objective of SDP is to predict which parts of the software are likely to contain defects so that resources such as time and budget can be more effectively allocated to support software quality assurance activities.

Early work on SDP typically focused on predicting defects in files or modules. In recent years, another branch of SDP has emerged that focuses on predicting defect-inducing software changes. This is known as Just-in-Time Software Defect Prediction (JIT-SDP). JIT-SDP predicts whether a change in the code will induce defects or not as soon as it is committed to a software repository. The main advantages of JIT-SDP over file level prediction are Kamei et al. (2012): (i) predictions are performed at fine granularity, helping to reduce the effort required to fix the defects; (ii) the defect fixing task can be assigned to the right developer as changes can be easily mapped to the person who committed them; and (iii) defect prediction is made immediately after committing the change so that the code is still fresh in the developer’s mind, facilitating code inspection.

Most of the existing JIT-SDP studies implicitly assume an offline scenario, where a pre-existing training set is available beforehand and additional training examples are never received anymore (Kamei et al. 2012, 2016). However, in practice, JIT-SDP operates in an online scenario, where software changes become labelled as either clean or defect-inducing and become available as training data over time. McIntosh and Kamei (2017) showed that there can be fluctuations in the importance of the characteristics of defect-inducing software changes over time, which may be a result of concept drift during the software development process. Concept drift can be described as a change in the data generating process, affecting the underlying probability distribution of the data. It may negatively impact predictive performance of the models, if they are predominantly built on old data. To deal with this issue, it is important for models to be able to learn and adapt to new data over time.

Both online and offline learning models can be used to learn additional data received over time in online scenarios. Online learning models are models that consider training examples one at a time, with the model parameters being updated after the presentation of each training example (Bishop 2006). Therefore, they naturally fit online scenarios, as they can be updated with each new training example generated by the online scenario separately, without requiring access to past data. Offline learning models are models that process the entire training set in one go (Bishop 2006). Even though they require the whole training set to be available before learning commences and cannot process each training example separately upon arrival, they can also be adapted for use in online scenarios. This can be done by retraining the model on new data together with (a sufficient amount of) past data. Both offline and online learning models need to be combined with special strategies to deal with concept drift to be able to address changes in the underlying data distribution. However, as online learning models do not require retraining on past data, their training process is usually faster. Such advantage can also become a weakness depending on the problem being learned. In particular, as online learning models do not conduct multiple learning passes through the data, they may present poorer predictive power, for instance as a result of catastrophic forgetting (McCloskey and Cohen 1989).

Therefore, this paper aims at analyzing whether offline learning models can improve predictive performance in online JIT-SDP scenarios compared to online learning models, and whether this would be at the cost of higher computational requirements. This investigation will be carried out both on Within-Project (WP) and Cross-Project (CP) online JIT-SDP scenarios. In particular, these two scenarios may lead to different conclusions in terms of which type of learning models perform better, due to the different amounts of training data used. The following research questions (RQs) are addressed:

RQ1:

Can offline learning help to improve predictive performance compared to online learning in online WP JIT-SDP scenarios? Which base learners usually perform best?

RQ2:

How beneficial is CP data to improve predictive performance of offline models compared to online models in online CP JIT-SDP scenarios?

RQ3:

How high is the computational cost of offline learning in online JIT-SDP scenarios compared to that of online learning models?

To answer the above research questions, we propose a new approach called Batch Oversampling Rate Boosting (BORB) that is able to use different offline base learners and can operate in the online JIT-SDP scenario by learning from new training data. BORB is an offline version of the online JIT-SDP approach Oversampling Rate Boosting (ORB) (Cabral et al. 2019). It translates the online resampling concepts of ORB which have been previously shown to be useful for online JIT-SDP scenarios (Cabral et al. 2019) into an offline resampling approach. Therefore, in this paper, offline learning implies using BORB, whereas online learning implies using ORB, unless stated otherwise.

BORB and ORB approaches are compared using 5 and 4 different base learners, respectively. The use of different base learners enables a more complete investigation of offline and online learning, as different base learners may lead to different conclusions. Our experiments based on ten open source projects show that offline learning (BORB) helped to improve predictive performance compared to online learning (ORB) when using most base learners with WP data. Even though CP data helped to improve BORB’s predictive performance further, it was more helpful to improve ORB’s predictive performance. The training process of online learning through ORB was less computationally expensive than that of offline learning through BORB. However, the magnitude of the differences in predictive performance and computational cost between the top ORB and BORB approaches were not very large.

The contributions of this work are following:

  • We provide the first comparison between offline base learners and online base learners in a realistic online JIT-SDP scenario, revealing that offline learning can slightly improve predictive performance compared to online learning. Therefore, if researchers or practitioners have predictive performance as a priority when choosing a JIT-SDP model, we recommend them to consider offline JIT-SDP as a possible choice.

  • We show how to adapt offline base learners so that they can use adaptive resampling rates to deal with class imbalance in online scenarios for JIT-SDP.

  • We show that CP data can improve predictive performance. This happens both when using offline BORB and online ORB, even though CP data was particularly beneficial for online ORB. Therefore, we recommend researchers and practitioners to consider adopting CP learning especially if they are using online ORB.

  • We show that online learning required less computational cost than offline learning. Therefore, we recommend researchers and practitioners to consider online learning if computational cost is a concern for them. This may be a concern when there is a need for comparing several JIT-SDP models to decide which one to adopt. However, it may not be a concern when adopting a single online or offline JIT-SDP model over time as the cost of these approaches becomes negligible in this context.

  • While one may intuitively assume that it is better to use online learning models for online JIT-SDP scenarios, we show that both online and offline learning models can bring benefits in such realistic scenarios and are worth further exploring.

This paper is further organized as follows. Section 2 presents related work. Section 3 presents background knowledge. Section 4 introduces the proposed approach. Section 5 presents the details of the investigated datasets. Section 6 explains the experimental setup for answering the RQs. Section 7 explains the results of the experiments. Section 8 presents threats to validity. Section 9 presents the conclusions and future work.

2 Related Work

This section discusses online and offline models for JIT-SDP using both WP and CP data. As our previous work Tabassum et al. (2020, 2022) also involves online CP JIT-SDP, there are some commonalities between the related work listed here and that of those studies.

2.1 Offline WP JIT-SDP

Kim et al. (2008) conducted one of the first studies on JIT-SDP. They proposed an approach to classify software changes, as defect-inducing or not, based on features extracted from the change metadata such as author name, commit hour, code entropy, lines of comments, cyclomatic complexity, etc. Their approach achieved an accuracy rate of 78% on average in a study involving 12 open source software projects. Śliwerski et al. (2005) investigated the connection among defects in a defect-tracking system and a control version system in order to identify ‘fix-inducing changes’ (changes able to identify previous defect-inducing changes). They investigated which properties of these changes are correlated with inducing fixes. They showed, for example, that if the change is large it is more likely to induce a fix. Eyolfson et al. (2011) showed that commits submitted during certain time of the day, day of the week and the daily commit frequency of the developer may influence the “bugginess” of a commit. Kamei et al. (2012) performed a large scale investigation of JIT-SDP by building logistic regression models using 6 open source and 5 commercial projects. They used 14 different features extracted from code changes to predict defect inducing changes and achieved an average accuracy of 68% and an average recall of 64%.

Other studies focused on the machine learning approach being used to create JIT-SDP models. Chen et al. (2018) considered JIT-SDP as a multi objective problem by maximizing the number of identified defect-inducing changes and minimizing efforts to fix them. They used logistic regression models to conduct the prediction using 6 open source datasets considering the 14 metrics described in Kamei et al. (2012). Their approach managed to identify 63.8% of the defect-inducing changes on average when using only 20% of the software quality team effort. Yang et al. (2017) proposed a two-layer ensemble learning approach TLEL. In the inner layer, they used Decision Trees and Bagging models to create a Random Forest model. In the outer layer, they grouped many different Random Forest models using stacking (Aggarwal CC et al. 2015). They have also investigated other base learners than Decision Trees, including Naive Bayes, Support Vector Machines, Linear Discriminant Analysis and Nearest Neighbor Classifiers. They showed that ensembles of Decision Trees to create Random Forests performed better than using other base learners in 5 out of 6 open source projects. Their approach detected 70% of defect-inducing changes by reviewing 20% of the code. The TLEL also achieved higher F1-score compared to three baseline approaches – Deeper, DNC and MKEL. Yang et al. (2015) proposed a deep learning method called ‘Deeper’ for JIT-SDP. They compared their approach with the approach proposed by Kamei et al. (2012) and showed that their approach was able to discover 32.22% more defects, based on a study with 6 open source datasets. Li et al. (2020) investigated the impact of different combinations of base learners such as Support Vector Machines (SVMs), Logistic Regression (LR), Random Forest (RF), Multi-layer Perceptron (MLP), Naive Bayes (NB) and Decision Trees (DTs). They showed that the diversity of base learners plays an important role for achieving promising performance.

Some studies have also suggested effort-aware prediction of defect-inducing software changes, leading to approaches such as EALR (Kamei et al. 2012), CBS (Huang et al. 2017) and CBS+ (Huang et al. 2019). However, the effort-aware components of these approaches require a whole set of software changes to be available for sorting in order of inspection priority. As being able to make predictions “just-in-time”, at commit time, is one of the key advantages of JIT-SDP (Kamei et al. 2012), it is unsuitable to wait for such whole set of changes to be produced for sorting in JIT-SDP.

All of the above discussed studies considered JIT-SDP in an offline scenario (i.e., all the learning algorithms used in these studies are offline and were not retrained with new data over time). These studies did not take into account the fact that the label of a training data may not be available immediately after the software change submission, i.e., they overlook a problem known as verification latency which consists in the delay for obtaining the class (or label) of a software change. The chronology of the data was also disregarded. As a result, in these works, future examples may have been used to train models for predicting past data. Hence, these offline WP JIT-SDP studies are not applicable in a realistic scenario.

2.2 Online WP JIT-SDP

Tan et al. (2015) investigated JIT-SDP in a scenario where new batches of training examples arrive over time and can be used for updating the classifiers. To the best of our knowledge, even though previous work considered verification latency in defect models that are updated online (Kim et al. 2007), Tan et al. were the first work to consider this issue in JIT-SDP. Regarding the classifiers, they used 7 updatable algorithms based on Naive Bayes (Bayes, LWL), instance-based learning (IBK, KStar), boosting (LogitBoost), nearest-neighbors (NNge), and Support Vector Machines (SPegasos) to learn over time. In addition, they used resampling techniques to tackle the inherent class imbalance problem. Based on a study with one proprietary and six open source projects, the authors claim that both resampling techniques and the updatable classification improve the precision by 12.2-89.5%. In this work they mention that overlooking the data chronology and the verification latency problem lead to a false impression of higher predictive performance. Therefore, it is important to take the data chronology and verification latency problem into account in order to reproduce more realistic scenarios. However, their approach assumes that there is no concept drift, i.e., that the defect generating process does not suffer variations over time. Their approach also assumes a fixed gap of time between the training and test examples, where no training examples can be produced. In practice, some software changes may be found to be defect-inducing during that gap, but their use for training will be delayed by their approach. Moreover, they have not compared online versus offline learning models in their work.

McIntosh and Kamei (2018) performed a longitudinal case study of 37,524 changes from the rapidly evolving QT and OPENSTACK systems and found that fluctuations in the importance of the features of fix-inducing changes can impact the performance of JIT-SDP models. They showed that JIT-SDP models typically lose predictive power after one year, possibly as a result of concept drift. Hence, they suggest to continuously update the JIT-SDP model with recent data.

Cabral et al. (2019) proposed a method called Oversampling Rate Boosting (ORB) to tackle a type of concept drift called class imbalance evolution, where the proportion of examples of the defect-inducing and clean classes change over time. Their work investigates an online JIT-SDP scenario taking verification latency into account. They considered a waiting time (w days) after the commit time to safely label the change as clean. If a defect is found within w days, the change is labeled as defect-inducing and used for training. If no defect associated to a change has been found in w days from its commit time, this gives confidence that this change is clean and therefore be labeled and used for training. If a change that has already been labeled as clean is found to be defect-inducing after w days, the training example corresponding to that change will be updated with the correct label and be presented again for learning.

ORB has a resampling rate to tackle class imbalance evolution based on the moving average over the predictions provided by the JIT-SDP model. In Cabral et al. (2019) this mechanism has shown to be able to improve predictive performance over JIT-SDP approaches that assume a fixed level of class imbalance. ORB achieved better \({|R_0 - R_1|}\) up to 45.38% and 63.59% compared to the state-of-the art class imbalance evolution algorithms Undersampling Online Bagging (UOB) and Improved Oversampling Online Bagging (OOB) (Wang et al. 2015), respectively.

2.3 Offline CP JIT-SDP

JIT-SDP classifiers require sufficient amount of training data to provide useful predictions. Such data is not available at the beginning of a software project as data arrives sequentially over time. Cross-Project (CP) JIT-SDP can overcome this issue by using data from past projects to build the classifier. Several studies investigated CP JIT-SDP. Kamei et al. (2016) conducted one of the first studies. They carried out an empirical evaluation of CP JIT-SDP performance by using data from 11 open source projects. They investigated five CP JIT-SDP approaches based on project similarity, three variations of data merging approaches, and ensemble approaches where each model was trained on data from a different project. All approaches employed random forests as base learners. They found that simple merging of all CP data into a single training set and ensemble approaches obtained similar predictive performance to that of WP models. Different from SDP at the component level, other more complex approaches, including similarity-based approaches, did not offer any additional advantage compared to these. Another study Catolino et al. (2019) investigated CP JIT-SDP for mobile platforms using 14 apps extracted from the CommitGuru platform (Rosen et al. 2015). They compared the CP performance of four different well-known classifiers and four ensemble techniques. Naive Bayes performed best compared to other classifiers and some ensemble techniques. They did not check how CP compared against WP results.

Chen et al. (2018) considered JIT-SDP as a multi-objective problem to maximize the number of identified defect-inducing changes while minimizing the effort required to fix the defects. They proposed a multi-objective optimization-based supervised method called MULTI to build logistic regression JIT-SDP models. They used six open source projects. MULTI was evaluated on three different performance evaluation scenarios (cross-validation, cross-project-validation, and timewise-cross-validation) against 43 state-of-the-art supervised and unsupervised methods. They found that it can perform significantly better than WP methods in terms of Accuracy and \(P_{opt}\) metrics. Zhu et al. (2020) proposed a JIT-SDP approach called DAECNN-JDP based on denoising autoencoder and convolutional neural networks. WP and CP defect prediction experiments were performed on six large open source projects and DAECNN-JDP was compared with 11 baseline models, including eight machine learning models, EALR, Deeper and CNN-JDP. The results show that DAECNN-JDP achieved better predictive performance than the baseline models for both CP and WP JIT-SDP. However, the predictive performances of CP and WP approaches were not compared against each other.

The studies above considered offline scenarios where the model is never updated with new data and, hence, cannot deal with concept drift. They did not take into account the chronology and verification latency of the data as well. It is unknown whether their conclusions would hold in realistic online JIT-SDP scenarios.

2.4 Online CP JIT-SDP

Tabassum et al. (2020, 2022) first investigated CP learning for online JIT-SDP based on OOB (Wang et al. 2013) and ORB (Cabral et al. 2019). They proposed three online CP approaches called AIO (that builds a single model by training with all WP and CP data together), Filtering (that filters out CP instances dissimilar to target project) and Ensemble (that builds an ensemble of models, where each model is trained by data from a different project) based on Hoeffding tree as base learners. Their study based on 10 open source and 9 proprietary datasets showed that their online CP approaches (AIO and Filter) achieved up to 53.89%, 37.35% and 29.03% improvements in terms of G-Mean compared to a WP online approach. They have also shown that enabling the CP approaches to be updated with additional training data received over time in an online CP scenario leads to better predictive performance than adopting an offline CP scenario, where only CP data available before the target project commences is used for training.

3 Background

This section explains the online JIT-SDP scenario adopted in this work and some background required to understand it. It also explains the ORB approach upon which BORB is based.

3.1 Definitions

Definition (Data Stream): A data stream is a potentially infinite sequence of training examples \(\mathcal {S} = \{(\textbf{x}_{i},y_{i})\}_{i=1}^{\infty }\), where i is a natural sequential number (time step) indicating the order with which the training examples were labeled, \(\textbf{x}_i\) are the input features describing example i, and \(y_i\) is the label of example i. In JIT-SDP, the input features are features describing the software change, as will be explained in Section 5. The label is defect-inducing or clean.

Definition (Online Scenario With Verification Latency): An online scenario is a scenario where training examples are produced over time, forming a data stream. JIT-SDP operates in an online scenario where the labels of the software changes arrive with a delay, which is referred to as verification latency (Ditzler et al. 2015). Specifically for JIT-SDP, labelled examples can be produced following the procedure defined by Cabral et al. (2019). When a change is committed to the repository the developers hope it to be clean, but it may, instead, induce a defect. To label this change as clean, we need to wait for a period of time (waiting period w) to be confident that the change is really clean. If no defect is reported to be associated to this change within w days, the change is labeled as clean at the end of the waiting period, producing a training example. If, on the other hand, a defect is found to be linked to this change during these w days, the change is immediately labeled as defect-inducing, without having to wait until the end of the waiting period. It may also happen that a change that was initially labeled as clean is found to be defect-inducing after the w days. When this happens, this change is relabeled as defect-inducing, producing a new labeled training example of the defect-inducing class. This procedure respects chronology, being able to capture a realistic scenario that reflects the labelling procedure that would be observed in practice. The waiting period w can be considered as a pre-defined parameter.

Definition (Online Learning): Given a data stream formed by training examples ordered by the time they were produced \(\mathcal {S} = \{(\textbf{x}_{i},y_{i})\}_{i=1}^{\infty }\), an online learning model is a model that is immediately updated whenever a new training example \((\textbf{x}_{i},y_{i}) \in \mathcal {S}\) becomes available. Strict online learning models must be able to process (learn) each training example once and only once. So, the classifier is always updated with new examples, without requiring any retraining on past examples. This is useful to speed up learning for cases where storing and reprocessing past training examples may be computationally infeasible, e.g., for very large data streams, or data streams where the frequency of incoming data is very large. However, some (non-strict) online learning algorithms may access a memory containing past training examples to support the learning process. The classifier may also have strategies to speed up adaptation to changes (a.k.a., concept drifts) that may affect the underlying data generating process.

Definition (Offline Learning): Consider a finite set \({\tau } = \{(\textbf{x}_{i},y_{i})\}_{i=1}^{n}\) containing n examples that are available for training at a given time. This set can be referred to as a batch. Being a set and not a stream, the time order of these examples is ignored. An offline learning model is trained on \({\tau }\) such that the training and testing phases cannot intersect in time, i.e., the classifier is only available to use when the training procedure has ended.

3.2 Discussion on Adopting Offline Learning in Online Scenarios

Even though the time order of*-.5pt the examples within \({\tau }\) is ignored by the offline learning procedure, it is still possible to create*-.5pt a sequence of training sets \({\tau }_{j}\), \(j\ge 0\), where each training set \({\tau }_{j}\) is updated with one or more new*-.5pt training examples that may become available until the current time step. We refer to the number*-.5pt of time steps that we wait before creating a new training set as retraining period (rp). At*-.5pt every rp time steps, \(\tau _j\) is created with all \(\tau _{j-1}\) training examples plus the new rp training examples.*-.5pt The batch \(\tau _j\) is then used to retrain the offline learning model from scratch.*-.5pt The larger the rp, the longer we will have to wait before the predictive model can be retrained.*-.5pt If a concept drift happens during this period, the outdated model is unable to react to this*-.5pt drift until the new \({\tau }_{j}\) is created, potentially hindering the predictive performance.*-.5pt On the other hand, the larger the rp, the higher the computational cost of the approach, as the model is retrained from scratch more often.

Despite the training process of the offline learning model ignoring the time order of examples within a given training set, this process would still ensure that only training data that is really available at a given point in time would be used for training, i.e., the JIT-SDP online scenario described in this section would still be respected.

As the data stream generated by the software changes submitted to a software repository is not a high frequency stream, it may be computationally acceptable to store past changes and rebuild classifiers from scratch when new training sets become available. Moreover, managing the whole historical data stream enables us to access all the benefits of offline learning over online learning. In particular, by ignoring the time order of examples, offline learning models frequently process the training set several times, which can help to produce stronger (more accurate) classifiers. In face of verification latency, revisiting all software changes labeled so far allows us to delete any training examples whose label was incorrectly assigned as clean for software changes that have now been found to be defect inducing. This prevents the classifier to learn noisy information, different from online learning models such as ORB (Cabral et al. 2019), where the mislabeled training example is definitely incorporated into the classifier. Nevertheless, a potential disadvantage of using offline learning for online scenarios is that this could make it more difficult to deal with certain types of change in the defect generating process, as each given training set may contain a mix of examples produced by different defect generating processes.

3.3 The ORB Approach

Cabral et al. (2019) tackled the problem of the class imbalance evolution over time when dealing with online JIT-SDP. They showed that this evolution negatively impacts the predictive performance by making the classifier to become highly skewed towards one of the classes during different periods of the project. They also considered the verification latency problem for receiving the class labels. Their proposed Oversampling Rate Boosting (ORB) approach was able to improve the predictive performance in comparison to algorithms that assume a fixed imbalance ratio over time and to the existing class imbalance evolution algorithms Undersampling Online Bagging (UOB) and Improved Oversampling Online Bagging (OOB) (Wang et al. 2015).

Algorithm 1
figure a

Oversampling Rate Boosting (ORB) (Cabral et al. 2019)

$$\begin{aligned} \rho _{c}^{(t)} = \theta '\rho _{c}^{(t-1)} + (1 - \theta ') (y^{(t)} == c). \end{aligned}$$
(1)

Algorithm 1 shows the pseudocode of the ORB approach. It is important to note that the ORB is built upon the OOB (Wang et al. 2015) approach. Thus, in Algorithm 1, the numbered black lines correspond to the original OOB while the blue lines correspond to the ORB. ORB calculates the moving average of the predictions \(ma^{(t)}\) using a time window of size \(w_s\). JIT-SDP is a binary problem where 0 represents the clean class and 1 represents the defect-inducing class. Calculating the moving average allows us to detect a bias in the predictions towards any particular class. Depending on this bias, the resampling rate of one of the classes is boosted (increased).

As JIT-SDP is a class imbalanced problem, an effective classifier would provide class imbalanced predictions. So, ORB is set to make the predictions rate as close as possible to a parameter th that represents the desired imbalance ratio of the predictions. For example, if \(ma^{(t)}\) is close to 1, it means that the classifier is producing many false alarms, then the resampling rate for the class 0 (clean class) will be increased in order to reduce the classifier’s skew, making it closer to the desired skew th. The adjustment in the resampling rate is made through boosting factors computed according to Equations 2 and 3. The final oversampling rate is then the product between \(obf_0\) or \(obf_1\) and the resampling rate k necessary to balance the classes computed by OOB. These boosting factors are responsible for adding an extra emphasis to one of the classes in order to yield balanced predictions.

$$\begin{aligned} OBF^{(t)}_0(P_0) = {\left\{ \begin{array}{ll} \left( \frac{m^{ma^{(t)}}-m^{th}}{(m-m^{th})}*l_0\right) +1, \text {if } ma^{(t)} > th\\ 1, \text { otherwise}\\ \end{array}\right. } \end{aligned}$$
(2)
$$\begin{aligned} OBF^{(t)}_1(P_1) = {\left\{ \begin{array}{ll} \left( \frac{m^{(th-ma^{(t)})}-1}{(m^{th}-1)}*l_1\right) +1, \text {if } ma^{(t)} \le th\\ 1, \text { otherwise}\\ \end{array}\right. } \end{aligned}$$
(3)

In equations 2 and 3, \(P_0\) and \(P_1\) are sets of hyperparameters containing the parameters: m - determines the growth of the exponential function, th - stands for the threshold that indicates the desired class imbalance in the predictions; \(ma^{(t)}\) - the predictions moving average at time t; \(l_0\) and \(l_1\) - control the maximum boosting factor values.

In short, if \(ma^{(t)} \le th\), this suggests that less than th% of the commits are being classified as defect-inducing. Hence, the resampling rate of the defect-inducing class should be boosted. If \(ma^{(t)} > th\), then more than th% of the commits are classified as defect-inducing. Hence, the resampling rate of the clean class should be boosted. For further details regarding the ORB, please refer to Cabral et al. (2019).

4 Proposed Approach

Algorithm 2
figure b

BORB’s testing and training procedure

To investigate the influence of offline learning in online JIT-SDP scenarios we propose a novel approach called Batch Oversampling Rate Boosting (BORB), which consists of an adaptation of Oversampling Rate Boosting (ORB) (Cabral et al. 2019). Our RQs require us to isolate the effects of offline vs online learning as much as possible, so that we can analyze the potential benefit of offline learning without being affected by other mechanisms that one may design to further improve predictive performance of the state-of-the-art. Therefore, such adaptation was designed to be as similar as possible to the ORB approach, but using core offline learning mechanisms instead of online ones.

BORB is an offline learning algorithm which periodically rebuilds its JIT-SDP models incorporating newly labeled training examples. This is achieved by updating the training set with the most recently labeled examples. The updated training set, containing all training examples received so far, is then used as a batch for retraining the JIT-SDP model from scratch based on an offline learning algorithm. Different from the offline learning approaches presented in Section 2.1, such training process ensures that only training examples whose labels are already available with respect to the dataset chronology can be used for training (i.e., it takes the verification latency problem into account and follows the online JIT-SDP scenario explained in Section 3).

Overall, BORB and ORB share the following similarities:

  • Both methods use resampling to deal with class imbalanced based on the same oversampling rate boosting function (Equations 2 and 3).

  • They are both capable of detecting when the classifier is performing badly based on the rule involving ma and th explained in Section 3.3 and to react to it by adjusting the resampling rate based on the above mentioned oversampling rate boosting function.

  • They are both able to take into account verification latency through the waiting time strategy from Cabral et al. (2019).

  • They both ensure that the online scenario explained in Section 3.1 is respected. In particular, both of them ensure that only training examples that are already available at a given point in time can be used for training at this point in time.

Their key differences are related to replacing the core online mechanisms of ORB by core offline ones:

  • Being an offline learning approach, BORB stores and can learn multiple times past data, while ORB sees each training example only once.

  • Being an offline learning approach, BORB collects the training examples into a training set. As such, the order of examples within this set is not respected when training on them, even though the online scenario explained in Section 3.1 is respected.

  • When collecting new training examples over time, BORB is able to note if these examples correspond to previously seen training examples whose label has changed due to a late detection of a defect associated to them. Therefore, BORB can replace the old mislabelled clean examples by the new corresponding defect-inducing ones. ORB is able to learn the new defect-inducing example, but is unable to remove the old example which has already been learned.

  • BORB is periodically retrained to enable offline learning models to be used, whereas ORB learns each training example separately.

Algorithm 2 presents BORB’s pseudocode. For each new incoming software change (\(x^{(t)}\)) received at timestep t, the base learner clf, if already trained, provides a class prediction (line 10). Note that clf is not useful until \({\tau }\) contains at least one labeled software change from each class (clean and defect-inducing). Before that, all provided predictions are assigned to the clean class.

Different from online learning models, BORB stores all historical software changes in \(\mathcal {X}\). As new class labels arrive (following the procedure described in Section 3 and using waiting period w), they are immediately used to create training examples corresponding to their respective software changes in \(\mathcal {X}\) (line 15). These training examples are added to the training set \(\tau \). If a given new defect-inducing class label corresponds to a software change that was previously labeled as clean, the previous training example in \(\tau \) is replaced by the new one with the defect-inducing label. The base learner is periodically reset whenever the modulo operation between the timestep (t) and the parameter rp is zero (line 16), and the training set \(\tau \) is used to retrain it (lines 16 to 27).

BORB tackles the class imbalance problem at two different moments: in the test phase by picking a presumable adequate classifier prediction threshold (lines 5 to 9) and in the training phase by means of an oversampling mechanism (lines 16 to 27). For the testing phase, BORB considers that classifiers usually make predictions based on a prediction threshold \(\theta \). The value of \(\theta \) is typically set to 0.5. If the score given by the classifier is smaller than 0.5, class 0 is predicted. Otherwise, class 1 is predicted. However, this threshold can potentially be adjusted to help dealing with class imbalance. BORB does that in lines 5 to 9. In particular, \(\textbf{c}\) (line 7) stores the prediction scores over the last \(w_s\) software changes. The decision threshold \(\theta \) to be adopted by the base learner is a quantile in \(\textbf{c}\) corresponding to a hyperparameter th. This hyperparameter represents the proportion of the predictions that is targeted to be defect-inducing predictions (th). E.g., if \(th = 0.5\), \(\theta = \) median(\(\textbf{c}\)). As JIT-SDP is a class imbalanced problem, the target proportion should normally be less than 0.5.

For the training phase, similar to ORB (Cabral et al. 2019), BORB deals with class imbalance based on oversampling as follows. The oversampling rate is used to decide whether and by how much to oversample examples of a given class for training the base learner. The oversampling rate is adjusted based on the predictions given to the most recent test software changes. This enables adjustments on the base learners without having to wait for the labels of these software changes. In particular, a proportion ma of predictions given to the defect-inducing class over the most recent \(w_s\) software changes is determined (line 23). This proportion is compared to the same hyperparameter th used in the test phase. If ma indicates that BORB is predicting the defect-inducing class more/less often than the target proportion th, we need to oversample the clean/defect-inducing class, so that BORB focuses more on learning how to identify examples of this class. The idea is that if S (i.e., a sample of the training set) is skewed towards the defect-inducing class, the classifier should also incorporate this skewness in its predictions.

The iterations from lines 20 to 26 are responsible for updating the base learner based on the oversampling rate. The function skewedSample (line 21) retrieves the sample S containing n training examples from \({\tau }\), based on the oversampling rate, which is determined according to the factors \(obf_0\) and \(obf_1\), as detailed in Algorithm 3. Therefore, many training iterations will be performed on different training sets S in order to make the base learner converge to a skew respecting th. The idea is that if ma (line 23) gets more distant from th, the obfs are adjusted so that S contains the necessary class imbalance to make the base learner accumulate new biased information such that at the last training iterations ma approaches th.

As an illustrative example of the impact of skewedSample function, consider using a Multilayer Perceptron (MLP) as a base learner and \(th = 0.4\) (i.e., classes proportions (0.6:0.4)). If in the first epoch of the MLP the base learner average prediction for the last \(w_s\) test examples is 0.7, \(th=0.4\) and \(ma = 0.7\) will be used to compute \(obf_0\) and \(obf_1\). The factors \(obf_0\) and \(obf_1\) will result in a new sample S containing training examples with the class proportions (\(\frac{obf_0}{obf_{0} + obf_{1}}\): \(1 - \frac{obf_0}{obf_{0} + obf_{1}}\)), which will then be used for the second training epoch. Since in the first training epoch \(ma > th\), in the second training epoch the proportion of examples from class 0 will be larger than the proportion of examples of class 1 (i.e., the oversampling rate for class 0 will be boosted). Eventually, repeating this process for many epochs will lead to the base learner’s average predictions to be close to the target th.

As in ORB, Equations 2 and 3 (Cabral et al. 2019) compute \(obf_0\) and \(obf_1\) (lines 24 and 25 of Alg. 1), respectively. Figure 1 presents the behaviour of these equations. Figure 1 a) shows the \(obf_0\) and \(obf_1\) curves generated by the parameters (\(th = 0.4\), \(l0 = 5\), \(l1 = 12\), \(m = 1000\) and \(ma \in 0 .. 1\)) while in Fig. 1 b) the parameters \(l0 = 9\) and \(m = 10\). Due to different values for parameter m, in Fig. 1 a), when \(ma \approx th\), \(obf_0\) and \(obf_1\) are less impacted than in Fig. 1 b). As in JIT-SDP the class 0 (clean class) is usually the majority class, it is advisable to use obf upper limits values (\(l_0\)) lower than the ones for defect-inducing class (\(l_1\)).

Algorithm 3
figure c

skewedSample function

Fig. 1
figure 1

Oversampling rate boosting function (Cabral et al. 2019) for two different set of parameters. The x-axis is the average of the last \(W_s\) test examples while the y-axis depicts the resulting oversampling boosting factor (obf) according to Equations 2 and 3

We investigated BORB for both WP and CP learning, respecting the online scenario introduced in Section 3. For WP learning, the JIT-SDP model is trained with data from the target project only. For CP learning, the JIT-SDP model is trained both with data from the target project (WP data) and from all other available projects (CP data). Hence, for CP learning, BORB model is trained with both CP and WP data together similar to the All-in-One approach from Tabassum et al. (2020, 2022). This means that any benefits of CP data mentioned in this study refer to the benefits obtained from combining both CP and WP data. This is reasonable in online scenarios, as both CP and WP data become available over time during the course of a project (Tabassum et al. 2020).

Table 1 An overview of the projects (adapted from Tabassum et al. (2020))

5 Datasets

We have used ten open source datasets extracted from open source GitHub repositories, which were made available by Cabral et al. (2019) at https://zenodo.org/record/2594681. Table 1 shows details about these datasets. All datasets were extracted based on CommitGuru (Rosen et al. 2015). The change metrics include 14 metrics (input features) that can be divided into five groups: i) diffusion of the change, including input features NS (number of modified subsystems), ND (number of modified directories), NF (number of modified files), Entropy (distribution of modified code across each file), ii) size of the change, including input features LA (lines of code added), LD (lines of code deleted), LT lines of code in a file before the change), iii) purpose of the change, including input features FIX (whether or not the change is a defect fix), iv) history of the change, including input features NDEV (number of developers that changed the modified files), AGE (average time interval between the last and the current change), NUC (number of unique changes to the modified files) and v) experience of the developer that made the change, including input features EXP (developer experience), REXP (recent developer experience), SEXP (developer experience on a subsystem). These software change metrics have been shown to be adequate for JIT-SDP in previous work (Kamei et al. 2012) and have been adopted in previous online JIT-SDP work (Cabral et al. 2019; Tabassum et al. 2020, 2022).

6 Experimental Setup

This section explains the experimental setup for answering the RQs introduced in Section 1. To perform the analysis for RQ1, we compare the predictive performances of BORB-WP and ORB-WP (Cabral et al. 2019) approaches; for RQ2, we compare the predictive performances of BORB-WP, BORB-CP and ORB-CP (Tabassum et al. 2020, 2022) approaches; and for RQ3, runtimes for BORB-WP, BORB-CP and ORB-CP are compared. Our analyses are based on various online and offline base learners, as listed in Section 6.1. For RQ1 and RQ2, we compared all approaches against a dummy classifier that predicts defect-inducing or clean uniformly at random. This is because being able to outperform a dummy classifier in terms of overall predictive performance means the JIT-SDP model was able to learn relevant JIT-SDP knowledge.

Given a certain project P, we are interested in using JIT-SDP to predict the software changes of P as defect-inducing or clean. Such predictions should respect chronological order according to the scenario explained in Section 3. Chronology is determined based on author timestamp, as recommended in Flint (2021).

When creating a predictive model for a given project P, WP approaches make use of only WP data from P for training. CP approaches make use of data from all projects for training, including P. The training procedure of all approaches at a given timestamp t ensures that only training examples that have already been labeled by timestamp t based on their chronology and waiting period are used for training, as explained in Section 3.1 (Cabral et al. 2019). Waiting period of 90 is used as in previous studies (Cabral et al. 2019; Tabassum et al. 2020) for open source data.

All approaches have been executed 30 times on each data set. A replication package can be found in the JIT-SDP-NN repository, https://github.com/dinaldoap/jit-sdp-nn. The datasets generated during and/or analysed during the current study are available in the JIT-SDP-DATA repository, https://github.com/dinaldoap/jit-sdp-data.

6.1 Base Learners

This section lists all the base learners that are investigated with the BORB and ORB approaches in this study. Altogether, our base learners were selected so as to: (1) cover a variety of different types of learning approaches for both offline and online learning (function-based, probabilistic and tree-based), as we wish to check what kind of online/offline model is most beneficial for JIT-SDP, (2) make the evaluation fair in the sense that we will select both online and offline approaches that are expected to achieve good results (in particular including Logistic Regression and Iterative Random Forest for fairness towards offline learning (Kamei et al. 2012; Chen et al. 2018; Li et al. 2020) and Oza Bagging of Hoeffding Trees and Naive Bayes for fairness towards online learning (Tabassum et al. 2020; Cabral et al. 2019; Turhan et al. 2009)) and (3) include the use of base learners that are the same as much as possible between online and offline learning (Logistic Regression and Multilayer Perceptron).

Overall, the following base learners were adopted by the ORB and BORB approaches in our experiments:

  • ORB: Logistic Regression, Multilayer Perceptron, Naive Bayes and Oza Bagging of Hoeffding Trees.

  • BORB: Logistic Regression, Multilayer Perceptron, Naive Bayes, Iterative Random Forest and Iterative Hoeffding Forest.

6.1.1 Offline Base Learners

  • Logistic Regression (LR): Logistic regression is a well known offline linear classifier (Kleinbaum et al. 2002). Its training requires iterating through all the training data multiple times (epochs), and it has been successfully used for offline JIT-SDP in previous work (Kamei et al. 2012; Chen et al. 2018). LR approaches can be affected by multi-collinearity. To cope with that, the LR approach used in our experiments is regularized with elastic net, and the overall effect of elastic net is grouping correlated coefficients and selecting the groups that are relevant for the model.

  • Multilayer Perceptron (MLP): MLP is an Artificial Neural Network that consists of three layers of interconnected nodes (Gardner and Dorling 1998), being able to model any function. Training also requires iterating through all the training data multiple times (epochs) based on the backpropagation algorithm. It has been included here for being a universal approximator, able to model any function. MLP approaches can also be affected by multi-collinearity. To deal with that, the MLP adopted in our experiments is regularized with dropout. This avoids collinearity by disabling some input features on each step of backpropagation algorithm.

  • Iterative Random Forest (IRF): IRF consists of an ensemble of CART decision trees (Breiman et al. 1984). It is similar to a Random Forest (Breiman 2001), but the decision trees are trained with different subsets of the training data to encourage more diversity. It was included here as previous work has shown that diversity is important in ensembles for JIT-SDP (Li et al. 2020).

  • Iterative Hoeffding Forest (IHF): IHF is the same approach as IRF, but using online Hoeffding trees as the base learners instead of CARTs. As the iterative ensemble approach itself is an offline learning approach, we classify this approach as an offline learning approach. We have adopted it with BORB in this study so that we can evaluate the benefits of the approach BORB itself compared with ORB, without being affected by the benefits of the offline decision tree over the online one. This evaluation can be conducted by comparing BORB-IHF against ORB-OHT.

6.1.2 Online Base Learners

  • Logistic Regression (LR): despite Logistic regression being an offline algorithm, it is possible to set the number of epochs to one so that the algorithm becomes online. The downside of using logistic regression as an online learning algorithm is that the resulting model is likely to become weaker, i.e., to have poorer predictive performance.

  • Multilayer Perceptron (MLP): similar to LR, it is also possible to set the number of epochs for training MLPs to one, so that MLPs become online learning models. The downside is similar to that of LR, i.e., its predictive performance may considerably reduce when a single epoch is used.

  • Naïve Bayes (NB): NB is a well known Bayesian classifier that can be trained through one pass over the training data. This approach is inherently online, as the equations used to update the model parameters can process each training example separately. It is included here because it has been successfully used in component-based software defect prediction (Turhan et al. 2009).

  • Oza Bagging of Hoeffding Trees (OHT): Oza Bagging is an online version of the Bagging ensemble learning algorithm. It requires a single pass through the training data to learn it. It is typically run with Hoeffding Trees, which are online decision trees suitable for large complex datasets (Domingos and Hulten 2000). Different from LR and MLP, it is not possible to make offline decision trees into online approaches by changing any of their hyperparameter values. Hoeffding Trees are a specific type of decision trees that can learn through a single pass through the training data. Due to its theoretical foundations on the Hoeffding bound, Hoeffding trees are able to produce online models with strong performance guarantees, reason why this approach is being adopted in this and in previous JIT-SDP work (Tabassum et al. 2020; Cabral et al. 2019; Tabassum et al. 2022).

6.2 Performance Metrics

The metrics adopted for measuring predictive performance are Geometric Mean (G-Mean) of Recall0 and Recall1, where Recall0 is the recall on the clean class and Recall1 is the recall on the defect-inducing class. Different from biased metrics such as Matthews Correlation Coefficient, F1-Score, Accuracy, Precision and G-Mean of Precision and Recall, the G-Mean of \(\textit{Recall}_0\) and \(\textit{Recall}_1\) adopted in our work is a metric that is not biased by class imbalance (Zhu 2020), being suitable for class imbalanced problems such as JIT-SDP. For simplicity, we will refer to this metric simply as G-Mean from here onward. We have also chosen G-Mean instead of AUC because AUC incorporates several threshold values that are not meaningful in practice and makes comparison between approaches difficult, hence discouraged in the context of software defect prediction (Song et al. 2018).

While computing the metrics in a prequential way, a fading factor is used to track changes in predictive performance over time as recommended for problems that may suffer concept drift (Gama 2013). As mentioned in our previous study (Tabassum et al. 2022), if the current example belongs to class i, \(Recall_i^{(t)} = \theta Recall_i^{(t-1)} + (1 - \theta ) \mathbbm {1}_{\hat{y} = i}\), where i is zero or one, t is the current time step, \(\theta \) is a fading factor set to 0.99 as in Cabral et al. (2019), \(\hat{y}\) is the predicted class, and \(\mathbbm {1}_{\hat{y} = i}\) is the indicator function, which evaluates to one if \(\hat{y} = i\) and to zero otherwise. If the current example does not belong to class i, \(Recall_i^{(t)} = Recall_i^{(t-1)}\). Also, \(\textit{G-Mean}^{(t)} = \sqrt{Recall_0^{(t)} \times Recall_1^{(t)}}\). It is worth noting that \(Recall_0 = 1 - FalseAlarmRate\), i.e., false alarms are taken into account through Recall0 and G-Mean.

The performance metric used to measure the computational cost is the amount of time in seconds used to train and test the JIT-SDP models.

Fig. 2
figure 2

Overview of Experiments. (1) Training and testing of a given learning model \(M_i\) on each Project \(P_j\)’s data stream, where \(1\le j \le N\) and \(N=10\) is the number of projects. (2) Group of observations corresponding to learning approach \(M_i\) created for the Scott-Knott.BA12 test. (3) A12 effect sizes computed against a dummy approach

6.3 Statistical Tests and Effect Size

Scott-Knott procedure (Mittas and Angelis 2012) is used to compare the performance obtained by all BORB and ORB approaches across datasets, ranking the models and separating them into subgroups. The use of statistical tests across datasets has been recommended by Demsar to reduce problems with multiple comparisons (Demšar 2006). Each group of observations compared through the test corresponds to one learning approach run across all projects (data streams) as illustrated in points (1) and (2) of Fig. 2. Therefore, given that we use 19 approaches (including the dummy approach) and 10 projects in our experiments, there are 19 groups with 10 observations each in the test. As recommended by Menzies et al. (2017), this test uses non-parametric bootstrap sampling. This makes this a non-parametric test which is adequate for comparison across data sets (Demšar 2006). Scott-Knott.A12 is used both to compare predictive performance in terms of G-Mean and computational cost in Seconds, but for the computation cost we remove the dummy approach, leading to 18 groups. This is because a comparison of computational cost against the dummy approach would be meaningless, as this approach does not spend any time on learning (there is no learning) and provides extremely fast predictions (it simply predicts randomly, rather than making predictions based on a predictive model). To rule out insignificant differences in performance, this test uses A12 effect size (Vargha and Delaney 2000). Approaches are placed in separate groups by Scott-Knott test only if the A12 size is significant (Menzies et al. 2017). We will refer to Scott-Knott based on Bootstrap sampling and A12 as Scott-Knott.BA12 in this paper. Smaller Scott-Knott.BA12 rankings are better rankings.

Table 2 Average G-Mean for BORB-WP
Table 3 Average G-Mean for ORB-WP
Table 4 Average G-Mean for BORB-CP
Table 5 Average G-Mean for ORB-CP

We also report the A12 effect sizes against the dummy approach for each learning approach on each dataset individually to support the analysis of predictive performance, as illustrated in Point (3) of Fig. 2. Symbols [*], [s], [m] and [b] represent insignificant (A12 < 0.56), small (A12 \(\ge \) 0.56), medium (A12 \(\ge \) 0.64) and large (A12 \(\ge \) 0.71) A12 effect size. Presence/absence of the sign “-” in the effect size means that the corresponding approach was worse/better than the corresponding WP approach.

6.4 Hyperparameter Tuning

Random search is used for hyperparameter tuning as suggested in Bergstra and Bengio (2012); Mantovani et al. (2015), which show that random search performed similar or better compared to grid search for hyperparameter optimisation. For each hyperparameter of each configuration of a classifier, a random value is chosen regarding the probability distribution specified (either uniform or log-uniform, depending on the hyperparameter being tuned). ORB and BORB (meta-models) are associated with OHT, IHF, LR, MLP, NB and IRF (base learners). So, the overall configuration of the classifier includes the BORB, ORB and the base learner’s hyperparameters. The first 3000 training examples from each data stream are used for hyperparameter tuning for both WP and CP approaches. For BORB and ORB approaches with each dataset, base learner and hyperparameter configuration, 3 executions have been performed for tuning purposes. For each dataset and classifier, 128 configurations were evaluated. More details of the investigated hyperparameter values are given in Table 1 in the supplementary material. It is worth noting that the application of log as a preprocessing step is considered as a hyperparameter choice when using MLP and LR as base models, as they can be affected by skewed distributions.

Table 6 G-Mean ranking of approaches based on the Scott-Knott.BA12 test
Table 7 Average run time in seconds for BORB-WP
Table 8 Average run time in seconds for ORB-WP
Table 9 Average run time in seconds for BORB-CP
Table 10 Average run time in seconds for ORB-CP
Table 11 Runtime ranking of approaches based on the Scott-Knott test

7 Experimental Results

Tables 2, 3, 4 and 5 present the average G-Mean with standard deviation and A12 effect size against the dummy classifier for the BORB and ORB approaches with different base learners, using WP and CP data. Table 6 presents the corresponding Scott-Knott.BA12 ranking of BORB and ORB approaches with different base learners. Tables 7, 8, 9 and 10 present the average runtime. Table 11 presents the corresponding Scott-Knott.BA12 ranking. Section 7.1 focuses on the comparison between WP approaches to answer RQ1; Section 7.2 focuses on the comparison between CP and WP approaches to answer RQ2; and Section 7.3 focuses on the computational cost analysis of each approach to answer RQ3.

7.1 RQ1: Can offline learning help to improve predictive performance in online WP JIT-SDP scenarios? Which base learners usually perform best?

To answer RQ1, we compared the predictive performances of offline (BORB) and online (ORB) approaches with different base learners for WP data. As existing online WP JIT-SDP studies have never explored any other base learners except OHT, it is interesting to know whether using different base learners would improve the predictive performance not only of offine WP approaches, but also of online WP approaches.

From Table 6, we can see that offline WP approaches in general outperformed online WP approaches, being better ranked. In particular, BORB-MLP-WP, BORB-LR-WP and BORB-IRF-WP achieved similar predictive performance to each other, and better than the other BORB-WP and ORB-WP approaches. However, interestingly, when using the exact same base learner, NB for ORB and BORB, BORB-NB-WP did not outperform ORB-NB-WP. This suggests that BORB’s adaptive resampling mechanism is not necessarily better than ORB’s adaptive resampling mechanism, and that BORB’s ability to enable offline base learners to be adopted is the likely reason for the generally better results obtained by the offline WP approaches.

When comparing BORB-MLP-WP against ORB-MLP-WP and BORB-LR-WP against ORB-LR-WP, it is thus clear that the single epoch used by the online base learners MLP and LR is the likely reason for the poorer predictive performance results obtained by ORB, rather than the different resampling mechanisms used by ORB and BORB. Such single epoch resulted in similar or worse predictive performance even than the dummy classifier, which is a very poor result.

When comparing BORB-IHF-WP and BORB-IRF-WP against ORB-OHT-WP, the offline IHF and IRF are also the likely the reason for the better predictive performance achieved by BORB. Nevertheless, the ranking of these tree-based ORB and BORB approaches is not far from each other – BORB-IHF-WP and BORB-IRF-WP were ranked second and ORB-OHT-WP was ranked third. This confirms our hypothesis mentioned in Section 1 in that OHT may be better suited for achieving good predictive performance in online JIT-SDP learning than MLP and LR.

Fig. 3
figure 3

G-Mean for all datasets through time for best ranked BORB and ORB approaches with WP data

When comparing the offline approach BORB-MLP-WP (ranked 2nd when considering all approaches investigated in this paper) against the online approach ORB-OHT-WP (ranked 3rd), we can see from Tables 2 and 3 that the absolute improvements in predictive performance obtained by BORB-MLP-WP varied from 2.76% (for Nova) to 23.19% (for Spring-Integration), varying from small to moderate improvements and with median of with a median of 6.36%. Both these approaches performed better than the dummy approach with large effect size [b], except for Spring-Integration, where ORB-OHT-WP performed worse than the dummy with large effect size [-b].

From Fig. 3, it is visible that the best 3 BORB-WP approaches (MLP, LR and IRF as shown in blue, orange and green, respectively) performed on average very similar to each other, and comparatively better than the best ORB-WP approach (OHT as shown in black). Previous work Tabassum et al. (2020) showed that WP approaches can suffer with low performance in the very beginning of a project, when there is lack of sufficient data. This initial period of low performance can be seen in most datasets for ORB-WP. BORB-WP approaches also suffered in such initial period, but was sometimes able to improve the G-Mean during the initial period (e.g., for Npm and Neutron).

Apart from the initial period, some other large performance drops can be observed in Camel, Npm and Spring-Integration (Fig. 3c, h and i) for ORB-WP. For these datasets, all 3 BORB-WP approaches managed to maintain stable performance during the drop periods. Hence, BORB-WP with MLP, LR and IRF are the best options compared to ORB-WP when considering the predictive performance.

It is worth noting that, if JIT-SDP often had concept drifts affecting the relationship between input features and the label (clean or defect inducing), retraining with all historical data as done by BORB-MLP-WP, BORB-LR-WP and BORB-IRF-WP would likely be detrimental to the predictive performance. This is because different portions of the training set would correspond to different relationships. The models would thus try to learn a mix of relationships, being unable to learn any of the individual relationships well enough. However, we have found in previous work Tabassum et al. (2022); Cabral and Minku (2022) that, despite sometimes happening, changes in such relationship are much less common than changes in the values of the input features. Therefore, retraining with historical data multiple times may offer some benefits, as shown in this section.

figure f

7.2 RQ2: How beneficial is CP data to improve predictive performance of offline models compared to online models in online CP JIT-SDP scenarios?

Previous studies Tabassum et al. (2020, 2022) investigated the use of CP data for online JIT-SDP and found that, with the use of CP data, online learners are exposed to more data and are able to improve the predictive performance of the JIT-SDP model compared to using only WP data. These studies also showed that CP data was helpful to maintain stable predictive performance during the periods when the WP models typically suffered sudden performance drops (Tabassum et al. 2022). From Section 7.1, we found that BORB’s offline learners outperformed ORB’s online learners with WP data. In particular, some offline models iterate over the training data several times, simulating the existence of a larger data set. It is unknown whether incorporating CP data with BORB would further improve predictive performance for offline models. Moreover, no other base learners were explored for online CP JIT-SDP except OHT. It is not known whether CP data would still be useful for JIT-SDP using other online base learners. Hence, it is important to investigate the use of CP data not only for offline models, but also for other online models than OHT.

Fig. 4
figure 4

G-Mean for all datasets through time for best ranked BORB and ORB approaches with WP and CP data

To answer RQ2, we compare 4 approaches – ORB-WP, ORB-CP, BORB-WP and BORB-CP. According to the Scott-Knott.A12 test shown in Table 6, BORB-CP with MLP, LR and IRF are the best ranked approaches and outperformed all other BORB-WP, ORB-CP and ORB-WP approaches. Even though BORB-IHF-CP is also a BORB-CP approach, it did not rank best (instead it ranked second). Even though IHF is classified as an offline approach, it uses online Hoeffding Trees as base learners (Section 6.1). It is possible that using online Hoeffding Trees resulted into a weaker model for BORB.

When using the exact same base learner, NB, BORB-CP performed worse than ORB-CP. This corroborates the results presented in Section 7.1, suggesting that BORB’s offline resampling mechanism is not necessarily better than that of ORB, and that its ability to enable the use of offline base learners is the likely reason for the typically better predictive performance achieved by BORB-CP approaches.

To address RQ2, we compare BORB-CP against BORB-WP approaches. We can see from Table 6 that BORB-MLP-CP (ranked first) outperformed BORB-MLP-WP (ranked second). Similarly, BORB-LR-CP also outperformed BORB-LR-WP, BORB-IRF-CP outperformed BORB-IRF-WP and BORB-IHF-CP outperformed BORB-IHF-WP. Only when using NB as the base learner, BORB-CP did not outperform BORB-CP, but both of these approaches are using online base models and performed worse than the dummy classifier, meaning that the comparison between the two of them is not necessarily meaningful in this specific context. Therefore, these results show that CP data is helpful to improve predictive performance when using offline learning for JIT-SDP. However, the magnitude of the improvements in predictive performance were not very large. For instance BORB-MLP-CP (ranked 1st) approach had absolute improvements in G-Mean from 0.45% (for Camel) to 8.11% (for Spring-Integration) with a median of 2.32% against BORB-MLP-WP (ranked 2nd), and led to slightly worse G-Mean for some projects, as can be computed based on Tables 4 and 2.

We also compare ORB-CP against ORB-WP approaches. We can see from Table 6 that ORB-OHT-CP outperforemd ORB-OHT-WP, ORB-NB-CP outperformed ORB-NB-WP, ORB-LR-CP outperformed ORB-LR-WP and ORB-MLP-CP outperformed ORB-MLP-WP. Therefore, these results show that CP data is helpful to improve predictive performance when using online learning for JIT-SDP, for all base learners investigated. When comparing the best ORB-CP and ORB-WP approaches against each other (ORB-OHT-CP and ORB-OHT-WP), we can see that the absolute improvements in G-Mean varied from 0.06% (for JGroups) to 29.28% (for Spring-Integration) with a median of 6.59%. Therefore, CP data was more helpful for improving predictive performance in the context of online learning than offline learning.

Such larger increase in the competitiveness of the ORB approach when using CP data is also reflected in the magnitude of the differences in performance of the best BORB-CP approach (e.g., BORB-MLP-CP) against the best ORB-CP approach (ORB-OHT-CP). Even though BORB-MLP-CP was ranked better than ORB-OHT-CP, the absolute improvements in G-Mean varied from 0.86% (for Neutron) to 3.68% (for JGroups), being always small. Moreover, for some projects, BORB-MLP-CP obtained slightly worse G-Mean than ORB-OHT-CP. Therefore, even though BORB-CP can achieve better rank in terms of predictive performance than ORB-CP, the magnitudes of the differences in predictive performance are not large.

It is also worth noting that all best ranked BORB-CP approaches (BORB-MLP-CP, BORB-LR-CP and BORB-IRF-CP) outperformed the dummy classifier with large [b] effect size, for all projects. The best ranked ORB-CP approach (ORB-OHT-CP) also outperformed the dummy classifier with large [b] effect size for all projects. Therefore, the weakness of ORB-OHT-WP, which had performed worse than the dummy classifier for Spring-Integration, is overcome when using CP data.

From Fig. 4, it is also visible that performance of best BORB-CP and ORB-CP were very similar. Both BORB-CP and ORB-CP were able to achieve better predictive performance during initial portion of the data streams than BORB-WP and ORB-WP. For Spring-Integration, BORB-CP managed to provide stable performance by eliminating the drops suffered by BORB-WP (Fig. 4i). These results suggest that CP data can be useful for both BORB and ORB during initial period of the project, and to help reducing drops in predictive performance over time for ORB.

figure g

7.3 RQ3: How high is the computational cost of offline learning in online scenarios compared to that of online learning models?

An ideal JIT-SDP approach should not be computationally too expensive as such approaches are not suitable to use in practice. Hence, while comparing between offline and online JIT-SDP approaches, it is important to consider computational cost (run time) along with the predictive performance. Typically, offline learners require multiple iterations of the training data leading to higher computational cost compared to online learners. Offline CP approaches could be even more computationally expensive than offline WP approaches as they require retraining with larger amount of data from several projects. An analysis of computational cost is required to understand how high these computational costs may be and whether they are feasible for adoption in practice.

Figure 5a shows computational costs of BORB and ORB approaches for all datasets. We can see that offline (BORB) approaches have higher computational cost than their online (ORB) counterparts. This is also reflected by the Scott-Knott.BA12 tests shown in Table 11. For instance, ORB-NB-WP is better ranked than BORB-NB-WP, ORB-LR-WP is better ranked than BORB-LR-WP, ORB-MLP-WP is better ranked than BORB-MLP-WP. In particular, the better ranking obtained by ORB-NB-WP compared to BORB-NB-WP shows us that, even when using the exact same online base learner NB, ORB is still faster than BORB. The same is valid when comparing ORB-NB-CP against BORB-NB-CP. This means that the offline resampling and retraining process required by BORB is slower than ORB’s procedures.

Fig. 5
figure 5

Computational Cost Analysis for BORB and ORB

Moreover, the offline base learners MLP, LR, IRF and IHF adopted by BORB are also themselves generally slower than their online base learner counterparts. This is revealed by the magnitude of the differences in computational cost between BORB and ORB when using these base models, which is usually larger than the differences between BORB-NB and ORB-NB, as we can see from Fig. 5. For instance, the magnitude of the difference in the computational cost between BORB-MLP-WP and ORB-MLP-WP is much larger than that between BORB-NB-WP and ORB-NB-WP. This is expected, as offline learning models frequently have to iterate through the dataset (or portions of it) several times to build the predictive model, whereas the online base learners require only one pass through the training examples.

We can also observe that all CP approaches have higher computational cost than their WP counterparts (note the different scale of the x-axis in Fig. 5a and b). This is also confirmed by the Scott-Knott.BA12 results shown in Table 11. For instance, ORB-OHT-CP has higher computational cost than ORB-OHT-WP, whereas BORB-IRF-WP has higher computational cost than BORB-IRF-CP. This is also expected, as CP training sets are much larger than WP ones.

Overall, this shows us that offline learning is in general slower than online learning, and that CP learning is in general slower than WP learning. In practice, one would be interested in adopting an approach that has low computational cost but high predictive performance. Therefore, we compared the computational cost of the offline and online models that obtained the best predictive performance.

Table 12 Computational Cost in Seconds for ORB-OHT-CP and BORB-MLP-CP

The best offline approaches in terms of predictive performance are BORB-MLP-CP, BORB-LR-CP and BORB-IRF-CP (Table 6). As they are all ranked the same in terms of predictive performance, we pick the one with lowest computational cost for this comparison. As both BORB-LR-CP and BORB-IRF-CP have the same rank in terms of computational cost but have better rank then BORB-MLP-CP (Table 11), we randomly pick BORB-LR-CP for this analysis. The best online approach in terms of predictive performance was ORB-OHT-CP (Table 6).

ORB-OHT-CP was ranked 9th in terms of computational cost, whereas BORB-LR-CP was ranked 10th. The differences in computational cost varied from 179.4 seconds (\(\approx 3\) minutes for Spring-Integration) to 2998.83 seconds (\(\approx 50\) minutes for Nova) in total, as we can see from Tables 10 and 9. In other words, ORB-OHT-CP was from 1.2 to 3.8 times faster. Such differences in computational cost may be particularly relevant when conducting experiments to choose an approach for adoption. To give an example, for the most time consuming dataset Nova, ORB-OHT-CP required 2986.95 seconds (\(\approx 50\) min) for a single run. As such experiments require multiple runs to lead to more reliable conclusions, one may opt for performing 30 runs, which would take \(\approx 25\) hours, just for this dataset. When using BORB-LR-CP, this amount of time was approximately the double. If a company is performing experiments with their projects to double check which approach would be better in their context, they would need to perform similar experiments with several of their past projects. If they can narrow down the set of approaches investigated to include less computationally expensive ones, this could lead to significant savings in computational costs. Moreover, in the future, if one proposes an approach to automatically tune hyperparameters over time, such approach may also rely on running multiple models concurrently, again resulting in a non-negligible computational cost.

That said, even though these are considerable computational costs when conducting experimental studies to run, tune and compare multiple approaches, when a given approach is chosen for adoption and we consider the duration of the projects in practice (e.g., 10.7 years for Spring-Integration and 6.99 years for Nova), this translates into a difference of only \(\approx 0.05\) to \(\approx 1.18\) seconds per day. Therefore, both BORB-LR-CP and ORB-OHT-CP are feasible for adoption in practice in terms of their computational cost per day, as illustrated in Table 12.

Even though ORB-OHT-CP is ranked worse in terms of predictive performance than BORB-LR-CP, the magnitudes of the differences in predictive performance are not high. In particular, BORB-LR-CP’s predictive performance was better with absolute differences in G-Mean varying from 0.13% (for Broadleaf) to 2.93% (for Tomcat) with a median of just 1.96%. For one dataset, ORB-OHT-CP had higher G-Mean than BORB-LR-CP (Fabric8).

Similar results would have been achieved if we had compared BORB-MLP-CP against ORB-OHT-CP, but the magnitude of the differences in predictive performance would have been slightly larger (BORB-MLP-CP obtained absolute improvements of up to 3.68% over ORB-OHT-CP as shown in Section 7.2), and so would the differences in computational cost (ORB-OHT-CP runs up to 5.97 times faster than ORB-OHT-CP). BORB-MLP-CP’s computational cost would also be feasible for adoption in practice, being up to 3.83 seconds per day.

figure h

8 Threats to Validity

Internal validity: poor hyperparameter choices can affect the predictive performance of machine learning models. To mitigate this threat, random search was performed on a set of possible values for the hyperparameters of each approach and base learner based on the first 3000 training examples of the data streams. It is worth noting that this leads to an overlap between the examples involved in tuning and the examples used for testing in the beginning of the data streams. Therefore, for all approaches, the predictive performance in the beginning of the data streams is likely an overestimation of the predictive performance that these approaches can achieve in practice. Verification latency was also taken into account for all approaches respecting the chronology of the software changes. Besides, each of the approaches with each dataset was executed 30 times to mitigate the threats to internal validity.

Construct validity: the evaluation metrics used in this study were G-Mean, Recall0 and Recall1. These are widely used metrics appropriate for class imbalance learning (Wang et al. 2018) and were computed prequentially considering a fading factor recommended in Gama (2013), that allows the model to give more importance to the most recent data.

Statistical conclusion validity: Scott-Knott test with non-parametric bootstrap sampling and A12 effect size were used to address conclusion validity. This avoids concluding that approaches have differences even though the difference in the predictive performance is very small which is presented by insignificant effect size.

External validity: this study used 10 open source projects from GitHub repositories. These projects are based on different programming languages and have different characteristics (e.g. number of commits per day, starting date, number of modified files). The results obtained by this study may not generalise for projects with different characteristics to those used in our study. The conclusions about offline learning drawn by this study are based on the proposed approach BORB. Other offline learning approaches may lead to different conclusions. Similarly, other offline base learners than the ones used in our study may also lead to different conclusions.

9 Conclusion

This study investigated whether offline JIT-SDP can offer any benefits in terms of predictive performance when applied to online JIT-SDP scenarios compared to online JIT-SDP approaches, and whether such benefits may come at the cost of higher computational requirements. For that, we proposed a new offline approach called BORB that can apply adaptive resampling to deal with class imbalance in JIT-SDP when applied to online JIT-SDP scenarios. These approach’s predictive performance and computational cost were compared against the existing online approach ORB when using various different base models on 10 open source projects.

Overall, our experiments suggest that, if one is focused on achieving the best possible predictive performance, it is worth considering offline learning through BORB using CP data as a possible choice, as it obtained slightly better predictive performance than ORB approaches while having an acceptable computational cost. If one is interested in saving computational cost, we recommend considering ORB using CP data with OHT as a possible choice, as it obtained better computational cost with just slightly worse predictive performance. Such saving in computational cost (and thus also in energy) may be relevant if multiple of such approaches are required to be run concurrently, e.g., when performing experimental studies to run, tune and choose among multiple models.

Future work can consider how to further improve predictive performance in JIT-SDP. In particular, the finding that offline models can be successfully applied to online JIT-SDP scenarios through BORB opens up the possibility of investigating other offline base learners such as deep learning approaches that may lead to even better predictive performance. However, even though the computational cost of the offline learning approaches adopted in the current study was feasible for adoption in practice, other more complex offline base learners such as deep learning may require a much higher computational cost, such that future work could also consider how to improve the computational cost of BORB. Future work could also evaluate BORB and ORB with additional projects. Finally, hyperparameter tuning in online JIT-SDP scenarios is still an open issue. Novel approaches for automatically tuning such hyperparameters are desirable and may benefit from faster JIT-SDP models such as ORB to be computationally feasible.