Based on our empirical results, this section shares our insights to benefit future research on DBRD.
6.1 Age/State Bias in an Alternative ITS
As mentioned in Section
4.1 Experimental Setup, to investigate the age bias (old vs. recent data) and state bias (initial vs. latest state), we run experiments on the BRs from Bugzilla. However, it stays unknown whether age bias and state bias exist in another ITS. Here, we investigate the age and state bias on data from an alternative ITS other than Bugzilla.
Age bias in an ITS other than Bugzilla: In RQ1, the experimental results demonstrate that there is a significant difference between tools in research on the old data and recent data from Bugzilla. Besides Bugzilla, we also conducted experiments on the old data from Jira. In RQ1, for Jira, we selected
Hadoop and
Spark. However, since the first issue from
Spark project was created on December 18, 2013. There are not enough old BRs to investigate the age bias. Note that the same reason also applies to all projects selected in our work which uses GitHub ITS. Specifically, the first issue from
VScode project was created on October 14, 2015 and the first issue from
Kibana project was created on February 6, 2013. Therefore, we are only able to conduct experiments on the old data of
Hadoop, which uses Jira as an ITS. Following what we did in the RQ1, we evaluate the three tools
REP,
Siamese Pair, and
SABD on the
Hadoop old dataset (which contains BRs submitted between 2012 and 2014). Table
11 shows the statistical test results of the performance of these tools in
Hadoop’s old and recent data. According to the
p-value, we can find that age bias is significant in most cases in
Hadoop.
State bias in an ITS other than Bugzilla: In RQ1, the experimental results show that there is no significant difference between tools performed with the initial states and the latest states. Besides Bugzilla, we leverage the datasets shared by Montgomery et al. [
37] and recover the states of issues in Jira (i.e.,
Hadoop and
Spark) to the end of the submission day. We then perform the same experiments as RQ1 for state bias and report the results in Table
11. As Table
11 shows, the state bias is also insignificant in most cases in
Hadoop and
Spark, which uses Jira ITS. Note that since the issue change history in GitHub can be deleted, the saved history may be incomplete. Thus, we did not investigate the state bias in GitHub.
6.3 Failure Analysis
To understand what are the causes of DBRD approaches that failed to detect some duplicate BRs, we investigated the three best performers, i.e., REP, SABD, and Siamese Pair. We selected the largest projects from each of the ITSs, i.e., Mozilla, Hadoop, and VSCode. We conducted the following steps to understand the causes of the DBRD failures:
(1)
Firstly, we get the BRs that were not detected successfully in all five runs (out of 10 positions) by each approach in each dataset.
(2)
Secondly, we sampled 50 BRs, which failed to recommend the three approaches on the three datasets.
We identified three causes of failed duplicate detection and describe them as follows.
(1)
Limited or Incomplete description. When the description is short, it does not provide enough context to understand the issue. Issue reporters attach screenshots or other supporting materials in the issue so that they neglected to write a detailed description. We also found that some descriptions contain too many URLs with only limited textual information. One such example is Bug 1668483
12 from the
Mozilla project. The description of this bug is full of long explicit URLs, which makes it hard for models to understand the real content in this issue. Furthermore, we found that issue reporters may break the BR description into multiple parts. They can write a BR description into several comments, which were not considered by our work. One such example is Bug 1641043
13 from the
Mozilla dataset. The issue reporter actually described the issue in two consecutive parts, while only the first block is called “description” and the second block is considered as “comment”. A complete version of the description may be helpful for DBRD approaches to detect duplicates.
(2)
Inability of the current approaches to understand the different ingredients in BR descriptions. Since the current DBRD approaches only treat the textual information as unstructured, they cannot extract useful information from the description. The useful information contained in the description may describe the failure, steps to reproduce, system information, and so on. They are usually arranged in a structured way. Aside from natural language description, there can also be (1) code snippets, (2) logs, (4) backtraces inside the description. A more reasonable approach may be able to extract different types of information separately. One such example is the duplicate BR pair from the VScode dataset, i.e., issue #105446
14 and issue #110999.
15 Both issues contain the system information, steps to reproduce, and screenshots. However, since the information in the descriptions was not considered separately, it may be challenging for models to understand the failures. Other than textual information, a BR can also contain images or screen recordings in the description. However, the approaches evaluated in our work only consider the textual information in BR descriptions. Sometimes, the screenshot shows similar information. For instance, the duplicate issues #108908
16 and #107104
17 in VScode repository. Among these two issues, issue #108908 describes the bug as “overlap” and issue #107104 describes the bug as “loads them twice”, which does not look duplicate for sure. However, based on the screenshots from both issues, we can find these two issues refer to the same bug. An approach that can handle the screenshots and screen recordings would be helpful in these cases.
(3)
Different failures with the same underlying fault. As indicated by Runeson et al. [
46], there are two types of duplicates: (1) they describe the same failure; (2) they describe two different failures with the same underlying fault. We also encountered difficult cases when both BR described the issue correctly, however, they described the two different failures while the underlying fault is the same. Since the current approaches are based on the similarity of BRs, it is challenging for them to detect the second type of duplicates. One such example is the duplicate BR pair from the Hadoop dataset, i.e., HBASE-24609
18 and HBASE-24608.
19 The two BRs describe two different objects, i.e.,
MetaTableAccessor and
CatalogAccessor. Even for developers with some experience on Hadoop projects, it may not be possible to recognize that they are duplicates.
6.4 Lessons Learned
Age bias and ITS bias should be considered for DBRD, and even for other tasks that involve BRs. We show that two kinds of bias (age and ITS) affect the performance of DBRD techniques. These biases must be considered while designing and evaluating future DBRD techniques. Furthermore, we believe that any task involving BRs, e.g., bug localization [
29,
35,
39], bug severity prediction [
10,
12,
34], bug triage [
49,
59], and so on should also provide due consideration for these biases. When evaluating an approach, it would be better to consider the diversity6 of the ITS.
Using FTS and REP as a baseline for evaluating DBRD approaches. We observe that FTS although simple outperforms many other DBRD approaches for most projects (all except Mozilla). REP, although proposed a decade ago, is the overall best performer. Thus, we suggest future research include these simpler techniques as baselines. Future state-of-the-art approaches need to demonstrate superior performance over these simpler techniques.
Choose your weapon—Projects with a medium to low volume of historical BRs may not benefit from deep learning-based tools. The two best-performing tools are
REP and
SABD.
SABD is deep learning-based, while
REP is not. Comparing the performance of both tools in Figure
5, we can notice that their performance is similar for projects with the largest number of BRs (
Mozilla and
vscode). However, there is a clear big gap in performance for the other projects (although they still contain thousands of BRs as training data). This suggests that the applicability of deep learning-based solutions may be limited to very large ITSs with tens of thousands of BRs submitted over a few year period (considering
age bias and data drift phenomenon [
47]). For most ITSs, non-deep learning-based approaches may outperform. Note that, in our experiments, we did not use all the historical data for training since our findings in RQ1 show that there is a significant difference when applying a DBRD approach to old data and recent data. Besides, the old and recent data carry different characteristics, e.g., the number of BRs, so the predictions of the models trained in the past data may become less accurate in the recent data [
62]. In addition, when training with more data, the training process takes longer and is more computationally expensive.
As shown in Table
9, we identified that the number of BRs in different projects varies a lot (i.e., the number of training and validation pairs are different). The size of training data might be a confounding factor. To understand whether our findings still hold when we have the same number of training and validation pairs, we investigated the impact of the data size. We adopted the pair generation strategy used by SABD [
43]. The positive pairs are all combinations of the BRs which belong to the same bucket. On the other hand, the negative pairs are randomly generated by pairing a BR from one bucket with the BR from another bucket. Since the number of positive pairs is fixed, we generated the same number of negative pairs. For each bias and each dataset, we sampled the same number of training pairs and validation pairs. For instance, when we work on age bias on the Eclipse dataset, as the old dataset has 8,668 BR pairs, while the recent dataset contains 3,342 BR pairs, we would sample the same number of 3,342 BR pairs from the old dataset (i.e., downsampling to the minority label). The number of training and validation pairs is reported in Table
12. We then conducted the same experiment for RQ1 with the sampled pairs as presented in Table
12. The
p-values regard each bias are all cases. The detailed results can be found in our replication package.
20 In the table, we find that the main message of this article is still valid even if we work on the same number of training and validation pairs. Please note that the experimental data for state bias have no difference in terms of size (i.e., the numbers of BRs for before and after state are the same), so we excluded it in this additional analysis.
Future research approaches should compare with industry tools. Researchers have largely ignored the comparison of DBRD techniques with industry tools. We conducted experiments on both FTS and vscodebot. Our experiments showed that FTS and VSCodeBot can outperform many research tools. While we have highlighted the need for evaluation with industry tools in the context of DBRD, we believe, our suggestion is valid even for other software engineering tasks too. Researchers should investigate if some alternative tools are used in practice to solve the same/similar pain points and compare the performance of research tools with those “defacto” tools.
Efficiency matters for the pre-submission DBRD scenario. In the post-submission scenario, the DBRD technique has the liberty of time to predict duplicates, but it is not the case for the pre-submission scenario. The DBRD response time varies depending on the number of BRs in the ITS. JIT duplicate recommendation used by Bugzilla, i.e.,
FTS, works faster than most research tools as they only query the summary field of existing BRs. In usability engineering, a response time of over 1 second is considered to interrupt a user’s flow of thought [
38]. Given that users can perceive a delay difference of 100 milliseconds [
15,
38], some DBRD approaches, which take over 10 seconds to predict potential duplicates, do not seem to meet the requirements. We report the seconds per prediction spent by each approach. The experiments were run on a machine with Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (Mem: 252G) with 4 GeForce RTX 2080Ti (11G). Only one GPU was utilized when running a single deep learning model. From Table
13, we can find that the time needed for each approach to make a prediction differs in different projects. Generally speaking, for tools in research,
REP, and
Siamese Pair are faster than the rest approaches, while in the largest project
Mozilla, the run time difference is more pronounced.
REP is
\(3.56\times\) faster than
Siamese Pair,
\(50.52\times\) faster than
SABD, and
\(66.56\times\) faster than
DC-CNN. For the tools in practice, since
VSCodeBot is not open-sourced, we cannot measure its run time on the pre-submission scenario. For
FTS, we can find that it is faster than most research tools.
We suggest two potential topics for future DBRD research: (1) investigating the acceptable delay for the pre-submission DBRD scenario and (2) optimizing DBRD response time. To reduce the prediction time, future research can also consider reducing the search space. For example, instead of including all the BRs submitted within the one-year time window as candidates, further approaches can reduce the candidates first: applying a time-efficient technique, such as BM25, to filter out the BRs with a low chance of being duplicated. After that, expensive deep learning-based models can be used.
Comments should be considered. Based on our failure analysis, we found that comments in the BRs may be helpful. Especially, when the issue reporter separates the description of a BR into several parts. In the post-submission scenario, leveraging comments can provide additional information for DBRD tools to represent a BR.
Different ingredients in the description should be handled separately. Although in Bugzilla and Jira, there are dedicated fields for categorical information, we found that the description can be arranged in a structured way. It can have steps to reproduce, expected behavior, observed behavior, and so on. Issue reporters usually include images or videos inside the description. An approach that can understand different contents from the description would be beneficial. For GitHub, where the information such as system information, extensions used, and steps to reproduce, are usually included in the description, an approach that can extract all the useful information from the description would be more effective.
Other resources in the project can be considered to further improve DBRD accuracy. The current DBRD approaches are designed for detecting BRs with similar contents. If future approaches want to tackle the duplicates which have different failures, we suggest they consider other resources in the project, such as code base, to understand the relationship between different failures with the same root cause.