CN110188047B

CN110188047B - Double-channel convolutional neural network-based repeated defect report detection method

Info

Publication number: CN110188047B
Application number: CN201910474540.6A
Authority: CN
Inventors: 徐玲; 何健军; 帅鉴航; 杨梦宁; 张小洪; 洪明坚; 葛永新; 杨丹; 王洪星; 黄晟; 陈飞宇
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2023-04-18
Anticipated expiration: 2039-06-20
Also published as: CN110188047A

Abstract

The invention relates to a double-channel convolutional neural network-based repeated defect report detection method, which comprises the three steps of data preparation, CNN model establishment and defect report prediction to be predicted; in data preparation, fields useful for repeating reports are extracted from defect reports, for each report, structured information and unstructured information are put together into a text invention file, and after preprocessing, each report represented by text is converted into a single-channel matrix, the single-channel matrix is combined into a double-channel matrix, and then one part is used as a training set, and the rest is used as a verification set. And establishing a CNN model, and taking a training set as an input training model. In the prediction stage of the defect report to be predicted, the trained model loads and predicts the similarity of a defect report pair consisting of an unknown defect report and a known defect report, and the similarity is the probability representing the repetition possibility of the defect report pair. The method has higher prediction accuracy.

Description

Double-channel convolution neural network-based repetitive defect report detection method

Technical Field

The invention relates to the technical field of software testing, in particular to a double-channel convolutional neural network-based repeated defect report detection method.

Background

Modern software projects use defect tracking systems such as Bugzilla [17] to store and manage defect reports. Software developers, software testers and end users submit bug reports to describe software problems when they encounter these problems. Defect reports may help guide software maintenance and repair work. As software systems evolve, hundreds of defect reports are submitted each day. When more than one person submits a defect report to describe one and the same bug, a duplicate defect report is generated. Since bug reports are always described in natural language, the same bug will likely also be described in a different form.

Manual detection of duplicate defect reports is a difficult task because of the large number of defect reports. Furthermore, because the defect reports are described in natural language, it is not practical to provide a standard template. Therefore, automatic detection of duplicate defect reports is a meaningful task that can avoid repairing the same bug multiple times. Many repeat defect report automatic detection techniques have been proposed to address this problem in recent years. These methods can be roughly divided into two directions, information retrieval and machine learning.

An information retrieval method, which generally calculates the similarity of two defect reports on a text, i.e. focuses on calculating the similarity from a text description.

For example, hiew builds a Model using VSM (Vector Space Model) that computes a report as a Vector with a TF-IDF (Term Frequency-Inverse Document Frequency) Term weighting scheme. Based on VSM, runeson et al, for the first time, used natural language processing techniques to detect repeat defect reports. Wang et al consider that simply considering natural language information does not solve this problem well, so they also perform repeated report detection with performance information as a feature. However, this approach has significant limitations because only a small fraction of reports have performance information. Sun et al propose REP, which uses not only summary and description, but also structured information such as product, component, version, etc. In order to obtain higher text similarity, the BM25F is expanded, and the method is an effective similarity calculation method in the field of information retrieval. In addition to text similarity and structured similarity, alipour et al also consider the effect of contextual information on duplicate report detection. They applied LDA to these features with better results. Information-based slowdown methods perform well in both accuracy and time efficiency, but when a problem is described in different terms, the results are unsatisfactory.

The machine learning method extracts the reported potential features through a self-learning algorithm, but the traditional machine learning method cannot well learn the input depth features. SVM is a classical method of machine learning. Jalbert et al established a classification system with which duplicate reports could be filtered. At the same time, they believe that previous methods do not take full advantage of the various features in the defect report, and therefore they use surface features, text semantics, and graph clustering in the model. Based on the work of Jalbert et al, tian et al considered some new features and established a linear model. From a feature and imbalance data perspective, they improve the accuracy of duplicate report detection. Sun et al, using SVM, developed an interpretation model that also for the first time classified the defect reports into repetitive and non-repetitive categories. L2R is another very useful machine learning method. Based on this, zhou et al considered text and statistical features and used a random gradient descent algorithm for them. This method has a better effect than conventional information retrieval methods such as VSM and BM 25F. With the application of word embedding technology [ in the field of natural language processing, more and more researchers are using it to detect duplicate reports. The budheiraja et al uses word embedding techniques to convert the defect reports into vectors and then calculate their similarity. Experimental results show that the method has the potential of improving the detection accuracy of repeated reports.

Disclosure of Invention

The technical problem to be solved by the present invention is the problem of automatic detection of duplicate reports, which can be further decomposed into determining the relationship between two defect reports, i.e. whether a defect report pair consisting of two reports is duplicate or non-duplicate.

In order to realize the purpose, the invention adopts the following technical scheme: a repetitive defect report detection method based on a dual-channel convolutional neural network comprises the following steps:

s100 data preparation

S101, extracting defect reports of software, wherein all the defect reports consist of structured information and unstructured information, and all the structured information and the unstructured information are put into a single text invention file for each defect report;

s102, for each defect report, carrying out pretreatment steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;

s103, after preprocessing, combining all words in the defect reports into a corpus, using the existing Word2vec on the corpus and selecting a CBOW model to obtain vector representation of each Word, namely obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;

according to the known information given by the software defect tracking system when the defect report of the software is extracted (the paired information is in a data set and is obtained by processing of a person who creates the data set), a defect report pair consisting of two defect reports is represented by a two-dimensional dual-channel matrix, the two-dimensional dual-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then a repeated or unrepeated label is marked on the dual-channel matrix;

dividing all the labeled dual-channel matrixes into a training set and a verification set;

s200, establishing a CNN model

S201: inputting all the double-channel matrixes in the training set and the verification set into a CNN model together;

s202: in the first convolutional layer, set

A convolution kernel pick>

Where d is the length of the convolution kernel, k _w Is the width of the convolution kernel; after the first convolution, the two channels of the two-channel matrix are merged into one, and the first layer of convolution formula is:

wherein C is ₁ Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer ₁ I channel j of ₁ J-th representing input ₁ Line, b ₁ Denotes an offset amount, f ₁ An activation function representing a non-linearity, given the length of the input, l (l = n) _w ) Padding value P =0 and step S =1, length of output O ₁ Can be calculated as:

the output shape of the first convolutional layer is

Reshaping the output shape of the first convolution layer to

Then convolved again, and on the second convolution layer, convolution kernels of three sizes are set>

Each convolution kernel->

The formula of the second layer convolution is:

wherein C is ₂ Represents the output of the second convolution layer, j ₂ Representing a second convolutional layer input I ₂ J (d) of ₂ Line, b ₂ Denotes an offset amount, f ₂ An activation function representing a non-linearity, after this convolution, results in three shapes

A characteristic diagram of (1), wherein O ₂ Can be based on l (l = O) ₁ ) And different convolution kernel lengths d, calculated according to equation (2);

s203: performing maximum pooling on all feature maps;

s204: reshaping and splicing all feature maps to obtain one

A vector of dimensions that will be input to the fully-connected layer;

after two fully connected layers, an independent probability sim is obtained _predict It represents the similarity of the two reports being predicted;

at the last layer, sigmoid is used as an activation function to obtain sim _predict ；

Given the output T = { x ] of the first fully-connected layer ₁ ,x ₂ ,…,x ₃₀₀ And weight vector W = { W = } ₁ ,w ₂ ,…,w ₃₀₀ }，s _impredict Can be calculated as:

wherein i represents the ith element of T and b represents an offset;

s205: traversing all defect report pairs in the training set, and repeating S202-S204;

s206: the back propagation is performed to update the hidden parameters of the model according to the loss function, which is as the formula (5):

wherein label _real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs;

s207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model;

s300: defect report prediction to be predicted

Firstly, preprocessing a defect report to be predicted by adopting the method in S102, and then converting the defect report to be predicted into a two-dimensional single-channel matrix of a predicted defect report by adopting the method in S103;

combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrix of N existing defect reports of the software pairwise to obtain N dual-channel matrixes to be predicted, forming a prediction set by the N dual-channel matrixes to be predicted, and inputting each dual-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;

and when the probability is greater than the threshold value in the N probabilities, the defect report and the predicted defect report corresponding to the probability are considered to be repeated.

As an improvement, in S101, the structured information is product and component, and the unstructured information is summary and description.

As an improvement, relu is used as an activation function to extract more nonlinear features at other layers except the last full connection layer.

Compared with the prior art, the invention has at least the following advantages:

the invention provides a novel method DC-CNN for repeated defect report detection. It combines two defect reports represented by a single channel matrix into a defect report pair represented by a two channel matrix. This two-channel matrix is then input into the CNN model to extract the implicit features. The method provided by the invention is verified on Open Office, eclipse, net Beans and a Combined data set Combined thereof, and is compared with the most advanced repeated report detection method based on deep learning at present, the method provided by the invention is effective, and more importantly, the performance is better.

Drawings

Figure 1 is the general framework of the process of the invention.

Fig. 2 is a general process flow for establishing a CNN model.

FIG. 3 (a) is a ROC curve for DC-CNN and SC-CNN on Open office dataset, FIG. 3 (b) is a ROC curve for DC-CNN and SC-CNN on Eclipse dataset, FIG. 3 (c) is a ROC curve for DC-CNN and SC-CNN on Net Beans dataset, and FIG. 3 (d) is a ROC curve for DC-CNN and SC-CNN on Combined dataset.

FIG. 4 is an illustration of the effect of word vector dimensions.

FIG. 5 is an illustration of the impact of unstructured information.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 shows the overall framework of the inventive method DC-CNN, which comprises three phases: preparing data, establishing a CNN model and reporting and predicting the defects to be predicted. During the data preparation phase, the fields useful for duplicate reporting, including component, product, summary, and description, are extracted from the defect report. For each report, the structured information and unstructured information are put together in a text document. After preprocessing, the text of all the defect reports is collected to form a corpus. Word2vec is used to extract the semantic rules of the corpus. Each report represented by text is converted into a single channel matrix. To determine the relationship between the two reports, the single channel matrices representing the defect reports are combined into a two channel matrix representing pairs of defect reports. Then, one part is used as a training set, and the rest part is used as a verification set. In the training phase, a CNN model is trained using the training set as input. In the prediction stage of the defect report to be predicted, the trained model loads and predicts the similarity of a defect report pair consisting of an unknown defect report and a known defect report, wherein the similarity is the probability representing the repetition possibility of the defect report pair.

A repetitive defect report detection method based on a dual-channel convolutional neural network comprises the following steps:

s100 data preparation

structured information is typically an optional attribute, while unstructured information is typically a textual description of a bug.

S102, for each defect report, carrying out preprocessing steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;

the present invention uses the standardalanyzer of Lucene to accomplish the above pretreatment step. When stop words are removed, a standard English stop word list is used. Furthermore, there are some words that are the same even in two unrelated defect reports. These words are typically professional vocabularies such as java, com, org, etc. They are also added to the stop word list due to their frequent occurrence. Through the above processing, some meaningless numbers are left in the text, and they are also removed.

S103, after preprocessing, combining words in all defect reports into a corpus, obtaining vector representation of each Word by using the existing Word2vec on the corpus and selecting a CBOW model, and obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;

according to known information given by a software defect tracking system when a defect report of software is extracted (the paired information is in a data set and is obtained by processing of a person who creates the data set), a defect report pair formed by two defect reports is represented by a two-dimensional double-channel matrix, the two-dimensional double-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then a repeated or unrepeated label is marked on the double-channel matrix;

the use of a two-pass representation of the defect-reporting pair has the following benefits compared to a single pass. First, two reports may be processed simultaneously by the CNN. Thus the training speed is increased. Second, it has been demonstrated that training CNNs using dual channel data can achieve higher accuracy. For a two-pass CNN, it can capture the correlation between two defect reports by a convolution operation.

Dividing all the labeled dual-channel matrixes into a training set and a verification set; in specific implementation, 80% of the two-channel matrixes labeled with the labels are divided into a training set, and the remaining 20% of the two-channel matrixes labeled with the labels are a verification set.

S200, establishing a CNN model

In order to extract features from defect report pairs, the present invention sets convolution kernels of three different sizes at each convolution layer. Thus, the first convolutional layer has three branches. For each of these three branches, there will still be three new branches at the second convolutional layer. Because the three branches are highly similar in structure, fig. 2 shows only one branch of the first convolutional layer in the overall working structure of CNN. Table 3 shows the specific parameter settings of the CNN model of the present invention.

TABLE 3

s202: in the first convolutional layer, set

A convolution kernel pick>

Where d is the length of the convolution kernel, k _w Is the width of the convolution kernel; because each row of the input matrix represents a word, the convolution kernel width is equal to the word vector dimension m; after the first convolution, the two channels of the two-channel matrix are merged into one, so that the two defect reports can be considered as a whole to extract features, and the first layer of convolution formula is as follows:

wherein C is ₁ Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer ₁ I channel j of ₁ J-th representing input ₁ Line, b ₁ Representing the offset, f the nonlinear activation function, relu, which is used in the present invention, given the length of the input, l (l = n) _w ) Padding value P =0 and step S =1, length of output O ₁ Can be calculated as:

the output shape of the first convolutional layer is

To further extract the relevant features of both reports, the output shape of the first convolutional layer is remodeled to £ or>

Then convolved again, and in the second convolution layer, convolution kernels with three sizes are set>

Each convolution kernel->

The formula of the second layer convolution is:

wherein C ₂ Representing the output of the second convolutional layer, j ₂ Represents a second convolutional layer input I ₂ J (d) of ₂ Line, b ₂ Denotes an offset amount, f ₂ Representing a non-linear activation function, relu is used in the present invention, and after this convolution, three shapes are obtained

A characteristic diagram of (1), wherein O ₂ Can be according to l (l = O) ₁ ) And different convolution kernel lengths d, calculated according to equation (2).

S203: performing maximum pooling on all feature maps; thus, each feature map is down-sampled to

The shape of (2).

S204: reshaping and splicing all feature maps to obtain one

A vector of dimensions that will be input to the fully-connected layer;

after two fully-connected layers, an independent probability sim is obtained _predict It represents the similarity of the two reports being predicted;

at the last layer, sim is obtained using sigmoid as the activation function _predict ；

where i represents the ith element of T and b represents an offset.

S205: all pairs of defect reports in the training set are traversed and S202-S204 are repeated.

S206: the hidden parameters of the model are updated by back propagation according to a loss function, which is as shown in formula (5):

wherein label _real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs.

S207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model.

S300: defect report to predict prediction

For example, if a certain software has N defect reports at present, each processed defect report corresponds to a two-dimensional single-channel matrix, a two-dimensional single-channel matrix to be predicted corresponding to the predicted defect report and the N two-dimensional single-channel matrices are arbitrarily combined in pairs to obtain N two-channel matrices to be predicted, and then the N two-channel matrices to be predicted are input into the CNN model one by one to obtain N probabilities. And when a certain probability is greater than a preset threshold value, the defect report to be predicted corresponding to the probability and the existing defect report in the software are considered to be repeated.

Test verification:

1. data set

For comparison, the present invention used the same dataset as that collected and processed by Lazar et al. It contains three large open source projects: open Office, eclipse and Net Beans. Open Office is Office software similar to Microsoft Office. Eclipse and Net Beans are open source integrated development environments. To perform the experiment with more training samples, a larger data set was obtained by combining the three data sets and named "Combined". These data sets also provide defect reporting pairings, some of which are shown in table 4.

Table 4: defect report pair

Some of these problems were found by analyzing all pairings in each dataset. First, some pairings are repeated. For example, in Open Office, (200622, 197347, duplicate) appears 5 times. Second, some pairs represent the same relationship, such as (159435, 164827, duplicate) and (164827, 159435, duplicate) in Eclipse. Therefore, the present invention will remove these pairs of defect reports. Table 5 shows the number of all pairings in the resulting dataset.

Table 5: complete data set

Dataset	Duplicate	Non duplicate
			OpenOffice	57340	41751
Eclipse	86385	160917
			Net Beans	95066	89988
Combined	238791	292476

Each data set was divided into a training set and a test set, with the training set accounting for 80% (of which 10% was the validation set) and the test set accounting for 20%. In addition, in order for the training and test sets to mimic the original data set distribution, the duplicate and non-duplicate report comparison examples in the training and test sets are made the same as the original data set when the data set is segmented. Both the training set and the test set were randomly selected. Table 6 shows the detailed distribution of defect report pairs in the training set and the test set.

Table 6: training set and test set

Evaluation criteria

In the model proposed by the present invention, the output represents the similarity of two reports in a defect report pair. Thus, this value is between 0 and 1. For further classification, a threshold value is set. Sim has been obtained in the third section _predict Thereafter, label _predict (representing a defect report versus predicted label) can be calculated according to the following equation:

according to label _predict And label _real Report pairs can be divided into four categories:

1)TP：label _real ＝1,label _predict ＝1

2)TN：label _real ＝0,label _predict ＝0

3)FP：label _real ＝0,label _predict ＝1

4)FN：label _real ＝1,label _predict ＝0

where 1 indicates that the report pair is duplicated and 0 indicates that the report pair is non-duplicated. TP represents the number of reporting pairs that are correctly predicted to be duplicated, TN represents the number of reporting pairs that are correctly predicted to be non-duplicated, FP represents the number of reporting pairs that are incorrectly predicted to be duplicated, and FN represents the number of reporting pairs that are incorrectly predicted to be non-duplicated. These four indices are the basis of the calculation of the following evaluation criteria.

Accuracy

Accuracy represents the ratio of correctly predicted defect report pairs to all report pairs, which represents the performance of the model to correctly classify all defect report pairs. Since the sigmoid function is used in performing the regression, the threshold is set to 0.5 when calculating Accuracy, recall and Precision.

Recall：

Recall represents the ratio of correctly predicted duplicate defect report pairs to all actually duplicate defect report pairs.

Precision：

Precision represents the ratio of the defect report pairs that are correctly predicted to be duplicates to all report pairs that are predicted to be duplicates.

F ₁ -Score：

F ₁ Score is the harmonic mean of Recall and Precision.

Roc curve:

in fact, conventional evaluation criteria such as Accuracy do not evaluate the performance of the classifier well because the defect reports in the dataset are not evenly distributed over the classes. Therefore, the present invention employs the ROC curve to further evaluate the performance of the classifier. Different TPR and FPR can be obtained according to different threshold values, and then an ROC curve can be drawn through the TPR and the FPR. TPR and FPR may be calculated according to the following equations:

the ROC curve can be obtained by taking all FPR values as the horizontal axis and all TPR values as the vertical axis. The closer the curve is to the upper left corner of the coordinate axis, the better the performance of the classifier.

Results of the experiment

The technical effect of the method of the present invention is demonstrated by answering several questions as follows.

Problem 1: is the DC-CNN of the present invention valid compared to the most advanced repetitive defect report detection method based on deep learning?

The research objective of the present invention is to propose a more efficient method based on deep learning. Thus, the method of the present invention was compared to the method of Deshmukh et al on the same data set.

Table 7: experimental results of the method of the invention and the method of Deshmukh et al

As a result: table 7 shows the experimental results of the method of the invention and the method of Deshmukh et al. The twin neural networks, which are the same core method, are used, and they establish two similar models, a search model and a classification model. For the classification model, the highest accuracy appears on the Open Office dataset, reaching 0.8275, while only 0.7268 appears in Eclipse. Their search model performed better than the classification model. For the retrieval model, the best performing data set is still Open Office, which has a correctness rate of up to 0.9455. Similarly, eclipse was slightly inferior and accuracy was 0.906. It can be found that compared with the classification model established by the twin neural network, the improvement of DC-CNN on Open Office, eclipse, net Beans, combined is 11.54%,24.17%,17.89% and 13.33%, respectively. Compared with a retrieval model established by a twin neural network, the DC-CNN is improved by 6.25%,4.07% and 3.84% in Eclipse, net Beans and Combined respectively. On Open Office, the accuracy of DC-CNN is lower than 0.03%.

Influence: according to Table 7, the performance of DC-CNN was higher on 3 datasets (Eclipse, net Beans, combined) than the classification model and search model constructed by Deshmukh et al with a twin neural network. On Open Office, the performance of DC-CNN is higher than that of the classification model constructed by Deshmukh et al by twin neural networks and has a very similar performance with their retrieval model. Overall, DC-CNN achieves a very good performance and surpasses the state-of-the-art deep learning based repeat report detection methods.

Problem 2: is DC-CNN valid compared to SC-CNN?

In order to prove that the two-channel matrix representation of the defect report pair proposed by the present invention is effective, the single-channel matrix representation of the defect report is also used as a comparison baseline. The structure of the CNN is kept unchanged, including the number of convolution kernels, the size of the convolution kernels, the number of convolution layers and the like, and the CNN is used for extracting the characteristics of two reports in a defect report pair respectively and then calculating the similarity of the two reports. This method is called Single-Channel Convolutional Neural Networks (SC-CNN).

Table 8: DC-CNN and SC-CNN test results

As a result: the performance of both methods was evaluated on the Accuracy, recall, precision, F1-Score, etc., and the results are shown in table 8, with the best results being bolded. It can be observed that DC-CNN exceeds SC-CNN at all indices of all datasets. Compared with SC-CNN, on Open Office, eclipse, net Beans and Combined, accuracy of DC-CNN is respectively improved by 2.78%,2.61%,1.36% and 2.33%, call of DC-CNN is respectively improved by 2.73%,0.51%,1.49% and 3.17%, precision of DC-CNN is respectively improved by 2.08%,6.53%,1.20% and 2.08%, F1-Score of DC-CNN is respectively improved by 2.40%,3.53%,1.35% and 2.62%. FIG. 3 (a) FIG. 3 (d) shows ROC curves for both methods. It can be observed that the curve for DC-CNN is above SC-CNN over all datasets, indicating that DC-CNN has better classification performance even when the sample distribution is unbalanced.

Influence: all experimental results show that the CNN model using two channels is more efficient than a single channel. For SC-CNN. Each report is converted into a matrix and then input into CNN to extract features, the results being represented as feature vectors. And then judging whether the two reports are repeated or not by calculating the similarity of the two feature vectors. For DC-CNN, two reports are combined into a dual-channel matrix and then input into CNN, and the two reports are convolved together, and the method can extract deep-level relation between the two reports and fully utilize the capability of CNN for capturing local features. Because the CNN model in DC-CNN focuses on extracting the correlation between two reports, there is better performance in detecting duplicate reports.

Problem 3: how do the experimental results change when changing the word vector dimension?

The invention provides a novel defect report pair representation method, namely a dual-channel matrix. Therefore, the influence of the parameters related to the test results is also explored. For a two-channel matrix, the most likely parameter to change is the dimension of the word vector, since the number of words is fixed and the position of the two reports (which report is on the first channel and which report is on the second channel) is indistinguishable for CNN. To answer the question of how the experimental results change when changing the word vector dimension, the word vector dimension was gradually changed from 10 to 100 and the change of the experimental results on the Open Office dataset was observed.

As a result: as can be seen from fig. 4, as the word vector dimension is gradually increased, accuracy first increases and then shows a downward trend. When the word vector dimension is 20, the accuracy reaches the maximum, 94.29%.

Influence: as the word vector dimension increases from 10 to 20, accuracy increases. As we continue to increase the word vector dimension, accuracy decreases. The reason may be when one word vector dimension is already sufficient to characterize one word. Continuing to increase the dimension instead makes it less well representative of the word. Although accurve reaches a maximum when the word vector dimension equals 20, it is not much higher than it would otherwise be. On one hand, the increase of the word vector dimension brings about a larger data storage problem; on the other hand, word embedding and CNN model training both increase in complexity. Thus, in the method of the present invention, 20 is the most appropriate word vector dimension.

Problem 4: is the method proposed by the present invention valid when no structured information is used?

Structured information such as product, component, and version provide very useful information in determining whether two reports are duplicated. Many methods use structured information as a single feature to improve the accuracy of repeated defect report detection. The unstructured information is typically a natural language description of the bug. For duplicate report detection, CNN is mainly used to process unstructured text, so it has good performance when processing long text. Unlike other methods, the present invention places both structured and unstructured information as text data in a text document. CNN is then used to extract their features. To answer question 4, structured information is removed from the input and a comparative experiment is set up without changing other conditions.

As a result: as can be seen from fig. 5, the experimental results on all datasets were reduced after removing the structured information, which was 1.74%,3.79%,3.38%, and 2.56% on Open Office, eclipse, net Beans, and Combined, respectively.

Influence: experimental results show that it is effective to input structured information and unstructured information together into CNN. Note that after the structured information is removed, although accuracy drops, this drop is not fatal. The reason is that the structured information only occupies a small part of the whole text. Part of the CNN main processing remains unstructured information.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A repeated defect report detection method based on a dual-channel convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:

s100 data preparation

according to known information given by a software defect tracking system when a defect report of software is extracted, expressing a defect report pair consisting of two defect reports through a two-dimensional dual-channel matrix, wherein the two-dimensional dual-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then marking a repeated or non-repeated label on the dual-channel matrix;

s200, establishing a CNN model

s202: in the first convolutional layer, set

A convolution kernel pick>

wherein C is ₁ Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer ₁ I channel j of ₁ J-th representing input ₁ Line, b ₁ Denotes an offset amount, f ₁ An activation function representing a non-linearity, given the length of the input l, l = n _w Fill-in value P =0 and step S =1, length O of output ₁ Can be calculated as:

the output shape of the first convolution layer is

Reshaping the output shape of the first convolution layer into +>

1 are then convolved, and in a second convolution layer, three more convolution kernels are set>

Each convolution kernel->

The formula of the second layer convolution is:

wherein C is ₂ Representing the output of the second convolutional layer, j ₂ Representing a second convolutional layer input I ₂ J (d) of ₂ Line, b ₂ Denotes an offset amount, f ₂ An activation function representing a non-linearity, after this convolution, results in three shapes

A characteristic diagram of (1), wherein O ₂ Can be calculated according to equation (2) based on l and different convolution kernel lengths d, where l = O ₁ ；

S203: performing maximum pooling on all feature maps;

s204: reshaping and stitching all the feature maps to obtain one

A vector of dimensions that will be input to the fully-connected layer; />

Given the output T = { x ] of the first fully-connected layer ₁ ,x ₂ ,…,x ₃₀₀ And weight vector W = { W = } ₁ ,w ₂ ,…,w ₃₀₀ }，sim _predict Can be calculated as:

wherein i represents the ith element of T and b represents an offset;

s300: defect report prediction to be predicted

combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrices of the N existing defect reports of the software in pairs to obtain N two-channel matrices to be predicted, forming a prediction set by the N two-channel matrices to be predicted, and inputting each two-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;

2. The repetitive defect report detection method based on a two-channel convolutional neural network as claimed in claim 1, characterized in that: in the S101, the structured information is product and component, and the unstructured information is summary and description.

3. The repetitive defect report detection method based on a two-channel convolutional neural network as claimed in claim 1, characterized in that: at all but the last fully connected layer, relu is used as an activation function to extract more nonlinear features.