CN110188047B - Double-channel convolutional neural network-based repeated defect report detection method - Google Patents
Double-channel convolutional neural network-based repeated defect report detection method Download PDFInfo
- Publication number
- CN110188047B CN110188047B CN201910474540.6A CN201910474540A CN110188047B CN 110188047 B CN110188047 B CN 110188047B CN 201910474540 A CN201910474540 A CN 201910474540A CN 110188047 B CN110188047 B CN 110188047B
- Authority
- CN
- China
- Prior art keywords
- defect
- defect report
- report
- channel
- predicted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000007547 defect Effects 0.000 title claims abstract description 146
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 70
- 238000001514 detection method Methods 0.000 title claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 30
- 238000012795 verification Methods 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000002360 preparation method Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 28
- 230000004913 activation Effects 0.000 claims description 11
- 230000003252 repetitive effect Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000011176 pooling Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 10
- 238000012360 testing method Methods 0.000 description 10
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 6
- 244000046052 Phaseolus vulgaris Species 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000013145 classification model Methods 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 230000003278 mimic effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013522 software testing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3692—Test management for test results analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a double-channel convolutional neural network-based repeated defect report detection method, which comprises the three steps of data preparation, CNN model establishment and defect report prediction to be predicted; in data preparation, fields useful for repeating reports are extracted from defect reports, for each report, structured information and unstructured information are put together into a text invention file, and after preprocessing, each report represented by text is converted into a single-channel matrix, the single-channel matrix is combined into a double-channel matrix, and then one part is used as a training set, and the rest is used as a verification set. And establishing a CNN model, and taking a training set as an input training model. In the prediction stage of the defect report to be predicted, the trained model loads and predicts the similarity of a defect report pair consisting of an unknown defect report and a known defect report, and the similarity is the probability representing the repetition possibility of the defect report pair. The method has higher prediction accuracy.
Description
Technical Field
The invention relates to the technical field of software testing, in particular to a double-channel convolutional neural network-based repeated defect report detection method.
Background
Modern software projects use defect tracking systems such as Bugzilla [17] to store and manage defect reports. Software developers, software testers and end users submit bug reports to describe software problems when they encounter these problems. Defect reports may help guide software maintenance and repair work. As software systems evolve, hundreds of defect reports are submitted each day. When more than one person submits a defect report to describe one and the same bug, a duplicate defect report is generated. Since bug reports are always described in natural language, the same bug will likely also be described in a different form.
Manual detection of duplicate defect reports is a difficult task because of the large number of defect reports. Furthermore, because the defect reports are described in natural language, it is not practical to provide a standard template. Therefore, automatic detection of duplicate defect reports is a meaningful task that can avoid repairing the same bug multiple times. Many repeat defect report automatic detection techniques have been proposed to address this problem in recent years. These methods can be roughly divided into two directions, information retrieval and machine learning.
An information retrieval method, which generally calculates the similarity of two defect reports on a text, i.e. focuses on calculating the similarity from a text description.
For example, hiew builds a Model using VSM (Vector Space Model) that computes a report as a Vector with a TF-IDF (Term Frequency-Inverse Document Frequency) Term weighting scheme. Based on VSM, runeson et al, for the first time, used natural language processing techniques to detect repeat defect reports. Wang et al consider that simply considering natural language information does not solve this problem well, so they also perform repeated report detection with performance information as a feature. However, this approach has significant limitations because only a small fraction of reports have performance information. Sun et al propose REP, which uses not only summary and description, but also structured information such as product, component, version, etc. In order to obtain higher text similarity, the BM25F is expanded, and the method is an effective similarity calculation method in the field of information retrieval. In addition to text similarity and structured similarity, alipour et al also consider the effect of contextual information on duplicate report detection. They applied LDA to these features with better results. Information-based slowdown methods perform well in both accuracy and time efficiency, but when a problem is described in different terms, the results are unsatisfactory.
The machine learning method extracts the reported potential features through a self-learning algorithm, but the traditional machine learning method cannot well learn the input depth features. SVM is a classical method of machine learning. Jalbert et al established a classification system with which duplicate reports could be filtered. At the same time, they believe that previous methods do not take full advantage of the various features in the defect report, and therefore they use surface features, text semantics, and graph clustering in the model. Based on the work of Jalbert et al, tian et al considered some new features and established a linear model. From a feature and imbalance data perspective, they improve the accuracy of duplicate report detection. Sun et al, using SVM, developed an interpretation model that also for the first time classified the defect reports into repetitive and non-repetitive categories. L2R is another very useful machine learning method. Based on this, zhou et al considered text and statistical features and used a random gradient descent algorithm for them. This method has a better effect than conventional information retrieval methods such as VSM and BM 25F. With the application of word embedding technology [ in the field of natural language processing, more and more researchers are using it to detect duplicate reports. The budheiraja et al uses word embedding techniques to convert the defect reports into vectors and then calculate their similarity. Experimental results show that the method has the potential of improving the detection accuracy of repeated reports.
Disclosure of Invention
The technical problem to be solved by the present invention is the problem of automatic detection of duplicate reports, which can be further decomposed into determining the relationship between two defect reports, i.e. whether a defect report pair consisting of two reports is duplicate or non-duplicate.
In order to realize the purpose, the invention adopts the following technical scheme: a repetitive defect report detection method based on a dual-channel convolutional neural network comprises the following steps:
s100 data preparation
S101, extracting defect reports of software, wherein all the defect reports consist of structured information and unstructured information, and all the structured information and the unstructured information are put into a single text invention file for each defect report;
s102, for each defect report, carrying out pretreatment steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;
s103, after preprocessing, combining all words in the defect reports into a corpus, using the existing Word2vec on the corpus and selecting a CBOW model to obtain vector representation of each Word, namely obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;
according to the known information given by the software defect tracking system when the defect report of the software is extracted (the paired information is in a data set and is obtained by processing of a person who creates the data set), a defect report pair consisting of two defect reports is represented by a two-dimensional dual-channel matrix, the two-dimensional dual-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then a repeated or unrepeated label is marked on the dual-channel matrix;
dividing all the labeled dual-channel matrixes into a training set and a verification set;
s200, establishing a CNN model
S201: inputting all the double-channel matrixes in the training set and the verification set into a CNN model together;
s202: in the first convolutional layer, setA convolution kernel pick>Where d is the length of the convolution kernel, k w Is the width of the convolution kernel; after the first convolution, the two channels of the two-channel matrix are merged into one, and the first layer of convolution formula is:
wherein C is 1 Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer 1 I channel j of 1 J-th representing input 1 Line, b 1 Denotes an offset amount, f 1 An activation function representing a non-linearity, given the length of the input, l (l = n) w ) Padding value P =0 and step S =1, length of output O 1 Can be calculated as:
the output shape of the first convolutional layer isReshaping the output shape of the first convolution layer to Then convolved again, and on the second convolution layer, convolution kernels of three sizes are set>Each convolution kernel->The formula of the second layer convolution is:
wherein C is 2 Represents the output of the second convolution layer, j 2 Representing a second convolutional layer input I 2 J (d) of 2 Line, b 2 Denotes an offset amount, f 2 An activation function representing a non-linearity, after this convolution, results in three shapesA characteristic diagram of (1), wherein O 2 Can be based on l (l = O) 1 ) And different convolution kernel lengths d, calculated according to equation (2);
s203: performing maximum pooling on all feature maps;
s204: reshaping and splicing all feature maps to obtain oneA vector of dimensions that will be input to the fully-connected layer;
after two fully connected layers, an independent probability sim is obtained predict It represents the similarity of the two reports being predicted;
at the last layer, sigmoid is used as an activation function to obtain sim predict ;
Given the output T = { x ] of the first fully-connected layer 1 ,x 2 ,…,x 300 And weight vector W = { W = } 1 ,w 2 ,…,w 300 },s impredict Can be calculated as:
wherein i represents the ith element of T and b represents an offset;
s205: traversing all defect report pairs in the training set, and repeating S202-S204;
s206: the back propagation is performed to update the hidden parameters of the model according to the loss function, which is as the formula (5):
wherein label real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs;
s207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model;
s300: defect report prediction to be predicted
Firstly, preprocessing a defect report to be predicted by adopting the method in S102, and then converting the defect report to be predicted into a two-dimensional single-channel matrix of a predicted defect report by adopting the method in S103;
combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrix of N existing defect reports of the software pairwise to obtain N dual-channel matrixes to be predicted, forming a prediction set by the N dual-channel matrixes to be predicted, and inputting each dual-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;
and when the probability is greater than the threshold value in the N probabilities, the defect report and the predicted defect report corresponding to the probability are considered to be repeated.
As an improvement, in S101, the structured information is product and component, and the unstructured information is summary and description.
As an improvement, relu is used as an activation function to extract more nonlinear features at other layers except the last full connection layer.
Compared with the prior art, the invention has at least the following advantages:
the invention provides a novel method DC-CNN for repeated defect report detection. It combines two defect reports represented by a single channel matrix into a defect report pair represented by a two channel matrix. This two-channel matrix is then input into the CNN model to extract the implicit features. The method provided by the invention is verified on Open Office, eclipse, net Beans and a Combined data set Combined thereof, and is compared with the most advanced repeated report detection method based on deep learning at present, the method provided by the invention is effective, and more importantly, the performance is better.
Drawings
Figure 1 is the general framework of the process of the invention.
Fig. 2 is a general process flow for establishing a CNN model.
FIG. 3 (a) is a ROC curve for DC-CNN and SC-CNN on Open office dataset, FIG. 3 (b) is a ROC curve for DC-CNN and SC-CNN on Eclipse dataset, FIG. 3 (c) is a ROC curve for DC-CNN and SC-CNN on Net Beans dataset, and FIG. 3 (d) is a ROC curve for DC-CNN and SC-CNN on Combined dataset.
FIG. 4 is an illustration of the effect of word vector dimensions.
FIG. 5 is an illustration of the impact of unstructured information.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Fig. 1 shows the overall framework of the inventive method DC-CNN, which comprises three phases: preparing data, establishing a CNN model and reporting and predicting the defects to be predicted. During the data preparation phase, the fields useful for duplicate reporting, including component, product, summary, and description, are extracted from the defect report. For each report, the structured information and unstructured information are put together in a text document. After preprocessing, the text of all the defect reports is collected to form a corpus. Word2vec is used to extract the semantic rules of the corpus. Each report represented by text is converted into a single channel matrix. To determine the relationship between the two reports, the single channel matrices representing the defect reports are combined into a two channel matrix representing pairs of defect reports. Then, one part is used as a training set, and the rest part is used as a verification set. In the training phase, a CNN model is trained using the training set as input. In the prediction stage of the defect report to be predicted, the trained model loads and predicts the similarity of a defect report pair consisting of an unknown defect report and a known defect report, wherein the similarity is the probability representing the repetition possibility of the defect report pair.
A repetitive defect report detection method based on a dual-channel convolutional neural network comprises the following steps:
s100 data preparation
S101, extracting defect reports of software, wherein all the defect reports consist of structured information and unstructured information, and all the structured information and the unstructured information are put into a single text invention file for each defect report;
structured information is typically an optional attribute, while unstructured information is typically a textual description of a bug.
S102, for each defect report, carrying out preprocessing steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;
the present invention uses the standardalanyzer of Lucene to accomplish the above pretreatment step. When stop words are removed, a standard English stop word list is used. Furthermore, there are some words that are the same even in two unrelated defect reports. These words are typically professional vocabularies such as java, com, org, etc. They are also added to the stop word list due to their frequent occurrence. Through the above processing, some meaningless numbers are left in the text, and they are also removed.
S103, after preprocessing, combining words in all defect reports into a corpus, obtaining vector representation of each Word by using the existing Word2vec on the corpus and selecting a CBOW model, and obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;
according to known information given by a software defect tracking system when a defect report of software is extracted (the paired information is in a data set and is obtained by processing of a person who creates the data set), a defect report pair formed by two defect reports is represented by a two-dimensional double-channel matrix, the two-dimensional double-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then a repeated or unrepeated label is marked on the double-channel matrix;
the use of a two-pass representation of the defect-reporting pair has the following benefits compared to a single pass. First, two reports may be processed simultaneously by the CNN. Thus the training speed is increased. Second, it has been demonstrated that training CNNs using dual channel data can achieve higher accuracy. For a two-pass CNN, it can capture the correlation between two defect reports by a convolution operation.
Dividing all the labeled dual-channel matrixes into a training set and a verification set; in specific implementation, 80% of the two-channel matrixes labeled with the labels are divided into a training set, and the remaining 20% of the two-channel matrixes labeled with the labels are a verification set.
S200, establishing a CNN model
In order to extract features from defect report pairs, the present invention sets convolution kernels of three different sizes at each convolution layer. Thus, the first convolutional layer has three branches. For each of these three branches, there will still be three new branches at the second convolutional layer. Because the three branches are highly similar in structure, fig. 2 shows only one branch of the first convolutional layer in the overall working structure of CNN. Table 3 shows the specific parameter settings of the CNN model of the present invention.
TABLE 3
S201: inputting all the double-channel matrixes in the training set and the verification set into a CNN model together;
s202: in the first convolutional layer, setA convolution kernel pick>Where d is the length of the convolution kernel, k w Is the width of the convolution kernel; because each row of the input matrix represents a word, the convolution kernel width is equal to the word vector dimension m; after the first convolution, the two channels of the two-channel matrix are merged into one, so that the two defect reports can be considered as a whole to extract features, and the first layer of convolution formula is as follows:
wherein C is 1 Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer 1 I channel j of 1 J-th representing input 1 Line, b 1 Representing the offset, f the nonlinear activation function, relu, which is used in the present invention, given the length of the input, l (l = n) w ) Padding value P =0 and step S =1, length of output O 1 Can be calculated as:
the output shape of the first convolutional layer isTo further extract the relevant features of both reports, the output shape of the first convolutional layer is remodeled to £ or>Then convolved again, and in the second convolution layer, convolution kernels with three sizes are set>Each convolution kernel->The formula of the second layer convolution is:
wherein C 2 Representing the output of the second convolutional layer, j 2 Represents a second convolutional layer input I 2 J (d) of 2 Line, b 2 Denotes an offset amount, f 2 Representing a non-linear activation function, relu is used in the present invention, and after this convolution, three shapes are obtainedA characteristic diagram of (1), wherein O 2 Can be according to l (l = O) 1 ) And different convolution kernel lengths d, calculated according to equation (2).
S203: performing maximum pooling on all feature maps; thus, each feature map is down-sampled toThe shape of (2).
S204: reshaping and splicing all feature maps to obtain oneA vector of dimensions that will be input to the fully-connected layer;
after two fully-connected layers, an independent probability sim is obtained predict It represents the similarity of the two reports being predicted;
at the last layer, sim is obtained using sigmoid as the activation function predict ;
Given the output T = { x ] of the first fully-connected layer 1 ,x 2 ,…,x 300 And weight vector W = { W = } 1 ,w 2 ,…,w 300 },s impredict Can be calculated as:
where i represents the ith element of T and b represents an offset.
S205: all pairs of defect reports in the training set are traversed and S202-S204 are repeated.
S206: the hidden parameters of the model are updated by back propagation according to a loss function, which is as shown in formula (5):
wherein label real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs.
S207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model.
S300: defect report to predict prediction
Firstly, preprocessing a defect report to be predicted by adopting the method in S102, and then converting the defect report to be predicted into a two-dimensional single-channel matrix of a predicted defect report by adopting the method in S103;
combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrix of N existing defect reports of the software pairwise to obtain N dual-channel matrixes to be predicted, forming a prediction set by the N dual-channel matrixes to be predicted, and inputting each dual-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;
and when the probability is greater than the threshold value in the N probabilities, the defect report and the predicted defect report corresponding to the probability are considered to be repeated.
For example, if a certain software has N defect reports at present, each processed defect report corresponds to a two-dimensional single-channel matrix, a two-dimensional single-channel matrix to be predicted corresponding to the predicted defect report and the N two-dimensional single-channel matrices are arbitrarily combined in pairs to obtain N two-channel matrices to be predicted, and then the N two-channel matrices to be predicted are input into the CNN model one by one to obtain N probabilities. And when a certain probability is greater than a preset threshold value, the defect report to be predicted corresponding to the probability and the existing defect report in the software are considered to be repeated.
Test verification:
1. data set
For comparison, the present invention used the same dataset as that collected and processed by Lazar et al. It contains three large open source projects: open Office, eclipse and Net Beans. Open Office is Office software similar to Microsoft Office. Eclipse and Net Beans are open source integrated development environments. To perform the experiment with more training samples, a larger data set was obtained by combining the three data sets and named "Combined". These data sets also provide defect reporting pairings, some of which are shown in table 4.
Table 4: defect report pair
Some of these problems were found by analyzing all pairings in each dataset. First, some pairings are repeated. For example, in Open Office, (200622, 197347, duplicate) appears 5 times. Second, some pairs represent the same relationship, such as (159435, 164827, duplicate) and (164827, 159435, duplicate) in Eclipse. Therefore, the present invention will remove these pairs of defect reports. Table 5 shows the number of all pairings in the resulting dataset.
Table 5: complete data set
Dataset | Duplicate | Non duplicate |
OpenOffice | 57340 | 41751 |
Eclipse | 86385 | 160917 |
Net Beans | 95066 | 89988 |
Combined | 238791 | 292476 |
Each data set was divided into a training set and a test set, with the training set accounting for 80% (of which 10% was the validation set) and the test set accounting for 20%. In addition, in order for the training and test sets to mimic the original data set distribution, the duplicate and non-duplicate report comparison examples in the training and test sets are made the same as the original data set when the data set is segmented. Both the training set and the test set were randomly selected. Table 6 shows the detailed distribution of defect report pairs in the training set and the test set.
Table 6: training set and test set
Evaluation criteria
In the model proposed by the present invention, the output represents the similarity of two reports in a defect report pair. Thus, this value is between 0 and 1. For further classification, a threshold value is set. Sim has been obtained in the third section predict Thereafter, label predict (representing a defect report versus predicted label) can be calculated according to the following equation:
according to label predict And label real Report pairs can be divided into four categories:
1)TP:label real =1,label predict =1
2)TN:label real =0,label predict =0
3)FP:label real =0,label predict =1
4)FN:label real =1,label predict =0
where 1 indicates that the report pair is duplicated and 0 indicates that the report pair is non-duplicated. TP represents the number of reporting pairs that are correctly predicted to be duplicated, TN represents the number of reporting pairs that are correctly predicted to be non-duplicated, FP represents the number of reporting pairs that are incorrectly predicted to be duplicated, and FN represents the number of reporting pairs that are incorrectly predicted to be non-duplicated. These four indices are the basis of the calculation of the following evaluation criteria.
Accuracy
Accuracy represents the ratio of correctly predicted defect report pairs to all report pairs, which represents the performance of the model to correctly classify all defect report pairs. Since the sigmoid function is used in performing the regression, the threshold is set to 0.5 when calculating Accuracy, recall and Precision.
Recall:
Recall represents the ratio of correctly predicted duplicate defect report pairs to all actually duplicate defect report pairs.
Precision:
Precision represents the ratio of the defect report pairs that are correctly predicted to be duplicates to all report pairs that are predicted to be duplicates.
F 1 -Score:
F 1 Score is the harmonic mean of Recall and Precision.
Roc curve:
in fact, conventional evaluation criteria such as Accuracy do not evaluate the performance of the classifier well because the defect reports in the dataset are not evenly distributed over the classes. Therefore, the present invention employs the ROC curve to further evaluate the performance of the classifier. Different TPR and FPR can be obtained according to different threshold values, and then an ROC curve can be drawn through the TPR and the FPR. TPR and FPR may be calculated according to the following equations:
the ROC curve can be obtained by taking all FPR values as the horizontal axis and all TPR values as the vertical axis. The closer the curve is to the upper left corner of the coordinate axis, the better the performance of the classifier.
Results of the experiment
The technical effect of the method of the present invention is demonstrated by answering several questions as follows.
Problem 1: is the DC-CNN of the present invention valid compared to the most advanced repetitive defect report detection method based on deep learning?
The research objective of the present invention is to propose a more efficient method based on deep learning. Thus, the method of the present invention was compared to the method of Deshmukh et al on the same data set.
Table 7: experimental results of the method of the invention and the method of Deshmukh et al
As a result: table 7 shows the experimental results of the method of the invention and the method of Deshmukh et al. The twin neural networks, which are the same core method, are used, and they establish two similar models, a search model and a classification model. For the classification model, the highest accuracy appears on the Open Office dataset, reaching 0.8275, while only 0.7268 appears in Eclipse. Their search model performed better than the classification model. For the retrieval model, the best performing data set is still Open Office, which has a correctness rate of up to 0.9455. Similarly, eclipse was slightly inferior and accuracy was 0.906. It can be found that compared with the classification model established by the twin neural network, the improvement of DC-CNN on Open Office, eclipse, net Beans, combined is 11.54%,24.17%,17.89% and 13.33%, respectively. Compared with a retrieval model established by a twin neural network, the DC-CNN is improved by 6.25%,4.07% and 3.84% in Eclipse, net Beans and Combined respectively. On Open Office, the accuracy of DC-CNN is lower than 0.03%.
Influence: according to Table 7, the performance of DC-CNN was higher on 3 datasets (Eclipse, net Beans, combined) than the classification model and search model constructed by Deshmukh et al with a twin neural network. On Open Office, the performance of DC-CNN is higher than that of the classification model constructed by Deshmukh et al by twin neural networks and has a very similar performance with their retrieval model. Overall, DC-CNN achieves a very good performance and surpasses the state-of-the-art deep learning based repeat report detection methods.
Problem 2: is DC-CNN valid compared to SC-CNN?
In order to prove that the two-channel matrix representation of the defect report pair proposed by the present invention is effective, the single-channel matrix representation of the defect report is also used as a comparison baseline. The structure of the CNN is kept unchanged, including the number of convolution kernels, the size of the convolution kernels, the number of convolution layers and the like, and the CNN is used for extracting the characteristics of two reports in a defect report pair respectively and then calculating the similarity of the two reports. This method is called Single-Channel Convolutional Neural Networks (SC-CNN).
Table 8: DC-CNN and SC-CNN test results
As a result: the performance of both methods was evaluated on the Accuracy, recall, precision, F1-Score, etc., and the results are shown in table 8, with the best results being bolded. It can be observed that DC-CNN exceeds SC-CNN at all indices of all datasets. Compared with SC-CNN, on Open Office, eclipse, net Beans and Combined, accuracy of DC-CNN is respectively improved by 2.78%,2.61%,1.36% and 2.33%, call of DC-CNN is respectively improved by 2.73%,0.51%,1.49% and 3.17%, precision of DC-CNN is respectively improved by 2.08%,6.53%,1.20% and 2.08%, F1-Score of DC-CNN is respectively improved by 2.40%,3.53%,1.35% and 2.62%. FIG. 3 (a) FIG. 3 (d) shows ROC curves for both methods. It can be observed that the curve for DC-CNN is above SC-CNN over all datasets, indicating that DC-CNN has better classification performance even when the sample distribution is unbalanced.
Influence: all experimental results show that the CNN model using two channels is more efficient than a single channel. For SC-CNN. Each report is converted into a matrix and then input into CNN to extract features, the results being represented as feature vectors. And then judging whether the two reports are repeated or not by calculating the similarity of the two feature vectors. For DC-CNN, two reports are combined into a dual-channel matrix and then input into CNN, and the two reports are convolved together, and the method can extract deep-level relation between the two reports and fully utilize the capability of CNN for capturing local features. Because the CNN model in DC-CNN focuses on extracting the correlation between two reports, there is better performance in detecting duplicate reports.
Problem 3: how do the experimental results change when changing the word vector dimension?
The invention provides a novel defect report pair representation method, namely a dual-channel matrix. Therefore, the influence of the parameters related to the test results is also explored. For a two-channel matrix, the most likely parameter to change is the dimension of the word vector, since the number of words is fixed and the position of the two reports (which report is on the first channel and which report is on the second channel) is indistinguishable for CNN. To answer the question of how the experimental results change when changing the word vector dimension, the word vector dimension was gradually changed from 10 to 100 and the change of the experimental results on the Open Office dataset was observed.
As a result: as can be seen from fig. 4, as the word vector dimension is gradually increased, accuracy first increases and then shows a downward trend. When the word vector dimension is 20, the accuracy reaches the maximum, 94.29%.
Influence: as the word vector dimension increases from 10 to 20, accuracy increases. As we continue to increase the word vector dimension, accuracy decreases. The reason may be when one word vector dimension is already sufficient to characterize one word. Continuing to increase the dimension instead makes it less well representative of the word. Although accurve reaches a maximum when the word vector dimension equals 20, it is not much higher than it would otherwise be. On one hand, the increase of the word vector dimension brings about a larger data storage problem; on the other hand, word embedding and CNN model training both increase in complexity. Thus, in the method of the present invention, 20 is the most appropriate word vector dimension.
Problem 4: is the method proposed by the present invention valid when no structured information is used?
Structured information such as product, component, and version provide very useful information in determining whether two reports are duplicated. Many methods use structured information as a single feature to improve the accuracy of repeated defect report detection. The unstructured information is typically a natural language description of the bug. For duplicate report detection, CNN is mainly used to process unstructured text, so it has good performance when processing long text. Unlike other methods, the present invention places both structured and unstructured information as text data in a text document. CNN is then used to extract their features. To answer question 4, structured information is removed from the input and a comparative experiment is set up without changing other conditions.
As a result: as can be seen from fig. 5, the experimental results on all datasets were reduced after removing the structured information, which was 1.74%,3.79%,3.38%, and 2.56% on Open Office, eclipse, net Beans, and Combined, respectively.
Influence: experimental results show that it is effective to input structured information and unstructured information together into CNN. Note that after the structured information is removed, although accuracy drops, this drop is not fatal. The reason is that the structured information only occupies a small part of the whole text. Part of the CNN main processing remains unstructured information.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.
Claims (3)
1. A repeated defect report detection method based on a dual-channel convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
s100 data preparation
S101, extracting defect reports of software, wherein all the defect reports consist of structured information and unstructured information, and all the structured information and the unstructured information are put into a single text invention file for each defect report;
s102, for each defect report, carrying out preprocessing steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;
s103, after preprocessing, combining words in all defect reports into a corpus, obtaining vector representation of each Word by using the existing Word2vec on the corpus and selecting a CBOW model, and obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;
according to known information given by a software defect tracking system when a defect report of software is extracted, expressing a defect report pair consisting of two defect reports through a two-dimensional dual-channel matrix, wherein the two-dimensional dual-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then marking a repeated or non-repeated label on the dual-channel matrix;
dividing all the labeled dual-channel matrixes into a training set and a verification set;
s200, establishing a CNN model
S201: inputting all the double-channel matrixes in the training set and the verification set into a CNN model together;
s202: in the first convolutional layer, setA convolution kernel pick>Where d is the length of the convolution kernel, k w Is the width of the convolution kernel; after the first convolution, the two channels of the two-channel matrix are merged into one, and the first layer of convolution formula is:
wherein C is 1 Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer 1 I channel j of 1 J-th representing input 1 Line, b 1 Denotes an offset amount, f 1 An activation function representing a non-linearity, given the length of the input l, l = n w Fill-in value P =0 and step S =1, length O of output 1 Can be calculated as:
the output shape of the first convolution layer isReshaping the output shape of the first convolution layer into +>1 are then convolved, and in a second convolution layer, three more convolution kernels are set>Each convolution kernel->The formula of the second layer convolution is:
wherein C is 2 Representing the output of the second convolutional layer, j 2 Representing a second convolutional layer input I 2 J (d) of 2 Line, b 2 Denotes an offset amount, f 2 An activation function representing a non-linearity, after this convolution, results in three shapesA characteristic diagram of (1), wherein O 2 Can be calculated according to equation (2) based on l and different convolution kernel lengths d, where l = O 1 ;
S203: performing maximum pooling on all feature maps;
s204: reshaping and stitching all the feature maps to obtain oneA vector of dimensions that will be input to the fully-connected layer; />
After two fully-connected layers, an independent probability sim is obtained predict It represents the similarity of the two reports being predicted;
at the last layer, sim is obtained using sigmoid as the activation function predict ;
Given the output T = { x ] of the first fully-connected layer 1 ,x 2 ,…,x 300 And weight vector W = { W = } 1 ,w 2 ,…,w 300 },sim predict Can be calculated as:
wherein i represents the ith element of T and b represents an offset;
s205: traversing all defect report pairs in the training set, and repeating S202-S204;
s206: the hidden parameters of the model are updated by back propagation according to a loss function, which is as shown in formula (5):
wherein label real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs;
s207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model;
s300: defect report prediction to be predicted
Firstly, preprocessing a defect report to be predicted by adopting the method in S102, and then converting the defect report to be predicted into a two-dimensional single-channel matrix of a predicted defect report by adopting the method in S103;
combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrices of the N existing defect reports of the software in pairs to obtain N two-channel matrices to be predicted, forming a prediction set by the N two-channel matrices to be predicted, and inputting each two-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;
and when the probability is greater than the threshold value in the N probabilities, the defect report and the predicted defect report corresponding to the probability are considered to be repeated.
2. The repetitive defect report detection method based on a two-channel convolutional neural network as claimed in claim 1, characterized in that: in the S101, the structured information is product and component, and the unstructured information is summary and description.
3. The repetitive defect report detection method based on a two-channel convolutional neural network as claimed in claim 1, characterized in that: at all but the last fully connected layer, relu is used as an activation function to extract more nonlinear features.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910474540.6A CN110188047B (en) | 2019-06-20 | 2019-06-20 | Double-channel convolutional neural network-based repeated defect report detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910474540.6A CN110188047B (en) | 2019-06-20 | 2019-06-20 | Double-channel convolutional neural network-based repeated defect report detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110188047A CN110188047A (en) | 2019-08-30 |
CN110188047B true CN110188047B (en) | 2023-04-18 |
Family
ID=67719718
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910474540.6A Active CN110188047B (en) | 2019-06-20 | 2019-06-20 | Double-channel convolutional neural network-based repeated defect report detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110188047B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111177010B (en) * | 2019-12-31 | 2023-12-15 | 杭州电子科技大学 | Software defect severity identification method |
CN111737107B (en) * | 2020-05-15 | 2021-10-26 | 南京航空航天大学 | Repeated defect report detection method based on heterogeneous information network |
CN112328469B (en) * | 2020-10-22 | 2022-03-18 | 南京航空航天大学 | Function level defect positioning method based on embedding technology |
CN112631898A (en) * | 2020-12-09 | 2021-04-09 | 南京理工大学 | Software defect prediction method based on CNN-SVM |
CN113379685A (en) * | 2021-05-26 | 2021-09-10 | 广东炬森智能装备有限公司 | PCB defect detection method and device based on dual-channel feature comparison model |
CN113362305A (en) * | 2021-06-03 | 2021-09-07 | 河南中烟工业有限责任公司 | Smoke box strip missing mixed brand detection system and method based on artificial intelligence |
CN113486176B (en) * | 2021-07-08 | 2022-11-04 | 桂林电子科技大学 | News classification method based on secondary feature amplification |
CN113379746B (en) * | 2021-08-16 | 2021-11-02 | 深圳荣耀智能机器有限公司 | Image detection method, device, system, computing equipment and readable storage medium |
CN113791897B (en) * | 2021-08-23 | 2022-09-06 | 湖北省农村信用社联合社网络信息中心 | Method and system for displaying server baseline detection report of rural telecommunication system |
US20230367967A1 (en) * | 2022-05-16 | 2023-11-16 | Jpmorgan Chase Bank, N.A. | System and method for interpreting stuctured and unstructured content to facilitate tailored transactions |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970666A (en) * | 2014-05-29 | 2014-08-06 | 重庆大学 | Method for detecting repeated software defect reports |
CN106250311A (en) * | 2016-07-27 | 2016-12-21 | 成都启力慧源科技有限公司 | Repeated defects based on LDA model report detection method |
CN108491835A (en) * | 2018-06-12 | 2018-09-04 | 常州大学 | Binary channels convolutional neural networks towards human facial expression recognition |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN109376092A (en) * | 2018-11-26 | 2019-02-22 | 扬州大学 | A kind of software defect reason automatic analysis method of facing defects patch code |
CN109491914A (en) * | 2018-11-09 | 2019-03-19 | 大连海事大学 | Defect report prediction technique is influenced based on uneven learning strategy height |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9219767B2 (en) * | 2006-06-22 | 2015-12-22 | Linkedin Corporation | Recording and indicating preferences |
US20170212829A1 (en) * | 2016-01-21 | 2017-07-27 | American Software Safety Reliability Company | Deep Learning Source Code Analyzer and Repairer |
-
2019
- 2019-06-20 CN CN201910474540.6A patent/CN110188047B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970666A (en) * | 2014-05-29 | 2014-08-06 | 重庆大学 | Method for detecting repeated software defect reports |
CN106250311A (en) * | 2016-07-27 | 2016-12-21 | 成都启力慧源科技有限公司 | Repeated defects based on LDA model report detection method |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
CN108804558A (en) * | 2018-05-22 | 2018-11-13 | 北京航空航天大学 | A kind of defect report automatic classification method based on semantic model |
CN108491835A (en) * | 2018-06-12 | 2018-09-04 | 常州大学 | Binary channels convolutional neural networks towards human facial expression recognition |
CN109491914A (en) * | 2018-11-09 | 2019-03-19 | 大连海事大学 | Defect report prediction technique is influenced based on uneven learning strategy height |
CN109376092A (en) * | 2018-11-26 | 2019-02-22 | 扬州大学 | A kind of software defect reason automatic analysis method of facing defects patch code |
Non-Patent Citations (2)
Title |
---|
指挥自动化系统嵌入式软件可靠性评估;贡岩等;《中国电子学会可靠性分会第十三届学术年会》;20061231;第376-381页 * |
改进的词向量特征和CNN在语句分类中的应用;缪浩然等;《第十四届全国人机语音通讯学术会议》;20171231;第1-6页 * |
Also Published As
Publication number | Publication date |
---|---|
CN110188047A (en) | 2019-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110188047B (en) | Double-channel convolutional neural network-based repeated defect report detection method | |
CN112214610B (en) | Entity relationship joint extraction method based on span and knowledge enhancement | |
CN110110062B (en) | Machine intelligent question and answer method and device and electronic equipment | |
CN109189767B (en) | Data processing method and device, electronic equipment and storage medium | |
JP2021504789A (en) | ESG-based corporate evaluation execution device and its operation method | |
CN109710744B (en) | Data matching method, device, equipment and storage medium | |
CN112711953A (en) | Text multi-label classification method and system based on attention mechanism and GCN | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN110097096B (en) | Text classification method based on TF-IDF matrix and capsule network | |
CN111177010B (en) | Software defect severity identification method | |
CN116992007B (en) | Limiting question-answering system based on question intention understanding | |
CN112115716A (en) | Service discovery method, system and equipment based on multi-dimensional word vector context matching | |
CN110992988B (en) | Speech emotion recognition method and device based on domain confrontation | |
CN109800309A (en) | Classroom Discourse genre classification methods and device | |
CN110347833B (en) | Classification method for multi-round conversations | |
CN104657574A (en) | Building method and device for medical diagnosis models | |
CN112036705A (en) | Quality inspection result data acquisition method, device and equipment | |
CN118113849A (en) | Information consultation service system and method based on big data | |
CN112489689B (en) | Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure | |
CN105894032A (en) | Method of extracting effective features based on sample properties | |
CN116050419B (en) | Unsupervised identification method and system oriented to scientific literature knowledge entity | |
CN112488188A (en) | Feature selection method based on deep reinforcement learning | |
CN116450848B (en) | Method, device and medium for evaluating computing thinking level based on event map | |
CN116522912A (en) | Training method, device, medium and equipment for package design language model | |
KR102418239B1 (en) | Patent analysis apparatus for finding technology sustainability |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |