[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110188047B - Double-channel convolutional neural network-based repeated defect report detection method - Google Patents

Double-channel convolutional neural network-based repeated defect report detection method Download PDF

Info

Publication number
CN110188047B
CN110188047B CN201910474540.6A CN201910474540A CN110188047B CN 110188047 B CN110188047 B CN 110188047B CN 201910474540 A CN201910474540 A CN 201910474540A CN 110188047 B CN110188047 B CN 110188047B
Authority
CN
China
Prior art keywords
defect
defect report
report
channel
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910474540.6A
Other languages
Chinese (zh)
Other versions
CN110188047A (en
Inventor
徐玲
何健军
帅鉴航
杨梦宁
张小洪
洪明坚
葛永新
杨丹
王洪星
黄晟
陈飞宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201910474540.6A priority Critical patent/CN110188047B/en
Publication of CN110188047A publication Critical patent/CN110188047A/en
Application granted granted Critical
Publication of CN110188047B publication Critical patent/CN110188047B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • G06F11/3692Test management for test results analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a double-channel convolutional neural network-based repeated defect report detection method, which comprises the three steps of data preparation, CNN model establishment and defect report prediction to be predicted; in data preparation, fields useful for repeating reports are extracted from defect reports, for each report, structured information and unstructured information are put together into a text invention file, and after preprocessing, each report represented by text is converted into a single-channel matrix, the single-channel matrix is combined into a double-channel matrix, and then one part is used as a training set, and the rest is used as a verification set. And establishing a CNN model, and taking a training set as an input training model. In the prediction stage of the defect report to be predicted, the trained model loads and predicts the similarity of a defect report pair consisting of an unknown defect report and a known defect report, and the similarity is the probability representing the repetition possibility of the defect report pair. The method has higher prediction accuracy.

Description

Double-channel convolution neural network-based repetitive defect report detection method
Technical Field
The invention relates to the technical field of software testing, in particular to a double-channel convolutional neural network-based repeated defect report detection method.
Background
Modern software projects use defect tracking systems such as Bugzilla [17] to store and manage defect reports. Software developers, software testers and end users submit bug reports to describe software problems when they encounter these problems. Defect reports may help guide software maintenance and repair work. As software systems evolve, hundreds of defect reports are submitted each day. When more than one person submits a defect report to describe one and the same bug, a duplicate defect report is generated. Since bug reports are always described in natural language, the same bug will likely also be described in a different form.
Manual detection of duplicate defect reports is a difficult task because of the large number of defect reports. Furthermore, because the defect reports are described in natural language, it is not practical to provide a standard template. Therefore, automatic detection of duplicate defect reports is a meaningful task that can avoid repairing the same bug multiple times. Many repeat defect report automatic detection techniques have been proposed to address this problem in recent years. These methods can be roughly divided into two directions, information retrieval and machine learning.
An information retrieval method, which generally calculates the similarity of two defect reports on a text, i.e. focuses on calculating the similarity from a text description.
For example, hiew builds a Model using VSM (Vector Space Model) that computes a report as a Vector with a TF-IDF (Term Frequency-Inverse Document Frequency) Term weighting scheme. Based on VSM, runeson et al, for the first time, used natural language processing techniques to detect repeat defect reports. Wang et al consider that simply considering natural language information does not solve this problem well, so they also perform repeated report detection with performance information as a feature. However, this approach has significant limitations because only a small fraction of reports have performance information. Sun et al propose REP, which uses not only summary and description, but also structured information such as product, component, version, etc. In order to obtain higher text similarity, the BM25F is expanded, and the method is an effective similarity calculation method in the field of information retrieval. In addition to text similarity and structured similarity, alipour et al also consider the effect of contextual information on duplicate report detection. They applied LDA to these features with better results. Information-based slowdown methods perform well in both accuracy and time efficiency, but when a problem is described in different terms, the results are unsatisfactory.
The machine learning method extracts the reported potential features through a self-learning algorithm, but the traditional machine learning method cannot well learn the input depth features. SVM is a classical method of machine learning. Jalbert et al established a classification system with which duplicate reports could be filtered. At the same time, they believe that previous methods do not take full advantage of the various features in the defect report, and therefore they use surface features, text semantics, and graph clustering in the model. Based on the work of Jalbert et al, tian et al considered some new features and established a linear model. From a feature and imbalance data perspective, they improve the accuracy of duplicate report detection. Sun et al, using SVM, developed an interpretation model that also for the first time classified the defect reports into repetitive and non-repetitive categories. L2R is another very useful machine learning method. Based on this, zhou et al considered text and statistical features and used a random gradient descent algorithm for them. This method has a better effect than conventional information retrieval methods such as VSM and BM 25F. With the application of word embedding technology [ in the field of natural language processing, more and more researchers are using it to detect duplicate reports. The budheiraja et al uses word embedding techniques to convert the defect reports into vectors and then calculate their similarity. Experimental results show that the method has the potential of improving the detection accuracy of repeated reports.
Disclosure of Invention
The technical problem to be solved by the present invention is the problem of automatic detection of duplicate reports, which can be further decomposed into determining the relationship between two defect reports, i.e. whether a defect report pair consisting of two reports is duplicate or non-duplicate.
In order to realize the purpose, the invention adopts the following technical scheme: a repetitive defect report detection method based on a dual-channel convolutional neural network comprises the following steps:
s100 data preparation
S101, extracting defect reports of software, wherein all the defect reports consist of structured information and unstructured information, and all the structured information and the unstructured information are put into a single text invention file for each defect report;
s102, for each defect report, carrying out pretreatment steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;
s103, after preprocessing, combining all words in the defect reports into a corpus, using the existing Word2vec on the corpus and selecting a CBOW model to obtain vector representation of each Word, namely obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;
according to the known information given by the software defect tracking system when the defect report of the software is extracted (the paired information is in a data set and is obtained by processing of a person who creates the data set), a defect report pair consisting of two defect reports is represented by a two-dimensional dual-channel matrix, the two-dimensional dual-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then a repeated or unrepeated label is marked on the dual-channel matrix;
dividing all the labeled dual-channel matrixes into a training set and a verification set;
s200, establishing a CNN model
S201: inputting all the double-channel matrixes in the training set and the verification set into a CNN model together;
s202: in the first convolutional layer, set
Figure BDA0002081786510000031
A convolution kernel pick>
Figure BDA0002081786510000032
Where d is the length of the convolution kernel, k w Is the width of the convolution kernel; after the first convolution, the two channels of the two-channel matrix are merged into one, and the first layer of convolution formula is:
Figure BDA0002081786510000033
wherein C is 1 Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer 1 I channel j of 1 J-th representing input 1 Line, b 1 Denotes an offset amount, f 1 An activation function representing a non-linearity, given the length of the input, l (l = n) w ) Padding value P =0 and step S =1, length of output O 1 Can be calculated as:
Figure BDA0002081786510000034
the output shape of the first convolutional layer is
Figure BDA0002081786510000035
Reshaping the output shape of the first convolution layer to
Figure BDA0002081786510000036
Figure BDA0002081786510000037
Then convolved again, and on the second convolution layer, convolution kernels of three sizes are set>
Figure BDA0002081786510000038
Each convolution kernel->
Figure BDA0002081786510000039
The formula of the second layer convolution is:
Figure BDA00020817865100000313
wherein C is 2 Represents the output of the second convolution layer, j 2 Representing a second convolutional layer input I 2 J (d) of 2 Line, b 2 Denotes an offset amount, f 2 An activation function representing a non-linearity, after this convolution, results in three shapes
Figure BDA00020817865100000310
A characteristic diagram of (1), wherein O 2 Can be based on l (l = O) 1 ) And different convolution kernel lengths d, calculated according to equation (2);
s203: performing maximum pooling on all feature maps;
s204: reshaping and splicing all feature maps to obtain one
Figure BDA00020817865100000311
A vector of dimensions that will be input to the fully-connected layer;
after two fully connected layers, an independent probability sim is obtained predict It represents the similarity of the two reports being predicted;
at the last layer, sigmoid is used as an activation function to obtain sim predict
Given the output T = { x ] of the first fully-connected layer 1 ,x 2 ,…,x 300 And weight vector W = { W = } 1 ,w 2 ,…,w 300 },s impredict Can be calculated as:
Figure BDA00020817865100000312
wherein i represents the ith element of T and b represents an offset;
s205: traversing all defect report pairs in the training set, and repeating S202-S204;
s206: the back propagation is performed to update the hidden parameters of the model according to the loss function, which is as the formula (5):
Figure BDA0002081786510000041
wherein label real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs;
s207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model;
s300: defect report prediction to be predicted
Firstly, preprocessing a defect report to be predicted by adopting the method in S102, and then converting the defect report to be predicted into a two-dimensional single-channel matrix of a predicted defect report by adopting the method in S103;
combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrix of N existing defect reports of the software pairwise to obtain N dual-channel matrixes to be predicted, forming a prediction set by the N dual-channel matrixes to be predicted, and inputting each dual-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;
and when the probability is greater than the threshold value in the N probabilities, the defect report and the predicted defect report corresponding to the probability are considered to be repeated.
As an improvement, in S101, the structured information is product and component, and the unstructured information is summary and description.
As an improvement, relu is used as an activation function to extract more nonlinear features at other layers except the last full connection layer.
Compared with the prior art, the invention has at least the following advantages:
the invention provides a novel method DC-CNN for repeated defect report detection. It combines two defect reports represented by a single channel matrix into a defect report pair represented by a two channel matrix. This two-channel matrix is then input into the CNN model to extract the implicit features. The method provided by the invention is verified on Open Office, eclipse, net Beans and a Combined data set Combined thereof, and is compared with the most advanced repeated report detection method based on deep learning at present, the method provided by the invention is effective, and more importantly, the performance is better.
Drawings
Figure 1 is the general framework of the process of the invention.
Fig. 2 is a general process flow for establishing a CNN model.
FIG. 3 (a) is a ROC curve for DC-CNN and SC-CNN on Open office dataset, FIG. 3 (b) is a ROC curve for DC-CNN and SC-CNN on Eclipse dataset, FIG. 3 (c) is a ROC curve for DC-CNN and SC-CNN on Net Beans dataset, and FIG. 3 (d) is a ROC curve for DC-CNN and SC-CNN on Combined dataset.
FIG. 4 is an illustration of the effect of word vector dimensions.
FIG. 5 is an illustration of the impact of unstructured information.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Fig. 1 shows the overall framework of the inventive method DC-CNN, which comprises three phases: preparing data, establishing a CNN model and reporting and predicting the defects to be predicted. During the data preparation phase, the fields useful for duplicate reporting, including component, product, summary, and description, are extracted from the defect report. For each report, the structured information and unstructured information are put together in a text document. After preprocessing, the text of all the defect reports is collected to form a corpus. Word2vec is used to extract the semantic rules of the corpus. Each report represented by text is converted into a single channel matrix. To determine the relationship between the two reports, the single channel matrices representing the defect reports are combined into a two channel matrix representing pairs of defect reports. Then, one part is used as a training set, and the rest part is used as a verification set. In the training phase, a CNN model is trained using the training set as input. In the prediction stage of the defect report to be predicted, the trained model loads and predicts the similarity of a defect report pair consisting of an unknown defect report and a known defect report, wherein the similarity is the probability representing the repetition possibility of the defect report pair.
A repetitive defect report detection method based on a dual-channel convolutional neural network comprises the following steps:
s100 data preparation
S101, extracting defect reports of software, wherein all the defect reports consist of structured information and unstructured information, and all the structured information and the unstructured information are put into a single text invention file for each defect report;
structured information is typically an optional attribute, while unstructured information is typically a textual description of a bug.
S102, for each defect report, carrying out preprocessing steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;
the present invention uses the standardalanyzer of Lucene to accomplish the above pretreatment step. When stop words are removed, a standard English stop word list is used. Furthermore, there are some words that are the same even in two unrelated defect reports. These words are typically professional vocabularies such as java, com, org, etc. They are also added to the stop word list due to their frequent occurrence. Through the above processing, some meaningless numbers are left in the text, and they are also removed.
S103, after preprocessing, combining words in all defect reports into a corpus, obtaining vector representation of each Word by using the existing Word2vec on the corpus and selecting a CBOW model, and obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;
according to known information given by a software defect tracking system when a defect report of software is extracted (the paired information is in a data set and is obtained by processing of a person who creates the data set), a defect report pair formed by two defect reports is represented by a two-dimensional double-channel matrix, the two-dimensional double-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then a repeated or unrepeated label is marked on the double-channel matrix;
the use of a two-pass representation of the defect-reporting pair has the following benefits compared to a single pass. First, two reports may be processed simultaneously by the CNN. Thus the training speed is increased. Second, it has been demonstrated that training CNNs using dual channel data can achieve higher accuracy. For a two-pass CNN, it can capture the correlation between two defect reports by a convolution operation.
Dividing all the labeled dual-channel matrixes into a training set and a verification set; in specific implementation, 80% of the two-channel matrixes labeled with the labels are divided into a training set, and the remaining 20% of the two-channel matrixes labeled with the labels are a verification set.
S200, establishing a CNN model
In order to extract features from defect report pairs, the present invention sets convolution kernels of three different sizes at each convolution layer. Thus, the first convolutional layer has three branches. For each of these three branches, there will still be three new branches at the second convolutional layer. Because the three branches are highly similar in structure, fig. 2 shows only one branch of the first convolutional layer in the overall working structure of CNN. Table 3 shows the specific parameter settings of the CNN model of the present invention.
TABLE 3
Figure BDA0002081786510000061
S201: inputting all the double-channel matrixes in the training set and the verification set into a CNN model together;
s202: in the first convolutional layer, set
Figure BDA0002081786510000062
A convolution kernel pick>
Figure BDA0002081786510000063
Where d is the length of the convolution kernel, k w Is the width of the convolution kernel; because each row of the input matrix represents a word, the convolution kernel width is equal to the word vector dimension m; after the first convolution, the two channels of the two-channel matrix are merged into one, so that the two defect reports can be considered as a whole to extract features, and the first layer of convolution formula is as follows:
Figure BDA0002081786510000071
wherein C is 1 Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer 1 I channel j of 1 J-th representing input 1 Line, b 1 Representing the offset, f the nonlinear activation function, relu, which is used in the present invention, given the length of the input, l (l = n) w ) Padding value P =0 and step S =1, length of output O 1 Can be calculated as:
Figure BDA0002081786510000072
the output shape of the first convolutional layer is
Figure BDA0002081786510000073
To further extract the relevant features of both reports, the output shape of the first convolutional layer is remodeled to £ or>
Figure BDA0002081786510000074
Then convolved again, and in the second convolution layer, convolution kernels with three sizes are set>
Figure BDA0002081786510000075
Each convolution kernel->
Figure BDA0002081786510000076
The formula of the second layer convolution is:
Figure BDA00020817865100000711
wherein C 2 Representing the output of the second convolutional layer, j 2 Represents a second convolutional layer input I 2 J (d) of 2 Line, b 2 Denotes an offset amount, f 2 Representing a non-linear activation function, relu is used in the present invention, and after this convolution, three shapes are obtained
Figure BDA0002081786510000077
A characteristic diagram of (1), wherein O 2 Can be according to l (l = O) 1 ) And different convolution kernel lengths d, calculated according to equation (2).
S203: performing maximum pooling on all feature maps; thus, each feature map is down-sampled to
Figure BDA0002081786510000078
The shape of (2).
S204: reshaping and splicing all feature maps to obtain one
Figure BDA0002081786510000079
A vector of dimensions that will be input to the fully-connected layer;
after two fully-connected layers, an independent probability sim is obtained predict It represents the similarity of the two reports being predicted;
at the last layer, sim is obtained using sigmoid as the activation function predict
Given the output T = { x ] of the first fully-connected layer 1 ,x 2 ,…,x 300 And weight vector W = { W = } 1 ,w 2 ,…,w 300 },s impredict Can be calculated as:
Figure BDA00020817865100000710
where i represents the ith element of T and b represents an offset.
S205: all pairs of defect reports in the training set are traversed and S202-S204 are repeated.
S206: the hidden parameters of the model are updated by back propagation according to a loss function, which is as shown in formula (5):
Figure BDA0002081786510000081
wherein label real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs.
S207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model.
S300: defect report to predict prediction
Firstly, preprocessing a defect report to be predicted by adopting the method in S102, and then converting the defect report to be predicted into a two-dimensional single-channel matrix of a predicted defect report by adopting the method in S103;
combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrix of N existing defect reports of the software pairwise to obtain N dual-channel matrixes to be predicted, forming a prediction set by the N dual-channel matrixes to be predicted, and inputting each dual-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;
and when the probability is greater than the threshold value in the N probabilities, the defect report and the predicted defect report corresponding to the probability are considered to be repeated.
For example, if a certain software has N defect reports at present, each processed defect report corresponds to a two-dimensional single-channel matrix, a two-dimensional single-channel matrix to be predicted corresponding to the predicted defect report and the N two-dimensional single-channel matrices are arbitrarily combined in pairs to obtain N two-channel matrices to be predicted, and then the N two-channel matrices to be predicted are input into the CNN model one by one to obtain N probabilities. And when a certain probability is greater than a preset threshold value, the defect report to be predicted corresponding to the probability and the existing defect report in the software are considered to be repeated.
Test verification:
1. data set
For comparison, the present invention used the same dataset as that collected and processed by Lazar et al. It contains three large open source projects: open Office, eclipse and Net Beans. Open Office is Office software similar to Microsoft Office. Eclipse and Net Beans are open source integrated development environments. To perform the experiment with more training samples, a larger data set was obtained by combining the three data sets and named "Combined". These data sets also provide defect reporting pairings, some of which are shown in table 4.
Table 4: defect report pair
Figure BDA0002081786510000082
Figure BDA0002081786510000091
Some of these problems were found by analyzing all pairings in each dataset. First, some pairings are repeated. For example, in Open Office, (200622, 197347, duplicate) appears 5 times. Second, some pairs represent the same relationship, such as (159435, 164827, duplicate) and (164827, 159435, duplicate) in Eclipse. Therefore, the present invention will remove these pairs of defect reports. Table 5 shows the number of all pairings in the resulting dataset.
Table 5: complete data set
Dataset Duplicate Non duplicate
OpenOffice 57340 41751
Eclipse 86385 160917
Net Beans 95066 89988
Combined 238791 292476
Each data set was divided into a training set and a test set, with the training set accounting for 80% (of which 10% was the validation set) and the test set accounting for 20%. In addition, in order for the training and test sets to mimic the original data set distribution, the duplicate and non-duplicate report comparison examples in the training and test sets are made the same as the original data set when the data set is segmented. Both the training set and the test set were randomly selected. Table 6 shows the detailed distribution of defect report pairs in the training set and the test set.
Table 6: training set and test set
Figure BDA0002081786510000092
Evaluation criteria
In the model proposed by the present invention, the output represents the similarity of two reports in a defect report pair. Thus, this value is between 0 and 1. For further classification, a threshold value is set. Sim has been obtained in the third section predict Thereafter, label predict (representing a defect report versus predicted label) can be calculated according to the following equation:
Figure BDA0002081786510000093
according to label predict And label real Report pairs can be divided into four categories:
1)TP:label real =1,label predict =1
2)TN:label real =0,label predict =0
3)FP:label real =0,label predict =1
4)FN:label real =1,label predict =0
where 1 indicates that the report pair is duplicated and 0 indicates that the report pair is non-duplicated. TP represents the number of reporting pairs that are correctly predicted to be duplicated, TN represents the number of reporting pairs that are correctly predicted to be non-duplicated, FP represents the number of reporting pairs that are incorrectly predicted to be duplicated, and FN represents the number of reporting pairs that are incorrectly predicted to be non-duplicated. These four indices are the basis of the calculation of the following evaluation criteria.
Accuracy
Accuracy represents the ratio of correctly predicted defect report pairs to all report pairs, which represents the performance of the model to correctly classify all defect report pairs. Since the sigmoid function is used in performing the regression, the threshold is set to 0.5 when calculating Accuracy, recall and Precision.
Figure BDA0002081786510000101
Recall:
Recall represents the ratio of correctly predicted duplicate defect report pairs to all actually duplicate defect report pairs.
Figure BDA0002081786510000102
Precision:
Precision represents the ratio of the defect report pairs that are correctly predicted to be duplicates to all report pairs that are predicted to be duplicates.
Figure BDA0002081786510000103
F 1 -Score:
F 1 Score is the harmonic mean of Recall and Precision.
Figure BDA0002081786510000104
Roc curve:
in fact, conventional evaluation criteria such as Accuracy do not evaluate the performance of the classifier well because the defect reports in the dataset are not evenly distributed over the classes. Therefore, the present invention employs the ROC curve to further evaluate the performance of the classifier. Different TPR and FPR can be obtained according to different threshold values, and then an ROC curve can be drawn through the TPR and the FPR. TPR and FPR may be calculated according to the following equations:
Figure BDA0002081786510000105
Figure BDA0002081786510000111
the ROC curve can be obtained by taking all FPR values as the horizontal axis and all TPR values as the vertical axis. The closer the curve is to the upper left corner of the coordinate axis, the better the performance of the classifier.
Results of the experiment
The technical effect of the method of the present invention is demonstrated by answering several questions as follows.
Problem 1: is the DC-CNN of the present invention valid compared to the most advanced repetitive defect report detection method based on deep learning?
The research objective of the present invention is to propose a more efficient method based on deep learning. Thus, the method of the present invention was compared to the method of Deshmukh et al on the same data set.
Table 7: experimental results of the method of the invention and the method of Deshmukh et al
Figure BDA0002081786510000112
As a result: table 7 shows the experimental results of the method of the invention and the method of Deshmukh et al. The twin neural networks, which are the same core method, are used, and they establish two similar models, a search model and a classification model. For the classification model, the highest accuracy appears on the Open Office dataset, reaching 0.8275, while only 0.7268 appears in Eclipse. Their search model performed better than the classification model. For the retrieval model, the best performing data set is still Open Office, which has a correctness rate of up to 0.9455. Similarly, eclipse was slightly inferior and accuracy was 0.906. It can be found that compared with the classification model established by the twin neural network, the improvement of DC-CNN on Open Office, eclipse, net Beans, combined is 11.54%,24.17%,17.89% and 13.33%, respectively. Compared with a retrieval model established by a twin neural network, the DC-CNN is improved by 6.25%,4.07% and 3.84% in Eclipse, net Beans and Combined respectively. On Open Office, the accuracy of DC-CNN is lower than 0.03%.
Influence: according to Table 7, the performance of DC-CNN was higher on 3 datasets (Eclipse, net Beans, combined) than the classification model and search model constructed by Deshmukh et al with a twin neural network. On Open Office, the performance of DC-CNN is higher than that of the classification model constructed by Deshmukh et al by twin neural networks and has a very similar performance with their retrieval model. Overall, DC-CNN achieves a very good performance and surpasses the state-of-the-art deep learning based repeat report detection methods.
Problem 2: is DC-CNN valid compared to SC-CNN?
In order to prove that the two-channel matrix representation of the defect report pair proposed by the present invention is effective, the single-channel matrix representation of the defect report is also used as a comparison baseline. The structure of the CNN is kept unchanged, including the number of convolution kernels, the size of the convolution kernels, the number of convolution layers and the like, and the CNN is used for extracting the characteristics of two reports in a defect report pair respectively and then calculating the similarity of the two reports. This method is called Single-Channel Convolutional Neural Networks (SC-CNN).
Table 8: DC-CNN and SC-CNN test results
Figure BDA0002081786510000121
As a result: the performance of both methods was evaluated on the Accuracy, recall, precision, F1-Score, etc., and the results are shown in table 8, with the best results being bolded. It can be observed that DC-CNN exceeds SC-CNN at all indices of all datasets. Compared with SC-CNN, on Open Office, eclipse, net Beans and Combined, accuracy of DC-CNN is respectively improved by 2.78%,2.61%,1.36% and 2.33%, call of DC-CNN is respectively improved by 2.73%,0.51%,1.49% and 3.17%, precision of DC-CNN is respectively improved by 2.08%,6.53%,1.20% and 2.08%, F1-Score of DC-CNN is respectively improved by 2.40%,3.53%,1.35% and 2.62%. FIG. 3 (a) FIG. 3 (d) shows ROC curves for both methods. It can be observed that the curve for DC-CNN is above SC-CNN over all datasets, indicating that DC-CNN has better classification performance even when the sample distribution is unbalanced.
Influence: all experimental results show that the CNN model using two channels is more efficient than a single channel. For SC-CNN. Each report is converted into a matrix and then input into CNN to extract features, the results being represented as feature vectors. And then judging whether the two reports are repeated or not by calculating the similarity of the two feature vectors. For DC-CNN, two reports are combined into a dual-channel matrix and then input into CNN, and the two reports are convolved together, and the method can extract deep-level relation between the two reports and fully utilize the capability of CNN for capturing local features. Because the CNN model in DC-CNN focuses on extracting the correlation between two reports, there is better performance in detecting duplicate reports.
Problem 3: how do the experimental results change when changing the word vector dimension?
The invention provides a novel defect report pair representation method, namely a dual-channel matrix. Therefore, the influence of the parameters related to the test results is also explored. For a two-channel matrix, the most likely parameter to change is the dimension of the word vector, since the number of words is fixed and the position of the two reports (which report is on the first channel and which report is on the second channel) is indistinguishable for CNN. To answer the question of how the experimental results change when changing the word vector dimension, the word vector dimension was gradually changed from 10 to 100 and the change of the experimental results on the Open Office dataset was observed.
As a result: as can be seen from fig. 4, as the word vector dimension is gradually increased, accuracy first increases and then shows a downward trend. When the word vector dimension is 20, the accuracy reaches the maximum, 94.29%.
Influence: as the word vector dimension increases from 10 to 20, accuracy increases. As we continue to increase the word vector dimension, accuracy decreases. The reason may be when one word vector dimension is already sufficient to characterize one word. Continuing to increase the dimension instead makes it less well representative of the word. Although accurve reaches a maximum when the word vector dimension equals 20, it is not much higher than it would otherwise be. On one hand, the increase of the word vector dimension brings about a larger data storage problem; on the other hand, word embedding and CNN model training both increase in complexity. Thus, in the method of the present invention, 20 is the most appropriate word vector dimension.
Problem 4: is the method proposed by the present invention valid when no structured information is used?
Structured information such as product, component, and version provide very useful information in determining whether two reports are duplicated. Many methods use structured information as a single feature to improve the accuracy of repeated defect report detection. The unstructured information is typically a natural language description of the bug. For duplicate report detection, CNN is mainly used to process unstructured text, so it has good performance when processing long text. Unlike other methods, the present invention places both structured and unstructured information as text data in a text document. CNN is then used to extract their features. To answer question 4, structured information is removed from the input and a comparative experiment is set up without changing other conditions.
As a result: as can be seen from fig. 5, the experimental results on all datasets were reduced after removing the structured information, which was 1.74%,3.79%,3.38%, and 2.56% on Open Office, eclipse, net Beans, and Combined, respectively.
Influence: experimental results show that it is effective to input structured information and unstructured information together into CNN. Note that after the structured information is removed, although accuracy drops, this drop is not fatal. The reason is that the structured information only occupies a small part of the whole text. Part of the CNN main processing remains unstructured information.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims (3)

1. A repeated defect report detection method based on a dual-channel convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
s100 data preparation
S101, extracting defect reports of software, wherein all the defect reports consist of structured information and unstructured information, and all the structured information and the unstructured information are put into a single text invention file for each defect report;
s102, for each defect report, carrying out preprocessing steps including word segmentation, word stem extraction, stop word removal and capital and lowercase transformation;
s103, after preprocessing, combining words in all defect reports into a corpus, obtaining vector representation of each Word by using the existing Word2vec on the corpus and selecting a CBOW model, and obtaining two-dimensional matrix representation of each defect report, namely a two-dimensional single-channel matrix of the defect report;
according to known information given by a software defect tracking system when a defect report of software is extracted, expressing a defect report pair consisting of two defect reports through a two-dimensional dual-channel matrix, wherein the two-dimensional dual-channel matrix is formed by combining two-dimensional single-channel matrixes corresponding to the two defect reports, and then marking a repeated or non-repeated label on the dual-channel matrix;
dividing all the labeled dual-channel matrixes into a training set and a verification set;
s200, establishing a CNN model
S201: inputting all the double-channel matrixes in the training set and the verification set into a CNN model together;
s202: in the first convolutional layer, set
Figure FDA0004087088000000011
A convolution kernel pick>
Figure FDA0004087088000000012
Where d is the length of the convolution kernel, k w Is the width of the convolution kernel; after the first convolution, the two channels of the two-channel matrix are merged into one, and the first layer of convolution formula is:
Figure FDA0004087088000000013
wherein C is 1 Representing the output of the first convolutional layer, I represents the input I of the first convolutional layer 1 I channel j of 1 J-th representing input 1 Line, b 1 Denotes an offset amount, f 1 An activation function representing a non-linearity, given the length of the input l, l = n w Fill-in value P =0 and step S =1, length O of output 1 Can be calculated as:
Figure FDA0004087088000000014
the output shape of the first convolution layer is
Figure FDA0004087088000000015
Reshaping the output shape of the first convolution layer into +>
Figure FDA0004087088000000016
1 are then convolved, and in a second convolution layer, three more convolution kernels are set>
Figure FDA0004087088000000017
Each convolution kernel->
Figure FDA0004087088000000018
The formula of the second layer convolution is:
Figure FDA0004087088000000019
wherein C is 2 Representing the output of the second convolutional layer, j 2 Representing a second convolutional layer input I 2 J (d) of 2 Line, b 2 Denotes an offset amount, f 2 An activation function representing a non-linearity, after this convolution, results in three shapes
Figure FDA0004087088000000021
A characteristic diagram of (1), wherein O 2 Can be calculated according to equation (2) based on l and different convolution kernel lengths d, where l = O 1
S203: performing maximum pooling on all feature maps;
s204: reshaping and stitching all the feature maps to obtain one
Figure FDA0004087088000000024
A vector of dimensions that will be input to the fully-connected layer; />
After two fully-connected layers, an independent probability sim is obtained predict It represents the similarity of the two reports being predicted;
at the last layer, sim is obtained using sigmoid as the activation function predict
Given the output T = { x ] of the first fully-connected layer 1 ,x 2 ,…,x 300 And weight vector W = { W = } 1 ,w 2 ,…,w 300 },sim predict Can be calculated as:
Figure FDA0004087088000000022
wherein i represents the ith element of T and b represents an offset;
s205: traversing all defect report pairs in the training set, and repeating S202-S204;
s206: the hidden parameters of the model are updated by back propagation according to a loss function, which is as shown in formula (5):
Figure FDA0004087088000000023
wherein label real A label indicating a preset defect report pair, i indicating an ith defect report pair, and n indicating a total number of defect report pairs;
s207: after each epoch training is finished, verifying the model by using a verification set; when the loss of the verification set is not reduced within 5 epochs any more, stopping updating the model parameters; otherwise, returning to S201, and continuing to train the CNN model;
s300: defect report prediction to be predicted
Firstly, preprocessing a defect report to be predicted by adopting the method in S102, and then converting the defect report to be predicted into a two-dimensional single-channel matrix of a predicted defect report by adopting the method in S103;
combining the two-dimensional single-channel matrix of the predicted defect report and the two-dimensional single-channel matrices of the N existing defect reports of the software in pairs to obtain N two-channel matrices to be predicted, forming a prediction set by the N two-channel matrices to be predicted, and inputting each two-channel matrix to be predicted in the prediction set into the CNN model to obtain a probability;
and when the probability is greater than the threshold value in the N probabilities, the defect report and the predicted defect report corresponding to the probability are considered to be repeated.
2. The repetitive defect report detection method based on a two-channel convolutional neural network as claimed in claim 1, characterized in that: in the S101, the structured information is product and component, and the unstructured information is summary and description.
3. The repetitive defect report detection method based on a two-channel convolutional neural network as claimed in claim 1, characterized in that: at all but the last fully connected layer, relu is used as an activation function to extract more nonlinear features.
CN201910474540.6A 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method Active CN110188047B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910474540.6A CN110188047B (en) 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910474540.6A CN110188047B (en) 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method

Publications (2)

Publication Number Publication Date
CN110188047A CN110188047A (en) 2019-08-30
CN110188047B true CN110188047B (en) 2023-04-18

Family

ID=67719718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910474540.6A Active CN110188047B (en) 2019-06-20 2019-06-20 Double-channel convolutional neural network-based repeated defect report detection method

Country Status (1)

Country Link
CN (1) CN110188047B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177010B (en) * 2019-12-31 2023-12-15 杭州电子科技大学 Software defect severity identification method
CN111737107B (en) * 2020-05-15 2021-10-26 南京航空航天大学 Repeated defect report detection method based on heterogeneous information network
CN112328469B (en) * 2020-10-22 2022-03-18 南京航空航天大学 Function level defect positioning method based on embedding technology
CN112631898A (en) * 2020-12-09 2021-04-09 南京理工大学 Software defect prediction method based on CNN-SVM
CN113379685A (en) * 2021-05-26 2021-09-10 广东炬森智能装备有限公司 PCB defect detection method and device based on dual-channel feature comparison model
CN113362305A (en) * 2021-06-03 2021-09-07 河南中烟工业有限责任公司 Smoke box strip missing mixed brand detection system and method based on artificial intelligence
CN113486176B (en) * 2021-07-08 2022-11-04 桂林电子科技大学 News classification method based on secondary feature amplification
CN113379746B (en) * 2021-08-16 2021-11-02 深圳荣耀智能机器有限公司 Image detection method, device, system, computing equipment and readable storage medium
CN113791897B (en) * 2021-08-23 2022-09-06 湖北省农村信用社联合社网络信息中心 Method and system for displaying server baseline detection report of rural telecommunication system
US20230367967A1 (en) * 2022-05-16 2023-11-16 Jpmorgan Chase Bank, N.A. System and method for interpreting stuctured and unstructured content to facilitate tailored transactions

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970666A (en) * 2014-05-29 2014-08-06 重庆大学 Method for detecting repeated software defect reports
CN106250311A (en) * 2016-07-27 2016-12-21 成都启力慧源科技有限公司 Repeated defects based on LDA model report detection method
CN108491835A (en) * 2018-06-12 2018-09-04 常州大学 Binary channels convolutional neural networks towards human facial expression recognition
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN108804558A (en) * 2018-05-22 2018-11-13 北京航空航天大学 A kind of defect report automatic classification method based on semantic model
CN109376092A (en) * 2018-11-26 2019-02-22 扬州大学 A kind of software defect reason automatic analysis method of facing defects patch code
CN109491914A (en) * 2018-11-09 2019-03-19 大连海事大学 Defect report prediction technique is influenced based on uneven learning strategy height

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9219767B2 (en) * 2006-06-22 2015-12-22 Linkedin Corporation Recording and indicating preferences
US20170212829A1 (en) * 2016-01-21 2017-07-27 American Software Safety Reliability Company Deep Learning Source Code Analyzer and Repairer

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970666A (en) * 2014-05-29 2014-08-06 重庆大学 Method for detecting repeated software defect reports
CN106250311A (en) * 2016-07-27 2016-12-21 成都启力慧源科技有限公司 Repeated defects based on LDA model report detection method
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN108804558A (en) * 2018-05-22 2018-11-13 北京航空航天大学 A kind of defect report automatic classification method based on semantic model
CN108491835A (en) * 2018-06-12 2018-09-04 常州大学 Binary channels convolutional neural networks towards human facial expression recognition
CN109491914A (en) * 2018-11-09 2019-03-19 大连海事大学 Defect report prediction technique is influenced based on uneven learning strategy height
CN109376092A (en) * 2018-11-26 2019-02-22 扬州大学 A kind of software defect reason automatic analysis method of facing defects patch code

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
指挥自动化系统嵌入式软件可靠性评估;贡岩等;《中国电子学会可靠性分会第十三届学术年会》;20061231;第376-381页 *
改进的词向量特征和CNN在语句分类中的应用;缪浩然等;《第十四届全国人机语音通讯学术会议》;20171231;第1-6页 *

Also Published As

Publication number Publication date
CN110188047A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110188047B (en) Double-channel convolutional neural network-based repeated defect report detection method
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN110110062B (en) Machine intelligent question and answer method and device and electronic equipment
CN109189767B (en) Data processing method and device, electronic equipment and storage medium
JP2021504789A (en) ESG-based corporate evaluation execution device and its operation method
CN109710744B (en) Data matching method, device, equipment and storage medium
CN112711953A (en) Text multi-label classification method and system based on attention mechanism and GCN
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN110097096B (en) Text classification method based on TF-IDF matrix and capsule network
CN111177010B (en) Software defect severity identification method
CN116992007B (en) Limiting question-answering system based on question intention understanding
CN112115716A (en) Service discovery method, system and equipment based on multi-dimensional word vector context matching
CN110992988B (en) Speech emotion recognition method and device based on domain confrontation
CN109800309A (en) Classroom Discourse genre classification methods and device
CN110347833B (en) Classification method for multi-round conversations
CN104657574A (en) Building method and device for medical diagnosis models
CN112036705A (en) Quality inspection result data acquisition method, device and equipment
CN118113849A (en) Information consultation service system and method based on big data
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN105894032A (en) Method of extracting effective features based on sample properties
CN116050419B (en) Unsupervised identification method and system oriented to scientific literature knowledge entity
CN112488188A (en) Feature selection method based on deep reinforcement learning
CN116450848B (en) Method, device and medium for evaluating computing thinking level based on event map
CN116522912A (en) Training method, device, medium and equipment for package design language model
KR102418239B1 (en) Patent analysis apparatus for finding technology sustainability

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant