CN113837266A

CN113837266A - Software defect prediction method based on feature extraction and Stacking ensemble learning

Info

Publication number: CN113837266A
Application number: CN202111106611.0A
Authority: CN
Inventors: 崔梦天; 吴克奇; 李卫榜; 王琳; 姜玥; 罗洪
Original assignee: Southwest Minzu University
Current assignee: Southwest Minzu University
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2021-12-24
Anticipated expiration: 2041-09-22
Also published as: CN113837266B

Abstract

The invention discloses a software defect prediction method based on feature extraction and Stacking ensemble learning, which comprises the following steps: (1) performing feature extraction on the original data set by using kernel principal component analysis to obtain a defect data set DS' after dimensionality reduction; (2) the collaborative filtering algorithm provided by the invention is utilized to recommend an applicable sampling method for new software defect data, and the recommended sampling algorithm is utilized to carry out unbalanced processing on the defect data set DS 'to obtain a defect data set DS' after unbalanced processing; (3) clustering the defect data set DS 'by using a K-Means algorithm, and removing abnormal values deviating from the main stream category to obtain a defect data set DS'; (4) constructing a software defect prediction model based on Stacking ensemble learning, selecting proper classifiers for a base learner of a first layer and a meta-learner of a second layer, and constructing a software defect prediction model with good performance; (5) and comparing the integrated model with the base model and the main flow integrated model on the processed defect data set DS', so as to verify the performance of the integrated prediction model provided by the invention. Research results show that the KSSDP integrated prediction model provided by the invention has better performance than a base model and a mainstream integrated model.

Description

Software defect prediction method based on feature extraction and Stacking ensemble learning

Technical Field

The invention relates to the field of software defects, in particular to a software defect prediction method based on feature extraction and Stacking ensemble learning.

Background

As one of the main trends of the future development of the software industry, how to ensure the quality of the open source software is always a concern and a crucial issue in the industry. Due to the openness of the open source software and the community-based sharing performance, many bugs are often contained in source codes, so that the cost of defect processing is greatly increased, and the application and popularization of the open source software are hindered. Therefore, the method has important practical significance for identifying and controlling the defect introduction factors in the early stage of software development, making effective defect prevention measures, reducing the defect introduction rate and ensuring the software quality. The current mainstream defect prediction technology is to find out modules with defects by using some classical classification algorithms and improved algorithms in machine learning, and the following limitations mainly exist: (1) aiming at the problems that most defect data sets have high-dimensional data, redundant features and the like, the existing model reduces the dimensions by using a feature selection method, so that more original data features are lost, and adverse effects are caused on subsequent defect prediction, such as the problems that the accuracy is reduced, the F-Measure value is not high and the like. (2) At present, an applicable sampling method is selected for a software defect data set, manual selection is mostly carried out according to the experience of experts and the average performance of the sampling method, so that the efficiency of the selection of the sampling method is low, and the selection of the sampling method is too dependent on the experience of the experts. (3) At present, software defects are predicted by mostly adopting a single prediction model. Because the characteristics of the defect data are complex and changeable, a single prediction model has certain limitations, and when the characteristics of the defect data are complex, the prediction effect is possibly poor.

Disclosure of Invention

Technical problem to be solved

In order to overcome the defects of the existing defect prediction method, the invention provides a software defect prediction method based on feature extraction and Stacking ensemble learning, so that the problems in the prior art are solved.

Technical scheme

A software defect prediction method based on feature extraction and Stacking ensemble learning is characterized by comprising the following steps:

step 1: extracting features of the original data set, extracting features of the original defect data set DS through Kernel Principal Component Analysis (KPCA) to reduce the feature dimension of the data set, and reducing the dimension of the original defect data set DS to 10 dimensions to obtain a reduced-dimension defect data set DS';

step 2: the invention provides a collaborative filtering sampling recommendation method facing to software defect data, which comprises the steps of firstly sorting sampling methods, selecting a classification algorithm by a user according to the characteristics of the defect data, sampling historical defect data by using a mainstream sampling method according to a measurement index accure, sorting the mainstream sampling method on the historical defect data by using the selected classification algorithm to obtain the performance sorting of the mainstream sampling method, then carrying out data similarity mining, calculating a Jaccard (Jaccard) similarity coefficient between new defect data and the historical defect data when the new defect data and the historical defect data belong to the same item, taking the Jaccard similarity coefficient as a similarity score between the new defect data and the historical defect data, carrying out characteristic extraction on the new defect data and the historical defect data when the new defect data and the historical defect data belong to different items, normalization is carried out, then the Euclidean distance between new defect data and historical defect data is calculated, the reciprocal of the Euclidean distance is used as a similarity score between the new defect data and the historical defect data, finally, recommendation based on users is carried out, information of the ranking of a sampling method and data similarity is combined, the sampling method suitable for the new software defect data is recommended by utilizing a collaborative filtering algorithm, and unbalanced processing is carried out on a defect data set DS 'by utilizing the recommended sampling algorithm to obtain a defect data set DS' after unbalanced processing;

and step 3: detecting and eliminating abnormal values in the defect data set DS ', clustering the defect data set DS ' by using a K-Means algorithm, and eliminating abnormal values deviating from the main stream category to obtain a defect data set DS ';

and 4, step 4: constructing a software defect prediction model based on Stacking ensemble learning, selecting proper classifiers for a base learner of a first layer and a meta-learner of a second layer, and constructing a software defect prediction model (KSSDP) with good performance;

and 5: and performing performance verification on the KSSDP integrated prediction model, and comparing the integrated model with the base model and the main stream integrated model on the processed defect data set DS', so as to verify the performance of the KSSDP integrated prediction model.

Advantageous effects

The invention provides a software defect prediction method (KSSDP) based on feature extraction and Stacking ensemble learning, which adopts kernel principal component analysis to extract features of a defect data set so as to reduce the correlation among data features, and uses a collaborative filtering sampling recommendation method facing software defect data to solve the class imbalance problem of the defect data set, the method comprises the steps of firstly calculating the prediction accuracy of a training set after the processing of a mainstream sampling method under a classification algorithm selected by a user, sequencing the sampling methods by taking the prediction accuracy as a measurement standard, then calculating the similarity between a new defect data set and a historical defect data set by using an Jacard similarity coefficient, or calculating the reciprocal of the Euclidean distance between the new defect data set and the historical defect data set as the similarity, and finally obtaining a recommendation score through the ranking score and the similarity value, recommending an applicable sampling method for the user according to the recommendation score, clustering the defect data set by using a K-Means algorithm according to the number of positive and negative samples of the balanced data set so as to find and remove abnormal values of the data set, constructing a software defect prediction model by using Stacking ensemble learning, and performing simulation experiments on a plurality of NASA defect data sets, wherein the experiment results show that the model has better performance than a base model and a mainstream integration model; therefore, when the sampling method is recommended for the new data set, manual intervention is not needed, the automatic selection of the applicable sampling method for the new defect data set is realized, and meanwhile, the software defect prediction method based on feature extraction and Stacking ensemble learning provided by the invention has good performance on the false alarm rate and the F-Measure index and is better in generalization than a base model and a main flow ensemble model.

Drawings

FIG. 1 is a flow diagram of a KSSDP integrated prediction model

FIG. 2 is a flowchart of a collaborative filtering sampling recommendation method for software defect data

FIG. 3 is a diagram of a recommended network architecture containing 3 sets of historical data and 4 sampling methods

FIG. 4-FIG. 5 are graphs comparing the false alarm rate (Pf) and F-Measure of the basis model

FIG. 6-FIG. 7 are graphs comparing the false alarm rate (Pf) and F-Measure of the optimal mainstream integration model

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

the invention provides a software defect prediction method (KSSDP) based on feature extraction and Stacking ensemble learning, wherein a flow chart of a KSSDP ensemble prediction model is shown in figure 1, and the technical scheme adopted for solving the technical problem comprises the following contents:

1. feature extraction on raw data set

And mapping the original data points in the low-dimensional feature space to the high-dimensional feature space by using a nonlinear mapping kernel function, further extracting representative features, and characterizing a complex defect data structure. The core principle is as follows:

let x be mapped into u by a corresponding function ρ, which is defined as follows:

u＝ρ(x) (1)

the kernel function maps the data to a corresponding N-dimensional feature space, and the data in the mapping feature space meets the following specific conditions:

2. collaborative filtering sampling recommendation method for software defect data

The flow chart of the collaborative filtering sampling recommendation method for the software defect data is shown in the attached figure 2. The method adopts a ten-fold cross validation method to train a historical data set, and sets OD (origin-destination) of the historical data set to { OD (origin-destination) }₁，OD₂，…，OD_mEach data set OD in_iDividing into ten parts, taking one part as a test set test and taking the rest nine parts as a training set train in sequence. Applying mainstream sampling method set T ═ T₁，T₂，…，T_nAny sampling method T in_jAnd carrying out unbalanced processing on the training set train to obtain a balanced training set BTrain. The user selects a proper classification algorithm CA ═ { CA ] from the classification algorithm library₁，CA₂，…，CA_pAnd learning on the balanced training set BTrain by utilizing a classification algorithm CA to obtain a predictor P. Evaluating the test set test by using a predictor P to obtain a corresponding performance metric value accurve, and calculating the sampling method ranking score RankScore [ i ] of the test set test on different historical data sets aiming at different sampling methods][j]The present invention ranks scores RankScore [ i ] using the following formula for the sampling method][j]The calculation of (2):

RankScore[i][j]＝RankScore[i][j]+accuracy (3)

through ten iterations, the accumulated sum of the performance metric value accuracy under the condition of taking different parts as the test set is finally obtained, and the OD of the data set is_iUsing a sampling method T_jThe cumulative sum of the performance metric values, accuracy, is stored in RankScore [ i][j]. Further aiming at the accumulation and the average value of the performance metric value accurve, the invention uses the following formula to calculate the average value of the accumulation and the average value of the performance metric value accurve:

RankScore[i][j]＝RankScore[i][j]/10 (4)

and finally, taking the average RankScore [ i ] [ j ] of the sum of the performance metric values as the basis for sorting by a sampling method.

The invention calculates a new defect data set ND and a historical defect data setOD＝{OD₁，OD₂，…，OD_mAnd when the new defect data and the historical defect data belong to the same item, calculating the intersection number and the union number of the features of the data sets, and taking the quotient of the intersection number and the union number as a similarity score SimiSore of the new defect data set and the historical defect data set. For each historical defect data set OD_iThe invention performs the calculation of the similarity score, SimiScore, using the following formula:

when the new defect data and the historical defect data belong to different items, the feature extraction is carried out on the new defect data set ND and the historical defect data set OD by utilizing kernel principal component analysis, the dimension of the ND and the dimension of the OD are reduced to 10, and the invention uses the following formula to carry out feature x on the ND and the OD_k(k ═ 1,2, …, 10) normalized calculations:

recording the characteristics of the new defect data set ND after normalization as y_kHistorical defect data set OD_iNormalized feature is z_kThe invention uses the following formula to perform the new defect data set ND and the historical defect data set OD_iAnd (3) calculating the Euclidean distance between the two elements:

in order to ensure that the value range of the similarity is between 0 and 1, for each historical defect data set OD_iThe invention performs the calculation of the similarity score, SimiScore, using the following formula:

the invention correspondingly multiplies the ranking score and the similarity score of the sampling method, takes the product as a recommendation score RecScore, and adopts a TOP-N sequencing method to recommend the applicable sampling method to the new data set. For a sampling method set T ═ T₁，T₂，…，T_nAny sampling method T in_jBased on the historical defect data set OD ═ OD₁，OD₂，…，OD_mThe present invention recommends a score Recscore [ j ] using the following formula for sampling method for m historical defect data sets in]The calculation of (2):

for different sampling methods, after the recommendation score RecScore is calculated, sorting is carried out according to the value of the recommendation score RecScore, the Top-N sorting of the sampling methods is obtained, and further the sampling method suitable for automatic recommendation of new software defect data is achieved. The invention provides a schematic diagram of a recommendation network structure consisting of three historical defect data sets and four sampling methods, and particularly refers to fig. 3, wherein the information of the ranking and the data similarity of the sampling methods is combined to construct a three-layer recommendation network, the connection weight between the first layer and the second layer is the similarity score between the data sets, and the connection weight between the second layer and the third layer is the ranking score.

3. Detecting outliers of a defect data set

Based on the principle of clustering criterion function minimization, data are divided into different classes through iteration, the generated classes are as compact and independent as possible, and abnormal values deviating from the main stream classes are removed. The core principle is as follows:

for i ═ 1,2, …, m, sample x is calculated_iAnd each centroid vector mu_j(j ═ 1,2, …, k) distance d_ij＝||x_i-μ_j||₂According to the smallest d_ijX is to be_iClass λ corresponding to the division_iAt this time, update

4. Software defect prediction model based on Stacking ensemble learning is constructed

In the Stacking ensemble learning model, the base learner of the first layer needs to satisfy the following characteristics: the method has the advantages of strong enough performance, small correlation and gap as much as possible, and performance that cannot be too large.

According to the characteristics, the KNN model, the random forest model and the Gaussian naive Bayes model are selected as the first-layer base learner. The KNN model is widely applied, and has the characteristics of mature theory, high efficiency of training mode and the like; the random forest model is formed by integrating decision trees as basic models under a Bagging integration framework, and has a good effect in practical application; the Gaussian naive Bayes model can be trained only by a small amount of samples, is good at processing separable binary data, and has the characteristics of high training speed and the like. Since overfitting may occur in the Stacking ensemble learning model, in order to reduce the overfitting, the meta learner at the second layer in the Stacking model should use a simpler model for learning, so the logistic regression model is selected as the meta learner at the second layer.

5. Performance verification of KSSDP integrated prediction model

And comparing general indexes such as false alarm rate and F-Measure to analyze the performance of the KSSDP integrated model, the base model and the mainstream integrated model. As can be seen from fig. 4, the false alarm rate of KSSDP on the data set JM1 is higher, and the random forest model, the gaussian naive bayes model, and the logistic regression model are all lower than the KSSDP model, wherein the gaussian naive bayes model is even 18.8% lower than the false alarm rate of the KSSDP model. On the data set PC4, the false alarm rate of the Gaussian naive Bayes model is lower than that of the KSSDP model, and the difference is 7.1%. However, the KSSDP model performs well on the remaining 6 data sets, and reaches or approaches the lowest false alarm rate, and the control of the false alarm rate of the KSSDP model is still more ideal as a whole.

As can be seen from FIG. 5, the KSSDP model proposed by the present invention has good performance on the 8 data sets, and F-Measure is a comprehensive index which can objectively reflect the quality of a model. The KSSDP model obtains the highest value on 8 data sets, and can show that the performance of the KSSDP model is superior to that of a single base classifier, including a KNN model, a random forest model, a Gaussian naive Bayesian model and a logistic regression model, thereby further showing that the KSSDP model provided by the invention is feasible and effective.

The invention selects the optimal main stream integration model for comparison, and if the KSSDP integration prediction model has better performance than the optimal main stream integration model, the KSSDP integration prediction model has better performance than all main stream integration models naturally. On the two indexes of F-Measure and Pf, the optimal mainstream integration model is the ExtraTrees model. As can be seen from FIGS. 6 and 7, the method of the present invention maintains a high F-Measure value and a low false alarm rate on the 8 data sets. The ExtraTrees model has a higher F-Measure value than the KSSDP model on the data set PC1, but the KSSDP model has a higher F-Measure value than the ExtraTrees model on the remaining 7 data sets. In the criterion of the false alarm rate, although the ExtraTrees model is lower than the KSSDP model in the data sets JM1 and PC1, the fluctuation of the ExtraTrees model is large, the average false alarm rate of the ExtraTrees model is 11.36 percent, the average false alarm rate of the KSSDP model is 9.7 percent, and the ExtraTrees model is not stable enough. In conclusion, the method provided by the invention has excellent performance, because the overall performance on 8 data sets is better than that of the base model and the mainstream integration model.

Claims

1. A software defect prediction method based on feature extraction and Stacking ensemble learning is characterized by comprising the following steps:

2. The method of claim 1, wherein a nonlinear mapping kernel function is used to map the original data points in the low-dimensional feature space to the high-dimensional feature space, so as to extract representative features and characterize complex defect data structures.

3. The collaborative filtering sampling recommendation method oriented to software defect data as claimed in claim 1, wherein a ten-fold cross validation method is adopted to train a historical defect data set, the historical defect data set OD is divided into ten parts, one part of the ten parts is taken as a test set test, the remaining nine parts are taken as training sets train, the training sets train are subjected to unbalance processing by using a mainstream sampling method T to obtain a balanced training set BTtrain, a user selects a proper classification algorithm CA in a classification algorithm library, a metric value is learned on the balanced training set BTtrain by using the classification algorithm CA to obtain a predictor P, the test set is tested by using the predictor P to obtain a corresponding performance accuracyan, and finally an average RankScore of the sum of the performance metric values is taken as a basis for sorting by the sampling method;

when the new defect data and the historical defect data belong to the same item, calculating the similarity between a new defect data set ND and a historical defect data set OD, calculating the intersection number and the union number between the data set characteristics, taking the quotient of the intersection number and the union number as the similarity score SimiScore of the new defect data set and the historical defect data set, when the new defect data and the historical defect data belong to different items, utilizing kernel principal component analysis to carry out characteristic extraction on the new defect data set ND and the historical defect data set OD, reducing the dimensionality of the ND and the OD to 10 dimensions, normalizing the ND and the OD, calculating the arithmetic square root of the sum of the differences of the characteristics of the ND and the OD dimensions, and taking the reciprocal of the calculated arithmetic square root as the similarity score SimiScore of the new defect data set and the historical defect data set;

and correspondingly multiplying the ranking score RankScore of the sampling method by the similarity score SimiSore, taking the product as a recommendation score RecScore, and recommending an applicable sampling method to the new data set by adopting a TOP-N sorting method according to the value of the recommendation score RecScore.

4. The method for eliminating abnormal values according to claim 1, wherein the abnormal values deviating from the main stream category are eliminated by iteratively dividing the data into different categories based on the principle of minimizing the clustering criterion function to make the generated categories as compact and independent as possible.

5. The method for constructing the defect prediction model based on the Stacking ensemble learning as claimed in claim 1, wherein in the Stacking ensemble learning model, the base learner of the first layer needs to satisfy the following characteristics: the method has strong enough performance, correlation and difference as small as possible and performance which cannot be too large, and according to the characteristics, a KNN model, a random forest model and a Gaussian naive Bayes model are selected as a first-layer base learner; since overfitting may occur in the Stacking ensemble learning model, in order to reduce the overfitting, the meta learner at the second layer in the Stacking model should use a simpler model for learning, so the logistic regression model is selected as the meta learner at the second layer.

6. The performance verification method of claim 1, wherein the performance of the KSSDP integrated model of the present invention is analyzed by comparing the false alarm rate with the common indicators of F-Measure, and the performance of the base model and the main stream integrated model is analyzed.