[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113837266A - Software defect prediction method based on feature extraction and Stacking ensemble learning - Google Patents

Software defect prediction method based on feature extraction and Stacking ensemble learning Download PDF

Info

Publication number
CN113837266A
CN113837266A CN202111106611.0A CN202111106611A CN113837266A CN 113837266 A CN113837266 A CN 113837266A CN 202111106611 A CN202111106611 A CN 202111106611A CN 113837266 A CN113837266 A CN 113837266A
Authority
CN
China
Prior art keywords
defect data
data set
model
historical
defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111106611.0A
Other languages
Chinese (zh)
Other versions
CN113837266B (en
Inventor
崔梦天
吴克奇
李卫榜
王琳
姜玥
罗洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southwest Minzu University
Original Assignee
Southwest Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southwest Minzu University filed Critical Southwest Minzu University
Priority to CN202111106611.0A priority Critical patent/CN113837266B/en
Publication of CN113837266A publication Critical patent/CN113837266A/en
Application granted granted Critical
Publication of CN113837266B publication Critical patent/CN113837266B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a software defect prediction method based on feature extraction and Stacking ensemble learning, which comprises the following steps: (1) performing feature extraction on the original data set by using kernel principal component analysis to obtain a defect data set DS' after dimensionality reduction; (2) the collaborative filtering algorithm provided by the invention is utilized to recommend an applicable sampling method for new software defect data, and the recommended sampling algorithm is utilized to carry out unbalanced processing on the defect data set DS 'to obtain a defect data set DS' after unbalanced processing; (3) clustering the defect data set DS 'by using a K-Means algorithm, and removing abnormal values deviating from the main stream category to obtain a defect data set DS'; (4) constructing a software defect prediction model based on Stacking ensemble learning, selecting proper classifiers for a base learner of a first layer and a meta-learner of a second layer, and constructing a software defect prediction model with good performance; (5) and comparing the integrated model with the base model and the main flow integrated model on the processed defect data set DS', so as to verify the performance of the integrated prediction model provided by the invention. Research results show that the KSSDP integrated prediction model provided by the invention has better performance than a base model and a mainstream integrated model.

Description

Software defect prediction method based on feature extraction and Stacking ensemble learning
Technical Field
The invention relates to the field of software defects, in particular to a software defect prediction method based on feature extraction and Stacking ensemble learning.
Background
As one of the main trends of the future development of the software industry, how to ensure the quality of the open source software is always a concern and a crucial issue in the industry. Due to the openness of the open source software and the community-based sharing performance, many bugs are often contained in source codes, so that the cost of defect processing is greatly increased, and the application and popularization of the open source software are hindered. Therefore, the method has important practical significance for identifying and controlling the defect introduction factors in the early stage of software development, making effective defect prevention measures, reducing the defect introduction rate and ensuring the software quality. The current mainstream defect prediction technology is to find out modules with defects by using some classical classification algorithms and improved algorithms in machine learning, and the following limitations mainly exist: (1) aiming at the problems that most defect data sets have high-dimensional data, redundant features and the like, the existing model reduces the dimensions by using a feature selection method, so that more original data features are lost, and adverse effects are caused on subsequent defect prediction, such as the problems that the accuracy is reduced, the F-Measure value is not high and the like. (2) At present, an applicable sampling method is selected for a software defect data set, manual selection is mostly carried out according to the experience of experts and the average performance of the sampling method, so that the efficiency of the selection of the sampling method is low, and the selection of the sampling method is too dependent on the experience of the experts. (3) At present, software defects are predicted by mostly adopting a single prediction model. Because the characteristics of the defect data are complex and changeable, a single prediction model has certain limitations, and when the characteristics of the defect data are complex, the prediction effect is possibly poor.
Disclosure of Invention
Technical problem to be solved
In order to overcome the defects of the existing defect prediction method, the invention provides a software defect prediction method based on feature extraction and Stacking ensemble learning, so that the problems in the prior art are solved.
Technical scheme
A software defect prediction method based on feature extraction and Stacking ensemble learning is characterized by comprising the following steps:
step 1: extracting features of the original data set, extracting features of the original defect data set DS through Kernel Principal Component Analysis (KPCA) to reduce the feature dimension of the data set, and reducing the dimension of the original defect data set DS to 10 dimensions to obtain a reduced-dimension defect data set DS';
step 2: the invention provides a collaborative filtering sampling recommendation method facing to software defect data, which comprises the steps of firstly sorting sampling methods, selecting a classification algorithm by a user according to the characteristics of the defect data, sampling historical defect data by using a mainstream sampling method according to a measurement index accure, sorting the mainstream sampling method on the historical defect data by using the selected classification algorithm to obtain the performance sorting of the mainstream sampling method, then carrying out data similarity mining, calculating a Jaccard (Jaccard) similarity coefficient between new defect data and the historical defect data when the new defect data and the historical defect data belong to the same item, taking the Jaccard similarity coefficient as a similarity score between the new defect data and the historical defect data, carrying out characteristic extraction on the new defect data and the historical defect data when the new defect data and the historical defect data belong to different items, normalization is carried out, then the Euclidean distance between new defect data and historical defect data is calculated, the reciprocal of the Euclidean distance is used as a similarity score between the new defect data and the historical defect data, finally, recommendation based on users is carried out, information of the ranking of a sampling method and data similarity is combined, the sampling method suitable for the new software defect data is recommended by utilizing a collaborative filtering algorithm, and unbalanced processing is carried out on a defect data set DS 'by utilizing the recommended sampling algorithm to obtain a defect data set DS' after unbalanced processing;
and step 3: detecting and eliminating abnormal values in the defect data set DS ', clustering the defect data set DS ' by using a K-Means algorithm, and eliminating abnormal values deviating from the main stream category to obtain a defect data set DS ';
and 4, step 4: constructing a software defect prediction model based on Stacking ensemble learning, selecting proper classifiers for a base learner of a first layer and a meta-learner of a second layer, and constructing a software defect prediction model (KSSDP) with good performance;
and 5: and performing performance verification on the KSSDP integrated prediction model, and comparing the integrated model with the base model and the main stream integrated model on the processed defect data set DS', so as to verify the performance of the KSSDP integrated prediction model.
Advantageous effects
The invention provides a software defect prediction method (KSSDP) based on feature extraction and Stacking ensemble learning, which adopts kernel principal component analysis to extract features of a defect data set so as to reduce the correlation among data features, and uses a collaborative filtering sampling recommendation method facing software defect data to solve the class imbalance problem of the defect data set, the method comprises the steps of firstly calculating the prediction accuracy of a training set after the processing of a mainstream sampling method under a classification algorithm selected by a user, sequencing the sampling methods by taking the prediction accuracy as a measurement standard, then calculating the similarity between a new defect data set and a historical defect data set by using an Jacard similarity coefficient, or calculating the reciprocal of the Euclidean distance between the new defect data set and the historical defect data set as the similarity, and finally obtaining a recommendation score through the ranking score and the similarity value, recommending an applicable sampling method for the user according to the recommendation score, clustering the defect data set by using a K-Means algorithm according to the number of positive and negative samples of the balanced data set so as to find and remove abnormal values of the data set, constructing a software defect prediction model by using Stacking ensemble learning, and performing simulation experiments on a plurality of NASA defect data sets, wherein the experiment results show that the model has better performance than a base model and a mainstream integration model; therefore, when the sampling method is recommended for the new data set, manual intervention is not needed, the automatic selection of the applicable sampling method for the new defect data set is realized, and meanwhile, the software defect prediction method based on feature extraction and Stacking ensemble learning provided by the invention has good performance on the false alarm rate and the F-Measure index and is better in generalization than a base model and a main flow ensemble model.
Drawings
FIG. 1 is a flow diagram of a KSSDP integrated prediction model
FIG. 2 is a flowchart of a collaborative filtering sampling recommendation method for software defect data
FIG. 3 is a diagram of a recommended network architecture containing 3 sets of historical data and 4 sampling methods
FIG. 4-FIG. 5 are graphs comparing the false alarm rate (Pf) and F-Measure of the basis model
FIG. 6-FIG. 7 are graphs comparing the false alarm rate (Pf) and F-Measure of the optimal mainstream integration model
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
the invention provides a software defect prediction method (KSSDP) based on feature extraction and Stacking ensemble learning, wherein a flow chart of a KSSDP ensemble prediction model is shown in figure 1, and the technical scheme adopted for solving the technical problem comprises the following contents:
1. feature extraction on raw data set
And mapping the original data points in the low-dimensional feature space to the high-dimensional feature space by using a nonlinear mapping kernel function, further extracting representative features, and characterizing a complex defect data structure. The core principle is as follows:
let x be mapped into u by a corresponding function ρ, which is defined as follows:
u=ρ(x) (1)
the kernel function maps the data to a corresponding N-dimensional feature space, and the data in the mapping feature space meets the following specific conditions:
Figure BDA0003272661290000021
2. collaborative filtering sampling recommendation method for software defect data
The flow chart of the collaborative filtering sampling recommendation method for the software defect data is shown in the attached figure 2. The method adopts a ten-fold cross validation method to train a historical data set, and sets OD (origin-destination) of the historical data set to { OD (origin-destination) }1,OD2,…,ODmEach data set OD iniDividing into ten parts, taking one part as a test set test and taking the rest nine parts as a training set train in sequence. Applying mainstream sampling method set T ═ T1,T2,…,TnAny sampling method T injAnd carrying out unbalanced processing on the training set train to obtain a balanced training set BTrain. The user selects a proper classification algorithm CA ═ { CA ] from the classification algorithm library1,CA2,…,CApAnd learning on the balanced training set BTrain by utilizing a classification algorithm CA to obtain a predictor P. Evaluating the test set test by using a predictor P to obtain a corresponding performance metric value accurve, and calculating the sampling method ranking score RankScore [ i ] of the test set test on different historical data sets aiming at different sampling methods][j]The present invention ranks scores RankScore [ i ] using the following formula for the sampling method][j]The calculation of (2):
RankScore[i][j]=RankScore[i][j]+accuracy (3)
through ten iterations, the accumulated sum of the performance metric value accuracy under the condition of taking different parts as the test set is finally obtained, and the OD of the data set isiUsing a sampling method TjThe cumulative sum of the performance metric values, accuracy, is stored in RankScore [ i][j]. Further aiming at the accumulation and the average value of the performance metric value accurve, the invention uses the following formula to calculate the average value of the accumulation and the average value of the performance metric value accurve:
RankScore[i][j]=RankScore[i][j]/10 (4)
and finally, taking the average RankScore [ i ] [ j ] of the sum of the performance metric values as the basis for sorting by a sampling method.
The invention calculates a new defect data set ND and a historical defect data setOD={OD1,OD2,…,ODmAnd when the new defect data and the historical defect data belong to the same item, calculating the intersection number and the union number of the features of the data sets, and taking the quotient of the intersection number and the union number as a similarity score SimiSore of the new defect data set and the historical defect data set. For each historical defect data set ODiThe invention performs the calculation of the similarity score, SimiScore, using the following formula:
Figure BDA0003272661290000031
when the new defect data and the historical defect data belong to different items, the feature extraction is carried out on the new defect data set ND and the historical defect data set OD by utilizing kernel principal component analysis, the dimension of the ND and the dimension of the OD are reduced to 10, and the invention uses the following formula to carry out feature x on the ND and the ODk(k ═ 1,2, …, 10) normalized calculations:
Figure BDA0003272661290000032
recording the characteristics of the new defect data set ND after normalization as ykHistorical defect data set ODiNormalized feature is zkThe invention uses the following formula to perform the new defect data set ND and the historical defect data set ODiAnd (3) calculating the Euclidean distance between the two elements:
Figure BDA0003272661290000033
in order to ensure that the value range of the similarity is between 0 and 1, for each historical defect data set ODiThe invention performs the calculation of the similarity score, SimiScore, using the following formula:
Figure BDA0003272661290000034
the invention correspondingly multiplies the ranking score and the similarity score of the sampling method, takes the product as a recommendation score RecScore, and adopts a TOP-N sequencing method to recommend the applicable sampling method to the new data set. For a sampling method set T ═ T1,T2,…,TnAny sampling method T injBased on the historical defect data set OD ═ OD1,OD2,…,ODmThe present invention recommends a score Recscore [ j ] using the following formula for sampling method for m historical defect data sets in]The calculation of (2):
Figure BDA0003272661290000035
for different sampling methods, after the recommendation score RecScore is calculated, sorting is carried out according to the value of the recommendation score RecScore, the Top-N sorting of the sampling methods is obtained, and further the sampling method suitable for automatic recommendation of new software defect data is achieved. The invention provides a schematic diagram of a recommendation network structure consisting of three historical defect data sets and four sampling methods, and particularly refers to fig. 3, wherein the information of the ranking and the data similarity of the sampling methods is combined to construct a three-layer recommendation network, the connection weight between the first layer and the second layer is the similarity score between the data sets, and the connection weight between the second layer and the third layer is the ranking score.
3. Detecting outliers of a defect data set
Based on the principle of clustering criterion function minimization, data are divided into different classes through iteration, the generated classes are as compact and independent as possible, and abnormal values deviating from the main stream classes are removed. The core principle is as follows:
for i ═ 1,2, …, m, sample x is calculatediAnd each centroid vector muj(j ═ 1,2, …, k) distance dij=||xij||2According to the smallest dijX is to beiClass λ corresponding to the divisioniAt this time, update
Figure BDA0003272661290000041
4. Software defect prediction model based on Stacking ensemble learning is constructed
In the Stacking ensemble learning model, the base learner of the first layer needs to satisfy the following characteristics: the method has the advantages of strong enough performance, small correlation and gap as much as possible, and performance that cannot be too large.
According to the characteristics, the KNN model, the random forest model and the Gaussian naive Bayes model are selected as the first-layer base learner. The KNN model is widely applied, and has the characteristics of mature theory, high efficiency of training mode and the like; the random forest model is formed by integrating decision trees as basic models under a Bagging integration framework, and has a good effect in practical application; the Gaussian naive Bayes model can be trained only by a small amount of samples, is good at processing separable binary data, and has the characteristics of high training speed and the like. Since overfitting may occur in the Stacking ensemble learning model, in order to reduce the overfitting, the meta learner at the second layer in the Stacking model should use a simpler model for learning, so the logistic regression model is selected as the meta learner at the second layer.
5. Performance verification of KSSDP integrated prediction model
And comparing general indexes such as false alarm rate and F-Measure to analyze the performance of the KSSDP integrated model, the base model and the mainstream integrated model. As can be seen from fig. 4, the false alarm rate of KSSDP on the data set JM1 is higher, and the random forest model, the gaussian naive bayes model, and the logistic regression model are all lower than the KSSDP model, wherein the gaussian naive bayes model is even 18.8% lower than the false alarm rate of the KSSDP model. On the data set PC4, the false alarm rate of the Gaussian naive Bayes model is lower than that of the KSSDP model, and the difference is 7.1%. However, the KSSDP model performs well on the remaining 6 data sets, and reaches or approaches the lowest false alarm rate, and the control of the false alarm rate of the KSSDP model is still more ideal as a whole.
As can be seen from FIG. 5, the KSSDP model proposed by the present invention has good performance on the 8 data sets, and F-Measure is a comprehensive index which can objectively reflect the quality of a model. The KSSDP model obtains the highest value on 8 data sets, and can show that the performance of the KSSDP model is superior to that of a single base classifier, including a KNN model, a random forest model, a Gaussian naive Bayesian model and a logistic regression model, thereby further showing that the KSSDP model provided by the invention is feasible and effective.
The invention selects the optimal main stream integration model for comparison, and if the KSSDP integration prediction model has better performance than the optimal main stream integration model, the KSSDP integration prediction model has better performance than all main stream integration models naturally. On the two indexes of F-Measure and Pf, the optimal mainstream integration model is the ExtraTrees model. As can be seen from FIGS. 6 and 7, the method of the present invention maintains a high F-Measure value and a low false alarm rate on the 8 data sets. The ExtraTrees model has a higher F-Measure value than the KSSDP model on the data set PC1, but the KSSDP model has a higher F-Measure value than the ExtraTrees model on the remaining 7 data sets. In the criterion of the false alarm rate, although the ExtraTrees model is lower than the KSSDP model in the data sets JM1 and PC1, the fluctuation of the ExtraTrees model is large, the average false alarm rate of the ExtraTrees model is 11.36 percent, the average false alarm rate of the KSSDP model is 9.7 percent, and the ExtraTrees model is not stable enough. In conclusion, the method provided by the invention has excellent performance, because the overall performance on 8 data sets is better than that of the base model and the mainstream integration model.

Claims (6)

1. A software defect prediction method based on feature extraction and Stacking ensemble learning is characterized by comprising the following steps:
step 1: extracting features of the original data set, extracting features of the original defect data set DS through Kernel Principal Component Analysis (KPCA) to reduce the feature dimension of the data set, and reducing the dimension of the original defect data set DS to 10 dimensions to obtain a reduced-dimension defect data set DS';
step 2: the invention provides a collaborative filtering sampling recommendation method facing to software defect data, which comprises the steps of firstly sorting sampling methods, selecting a classification algorithm by a user according to the characteristics of the defect data, sampling historical defect data by using a mainstream sampling method according to a measurement index accure, sorting the mainstream sampling method on the historical defect data by using the selected classification algorithm to obtain the performance sorting of the mainstream sampling method, then carrying out data similarity mining, calculating a Jaccard (Jaccard) similarity coefficient between new defect data and the historical defect data when the new defect data and the historical defect data belong to the same item, taking the Jaccard similarity coefficient as a similarity score between the new defect data and the historical defect data, carrying out characteristic extraction on the new defect data and the historical defect data when the new defect data and the historical defect data belong to different items, normalization is carried out, then the Euclidean distance between new defect data and historical defect data is calculated, the reciprocal of the Euclidean distance is used as a similarity score between the new defect data and the historical defect data, finally, recommendation based on users is carried out, information of the ranking of a sampling method and data similarity is combined, the sampling method suitable for the new software defect data is recommended by utilizing a collaborative filtering algorithm, and unbalanced processing is carried out on a defect data set DS 'by utilizing the recommended sampling algorithm to obtain a defect data set DS' after unbalanced processing;
and step 3: detecting and eliminating abnormal values in the defect data set DS ', clustering the defect data set DS ' by using a K-Means algorithm, and eliminating abnormal values deviating from the main stream category to obtain a defect data set DS ';
and 4, step 4: constructing a software defect prediction model based on Stacking ensemble learning, selecting proper classifiers for a base learner of a first layer and a meta-learner of a second layer, and constructing a software defect prediction model (KSSDP) with good performance;
and 5: and performing performance verification on the KSSDP integrated prediction model, and comparing the integrated model with the base model and the main stream integrated model on the processed defect data set DS', so as to verify the performance of the KSSDP integrated prediction model.
2. The method of claim 1, wherein a nonlinear mapping kernel function is used to map the original data points in the low-dimensional feature space to the high-dimensional feature space, so as to extract representative features and characterize complex defect data structures.
3. The collaborative filtering sampling recommendation method oriented to software defect data as claimed in claim 1, wherein a ten-fold cross validation method is adopted to train a historical defect data set, the historical defect data set OD is divided into ten parts, one part of the ten parts is taken as a test set test, the remaining nine parts are taken as training sets train, the training sets train are subjected to unbalance processing by using a mainstream sampling method T to obtain a balanced training set BTtrain, a user selects a proper classification algorithm CA in a classification algorithm library, a metric value is learned on the balanced training set BTtrain by using the classification algorithm CA to obtain a predictor P, the test set is tested by using the predictor P to obtain a corresponding performance accuracyan, and finally an average RankScore of the sum of the performance metric values is taken as a basis for sorting by the sampling method;
when the new defect data and the historical defect data belong to the same item, calculating the similarity between a new defect data set ND and a historical defect data set OD, calculating the intersection number and the union number between the data set characteristics, taking the quotient of the intersection number and the union number as the similarity score SimiScore of the new defect data set and the historical defect data set, when the new defect data and the historical defect data belong to different items, utilizing kernel principal component analysis to carry out characteristic extraction on the new defect data set ND and the historical defect data set OD, reducing the dimensionality of the ND and the OD to 10 dimensions, normalizing the ND and the OD, calculating the arithmetic square root of the sum of the differences of the characteristics of the ND and the OD dimensions, and taking the reciprocal of the calculated arithmetic square root as the similarity score SimiScore of the new defect data set and the historical defect data set;
and correspondingly multiplying the ranking score RankScore of the sampling method by the similarity score SimiSore, taking the product as a recommendation score RecScore, and recommending an applicable sampling method to the new data set by adopting a TOP-N sorting method according to the value of the recommendation score RecScore.
4. The method for eliminating abnormal values according to claim 1, wherein the abnormal values deviating from the main stream category are eliminated by iteratively dividing the data into different categories based on the principle of minimizing the clustering criterion function to make the generated categories as compact and independent as possible.
5. The method for constructing the defect prediction model based on the Stacking ensemble learning as claimed in claim 1, wherein in the Stacking ensemble learning model, the base learner of the first layer needs to satisfy the following characteristics: the method has strong enough performance, correlation and difference as small as possible and performance which cannot be too large, and according to the characteristics, a KNN model, a random forest model and a Gaussian naive Bayes model are selected as a first-layer base learner; since overfitting may occur in the Stacking ensemble learning model, in order to reduce the overfitting, the meta learner at the second layer in the Stacking model should use a simpler model for learning, so the logistic regression model is selected as the meta learner at the second layer.
6. The performance verification method of claim 1, wherein the performance of the KSSDP integrated model of the present invention is analyzed by comparing the false alarm rate with the common indicators of F-Measure, and the performance of the base model and the main stream integrated model is analyzed.
CN202111106611.0A 2021-09-22 2021-09-22 Software defect prediction method based on feature extraction and Stacking ensemble learning Active CN113837266B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111106611.0A CN113837266B (en) 2021-09-22 2021-09-22 Software defect prediction method based on feature extraction and Stacking ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111106611.0A CN113837266B (en) 2021-09-22 2021-09-22 Software defect prediction method based on feature extraction and Stacking ensemble learning

Publications (2)

Publication Number Publication Date
CN113837266A true CN113837266A (en) 2021-12-24
CN113837266B CN113837266B (en) 2022-05-20

Family

ID=78960344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111106611.0A Active CN113837266B (en) 2021-09-22 2021-09-22 Software defect prediction method based on feature extraction and Stacking ensemble learning

Country Status (1)

Country Link
CN (1) CN113837266B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114706780A (en) * 2022-04-13 2022-07-05 北京理工大学 Software defect prediction method based on Stacking ensemble learning
CN118052813A (en) * 2024-04-12 2024-05-17 深圳特朗达照明股份有限公司 Intelligent detection device and method for LED lamp

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659207A (en) * 2019-09-02 2020-01-07 北京航空航天大学 Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration
CN110674865A (en) * 2019-09-20 2020-01-10 燕山大学 Rule learning classifier integration method oriented to software defect class distribution unbalance
US20210109931A1 (en) * 2019-10-10 2021-04-15 Sap Se Data security through query refinement

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110659207A (en) * 2019-09-02 2020-01-07 北京航空航天大学 Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration
CN110674865A (en) * 2019-09-20 2020-01-10 燕山大学 Rule learning classifier integration method oriented to software defect class distribution unbalance
US20210109931A1 (en) * 2019-10-10 2021-04-15 Sap Se Data security through query refinement

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114706780A (en) * 2022-04-13 2022-07-05 北京理工大学 Software defect prediction method based on Stacking ensemble learning
CN118052813A (en) * 2024-04-12 2024-05-17 深圳特朗达照明股份有限公司 Intelligent detection device and method for LED lamp

Also Published As

Publication number Publication date
CN113837266B (en) 2022-05-20

Similar Documents

Publication Publication Date Title
CN111897963B (en) Commodity classification method based on text information and machine learning
Lin et al. Parameter tuning, feature selection and weight assignment of features for case-based reasoning by artificial immune system
Utari et al. Implementation of data mining for drop-out prediction using random forest method
CN111275113A (en) Skew time series abnormity detection method based on cost sensitive hybrid network
CN113837266B (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN108681742B (en) Analysis method for analyzing sensitivity of driver driving behavior to vehicle energy consumption
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN111343147A (en) Network attack detection device and method based on deep learning
CN107016416B (en) Data classification prediction method based on neighborhood rough set and PCA fusion
CN115688024A (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
CN116257759A (en) Structured data intelligent classification grading system of deep neural network model
CN114519508A (en) Credit risk assessment method based on time sequence deep learning and legal document information
Saha et al. The corporeality of infotainment on fans feedback towards sports comment employing convolutional long-short term neural network
CN111708865A (en) Technology forecasting and patent early warning analysis method based on improved XGboost algorithm
CN118468061A (en) Automatic algorithm matching and parameter optimizing method and system
Krishnamoorthy et al. Comparative study of machine learning algorithms for product recommendation based on user experience
CN114358813B (en) Improved advertisement putting method and system based on field matrix factorization machine
CN115496151A (en) Equipment production state classification method and device, computer equipment and storage medium
CN115544361A (en) Frame for predicting change of attention point of window similarity analysis and analysis method thereof
JP7226783B2 (en) Information processing system, information processing method and program
CN110609961A (en) Collaborative filtering recommendation method based on word embedding
Shanthini et al. Advanced Data Mining Enabled Robust Sentiment Analysis on E-Commerce Product Reviews and Recommendation Model
CN113435655B (en) Sector dynamic management decision method, server and system
CN118133051B (en) Construction method and device of element evaluation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant