CN113127342A - Defect prediction method and device based on power grid information system feature selection - Google Patents
Defect prediction method and device based on power grid information system feature selection Download PDFInfo
- Publication number
- CN113127342A CN113127342A CN202110339177.4A CN202110339177A CN113127342A CN 113127342 A CN113127342 A CN 113127342A CN 202110339177 A CN202110339177 A CN 202110339177A CN 113127342 A CN113127342 A CN 113127342A
- Authority
- CN
- China
- Prior art keywords
- software
- data set
- training set
- module
- version data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000007547 defect Effects 0.000 title claims abstract description 117
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000012549 training Methods 0.000 claims abstract description 80
- 238000012545 processing Methods 0.000 claims abstract description 48
- 238000010606 normalization Methods 0.000 claims abstract description 18
- 238000007635 classification algorithm Methods 0.000 claims abstract description 12
- 238000012360 testing method Methods 0.000 claims description 35
- 238000004422 calculation algorithm Methods 0.000 claims description 21
- 230000002950 deficient Effects 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 13
- 238000010276 construction Methods 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 6
- 238000001514 detection method Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 238000007637 random forest analysis Methods 0.000 claims description 4
- 238000009826 distribution Methods 0.000 abstract description 15
- 238000010586 diagram Methods 0.000 description 9
- 238000011161 development Methods 0.000 description 8
- 238000013522 software testing Methods 0.000 description 8
- 238000004590 computer program Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 7
- 238000005259 measurement Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000003860 storage Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000007418 data mining Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 201000004569 Blindness Diseases 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a defect prediction method and a defect prediction device based on power grid information system feature selection, wherein the method comprises the following steps: acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing; calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set; carrying out class unbalance processing on the training set; carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing; and predicting the defect condition of each module of the software of the version to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set selected by the characteristics to obtain the defect prediction result of each module of the software of the version to be tested. The invention considers the characteristic difference and data distribution difference between different versions of the software and improves the efficiency and the precision of software defect prediction.
Description
Technical Field
The invention relates to the field of software testing, in particular to a method and a device for predicting testing defects of a power grid information system.
Background
In the process of software development and operation and maintenance, due to the change of requirements, performance improvement, defect repair, code reconstruction and the like, software changes, the software is further larger and larger in scale, the functions are more and more complex, the relation between different functional modules is more and more complex, and defects in the software are inevitable. Software testing guarantees the quality of software by executing programs to discover as many software defects as possible. Software testing is the most time and resource consuming part of a software project, and all programs are tested by using limited testing resources. With the continuous development of distributed power sources, incremental power distribution networks and the like, the frequency, complexity and timeliness requirements of update iteration of the power information system/power grid information system are higher and higher, and higher requirements are provided for software test defect prediction. In the project defect prediction of version-oriented iterative update, irrelevant features exist in a source data set and a target data set, and the data distribution of the source data set and the target data set may be different. The existing defect detection method does not consider the characteristic difference and the data distribution difference, and the distribution of test resources is insufficient, so that the performance and efficiency of defect prediction are not high.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a defect prediction method based on power grid information system feature selection, which can more effectively distribute test resources and improve the efficiency and quality of software test.
Another object of the present invention is to provide a defect prediction apparatus based on grid information system feature selection.
The technical scheme is as follows: according to a first aspect of the invention, a test defect prediction method based on power grid information system feature selection is provided, which comprises the following steps:
(1) acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing;
(2) calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set;
(3) carrying out class unbalance processing on the training set to obtain a class balanced training set;
(4) carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing;
(5) and predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
According to a second aspect of the present invention, there is provided a defect prediction apparatus selected based on characteristics of a grid information system, including:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and carrying out standardized processing;
the training set construction module is used for calculating the similarity between each example in the to-be-detected version data set and each example in the historical version data set, selecting k nearest examples from the historical version data set according to the similarity, and constructing a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the characteristic selection module is used for selecting characteristics of the training set with class balance and the to-be-detected version data set subjected to standardization processing;
and the prediction module is used for predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
Has the advantages that: the defect prediction method and device based on power grid information system feature selection provided by the invention have the advantages that the data quality is improved by preprocessing a data set; enabling the data distribution of the historical version data set and the data distribution of the current version data set to be detected to be consistent through instance selection; selecting features strongly related to the defects through feature selection, removing irrelevant features, and improving the performance and efficiency of defect prediction; and constructing a defect prediction model on the historical version data set by adopting an improved classification algorithm, predicting the defect tendency of each module of the current version to be tested, finally realizing accurate and effective prediction, and simultaneously recording and updating corresponding parameters of the prediction model to be used as support data for testing defect prediction of the power grid information system. The invention can effectively assist the software tester to predict the software module which is possibly defective before the software test, thereby more effectively distributing the test resources and further improving the efficiency and quality of the software test.
Drawings
FIG. 1 is a general schematic diagram of a grid information system feature selection-based defect prediction method of the present invention;
FIG. 2 is a flow chart of a method for fault prediction based on grid information system feature selection in accordance with the present invention.
Detailed Description
With the continuous improvement of the power grid information system, the historical versions of different information systems of the power grid are more and more, for a continuous software project with the historical versions, before software testing, defect data of the historical versions of the software are mined according to certain testing experience, a defect prediction model is constructed by utilizing a data mining and machine learning algorithm, and the defect condition of each module of a subsequent version can be effectively predicted. Software defects are not randomly distributed, and their distribution is regularly traceable. By mining historical software defect data and analyzing the defect distribution rule, software modules with defect tendency can be accurately predicted, most test resources are distributed to the software modules without spending resources on the modules without defect tendency. And furthermore, on the premise of ensuring the software testing quality, testing resources can be effectively distributed, and the software testing efficiency is obviously improved.
The embodiment of the invention provides a defect prediction method based on power grid information system feature selection aiming at software codes of continuity development and application type projects in a power grid information system and comprehensively considering the influence of data distribution of a data set on defect prediction before software testing, and in summary, the method comprises the following steps: and (3) recommending similar historical version data to enable the data distribution of the historical version data set to be consistent with that of the version data set to be detected, and searching for the distribution rule of software defects in the historical version. The method further comprises the following steps: the method comprises the steps of effectively selecting features strongly related to defects through a feature selection algorithm, constructing a defect prediction model by using an improved AdaBoost classification algorithm for training and analyzing, and meanwhile, recording and updating corresponding parameters of the prediction model to serve as support data for testing defect prediction of a power grid information system.
A detailed description of the steps of carrying out the method of the present invention is given below with reference to the accompanying drawings. It should be noted that the steps described below are only for the purpose of illustrating the present invention and are not limiting to the present invention.
As shown in fig. 1 and 2, in step1, a historical version data set and a version data set to be tested are constructed.
In an embodiment, the historical version data set may be constructed based on the number of instances (defective number, non-defective number) and the number of features of the historical version test of the information system used by the power grid. Features refer to software metrics elements of the information system software. The software measure element comprises a code measure element and a process measure element. For example, in the running process of all information testing projects in a power grid information department, all the information testing projects are developed by java language, data mining is carried out on code warehouses, version control systems and the like which are oriented to different applications and have a plurality of continuous versions, classes in historical version modules of the projects are recorded, software measurement elements related to defects are designed, such as code loop complexity, code change line number and the like, and the measurement historical version modules are marked to be flawless. The software metric element refers to indexes and parameters describing the characteristics of the software product, and can also be understood as software characteristics. Currently, software metrics are mainly divided into code metrics and process metrics. The code measurement element mainly refers to the complexity of a loop and describes the complexity of a software code structure. The process measurement element mainly comprises a measurement element which is based on code change, developer information and development process correlation. The method mainly comprises the changing times, the number of developers, the number of code changing lines and the like. The indication that there are no defects in the historical version module instances may be determined empirically, such as by historical test records, for the value of the metric for each software module instance.
The constructed historical version data set is represented as: DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiAnd d represents the number of software metric elements.
And for the version data set to be tested, acquiring the characteristic indexes and parameters of the version to be tested, namely the values of the software measurement elements in the software module example, based on the same code measurement elements.
The obtained historical version data set and the version data set to be detected are collectively called as an original data set.
In step2, preprocessing the data in the constructed historical version data set and the version data set to be detected.
Preprocessing the data recorded in the original data set, wherein the preprocessing comprises the following steps: and checking the data consistency, carrying out data standardization processing, removing the obviously distorted data, and carrying out effective sorting and storage. In the data normalization processing, different software measurement elements have different value ranges, random forest filling missing values are selected according to different influence degrees of characteristic values on defects, the value ranges of the software measurement elements are normalized by adopting a Max-min method, and the normalization processing is carried out to [0, 1] so as to eliminate the influence on the defect prediction result caused by the different value ranges of the different software measurement elements. The formula of the normalization process is:
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalizationj) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) The maximum value is indicated. In the description of the present invention, software modules, software module instances, instances may be used interchangeably.
In step3, a training set and a test set are constructed from the preprocessed data.
Whether instance recommendation is performed or not can be selected according to actual needs, if changes of developers or development environments and the like do not occur in the project development process, namely data distribution of the historical version data set and the current version data set to be tested is consistent, instance recommendation operation can not be performed, and otherwise, instance recommendation is performed. Due to the fact that continuous development aiming at one software project is conducted, developers in different versions, development environments and the like are changed, and data distribution of a data set is changed. Effective characteristic data can be effectively selected through example recommendation, and the prediction performance is improved.
The method specifically comprises the following steps: and effectively calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set through the Euclidean distance, and then selecting k adjacent instances with the minimum Euclidean distance to each instance in the current to-be-detected version data set from the historical version data set. Repeated instances in all k neighbors are taken only once, resulting in a new data set. And (5) testing the influence of the k value on the algorithm for many times, and taking the k value as 8. The formula for calculating the euclidean distance is prior art and is not described here.
And taking the historical version data set obtained after the processing as a training set, and taking the version data set to be tested as a testing set.
In step4, class imbalance processing is performed on the training set.
In most cases, the number of non-defective module instances is much greater than the number of defective module instances, and thus there is a class imbalance problem with the data in the training set. The correct classification of a few classes of samples is often more important than a majority of classes of samples in unbalanced data set classification. The invention performs class unbalance processing on the training set by adopting an SMOTE method so as to balance the number of defective module examples (few class samples) and the number of non-defective module examples (most class samples) to obtain a class balanced training set.
SMOTE sampling is to process a few classes and generate a few classes of data so as to achieve the aim of balancing a data set. The algorithm is improved on the basis of random oversampling, a minority sample of k neighbor of a minority class x is obtained, it is understood that the k value of the k neighbor is not necessarily equal to the k value of the k neighbor selected in step3, the sampling multiplying factor N is set according to the proportion of unbalanced data, and the assumption is that x is setnSampling is performed for a few samples in k neighbors of x according to the following formula:
Xnew=X+rand(0,1)*|X-Xn|
the complete steps are as follows:
step1. for an instance p of a random minority class, its distance to all instances in the minority class instances is measured by using euclidean distance as a standard, and k neighbor instances thereof are obtained.
And step2, randomly extracting R to be less than or equal to k neighbors in a release manner.
Step3. for the R instances, each instance can form a straight line with the instance p, and then randomly take an instance on the straight line, so as to generate a new sample, and continuously do so, so that R new instances can be generated in total.
Step4. add these new spots to the sample set.
The new samples synthesized by the simple random oversampling method have the problems of blindness and limitation, because the method randomly copies a few types of samples to increase the number of samples. The SMOTE algorithm uses a linear interpolation method and synthesizes a new few classes of samples according to some specific rules. Therefore, the problem that the decision domain is reduced due to the fact that the number of the few samples is increased can be solved while the number of the few samples is increased, and therefore the algorithm is prevented from being over-fitted to a certain degree, and the purpose of improving the performance of the classifier is achieved.
In step 5, feature selection is performed according to the class-balanced training set and the data set of the current version to be tested after normalization processing.
And (4) performing feature sorting on the training set by using a feature sorting method, selecting features strongly related to the defects, and removing irrelevant features. Whether feature selection is performed or not can be selected, and if feature selection is performed, a feature ordered list is obtained through a Relieff algorithm (RF). And according to the set number of the features to be selected, selecting the features with the specified number at the top of the rank from the feature sorted list, and displaying the selected features in the form of serial numbers and names. And finally, selecting the features from the class-balanced training set and the normalized current version data set to be tested, removing the rest features, and obtaining the training set after feature selection and the test set after feature selection.
In step 6, a classification algorithm is used for constructing a defect prediction model on the historical version data set after recommendation selection and feature selection processing, a test set after feature selection is input, the defect condition of each module of the current version to be tested is predicted, and the defect prediction result of each module of the current version to be tested is returned.
In order to pursue further improvement of accuracy and recall rate of minority class identification, a classification algorithm is improved. Most of the conventional classification algorithms assume that the misclassification costs are the same and the improvement of the classification accuracy of the classifier is the final goal, so when the classification problem of the unbalanced data set is processed, a small number of samples are generally classified into a large number of classes, and the classification accuracy of the classifier is further improved. But the correct classification of a few classes of samples is often more important than a majority of classes of samples in unbalanced data set classification. The cost sensitive learning is based on the theory, and gives a higher wrong score cost to a few types of wrongly scored samples. In this embodiment, the processed data set is subjected to effective classification prediction by a defect prediction model constructed by an improved Adaboost classification algorithm, so as to achieve the purpose of improving the classification effect of the classifier on a small number of types of samples.
Different from the traditional Adaboost algorithm, the invention changes the weight update of Adaboost by introducing the cost matrix into the weight update formulaThe new mode enables the samples of the few classes which are classified wrongly to obtain higher weight, and the samples which are classified correctly to reduce the weight. The specific mode is to modify a sample weight value updating formula in Adaboost, and then the formula is used for updating the sample weight value in AdaboostIs updated toIn the case of determining β (i), that is, the cost matrix, the cost sensitive function, the power grid data set processed in the present invention does not have a well-determined cost matrix, and therefore, in the present invention, for β (i), it is equivalent to directly giving a few class samples a coefficient K (K) of a small number of class samples>1) When the weak classifier is classified correctly, β (i) ═ 1 remains unchanged, and the sample weight is reduced normally; when the minority class is classified into the majority class, β (i) ═ K, the weight of the sample increases at a faster rate; when the majority class is classified into the minority class, β (i) ═ 1 remains unchanged, and the sample weight normally increases. β (i) is referred to herein as the cost sensitive compensation parameter for the ith instance. By the method, the weight of the misclassified minority samples can be increased, and the recognition rate of the minority samples can be increased more quickly.
The specific flow of the improved Adaboost algorithm is as follows:
inputting: training set after feature processing; the iteration number T; a base learning algorithm;
and (3) outputting: combined classifier
step1. initialize the sample weight in training set to D1(i)=1/n。
step2. for i 1t(x) And calculating the error rate epsilon of the t-th iteration classifiertAlso known as error:
step3. estimate error if et>0.5 or εt=0,The classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
αtis the weight of the weak classifier(s),in order to be a normalization constant, the method comprises the following steps of,
and increasing the cost of misclassification of the minority samples in the weight updating formula so that the samples misclassified by the minority samples obtain more sample weight. In this way, the prediction accuracy of a few types of samples can be improved more in the same iteration number.
By the method, on one hand, a software tester is helped to predict the possible defective software modules before software testing to provide corresponding data support, so that the effective distribution of testing resources is guided, and the testing efficiency is improved; on the other hand, the method analyzes the causes of software defects, improves the software development process and improves the development quality of subsequent versions.
The method comprises the steps of selecting key features through feature selection after instance selection and class imbalance processing are carried out on a historical version data set, obtaining a test set and a training set after feature selection, optimizing an Adaboost prediction model in order to further optimize the effect of minority test defect prediction, and constructing the prediction model through an experimental contrast decision tree, an Adaboost algorithm and an improved Adaboost algorithm. By comprehensively considering the false judgment and the missing judgment of the prediction model on the minority instances, the f1 score of the improved classification model method provided by the invention is about 5% higher than that of decisionTree and Adaboost respectively, namely the method better improves the accuracy of minority instance identification prediction and effectively improves the prediction performance on the basis of ensuring the identification rate.
Aiming at the characteristics of the power information system, the invention excavates the defect data of the historical version of the software according to certain test experience based on the continuity software code of the multi-historical version in the typical application of the power information system based on the software defect detection technology before the software test, and constructs a defect prediction model by utilizing data excavation and machine learning algorithm to effectively predict the defect condition of each module of the subsequent version. The method can be used for software defect detection and software defect prediction solutions.
In another embodiment, a defect prediction apparatus based on grid information system feature selection is provided, including:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and carrying out standardized processing;
the training set construction module is used for calculating the similarity between each example in the to-be-detected version data set and each example in the historical version data set, selecting k nearest examples from the historical version data set according to the similarity, and constructing a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the characteristic selection module is used for selecting characteristics of the training set with class balance and the to-be-detected version data set subjected to standardization processing;
and the prediction module is used for predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
Wherein, the data acquisition module includes:
a historical version DATA set acquisition unit, which is used for recording classes in the software modules of the historical version, measuring whether the software modules of the historical version have defects according to the software measurement elements related to the defects to obtain a historical version DATA set which is expressed as DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiThe value of the jth software metric element of (a), d represents the number of software metric elements;
the system comprises a to-be-detected version data set acquisition unit, a to-be-detected version data set acquisition unit and a defect detection unit, wherein the to-be-detected version data set acquisition unit is used for acquiring classes in a software module of a to-be-detected version and acquiring values of software measurement elements in a software module example according to the software measurement elements related to defects;
the normalization processing unit is used for selecting random forest filling missing values and normalizing the value range of each software measurement element by adopting a Max-min method, and the formula is as follows:
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalizationj) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) Representing the maximum value of the jth software metric element in all software modules.
As a preferred embodiment, the training set processing module performs class imbalance processing on the training set by using a SMOTE sampling algorithm, and the training set processing module specifically includes:
the minority class neighbor determining unit is used for measuring the distance from the instance p to all instances in the minority class instances by taking the Euclidean distance as a standard so as to obtain k neighbor instances of the instance p;
the neighbor extraction unit is used for extracting R (R) is less than or equal to k neighbors randomly in a replacement mode;
the new sample generation unit is used for forming a straight line by combining each example and the example p for the R examples randomly extracted by the neighbor extraction unit, randomly taking one example on the straight line, generating a new sample and generating R new samples together; and
and the training set updating unit is used for adding the new samples generated by the new sample generating unit into the training set to obtain a class balance training set.
As a preferred embodiment, the feature selection module obtains the feature sorted list through a ReliefF algorithm, selects a specified number of features ranked at the top from the feature sorted list, selects the features from the class-balanced training set and the normalized current version data set to be tested, and removes the remaining features to obtain the training set after the feature selection and the test set after the feature selection.
As a preferred embodiment, the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, the defect prediction unit predicts the defect condition of each module of the version software to be tested by using the trained defect prediction model based on the test set data after feature selection, and obtains the defect prediction result of each module of the version software to be tested;
wherein, the defect prediction model construction unit comprises:
an initialization unit for initializing the sample weight in the training set to D1(i) 1/n, n is the number of instances;
an iterative execution unit, for i ═ 1.... T, iteratively executing training of the T-th weak classifier h on the training sett(x) And calculating the error of the t-th iteration classifierRate epsilont:
T is the number of iterations, yiIs the category of the ith example in the training set;
wherein when epsilontWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
wherein alpha istIs the weight of the weak classifier(s),to normalize the constants, β (i) is a cost sensitive compensation parameter,
as a preferred embodiment, the defect prediction apparatus further includes: and the optimization module is used for optimizing and updating the prediction model by taking the to-be-tested version data set selected by the characteristics as a test set.
It should be understood that the defect prediction apparatus selected based on the characteristics of the grid information system in the embodiment of the present invention may implement all technical solutions in the above method embodiments, functions of each functional module may be implemented according to the method in the above method embodiments, and specific implementation processes and calculation formulas that are not described in detail in the apparatus embodiment may refer to relevant descriptions in the above embodiments.
Based on the same technical concept as the method embodiment, according to another embodiment of the present invention, there is provided a computer apparatus including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps in the method embodiments.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.
Claims (10)
1. A defect prediction method based on grid information system feature selection is characterized by comprising the following steps:
(1) acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing;
(2) calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set;
(3) carrying out class unbalance processing on the training set to obtain a class balanced training set;
(4) carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing;
(5) and predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
2. The method of claim 1The defect prediction method based on the power grid information system feature selection is characterized in that in the step (1), acquiring a historical version data set of the software to be tested comprises the following steps: and (b) a class in the software module recording the historical version is measured according to the software measurement element related to the defect, and the software module measuring the historical version has no defect, so that a historical version DATA set is obtained and is represented as DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiThe value of the jth software metric element of (a), d represents the number of software metric elements;
the acquiring of the version data set to be tested comprises the following steps: and acquiring classes in the software module of the version to be tested, and acquiring values of all software measurement elements in the software module example according to the software measurement elements related to the defects.
3. The grid information system feature selection-based defect prediction method according to claim 2, wherein the normalization process comprises: random forest filling missing values are selected, the value range of each software measurement element is normalized by adopting a Max-min method, and the formula is as follows:
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalizationj) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) Representing the maximum value of the jth software metric element in all software modules.
4. The grid information system feature selection-based defect prediction method according to claim 1, wherein in the step (2), the euclidean distance is used to calculate the similarity between each instance in the to-be-measured version data set and each instance in the historical version data set.
5. The grid information system feature selection-based defect prediction method according to claim 1, wherein the class imbalance processing is performed on the training set by using a SMOTE sampling algorithm in the step (3), and the method comprises the following steps:
(3-1) for an instance p of a random minority class, measuring the distance from the instance p to all instances in the minority class instance by using Euclidean distance as a standard to obtain k adjacent instances;
(3-2) randomly extracting R to be less than or equal to k neighbors in a replacement mode;
(3-3) for the R instances, each instance and the instance p form a straight line, and a new sample is generated by randomly taking one instance on the straight line, so that R new samples are generated;
and (3-4) adding the newly generated samples into a training set to obtain a class-balanced training set.
6. The grid information system feature selection-based defect prediction method according to claim 1, wherein the defect prediction model in the step (5) is constructed by adopting an improved Adaboost algorithm, and the method comprises the following steps:
(5-1) initializing sample weights in training set to D1(i) 1/n, n is the number of instances;
(5-2) for i ═ 1.. T, iteratively performing training of the T-th weak classifier h on the training sett(x) And calculating the error rate epsilon of the t-th iteration classifiert:
T is the number of iterations, yiIs the category of the ith example in the training set;
(5-3) when εtWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
wherein alpha istIs the weight of the weak classifier(s),to normalize the constants, β (i) is a cost sensitive compensation parameter,
7. a defect prediction device based on grid information system feature selection is characterized by comprising:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and carrying out standardized processing;
the training set construction module is used for calculating the similarity between each example in the to-be-detected version data set and each example in the historical version data set, selecting k nearest examples from the historical version data set according to the similarity, and constructing a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the characteristic selection module is used for selecting characteristics of the training set with class balance and the to-be-detected version data set subjected to standardization processing;
and the prediction module is used for predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
8. The grid information system feature selection-based defect prediction device of claim 7, wherein the data acquisition module comprises:
a historical version DATA set acquisition unit, which is used for recording classes in the software modules of the historical version, measuring whether the software modules of the historical version have defects according to the software measurement elements related to the defects to obtain a historical version DATA set which is expressed as DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiThe value of the jth software metric element of (a), d represents the number of software metric elements;
the system comprises a to-be-detected version data set acquisition unit, a to-be-detected version data set acquisition unit and a defect detection unit, wherein the to-be-detected version data set acquisition unit is used for acquiring classes in a software module of a to-be-detected version and acquiring values of software measurement elements in a software module example according to the software measurement elements related to defects;
the normalization processing unit is used for selecting random forest filling missing values and normalizing the value range of each software measurement element by adopting a Max-min method, and the formula is as follows:
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jJ-th representing the ith software module before normalizationValue of the software metric element, Min (q)j) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) Representing the maximum value of the jth software metric element in all software modules.
9. The grid information system feature selection-based defect prediction device according to claim 7, wherein the training set processing module performs class imbalance processing on a training set by using a SMOTE sampling algorithm, and the training set processing module specifically includes:
the minority class neighbor determining unit is used for measuring the distance from the instance p to all instances in the minority class instances by taking the Euclidean distance as a standard so as to obtain k neighbor instances of the instance p;
the neighbor extraction unit is used for extracting R (R) is less than or equal to k neighbors randomly in a replacement mode;
the new sample generation unit is used for forming a straight line by combining each example and the example p for the R examples randomly extracted by the neighbor extraction unit, randomly taking one example on the straight line, generating a new sample and generating R new samples together; and
and the training set updating unit is used for adding the new samples generated by the new sample generating unit into the training set to obtain a class balance training set.
10. The grid information system feature selection-based defect prediction device according to claim 7, wherein the feature selection module obtains a feature ranking list through a Relieff algorithm, selects a specified number of features ranked at the top from the feature ranking list, selects the features from a class-balanced training set and a normalized current version data set to be tested, removes the remaining features, and obtains a training set after feature selection and a test set after feature selection;
and wherein the step of (a) is,
the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, the defect prediction unit predicts the defect condition of each module of the version software to be tested by using the trained defect prediction model based on the test set data after feature selection to obtain the defect prediction result of each module of the version software to be tested,
the defect prediction model construction unit includes:
an initialization unit for initializing the sample weight in the training set to D1(i) 1/n, n is the number of instances;
an iterative execution unit, for i ═ 1.... T, iteratively executing training of the T-th weak classifier h on the training sett(x) And calculating the error rate epsilon of the t-th iteration classifiert:
T is the number of iterations, yiIs the category of the ith example in the training set;
wherein when epsilontWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
wherein alpha istIs the weight of the weak classifier(s),to normalize the constants, β (i) is a cost sensitive compensation parameter,
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110339177.4A CN113127342B (en) | 2021-03-30 | 2021-03-30 | Defect prediction method and device based on power grid information system feature selection |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110339177.4A CN113127342B (en) | 2021-03-30 | 2021-03-30 | Defect prediction method and device based on power grid information system feature selection |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113127342A true CN113127342A (en) | 2021-07-16 |
CN113127342B CN113127342B (en) | 2023-06-09 |
Family
ID=76774868
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110339177.4A Active CN113127342B (en) | 2021-03-30 | 2021-03-30 | Defect prediction method and device based on power grid information system feature selection |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113127342B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114356641A (en) * | 2022-03-04 | 2022-04-15 | 中南大学 | Incremental software defect prediction method, system, equipment and storage medium |
CN114816979A (en) * | 2021-11-22 | 2022-07-29 | 江苏科技大学 | A Software Defect Prediction Method Based on Cluster Analysis and Decision Tree Algorithm |
CN114911800A (en) * | 2022-05-16 | 2022-08-16 | 国网青海省电力公司信息通信公司 | Fault prediction method, device and electronic device for power system |
CN115033493A (en) * | 2022-07-06 | 2022-09-09 | 陕西师范大学 | Workload sensing instant software defect prediction method based on linear programming |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677564A (en) * | 2016-01-04 | 2016-06-15 | 中国石油大学(华东) | Adaboost software defect unbalanced data classification method based on improvement |
WO2017131263A1 (en) * | 2016-01-29 | 2017-08-03 | 한국과학기술원 | Hybrid instance selection method using nearest neighboring point for cross project defect prediction |
US20180210944A1 (en) * | 2017-01-26 | 2018-07-26 | Agt International Gmbh | Data fusion and classification with imbalanced datasets |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
CN109977028A (en) * | 2019-04-08 | 2019-07-05 | 燕山大学 | A kind of Software Defects Predict Methods based on genetic algorithm and random forest |
AU2020100709A4 (en) * | 2020-05-05 | 2020-06-11 | Bao, Yuhang Mr | A method of prediction model based on random forest algorithm |
WO2020199345A1 (en) * | 2019-04-02 | 2020-10-08 | 广东石油化工学院 | Semi-supervised and heterogeneous software defect prediction algorithm employing github |
CN112465040A (en) * | 2020-12-01 | 2021-03-09 | 杭州电子科技大学 | Software defect prediction method based on class imbalance learning algorithm |
-
2021
- 2021-03-30 CN CN202110339177.4A patent/CN113127342B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105677564A (en) * | 2016-01-04 | 2016-06-15 | 中国石油大学(华东) | Adaboost software defect unbalanced data classification method based on improvement |
WO2017131263A1 (en) * | 2016-01-29 | 2017-08-03 | 한국과학기술원 | Hybrid instance selection method using nearest neighboring point for cross project defect prediction |
US20180210944A1 (en) * | 2017-01-26 | 2018-07-26 | Agt International Gmbh | Data fusion and classification with imbalanced datasets |
CN108563556A (en) * | 2018-01-10 | 2018-09-21 | 江苏工程职业技术学院 | Software defect prediction optimization method based on differential evolution algorithm |
WO2020199345A1 (en) * | 2019-04-02 | 2020-10-08 | 广东石油化工学院 | Semi-supervised and heterogeneous software defect prediction algorithm employing github |
CN109977028A (en) * | 2019-04-08 | 2019-07-05 | 燕山大学 | A kind of Software Defects Predict Methods based on genetic algorithm and random forest |
AU2020100709A4 (en) * | 2020-05-05 | 2020-06-11 | Bao, Yuhang Mr | A method of prediction model based on random forest algorithm |
CN112465040A (en) * | 2020-12-01 | 2021-03-09 | 杭州电子科技大学 | Software defect prediction method based on class imbalance learning algorithm |
Non-Patent Citations (1)
Title |
---|
程铭;毋国庆;袁梦霆;: "基于迁移学习的软件缺陷预测", 电子学报, no. 01 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114816979A (en) * | 2021-11-22 | 2022-07-29 | 江苏科技大学 | A Software Defect Prediction Method Based on Cluster Analysis and Decision Tree Algorithm |
CN114816979B (en) * | 2021-11-22 | 2024-08-20 | 江苏科技大学 | Software defect prediction method based on cluster analysis and decision tree algorithm |
CN114356641A (en) * | 2022-03-04 | 2022-04-15 | 中南大学 | Incremental software defect prediction method, system, equipment and storage medium |
CN114356641B (en) * | 2022-03-04 | 2022-05-27 | 中南大学 | An incremental software defect prediction method, system, device and storage medium |
CN114911800A (en) * | 2022-05-16 | 2022-08-16 | 国网青海省电力公司信息通信公司 | Fault prediction method, device and electronic device for power system |
CN115033493A (en) * | 2022-07-06 | 2022-09-09 | 陕西师范大学 | Workload sensing instant software defect prediction method based on linear programming |
Also Published As
Publication number | Publication date |
---|---|
CN113127342B (en) | 2023-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113127342B (en) | Defect prediction method and device based on power grid information system feature selection | |
US11580425B2 (en) | Managing defects in a model training pipeline using synthetic data sets associated with defect types | |
CN106201871A (en) | Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised | |
CN107203467A (en) | The reference test method and device of supervised learning algorithm under a kind of distributed environment | |
CN112288455A (en) | Label generation method and device, computer readable storage medium and electronic equipment | |
CN111582315A (en) | Sample data processing method and device and electronic equipment | |
Gashi et al. | Dealing with missing usage data in defect prediction: A case study of a welding supplier | |
CN117574201A (en) | Model training method, device, equipment and storage medium based on multi-industry model | |
CN114139636B (en) | Abnormal operation processing method and device | |
Bernedixen | Automated Bottleneck Analysis of Production Systems: Increasing the applicability of simulation-based multi-objective optimization for bottleneck analysis within industry | |
JP7190246B2 (en) | Software failure prediction device | |
CN118036920A (en) | Supplier competition type matching method and system based on photovoltaic demand | |
CN106991050A (en) | A kind of static test null pointer dereference defect false positive recognition methods | |
US20220215144A1 (en) | Learning Apparatus, Learning Method and Learning Program | |
JP4308113B2 (en) | Data analysis apparatus and method, and program | |
CN117763316A (en) | High-dimensional data dimension reduction method and dimension reduction system based on machine learning | |
CN111026661B (en) | Comprehensive testing method and system for software usability | |
CN112395280B (en) | A data quality detection method and system thereof | |
JP2019003333A (en) | Bug mixing probability calculation program and bug mixing probability calculation method | |
CN114328221A (en) | Cross-project software defect prediction method and system based on feature and instance migration | |
Singh et al. | Improved software fault prediction model based on optimal features set and threshold values using metaheuristic approach | |
JP6588494B2 (en) | Extraction apparatus, analysis system, extraction method, and extraction program | |
CN117313900B (en) | Method, apparatus and medium for data processing | |
CN117313899B (en) | Method, apparatus and medium for data processing | |
CN117708622B (en) | Abnormal index analysis method and system of operation and maintenance system and electronic device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |