CN113127342A

CN113127342A - Defect prediction method and device based on power grid information system feature selection

Info

Publication number: CN113127342A
Application number: CN202110339177.4A
Authority: CN
Inventors: 沈伍强; 龙震岳; 张小陆; 曾纪钧; 梁哲恒
Original assignee: Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-07-16
Anticipated expiration: 2041-03-30
Also published as: CN113127342B

Abstract

The invention provides a defect prediction method and a defect prediction device based on power grid information system feature selection, wherein the method comprises the following steps: acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing; calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set; carrying out class unbalance processing on the training set; carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing; and predicting the defect condition of each module of the software of the version to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set selected by the characteristics to obtain the defect prediction result of each module of the software of the version to be tested. The invention considers the characteristic difference and data distribution difference between different versions of the software and improves the efficiency and the precision of software defect prediction.

Description

Defect prediction method and device based on power grid information system feature selection

Technical Field

The invention relates to the field of software testing, in particular to a method and a device for predicting testing defects of a power grid information system.

Background

In the process of software development and operation and maintenance, due to the change of requirements, performance improvement, defect repair, code reconstruction and the like, software changes, the software is further larger and larger in scale, the functions are more and more complex, the relation between different functional modules is more and more complex, and defects in the software are inevitable. Software testing guarantees the quality of software by executing programs to discover as many software defects as possible. Software testing is the most time and resource consuming part of a software project, and all programs are tested by using limited testing resources. With the continuous development of distributed power sources, incremental power distribution networks and the like, the frequency, complexity and timeliness requirements of update iteration of the power information system/power grid information system are higher and higher, and higher requirements are provided for software test defect prediction. In the project defect prediction of version-oriented iterative update, irrelevant features exist in a source data set and a target data set, and the data distribution of the source data set and the target data set may be different. The existing defect detection method does not consider the characteristic difference and the data distribution difference, and the distribution of test resources is insufficient, so that the performance and efficiency of defect prediction are not high.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a defect prediction method based on power grid information system feature selection, which can more effectively distribute test resources and improve the efficiency and quality of software test.

Another object of the present invention is to provide a defect prediction apparatus based on grid information system feature selection.

The technical scheme is as follows: according to a first aspect of the invention, a test defect prediction method based on power grid information system feature selection is provided, which comprises the following steps:

(1) acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing;

(2) calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set;

(3) carrying out class unbalance processing on the training set to obtain a class balanced training set;

(4) carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing;

(5) and predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.

According to a second aspect of the present invention, there is provided a defect prediction apparatus selected based on characteristics of a grid information system, including:

the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and carrying out standardized processing;

the training set construction module is used for calculating the similarity between each example in the to-be-detected version data set and each example in the historical version data set, selecting k nearest examples from the historical version data set according to the similarity, and constructing a training set;

the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;

the characteristic selection module is used for selecting characteristics of the training set with class balance and the to-be-detected version data set subjected to standardization processing;

and the prediction module is used for predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.

Has the advantages that: the defect prediction method and device based on power grid information system feature selection provided by the invention have the advantages that the data quality is improved by preprocessing a data set; enabling the data distribution of the historical version data set and the data distribution of the current version data set to be detected to be consistent through instance selection; selecting features strongly related to the defects through feature selection, removing irrelevant features, and improving the performance and efficiency of defect prediction; and constructing a defect prediction model on the historical version data set by adopting an improved classification algorithm, predicting the defect tendency of each module of the current version to be tested, finally realizing accurate and effective prediction, and simultaneously recording and updating corresponding parameters of the prediction model to be used as support data for testing defect prediction of the power grid information system. The invention can effectively assist the software tester to predict the software module which is possibly defective before the software test, thereby more effectively distributing the test resources and further improving the efficiency and quality of the software test.

Drawings

FIG. 1 is a general schematic diagram of a grid information system feature selection-based defect prediction method of the present invention;

FIG. 2 is a flow chart of a method for fault prediction based on grid information system feature selection in accordance with the present invention.

Detailed Description

With the continuous improvement of the power grid information system, the historical versions of different information systems of the power grid are more and more, for a continuous software project with the historical versions, before software testing, defect data of the historical versions of the software are mined according to certain testing experience, a defect prediction model is constructed by utilizing a data mining and machine learning algorithm, and the defect condition of each module of a subsequent version can be effectively predicted. Software defects are not randomly distributed, and their distribution is regularly traceable. By mining historical software defect data and analyzing the defect distribution rule, software modules with defect tendency can be accurately predicted, most test resources are distributed to the software modules without spending resources on the modules without defect tendency. And furthermore, on the premise of ensuring the software testing quality, testing resources can be effectively distributed, and the software testing efficiency is obviously improved.

The embodiment of the invention provides a defect prediction method based on power grid information system feature selection aiming at software codes of continuity development and application type projects in a power grid information system and comprehensively considering the influence of data distribution of a data set on defect prediction before software testing, and in summary, the method comprises the following steps: and (3) recommending similar historical version data to enable the data distribution of the historical version data set to be consistent with that of the version data set to be detected, and searching for the distribution rule of software defects in the historical version. The method further comprises the following steps: the method comprises the steps of effectively selecting features strongly related to defects through a feature selection algorithm, constructing a defect prediction model by using an improved AdaBoost classification algorithm for training and analyzing, and meanwhile, recording and updating corresponding parameters of the prediction model to serve as support data for testing defect prediction of a power grid information system.

A detailed description of the steps of carrying out the method of the present invention is given below with reference to the accompanying drawings. It should be noted that the steps described below are only for the purpose of illustrating the present invention and are not limiting to the present invention.

As shown in fig. 1 and 2, in step1, a historical version data set and a version data set to be tested are constructed.

In an embodiment, the historical version data set may be constructed based on the number of instances (defective number, non-defective number) and the number of features of the historical version test of the information system used by the power grid. Features refer to software metrics elements of the information system software. The software measure element comprises a code measure element and a process measure element. For example, in the running process of all information testing projects in a power grid information department, all the information testing projects are developed by java language, data mining is carried out on code warehouses, version control systems and the like which are oriented to different applications and have a plurality of continuous versions, classes in historical version modules of the projects are recorded, software measurement elements related to defects are designed, such as code loop complexity, code change line number and the like, and the measurement historical version modules are marked to be flawless. The software metric element refers to indexes and parameters describing the characteristics of the software product, and can also be understood as software characteristics. Currently, software metrics are mainly divided into code metrics and process metrics. The code measurement element mainly refers to the complexity of a loop and describes the complexity of a software code structure. The process measurement element mainly comprises a measurement element which is based on code change, developer information and development process correlation. The method mainly comprises the changing times, the number of developers, the number of code changing lines and the like. The indication that there are no defects in the historical version module instances may be determined empirically, such as by historical test records, for the value of the metric for each software module instance.

The constructed historical version data set is represented as: DATA { (a)₁,b₁),(a₂,b₂),…,(a_i,b_i),…,(a_n,b_n)}，a_i＝(f_i,1,f_i,2,…,f_i,j,…,f_i,d) Wherein a is_iRepresenting an example of a software module, b_iRepresents the class of the instance, b_iE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instances_i,jShows an example a_iAnd d represents the number of software metric elements.

And for the version data set to be tested, acquiring the characteristic indexes and parameters of the version to be tested, namely the values of the software measurement elements in the software module example, based on the same code measurement elements.

The obtained historical version data set and the version data set to be detected are collectively called as an original data set.

In step2, preprocessing the data in the constructed historical version data set and the version data set to be detected.

Preprocessing the data recorded in the original data set, wherein the preprocessing comprises the following steps: and checking the data consistency, carrying out data standardization processing, removing the obviously distorted data, and carrying out effective sorting and storage. In the data normalization processing, different software measurement elements have different value ranges, random forest filling missing values are selected according to different influence degrees of characteristic values on defects, the value ranges of the software measurement elements are normalized by adopting a Max-min method, and the normalization processing is carried out to [0, 1] so as to eliminate the influence on the defect prediction result caused by the different value ranges of the different software measurement elements. The formula of the normalization process is:

wherein p is_i,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressed_i,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalization_j) Represents the minimum value of the jth software metric element, Max (q), among all software modules_j) The maximum value is indicated. In the description of the present invention, software modules, software module instances, instances may be used interchangeably.

In step3, a training set and a test set are constructed from the preprocessed data.

Whether instance recommendation is performed or not can be selected according to actual needs, if changes of developers or development environments and the like do not occur in the project development process, namely data distribution of the historical version data set and the current version data set to be tested is consistent, instance recommendation operation can not be performed, and otherwise, instance recommendation is performed. Due to the fact that continuous development aiming at one software project is conducted, developers in different versions, development environments and the like are changed, and data distribution of a data set is changed. Effective characteristic data can be effectively selected through example recommendation, and the prediction performance is improved.

The method specifically comprises the following steps: and effectively calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set through the Euclidean distance, and then selecting k adjacent instances with the minimum Euclidean distance to each instance in the current to-be-detected version data set from the historical version data set. Repeated instances in all k neighbors are taken only once, resulting in a new data set. And (5) testing the influence of the k value on the algorithm for many times, and taking the k value as 8. The formula for calculating the euclidean distance is prior art and is not described here.

And taking the historical version data set obtained after the processing as a training set, and taking the version data set to be tested as a testing set.

In step4, class imbalance processing is performed on the training set.

In most cases, the number of non-defective module instances is much greater than the number of defective module instances, and thus there is a class imbalance problem with the data in the training set. The correct classification of a few classes of samples is often more important than a majority of classes of samples in unbalanced data set classification. The invention performs class unbalance processing on the training set by adopting an SMOTE method so as to balance the number of defective module examples (few class samples) and the number of non-defective module examples (most class samples) to obtain a class balanced training set.

SMOTE sampling is to process a few classes and generate a few classes of data so as to achieve the aim of balancing a data set. The algorithm is improved on the basis of random oversampling, a minority sample of k neighbor of a minority class x is obtained, it is understood that the k value of the k neighbor is not necessarily equal to the k value of the k neighbor selected in step3, the sampling multiplying factor N is set according to the proportion of unbalanced data, and the assumption is that x is set_nSampling is performed for a few samples in k neighbors of x according to the following formula:

X_new＝X+rand(0,1)*|X-X_n|

the complete steps are as follows:

step1. for an instance p of a random minority class, its distance to all instances in the minority class instances is measured by using euclidean distance as a standard, and k neighbor instances thereof are obtained.

And step2, randomly extracting R to be less than or equal to k neighbors in a release manner.

Step3. for the R instances, each instance can form a straight line with the instance p, and then randomly take an instance on the straight line, so as to generate a new sample, and continuously do so, so that R new instances can be generated in total.

Step4. add these new spots to the sample set.

The new samples synthesized by the simple random oversampling method have the problems of blindness and limitation, because the method randomly copies a few types of samples to increase the number of samples. The SMOTE algorithm uses a linear interpolation method and synthesizes a new few classes of samples according to some specific rules. Therefore, the problem that the decision domain is reduced due to the fact that the number of the few samples is increased can be solved while the number of the few samples is increased, and therefore the algorithm is prevented from being over-fitted to a certain degree, and the purpose of improving the performance of the classifier is achieved.

In step 5, feature selection is performed according to the class-balanced training set and the data set of the current version to be tested after normalization processing.

And (4) performing feature sorting on the training set by using a feature sorting method, selecting features strongly related to the defects, and removing irrelevant features. Whether feature selection is performed or not can be selected, and if feature selection is performed, a feature ordered list is obtained through a Relieff algorithm (RF). And according to the set number of the features to be selected, selecting the features with the specified number at the top of the rank from the feature sorted list, and displaying the selected features in the form of serial numbers and names. And finally, selecting the features from the class-balanced training set and the normalized current version data set to be tested, removing the rest features, and obtaining the training set after feature selection and the test set after feature selection.

In step 6, a classification algorithm is used for constructing a defect prediction model on the historical version data set after recommendation selection and feature selection processing, a test set after feature selection is input, the defect condition of each module of the current version to be tested is predicted, and the defect prediction result of each module of the current version to be tested is returned.

In order to pursue further improvement of accuracy and recall rate of minority class identification, a classification algorithm is improved. Most of the conventional classification algorithms assume that the misclassification costs are the same and the improvement of the classification accuracy of the classifier is the final goal, so when the classification problem of the unbalanced data set is processed, a small number of samples are generally classified into a large number of classes, and the classification accuracy of the classifier is further improved. But the correct classification of a few classes of samples is often more important than a majority of classes of samples in unbalanced data set classification. The cost sensitive learning is based on the theory, and gives a higher wrong score cost to a few types of wrongly scored samples. In this embodiment, the processed data set is subjected to effective classification prediction by a defect prediction model constructed by an improved Adaboost classification algorithm, so as to achieve the purpose of improving the classification effect of the classifier on a small number of types of samples.

Different from the traditional Adaboost algorithm, the invention changes the weight update of Adaboost by introducing the cost matrix into the weight update formulaThe new mode enables the samples of the few classes which are classified wrongly to obtain higher weight, and the samples which are classified correctly to reduce the weight. The specific mode is to modify a sample weight value updating formula in Adaboost, and then the formula is used for updating the sample weight value in Adaboost

Is updated to

In the case of determining β (i), that is, the cost matrix, the cost sensitive function, the power grid data set processed in the present invention does not have a well-determined cost matrix, and therefore, in the present invention, for β (i), it is equivalent to directly giving a few class samples a coefficient K (K) of a small number of class samples>1) When the weak classifier is classified correctly, β (i) ═ 1 remains unchanged, and the sample weight is reduced normally; when the minority class is classified into the majority class, β (i) ═ K, the weight of the sample increases at a faster rate; when the majority class is classified into the minority class, β (i) ═ 1 remains unchanged, and the sample weight normally increases. β (i) is referred to herein as the cost sensitive compensation parameter for the ith instance. By the method, the weight of the misclassified minority samples can be increased, and the recognition rate of the minority samples can be increased more quickly.

The specific flow of the improved Adaboost algorithm is as follows:

inputting: training set after feature processing; the iteration number T; a base learning algorithm;

and (3) outputting: combined classifier

step1. initialize the sample weight in training set to D₁(i)＝1/n。

step2. for i 1_t(x) And calculating the error rate epsilon of the t-th iteration classifier_tAlso known as error:

step3. estimate error if e_t>0.5 or ε_t＝0，The classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:

α_tis the weight of the weak classifier(s),

in order to be a normalization constant, the method comprises the following steps of,

step4. output combined classifier:

and increasing the cost of misclassification of the minority samples in the weight updating formula so that the samples misclassified by the minority samples obtain more sample weight. In this way, the prediction accuracy of a few types of samples can be improved more in the same iteration number.

By the method, on one hand, a software tester is helped to predict the possible defective software modules before software testing to provide corresponding data support, so that the effective distribution of testing resources is guided, and the testing efficiency is improved; on the other hand, the method analyzes the causes of software defects, improves the software development process and improves the development quality of subsequent versions.

The method comprises the steps of selecting key features through feature selection after instance selection and class imbalance processing are carried out on a historical version data set, obtaining a test set and a training set after feature selection, optimizing an Adaboost prediction model in order to further optimize the effect of minority test defect prediction, and constructing the prediction model through an experimental contrast decision tree, an Adaboost algorithm and an improved Adaboost algorithm. By comprehensively considering the false judgment and the missing judgment of the prediction model on the minority instances, the f1 score of the improved classification model method provided by the invention is about 5% higher than that of decisionTree and Adaboost respectively, namely the method better improves the accuracy of minority instance identification prediction and effectively improves the prediction performance on the basis of ensuring the identification rate.

Aiming at the characteristics of the power information system, the invention excavates the defect data of the historical version of the software according to certain test experience based on the continuity software code of the multi-historical version in the typical application of the power information system based on the software defect detection technology before the software test, and constructs a defect prediction model by utilizing data excavation and machine learning algorithm to effectively predict the defect condition of each module of the subsequent version. The method can be used for software defect detection and software defect prediction solutions.

In another embodiment, a defect prediction apparatus based on grid information system feature selection is provided, including:

Wherein, the data acquisition module includes:

a historical version DATA set acquisition unit, which is used for recording classes in the software modules of the historical version, measuring whether the software modules of the historical version have defects according to the software measurement elements related to the defects to obtain a historical version DATA set which is expressed as DATA { (a)₁,b₁),(a₂,b₂),…,(a_i,b_i),…,(a_n,b_n)}，a_i＝(f_i,1,f_i,2,…,f_i,j,…,f_i,d) Wherein a is_iRepresenting an example of a software module, b_iRepresents the class of the instance, b_iE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instances_i,jShows an example a_iThe value of the jth software metric element of (a), d represents the number of software metric elements;

the system comprises a to-be-detected version data set acquisition unit, a to-be-detected version data set acquisition unit and a defect detection unit, wherein the to-be-detected version data set acquisition unit is used for acquiring classes in a software module of a to-be-detected version and acquiring values of software measurement elements in a software module example according to the software measurement elements related to defects;

the normalization processing unit is used for selecting random forest filling missing values and normalizing the value range of each software measurement element by adopting a Max-min method, and the formula is as follows:

wherein p is_i,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressed_i,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalization_j) Represents the minimum value of the jth software metric element, Max (q), among all software modules_j) Representing the maximum value of the jth software metric element in all software modules.

As a preferred embodiment, the training set processing module performs class imbalance processing on the training set by using a SMOTE sampling algorithm, and the training set processing module specifically includes:

the minority class neighbor determining unit is used for measuring the distance from the instance p to all instances in the minority class instances by taking the Euclidean distance as a standard so as to obtain k neighbor instances of the instance p;

the neighbor extraction unit is used for extracting R (R) is less than or equal to k neighbors randomly in a replacement mode;

the new sample generation unit is used for forming a straight line by combining each example and the example p for the R examples randomly extracted by the neighbor extraction unit, randomly taking one example on the straight line, generating a new sample and generating R new samples together; and

and the training set updating unit is used for adding the new samples generated by the new sample generating unit into the training set to obtain a class balance training set.

As a preferred embodiment, the feature selection module obtains the feature sorted list through a ReliefF algorithm, selects a specified number of features ranked at the top from the feature sorted list, selects the features from the class-balanced training set and the normalized current version data set to be tested, and removes the remaining features to obtain the training set after the feature selection and the test set after the feature selection.

As a preferred embodiment, the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, the defect prediction unit predicts the defect condition of each module of the version software to be tested by using the trained defect prediction model based on the test set data after feature selection, and obtains the defect prediction result of each module of the version software to be tested;

wherein, the defect prediction model construction unit comprises:

an initialization unit for initializing the sample weight in the training set to D₁(i) 1/n, n is the number of instances;

an iterative execution unit, for i ═ 1.... T, iteratively executing training of the T-th weak classifier h on the training set_t(x) And calculating the error of the t-th iteration classifierRate epsilon_t：

T is the number of iterations, y_iIs the category of the ith example in the training set;

wherein when epsilon_tWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:

wherein alpha is_tIs the weight of the weak classifier(s),

to normalize the constants, β (i) is a cost sensitive compensation parameter,

an output unit for outputting the combined classifier:

as a preferred embodiment, the defect prediction apparatus further includes: and the optimization module is used for optimizing and updating the prediction model by taking the to-be-tested version data set selected by the characteristics as a test set.

It should be understood that the defect prediction apparatus selected based on the characteristics of the grid information system in the embodiment of the present invention may implement all technical solutions in the above method embodiments, functions of each functional module may be implemented according to the method in the above method embodiments, and specific implementation processes and calculation formulas that are not described in detail in the apparatus embodiment may refer to relevant descriptions in the above embodiments.

Based on the same technical concept as the method embodiment, according to another embodiment of the present invention, there is provided a computer apparatus including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps in the method embodiments.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A defect prediction method based on grid information system feature selection is characterized by comprising the following steps:

2. The method of claim 1The defect prediction method based on the power grid information system feature selection is characterized in that in the step (1), acquiring a historical version data set of the software to be tested comprises the following steps: and (b) a class in the software module recording the historical version is measured according to the software measurement element related to the defect, and the software module measuring the historical version has no defect, so that a historical version DATA set is obtained and is represented as DATA { (a)₁,b₁),(a₂,b₂),…,(a_i,b_i),…,(a_n,b_n)}，a_i＝(f_i,1,f_i,2,…,f_i,j,…,f_i,d) Wherein a is_iRepresenting an example of a software module, b_iRepresents the class of the instance, b_iE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instances_i,jShows an example a_iThe value of the jth software metric element of (a), d represents the number of software metric elements;

the acquiring of the version data set to be tested comprises the following steps: and acquiring classes in the software module of the version to be tested, and acquiring values of all software measurement elements in the software module example according to the software measurement elements related to the defects.

3. The grid information system feature selection-based defect prediction method according to claim 2, wherein the normalization process comprises: random forest filling missing values are selected, the value range of each software measurement element is normalized by adopting a Max-min method, and the formula is as follows:

4. The grid information system feature selection-based defect prediction method according to claim 1, wherein in the step (2), the euclidean distance is used to calculate the similarity between each instance in the to-be-measured version data set and each instance in the historical version data set.

5. The grid information system feature selection-based defect prediction method according to claim 1, wherein the class imbalance processing is performed on the training set by using a SMOTE sampling algorithm in the step (3), and the method comprises the following steps:

(3-1) for an instance p of a random minority class, measuring the distance from the instance p to all instances in the minority class instance by using Euclidean distance as a standard to obtain k adjacent instances;

(3-2) randomly extracting R to be less than or equal to k neighbors in a replacement mode;

(3-3) for the R instances, each instance and the instance p form a straight line, and a new sample is generated by randomly taking one instance on the straight line, so that R new samples are generated;

and (3-4) adding the newly generated samples into a training set to obtain a class-balanced training set.

6. The grid information system feature selection-based defect prediction method according to claim 1, wherein the defect prediction model in the step (5) is constructed by adopting an improved Adaboost algorithm, and the method comprises the following steps:

(5-1) initializing sample weights in training set to D₁(i) 1/n, n is the number of instances;

(5-2) for i ═ 1.. T, iteratively performing training of the T-th weak classifier h on the training set_t(x) And calculating the error rate epsilon of the t-th iteration classifier_t：

(5-3) when ε_tWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:

wherein alpha is_tIs the weight of the weak classifier(s),

to normalize the constants, β (i) is a cost sensitive compensation parameter,

(5-4) outputting a combined classifier:

7. a defect prediction device based on grid information system feature selection is characterized by comprising:

8. The grid information system feature selection-based defect prediction device of claim 7, wherein the data acquisition module comprises:

wherein p is_i,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressed_i,jJ-th representing the ith software module before normalizationValue of the software metric element, Min (q)_j) Represents the minimum value of the jth software metric element, Max (q), among all software modules_j) Representing the maximum value of the jth software metric element in all software modules.

9. The grid information system feature selection-based defect prediction device according to claim 7, wherein the training set processing module performs class imbalance processing on a training set by using a SMOTE sampling algorithm, and the training set processing module specifically includes:

10. The grid information system feature selection-based defect prediction device according to claim 7, wherein the feature selection module obtains a feature ranking list through a Relieff algorithm, selects a specified number of features ranked at the top from the feature ranking list, selects the features from a class-balanced training set and a normalized current version data set to be tested, removes the remaining features, and obtains a training set after feature selection and a test set after feature selection;

and wherein the step of (a) is,

the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, the defect prediction unit predicts the defect condition of each module of the version software to be tested by using the trained defect prediction model based on the test set data after feature selection to obtain the defect prediction result of each module of the version software to be tested,

the defect prediction model construction unit includes:

an iterative execution unit, for i ═ 1.... T, iteratively executing training of the T-th weak classifier h on the training set_t(x) And calculating the error rate epsilon of the t-th iteration classifier_t：

wherein alpha is_tIs the weight of the weak classifier(s),

to normalize the constants, β (i) is a cost sensitive compensation parameter,

an output unit for outputting the combined classifier: