[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN113127342A - Defect prediction method and device based on power grid information system feature selection - Google Patents

Defect prediction method and device based on power grid information system feature selection Download PDF

Info

Publication number
CN113127342A
CN113127342A CN202110339177.4A CN202110339177A CN113127342A CN 113127342 A CN113127342 A CN 113127342A CN 202110339177 A CN202110339177 A CN 202110339177A CN 113127342 A CN113127342 A CN 113127342A
Authority
CN
China
Prior art keywords
software
data set
training set
module
version data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110339177.4A
Other languages
Chinese (zh)
Other versions
CN113127342B (en
Inventor
沈伍强
龙震岳
张小陆
曾纪钧
梁哲恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202110339177.4A priority Critical patent/CN113127342B/en
Publication of CN113127342A publication Critical patent/CN113127342A/en
Application granted granted Critical
Publication of CN113127342B publication Critical patent/CN113127342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3668Testing of software
    • G06F11/3672Test management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a defect prediction method and a defect prediction device based on power grid information system feature selection, wherein the method comprises the following steps: acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing; calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set; carrying out class unbalance processing on the training set; carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing; and predicting the defect condition of each module of the software of the version to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set selected by the characteristics to obtain the defect prediction result of each module of the software of the version to be tested. The invention considers the characteristic difference and data distribution difference between different versions of the software and improves the efficiency and the precision of software defect prediction.

Description

Defect prediction method and device based on power grid information system feature selection
Technical Field
The invention relates to the field of software testing, in particular to a method and a device for predicting testing defects of a power grid information system.
Background
In the process of software development and operation and maintenance, due to the change of requirements, performance improvement, defect repair, code reconstruction and the like, software changes, the software is further larger and larger in scale, the functions are more and more complex, the relation between different functional modules is more and more complex, and defects in the software are inevitable. Software testing guarantees the quality of software by executing programs to discover as many software defects as possible. Software testing is the most time and resource consuming part of a software project, and all programs are tested by using limited testing resources. With the continuous development of distributed power sources, incremental power distribution networks and the like, the frequency, complexity and timeliness requirements of update iteration of the power information system/power grid information system are higher and higher, and higher requirements are provided for software test defect prediction. In the project defect prediction of version-oriented iterative update, irrelevant features exist in a source data set and a target data set, and the data distribution of the source data set and the target data set may be different. The existing defect detection method does not consider the characteristic difference and the data distribution difference, and the distribution of test resources is insufficient, so that the performance and efficiency of defect prediction are not high.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of the prior art, the invention provides a defect prediction method based on power grid information system feature selection, which can more effectively distribute test resources and improve the efficiency and quality of software test.
Another object of the present invention is to provide a defect prediction apparatus based on grid information system feature selection.
The technical scheme is as follows: according to a first aspect of the invention, a test defect prediction method based on power grid information system feature selection is provided, which comprises the following steps:
(1) acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing;
(2) calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set;
(3) carrying out class unbalance processing on the training set to obtain a class balanced training set;
(4) carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing;
(5) and predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
According to a second aspect of the present invention, there is provided a defect prediction apparatus selected based on characteristics of a grid information system, including:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and carrying out standardized processing;
the training set construction module is used for calculating the similarity between each example in the to-be-detected version data set and each example in the historical version data set, selecting k nearest examples from the historical version data set according to the similarity, and constructing a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the characteristic selection module is used for selecting characteristics of the training set with class balance and the to-be-detected version data set subjected to standardization processing;
and the prediction module is used for predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
Has the advantages that: the defect prediction method and device based on power grid information system feature selection provided by the invention have the advantages that the data quality is improved by preprocessing a data set; enabling the data distribution of the historical version data set and the data distribution of the current version data set to be detected to be consistent through instance selection; selecting features strongly related to the defects through feature selection, removing irrelevant features, and improving the performance and efficiency of defect prediction; and constructing a defect prediction model on the historical version data set by adopting an improved classification algorithm, predicting the defect tendency of each module of the current version to be tested, finally realizing accurate and effective prediction, and simultaneously recording and updating corresponding parameters of the prediction model to be used as support data for testing defect prediction of the power grid information system. The invention can effectively assist the software tester to predict the software module which is possibly defective before the software test, thereby more effectively distributing the test resources and further improving the efficiency and quality of the software test.
Drawings
FIG. 1 is a general schematic diagram of a grid information system feature selection-based defect prediction method of the present invention;
FIG. 2 is a flow chart of a method for fault prediction based on grid information system feature selection in accordance with the present invention.
Detailed Description
With the continuous improvement of the power grid information system, the historical versions of different information systems of the power grid are more and more, for a continuous software project with the historical versions, before software testing, defect data of the historical versions of the software are mined according to certain testing experience, a defect prediction model is constructed by utilizing a data mining and machine learning algorithm, and the defect condition of each module of a subsequent version can be effectively predicted. Software defects are not randomly distributed, and their distribution is regularly traceable. By mining historical software defect data and analyzing the defect distribution rule, software modules with defect tendency can be accurately predicted, most test resources are distributed to the software modules without spending resources on the modules without defect tendency. And furthermore, on the premise of ensuring the software testing quality, testing resources can be effectively distributed, and the software testing efficiency is obviously improved.
The embodiment of the invention provides a defect prediction method based on power grid information system feature selection aiming at software codes of continuity development and application type projects in a power grid information system and comprehensively considering the influence of data distribution of a data set on defect prediction before software testing, and in summary, the method comprises the following steps: and (3) recommending similar historical version data to enable the data distribution of the historical version data set to be consistent with that of the version data set to be detected, and searching for the distribution rule of software defects in the historical version. The method further comprises the following steps: the method comprises the steps of effectively selecting features strongly related to defects through a feature selection algorithm, constructing a defect prediction model by using an improved AdaBoost classification algorithm for training and analyzing, and meanwhile, recording and updating corresponding parameters of the prediction model to serve as support data for testing defect prediction of a power grid information system.
A detailed description of the steps of carrying out the method of the present invention is given below with reference to the accompanying drawings. It should be noted that the steps described below are only for the purpose of illustrating the present invention and are not limiting to the present invention.
As shown in fig. 1 and 2, in step1, a historical version data set and a version data set to be tested are constructed.
In an embodiment, the historical version data set may be constructed based on the number of instances (defective number, non-defective number) and the number of features of the historical version test of the information system used by the power grid. Features refer to software metrics elements of the information system software. The software measure element comprises a code measure element and a process measure element. For example, in the running process of all information testing projects in a power grid information department, all the information testing projects are developed by java language, data mining is carried out on code warehouses, version control systems and the like which are oriented to different applications and have a plurality of continuous versions, classes in historical version modules of the projects are recorded, software measurement elements related to defects are designed, such as code loop complexity, code change line number and the like, and the measurement historical version modules are marked to be flawless. The software metric element refers to indexes and parameters describing the characteristics of the software product, and can also be understood as software characteristics. Currently, software metrics are mainly divided into code metrics and process metrics. The code measurement element mainly refers to the complexity of a loop and describes the complexity of a software code structure. The process measurement element mainly comprises a measurement element which is based on code change, developer information and development process correlation. The method mainly comprises the changing times, the number of developers, the number of code changing lines and the like. The indication that there are no defects in the historical version module instances may be determined empirically, such as by historical test records, for the value of the metric for each software module instance.
The constructed historical version data set is represented as: DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiAnd d represents the number of software metric elements.
And for the version data set to be tested, acquiring the characteristic indexes and parameters of the version to be tested, namely the values of the software measurement elements in the software module example, based on the same code measurement elements.
The obtained historical version data set and the version data set to be detected are collectively called as an original data set.
In step2, preprocessing the data in the constructed historical version data set and the version data set to be detected.
Preprocessing the data recorded in the original data set, wherein the preprocessing comprises the following steps: and checking the data consistency, carrying out data standardization processing, removing the obviously distorted data, and carrying out effective sorting and storage. In the data normalization processing, different software measurement elements have different value ranges, random forest filling missing values are selected according to different influence degrees of characteristic values on defects, the value ranges of the software measurement elements are normalized by adopting a Max-min method, and the normalization processing is carried out to [0, 1] so as to eliminate the influence on the defect prediction result caused by the different value ranges of the different software measurement elements. The formula of the normalization process is:
Figure BDA0002998620810000041
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalizationj) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) The maximum value is indicated. In the description of the present invention, software modules, software module instances, instances may be used interchangeably.
In step3, a training set and a test set are constructed from the preprocessed data.
Whether instance recommendation is performed or not can be selected according to actual needs, if changes of developers or development environments and the like do not occur in the project development process, namely data distribution of the historical version data set and the current version data set to be tested is consistent, instance recommendation operation can not be performed, and otherwise, instance recommendation is performed. Due to the fact that continuous development aiming at one software project is conducted, developers in different versions, development environments and the like are changed, and data distribution of a data set is changed. Effective characteristic data can be effectively selected through example recommendation, and the prediction performance is improved.
The method specifically comprises the following steps: and effectively calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set through the Euclidean distance, and then selecting k adjacent instances with the minimum Euclidean distance to each instance in the current to-be-detected version data set from the historical version data set. Repeated instances in all k neighbors are taken only once, resulting in a new data set. And (5) testing the influence of the k value on the algorithm for many times, and taking the k value as 8. The formula for calculating the euclidean distance is prior art and is not described here.
And taking the historical version data set obtained after the processing as a training set, and taking the version data set to be tested as a testing set.
In step4, class imbalance processing is performed on the training set.
In most cases, the number of non-defective module instances is much greater than the number of defective module instances, and thus there is a class imbalance problem with the data in the training set. The correct classification of a few classes of samples is often more important than a majority of classes of samples in unbalanced data set classification. The invention performs class unbalance processing on the training set by adopting an SMOTE method so as to balance the number of defective module examples (few class samples) and the number of non-defective module examples (most class samples) to obtain a class balanced training set.
SMOTE sampling is to process a few classes and generate a few classes of data so as to achieve the aim of balancing a data set. The algorithm is improved on the basis of random oversampling, a minority sample of k neighbor of a minority class x is obtained, it is understood that the k value of the k neighbor is not necessarily equal to the k value of the k neighbor selected in step3, the sampling multiplying factor N is set according to the proportion of unbalanced data, and the assumption is that x is setnSampling is performed for a few samples in k neighbors of x according to the following formula:
Xnew=X+rand(0,1)*|X-Xn|
the complete steps are as follows:
step1. for an instance p of a random minority class, its distance to all instances in the minority class instances is measured by using euclidean distance as a standard, and k neighbor instances thereof are obtained.
And step2, randomly extracting R to be less than or equal to k neighbors in a release manner.
Step3. for the R instances, each instance can form a straight line with the instance p, and then randomly take an instance on the straight line, so as to generate a new sample, and continuously do so, so that R new instances can be generated in total.
Step4. add these new spots to the sample set.
The new samples synthesized by the simple random oversampling method have the problems of blindness and limitation, because the method randomly copies a few types of samples to increase the number of samples. The SMOTE algorithm uses a linear interpolation method and synthesizes a new few classes of samples according to some specific rules. Therefore, the problem that the decision domain is reduced due to the fact that the number of the few samples is increased can be solved while the number of the few samples is increased, and therefore the algorithm is prevented from being over-fitted to a certain degree, and the purpose of improving the performance of the classifier is achieved.
In step 5, feature selection is performed according to the class-balanced training set and the data set of the current version to be tested after normalization processing.
And (4) performing feature sorting on the training set by using a feature sorting method, selecting features strongly related to the defects, and removing irrelevant features. Whether feature selection is performed or not can be selected, and if feature selection is performed, a feature ordered list is obtained through a Relieff algorithm (RF). And according to the set number of the features to be selected, selecting the features with the specified number at the top of the rank from the feature sorted list, and displaying the selected features in the form of serial numbers and names. And finally, selecting the features from the class-balanced training set and the normalized current version data set to be tested, removing the rest features, and obtaining the training set after feature selection and the test set after feature selection.
In step 6, a classification algorithm is used for constructing a defect prediction model on the historical version data set after recommendation selection and feature selection processing, a test set after feature selection is input, the defect condition of each module of the current version to be tested is predicted, and the defect prediction result of each module of the current version to be tested is returned.
In order to pursue further improvement of accuracy and recall rate of minority class identification, a classification algorithm is improved. Most of the conventional classification algorithms assume that the misclassification costs are the same and the improvement of the classification accuracy of the classifier is the final goal, so when the classification problem of the unbalanced data set is processed, a small number of samples are generally classified into a large number of classes, and the classification accuracy of the classifier is further improved. But the correct classification of a few classes of samples is often more important than a majority of classes of samples in unbalanced data set classification. The cost sensitive learning is based on the theory, and gives a higher wrong score cost to a few types of wrongly scored samples. In this embodiment, the processed data set is subjected to effective classification prediction by a defect prediction model constructed by an improved Adaboost classification algorithm, so as to achieve the purpose of improving the classification effect of the classifier on a small number of types of samples.
Different from the traditional Adaboost algorithm, the invention changes the weight update of Adaboost by introducing the cost matrix into the weight update formulaThe new mode enables the samples of the few classes which are classified wrongly to obtain higher weight, and the samples which are classified correctly to reduce the weight. The specific mode is to modify a sample weight value updating formula in Adaboost, and then the formula is used for updating the sample weight value in Adaboost
Figure BDA0002998620810000061
Is updated to
Figure BDA0002998620810000062
In the case of determining β (i), that is, the cost matrix, the cost sensitive function, the power grid data set processed in the present invention does not have a well-determined cost matrix, and therefore, in the present invention, for β (i), it is equivalent to directly giving a few class samples a coefficient K (K) of a small number of class samples>1) When the weak classifier is classified correctly, β (i) ═ 1 remains unchanged, and the sample weight is reduced normally; when the minority class is classified into the majority class, β (i) ═ K, the weight of the sample increases at a faster rate; when the majority class is classified into the minority class, β (i) ═ 1 remains unchanged, and the sample weight normally increases. β (i) is referred to herein as the cost sensitive compensation parameter for the ith instance. By the method, the weight of the misclassified minority samples can be increased, and the recognition rate of the minority samples can be increased more quickly.
The specific flow of the improved Adaboost algorithm is as follows:
inputting: training set after feature processing; the iteration number T; a base learning algorithm;
and (3) outputting: combined classifier
step1. initialize the sample weight in training set to D1(i)=1/n。
step2. for i 1t(x) And calculating the error rate epsilon of the t-th iteration classifiertAlso known as error:
Figure BDA0002998620810000071
step3. estimate error if et>0.5 or εt=0,The classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
Figure BDA0002998620810000072
αtis the weight of the weak classifier(s),
Figure BDA0002998620810000073
in order to be a normalization constant, the method comprises the following steps of,
Figure BDA0002998620810000074
step4. output combined classifier:
Figure BDA0002998620810000075
and increasing the cost of misclassification of the minority samples in the weight updating formula so that the samples misclassified by the minority samples obtain more sample weight. In this way, the prediction accuracy of a few types of samples can be improved more in the same iteration number.
By the method, on one hand, a software tester is helped to predict the possible defective software modules before software testing to provide corresponding data support, so that the effective distribution of testing resources is guided, and the testing efficiency is improved; on the other hand, the method analyzes the causes of software defects, improves the software development process and improves the development quality of subsequent versions.
The method comprises the steps of selecting key features through feature selection after instance selection and class imbalance processing are carried out on a historical version data set, obtaining a test set and a training set after feature selection, optimizing an Adaboost prediction model in order to further optimize the effect of minority test defect prediction, and constructing the prediction model through an experimental contrast decision tree, an Adaboost algorithm and an improved Adaboost algorithm. By comprehensively considering the false judgment and the missing judgment of the prediction model on the minority instances, the f1 score of the improved classification model method provided by the invention is about 5% higher than that of decisionTree and Adaboost respectively, namely the method better improves the accuracy of minority instance identification prediction and effectively improves the prediction performance on the basis of ensuring the identification rate.
Aiming at the characteristics of the power information system, the invention excavates the defect data of the historical version of the software according to certain test experience based on the continuity software code of the multi-historical version in the typical application of the power information system based on the software defect detection technology before the software test, and constructs a defect prediction model by utilizing data excavation and machine learning algorithm to effectively predict the defect condition of each module of the subsequent version. The method can be used for software defect detection and software defect prediction solutions.
In another embodiment, a defect prediction apparatus based on grid information system feature selection is provided, including:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and carrying out standardized processing;
the training set construction module is used for calculating the similarity between each example in the to-be-detected version data set and each example in the historical version data set, selecting k nearest examples from the historical version data set according to the similarity, and constructing a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the characteristic selection module is used for selecting characteristics of the training set with class balance and the to-be-detected version data set subjected to standardization processing;
and the prediction module is used for predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
Wherein, the data acquisition module includes:
a historical version DATA set acquisition unit, which is used for recording classes in the software modules of the historical version, measuring whether the software modules of the historical version have defects according to the software measurement elements related to the defects to obtain a historical version DATA set which is expressed as DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiThe value of the jth software metric element of (a), d represents the number of software metric elements;
the system comprises a to-be-detected version data set acquisition unit, a to-be-detected version data set acquisition unit and a defect detection unit, wherein the to-be-detected version data set acquisition unit is used for acquiring classes in a software module of a to-be-detected version and acquiring values of software measurement elements in a software module example according to the software measurement elements related to defects;
the normalization processing unit is used for selecting random forest filling missing values and normalizing the value range of each software measurement element by adopting a Max-min method, and the formula is as follows:
Figure BDA0002998620810000081
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalizationj) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) Representing the maximum value of the jth software metric element in all software modules.
As a preferred embodiment, the training set processing module performs class imbalance processing on the training set by using a SMOTE sampling algorithm, and the training set processing module specifically includes:
the minority class neighbor determining unit is used for measuring the distance from the instance p to all instances in the minority class instances by taking the Euclidean distance as a standard so as to obtain k neighbor instances of the instance p;
the neighbor extraction unit is used for extracting R (R) is less than or equal to k neighbors randomly in a replacement mode;
the new sample generation unit is used for forming a straight line by combining each example and the example p for the R examples randomly extracted by the neighbor extraction unit, randomly taking one example on the straight line, generating a new sample and generating R new samples together; and
and the training set updating unit is used for adding the new samples generated by the new sample generating unit into the training set to obtain a class balance training set.
As a preferred embodiment, the feature selection module obtains the feature sorted list through a ReliefF algorithm, selects a specified number of features ranked at the top from the feature sorted list, selects the features from the class-balanced training set and the normalized current version data set to be tested, and removes the remaining features to obtain the training set after the feature selection and the test set after the feature selection.
As a preferred embodiment, the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, the defect prediction unit predicts the defect condition of each module of the version software to be tested by using the trained defect prediction model based on the test set data after feature selection, and obtains the defect prediction result of each module of the version software to be tested;
wherein, the defect prediction model construction unit comprises:
an initialization unit for initializing the sample weight in the training set to D1(i) 1/n, n is the number of instances;
an iterative execution unit, for i ═ 1.... T, iteratively executing training of the T-th weak classifier h on the training sett(x) And calculating the error of the t-th iteration classifierRate epsilont
Figure BDA0002998620810000091
T is the number of iterations, yiIs the category of the ith example in the training set;
wherein when epsilontWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
Figure BDA0002998620810000092
wherein alpha istIs the weight of the weak classifier(s),
Figure BDA0002998620810000101
to normalize the constants, β (i) is a cost sensitive compensation parameter,
Figure BDA0002998620810000102
an output unit for outputting the combined classifier:
Figure BDA0002998620810000103
as a preferred embodiment, the defect prediction apparatus further includes: and the optimization module is used for optimizing and updating the prediction model by taking the to-be-tested version data set selected by the characteristics as a test set.
It should be understood that the defect prediction apparatus selected based on the characteristics of the grid information system in the embodiment of the present invention may implement all technical solutions in the above method embodiments, functions of each functional module may be implemented according to the method in the above method embodiments, and specific implementation processes and calculation formulas that are not described in detail in the apparatus embodiment may refer to relevant descriptions in the above embodiments.
Based on the same technical concept as the method embodiment, according to another embodiment of the present invention, there is provided a computer apparatus including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, which when executed by the processors implement the steps in the method embodiments.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A defect prediction method based on grid information system feature selection is characterized by comprising the following steps:
(1) acquiring a historical version data set and a version data set of software to be tested and carrying out standardized processing;
(2) calculating the similarity between each instance in the to-be-detected version data set and each instance in the historical version data set, selecting k instances nearest to each instance in the to-be-detected version data set from the historical version data set according to the similarity, and constructing a training set;
(3) carrying out class unbalance processing on the training set to obtain a class balanced training set;
(4) carrying out feature selection on the training set with class balance and the to-be-detected version data set subjected to normalization processing;
(5) and predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
2. The method of claim 1The defect prediction method based on the power grid information system feature selection is characterized in that in the step (1), acquiring a historical version data set of the software to be tested comprises the following steps: and (b) a class in the software module recording the historical version is measured according to the software measurement element related to the defect, and the software module measuring the historical version has no defect, so that a historical version DATA set is obtained and is represented as DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiThe value of the jth software metric element of (a), d represents the number of software metric elements;
the acquiring of the version data set to be tested comprises the following steps: and acquiring classes in the software module of the version to be tested, and acquiring values of all software measurement elements in the software module example according to the software measurement elements related to the defects.
3. The grid information system feature selection-based defect prediction method according to claim 2, wherein the normalization process comprises: random forest filling missing values are selected, the value range of each software measurement element is normalized by adopting a Max-min method, and the formula is as follows:
Figure FDA0002998620800000011
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jThe value of the jth software metric element, Min (q), representing the ith software module prior to normalizationj) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) Representing the maximum value of the jth software metric element in all software modules.
4. The grid information system feature selection-based defect prediction method according to claim 1, wherein in the step (2), the euclidean distance is used to calculate the similarity between each instance in the to-be-measured version data set and each instance in the historical version data set.
5. The grid information system feature selection-based defect prediction method according to claim 1, wherein the class imbalance processing is performed on the training set by using a SMOTE sampling algorithm in the step (3), and the method comprises the following steps:
(3-1) for an instance p of a random minority class, measuring the distance from the instance p to all instances in the minority class instance by using Euclidean distance as a standard to obtain k adjacent instances;
(3-2) randomly extracting R to be less than or equal to k neighbors in a replacement mode;
(3-3) for the R instances, each instance and the instance p form a straight line, and a new sample is generated by randomly taking one instance on the straight line, so that R new samples are generated;
and (3-4) adding the newly generated samples into a training set to obtain a class-balanced training set.
6. The grid information system feature selection-based defect prediction method according to claim 1, wherein the defect prediction model in the step (5) is constructed by adopting an improved Adaboost algorithm, and the method comprises the following steps:
(5-1) initializing sample weights in training set to D1(i) 1/n, n is the number of instances;
(5-2) for i ═ 1.. T, iteratively performing training of the T-th weak classifier h on the training sett(x) And calculating the error rate epsilon of the t-th iteration classifiert
Figure FDA0002998620800000021
T is the number of iterations, yiIs the category of the ith example in the training set;
(5-3) when εtWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
Figure FDA0002998620800000022
wherein alpha istIs the weight of the weak classifier(s),
Figure FDA0002998620800000023
to normalize the constants, β (i) is a cost sensitive compensation parameter,
Figure FDA0002998620800000024
(5-4) outputting a combined classifier:
Figure FDA0002998620800000025
7. a defect prediction device based on grid information system feature selection is characterized by comprising:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and carrying out standardized processing;
the training set construction module is used for calculating the similarity between each example in the to-be-detected version data set and each example in the historical version data set, selecting k nearest examples from the historical version data set according to the similarity, and constructing a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the characteristic selection module is used for selecting characteristics of the training set with class balance and the to-be-detected version data set subjected to standardization processing;
and the prediction module is used for predicting the defect condition of each module of the version software to be tested by utilizing a defect prediction model constructed by a classification algorithm based on the training set and the test set selected by the characteristics to obtain the defect prediction result of each module of the version software to be tested.
8. The grid information system feature selection-based defect prediction device of claim 7, wherein the data acquisition module comprises:
a historical version DATA set acquisition unit, which is used for recording classes in the software modules of the historical version, measuring whether the software modules of the historical version have defects according to the software measurement elements related to the defects to obtain a historical version DATA set which is expressed as DATA { (a)1,b1),(a2,b2),…,(ai,bi),…,(an,bn)},ai=(fi,1,fi,2,…,fi,j,…,fi,d) Wherein a isiRepresenting an example of a software module, biRepresents the class of the instance, biE Y, Y ═ defective, non-defective, n denotes the number of instances, f denotes the number of instancesi,jShows an example aiThe value of the jth software metric element of (a), d represents the number of software metric elements;
the system comprises a to-be-detected version data set acquisition unit, a to-be-detected version data set acquisition unit and a defect detection unit, wherein the to-be-detected version data set acquisition unit is used for acquiring classes in a software module of a to-be-detected version and acquiring values of software measurement elements in a software module example according to the software measurement elements related to defects;
the normalization processing unit is used for selecting random forest filling missing values and normalizing the value range of each software measurement element by adopting a Max-min method, and the formula is as follows:
Figure FDA0002998620800000031
wherein p isi,jThe value q of the jth software metric element of the ith software module after the normalization processing is expressedi,jJ-th representing the ith software module before normalizationValue of the software metric element, Min (q)j) Represents the minimum value of the jth software metric element, Max (q), among all software modulesj) Representing the maximum value of the jth software metric element in all software modules.
9. The grid information system feature selection-based defect prediction device according to claim 7, wherein the training set processing module performs class imbalance processing on a training set by using a SMOTE sampling algorithm, and the training set processing module specifically includes:
the minority class neighbor determining unit is used for measuring the distance from the instance p to all instances in the minority class instances by taking the Euclidean distance as a standard so as to obtain k neighbor instances of the instance p;
the neighbor extraction unit is used for extracting R (R) is less than or equal to k neighbors randomly in a replacement mode;
the new sample generation unit is used for forming a straight line by combining each example and the example p for the R examples randomly extracted by the neighbor extraction unit, randomly taking one example on the straight line, generating a new sample and generating R new samples together; and
and the training set updating unit is used for adding the new samples generated by the new sample generating unit into the training set to obtain a class balance training set.
10. The grid information system feature selection-based defect prediction device according to claim 7, wherein the feature selection module obtains a feature ranking list through a Relieff algorithm, selects a specified number of features ranked at the top from the feature ranking list, selects the features from a class-balanced training set and a normalized current version data set to be tested, removes the remaining features, and obtains a training set after feature selection and a test set after feature selection;
and wherein the step of (a) is,
the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, the defect prediction unit predicts the defect condition of each module of the version software to be tested by using the trained defect prediction model based on the test set data after feature selection to obtain the defect prediction result of each module of the version software to be tested,
the defect prediction model construction unit includes:
an initialization unit for initializing the sample weight in the training set to D1(i) 1/n, n is the number of instances;
an iterative execution unit, for i ═ 1.... T, iteratively executing training of the T-th weak classifier h on the training sett(x) And calculating the error rate epsilon of the t-th iteration classifiert
Figure FDA0002998620800000041
T is the number of iterations, yiIs the category of the ith example in the training set;
wherein when epsilontWhen the value is smaller than the preset threshold value, the classifier is unqualified, and the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
Figure FDA0002998620800000042
wherein alpha istIs the weight of the weak classifier(s),
Figure FDA0002998620800000043
to normalize the constants, β (i) is a cost sensitive compensation parameter,
Figure FDA0002998620800000051
an output unit for outputting the combined classifier:
Figure FDA0002998620800000052
CN202110339177.4A 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection Active CN113127342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110339177.4A CN113127342B (en) 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110339177.4A CN113127342B (en) 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection

Publications (2)

Publication Number Publication Date
CN113127342A true CN113127342A (en) 2021-07-16
CN113127342B CN113127342B (en) 2023-06-09

Family

ID=76774868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110339177.4A Active CN113127342B (en) 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection

Country Status (1)

Country Link
CN (1) CN113127342B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356641A (en) * 2022-03-04 2022-04-15 中南大学 Incremental software defect prediction method, system, equipment and storage medium
CN114816979A (en) * 2021-11-22 2022-07-29 江苏科技大学 A Software Defect Prediction Method Based on Cluster Analysis and Decision Tree Algorithm
CN114911800A (en) * 2022-05-16 2022-08-16 国网青海省电力公司信息通信公司 Fault prediction method, device and electronic device for power system
CN115033493A (en) * 2022-07-06 2022-09-09 陕西师范大学 Workload sensing instant software defect prediction method based on linear programming

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
WO2017131263A1 (en) * 2016-01-29 2017-08-03 한국과학기술원 Hybrid instance selection method using nearest neighboring point for cross project defect prediction
US20180210944A1 (en) * 2017-01-26 2018-07-26 Agt International Gmbh Data fusion and classification with imbalanced datasets
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN109977028A (en) * 2019-04-08 2019-07-05 燕山大学 A kind of Software Defects Predict Methods based on genetic algorithm and random forest
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
WO2020199345A1 (en) * 2019-04-02 2020-10-08 广东石油化工学院 Semi-supervised and heterogeneous software defect prediction algorithm employing github
CN112465040A (en) * 2020-12-01 2021-03-09 杭州电子科技大学 Software defect prediction method based on class imbalance learning algorithm

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
WO2017131263A1 (en) * 2016-01-29 2017-08-03 한국과학기술원 Hybrid instance selection method using nearest neighboring point for cross project defect prediction
US20180210944A1 (en) * 2017-01-26 2018-07-26 Agt International Gmbh Data fusion and classification with imbalanced datasets
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
WO2020199345A1 (en) * 2019-04-02 2020-10-08 广东石油化工学院 Semi-supervised and heterogeneous software defect prediction algorithm employing github
CN109977028A (en) * 2019-04-08 2019-07-05 燕山大学 A kind of Software Defects Predict Methods based on genetic algorithm and random forest
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
CN112465040A (en) * 2020-12-01 2021-03-09 杭州电子科技大学 Software defect prediction method based on class imbalance learning algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
程铭;毋国庆;袁梦霆;: "基于迁移学习的软件缺陷预测", 电子学报, no. 01 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114816979A (en) * 2021-11-22 2022-07-29 江苏科技大学 A Software Defect Prediction Method Based on Cluster Analysis and Decision Tree Algorithm
CN114816979B (en) * 2021-11-22 2024-08-20 江苏科技大学 Software defect prediction method based on cluster analysis and decision tree algorithm
CN114356641A (en) * 2022-03-04 2022-04-15 中南大学 Incremental software defect prediction method, system, equipment and storage medium
CN114356641B (en) * 2022-03-04 2022-05-27 中南大学 An incremental software defect prediction method, system, device and storage medium
CN114911800A (en) * 2022-05-16 2022-08-16 国网青海省电力公司信息通信公司 Fault prediction method, device and electronic device for power system
CN115033493A (en) * 2022-07-06 2022-09-09 陕西师范大学 Workload sensing instant software defect prediction method based on linear programming

Also Published As

Publication number Publication date
CN113127342B (en) 2023-06-09

Similar Documents

Publication Publication Date Title
CN113127342B (en) Defect prediction method and device based on power grid information system feature selection
US11580425B2 (en) Managing defects in a model training pipeline using synthetic data sets associated with defect types
CN106201871A (en) Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN107203467A (en) The reference test method and device of supervised learning algorithm under a kind of distributed environment
CN112288455A (en) Label generation method and device, computer readable storage medium and electronic equipment
CN111582315A (en) Sample data processing method and device and electronic equipment
Gashi et al. Dealing with missing usage data in defect prediction: A case study of a welding supplier
CN117574201A (en) Model training method, device, equipment and storage medium based on multi-industry model
CN114139636B (en) Abnormal operation processing method and device
Bernedixen Automated Bottleneck Analysis of Production Systems: Increasing the applicability of simulation-based multi-objective optimization for bottleneck analysis within industry
JP7190246B2 (en) Software failure prediction device
CN118036920A (en) Supplier competition type matching method and system based on photovoltaic demand
CN106991050A (en) A kind of static test null pointer dereference defect false positive recognition methods
US20220215144A1 (en) Learning Apparatus, Learning Method and Learning Program
JP4308113B2 (en) Data analysis apparatus and method, and program
CN117763316A (en) High-dimensional data dimension reduction method and dimension reduction system based on machine learning
CN111026661B (en) Comprehensive testing method and system for software usability
CN112395280B (en) A data quality detection method and system thereof
JP2019003333A (en) Bug mixing probability calculation program and bug mixing probability calculation method
CN114328221A (en) Cross-project software defect prediction method and system based on feature and instance migration
Singh et al. Improved software fault prediction model based on optimal features set and threshold values using metaheuristic approach
JP6588494B2 (en) Extraction apparatus, analysis system, extraction method, and extraction program
CN117313900B (en) Method, apparatus and medium for data processing
CN117313899B (en) Method, apparatus and medium for data processing
CN117708622B (en) Abnormal index analysis method and system of operation and maintenance system and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant