Fuzzy compact scattering clustering method based on cluster feature weighting
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a fuzzy compact scattering clustering method based on cluster feature weighting.
Background
In natural science and social science, a large number of classification problems exist, a clustering method is a statistical analysis method for researching (sample or index) classification problems, and is also an important algorithm for data mining, and the application field is very wide. A fuzzy C-means (FCM) clustering algorithm is a common unsupervised pattern recognition method, and is continuously improved by many people, and the fuzzy C-means (FCM) clustering algorithm considers the influence of each characteristic parameter of a sample on a clustering center and improves the conditions of noise, abnormal data influence and the like. However, these FCM-based clustering algorithms essentially only consider intra-class compactness (intra-class divergence) of samples, but ignore inter-class scatter (inter-class divergence) of samples, and cannot well deal with the problem of data clustering with sample maldistribution. The FCS (fuzzy Compactness and separation) algorithm proposed by Kuo-Lung Wu et al considers the intra-class Compactness and the inter-class dispersion and is compatible with the hard division and the fuzzy division of the sample, which is more practical; a classification method of a maximum divergence difference judgment criterion is proposed by Song Fenxi and other people in China, and the criterion comprehensively considers the inter-class divergence and the intra-class divergence to obtain an optimal projection vector so as to classify samples; the Satsuga et al introduces the ambiguity into the maximum divergence difference discrimination criterion and provides an FMSDC (fuzzy approximation scanner difference criterion) algorithm, and dimension reduction is carried out while fuzzy clustering is carried out; the algorithm of the Satsu et al is a correct version of the algorithm of the Satsu et al, membership and sample mean values are initialized by using an FCM algorithm, dimension reduction is carried out by using an FMSDC algorithm, dimension reduction data are clustered by using an FCS algorithm, and the clustering essence adopts the FCS algorithm.
In the process of classifying data by using the above algorithm, we find that, some actual data are in some hard division areas, the membership degrees of the data do not need to be fuzzified, and how to effectively divide data with unbalanced sample distribution cannot be solved by the FCM algorithm and the related extended FCM algorithm. Although the FCS algorithm considers the problem of hard partitioning of samples, the FCS algorithm does not consider the sample situation at the boundary of hard partitioning, which results in the problem that the algorithm fails when boundary data is encountered when actual data is classified.
Disclosure of Invention
The invention discloses a fuzzy compact scattering clustering method based on cluster feature weighting, which aims at solving the problems that the actual situation of sample hard division is not considered in the clustering process of the conventional WFCM algorithm, the unbalanced data division of sample distribution cannot be well managed, the situation of hard division boundary points is not considered in the FCS algorithm, and the influence of sample feature parameters on various clusters is ignored.
In order to achieve the purpose, the invention provides the following technical scheme:
a cluster feature weighted fuzzy compact scattering clustering method comprises the following steps:
step one, setting membership index m and characteristic weighting index α∈ [ -10, -1 [ ]]∪(1,10]β∈ {0.005,0.05,0.5,1}, an initial iteration number p of 0 and an iteration error > 0, randomly generating an initial clustering center ai,(s is the number of characteristic parameters);
step two, calculating the coefficient η according to the following formulai:
Wherein,is the sample mean;
step three: updating the sample membership mu according toij:
Note the book
When the sample point xjWhen there is a fall on a hard division boundary, at this time Δij0, for all Δ s, while ensuring that the distance scale of each sample point with respect to the ith class is unchangedijP (. DELTA.) utilization of not less than 0ij) And (3) adjusting:
after adjustment, the new μ is calculated using the following equationij:
Because there is a sample point xjFalls within the class i hard partition region, so there will be μij< 0, therefore for μijAnd (3) carrying out hard division adjustment:
step four: the feature weight ω is calculated according to the following equationik:
Note the book
If Δik< 0 because of ωik∈[0,1]Therefore, it is necessary to adjust ΔikProjecting to an interval larger than 0 and ensuring the kth characteristic parameter and the ith class of each sampleThe distance scale of the hard segment is unchanged, and then Δ is adjusted using the following equationk:
After adjustment, a new omega is calculated by utilizing a characteristic weight formulaik;
Step five: calculating the clustering center a according to the following formulaik:
Step six: let iteration number p be p +1 untilOtherwise, turning to the step two;
step seven: the mu obtained by the p iterationijOutput according toI.e. the jth sample belongs to the ith class.
Further, the sample membership μijAnd a feature weight ωikCalculated by the following steps:
establishing an objective function:
the cluster feature weighted FCS clustering problem is expressed as follows:
obtained by using a Lagrange multiplier method:
in the above formula, λi、λjIs a lagrange multiplier;
respectively corresponding to mu according to the above formulaij、ωik、λi、λjCalculating the offset and making the offset result zero to obtain muij、ωik。
The invention also provides an industrial data classification method based on the fuzzy compact scattering clustering method based on cluster feature weighting, which comprises the following steps: after the data acquired by the sensor is obtained, the acquired data is classified by the CWFCS method (steps one to seven) provided by the invention, and then the current state of the industrial equipment or the process is judged according to the classification result.
Furthermore, the sensor collects the state data of the aircraft engine, and the health state of the aircraft engine is judged.
Has the advantages that:
the method follows the actual situation of sample hard division, fully considers the influence of sample characteristic parameters on the sample division, makes the sample class compact and disperse as much as possible, solves the problem of sample membership degree at the hard division boundary, and realizes more effective division of noise data and abnormal data under the condition of unbalanced sample distribution. The clustering performance is good, the convergence rate is high, and the iteration efficiency is high. Experiments prove that the algorithm has good clustering performance, high convergence speed and high iteration efficiency. Compared with the prior art, the method has high clustering accuracy and obviously reduced time consumption, and is suitable for being applied to occasions with unbalanced sample distribution and high real-time requirement in industrial control.
Drawings
FIG. 1 is a schematic flow chart of the steps of a fuzzy compact scattering clustering method based on cluster feature weighting;
FIG. 2 is a schematic diagram of data distribution of an Iris data set, clustering effects and clustering centers of a CWBCs algorithm, an FCS algorithm and a WFCM algorithm;
fig. 3 is a schematic diagram of the clustering result, the hard partitioning result and the convergence of the CWFCS algorithm when β is 1;
fig. 4 is a schematic diagram of the clustering result, the hard partitioning result and the convergence of the CWFCS algorithm when β is 0.5;
fig. 5 is a schematic diagram of the clustering result, the hard partitioning result and the convergence of the CWFCS algorithm when β is 0.05;
fig. 6 is a schematic diagram of the clustering result, the hard partitioning result and the convergence of the CWFCS algorithm when β is 0.005;
FIG. 7 is a schematic diagram showing the influence of different values of parameters α, β, and m on the clustering result.
Detailed Description
The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.
The invention provides an improved fuzzy compact scattering clustering method based on the thought, wherein the unsupervised clustering of data in real life exists in the hard partition of a sample to a clustering center, the sample on the hard partition boundary has the maximum membership degree to the class compared with the sample outside the hard partition region, but the sample in the relatively hard partition region is relatively fuzzy, and each characteristic parameter of the sample has different influence on various clustering results.
Firstly, defining the intra-class divergence and the inter-class divergence of the weighting of the cluster characteristics as follows:
a characteristic weighting coefficient α ∈ [ -10,0) — (1,10 ];
establishing an objective function:
the cluster feature weighted FCS clustering problem is expressed as follows:
obtained by using a Lagrange multiplier method:
in the above formula, λi、λjIs a lagrange multiplier;
respectively corresponding to mu according to the above formulaij、ωik、λi、λjAnd solving the partial derivatives and making the partial derivatives result be zero, and solving:
the fuzzy compact scattering clustering method based on cluster feature weighting, as shown in fig. 1, includes the following steps:
step one, setting membership index m and characteristic weighting index α∈ [ -10, -1 [ ]]∪(1,10]β∈ {0.005,0.05,0.5,1}, an initial iteration number p of 0 and an iteration error > 0, randomly generating an initial clustering center ai,(s is the number of characteristic parameters);
step two, calculating the coefficient η according to the following formulai:
Wherein,is the sample mean;
step three: updating the sample membership mu according to the formula (3)ij:
Note the book
Consider a sample point xjIn the case of hard-divided boundary, if μ is calculated directly by equation (3)ijThe algorithm is invalid when the result is positive infinity; the sample point falling on the i-th class hard division boundary has ambiguity per se, if the sample point is subjected to hard division, the sample point is inconsistent with the actual situation, but is not subjected to hard division with other sample points falling on the hard division boundaryComparison of sample points outside the subarea, xjThe higher fuzzy membership degree exists for the ith class, and all delta are subjected to the condition of ensuring that the distance scale of each sample point relative to the ith class is unchangedijUtilization adjustment function P (Delta) of more than or equal to 0ij) And (3) adjusting:
after adjustment, the new μ is calculated using the following equationij:
Because there is a sample point xjFalls within the class i hard partition region, so there will be μij< 0, therefore for μijAnd (3) carrying out hard division adjustment:
step four: the feature weight ω is calculated according to the following equationik:
Note the book
When deltaikWhen 0, the k-th characteristic parameter has the same influence on the i-th class, so ω isik=0。
If the sample distribution is very uneven, there is Δik< 0 because of ωik∈[0,1]Therefore, it is necessary to adjust ΔikProject to an interval greater than 0 and ensure eachThe k characteristic parameter of the sample is invariant to the distance scale of the i hard-divided region, and then Δ is adjusted using the following equationik:
After adjustment, a new omega is calculated by utilizing a characteristic weight formulaik;
Step five: calculating the clustering center a according to the following formulaik:
Step six: let iteration number p be p +1 untilOtherwise, turning to the step two;
step seven: the mu obtained by the p iterationijOutput according toI.e. the jth sample belongs to the ith class.
Through the steps, the actual situation of sample hard division is followed, the influence of sample characteristic parameters on various divisions is fully considered, the sample classes are compact and dispersed as much as possible, the problem of sample membership degree at the hard division boundary is solved, and more effective division is realized on noise data and abnormal data under the condition of unbalanced sample distribution.
The first embodiment is as follows:
to better illustrate the performance of the present invention, we adopted the method of the present invention to target one of the real data sets of UCI retrieval of machine learning databases: iris data set was subjected to classification experiments, fuzzyThe indexes m are respectively set to be (1.5,2,2.5,3,3.5), and the iteration error precision is 10-6The Iris data set retains all data of a first class and a second class and randomly selects 10 samples from a third class, and the 110 samples are divided into 3 classes, wherein the 2 nd class and the 3 rd class are crossed, the clustering result obtained by the algorithm (CWFSC algorithm for short) is shown in figures 2-6, the algorithm has a basic clustering function, the clustering result is approximately the same as the original data distribution shown in figure 2(a), figures 3-6 show that the distance between the three classes of clustering centers changes along with β, when β is reduced from 1 to 0.05, the system fuzzy degree is increased, the three classes of clustering centers are gradually closed, the number of the third class samples is far less than that of the first class and the second class and is overlapped with the first class and the second class, in order to make the cluster internal compactness as large as possible, the sample internal distribution is also as large as possible, the parameters of CWFSC algorithm are set to be (0.005, 0.5, 1), the weighted sample distribution speed is higher, the weighted sample distribution speed is set as the hard, the weighted sample distribution efficiency of the algorithm is set to be higher, the weighted sample distribution efficiency of the algorithm is set as high as the weighted sample distribution of the weighted sample distribution algorithm is set as 0.05, the weighted sample distribution of the first class 2-3 cluster algorithm is set as the weighted distribution of the cluster algorithm is higher, the cluster algorithm is set as the hard distribution of the cluster algorithm is set as the cluster distribution of the cluster algorithm is set as the cluster characteristics of the cluster internal distribution of the cluster characteristics of the cluster distribution of the cluster 2-3 cluster algorithm is higher, the cluster algorithm is set as the weighted cluster-3 cluster-6, the weighted cluster-3 cluster algorithm is set as the weighted cluster algorithm is.
Fig. 7 shows the impact of different values of the parameters α, β, m on the clustering. The smaller beta is, the larger the error fraction is; no matter what value beta takes, the algorithm is sensitive to the values of alpha and m when the average error fraction is minimum and beta is less than 0.5 for the same value of beta, alpha is 2 and m belongs to {1.5,2 }. Fig. 7(a) β is 1, and α > 3, when m is an integer (2, 3), the larger α is, the smaller the error fraction is, otherwise, the smaller α is; when α < 0, the error fraction becomes smaller as α becomes smaller, and m has little influence. FIGS. 7(b) - (d) show that when β < 1, the algorithm has substantially the same tendency to be influenced by α and m, and the larger m is for a certain α, the larger the error rate is; for a certain m (without considering the optimal case of α ═ 2), the larger α is the error fraction smaller if α > 0, and the smaller α is the error fraction smaller if α < 0.
Example two:
to verify the superiority of the present invention, we tested the Iris dataset using FCS, WFCM and CFCS provided by the present invention.
In the experiment, fuzzy indexes m in the experiment are respectively set as (1.5,2,2.5,3 and 3.5), and the iteration error precision is 10-6The parameters β in the CWFSC algorithm are respectively set as (0.005, 0.05,0.5, 1), the experiment is repeated 100 times, and the optimal result and the average result are obtained, the optimal performance of the algorithm is measured by three indexes of Accuracy (Accuracy), iteration frequency (Iter) and execution Time (Time), the overall performance of the algorithm is measured by average Accuracy (avg _ Accuracy, number of correctly divided samples/total number of samples), average iteration frequency (avg _ Iter) and average execution Time (avg _ Time), and the best and average result in the clustering results of the three algorithms is shown in Table 1:
Algorithm |
Accury |
IterNO |
Time |
avg_Accury |
avg_Iterno |
avg_Time |
FCS |
0.754545 |
28 |
0.028236 |
0.689091 |
35 |
0.193956 |
WFCM |
0.854545 |
30 |
0.103216 |
0.852424 |
29 |
0.090867 |
CWFCS |
0.981818 |
48 |
0.055334 |
0.966364 |
55 |
0.063656 |
TABLE 1
As can be seen from Table 1, for the Iris dataset, both the highest accuracy and the average accuracy of the CFCS algorithm are higher than those of the other two algorithms; the execution time of the CWFS is the shortest, the average execution time of the CWFS is shortened by about 67 percent compared with an FCS algorithm, the average accuracy is improved by 40 percent compared with the FCS algorithm, the time of the CWFS is shortened by 21 percent compared with a WFCM algorithm, and the average accuracy is improved by 23 percent.
The experimental results are obtained based on the noiseless Iris data set, and we can also use three methods of FCS, WFCM and CWFCS provided by the present invention to perform experiments on the noisy Iris data set, and the experimental parameters and environment are the same as those described above for the noiseless Iris data set. The best and average of the clustering results for the three algorithms are shown in table 2:
Algorithm |
Accury |
IterNO |
Time |
avg_Accury |
avg_Iterno |
avg_Time |
FCS |
0.754545 |
40 |
0.386212 |
0.720606 |
62 |
0.468495 |
WFCM |
0.845455 |
26 |
0.109535 |
0.845455 |
29 |
0.101066 |
CWFCS |
0.972727 |
29 |
0.031420 |
0.887879 |
43 |
0.049336 |
TABLE 2
As can be seen from table 2, the highest and average accuracy of the CWFCS algorithm is also significantly higher for the noisy Iris dataset than the other two algorithms.
Example three:
we then performed experiments on the clean Cancer data set by using FCS, WFCM and CWFCS provided by the present invention, wherein the clean Cancer data set has 30 attributes, and in order to indicate that the samples are unevenly distributed, 10 samples are randomly selected in the first category and 367 samples in the second category, and the results are shown in table 2. Table 3 shows that the CWFCS algorithm performs most stably, the iteration number is slightly higher than the WFCM algorithm, the execution time is within 0.1 second, and the clustering accuracy is higher than the other two algorithms.
Algorithm |
Accury |
IterNO |
Time |
avg_Accury |
avg_Iterno |
avg_Time |
FCS |
0.737401 |
45 |
0.827577 |
0.737401 |
43 |
0.533281 |
WFCM |
0.819629 |
11 |
0.026210 |
0.767109 |
11 |
0.030475 |
CWFCS |
0.965517 |
13 |
0.074786 |
0.960212 |
12 |
0.075808 |
TABLE 3
Example four:
the aeroengine gas circuit simulation data set (noise adding) is tested by using FCS, WFCM and CWFCS provided by the invention respectively, and the result is shown in Table 4. The GasPath data set is aeroengine gas path data and comprises three characteristic parameters of DEGT, DNH and DFF, wherein 200 healthy data samples are obtained, and 5 fault data samples are randomly selected.
Algorithm |
Accury |
IterNO |
Time |
avg_Accury |
avg_Iterno |
avg_Time |
FCS |
0.614634 |
24 |
0.290102 |
0.614634 |
24 |
0.181671 |
WFCM |
0.6 |
19 |
0.046147 |
0.6 |
21 |
0.052607 |
CWFCS |
0.917073 |
15 |
0.023733 |
0.86878 |
23 |
0.033184 |
TABLE 4
As shown in table 4, for the GasPath data set, the robustness is very good for the data polluted by noise in engineering application, and the data can be divided more accurately, and for such data, the accuracy of the algorithm for clustering by using the intra-class compactness and the inter-class dispersion of the samples is higher than that of the WFCM algorithm only considering the intra-class compactness.
Example five:
the invention also provides a specific application method in the industrial control, which comprises the following steps:
firstly, state monitoring must be performed on important specific parameters in industrial control (various sensors are usually required to be arranged to obtain comprehensive data), after the data acquired by the sensors are acquired, the acquired data are classified by the CWFCS method (steps one to seven) provided by the invention, and then the current state of the industrial equipment or process is judged according to the classification result. For example, the state of the aircraft engine is monitored by a sensor, and whether the aircraft engine is in an unhealthy state is judged by classifying the collected data (through the CWFS method provided by the invention, steps one to seven).
The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and such improvements and modifications are also considered to be within the scope of the present invention.