Disclosure of Invention
The present invention is directed to solving the above problems of the prior art. The method for effectively improving the quality of the clustering integrated result, providing a data mining strategy with higher robustness and expansibility for a user and improving the algorithm efficiency so as to meet the user requirements is provided. The technical scheme of the invention is as follows:
a weighted selection integration three-branch clustering method based on Spark platform and adopting two evaluations comprises the following steps:
step1, carrying out partition management on a large data set and generating a corresponding elastic distributed data set RDD;
2, clustering the data of each partition by using a Spark-based K-Means clustering algorithm to generate a plurality of different clustering members;
step3, constructing a new evaluation function and a weighted selection strategy of the cluster members through two evaluations, selecting the cluster members, and deleting the cluster results with poor clustering effect to form new cluster members;
and 4, integrating the clustering members, constructing a weighted voting matrix, and clustering and dividing according to the three decision rules to obtain a final three-branch clustering result.
Further, the step1 of performing partition management on the large-scale data specifically includes the steps of: firstly, storing a large-scale data set under a distributed file system (HDFS); then initializing an environment variable SparkContext, converting the data set into an elastic distributed data set RDD form through a function textFile (), creating a partition number numpartitions, calculating the input key by using a function getPartition (key: Any), and returning the partition ID of the key.
Further, the step2 of clustering the data of each partition by using a Spark-based K-Means clustering algorithm to generate a plurality of different cluster members specifically comprises the steps of:
the method comprises the steps of mapping the number k of the class clusters to each partition by setting the number k of the class clusters and different iteration times, operating an algorithm, outputting a key value pair (id, x), marking the cluster number of a data object x by the id, and then merging the partitions to obtain clustering results to obtain m different clustering members.
Further, the K-Means clustering algorithm specifically includes:
step1, partitioning the input data set to obtain K partitioned RDDs 1;
step2, in the first k partitions, randomly selecting a point as an initial clustering center for each partition, and storing the point in the RDD 2;
step3, calculating the distance between each data object and the cluster center according to the Euclidean distance formula, and storing the distance in the RDD3 in the form of key value pairs (xi, e);
step4, carrying out Reduce operation on RDD3, carrying out class cluster division on the data objects, and storing the data objects in RDD4 in a key value pair (id, xi) mode, wherein the id records the cluster number of each data object;
step5, calculating the average value of each cluster in the RDD5 to obtain a new cluster center point; wherein RDDs 1-5 represent the first-fifth elastic distributed data sets, respectively;
and Step6, circularly iterating Step3 to Step5 until the maximum iteration times are reached, and outputting a clustering result.
Further, the weighted selection process of the cluster members in step3 specifically includes the steps of: the three-branch clustering method in the step4 specifically comprises the following steps:
(1) constructing an OVERLAP overlapping matrix, namely, taking a first cluster member as a reference partition, constructing a k x k OVERLAP matrix by the rest m-1 cluster members and the cluster member respectively, recording the number of the same objects covered by each class cluster in the two partitions by the matrix, indicating the cluster number of the cluster member needing label alignment by a column, indicating the cluster number of the first reference partition by a row, selecting the class cluster label with the maximum number of the same objects covered by each row of the matrix, and modifying the label into the cluster number label corresponding to the reference partition;
(2) as a first evaluation, the CH value for each cluster member was calculated using the following formula:
wherein the data set U ═ x
1,x
2,...,x
n,...,x
N},
Representing the center point of the data set, N being the total number of data objects, k being the current class cluster, and the set of m cluster members being represented by R ═ { R {
(1),R
(2),...,R
(i),...,R
(m)}. For each cluster member, there are K class clusters, denoted R
(i)={C
1,C
2,...,C
k,...,C
K},
Represents a class cluster C
kD represents the distance between the calculation objects. The CH index is obtained by the ratio of the separation degree between the clusters and the closeness degree in the clusters, wherein the separation degree is obtained by calculating the distance from the center of each cluster to the center of other clusters, the closeness degree is obtained by calculating the distance from each data object of one cluster to the center of the cluster, and the cluster member with the largest CH value is taken as a reference partition;
obtaining each cluster member R ═ { R ═ R(1),R(2),...,R(i),...,R(m)CH value of, R(m)Denotes the m-th cluster member, R(i)Representing the ith cluster member, and taking the cluster result with the largest CH value as a reference partition R(*);
(3) Calculating the accuracy N of each cluster member(a)And difference N(d)Constructing an evaluation function E (R) according to the obtained accuracy and the differencei) The normalized weight formula is as follows:
wherein Z is used for normalizing the weight so that the weight of the cluster marker meets the following conditions:
(4) setting a threshold value
The result with the weight value less than the threshold value does not participate in integration, so that a new cluster member is selected
Further, constructing an Nxk voting matrix with weights according to a majority voting rule for the obtained new cluster members R, recording the weight sum of the cluster members corresponding to the data objects divided into different clusters in the voting matrix, and setting a threshold value according to three decision rules
And the value range of the threshold (alpha, beta) is more than or equal to 0 and less than or equal to 1, and each data object is sequentially divided into a core domain and an edge domain of the cluster to obtain the final three-branch clustering result.
The invention has the following advantages and beneficial effects:
the invention provides a Spark platform based weighting selection integration three-branch clustering method adopting two-time evaluation, which can process large-scale uncertain data to obtain three-branch clustering results, can visually depict the cluster division of the uncertain data, and better accords with actual conditions and a plurality of practical application scenes. The quality of the clustering integration result is effectively improved by selecting integration, a data mining strategy with higher robustness and stronger expansibility can be provided for a user, and the algorithm efficiency is improved so as to meet the user requirements.
The main innovation points of the invention comprise:
1. the distributed clustering algorithm framework based on Spark can process a large-scale data set;
2. constructing a new evaluation function through two evaluations, and providing a new weighting selection integration strategy;
3. aiming at the uncertain data objects, three decision division rules are utilized to express three clustering results, and the class cluster division of the uncertain data is depicted more intuitively and accurately.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.
The technical scheme for solving the technical problems is as follows:
FIG. 1 is a block diagram of a weighted selection integration three-branch clustering process based on Spark platform and adopting two evaluations, and the data partitioning is performed on an input data set in a user-defined partitioning stage; setting the number of initial clusters and different iteration times through a Spark-based K-Means clustering algorithm to generate initial clustering members; performing label alignment on the initial clustering members, and selecting new clustering members through two evaluations, wherein the first evaluation is used for searching reference partitions and is used as input of the second evaluation, and the second evaluation is used for obtaining normalized weights through calculating accuracy and difference; and constructing a voting matrix with the weight according to the new clustering members with the weight, and obtaining a clustering result expressed by three branches according to the three-branch decision rule.
(1) Custom partitioning phase
Fig. 2 is a block diagram of a data blocking flow based on Spark. And converting the input original large-scale data set into the RDD of the Spark platform, and finishing the initialization operation. The environment variable SparkContext is first initialized and then the dataset is converted into RDD form by the function textFile (), creating a distributed dataset so that the dataset can be processed in parallel afterwards. Returning the number of partitions to be created through a numPartitions function, Int, calculating a partition value according to a key, returning the partition ID of the key, wherein the range of the partition ID is 0 to numPartitions-1, and ensuring that the returned number is a nonnegative number. This partition, which is customized, is finally used by the function partitionBy ().
Fig. 3 is a flowchart illustrating the implementation of the Spark platform. Firstly, a client submits generated operation information to a ResourceMenager, a NodeManager starts a SparkAppMaster, the SparkAppMaster initializes operation and applies for resources like the ResourceMenager, then the NodeManager starts a corresponding SparkExecutor to execute tasks, and finally the client can obtain operation running states from the SparkAppMasker.
(2) Cluster member generation phase
FIG. 4 is a RDD conversion diagram of the K-Means clustering algorithm based on Spark. The method aims to generate a plurality of initial clustering members in parallel by setting different iteration times, and comprises the following specific steps:
step1, partitioning the input data set according to the partitioning method to obtain K partitioned RDDs 1;
step2, in the first k partitions, randomly selecting a point as an initial clustering center for each partition, and storing the point in the RDD 2;
step3, calculating the distance between each data object and the cluster center according to the Euclidean distance formula, and storing the distance in the RDD3 in the form of key value pairs (xi, e);
step4, carrying out Reduce operation on RDD3, carrying out class cluster division on the data objects, and storing the data objects in RDD4 in a key value pair (id, xi) mode, wherein the id records the cluster number of each data object;
step5, calculating the average value of each cluster in the RDD5 to obtain a new cluster center point;
and Step6, circularly iterating Step3 to Step5 until the maximum iteration times are reached, and outputting a clustering result.
In the calculation process, different iteration times are initialized to obtain a plurality of different clustering results in parallel to serve as initial clustering members.
(3) Cluster member selection phase
Fig. 5 is a block diagram of a cluster member selection process. For the resulting initial cluster member R ═ { R ═ R(1),R(2),...,R(i),...,R(m)H, clustering the members R with the first one(1)The class cluster labels of (1) are standard, and the remaining m-1 cluster member class cluster labels are aligned. And obtaining a k x k OVERLAP matrix, recording the number of the same objects covered by each class cluster in the two partitions by the matrix, listing the cluster number of the cluster member needing label alignment, and listing the cluster number of the first reference partition. And selecting the cluster label of the class with the maximum number of the same objects covered by each row of the matrix, and then modifying the label into a cluster number label corresponding to the reference partition.
And calculating the CH value of each cluster member through a first evaluation function CH, and selecting the cluster member with the largest value as a reference partition for the calculation of a second evaluation function. And constructing a second evaluation function, namely obtaining a new evaluation function by mainly calculating the accuracy and the difference of each cluster member. Calculating the weight w of each cluster member according to the formula (5) and the formula (6) by using the second evaluation function, wherein the weight w is equal to { w } of each cluster member(1),w(2),...,w(i),...,w(n)Get new cluster member R through threshold lambda*。
(4) Three voting stages
According to new cluster member
And the weight of each cluster member, constructing an N x k voting matrix, and recording that each data object is divided into a class cluster C
iThe sum of the weights of (a). Then setting a threshold value according to three decision rules
Wherein, the number of votes obtained by the data object in a certain class cluster is more than or equal to alpha, and the data object is divided into a core domain Co (C) of the class cluster
k) (ii) a If the number of votes is more than or equal to beta, dividing the votes into the edge regions Fr (C) of the clusters
k) (ii) a If the above conditions are not met, the class clusters with the vote number larger than 0 of the data object are found and are divided into the edge areas of the class clusters.
The following examples further illustrate the practice of the present invention. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.
Assuming that there are 10 objects in a dataset and the dimension is 2, the dataset is specifically { (1,3), (2,2), (9,2), (7,1), (5,4), (4,5), (4,4), (1,5), (9,4), (2,3) }, the initial cluster number k is set to 3, and the number of iterations is 2,3, 4,5, 6, respectively.
First, a data set is read and converted into an RDD form, the number of partitions is set to 3, assuming that the partitioning result is { ((1,3), (2,2), (9,2)), ((7,1), (5,4), (4,5)), ((4,4), (1,5), (9,4), (2,3)) }, one data object is selected as an initial cluster center in each partition, which is (2,2), (4,5), and 9,4, respectively. The distance of the data object to the cluster center is calculated according to the Euclidean distance as follows:
|
1
|
5
|
8
|
0
|
1.414214
|
3.605551
|
8.062258
|
2
|
7
|
5.830952
|
2
|
3
|
5.09902
|
5
|
3.605551
|
4
|
3.605551
|
1.414214
|
4
|
6
|
2.828427
|
1
|
5
|
7
|
3.162278
|
3
|
8.062258
|
9
|
1
|
2.828427
|
7.071068 |
therefore, the clustering result obtained in the first iteration is { ((1,3), (2,2), (2,3)), ((5,4), (4,5), (4,4), (1,5)), ((9,2), (7,1), (9,4)) }, each cluster is averaged, and the clustering center is updated to obtain the clustering centers (1.6667,2.6667), (3.5,4.5), (8.3333,2.3333) in the second iteration. The distance is again calculated as:
the second iteration therefore results in a clustering result of { ((1,3), (2,2), (1,5), (2,3)), ((5,4), (4,5), (4,4)), ((9,2), (7,1), (9,4)) }. Suppose that 5 clustering results are obtained by a Spark-based K-Means clustering algorithm, R is respectively(1)={((1,3),(2,2),(2,3)),((5,4),(4,5),(4,4),(1,5)),((9,2),(7,1),(9,4))}、R(2)={((1,3),(2,2),(1,5),(2,3)),((5,4),(4,5),(4,4)),((9,2),(7,1),(9,4))}、R(3)={((1,3),(2,2),(1,5),(2,3)),((5,4),(4,5),(4,4),(9,4)),((9,2),(7,1))}、R(4)={((1,3),(2,2),(2,3),(7,1)),((5,4),(4,5),(4,4),(1,5)),((9,2),(9,4))}、R(5)={((1,3),(2,2),(1,5),(2,3)),((5,4),(4,5),(4,4)),((9,2),(7,1),(9,4))}。
Then, the CH value of each cluster member above is calculated according to equation (1), which is: 10.88,16.95,5.58,3.60 and 8.98, wherein R(2)The largest CH value indicates the best clustering effect, so R is selected(2)Divided as a reference.
And (5) constructing a new evaluation function for the second time by using the first evaluation result according to the formulas (2) to (5) so as to measure the clustering quality of the clustering members, and then converting the evaluation function into weight. And selecting the cluster members with the weight of more than or equal to 1/m-1/5-0.2 to perform three votes, and obtaining the final three cluster results. Calculating to obtain R(1)-R(5)The weights of (a) are 0.201332,0.217771,0.192346,0.173162 and 0.217771, respectively, so that new cluster members are obtained by selecting a cluster result with the weight of more than or equal to 0.2.
The voting matrix with the right is obtained as follows:
calculating a threshold value
Therefore, the data objects (1,3), (2,2), (1,5), (2,3) are divided into the core domain of the
class cluster 1, the data objects (5,4), (4,5), (4,4) are divided into the core domain of the
class cluster 2, the data objects (9,2), (7,1), (9,4) are divided into the core domain of the
class cluster 3, and the data objects (1,5) are divided into the edge domain of the
class cluster 2. The final three-branch clustering result diagram is shown in fig. 6.
The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.