CN107480694B

CN107480694B - Weighting selection integration three-branch clustering method adopting two-time evaluation based on Spark platform

Info

Publication number: CN107480694B
Application number: CN201710548072.3A
Authority: CN
Inventors: 于洪; 陈云; 胡峰; 王国胤; 胡军
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Shaanxi Yunqianmeng Bean Sprout Technology Incubator Co.,Ltd.
Priority date: 2017-07-06
Filing date: 2017-07-06
Publication date: 2021-02-09
Anticipated expiration: 2037-07-06
Also published as: CN107480694A

Abstract

The invention discloses a weighted selection integration three-branch clustering method based on Spark platform and adopting two evaluations, which mainly comprises the following steps: step1, partitioning and managing a large data set and generating a corresponding elastic distributed data set (RDD); 2, clustering each partition data by using a Spark-based K-Means clustering algorithm to generate a plurality of different clustering members; step3, constructing a new evaluation function and a weighting selection strategy through two evaluations, selecting cluster members, deleting cluster results with poor clustering effects, and forming new cluster members; and 4, integrating the clustering members, constructing a weighted voting matrix, and clustering and dividing according to the three decision rules to obtain a final three-branch clustering result. The invention greatly reduces the algorithm running time and improves the algorithm efficiency.

Description

Weighting selection integration three-branch clustering method adopting two-time evaluation based on Spark platform

Technical Field

The invention belongs to the technical field of big data processing and data mining, and particularly relates to a Spark-based three-branch selective integration method, belonging to the field of three-branch decision making.

Background

With the rapid development of social informatization and networking, data is explosively increased every day, and large data attracts more and more attention in the face of generation of massive data. In the fields of medical treatment, biological science, finance, internet and the like, data generated and accumulated in daily operation cannot be measured by GB or TB. By 2012, the amount of data has jumped from TB level to PB, EB and even ZB level. The data contains a lot of values, and the analysis and mining of the data can obtain new information and knowledge, so that the data has wide application in various fields, such as e-commerce, O2O, logistics distribution and the like, and all the fields are developed by utilizing big data, and the innovative operation modes are assisting enterprises to continuously develop new services. Efficient handling of large data is therefore highly desirable and is a strong demand in the current broad market.

The clustering integration technology has the idea of parallelization and can process larger data, but the limitation of a single machine algorithm causes that the traditional clustering integration method cannot process large-scale data. Spark is an open source big data ecosystem developed by AMP laboratories, university of california, berkeley, university, usa, and is a memory-based distributed computing system. Compared with MapReduce, Spark has the advantages of MapReduce, reduces the intermediate calculation and storage links of each Job of MapReduce, and greatly reduces the calculation time and the data processing efficiency. The kernel of Spark is an elastic distributed data set (RDD), which is an abstract concept and is distributed in a data object set stored in a memory of each computing node, and RDD is divided into two operations, namely conversion and action, and can perform function processing on the data object.

The three-branch decision idea is provided, various phenomena existing in real life are defined from another angle, and a new thinking problem and a problem solving mode are provided. Researchers convert three decision-making ideas into a theoretical system, and the method is widely applied to various fields, such as the field of psychology, the field of medical diagnosis, the field of management and the like. For the three-branch clustering, the three-branch decision of the clustering process and the three-branch division of the clustering result are mainly embodied. The conventional clustering result is usually represented by a set, and from the decision point of view, the clustering result belongs to a two-branch representation mode, namely, a certain data object in the set either belongs to the class cluster or does not belong to the class cluster. The disadvantage of this explicit binary partitioning of data objects is that data objects that may belong to this class cluster cannot be represented and thus the degree of influence of the objects on the construction of the class cluster cannot be visualized. Because the representation of the class cluster with three decisions is more reasonable, the data objects determined to belong to the class cluster are divided into the core domain part of the class cluster; dividing the data objects possibly belonging to the class cluster into edge domain parts of the class cluster; data objects determined not to belong to the class cluster are partitioned into trivial field portions of the class cluster.

The selective integration is an extension of cluster integration, and the idea of the selective integration is to delete a poor base clustering device and select a clustering member with high accuracy and large difference, so that the finally obtained clustering result has higher quality and robustness. The selective integration has the advantages that redundancy can be eliminated, and the quality of the base clustering device is improved; reducing clustering integration pressure; and optimizing the final clustering result.

Disclosure of Invention

The present invention is directed to solving the above problems of the prior art. The method for effectively improving the quality of the clustering integrated result, providing a data mining strategy with higher robustness and expansibility for a user and improving the algorithm efficiency so as to meet the user requirements is provided. The technical scheme of the invention is as follows:

a weighted selection integration three-branch clustering method based on Spark platform and adopting two evaluations comprises the following steps:

step1, carrying out partition management on a large data set and generating a corresponding elastic distributed data set RDD;

2, clustering the data of each partition by using a Spark-based K-Means clustering algorithm to generate a plurality of different clustering members;

step3, constructing a new evaluation function and a weighted selection strategy of the cluster members through two evaluations, selecting the cluster members, and deleting the cluster results with poor clustering effect to form new cluster members;

and 4, integrating the clustering members, constructing a weighted voting matrix, and clustering and dividing according to the three decision rules to obtain a final three-branch clustering result.

Further, the step1 of performing partition management on the large-scale data specifically includes the steps of: firstly, storing a large-scale data set under a distributed file system (HDFS); then initializing an environment variable SparkContext, converting the data set into an elastic distributed data set RDD form through a function textFile (), creating a partition number numpartitions, calculating the input key by using a function getPartition (key: Any), and returning the partition ID of the key.

Further, the step2 of clustering the data of each partition by using a Spark-based K-Means clustering algorithm to generate a plurality of different cluster members specifically comprises the steps of:

the method comprises the steps of mapping the number k of the class clusters to each partition by setting the number k of the class clusters and different iteration times, operating an algorithm, outputting a key value pair (id, x), marking the cluster number of a data object x by the id, and then merging the partitions to obtain clustering results to obtain m different clustering members.

Further, the K-Means clustering algorithm specifically includes:

step1, partitioning the input data set to obtain K partitioned RDDs 1;

step2, in the first k partitions, randomly selecting a point as an initial clustering center for each partition, and storing the point in the RDD 2;

step3, calculating the distance between each data object and the cluster center according to the Euclidean distance formula, and storing the distance in the RDD3 in the form of key value pairs (xi, e);

step4, carrying out Reduce operation on RDD3, carrying out class cluster division on the data objects, and storing the data objects in RDD4 in a key value pair (id, xi) mode, wherein the id records the cluster number of each data object;

step5, calculating the average value of each cluster in the RDD5 to obtain a new cluster center point; wherein RDDs 1-5 represent the first-fifth elastic distributed data sets, respectively;

and Step6, circularly iterating Step3 to Step5 until the maximum iteration times are reached, and outputting a clustering result.

Further, the weighted selection process of the cluster members in step3 specifically includes the steps of: the three-branch clustering method in the step4 specifically comprises the following steps:

(1) constructing an OVERLAP overlapping matrix, namely, taking a first cluster member as a reference partition, constructing a k x k OVERLAP matrix by the rest m-1 cluster members and the cluster member respectively, recording the number of the same objects covered by each class cluster in the two partitions by the matrix, indicating the cluster number of the cluster member needing label alignment by a column, indicating the cluster number of the first reference partition by a row, selecting the class cluster label with the maximum number of the same objects covered by each row of the matrix, and modifying the label into the cluster number label corresponding to the reference partition;

(2) as a first evaluation, the CH value for each cluster member was calculated using the following formula:

wherein the data set U ═ x₁,x₂,...,x_n,...,x_N}，

Representing the center point of the data set, N being the total number of data objects, k being the current class cluster, and the set of m cluster members being represented by R ═ { R {⁽¹⁾,R⁽²⁾,...,R⁽ⁱ⁾,...,R^(m)}. For each cluster member, there are K class clusters, denoted R⁽ⁱ⁾＝{C₁,C₂,...,C_k,...,C_K}，

Represents a class cluster C_kD represents the distance between the calculation objects. The CH index is obtained by the ratio of the separation degree between the clusters and the closeness degree in the clusters, wherein the separation degree is obtained by calculating the distance from the center of each cluster to the center of other clusters, the closeness degree is obtained by calculating the distance from each data object of one cluster to the center of the cluster, and the cluster member with the largest CH value is taken as a reference partition;

obtaining each cluster member R ═ { R ═ R⁽¹⁾,R⁽²⁾,...,R⁽ⁱ⁾,...,R^(m)CH value of, R^(m)Denotes the m-th cluster member, R⁽ⁱ⁾Representing the ith cluster member, and taking the cluster result with the largest CH value as a reference partition R⁽*⁾；

(3) Calculating the accuracy N of each cluster member^(a)And difference N^(d)Constructing an evaluation function E (R) according to the obtained accuracy and the differenceⁱ) The normalized weight formula is as follows:

wherein Z is used for normalizing the weight so that the weight of the cluster marker meets the following conditions:

(4) setting a threshold value

The result with the weight value less than the threshold value does not participate in integration, so that a new cluster member is selected

Further, constructing an Nxk voting matrix with weights according to a majority voting rule for the obtained new cluster members R, recording the weight sum of the cluster members corresponding to the data objects divided into different clusters in the voting matrix, and setting a threshold value according to three decision rules

And the value range of the threshold (alpha, beta) is more than or equal to 0 and less than or equal to 1, and each data object is sequentially divided into a core domain and an edge domain of the cluster to obtain the final three-branch clustering result.

The invention has the following advantages and beneficial effects:

the invention provides a Spark platform based weighting selection integration three-branch clustering method adopting two-time evaluation, which can process large-scale uncertain data to obtain three-branch clustering results, can visually depict the cluster division of the uncertain data, and better accords with actual conditions and a plurality of practical application scenes. The quality of the clustering integration result is effectively improved by selecting integration, a data mining strategy with higher robustness and stronger expansibility can be provided for a user, and the algorithm efficiency is improved so as to meet the user requirements.

The main innovation points of the invention comprise:

1. the distributed clustering algorithm framework based on Spark can process a large-scale data set;

2. constructing a new evaluation function through two evaluations, and providing a new weighting selection integration strategy;

3. aiming at the uncertain data objects, three decision division rules are utilized to express three clustering results, and the class cluster division of the uncertain data is depicted more intuitively and accurately.

Drawings

FIG. 1 is a block diagram of a weighted selection integration three-branch clustering process based on Spark platform with two evaluations according to a preferred embodiment of the present invention;

FIG. 2 is a flow diagram of a Spark-based data chunking process;

FIG. 3 is a block diagram of an execution flow of the Spark platform;

FIG. 4 is a RDD conversion diagram of a Spark-based K-Means clustering algorithm;

FIG. 5 is a block diagram of a cluster member selection process;

FIG. 6 is a schematic diagram showing the results.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail and clearly with reference to the accompanying drawings. The described embodiments are only some of the embodiments of the present invention.

The technical scheme for solving the technical problems is as follows:

FIG. 1 is a block diagram of a weighted selection integration three-branch clustering process based on Spark platform and adopting two evaluations, and the data partitioning is performed on an input data set in a user-defined partitioning stage; setting the number of initial clusters and different iteration times through a Spark-based K-Means clustering algorithm to generate initial clustering members; performing label alignment on the initial clustering members, and selecting new clustering members through two evaluations, wherein the first evaluation is used for searching reference partitions and is used as input of the second evaluation, and the second evaluation is used for obtaining normalized weights through calculating accuracy and difference; and constructing a voting matrix with the weight according to the new clustering members with the weight, and obtaining a clustering result expressed by three branches according to the three-branch decision rule.

(1) Custom partitioning phase

Fig. 2 is a block diagram of a data blocking flow based on Spark. And converting the input original large-scale data set into the RDD of the Spark platform, and finishing the initialization operation. The environment variable SparkContext is first initialized and then the dataset is converted into RDD form by the function textFile (), creating a distributed dataset so that the dataset can be processed in parallel afterwards. Returning the number of partitions to be created through a numPartitions function, Int, calculating a partition value according to a key, returning the partition ID of the key, wherein the range of the partition ID is 0 to numPartitions-1, and ensuring that the returned number is a nonnegative number. This partition, which is customized, is finally used by the function partitionBy ().

Fig. 3 is a flowchart illustrating the implementation of the Spark platform. Firstly, a client submits generated operation information to a ResourceMenager, a NodeManager starts a SparkAppMaster, the SparkAppMaster initializes operation and applies for resources like the ResourceMenager, then the NodeManager starts a corresponding SparkExecutor to execute tasks, and finally the client can obtain operation running states from the SparkAppMasker.

(2) Cluster member generation phase

FIG. 4 is a RDD conversion diagram of the K-Means clustering algorithm based on Spark. The method aims to generate a plurality of initial clustering members in parallel by setting different iteration times, and comprises the following specific steps:

step1, partitioning the input data set according to the partitioning method to obtain K partitioned RDDs 1;

step5, calculating the average value of each cluster in the RDD5 to obtain a new cluster center point;

In the calculation process, different iteration times are initialized to obtain a plurality of different clustering results in parallel to serve as initial clustering members.

(3) Cluster member selection phase

Fig. 5 is a block diagram of a cluster member selection process. For the resulting initial cluster member R ═ { R ═ R⁽¹⁾,R⁽²⁾,...,R⁽ⁱ⁾,...,R^(m)H, clustering the members R with the first one⁽¹⁾The class cluster labels of (1) are standard, and the remaining m-1 cluster member class cluster labels are aligned. And obtaining a k x k OVERLAP matrix, recording the number of the same objects covered by each class cluster in the two partitions by the matrix, listing the cluster number of the cluster member needing label alignment, and listing the cluster number of the first reference partition. And selecting the cluster label of the class with the maximum number of the same objects covered by each row of the matrix, and then modifying the label into a cluster number label corresponding to the reference partition.

And calculating the CH value of each cluster member through a first evaluation function CH, and selecting the cluster member with the largest value as a reference partition for the calculation of a second evaluation function. And constructing a second evaluation function, namely obtaining a new evaluation function by mainly calculating the accuracy and the difference of each cluster member. Calculating the weight w of each cluster member according to the formula (5) and the formula (6) by using the second evaluation function, wherein the weight w is equal to { w } of each cluster member⁽¹⁾,w⁽²⁾,...,w⁽ⁱ⁾,...,w⁽ⁿ⁾Get new cluster member R through threshold lambda_*。

(4) Three voting stages

According to new cluster member

And the weight of each cluster member, constructing an N x k voting matrix, and recording that each data object is divided into a class cluster C_iThe sum of the weights of (a). Then setting a threshold value according to three decision rules

Wherein, the number of votes obtained by the data object in a certain class cluster is more than or equal to alpha, and the data object is divided into a core domain Co (C) of the class cluster_k) (ii) a If the number of votes is more than or equal to beta, dividing the votes into the edge regions Fr (C) of the clusters_k) (ii) a If the above conditions are not met, the class clusters with the vote number larger than 0 of the data object are found and are divided into the edge areas of the class clusters.

The following examples further illustrate the practice of the present invention. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following embodiments.

Assuming that there are 10 objects in a dataset and the dimension is 2, the dataset is specifically { (1,3), (2,2), (9,2), (7,1), (5,4), (4,5), (4,4), (1,5), (9,4), (2,3) }, the initial cluster number k is set to 3, and the number of iterations is 2,3, 4,5, 6, respectively.

First, a data set is read and converted into an RDD form, the number of partitions is set to 3, assuming that the partitioning result is { ((1,3), (2,2), (9,2)), ((7,1), (5,4), (4,5)), ((4,4), (1,5), (9,4), (2,3)) }, one data object is selected as an initial cluster center in each partition, which is (2,2), (4,5), and 9,4, respectively. The distance of the data object to the cluster center is calculated according to the Euclidean distance as follows:

	1	5	8
				0	1.414214	3.605551	8.062258
2	7	5.830952	2
				3	5.09902	5	3.605551
4	3.605551	1.414214	4
				6	2.828427	1	5
7	3.162278	3	8.062258
				9	1	2.828427	7.071068

therefore, the clustering result obtained in the first iteration is { ((1,3), (2,2), (2,3)), ((5,4), (4,5), (4,4), (1,5)), ((9,2), (7,1), (9,4)) }, each cluster is averaged, and the clustering center is updated to obtain the clustering centers (1.6667,2.6667), (3.5,4.5), (8.3333,2.3333) in the second iteration. The distance is again calculated as:

the second iteration therefore results in a clustering result of { ((1,3), (2,2), (1,5), (2,3)), ((5,4), (4,5), (4,4)), ((9,2), (7,1), (9,4)) }. Suppose that 5 clustering results are obtained by a Spark-based K-Means clustering algorithm, R is respectively⁽¹⁾＝{((1,3),(2,2),(2,3)),((5,4),(4,5),(4,4),(1,5)),((9,2),(7,1),(9,4))}、R⁽²⁾＝{((1,3),(2,2),(1,5),(2,3)),((5,4),(4,5),(4,4)),((9,2),(7,1),(9,4))}、R⁽³⁾＝{((1,3),(2,2),(1,5),(2,3)),((5,4),(4,5),(4,4),(9,4)),((9,2),(7,1))}、R⁽⁴⁾＝{((1,3),(2,2),(2,3),(7,1)),((5,4),(4,5),(4,4),(1,5)),((9,2),(9,4))}、R⁽⁵⁾＝{((1,3),(2,2),(1,5),(2,3)),((5,4),(4,5),(4,4)),((9,2),(7,1),(9,4))}。

Then, the CH value of each cluster member above is calculated according to equation (1), which is: 10.88,16.95,5.58,3.60 and 8.98, wherein R⁽²⁾The largest CH value indicates the best clustering effect, so R is selected⁽²⁾Divided as a reference.

And (5) constructing a new evaluation function for the second time by using the first evaluation result according to the formulas (2) to (5) so as to measure the clustering quality of the clustering members, and then converting the evaluation function into weight. And selecting the cluster members with the weight of more than or equal to 1/m-1/5-0.2 to perform three votes, and obtaining the final three cluster results. Calculating to obtain R⁽¹⁾-R⁽⁵⁾The weights of (a) are 0.201332,0.217771,0.192346,0.173162 and 0.217771, respectively, so that new cluster members are obtained by selecting a cluster result with the weight of more than or equal to 0.2.

The voting matrix with the right is obtained as follows:

calculating a threshold value

Therefore, the data objects (1,3), (2,2), (1,5), (2,3) are divided into the core domain of the class cluster 1, the data objects (5,4), (4,5), (4,4) are divided into the core domain of the class cluster 2, the data objects (9,2), (7,1), (9,4) are divided into the core domain of the class cluster 3, and the data objects (1,5) are divided into the edge domain of the class cluster 2. The final three-branch clustering result diagram is shown in fig. 6.

The above examples are to be construed as merely illustrative and not limitative of the remainder of the disclosure. After reading the description of the invention, the skilled person can make various changes or modifications to the invention, and these equivalent changes and modifications also fall into the scope of the invention defined by the claims.

Claims

1. A logistics distribution data clustering method based on Spark platform and adopting twice evaluation is characterized by comprising the following steps:

step1, carrying out partition management on a logistics distribution big data set and generating a corresponding elastic distributed data set RDD;

step2, clustering the data of each partition by using a Spark-based K-Means clustering algorithm to generate a plurality of different clustering members;

2. The logistics distribution data clustering method based on Spark platform adopting two evaluations according to claim 1, wherein the step1 of performing partition management on the large-scale data specifically comprises the steps of: firstly, storing a large-scale data set under a distributed file system (HDFS); then initializing an environment variable SparkContext, converting the data set into an elastic distributed data set RDD form through a function textFile (), creating a partition number numpartitions, calculating the input key by using a function getPartition (key: Any), and returning the partition ID of the key.

3. The logistics distribution data clustering method based on Spark platform and adopting double evaluation as claimed in claim 1, wherein the step2 uses Spark-based K-Means clustering algorithm to cluster the data of each partition, and the generation of a plurality of different cluster members specifically comprises the steps of:

4. The logistics distribution data clustering method based on Spark platform adopting two evaluations as claimed in claim 3, wherein the K-Means clustering algorithm specifically comprises:

step1, partitioning the input data set to obtain K partitioned RDDs 1;

step5, calculating the average value of each cluster in the RDD5 to obtain a new cluster center point, wherein the RDDs 1-5 respectively represent a first elastic distributed data set-a fifth elastic distributed data set;

5. The logistics distribution data clustering method based on Spark platform adopting two evaluations as claimed in claim 3 or 4, wherein the weighting selection process of the clustering members in the step3 specifically comprises the steps of: the three-branch clustering method in the step4 specifically comprises the following steps:

wherein the data set U ═ x₁,x₂,...,x_n,...,x_N}，

Representing the center point of the data set, N being the total number of data objects, k being the current class cluster, and the set of m cluster members being represented by R ═ { R {⁽¹⁾,R⁽²⁾,...,R⁽ⁱ⁾,...,R^(m)For each cluster member, there are K class clusters, denoted R⁽ⁱ⁾＝{C₁,C₂,...,C_k,...,C_K}，

Represents a class cluster C_kD represents the distance between the calculation objects, the CH index is obtained by the ratio of the separation degree between the clusters and the closeness degree in the clusters, wherein the separation degree is obtained by calculating the distance from the center of each cluster to the center of other clusters, the closeness degree is obtained by calculating the distance from each data object of one cluster to the center of the cluster, and the cluster member with the largest CH value is taken as a reference partition;

obtaining each cluster member R ═ { R ═ R⁽¹⁾,R⁽²⁾,...,R⁽ⁱ⁾,...,R^(m)CH value of, R^(m)Denotes the m-th cluster member, R⁽ⁱ⁾Representing the ith cluster member, and taking the cluster result with the largest CH value as a reference partition R^(*)；

(4) setting a threshold value

6. The method for clustering logistics distribution data based on Spark platform with two evaluations as claimed in claim 5, wherein the new obtained data is obtainedCluster member of (R)_*Constructing an Nxk voting matrix with weights according to a majority voting rule, recording the weight sum of cluster members corresponding to each data object divided into different clusters by the voting matrix, and setting a threshold value according to three decision rules