CN106909942B

CN106909942B - Subspace clustering method and device for high-dimensionality big data

Info

Publication number: CN106909942B
Application number: CN201710112771.3A
Authority: CN
Inventors: 高志鹏; 范译丹; 牛琨; 赵旸; 邓楠洁; 杨杨; 邱雪松; 李文璟
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2022-09-13
Anticipated expiration: 2037-02-28
Also published as: CN106909942A

Abstract

The embodiment of the invention provides a subspace clustering method and a subspace clustering device for high-dimensionality big data, wherein the method comprises the following steps: establishing a first Map task for each row of the acquired high-dimensional big data, and segmenting data in each first Map task according to the dimensionality to obtain a characteristic value of each dimensionality in each first Map task; in the first Reduce node, acquiring and obtaining a 1-dimensional dense subspace of each dimension according to the data area, the preset window number, the preset window merging threshold and the preset window density threshold of all the characteristic values of each dimension; determining a k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces; establishing a second Map task for each k-dimensional dense subspace, and obtaining all sample points distributed in each k-dimensional dense subspace; and in the second Reduce node, obtaining a k + 1-dimensional dense subspace after clustering. By the scheme, the operation efficiency of high-dimension big data clustering can be improved.

Description

Subspace clustering method and device for high-dimensionality big data

Technical Field

The invention relates to the technical field of data processing, in particular to a subspace clustering method and device for high-dimensionality big data.

Background

Clustering is the process of dividing a collection of physical or abstract objects into classes composed of similar objects. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters. Clustering analysis, also known as cluster analysis, is a statistical analysis method for studying classification problems. Due to the coming of the big data era, the data set is larger in scale and higher in dimensionality, and due to the existence of irrelevant features and dimensionality disasters, the traditional clustering algorithm is not suitable for high-dimensionality data any more.

In order to solve the problem that the traditional clustering algorithm is not suitable for high-dimensional data, the prior art provides a concept of subspace clustering, the subspace clustering algorithm aims to find hidden clusters in a subset of an original characteristic space, so that dimension disasters are avoided, and the problem of clustering the high-dimensional data is solved. The subspace clustering algorithm generally assumes that the whole data set runs on a single machine, and when high-dimensionality large data is encountered, the problem of bottleneck of memory capacity and kernel processing speed can be encountered when clustering analysis is performed on the single machine.

In the existing subspace clustering algorithm, the most common method is a Mafia subspace clustering algorithm mined by a maximum frequent set, the Mafia subspace clustering algorithm firstly carries out uniform interval division on each dimension, merges adjacent segments with similar data distribution density on each dimension in a uniformly divided grid to generate a non-uniformly divided grid, identifies dense subspaces in each grid, then generates a candidate clustering region set of k-dimensional dense subspaces according to the (k-1) dimension dense subspace, searches for adjacent dense subspaces in each selected candidate clustering region set by using depth-first search, and merges the dense subspaces by a greedy growth method, namely, from an arbitrary dense subspace, greedily generates a maximum region in each dimension until the sum of all regions covers the whole cluster. The Mafia subspace clustering algorithm reduces the number of units segmented in each dimension and the data of a candidate clustering region set, and meanwhile, eliminates a pruning technology of subspace detection; moreover, the Mafia subspace clustering algorithm can adopt a parallel method for clustering processing, and then serially execute the clustered data. However, in the process of clustering, high-dimensional big data with the magnitude order of TB and PB levels or above has a huge data volume, the number of lines of the data may reach tens of thousands of lines, and the data obtained by the mapia subspace clustering algorithm after interval division is still very huge data, so that the operation efficiency of the mapia subspace clustering algorithm is low when processing huge data.

Disclosure of Invention

The embodiment of the invention aims to provide a subspace clustering method and a subspace clustering device for high-dimensional big data, so as to improve the operation efficiency of high-dimensional big data clustering. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a subspace clustering method for high-dimensional big data, where the method includes:

acquiring input high-dimensional big data, establishing a first Map task for each line of data under a MapReduce framework, and segmenting the data in each first Map task according to the dimension to obtain a characteristic value of each dimension in each first Map task;

sending the characteristic value of each dimension in each first Map task to a first Reduce node, so that a 1-dimensional dense subspace of each dimension is obtained in each first Reduce node according to the data area, the number of preset windows, the combination threshold of the preset windows and the density threshold of the preset windows of all the characteristic values of each dimension;

determining a k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces, wherein k is greater than or equal to 1, and k +1 is less than or equal to the total dimensionality of the high-dimensionality big data;

establishing a second Map task for each k-dimensional dense subspace, and obtaining all sample points distributed in each k-dimensional dense subspace;

in each second Map task, when the k + 1-dimensional candidate subspace comprises all dimensions in the k-dimensional dense subspace, determining that the k + 1-dimensional candidate subspace covers the k-dimensional dense subspace;

and sending the sample point sets in all k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace to second Reduce nodes so as to obtain the intersection and the preset cluster density threshold of the sample point sets in all k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace in each second Reduce node, and obtaining the clustered k + 1-dimensional dense subspace according to the k + 1-dimensional candidate subspace, the intersection and the preset cluster density threshold.

Optionally, the obtaining and obtaining the 1-dimensional dense subspace of each dimension according to the data area, the preset window number, the preset window merging threshold, and the preset window density threshold of all feature values of each dimension includes:

determining the maximum value and the minimum value of the characteristic values in each dimension according to all the characteristic values of each dimension, and determining the data area of all the characteristic values of each dimension to be the data area which is larger than or equal to the minimum value and smaller than or equal to the maximum value;

obtaining and equally dividing the data area into N according to the number of preset windows _p Initial windows and assigning all the characteristic values to the initial windows, wherein N _p The number of the preset windows is set;

acquiring a preset window merging threshold, and counting the number of characteristic values in each initial window;

when the difference value of the number of the characteristic values in two adjacent initial windows is smaller than the preset window merging threshold value, merging the two adjacent initial windows to obtain a merged window;

and acquiring a preset window density threshold, and determining that the window with the number of the characteristic values larger than the preset window density threshold in the merged window is a 1-dimensional dense subspace.

Optionally, when the difference between the numbers of the feature values in the two adjacent initial windows is smaller than the preset window merging threshold, the two adjacent initial windows are merged to obtain a merged window, and the method further includes:

when the total number of the combined windows is 1, equally dividing the combined windows into N according to the preset window number _p A window.

Optionally, after obtaining a preset window density threshold and determining that a window in the merged window, in which the number of the feature values is greater than the preset window density threshold, is a 1-dimensional dense subspace, the method further includes:

storing the 1-dimensional dense subspace and all sample points in the 1-dimensional dense subspace in a key-value pair.

Optionally, the determining a k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces includes:

when each k-dimensional dense subspace is a 1-dimensional dense subspace, combining every two 1-dimensional dense subspaces to obtain a 2-dimensional candidate subspace;

when the k-dimensional dense subspace is an N-dimensional dense subspace, if every two k-dimensional dense subspaces include the same k-1-dimensional dense subspace, performing fusion and duplicate removal on the two k-dimensional dense subspaces to obtain a k + 1-dimensional candidate subspace, wherein N is greater than 1 and smaller than the total dimensionality of the high-dimensionality big data.

Optionally, after determining that the k + 1-dimensional candidate subspace covers the k-dimensional dense subspace when the k + 1-dimensional candidate subspace includes all dimensions in the k-dimensional dense subspace, the method further includes:

and distributing the k + 1-dimensional candidate subspace and the k-dimensional dense subspace covered by the k + 1-dimensional candidate subspace to each child node for execution.

Optionally, the obtaining an intersection and a preset cluster density threshold of the sample point sets in all k-dimensional dense sub-spaces covered by the k + 1-dimensional candidate sub-space, and obtaining the clustered k + 1-dimensional dense sub-space according to the k + 1-dimensional candidate sub-space, the intersection and the preset cluster density threshold, includes:

counting the number of sample points in the intersection of the sample point sets in all k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace;

and acquiring a preset cluster density threshold, and determining the k + 1-dimensional candidate subspace as a clustered k + 1-dimensional dense subspace when the number of the sample points in the intersection is greater than the preset cluster density threshold.

Optionally, the obtaining a preset cluster density threshold, and when the number of sample points in the intersection is greater than the preset cluster density threshold, determining that the k + 1-dimensional candidate subspace is a clustered k + 1-dimensional dense subspace, the method further includes:

storing all sample points in the k + 1-dimensional dense subspace and the k + 1-dimensional dense subspace in a key-value pair.

In a second aspect, an embodiment of the present invention further provides a subspace clustering apparatus for high-dimensional big data, where the apparatus includes:

the system comprises a first segmentation module, a second segmentation module and a third segmentation module, wherein the first segmentation module is used for acquiring input high-dimensional big data, establishing a first Map task for each line of data under a MapReduce framework, and segmenting the data in each first Map task according to the dimension to obtain a characteristic value of each dimension in each first Map task;

the first determining module is used for sending the characteristic value of each dimension in each first Map task to the first Reduce nodes, so that a 1-dimensional dense subspace of each dimension is obtained in each first Reduce node according to the data area, the number of preset windows, the preset window combination threshold and the preset window density threshold of all the characteristic values of each dimension;

a second determining module, configured to determine a k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces, where k is greater than or equal to 1, and k +1 is less than or equal to a total dimension of the high-dimensional big data;

the second segmentation module is used for establishing a second Map task for each k-dimensional dense subspace and obtaining all sample points distributed in each k-dimensional dense subspace;

a third determining module, configured to determine, in each second Map task, that a k + 1-dimensional candidate subspace covers a k-dimensional dense subspace when the k + 1-dimensional candidate subspace includes all dimensions in the k-dimensional dense subspace;

and the fourth determining module is used for sending the sample point sets in all the k-dimensional dense sub-spaces covered by the k + 1-dimensional candidate sub-space to the second Reduce nodes so as to obtain the intersection and the preset cluster density threshold of the sample point sets in all the k-dimensional dense sub-spaces covered by the k + 1-dimensional candidate sub-space in each second Reduce node, and obtaining the clustered k + 1-dimensional dense sub-space according to the k + 1-dimensional candidate sub-space, the intersection and the preset cluster density threshold.

Optionally, the apparatus further comprises:

and the distribution module is used for distributing the k + 1-dimensional candidate subspace and the k-dimensional dense subspace covered by the k + 1-dimensional candidate subspace to each child node for execution.

The subspace clustering method and the subspace clustering device for the high-dimensional big data provided by the embodiment of the invention are characterized in that firstly, a first Map task is established for each line of data in the input high-dimensional big data, the data is segmented in each Map task according to the dimension, and the segmented characteristic values are summarized in a first Reduce node to obtain a 1-dimensional dense subspace; then, determining a high-dimensional candidate subspace through the low-dimensional dense subspace, establishing a second Map task for each low-dimensional dense subspace to obtain all sample points distributed in each k-dimensional dense subspace, and determining whether the high-dimensional candidate subspace covers the low-dimensional dense subspace in the second Map task; and finally, determining the clustered high-dimensional dense subspace in the second Reduce node according to the intersection of the sample sets in the low-dimensional dense subspace covered by the high-dimensional candidate subspace after the aggregation. The embodiment relies on the improvement of a Mafia subspace clustering algorithm for realizing distributed parallelization by a distributed MapReduce architecture, the MapReduce architecture is responsible for complex problems of distributed storage, work scheduling, load balancing, fault-tolerant processing, network communication and the like of big data processing, and the distributed parallelization execution of the algorithm is realized by using the advantages of data partitioning and task parallelization of MapReduce and designing reasonable Map and Reudce functions. According to the embodiment, unnecessary reduction operation and input/output operation are avoided, the whole computing task is distributed in each node in a balanced manner and is not interfered with each other, only a small amount of specific data needs to be shared, all data are stored and accessed through an effective key value structure, the balance of disk access cost and network access cost is realized by using the minimum input/output cost and node communication, high-dimensional big data can be processed, the execution speed is remarkably accelerated, good expandability is achieved, and the operating efficiency of high-dimensional big data clustering can be improved based on a MapReduce architecture.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a subspace clustering method for high-dimensional big data according to an embodiment of the present invention;

FIG. 2 is another schematic flow chart of a subspace clustering method for high-dimensional big data according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a subspace clustering apparatus for high-dimensional big data according to an embodiment of the present invention;

fig. 4 is another schematic structural diagram of a subspace clustering apparatus for high-dimensional big data according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to improve the operation efficiency of high-dimension big data clustering, the embodiment of the invention provides a high-dimension big data-oriented subspace clustering method and device.

First, a subspace clustering method for high-dimensional big data provided by the embodiment of the present invention is introduced below.

It should be noted that an execution main body of the subspace clustering method for high-dimensional big data according to the embodiment of the present invention may be a plurality of processors that execute a clustering algorithm, such as a DSP (Digital Signal Processor), an ARM (Advanced Reduced Instruction Set Computer Machines), an FPGA (Field Programmable Gate Array), or the like, which is not limited herein. The method for implementing the subspace clustering method oriented to the high-dimensional big data provided by the embodiment of the invention can be implemented by software, hardware circuits and/or logic circuits arranged in a processor.

As shown in fig. 1, a subspace clustering method for high-dimensional big data provided in the embodiment of the present invention may include the following steps:

s101, acquiring input high-dimensional big data, establishing a first Map task for each line of data under a MapReduce framework, and segmenting the data in each first Map task according to the dimension to obtain a characteristic value of each dimension in each first Map task.

It should be noted that, in the conventional subspace clustering algorithm, it is generally assumed that the entire data set runs on a single computer, when high-dimensional large data is encountered, a bottleneck problem of memory capacity and kernel processing speed is encountered when performing clustering analysis on the single computer, and the efficiency and the expandability of the conventional subspace clustering algorithm can be improved by selecting a parallel processing mode. Parallel processing is an efficient method for improving the operation speed, a data set is decomposed into a plurality of sub data sets through the cooperative work of a plurality of processing nodes, and each sub data set is operated by an independent node. Based on the idea of parallel processing, the embodiment first numbers each line of the input high-dimensional big data according to the line number, and performs feature splitting on the examples according to the dimension. That is, before the traditional Mafia subspace clustering algorithm is executed, high-dimensional big data is divided according to rows, then the dimensions are divided according to the traditional Mafia subspace clustering algorithm, when the input data is too large, the data volume after the traditional Mafia subspace clustering algorithm divides the dimensions may still be large, for example, tens of thousands of rows of data are obtained, therefore, the data are divided according to the rows before the dimensions are divided, and finally the obtained data are reduced greatly, so that the pressure of memory and kernel processing is released.

It should be emphasized that a first Map task under a MapReduce architecture is established for each row of acquired high-dimensional big data, and in the first Map task, the dimension number, the instance row number and the data of the instance in the dimension are output, for example, the data of the ith row is < d1, d2, …, dD >, the data of the ith row is output as <1, i: d1>, <2, i: d2> … < N, i: dD >, the data of the 1-dimensional ith row is d1, the data of the 2-dimensional ith row is d2, and the data of the nth-dimensional ith row is dD. The MapReduce architecture is used for processing large-scale data in a large-scale parallel mode, a task is divided into a plurality of subtasks by adopting the concept of 'divide-and-conquer', the subtasks are scheduled among idle nodes of a cluster and processed at high speed, and then results output by the subtasks are merged. MapReduce is a programming model for parallel operations on large-scale data sets (greater than 1TB), and includes Map and Reduce, and their main ideas, both from functional programming languages and from vector programming languages. The method greatly facilitates programmers to operate programs on the distributed system under the condition of no distributed parallel programming. Current software implementations specify a Map function to Map a set of key-value pairs into a new set of key-value pairs, and a concurrent Reduce function to ensure that each of all mapped key-value pairs share the same key-set.

And S102, sending the characteristic value of each dimension in each first Map task to the first Reduce nodes, so that a 1-dimensional dense subspace of each dimension is obtained in each first Reduce node according to the data area, the number of the preset windows, the preset window combination threshold and the preset window density threshold of all the characteristic values of each dimension.

It should be noted that the feature value of each dimension obtained in the first Map task is sent to the first Reduce node, and all the feature values of each dimension are integrated. For each dimension, the data area of all the characteristic values can be preset, or can be a range from the minimum value to the maximum value obtained by calculating all the characteristic values; the preset window number, the preset window merging threshold and the preset window density threshold are all preset by a user or a person skilled in the art according to experience; the preset window number refers to the number of all feature values of each dimension subjected to region division, the preset window merging threshold refers to the condition that the divided regions are merged with each other, and the preset window density threshold refers to the condition that the divided regions can be determined to be 1-dimensional dense subspace. It should be emphasized that, after obtaining the partition result, the same processor node stores the partition result and then performs the operation of this step.

Optionally, the step of obtaining and obtaining the 1-dimensional dense subspace of each dimension according to the data area, the preset window number, the preset window merging threshold, and the preset window density threshold of all feature values of each dimension may include:

firstly, according to all the characteristic values of each dimension, determining the maximum value and the minimum value of the characteristic values in each dimension, and determining the data area of all the characteristic values of each dimension as the data area which is greater than or equal to the minimum value and less than or equal to the maximum value.

It should be noted that, in order to ensure the accuracy of data region division, the data region of all the feature values of each dimension may be determined as a range from the minimum value to the maximum value among all the feature values of each dimension. For example, the minimum value of the first-dimension data in the input high-dimension large data is-152, and the maximum value is 512, then the data area of the first-dimension data is [ -152,512 ].

Secondly, acquiring and equally dividing the data area into N according to the number of preset windows _p And assigning all the characteristic values to the initial windows.

Wherein N is _p Is a preset number of windows. It should be noted that the preset window number is usually preset by a user or a person skilled in the art according to an empirical value, and is less than or equal to the total number of rows of the high-dimensional big data. For example, the input high-dimensional big data has 12000 lines of data in total, and is presetIf the window number is set to 12000, 12000 data for each dimension can be allocated to each preset window.

And thirdly, acquiring a preset window merging threshold, and counting the number of characteristic values in each initial window.

It should be noted that, in order to reduce the number of data partitions in each dimension, so as to merge evenly divided data regions, and merge according to a certain merging condition, a preset window merging threshold value may be used as the merging condition, and the merging condition is related to the number of feature values in each initial window, and statistics needs to be performed, after the statistics, an initial grid is formed, and the initial grid may be stored in a format of < window number, number of instances, minimum value, maximum value >.

And then, when the difference value of the number of the characteristic values in the two adjacent initial windows is smaller than a preset window merging threshold, merging the two adjacent initial windows to obtain a merged window.

It should be noted that, the adjacent initial windows are merged according to the condition that the difference value of the number of the characteristic values in the adjacent initial windows is smaller than the preset window merging threshold value until the merged windows do not change any more, so that unevenly divided data regions can be obtained, no requirement is imposed on data pruning, and the flexibility of the algorithm is improved. The resulting window may be a plurality of non-uniform windows or may be a combination of windows. Merging into one window is not beneficial to subsequent cluster analysis, so that, in general, if a window obtained by merging is one window, the window needs to be divided again.

Specifically, when the total number of the merged windows is 1, the merged windows are equally divided into N according to the preset number of windows _p A window.

It should be noted that dividing the merged window may be performed according to the number of preset windows previously set, or may be performed according to the number of preset windows newly input. In this embodiment, in order to reduce the complexity of data repeated input, the number of the preset windows is the number of the preset windows input previously.

And finally, acquiring a preset window density threshold value, and determining that the window with the number of the characteristic values larger than the preset window density threshold value in the combined window is a 1-dimensional dense subspace.

It should be noted that, if the number of the feature values in the merged window is too small, the window is considered not to be a dense subspace, that is, the window is not taken as a subspace of a cluster; in general, whether the number of feature values in the merged window is too small may be measured by a window density threshold preset by a user or a person skilled in the art, and if the number of feature values in the merged window is smaller than the preset window density threshold, the number of feature values in the merged window is considered to be too small. All dense subspaces satisfying the condition can be obtained through the step, and the dense subspaces are defined as 1-dimensional dense subspaces in each dimension.

Specifically, in order to facilitate the subsequent step of clustering operation processing on different dimension dense subspaces under each dimension, after the 1-dimensional dense subspace under each dimension is determined, the 1-dimensional dense subspace and all sample points in the 1-dimensional dense subspace may be stored in a key value pair form.

Wherein the sample points are all feature values of each row. It should be noted that the key-value pair form is similar to the index form, for example, the 1-dimensional dense subspace of the ith-dimensional data is represented by the < S _1, i > key-value pair form, and the row number of each feature value in the 1-dimensional dense subspace of the ith-dimensional data is represented by the < S _1, i, p _1, …, p _ n > key-value pair form, where p _1 is the row number of the 1 st feature value, and p _ n is the row number of the nth feature value, then the information of each feature value in the dense subspace of each dimension of data is clear at a glance in the key-value pair form, which facilitates the invocation of the subsequent step.

It should be emphasized that the maximum complexity of the operations of the above steps S101 and S102 is O (N) _p logN _p ) Namely, the complexity of step operation is related to the number of preset windows, the Reduce function of each sub-processor processes one dimension, and the calculation time linearly increases along with the increase of the dimension.

S103, determining k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces.

Wherein k is greater than or equal to 1, and k +1 is less than or equal to the total dimension of the high-dimensional big data. It should be noted that, after obtaining the clustered dense subspace, the next one-dimensional clustered dense subspace needs to be calculated, and in general, the high-dimensional dense subspace cannot be directly obtained from the low-dimensional dense subspace. The characteristic values in the 1-dimensional dense subspace are fewer, and the process of obtaining the 2-dimensional candidate subspace is relatively simple; the remaining process of obtaining the high-dimensional candidate subspace from the low-dimensional dense subspace is relatively complex and different in process.

Optionally, the step of determining a k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces may include:

firstly, when each k-dimensional dense subspace is a 1-dimensional dense subspace, combining every two 1-dimensional dense subspaces to obtain a 2-dimensional candidate subspace.

It should be noted that, for the 1-dimensional dense subspace, two-by-two merging of the 1-dimensional dense subspaces of different dimensions is directly performed to generate the 2-dimensional candidate subspace.

Secondly, when the k-dimensional dense subspace is an N-dimensional dense subspace, if every two k-dimensional dense subspaces contain the same k-1-dimensional dense subspace, the k + 1-dimensional candidate subspaces are obtained by fusing and de-duplicating the two k-dimensional dense subspaces.

Wherein N is greater than 1 and less than the total dimension of the high-dimensional big data. It should be noted that, for the k-dimensional dense subspace, the dense subspace determines pairwise whether there is the same k-1-dimensional dense subspace, if so, a k + 1-dimensional candidate subspace formed by fusing the two dense subspaces is output, and finally the obtained k + 1-dimensional candidate subspace is de-duplicated, for example, two 2-dimensional dense subspaces are respectively the two k-1-dimensional dense subspaces<1,2>And<1,3>with a total of 1-dimensional dense subspace<1>(ii) a Two 2-dimensional dense subspaces, each being<1,2>And<2,3>total 1-dimensional dense subspace<2>(ii) a Two 2-dimensional dense subspaces, each being<1,3>And<2,3>with a total of 1-dimensional dense subspace<3>Are all combined to k +1Dimension candidate subspace<1,2,3>And integrating the three k + 1-dimensional candidate subspaces into a k + 1-dimensional candidate subspace output after the duplication removal. The time complexity of this process is O (N) ² ) I.e. a fusion of k-dimensional dense sub-spaces is attempted every two k-1-dimensional dense sub-spaces, N being the total number of k-1-dimensional dense sub-spaces. Each dense subspace can be represented as dense window numbers which are arranged in an ascending order or a descending order according to dimensionality, so that the sequence of characteristic values in the obtained k + 1-dimensional candidate subspace is ensured to be constant, disorder arrangement is avoided, and the uniqueness of the candidate subspace is fully ensured through the duplication removing operation.

And S104, establishing a second Map task for each k-dimensional dense subspace, and obtaining all sample points distributed in each k-dimensional dense subspace.

It should be noted that a second Map task is established for each k-dimensional dense subspace, and in the second Map task, each k-dimensional dense subspace is analyzed and counted to obtain all sample points distributed in each k-dimensional dense subspace, that is, all data of each line in each k-dimensional dense subspace.

S105, in each second Map task, when the k + 1-dimensional candidate subspace comprises all dimensions in the k-dimensional dense subspace, determining that the k + 1-dimensional candidate subspace covers the k-dimensional dense subspace.

It should be noted that, in each second Map task, each k-dimensional dense subspace finds the k + 1-dimensional candidate subspace through traversing all k + 1-dimensional candidate subspaces, and sends all sample points in the k-dimensional dense subspace to the Reduce node clustering the k + 1-dimensional candidate subspace, so as to send sample points included in all k-dimensional dense subspaces covered by the same k + 1-dimensional candidate subspace to the same Reduce node. Wherein the matching may be performed by an inclusion relationship of the k + 1-dimensional candidate subspace dimension and the k-dimensional dense subspace dimension. For example, the m-th 3-dimensional candidate subspace < S _3{ d1, d2, d3}, m > covers the n 1-th 2-dimensional dense subspace < S _2{ d1, d2}, n1>, n 2-dimensional dense subspace < S _2{ d2, d3}, n3> and the n 3-th 2-dimensional dense subspace < S _2{ d3, d3}, n3>, wherein the d3, d3 and d3 are dimension numbers, the 2-dimensional dense subspace S _2{ d3, d3}, S _2{ d3, d3} covers the sample point sets of DB3, DB3 and DB3, DB 3{ d } respectively, and the m-dimensional dense subspace < S _3{ d 6372, d } S _3{ d 6372, d } and DB < S _ 72 { 3{ d } are the task { 3, d } sets of DB3, DB < S _3 d } and DB < S _3{ 3 d } sets of the n3, d { 3, d } sets of the 2-dimensional dense subspace < S _3, d < S _3, DB < S _3, d { 3, d } sets of the d { 3, DB < 3, d { 3, DB { 3, d } sets of the d { 3, the d } sets of the d < 3, the d { 3, S _3, the n 3{ 3, S _3{ 3, the d { 3, DB { 3, the n3, the d } sets of the dense subspace < S _3, the d < S _3, the n 3{ 3, the n 3{ 3, the n 3{ 3, the n 3{ 3} sets of the dense subspace < S _ 3-d } sets of the n 3{ 3, the n 3{ 3} sets of the dense subspace < S _3 d } sets of the n3, the dense subspace < S _3 d { 3, the n3, the dense subspace < S _3 d } sets of the dense subspace < S _3 d { 3, the n3, the dense subspace < S _3 d { 3, the n3, the dense subspace < S _ 3-d { 3, the dense subspace < S _3 d } sets of the dense subspace < S _3 d < S.

S106, sending the sample point sets in all the k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace to second Reduce nodes, so that in each second Reduce node, the intersection and the preset cluster density threshold of the sample point sets in all the k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace are obtained, and the k + 1-dimensional dense subspace after clustering is obtained according to the k + 1-dimensional candidate subspace, the intersection and the preset cluster density threshold.

It should be noted that the sample points of all k + 1-dimensional candidate subspaces obtained in the second Map task are sent to the second Reduce node, and the sample point sets of all k-dimensional dense subspaces covered by the candidate subspaces are combined to obtain an intersection to obtain D _ k +1, so that the clustered k + 1-dimensional dense subspace is determined.

Optionally, the step of obtaining the intersection and the preset cluster density threshold of the sample point sets in all k-dimensional dense sub-spaces covered by the k + 1-dimensional candidate sub-space, and obtaining the clustered k + 1-dimensional dense sub-space according to the k + 1-dimensional candidate sub-space, the intersection and the preset cluster density threshold may include:

first, the number of sample points in the intersection of the sample point sets in all k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace is counted.

It should be noted that a plurality of k-dimensional dense sub-spaces may be covered in the k + 1-dimensional candidate sub-space, all the k-dimensional dense sub-spaces are intersected with each other, an intersection obtained after the intersection includes a plurality of sample points, and the number of the sample points in the intersection is counted to determine whether the k + 1-dimensional candidate sub-space is the k + 1-dimensional dense sub-space.

Secondly, a preset cluster density threshold value is obtained, and when the number of sample points in the intersection is larger than the preset cluster density threshold value, the k + 1-dimensional candidate subspace is determined to be the clustered k + 1-dimensional dense subspace.

It should be noted that the preset cluster dense threshold is a determination threshold of the dense subspace, and is usually set by a user or a person skilled in the art according to experience in advance, if the number of sample points in the intersection is greater than the preset cluster density threshold, the sample point density of the k + 1-dimensional candidate subspace is considered to be greater, and the k + 1-dimensional candidate subspace may be determined as the clustered k + 1-dimensional dense subspace.

It is emphasized that if a k + 1-dimensional dense subspace exists and k +1 is smaller than the total dimensionality of the high-dimensionality big data, S103 to S106 are repeatedly executed, and a k + 2-dimensional candidate subspace and a dense subspace are iteratively calculated; otherwise, the process is terminated.

Specifically, in order to facilitate the determination of the next-dimensional candidate subspace, after the clustered k + 1-dimensional dense subspace is determined, the k + 1-dimensional dense subspace and all sample points in the k + 1-dimensional dense subspace may be stored in a key value pair form.

It should be noted that the form of the key-value pairs is similar to the index form, for example, the m-th k + 1-dimensional dense subspace is represented by the form of < S _ k +1, m > > key-value pairs, all sample points included in the m-th k + 1-dimensional dense subspace are represented by the form of < S _ k +1, m > < …, dt, … > > key-value pairs, and dt is a specific sample point represented by a row number where the sample point is located in the acquired original high-dimensional big data.

By applying the embodiment, firstly, a first Map task is established for each line of data in input high-dimensional big data, the data is segmented according to the dimension in each Map task, and the segmented characteristic values are collected in a first Reduce node to obtain a 1-dimensional dense subspace; then, determining a high-dimensional candidate subspace through the low-dimensional dense subspace, establishing a second Map task aiming at each low-dimensional dense subspace to obtain all sample points distributed in each k-dimensional dense subspace, and determining whether the high-dimensional candidate subspace covers the low-dimensional dense subspace in the second Map task; and finally, determining the clustered high-dimensional dense subspace in the second Reduce node according to the intersection of the sample sets in the low-dimensional dense subspace covered by the high-dimensional candidate subspace after the aggregation. The embodiment relies on the improvement of a Mafia subspace clustering algorithm for realizing distributed parallelization by a distributed MapReduce architecture, the MapReduce architecture is responsible for complex problems of distributed storage, work scheduling, load balancing, fault-tolerant processing, network communication and the like of big data processing, and the distributed parallelization execution of the algorithm is realized by using the advantages of data partitioning and task parallelization of MapReduce and designing reasonable Map and Reudce functions. According to the embodiment, unnecessary reduction operation and input/output operation are avoided, the whole computing task is distributed in each node in a balanced manner and is not interfered with each other, only a small amount of specific data needs to be shared, all data are stored and accessed through an effective key value structure, the balance of disk access cost and network access cost is realized by using the minimum input/output cost and node communication, high-dimensional big data can be processed, the execution speed is remarkably accelerated, good expandability is achieved, and the operating efficiency of high-dimensional big data clustering can be improved based on a MapReduce architecture.

As shown in fig. 2, after step S105 of the embodiment shown in fig. 1, the method for high-dimensional big data-oriented subspace clustering according to the embodiment of the present invention may further include:

and S107, distributing the k + 1-dimensional candidate subspace and the k-dimensional dense subspace covered by the k + 1-dimensional candidate subspace to each child node for execution.

It should be noted that each k + 1-dimensional candidate subspace and each k-dimensional dense subspace may be respectively distributed to each child node for execution through a distributed cache mechanism, and each child node may execute a clustered subspace or may execute multiple clustered subspaces, which is not limited herein. Of course, in order to better improve the efficiency of the algorithm operation, a sub-space of a cluster may be executed by one sub-node, so that the number of the required sub-nodes may be large.

It should be emphasized that S101 to S106 in this embodiment are identical to those in the embodiment shown in fig. 1, and are not described herein again.

By applying the embodiment, firstly, a first Map task is established for each line of data in input high-dimensional big data, the data is segmented in each Map task according to the dimension, and the divided characteristic values are collected in a first Reduce node to obtain a 1-dimensional dense subspace; then, determining a high-dimensional candidate subspace through the low-dimensional dense subspace, establishing a second Map task for each low-dimensional dense subspace to obtain all sample points distributed in each k-dimensional dense subspace, and determining whether the high-dimensional candidate subspace covers the low-dimensional dense subspace in the second Map task; and finally, determining the clustered high-dimensional dense subspace in the second Reduce node according to the intersection of the sample sets in the low-dimensional dense subspace covered by the high-dimensional candidate subspace after the aggregation. The embodiment relies on the improvement of a Mafia subspace clustering algorithm for realizing distributed parallelization by a distributed MapReduce architecture, the MapReduce architecture is responsible for complex problems of distributed storage, work scheduling, load balancing, fault-tolerant processing, network communication and the like of big data processing, and the distributed parallelization execution of the algorithm is realized by using the advantages of data partitioning and task parallelization of MapReduce and designing reasonable Map and Reudce functions. The embodiment avoids unnecessary reduction operation and input/output operation, ensures that the whole calculation task is distributed in each node in a balanced manner and is not interfered with each other, only a small amount of specific data needs to be shared, all the data are stored and accessed through an effective key value structure, and the balance of disk access cost and network access cost is realized by using the minimum input/output cost and node communication; and the dense subspace and the candidate subspace obtained by clustering are cached and distributed to each child node for execution, so that the parallel execution of the clustered data can be ensured, and the operation efficiency of the algorithm is improved.

Corresponding to the foregoing method embodiment, an embodiment of the present invention provides a subspace clustering apparatus for high-dimensional big data, and as shown in fig. 3, the apparatus may include:

the first segmentation module 310 is configured to obtain input high-dimensional big data, establish a first Map task for each line of data under a MapReduce architecture, and segment the data in each first Map task according to a dimension to obtain a feature value of each dimension in each first Map task;

the first determining module 320 is configured to send the feature value of each dimension in each first Map task to the first Reduce nodes, so that in each first Reduce node, a 1-dimensional dense subspace of each dimension is obtained and obtained according to the data regions of all feature values of each dimension, the number of preset windows, the preset window merging threshold and the preset window density threshold;

a second determining module 330, configured to determine a k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces, where k is greater than or equal to 1, and k +1 is less than or equal to the total dimension of the high-dimensional big data;

the second segmentation module 340 is configured to establish a second Map task for each k-dimensional dense subspace, and obtain all sample points distributed in each k-dimensional dense subspace;

a third determining module 350, configured to determine, in each second Map task, that when the k + 1-dimensional candidate subspace includes all dimensions in the k-dimensional dense subspace, the k + 1-dimensional candidate subspace covers the k-dimensional dense subspace;

the fourth determining module 360 is configured to send the sample point sets in all k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace to the second Reduce node, so that in each second Reduce node, the intersection and the preset cluster density threshold of the sample point sets in all k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace are obtained, and the clustered k + 1-dimensional dense subspace is obtained according to the k + 1-dimensional candidate subspace, the intersection and the preset cluster density threshold.

Optionally, the first determining module 320 may be specifically configured to:

when the difference value of the number of the characteristic values in the two adjacent initial windows is smaller than the preset window merging threshold value, merging the two adjacent initial windows to obtain a merged window;

and acquiring a preset window density threshold value, and determining that the window with the number of the characteristic values larger than the preset window density threshold value in the merged window is a 1-dimensional dense subspace.

Optionally, the first determining module 320 may be further configured to:

when the total number of the combined windows is 1, equally dividing the combined windows into N according to the preset window number _p And (4) a window.

Optionally, the first determining module 320 may be further configured to:

Optionally, the second determining module 330 may be specifically configured to:

when each k-dimensional dense subspace is a 1-dimensional dense subspace, combining each two 1-dimensional dense subspaces to obtain a 2-dimensional candidate subspace;

when the k-dimensional dense subspace is an N-dimensional dense subspace, if every two k-dimensional dense subspaces include the same k-1-dimensional dense subspace, performing fusion and deduplication on the two k-dimensional dense subspaces to obtain a k + 1-dimensional candidate subspace, wherein N is greater than 1 and smaller than the total dimensionality of the high-dimensionality big data.

Optionally, the fourth determining module 360 may be specifically configured to:

counting the number of sample points in the intersection of the sample point sets in all the k-dimensional dense sub-spaces covered by the k + 1-dimensional candidate sub-space;

Optionally, the fourth determining module 360 may be further configured to:

Furthermore, on the basis of the first segmentation module 310, the first determination module 320, the second determination module 330, the second segmentation module 340, the third determination module 350, and the fourth determination module 360, as shown in fig. 4, the subspace clustering apparatus for high-dimensional big data according to the embodiment of the present invention may further include:

a distributing module 410, configured to distribute the k + 1-dimensional candidate subspace and the k-dimensional dense subspace covered by the k + 1-dimensional candidate subspace to each child node for execution.

By applying the embodiment, firstly, a first Map task is established for each line of data in input high-dimensional big data, the data is segmented according to the dimension in each Map task, and the segmented characteristic values are collected in a first Reduce node to obtain a 1-dimensional dense subspace; then, determining a high-dimensional candidate subspace through the low-dimensional dense subspace, establishing a second Map task for each low-dimensional dense subspace to obtain all sample points distributed in each k-dimensional dense subspace, and determining whether the high-dimensional candidate subspace covers the low-dimensional dense subspace in the second Map task; and finally, determining the clustered high-dimensional dense subspace in the second Reduce node according to the intersection of the sample sets in the low-dimensional dense subspace covered by the high-dimensional candidate subspace after the aggregation. The embodiment relies on the improvement of a Mafia subspace clustering algorithm for realizing distributed parallelization by a distributed MapReduce architecture, the MapReduce architecture is responsible for complex problems of distributed storage, work scheduling, load balancing, fault-tolerant processing, network communication and the like of big data processing, and the distributed parallelization execution of the algorithm is realized by using the advantages of data partitioning and task parallelization of MapReduce and designing reasonable Map and Reudce functions. According to the embodiment, unnecessary reduction operation and input/output operation are avoided, the whole computing task is distributed in each node in a balanced manner and is not interfered with each other, only specific small amount of data needs to be shared, all data are stored and accessed through an effective key value structure, the balance of disk access cost and network access cost is realized by using the minimum input/output cost and node communication, high-dimensional big data can be processed, the execution speed is remarkably accelerated, good expandability is achieved, and the operating efficiency of high-dimensional big data clustering can be improved based on a MapReduce framework; and the dense subspace and the candidate subspace obtained by clustering are cached and distributed to each child node for execution, so that the parallel execution of the clustered data can be ensured, and the operation efficiency of the algorithm is improved.

It should be noted that, the subspace clustering apparatus oriented to high-dimensional big data according to the embodiments of the present invention is an apparatus applying the subspace clustering method oriented to high-dimensional big data, and all embodiments of the subspace clustering method oriented to high-dimensional big data are applicable to the apparatus and can achieve the same or similar beneficial effects.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A subspace clustering method for high-dimensional big data is applied to a processor, and the method comprises the following steps:

and sending the sample point sets in all k-dimensional dense sub-spaces covered by the k + 1-dimensional candidate sub-space to second Reduce nodes, so that in each second Reduce node, the intersection and the preset cluster density threshold of the sample point sets in all k-dimensional dense sub-spaces covered by the k + 1-dimensional candidate sub-space are obtained, and the k + 1-dimensional dense sub-space after clustering is obtained according to the k + 1-dimensional candidate sub-space, the intersection and the preset cluster density threshold.

2. The high-dimensional big data-oriented subspace clustering method according to claim 1, wherein the obtaining and obtaining the 1-dimensional dense subspace of each dimension according to the data regions of all the feature values of each dimension, the preset window number, the preset window merging threshold and the preset window density threshold comprises:

3. The high-dimensionality large data-oriented subspace clustering method according to claim 2, wherein when the difference value of the number of the eigenvalues in two adjacent initial windows is smaller than the preset window merging threshold, the two adjacent initial windows are merged to obtain a merged window, and then the method further comprises:

4. The high-dimensional big data-oriented subspace clustering method according to claim 2, wherein after the preset window density threshold is obtained, and the window with the number of feature values larger than the preset window density threshold in the merged window is determined to be a 1-dimensional dense subspace, the method further comprises:

5. The high-dimensional big data-oriented subspace clustering method according to claim 1, wherein the determining k + 1-dimensional candidate subspaces from every two k-dimensional dense subspaces comprises:

6. The method for high-dimensional big data-oriented subspace clustering according to claim 1, wherein when the k + 1-dimensional candidate subspace includes all dimensions in a k-dimensional dense subspace, after determining that the k + 1-dimensional candidate subspace covers the k-dimensional dense subspace, the method further comprises:

7. The high-dimensional big data-oriented subspace clustering method according to claim 1, wherein the obtaining of the intersection and the preset cluster density threshold of the sample point sets in all k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace, and obtaining the clustered k + 1-dimensional dense subspace according to the k + 1-dimensional candidate subspace, the intersection and the preset cluster density threshold, comprises:

8. The high-dimensional big data-oriented subspace clustering method according to claim 7, wherein the obtaining a preset cluster density threshold, when the number of sample points in the intersection is greater than the preset cluster density threshold, after determining that the k + 1-dimensional candidate subspace is a clustered k + 1-dimensional dense subspace, the method further includes:

9. A subspace clustering device oriented to high-dimensional big data, the device being characterized by comprising:

the first determining module is used for sending the characteristic value of each dimension in each first Map task to the first Reduce nodes, so that in each first Reduce node, a 1-dimensional dense subspace of each dimension is obtained and obtained according to the data area, the preset window number, the preset window merging threshold and the preset window density threshold of all the characteristic values of each dimension;

a second determining module, configured to determine a k + 1-dimensional candidate subspace according to every two k-dimensional dense subspaces, where k is greater than or equal to 1, and k +1 is less than or equal to the total dimension of the high-dimensional big data;

and the fourth determining module is used for sending the sample point sets in all the k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace to the second Reduce nodes, so that in each second Reduce node, the intersection and the preset cluster density threshold of the sample point sets in all the k-dimensional dense subspaces covered by the k + 1-dimensional candidate subspace are obtained, and the clustered k + 1-dimensional dense subspace is obtained according to the k + 1-dimensional candidate subspace, the intersection and the preset cluster density threshold.

10. The apparatus for high-dimensional big data oriented subspace clustering according to claim 9, wherein the apparatus further comprises: