CN106909942A

CN106909942A - A kind of Subspace clustering method and device towards high-dimensional big data

Info

Publication number: CN106909942A
Application number: CN201710112771.3A
Authority: CN
Inventors: 高志鹏; 范译丹; 牛琨; 赵旸; 邓楠洁; 杨杨; 邱雪松; 李文璟
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2017-06-30
Anticipated expiration: 2037-02-28
Also published as: CN106909942B

Abstract

A kind of Subspace clustering method and device towards high-dimensional big data is the embodiment of the invention provides, wherein, method includes：Often row for the higher-dimension big data for obtaining sets up a Map tasks, and the data in each Map task are split according to dimension, obtains the characteristic value of each dimension in each Map task；In a Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window merging threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension；Intensive subspace is tieed up according to each two k, k+1 dimension candidates subspace is determined；Intensive subspace is tieed up for each k and set up the 2nd Map tasks, and obtain all sample points for being distributed in the intensive subspace of each k dimensions；In the 2nd Reduce nodes, the k+1 after being clustered ties up intensive subspace.The operational efficiency that high-dimensional big data is clustered can be lifted by this programme.

Description

A kind of Subspace clustering method and device towards high-dimensional big data

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of subspace clustering towards high-dimensional big data Method and device.

Background technology

Cluster, is the process that the set of physics or abstract object is divided into the multiple classes being made up of the object being similar to.By gathering The cluster that class is generated is one group of set of data object, and these objects are similar each other to the object in same cluster, with other clusters In object it is different.Cluster analysis, also known as cluster analysis, is a kind of statistical analysis technique for studying classification problem.During due to big data The arriving in generation, the scale of data set is increasing, dimension more and more higher, because the presence of uncorrelated features and dimension disaster, passes The clustering algorithm of system is no longer applicable for higher-dimension degrees of data.

In order to solve the problems, such as that traditional clustering algorithm is inapplicable for higher-dimension degrees of data, prior art proposes subspace The concept of cluster, subspace clustering algorithm is intended to find hiding cluster class in the subset of original feature space, so as to avoid Dimension disaster, solves the clustering problem of higher-dimension degrees of data, and subspace clustering algorithm is based primarily upon applied searching method and is divided For bottom-up and top-down two groups.Subspace clustering algorithm commonly assumes that whole data set runs on unit, when running into height During dimension big data, on unit carrying out cluster analysis can run into the bottleneck problem of memory size and kernel processes speed.

In existing subspace clustering algorithm, most often maximal frequent itemset excavates Mafia subspace clustering algorithms, During each dimension is carried out uniform interval division, and the grid that will be evenly dividing by Mafia subspace clusterings algorithm first Merged per the similar adjacent segment of one-dimensional upper data distribution density, produce a grid for uneven division, recognize each net Intensive subspace in lattice, then ties up the candidate cluster set of regions that intensive subspace generation k ties up intensive subspace according to (k-1), Adjacent intensive subspace is found in each selected candidate cluster set of regions and by greediness using depth-first search The method of growth merges these intensive subspaces, i.e., since an arbitrary intensive subspace, wolfishly produced in each dimension A raw region for maximum covers whole cluster class until the summation in all regions.Mafia subspace clustering algorithms reduce each The element number and the data of candidate cluster set of regions split in dimension, while eliminating the technology of prunning branches of subspace detection；Also, Mafia subspace clusterings algorithm can carry out clustering processing using parallel method, then the data after clustering processing are carried out serially Perform.But, for the process of clustering processing, the order of magnitude is in TB, PB grade and the high-dimensional big data of the above, and data volume is huge, The line number of data is likely to be breached rows up to ten thousand, and what Mafia subspace clusterings algorithm was obtained after interval division is carried out remains very Huge data, so as to cause Mafia subspace clusterings algorithm when huge data are processed, operational efficiency is relatively low.

The content of the invention

The purpose of the embodiment of the present invention is to provide a kind of Subspace clustering method and device towards high-dimensional big data, To lift the operational efficiency that high-dimensional big data is clustered.Concrete technical scheme is as follows：

In a first aspect, a kind of Subspace clustering method towards high-dimensional big data is the embodiment of the invention provides, it is described Method includes：

The high-dimensional big data of input is obtained, a Map is set up for every data line under MapReduce frameworks and is appointed Business, and data are split according to dimension in each Map task, obtain each dimension in each Map task Characteristic value；

The characteristic value of each dimension in each Map task is sent to a Reduce nodes so that each In one Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window merging Threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension；

Intensive subspace is tieed up according to each two k, k+1 dimension candidates subspace is determined, wherein, k is less than more than or equal to 1, k+1 Or equal to total dimension of the high-dimensional big data；

Intensive subspace is tieed up for each k set up the 2nd Map tasks, and acquisition is distributed in each k and ties up intensive subspace All sample points；

In each the 2nd Map task, in all dimensions during intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k, really The fixed k+1 dimension candidates subspace covers the k and ties up intensive subspace；

All k of k+1 dimension candidates subspace covering are tieed up into the sample point set in intensive subspace to send to second Reduce nodes, so that in each the 2nd Reduce node, all k dimensions for obtaining the k+1 dimension candidates subspace covering are close Sample point intersection of sets collection and default cluster class density threshold in collection subspace, and candidate subspace, the friendship are tieed up according to the k+1 Collection and the default cluster class density threshold, the k+1 after being clustered tie up intensive subspace.

Optionally, the acquisition and the data area according to all characteristic values of each dimension, preset window number, preset window Merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension, including：

According to all characteristic values of each dimension, the maximum and minimum value of characteristic value in each dimension are determined, and determine The data area of all characteristic values of each dimension is more than or equal to the minimum value and less than or equal to the number of the maximum According to region；

Obtain and according to preset window number, the data area is divided into N_pIndividual home window, and by all characteristic values Distribute to each home window, wherein, N_pIt is the preset window number；

Obtain preset window and merge threshold value, and count the quantity of characteristic value in each home window；

When the difference of the quantity of characteristic value merges threshold value less than the preset window in two adjacent home windows, merge Described two adjacent home windows, the window after being merged；

Obtain preset window density threshold, in determining the window after the merging, the quantity of characteristic value it is default more than described The window of window density threshold is the 1 intensive subspace of dimension.

Optionally, the difference of the quantity of the characteristic value in two adjacent home windows merges less than the preset window During threshold value, merge described two adjacent home windows, after the window after being merged, methods described also includes：

When the total quantity of the window after the merging is 1, according to the preset window number, by the window after the merging It is divided into N_pIndividual window.

Optionally, the acquisition preset window density threshold, in determining the window after the merging, the quantity of characteristic value it is big After the window of the preset window density threshold is for the 1 intensive subspace of dimension, methods described also includes：

With all sample points in the key-value pair form storage intensive subspace of 1 dimension and the intensive subspace of 1 dimension.

Optionally, it is described that intensive subspace is tieed up according to each two k, determine k+1 dimension candidates subspace, including：

When each k ties up intensive subspace and is the intensive subspace of 1 dimension, merge each two 1 and tie up intensive subspace, obtain 2 Dimension candidate subspace；

At the intensive subspace of k dimensions subspace intensive for N-dimensional, if each two k ties up intensive subspace includes identical k-1 dimensions Intensive subspace, is merged, duplicate removal by the way that described two k are tieed up with intensive subspace, obtains k+1 dimension candidates subspace, wherein, N is more than 1 and less than total dimension of the high-dimensional big data.

Optionally, it is described in all dimensions during intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k, determine the k+1 Dimension candidate subspace is covered after the intensive subspace of k dimensions, and methods described also includes：

The intensive subspace of k dimensions that the k+1 ties up candidate subspace and k+1 dimension candidates subspace covering is distributed to respectively Individual child node is performed.

Optionally, all k for obtaining the k+1 dimension candidates subspace covering tie up sample point set in intensive subspace Common factor and default cluster class density threshold, and candidate subspace, the common factor and the default cluster class density are tieed up according to the k+1 Threshold value, the k+1 after being clustered ties up intensive subspace, including：

The all k for counting the k+1 dimension candidates subspace covering tie up sample point intersection of sets concentration sample in intensive subspace The quantity of this point；

Default cluster class density threshold is obtained, the quantity of sample point is more than the default cluster class density threshold in the common factor When, determine that the k+1 dimension candidates subspace is the intensive subspace of k+1 dimensions after cluster.

Optionally, described to obtain default cluster class density threshold, the quantity of sample point is default more than described in the common factor During cluster class density threshold, determine that k+1 dimension candidate subspace is methods described after k+1 after cluster ties up intensive subspace Also include：

The all samples during the k+1 ties up intensive subspace and the intensive subspace of k+1 dimensions are stored in key-value pair form Point.

Second aspect, the embodiment of the present invention additionally provides a kind of subspace clustering device towards high-dimensional big data, institute Stating device includes：

First segmentation module, the high-dimensional big data for obtaining input is directed to each line number under MapReduce frameworks According to a Map tasks of setting up, and data are split according to dimension in each Map task, obtain each Map The characteristic value of each dimension in task；

First determining module, for the characteristic value of each dimension in each Map task to be sent to a Reduce Node, so as in each Reduce node, obtain and the data area according to all characteristic values of each dimension, default window Mouth number, preset window merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension；

Second determining module, for tieing up intensive subspace according to each two k, determines k+1 dimension candidates subspace, wherein, k is big In or equal to 1, k+1 less than or equal to the high-dimensional big data total dimension；

Second segmentation module, the 2nd Map tasks are set up for tieing up intensive subspace for each k, and acquisition is distributed in often Individual k ties up all sample points of intensive subspace；

3rd determining module, it is empty comprising the intensive son of k dimensions in k+1 dimension candidate subspaces in each the 2nd Map task Between in all dimensions when, determine that k+1 dimension candidate subspace covers the k and ties up intensive subspace；

4th determining module, for all k of k+1 dimension candidates subspace covering to be tieed up the sample in intensive subspace Point set is sent to the 2nd Reduce nodes, so as in each the 2nd Reduce node, obtain the k+1 dimension candidates subspace All k of covering tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace, and tie up time according to the k+1 Subspace, the common factor and the default cluster class density threshold are selected, the k+1 after being clustered ties up intensive subspace.

Optionally, described device also includes：

Distribution module, the k dimensions for the k+1 to be tieed up candidate subspace and k+1 dimension candidates subspace covering are intensive Subspace is distributed to the execution of each child node.

Subspace clustering method and device towards high-dimensional big data provided in an embodiment of the present invention, first, for defeated A Map tasks are set up per data line in the high-dimensional big data for entering, data are carried out according to dimension in each Map task Segmentation, the characteristic value after division obtains the intensive subspace of 1 dimension by being summarised in a Reduce nodes；Then, by low-dimensional Intensive subspace determines higher-dimension candidate subspace, then sets up the 2nd Map tasks for the intensive subspace of each low-dimensional, is distributed All sample points of intensive subspace are tieed up in each k, determines whether higher-dimension candidate subspace covers low-dimensional in the 2nd Map tasks Intensive subspace；Finally, by collecting after the low-dimensional that is covered according to higher-dimension candidate subspace in the 2nd Reduce nodes it is close The common factor of sample set in collection subspace, it is determined that the intensive subspace of higher-dimension after cluster.The present embodiment relies on distribution MapReduce frameworks realize the improvement of distributed parallel Mafia subspace clustering algorithms, are responsible for by MapReduce frameworks The complexity such as distributed storage, scheduling, load balancing, fault-tolerant balanced, fault-tolerant processing and network service that big data is processed Problem, algorithm is realized using the rational Map and Reudce functions of data partition and tasks in parallel advantageous design of MapReduce Distributed parallel perform.This embodiment avoids unnecessary reduction operation and input/output operations, it is ensured that whole calculating is appointed Business is balanced and non-interference in each node distribution, and only specific low volume data needs to share, and all data are by effective Key value structure is stored and accessed, and is communicated with minimum input/output cost and node and is realized disk access cost and network access The balance of cost, and the present embodiment can process higher-dimension big data, substantially speeded up execution speed and expanded with good Malleability, the operational efficiency of high-dimensional big data cluster can be lifted based on MapReduce frameworks.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of schematic flow sheet of the Subspace clustering method towards high-dimensional big data of the embodiment of the present invention；

Fig. 2 illustrates for another flow of the Subspace clustering method towards high-dimensional big data of the embodiment of the present invention Figure；

Fig. 3 is a kind of structural representation of the subspace clustering device towards high-dimensional big data of the embodiment of the present invention；

Fig. 4 is another structural representation of the subspace clustering device towards high-dimensional big data of the embodiment of the present invention Figure.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

In order to lift the operational efficiency of high-dimensional big data cluster, the embodiment of the invention provides a kind of towards high-dimensional big The Subspace clustering method and device of data.

A kind of Subspace clustering method towards high-dimensional big data for being provided the embodiment of the present invention first below enters Row is introduced.

It should be noted that a kind of Subspace clustering method towards high-dimensional big data that the embodiment of the present invention is provided Executive agent can be perform clustering algorithm multiple processors, such as DSP (Digital Signal Processor, number Word signal processor), ARM (Advanced Reduced Instruction Set Computer Machines, reduced instruction Collection computer microprocessor) or FPGA (Field-Programmable Gate Array, field programmable gate array) etc., Here it is not construed as limiting.Wherein, a kind of subspace clustering side towards high-dimensional big data that the embodiment of the present invention is provided is realized The mode of method can be software, hardware circuit and/or the logic circuit being arranged in processor.

As shown in figure 1, a kind of Subspace clustering method towards high-dimensional big data that the embodiment of the present invention is provided, can To comprise the following steps：

S101, obtains the high-dimensional big data of input, and first is set up for per data line under MapReduce frameworks Map tasks, and data are split according to dimension in each Map task, obtain each in each Map task The characteristic value of dimension.

It should be noted that traditional subspace clustering algorithm often assumes that whole data set runs on single computer, When high-dimensional big data is run into, cluster analysis is carried out in single computer can run into memory size and kernel processes speed Bottleneck problem, can improve the efficiency and scalability of existing subspace clustering algorithm by way of selecting parallel processing.And A kind of high efficiency method for improving arithmetic speed carried out by row treatment, is to process node cooperative work by a data set point by multiple Solution carries out calculation process into several Sub Data Sets, each Sub Data Set by independent node.Thought based on parallel processing, this By line number be numbered every a line example of the high-dimensional big data being input into first by embodiment, and example is carried out according to dimension Feature splits.Namely before traditional Mafia subspace clustering algorithms are performed, high-dimensional big data is carried out according to row first Divide, according still further to traditional Mafia subspace clustering algorithms divided dimension, when the data of input are excessively huge, Data volume after traditional Mafia subspace clustering algorithms are divided dimension may be still very huge, for example, obtain up to ten thousand Capable data, therefore, being divided by row before dimension is divided, the data for finally obtaining are reduced a lot, so that The internal memory of release and the pressure of kernel processes.

It is emphasized that the high-dimensional big data each row of data to obtaining set up under a MapReduce framework the One Map tasks, in a Map tasks, are exported with dimension numbering, the data of example line number and example in the dimension, The data of such as the i-th row are<d1,d2,…,dD>, then what is exported is<1,i:d1>,<2,i:d2>…<N,i:dD>, expression is The data of the 1st the i-th row of dimension are d1, the data of the 2nd the i-th row of dimension are d2, the data of the row of N-dimensional i-th are dD.Wherein, MapReduce Framework is used for realizing the treatment in large-scale parallel mode to large-scale data, using the thought of " dividing and rule ", one is appointed Business is divided into multiple subtasks, and these subtasks are dispatched between the idle node of cluster and obtain high speed processing, then again will The result of child node output is merged.MapReduce is a kind of programming model, for large-scale dataset (being more than 1TB) Concurrent operation, MapReduce includes Map (mapping) and Reduce (reduction), and their main thought, is compiled from functional expression Borrowed in Cheng Yuyan, the characteristic also borrowed from vector programming language.It is very easy to programming personnel will not divide In the case of cloth multiple programming, the program of oneself is operated in distributed system.Current software realizes it being to specify one Map (mapping) function, for one group of key-value pair is mapped to one group of new key-value pair, specifies concurrent Reduce (reduction) letter Number, for each the shared identical key group in the key-value pair for ensureing all mappings.

S102, the characteristic value of each dimension in each Map task is sent to a Reduce nodes, so that every In an individual Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window Merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension.

It should be noted that the characteristic value of each dimension obtained in a Map tasks is sent to Reduce sections Point, all characteristic values of each dimension are integrated.For each dimension, the data area of all characteristic values can be advance The set, or scope by the minimum value being calculated to all characteristic values to maximum；It is preset window number, pre- If it is all that user or those skilled in the art rule of thumb preset that window merges threshold value and preset window density threshold 's；Preset window number refers to the number that all characteristic values of each dimension are carried out region division, and preset window merges threshold value is Refer to the condition that the region after dividing merges each other, preset window density threshold refers to that the region that will can be divided determines It is the condition of the 1 intensive subspace of dimension.It is emphasized that same processor node is after division result is obtained, tied dividing Fruit is preserved, then carries out the operation of this step.

Optionally, the acquisition and the data area according to all characteristic values of each dimension, preset window number, preset window Merge threshold value and preset window density threshold, obtain each dimension 1 dimension intensive subspace the step of, can include：

First, according to all characteristic values of each dimension, the maximum and minimum value of characteristic value in each dimension are determined, and The data area for determining all characteristic values of each dimension is more than or equal to above-mentioned minimum value and less than or equal to above-mentioned maximum Data area.

It should be noted that in order to ensure the accuracy that data area divides, can be by all characteristic values of each dimension Data area is defined as the scope of minimum value in all characteristic values of each dimension to maximum.For example, the high-dimensional big number of input Minimum value is that -152, maximum is 512 in the first dimension data in, then the data area of the first dimension data is [- 152,512].

Secondly, obtain and according to preset window number, data area is divided into N_pIndividual home window, and by all characteristic values Distribute to each home window.

Wherein, N_pIt is preset window number.It should be noted that preset window number is usually user or people in the art Member presets based on experience value, and less than or equal to total line number of high-dimensional big data.For example, input is high-dimensional big A total of 12000 row data of data, preset window number, then can be by 12000 numbers of each dimension if provided as 12000 According to being respectively allocated to each preset window.

Again, obtain preset window and merge threshold value, and count the quantity of characteristic value in each home window.

It should be noted that the quantity in order to reduce every dimension data segmentation, closes with to the data area being evenly dividing And, and need to be merged according to certain merging condition, preset window merges threshold value and can be used as merging condition, Er Qiehe And condition is relevant with the quantity of characteristic value in each home window, it is necessary to counted, and initial mesh is formed after statistics, can be with With<Window number, instance number, minimum value, maximum>Form stored.

Then, when the difference of the quantity of characteristic value merges threshold value less than preset window in two adjacent home windows, close And two adjacent home windows, the window after being merged.

It should be noted that merging threshold value less than preset window according to the difference of the quantity of characteristic value in adjacent home window Condition adjacent home window is merged, until merge after window no longer change, the number of uneven division can be obtained According to region, do not required for data beta pruning, improve the retractility of algorithm.The window for finally obtaining is probably multiple uneven Window, it is also possible to be incorporated into a window.It is merged into a window and is unfavorable for follow-up cluster analysis, therefore, generally In the case of, if it is a window, it is necessary to be divided again to the window to merge the window that finally obtains.

Specifically, when the total quantity of window after merging is 1, according to preset window number, the window after merging is divided equally It is N_pIndividual window.

It should be noted that it can be according to previously positioned preset window number that a window after to merging divide Divided, or divided according to the preset window number of new input.It is defeated in order to reduce Data duplication in the present embodiment The complexity for entering, preset window number is the preset window number being previously entered.

Finally, obtain preset window density threshold, it is determined that in window after merging, the quantity of characteristic value be more than preset window The window of density threshold is the 1 intensive subspace of dimension.

If it should be noted that merge after window in characteristic value quantity it is too small, then it is assumed that the window is not intensive Subspace, that is to say, that subspace of the window not as cluster；Under normal circumstances, can be by user or this area skill Whether the quantity that art personnel window density threshold set in advance weighs characteristic value in the window after merging is too small, if after merging Window in characteristic value quantity be less than the preset window density threshold, then it is assumed that the quantity mistake of characteristic value in the window after merging It is small.All intensive subspaces for meeting condition can be obtained by the step, these intensive subspaces are defined as each dimension Under the 1 intensive subspace of dimension.

Specifically, the cluster calculation process for the ease of subsequent step for the intensive subspace of different dimensional under each dimension, It is determined that after the 1 intensive subspace of dimension under each dimension, can be with the intensive subspace of dimension of key-value pair form storage 1 and 1 dimension All sample points in intensive subspace.

Wherein, sample point is all characteristic values of every a line.It should be noted that key-value pair form is similar to index shape Formula, for example with<S_1,i>Key-value pair form represents the 1 intensive subspace of dimension of i-th dimension data, with<S_1,i,p_1,…,p_n>Key Value represents form the 1 of the i-th dimension data line numbers for tieing up each characteristic value in intensive subspace, wherein, p_1 is the 1st characteristic value Line number, p_n is n-th line number of characteristic value, then in key-value pair form, in the intensive subspace of the data of each dimension each The information of characteristic value is very clear, beneficial to calling for subsequent step.

It is emphasized that the maximum complexity of above-mentioned S101 and S102 step operations is O (N_plogN_p), i.e. step operation Complexity it is relevant with preset window number, the Reduce functions of each sub-processor process a dimension, calculate the time with dimension The increase linear increase of degree.

S103, intensive subspace is tieed up according to each two k, determines k+1 dimension candidates subspace.

Wherein, total dimensions of the k more than or equal to 1, k+1 less than or equal to the high-dimensional big data.It should be noted that The one-dimensional intensive subspace of cluster under needing to calculate after the intensive subspace for being clustered again, it is generally the case that higher-dimension it is close Collection subspace cannot be directly according to low-dimensional intensive subspace obtain, therefore, can be obtained according to the intensive subspace for having obtained To the candidate subspace of higher-dimension, then intensive subspace is determined from candidate subspace.Characteristic value in the 1 intensive subspace of dimension compared with Few, the process for obtaining 2 dimension candidate subspaces is relatively simple；And remaining obtains higher-dimension candidate subspace from the intensive subspace of low-dimensional Process it is relatively complicated, it is different in process.

Optionally, it is described that intensive subspace is tieed up according to each two k, determine the step of k+1 ties up candidate subspace, can wrap Include：

First, when each k ties up intensive subspace and is the intensive subspace of 1 dimension, merge each two 1 and tie up intensive subspace, Obtain 2 dimension candidate subspaces.

It should be noted that for the 1 intensive subspace of dimension, the 1 intensive subspace of dimension of direct different dimensions merges life two-by-two Into 2 dimension candidate subspaces.

Secondly, at the intensive subspace of k dimensions subspace intensive for N-dimensional, if each two k ties up intensive subspace includes identical K-1 ties up intensive subspace, is merged by the way that the two k are tieed up with intensive subspace, duplicate removal, obtains k+1 dimension candidates subspace.

Wherein, N is more than 1 and less than total dimension of the high-dimensional big data.It should be noted that tieing up intensive son for k Space, intensive subspace determines whether that identical k-1 ties up intensive subspace two-by-two, has, and exports the two intensive subspaces and melts The k+1 dimension candidates subspace formed after conjunction, the k+1 dimension candidates subspace that will finally obtain carries out duplicate removal, and such as two 2 dimensions are intensive Subspace, respectively<1,2>With<1,3>Have the intensive subspace of 1 dimension<1>；The two 2 intensive subspaces of dimension, respectively<1,2>With <2,3>Have the intensive subspace of 1 dimension<2>；The two 2 intensive subspaces of dimension, respectively<1,3>With<2,3>Have the intensive son of 1 dimension empty Between<3>, all merge into k+1 dimension candidates subspace<1,2,3>, three k+1 dimension candidates subspaces are integrated into a k+1 dimension after duplicate removal Candidate subspace exports.The time complexity of this process is O (N²), i.e. each two k-1 ties up intensive subspace and attempts carrying out k The fusion of intensive subspace is tieed up, N is the sum that k-1 ties up intensive subspace.Wherein, each intensive subspace can be represented as by Dimension ascending order or the intensive window number of descending arrangement, to ensure the order of characteristic value in the k+1 dimension candidates subspace for obtaining Necessarily, it is to avoid out of order arrangement, then by deduplication operation the uniqueness of candidate subspace has been fully ensured that.

S104, ties up intensive subspace and sets up the 2nd Map tasks for each k, and acquisition is distributed in the intensive son sky of each k dimensions Between all sample points.

It should be noted that tieing up intensive subspace for each k sets up the 2nd Map tasks, it is right in the 2nd Map tasks Each k ties up intensive subspace and is analyzed, counts, and obtains being distributed in all sample points that each k ties up intensive subspace, i.e., each K ties up all data of every a line in intensive subspace.

S105, in each the 2nd Map task, all dimensions in intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k When, determine that k+1 dimension candidates subspace covers the k and ties up intensive subspace.

It should be noted that in each the 2nd Map task, each k ties up intensive subspace and ties up time by traveling through all k+1 Subspace is selected, the k+1 dimensions candidate subspace of institute's subordinate is found, all sample points that k tieed up in intensive subspace is sent to clustering the k The Reduce nodes of+1 dimension candidate subspace, so as to intensive subspace will be tieed up by all k of identical k+1 dimension candidates subspace covering Comprising sample point, send to same Reduce nodes.Wherein it is possible to tie up candidate's subspace dimension and the intensive son of k dimensions by k+1 The inclusion relation of Spatial Dimension is matched.For example, m-th 3-dimensional candidate subspace<S_3{d1,d2,d3},m>Cover the n-th 1 The 2 intensive subspaces of dimension<S_2{d1,d2},n1>, the n-th 22 intensive subspaces of dimension<S_2{d2,d3},n2>And the n-th 32 dimensions are close Collection subspace<S_2{d1,d3},n3>, wherein, d1, d2 and d3 are numbered for dimension, the 2 intensive subspace S_2 { d1, d2 } of dimension, S_2 The sample point set of { d2, d3 }, S_2 { d1, d3 } covering is respectively DB1, DB2 and DB3, then 2 dimension intensive subspace S_2 d1, D2 }, the Map tasks of S_2 { d2, d3 } and S_2 { d1, d3 } export respectively<S_3{d1,d2,d3},DB1>,<S_3{d1,d2, d3},DB2>And<S_3{d1,d2,d3},DB3>.

All k of k+1 dimension candidates subspace covering are tieed up the sample point set in intensive subspace and sent to second by S106 Reduce nodes, so that in each the 2nd Reduce node, all k for obtaining k+1 dimension candidates subspace covering tie up intensive son Sample point intersection of sets collection and default cluster class density threshold in space, and candidate subspace, above-mentioned common factor are tieed up according to k+1 and is preset Cluster class density threshold, the k+1 after being clustered ties up intensive subspace.

It should be noted that the sample point of all k+1 dimensions candidate subspace that will be obtained in the 2nd Map tasks is sent to the Two Reduce nodes, the sample point set of the intensive subspace of all k dimensions that the candidate subspace is covered takes common factor and obtains D_k+ 1, so that it is determined that the k+1 after cluster ties up intensive subspace.

Optionally, all k of the acquisition k+1 dimension candidates subspace covering tie up sample point intersection of sets in intensive subspace Collection and default cluster class density threshold, and candidate subspace is tieed up according to k+1, is occured simultaneously and default cluster class density threshold, after being clustered K+1 tie up intensive subspace the step of, can include：

First, sample point intersection of sets concentrates sample during all k of statistics k+1 dimension candidates subspace covering tie up intensive subspace The quantity of this point.

It should be noted that multiple k may be covered in k+1 dimension candidates subspace ties up intensive subspace, all k dimensions are intensive There is intersecting situation subspace each other, and multiple sample points are contained in the common factor obtained after intersecting, during statistics is occured simultaneously here The quantity of sample point, is used to determine whether k+1 dimension candidates subspace is that k+1 ties up intensive subspace.

Secondly, default cluster class density threshold is obtained, when the quantity of sample point is more than default cluster class density threshold in common factor, Determine that k+1 dimension candidates subspace is the intensive subspace of k+1 dimensions after cluster.

It should be noted that the default cluster intensive threshold value of class is the decision threshold of intensive subspace, usually user or this Art personnel rule of thumb preset, if the quantity of sample point is more than the default cluster class density threshold in occuring simultaneously, Think that the sample dot density of k+1 dimension candidates subspace is larger, k+1 dimension candidates subspace can be defined as the k+1 after cluster Tie up intensive subspace.

If it is emphasized that k+1 ties up intensive subspace presence, and k+1 is less than total dimension of high-dimensional big data, then Repeat S103 to S106, iterative calculation k+2 dimensions candidate subspace and intensive subspace；Otherwise terminate.

Specifically, the determination for the ease of carrying out down one-dimensional candidate subspace, it is determined that the intensive son of k+1 dimensions after cluster is empty Between after, all samples during intensive subspace and the k+1 tie up intensive subspace can be tieed up with key-value pair form storage k+1 Point.

It should be noted that key-value pair form similar to index form, for example with<S_k+1,m>Key-value pair form represents M k+1 ties up intensive subspace, with<S_k+1,m,<…,dt,…>>Key-value pair form is represented in m-th intensive subspace of k+1 dimensions Comprising all sample points, dt be with sample point in the original high-dimensional big data for obtaining the specific sample that represents of residing line number Point.

Using the present embodiment, first, a Map tasks are set up for every data line in the high-dimensional big data of input, Data are split according to dimension in each Map task, the characteristic value after division is by being summarised in a Reduce nodes In obtain the intensive subspace of 1 dimension；Then, higher-dimension candidate subspace is determined by the intensive subspace of low-dimensional, then for each low-dimensional The 2nd Map tasks are set up in intensive subspace, obtain being distributed in all sample points that each k ties up intensive subspace, appoint in the 2nd Map Determine whether higher-dimension candidate subspace covers the intensive subspace of low-dimensional in business；Finally, by collecting after in the 2nd Reduce nodes The common factor of sample set in the middle intensive subspace of low-dimensional covered according to higher-dimension candidate subspace, it is determined that the higher-dimension after cluster is close Collection subspace.The present embodiment relies on distributed MapReduce frameworks and realizes that distributed parallel Mafia subspace clusterings are calculated The improvement of method, by MapReduce frameworks be responsible for big data treatment distributed storage, scheduling, load balancing, it is fault-tolerant The challenges such as weighing apparatus, fault-tolerant processing and network service, using the data partition and tasks in parallel advantageous design of MapReduce Rational Map and Reudce functions realize that the distributed parallel of algorithm is performed.This embodiment avoids unnecessary reduction operation And input/output operations, it is ensured that whole calculating task is balanced and non-interference in each node distribution, only specific a small amount of number Shared according to needing, all data are stored and accessed by effective key value structure, logical with minimum input/output cost and node The balance of reliable existing disk access cost and network access cost, and the present embodiment can process higher-dimension big data, significantly add Speed execution speed is simultaneously with good expansibility, and high-dimensional big data cluster can be lifted based on MapReduce frameworks Operational efficiency.

As shown in Fig. 2 a kind of Subspace clustering method towards high-dimensional big data for being provided of the embodiment of the present invention, The step of the embodiment shown in Fig. 1 after S105, the method can also include：

S107, the intensive subspace of k dimensions that k+1 ties up candidate subspace and k+1 dimension candidates subspace covering is distributed to respectively Individual child node is performed.

It should be noted that each k+1 dimension candidate subspaces and each k can be tieed up into close by Distributed Cache Mechanism Collection subspace is distributed to each child node and performs respectively, and each child node can perform a subspace for cluster, it is also possible to perform The subspace of multiple cluster, does not limit here.Certainly, the efficiency that algorithm runs in order to better improve, can be saved with a son Point performs a subspace for cluster, and the quantity of the child node for so needing may be more.

It is emphasized that S101 to S106 is identical with embodiment illustrated in fig. 1 in the present embodiment, no longer go to live in the household of one's in-laws on getting married here State.

Using the present embodiment, first, a Map tasks are set up for every data line in the high-dimensional big data of input, Data are split according to dimension in each Map task, the characteristic value after division is by being summarised in a Reduce nodes In obtain the intensive subspace of 1 dimension；Then, higher-dimension candidate subspace is determined by the intensive subspace of low-dimensional, then for each low-dimensional The 2nd Map tasks are set up in intensive subspace, obtain being distributed in all sample points that each k ties up intensive subspace, appoint in the 2nd Map Determine whether higher-dimension candidate subspace covers the intensive subspace of low-dimensional in business；Finally, by collecting after in the 2nd Reduce nodes The common factor of sample set in the middle intensive subspace of low-dimensional covered according to higher-dimension candidate subspace, it is determined that the higher-dimension after cluster is close Collection subspace.The present embodiment relies on distributed MapReduce frameworks and realizes that distributed parallel Mafia subspace clusterings are calculated The improvement of method, by MapReduce frameworks be responsible for big data treatment distributed storage, scheduling, load balancing, it is fault-tolerant The challenges such as weighing apparatus, fault-tolerant processing and network service, using the data partition and tasks in parallel advantageous design of MapReduce Rational Map and Reudce functions realize that the distributed parallel of algorithm is performed.This embodiment avoids unnecessary reduction operation And input/output operations, it is ensured that whole calculating task is balanced and non-interference in each node distribution, only specific a small amount of number Shared according to needing, all data are stored and accessed by effective key value structure, logical with minimum input/output cost and node The balance of reliable existing disk access cost and network access cost, and the present embodiment can process higher-dimension big data, significantly add Speed execution speed is simultaneously with good expansibility, and high-dimensional big data cluster can be lifted based on MapReduce frameworks Operational efficiency；And the intensive subspace that obtains and candidate subspace caching will be clustered, the execution of each child node will be distributed to, can be with Ensure the executed in parallel of cluster data, improve the operational efficiency of algorithm.

Corresponding to above method embodiment, the embodiment of the invention provides a kind of subspace towards high-dimensional big data and gather Class device, as shown in figure 3, the device can include：

First segmentation module 310, the high-dimensional big data for obtaining input, for each under MapReduce frameworks Row data set up a Map tasks, and data are split according to dimension in each Map task, obtain each The characteristic value of each dimension in one Map tasks；

First determining module 320, for the characteristic value of each dimension in each Map task to be sent to first Reduce nodes so that in each Reduce node, obtain and the data area according to all characteristic values of each dimension, Preset window number, preset window merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension；

Second determining module 330, for tieing up intensive subspace according to each two k, determines k+1 dimension candidates subspace, wherein, Total dimensions of the k more than or equal to 1, k+1 less than or equal to the high-dimensional big data；

Second segmentation module 340, sets up the 2nd Map tasks, and be distributed in for tieing up intensive subspace for each k Each k ties up all sample points of intensive subspace；

3rd determining module 350, in each the 2nd Map task, tieing up intensive comprising k in k+1 dimension candidate subspaces In subspace during all dimensions, determine that the k+1 dimension candidates subspace covers the k and ties up intensive subspace；

4th determining module 360, in the intensive subspace of all k dimensions by k+1 dimension candidates subspace covering Sample point set is sent to the 2nd Reduce nodes, so as in each the 2nd Reduce node, obtain k+1 dimensions candidate's All k of space covering tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace, and according to the k+1 Dimension candidate subspace, the common factor and the default cluster class density threshold, the k+1 after being clustered tie up intensive subspace.

Optionally, first determining module 320, specifically can be used for：

Optionally, first determining module 320, specifically can be also used for：

Optionally, second determining module 330, specifically can be used for：

Optionally, the 4th determining module 360, specifically can be used for：

Optionally, the 4th determining module 360, specifically can be also used for：

Further, comprising the first segmentation module 310, the first determining module 320, the second determining module 330, second On the basis of segmentation module 340, the 3rd determining module 350, the 4th determining module 360, as shown in figure 4, embodiment of the present invention institute A kind of subspace clustering device towards high-dimensional big data for providing can also include：

Distribution module 410, the k dimensions for the k+1 to be tieed up candidate subspace and k+1 dimension candidates subspace covering are close Collection subspace is distributed to the execution of each child node.

It should be noted that the subspace clustering device towards high-dimensional big data of the embodiment of the present invention is using above-mentioned Towards the device of the Subspace clustering method of high-dimensional big data, then the above-mentioned Subspace clustering method towards high-dimensional big data All embodiments be applied to the device, and can reach same or analogous beneficial effect.

It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or deposited between operating In any this actual relation or order.And, term " including ", "comprising" or its any other variant be intended to Nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Also there is other identical element in process, method, article or equipment including the key element.

Each embodiment in this specification is described by the way of correlation, identical similar portion between each embodiment Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for system reality Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of Subspace clustering method towards high-dimensional big data, it is characterised in that methods described includes：

The high-dimensional big data of input is obtained, a Map tasks are set up for per data line under MapReduce frameworks, and Data are split according to dimension in each Map task, the feature of each dimension in each Map task is obtained Value；

The characteristic value of each dimension in each Map task is sent to a Reduce nodes so that each first In Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window merging threshold Value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension；

Intensive subspace is tieed up according to each two k, k+1 dimension candidates subspace is determined, wherein, k is less than or waits more than or equal to 1, k+1 In total dimension of the high-dimensional big data；

Intensive subspace is tieed up for each k set up the 2nd Map tasks, and acquisition is distributed in each k and ties up all of intensive subspace Sample point；

In each the 2nd Map task, in all dimensions during intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k, institute is determined State k+1 dimension candidates subspace and cover the intensive subspace of k dimensions；

2. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that the acquisition And data area, preset window number, preset window merging threshold value and preset window density according to all characteristic values of each dimension Threshold value, obtains the 1 intensive subspace of dimension of each dimension, including：

According to all characteristic values of each dimension, the maximum and minimum value of characteristic value in each dimension are determined, and determine each The data area of all characteristic values of dimension is more than or equal to the minimum value and less than or equal to the data field of the maximum Domain；

Obtain preset window density threshold, in determining the window after the merging, the quantity of characteristic value be more than the preset window The window of density threshold is the 1 intensive subspace of dimension.

3. the Subspace clustering method towards high-dimensional big data according to claim 2, it is characterised in that described two When the difference of the quantity of characteristic value merges threshold value less than the preset window in individual adjacent home window, merge described two adjacent Home window, after the window after being merged, methods described also includes：

When the total quantity of the window after the merging is 1, according to the preset window number, the window after the merging is divided equally It is N_pIndividual window.

4. the Subspace clustering method towards high-dimensional big data according to claim 2, it is characterised in that the acquisition Preset window density threshold, in determining the window after the merging, the quantity of characteristic value be more than the preset window density threshold Window for the 1 intensive subspace of dimension after, methods described also includes：

5. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that the basis Each two k ties up intensive subspace, determines k+1 dimension candidates subspace, including：

When each k ties up intensive subspace and is the intensive subspace of 1 dimension, merge each two 1 and tie up intensive subspace, obtain 2 dimensions time Select subspace；

At the intensive subspace of k dimensions subspace intensive for N-dimensional, if each two k ties up intensive subspace ties up intensive comprising identical k-1 Subspace, is merged, duplicate removal by the way that described two k are tieed up with intensive subspace, obtains k+1 dimension candidates subspace, wherein, N is big In 1 and less than total dimension of the high-dimensional big data.

6. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that described in k+ When all dimensions in intensive subspace are tieed up in 1 dimension candidate subspace comprising k, determine that the k+1 dimension candidates subspace covers the k dimensions After intensive subspace, methods described also includes：

The intensive subspace of k dimensions that the k+1 ties up candidate subspace and k+1 dimension candidates subspace covering is distributed to each height Node is performed.

7. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that the acquisition All k of the k+1 dimension candidates subspace covering tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace Value, and candidate subspace, the common factor and the default cluster class density threshold are tieed up according to the k+1, the k+1 dimensions after being clustered Intensive subspace, including：

The all k for counting the k+1 dimension candidates subspace covering tie up sample point intersection of sets concentration sample point in intensive subspace Quantity；

Default cluster class density threshold is obtained, when the quantity of sample point is more than the default cluster class density threshold in the common factor, Determine that the k+1 dimension candidates subspace is the intensive subspace of k+1 dimensions after cluster.

8. the Subspace clustering method towards high-dimensional big data according to claim 7, it is characterised in that the acquisition Default cluster class density threshold, when the quantity of sample point is more than the default cluster class density threshold in the common factor, determines the k + 1 dimension candidate subspace is that methods described also includes after k+1 after cluster ties up intensive subspace：

The all sample points during the k+1 ties up intensive subspace and the intensive subspace of k+1 dimensions are stored in key-value pair form.

9. a kind of subspace clustering device towards high-dimensional big data, it is characterised in that described device includes：

First segmentation module, the high-dimensional big data for obtaining input is built under MapReduce frameworks for per data line A vertical Map tasks, and data are split according to dimension in each Map task, obtain each Map task In each dimension characteristic value；

First determining module, for the characteristic value of each dimension in each Map task to be sent to a Reduce nodes, So that in each Reduce node, obtain and the data area according to all characteristic values of each dimension, preset window number, Preset window merges threshold value and preset window density threshold, obtains the 1 intensive subspace of dimension of each dimension；

Second determining module, for tieing up intensive subspace according to each two k, determines k+1 dimension candidates subspace, wherein, k be more than or Total dimension equal to 1, k+1 less than or equal to the high-dimensional big data；

Second segmentation module, the 2nd Map tasks are set up for tieing up intensive subspace for each k, and acquisition is distributed in each k dimensions All sample points of intensive subspace；

3rd determining module, in each the 2nd Map task, in intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k During all dimensions, determine that the k+1 dimension candidates subspace covers the k and ties up intensive subspace；

4th determining module, for all k of k+1 dimension candidates subspace covering to be tieed up the sample point set in intensive subspace Close and send to the 2nd Reduce nodes, so as in each the 2nd Reduce node, obtain the k+1 dimension candidates subspace covering All k tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace, and candidate's is tieed up according to the k+1 Space, the common factor and the default cluster class density threshold, the k+1 after being clustered tie up intensive subspace.

10. the subspace clustering device towards high-dimensional big data according to claim 9, it is characterised in that the dress Putting also includes：

Distribution module, the intensive son of k dimensions for the k+1 to be tieed up candidate subspace and k+1 dimension candidates subspace covering is empty Between be distributed to each child node execution.