CN106909942A - A kind of Subspace clustering method and device towards high-dimensional big data - Google Patents
A kind of Subspace clustering method and device towards high-dimensional big data Download PDFInfo
- Publication number
- CN106909942A CN106909942A CN201710112771.3A CN201710112771A CN106909942A CN 106909942 A CN106909942 A CN 106909942A CN 201710112771 A CN201710112771 A CN 201710112771A CN 106909942 A CN106909942 A CN 106909942A
- Authority
- CN
- China
- Prior art keywords
- subspace
- dimension
- intensive
- window
- big data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A kind of Subspace clustering method and device towards high-dimensional big data is the embodiment of the invention provides, wherein, method includes:Often row for the higher-dimension big data for obtaining sets up a Map tasks, and the data in each Map task are split according to dimension, obtains the characteristic value of each dimension in each Map task;In a Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window merging threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension;Intensive subspace is tieed up according to each two k, k+1 dimension candidates subspace is determined;Intensive subspace is tieed up for each k and set up the 2nd Map tasks, and obtain all sample points for being distributed in the intensive subspace of each k dimensions;In the 2nd Reduce nodes, the k+1 after being clustered ties up intensive subspace.The operational efficiency that high-dimensional big data is clustered can be lifted by this programme.
Description
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of subspace clustering towards high-dimensional big data
Method and device.
Background technology
Cluster, is the process that the set of physics or abstract object is divided into the multiple classes being made up of the object being similar to.By gathering
The cluster that class is generated is one group of set of data object, and these objects are similar each other to the object in same cluster, with other clusters
In object it is different.Cluster analysis, also known as cluster analysis, is a kind of statistical analysis technique for studying classification problem.During due to big data
The arriving in generation, the scale of data set is increasing, dimension more and more higher, because the presence of uncorrelated features and dimension disaster, passes
The clustering algorithm of system is no longer applicable for higher-dimension degrees of data.
In order to solve the problems, such as that traditional clustering algorithm is inapplicable for higher-dimension degrees of data, prior art proposes subspace
The concept of cluster, subspace clustering algorithm is intended to find hiding cluster class in the subset of original feature space, so as to avoid
Dimension disaster, solves the clustering problem of higher-dimension degrees of data, and subspace clustering algorithm is based primarily upon applied searching method and is divided
For bottom-up and top-down two groups.Subspace clustering algorithm commonly assumes that whole data set runs on unit, when running into height
During dimension big data, on unit carrying out cluster analysis can run into the bottleneck problem of memory size and kernel processes speed.
In existing subspace clustering algorithm, most often maximal frequent itemset excavates Mafia subspace clustering algorithms,
During each dimension is carried out uniform interval division, and the grid that will be evenly dividing by Mafia subspace clusterings algorithm first
Merged per the similar adjacent segment of one-dimensional upper data distribution density, produce a grid for uneven division, recognize each net
Intensive subspace in lattice, then ties up the candidate cluster set of regions that intensive subspace generation k ties up intensive subspace according to (k-1),
Adjacent intensive subspace is found in each selected candidate cluster set of regions and by greediness using depth-first search
The method of growth merges these intensive subspaces, i.e., since an arbitrary intensive subspace, wolfishly produced in each dimension
A raw region for maximum covers whole cluster class until the summation in all regions.Mafia subspace clustering algorithms reduce each
The element number and the data of candidate cluster set of regions split in dimension, while eliminating the technology of prunning branches of subspace detection;Also,
Mafia subspace clusterings algorithm can carry out clustering processing using parallel method, then the data after clustering processing are carried out serially
Perform.But, for the process of clustering processing, the order of magnitude is in TB, PB grade and the high-dimensional big data of the above, and data volume is huge,
The line number of data is likely to be breached rows up to ten thousand, and what Mafia subspace clusterings algorithm was obtained after interval division is carried out remains very
Huge data, so as to cause Mafia subspace clusterings algorithm when huge data are processed, operational efficiency is relatively low.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of Subspace clustering method and device towards high-dimensional big data,
To lift the operational efficiency that high-dimensional big data is clustered.Concrete technical scheme is as follows:
In a first aspect, a kind of Subspace clustering method towards high-dimensional big data is the embodiment of the invention provides, it is described
Method includes:
The high-dimensional big data of input is obtained, a Map is set up for every data line under MapReduce frameworks and is appointed
Business, and data are split according to dimension in each Map task, obtain each dimension in each Map task
Characteristic value;
The characteristic value of each dimension in each Map task is sent to a Reduce nodes so that each
In one Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window merging
Threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension;
Intensive subspace is tieed up according to each two k, k+1 dimension candidates subspace is determined, wherein, k is less than more than or equal to 1, k+1
Or equal to total dimension of the high-dimensional big data;
Intensive subspace is tieed up for each k set up the 2nd Map tasks, and acquisition is distributed in each k and ties up intensive subspace
All sample points;
In each the 2nd Map task, in all dimensions during intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k, really
The fixed k+1 dimension candidates subspace covers the k and ties up intensive subspace;
All k of k+1 dimension candidates subspace covering are tieed up into the sample point set in intensive subspace to send to second
Reduce nodes, so that in each the 2nd Reduce node, all k dimensions for obtaining the k+1 dimension candidates subspace covering are close
Sample point intersection of sets collection and default cluster class density threshold in collection subspace, and candidate subspace, the friendship are tieed up according to the k+1
Collection and the default cluster class density threshold, the k+1 after being clustered tie up intensive subspace.
Optionally, the acquisition and the data area according to all characteristic values of each dimension, preset window number, preset window
Merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension, including:
According to all characteristic values of each dimension, the maximum and minimum value of characteristic value in each dimension are determined, and determine
The data area of all characteristic values of each dimension is more than or equal to the minimum value and less than or equal to the number of the maximum
According to region;
Obtain and according to preset window number, the data area is divided into NpIndividual home window, and by all characteristic values
Distribute to each home window, wherein, NpIt is the preset window number;
Obtain preset window and merge threshold value, and count the quantity of characteristic value in each home window;
When the difference of the quantity of characteristic value merges threshold value less than the preset window in two adjacent home windows, merge
Described two adjacent home windows, the window after being merged;
Obtain preset window density threshold, in determining the window after the merging, the quantity of characteristic value it is default more than described
The window of window density threshold is the 1 intensive subspace of dimension.
Optionally, the difference of the quantity of the characteristic value in two adjacent home windows merges less than the preset window
During threshold value, merge described two adjacent home windows, after the window after being merged, methods described also includes:
When the total quantity of the window after the merging is 1, according to the preset window number, by the window after the merging
It is divided into NpIndividual window.
Optionally, the acquisition preset window density threshold, in determining the window after the merging, the quantity of characteristic value it is big
After the window of the preset window density threshold is for the 1 intensive subspace of dimension, methods described also includes:
With all sample points in the key-value pair form storage intensive subspace of 1 dimension and the intensive subspace of 1 dimension.
Optionally, it is described that intensive subspace is tieed up according to each two k, determine k+1 dimension candidates subspace, including:
When each k ties up intensive subspace and is the intensive subspace of 1 dimension, merge each two 1 and tie up intensive subspace, obtain 2
Dimension candidate subspace;
At the intensive subspace of k dimensions subspace intensive for N-dimensional, if each two k ties up intensive subspace includes identical k-1 dimensions
Intensive subspace, is merged, duplicate removal by the way that described two k are tieed up with intensive subspace, obtains k+1 dimension candidates subspace, wherein,
N is more than 1 and less than total dimension of the high-dimensional big data.
Optionally, it is described in all dimensions during intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k, determine the k+1
Dimension candidate subspace is covered after the intensive subspace of k dimensions, and methods described also includes:
The intensive subspace of k dimensions that the k+1 ties up candidate subspace and k+1 dimension candidates subspace covering is distributed to respectively
Individual child node is performed.
Optionally, all k for obtaining the k+1 dimension candidates subspace covering tie up sample point set in intensive subspace
Common factor and default cluster class density threshold, and candidate subspace, the common factor and the default cluster class density are tieed up according to the k+1
Threshold value, the k+1 after being clustered ties up intensive subspace, including:
The all k for counting the k+1 dimension candidates subspace covering tie up sample point intersection of sets concentration sample in intensive subspace
The quantity of this point;
Default cluster class density threshold is obtained, the quantity of sample point is more than the default cluster class density threshold in the common factor
When, determine that the k+1 dimension candidates subspace is the intensive subspace of k+1 dimensions after cluster.
Optionally, described to obtain default cluster class density threshold, the quantity of sample point is default more than described in the common factor
During cluster class density threshold, determine that k+1 dimension candidate subspace is methods described after k+1 after cluster ties up intensive subspace
Also include:
The all samples during the k+1 ties up intensive subspace and the intensive subspace of k+1 dimensions are stored in key-value pair form
Point.
Second aspect, the embodiment of the present invention additionally provides a kind of subspace clustering device towards high-dimensional big data, institute
Stating device includes:
First segmentation module, the high-dimensional big data for obtaining input is directed to each line number under MapReduce frameworks
According to a Map tasks of setting up, and data are split according to dimension in each Map task, obtain each Map
The characteristic value of each dimension in task;
First determining module, for the characteristic value of each dimension in each Map task to be sent to a Reduce
Node, so as in each Reduce node, obtain and the data area according to all characteristic values of each dimension, default window
Mouth number, preset window merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension;
Second determining module, for tieing up intensive subspace according to each two k, determines k+1 dimension candidates subspace, wherein, k is big
In or equal to 1, k+1 less than or equal to the high-dimensional big data total dimension;
Second segmentation module, the 2nd Map tasks are set up for tieing up intensive subspace for each k, and acquisition is distributed in often
Individual k ties up all sample points of intensive subspace;
3rd determining module, it is empty comprising the intensive son of k dimensions in k+1 dimension candidate subspaces in each the 2nd Map task
Between in all dimensions when, determine that k+1 dimension candidate subspace covers the k and ties up intensive subspace;
4th determining module, for all k of k+1 dimension candidates subspace covering to be tieed up the sample in intensive subspace
Point set is sent to the 2nd Reduce nodes, so as in each the 2nd Reduce node, obtain the k+1 dimension candidates subspace
All k of covering tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace, and tie up time according to the k+1
Subspace, the common factor and the default cluster class density threshold are selected, the k+1 after being clustered ties up intensive subspace.
Optionally, described device also includes:
Distribution module, the k dimensions for the k+1 to be tieed up candidate subspace and k+1 dimension candidates subspace covering are intensive
Subspace is distributed to the execution of each child node.
Subspace clustering method and device towards high-dimensional big data provided in an embodiment of the present invention, first, for defeated
A Map tasks are set up per data line in the high-dimensional big data for entering, data are carried out according to dimension in each Map task
Segmentation, the characteristic value after division obtains the intensive subspace of 1 dimension by being summarised in a Reduce nodes;Then, by low-dimensional
Intensive subspace determines higher-dimension candidate subspace, then sets up the 2nd Map tasks for the intensive subspace of each low-dimensional, is distributed
All sample points of intensive subspace are tieed up in each k, determines whether higher-dimension candidate subspace covers low-dimensional in the 2nd Map tasks
Intensive subspace;Finally, by collecting after the low-dimensional that is covered according to higher-dimension candidate subspace in the 2nd Reduce nodes it is close
The common factor of sample set in collection subspace, it is determined that the intensive subspace of higher-dimension after cluster.The present embodiment relies on distribution
MapReduce frameworks realize the improvement of distributed parallel Mafia subspace clustering algorithms, are responsible for by MapReduce frameworks
The complexity such as distributed storage, scheduling, load balancing, fault-tolerant balanced, fault-tolerant processing and network service that big data is processed
Problem, algorithm is realized using the rational Map and Reudce functions of data partition and tasks in parallel advantageous design of MapReduce
Distributed parallel perform.This embodiment avoids unnecessary reduction operation and input/output operations, it is ensured that whole calculating is appointed
Business is balanced and non-interference in each node distribution, and only specific low volume data needs to share, and all data are by effective
Key value structure is stored and accessed, and is communicated with minimum input/output cost and node and is realized disk access cost and network access
The balance of cost, and the present embodiment can process higher-dimension big data, substantially speeded up execution speed and expanded with good
Malleability, the operational efficiency of high-dimensional big data cluster can be lifted based on MapReduce frameworks.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with
Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of schematic flow sheet of the Subspace clustering method towards high-dimensional big data of the embodiment of the present invention;
Fig. 2 illustrates for another flow of the Subspace clustering method towards high-dimensional big data of the embodiment of the present invention
Figure;
Fig. 3 is a kind of structural representation of the subspace clustering device towards high-dimensional big data of the embodiment of the present invention;
Fig. 4 is another structural representation of the subspace clustering device towards high-dimensional big data of the embodiment of the present invention
Figure.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
In order to lift the operational efficiency of high-dimensional big data cluster, the embodiment of the invention provides a kind of towards high-dimensional big
The Subspace clustering method and device of data.
A kind of Subspace clustering method towards high-dimensional big data for being provided the embodiment of the present invention first below enters
Row is introduced.
It should be noted that a kind of Subspace clustering method towards high-dimensional big data that the embodiment of the present invention is provided
Executive agent can be perform clustering algorithm multiple processors, such as DSP (Digital Signal Processor, number
Word signal processor), ARM (Advanced Reduced Instruction Set Computer Machines, reduced instruction
Collection computer microprocessor) or FPGA (Field-Programmable Gate Array, field programmable gate array) etc.,
Here it is not construed as limiting.Wherein, a kind of subspace clustering side towards high-dimensional big data that the embodiment of the present invention is provided is realized
The mode of method can be software, hardware circuit and/or the logic circuit being arranged in processor.
As shown in figure 1, a kind of Subspace clustering method towards high-dimensional big data that the embodiment of the present invention is provided, can
To comprise the following steps:
S101, obtains the high-dimensional big data of input, and first is set up for per data line under MapReduce frameworks
Map tasks, and data are split according to dimension in each Map task, obtain each in each Map task
The characteristic value of dimension.
It should be noted that traditional subspace clustering algorithm often assumes that whole data set runs on single computer,
When high-dimensional big data is run into, cluster analysis is carried out in single computer can run into memory size and kernel processes speed
Bottleneck problem, can improve the efficiency and scalability of existing subspace clustering algorithm by way of selecting parallel processing.And
A kind of high efficiency method for improving arithmetic speed carried out by row treatment, is to process node cooperative work by a data set point by multiple
Solution carries out calculation process into several Sub Data Sets, each Sub Data Set by independent node.Thought based on parallel processing, this
By line number be numbered every a line example of the high-dimensional big data being input into first by embodiment, and example is carried out according to dimension
Feature splits.Namely before traditional Mafia subspace clustering algorithms are performed, high-dimensional big data is carried out according to row first
Divide, according still further to traditional Mafia subspace clustering algorithms divided dimension, when the data of input are excessively huge,
Data volume after traditional Mafia subspace clustering algorithms are divided dimension may be still very huge, for example, obtain up to ten thousand
Capable data, therefore, being divided by row before dimension is divided, the data for finally obtaining are reduced a lot, so that
The internal memory of release and the pressure of kernel processes.
It is emphasized that the high-dimensional big data each row of data to obtaining set up under a MapReduce framework the
One Map tasks, in a Map tasks, are exported with dimension numbering, the data of example line number and example in the dimension,
The data of such as the i-th row are<d1,d2,…,dD>, then what is exported is<1,i:d1>,<2,i:d2>…<N,i:dD>, expression is
The data of the 1st the i-th row of dimension are d1, the data of the 2nd the i-th row of dimension are d2, the data of the row of N-dimensional i-th are dD.Wherein, MapReduce
Framework is used for realizing the treatment in large-scale parallel mode to large-scale data, using the thought of " dividing and rule ", one is appointed
Business is divided into multiple subtasks, and these subtasks are dispatched between the idle node of cluster and obtain high speed processing, then again will
The result of child node output is merged.MapReduce is a kind of programming model, for large-scale dataset (being more than 1TB)
Concurrent operation, MapReduce includes Map (mapping) and Reduce (reduction), and their main thought, is compiled from functional expression
Borrowed in Cheng Yuyan, the characteristic also borrowed from vector programming language.It is very easy to programming personnel will not divide
In the case of cloth multiple programming, the program of oneself is operated in distributed system.Current software realizes it being to specify one
Map (mapping) function, for one group of key-value pair is mapped to one group of new key-value pair, specifies concurrent Reduce (reduction) letter
Number, for each the shared identical key group in the key-value pair for ensureing all mappings.
S102, the characteristic value of each dimension in each Map task is sent to a Reduce nodes, so that every
In an individual Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window
Merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension.
It should be noted that the characteristic value of each dimension obtained in a Map tasks is sent to Reduce sections
Point, all characteristic values of each dimension are integrated.For each dimension, the data area of all characteristic values can be advance
The set, or scope by the minimum value being calculated to all characteristic values to maximum;It is preset window number, pre-
If it is all that user or those skilled in the art rule of thumb preset that window merges threshold value and preset window density threshold
's;Preset window number refers to the number that all characteristic values of each dimension are carried out region division, and preset window merges threshold value is
Refer to the condition that the region after dividing merges each other, preset window density threshold refers to that the region that will can be divided determines
It is the condition of the 1 intensive subspace of dimension.It is emphasized that same processor node is after division result is obtained, tied dividing
Fruit is preserved, then carries out the operation of this step.
Optionally, the acquisition and the data area according to all characteristic values of each dimension, preset window number, preset window
Merge threshold value and preset window density threshold, obtain each dimension 1 dimension intensive subspace the step of, can include:
First, according to all characteristic values of each dimension, the maximum and minimum value of characteristic value in each dimension are determined, and
The data area for determining all characteristic values of each dimension is more than or equal to above-mentioned minimum value and less than or equal to above-mentioned maximum
Data area.
It should be noted that in order to ensure the accuracy that data area divides, can be by all characteristic values of each dimension
Data area is defined as the scope of minimum value in all characteristic values of each dimension to maximum.For example, the high-dimensional big number of input
Minimum value is that -152, maximum is 512 in the first dimension data in, then the data area of the first dimension data is [- 152,512].
Secondly, obtain and according to preset window number, data area is divided into NpIndividual home window, and by all characteristic values
Distribute to each home window.
Wherein, NpIt is preset window number.It should be noted that preset window number is usually user or people in the art
Member presets based on experience value, and less than or equal to total line number of high-dimensional big data.For example, input is high-dimensional big
A total of 12000 row data of data, preset window number, then can be by 12000 numbers of each dimension if provided as 12000
According to being respectively allocated to each preset window.
Again, obtain preset window and merge threshold value, and count the quantity of characteristic value in each home window.
It should be noted that the quantity in order to reduce every dimension data segmentation, closes with to the data area being evenly dividing
And, and need to be merged according to certain merging condition, preset window merges threshold value and can be used as merging condition, Er Qiehe
And condition is relevant with the quantity of characteristic value in each home window, it is necessary to counted, and initial mesh is formed after statistics, can be with
With<Window number, instance number, minimum value, maximum>Form stored.
Then, when the difference of the quantity of characteristic value merges threshold value less than preset window in two adjacent home windows, close
And two adjacent home windows, the window after being merged.
It should be noted that merging threshold value less than preset window according to the difference of the quantity of characteristic value in adjacent home window
Condition adjacent home window is merged, until merge after window no longer change, the number of uneven division can be obtained
According to region, do not required for data beta pruning, improve the retractility of algorithm.The window for finally obtaining is probably multiple uneven
Window, it is also possible to be incorporated into a window.It is merged into a window and is unfavorable for follow-up cluster analysis, therefore, generally
In the case of, if it is a window, it is necessary to be divided again to the window to merge the window that finally obtains.
Specifically, when the total quantity of window after merging is 1, according to preset window number, the window after merging is divided equally
It is NpIndividual window.
It should be noted that it can be according to previously positioned preset window number that a window after to merging divide
Divided, or divided according to the preset window number of new input.It is defeated in order to reduce Data duplication in the present embodiment
The complexity for entering, preset window number is the preset window number being previously entered.
Finally, obtain preset window density threshold, it is determined that in window after merging, the quantity of characteristic value be more than preset window
The window of density threshold is the 1 intensive subspace of dimension.
If it should be noted that merge after window in characteristic value quantity it is too small, then it is assumed that the window is not intensive
Subspace, that is to say, that subspace of the window not as cluster;Under normal circumstances, can be by user or this area skill
Whether the quantity that art personnel window density threshold set in advance weighs characteristic value in the window after merging is too small, if after merging
Window in characteristic value quantity be less than the preset window density threshold, then it is assumed that the quantity mistake of characteristic value in the window after merging
It is small.All intensive subspaces for meeting condition can be obtained by the step, these intensive subspaces are defined as each dimension
Under the 1 intensive subspace of dimension.
Specifically, the cluster calculation process for the ease of subsequent step for the intensive subspace of different dimensional under each dimension,
It is determined that after the 1 intensive subspace of dimension under each dimension, can be with the intensive subspace of dimension of key-value pair form storage 1 and 1 dimension
All sample points in intensive subspace.
Wherein, sample point is all characteristic values of every a line.It should be noted that key-value pair form is similar to index shape
Formula, for example with<S_1,i>Key-value pair form represents the 1 intensive subspace of dimension of i-th dimension data, with<S_1,i,p_1,…,p_n>Key
Value represents form the 1 of the i-th dimension data line numbers for tieing up each characteristic value in intensive subspace, wherein, p_1 is the 1st characteristic value
Line number, p_n is n-th line number of characteristic value, then in key-value pair form, in the intensive subspace of the data of each dimension each
The information of characteristic value is very clear, beneficial to calling for subsequent step.
It is emphasized that the maximum complexity of above-mentioned S101 and S102 step operations is O (NplogNp), i.e. step operation
Complexity it is relevant with preset window number, the Reduce functions of each sub-processor process a dimension, calculate the time with dimension
The increase linear increase of degree.
S103, intensive subspace is tieed up according to each two k, determines k+1 dimension candidates subspace.
Wherein, total dimensions of the k more than or equal to 1, k+1 less than or equal to the high-dimensional big data.It should be noted that
The one-dimensional intensive subspace of cluster under needing to calculate after the intensive subspace for being clustered again, it is generally the case that higher-dimension it is close
Collection subspace cannot be directly according to low-dimensional intensive subspace obtain, therefore, can be obtained according to the intensive subspace for having obtained
To the candidate subspace of higher-dimension, then intensive subspace is determined from candidate subspace.Characteristic value in the 1 intensive subspace of dimension compared with
Few, the process for obtaining 2 dimension candidate subspaces is relatively simple;And remaining obtains higher-dimension candidate subspace from the intensive subspace of low-dimensional
Process it is relatively complicated, it is different in process.
Optionally, it is described that intensive subspace is tieed up according to each two k, determine the step of k+1 ties up candidate subspace, can wrap
Include:
First, when each k ties up intensive subspace and is the intensive subspace of 1 dimension, merge each two 1 and tie up intensive subspace,
Obtain 2 dimension candidate subspaces.
It should be noted that for the 1 intensive subspace of dimension, the 1 intensive subspace of dimension of direct different dimensions merges life two-by-two
Into 2 dimension candidate subspaces.
Secondly, at the intensive subspace of k dimensions subspace intensive for N-dimensional, if each two k ties up intensive subspace includes identical
K-1 ties up intensive subspace, is merged by the way that the two k are tieed up with intensive subspace, duplicate removal, obtains k+1 dimension candidates subspace.
Wherein, N is more than 1 and less than total dimension of the high-dimensional big data.It should be noted that tieing up intensive son for k
Space, intensive subspace determines whether that identical k-1 ties up intensive subspace two-by-two, has, and exports the two intensive subspaces and melts
The k+1 dimension candidates subspace formed after conjunction, the k+1 dimension candidates subspace that will finally obtain carries out duplicate removal, and such as two 2 dimensions are intensive
Subspace, respectively<1,2>With<1,3>Have the intensive subspace of 1 dimension<1>;The two 2 intensive subspaces of dimension, respectively<1,2>With
<2,3>Have the intensive subspace of 1 dimension<2>;The two 2 intensive subspaces of dimension, respectively<1,3>With<2,3>Have the intensive son of 1 dimension empty
Between<3>, all merge into k+1 dimension candidates subspace<1,2,3>, three k+1 dimension candidates subspaces are integrated into a k+1 dimension after duplicate removal
Candidate subspace exports.The time complexity of this process is O (N2), i.e. each two k-1 ties up intensive subspace and attempts carrying out k
The fusion of intensive subspace is tieed up, N is the sum that k-1 ties up intensive subspace.Wherein, each intensive subspace can be represented as by
Dimension ascending order or the intensive window number of descending arrangement, to ensure the order of characteristic value in the k+1 dimension candidates subspace for obtaining
Necessarily, it is to avoid out of order arrangement, then by deduplication operation the uniqueness of candidate subspace has been fully ensured that.
S104, ties up intensive subspace and sets up the 2nd Map tasks for each k, and acquisition is distributed in the intensive son sky of each k dimensions
Between all sample points.
It should be noted that tieing up intensive subspace for each k sets up the 2nd Map tasks, it is right in the 2nd Map tasks
Each k ties up intensive subspace and is analyzed, counts, and obtains being distributed in all sample points that each k ties up intensive subspace, i.e., each
K ties up all data of every a line in intensive subspace.
S105, in each the 2nd Map task, all dimensions in intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k
When, determine that k+1 dimension candidates subspace covers the k and ties up intensive subspace.
It should be noted that in each the 2nd Map task, each k ties up intensive subspace and ties up time by traveling through all k+1
Subspace is selected, the k+1 dimensions candidate subspace of institute's subordinate is found, all sample points that k tieed up in intensive subspace is sent to clustering the k
The Reduce nodes of+1 dimension candidate subspace, so as to intensive subspace will be tieed up by all k of identical k+1 dimension candidates subspace covering
Comprising sample point, send to same Reduce nodes.Wherein it is possible to tie up candidate's subspace dimension and the intensive son of k dimensions by k+1
The inclusion relation of Spatial Dimension is matched.For example, m-th 3-dimensional candidate subspace<S_3{d1,d2,d3},m>Cover the n-th 1
The 2 intensive subspaces of dimension<S_2{d1,d2},n1>, the n-th 22 intensive subspaces of dimension<S_2{d2,d3},n2>And the n-th 32 dimensions are close
Collection subspace<S_2{d1,d3},n3>, wherein, d1, d2 and d3 are numbered for dimension, the 2 intensive subspace S_2 { d1, d2 } of dimension, S_2
The sample point set of { d2, d3 }, S_2 { d1, d3 } covering is respectively DB1, DB2 and DB3, then 2 dimension intensive subspace S_2 d1,
D2 }, the Map tasks of S_2 { d2, d3 } and S_2 { d1, d3 } export respectively<S_3{d1,d2,d3},DB1>,<S_3{d1,d2,
d3},DB2>And<S_3{d1,d2,d3},DB3>.
All k of k+1 dimension candidates subspace covering are tieed up the sample point set in intensive subspace and sent to second by S106
Reduce nodes, so that in each the 2nd Reduce node, all k for obtaining k+1 dimension candidates subspace covering tie up intensive son
Sample point intersection of sets collection and default cluster class density threshold in space, and candidate subspace, above-mentioned common factor are tieed up according to k+1 and is preset
Cluster class density threshold, the k+1 after being clustered ties up intensive subspace.
It should be noted that the sample point of all k+1 dimensions candidate subspace that will be obtained in the 2nd Map tasks is sent to the
Two Reduce nodes, the sample point set of the intensive subspace of all k dimensions that the candidate subspace is covered takes common factor and obtains D_k+
1, so that it is determined that the k+1 after cluster ties up intensive subspace.
Optionally, all k of the acquisition k+1 dimension candidates subspace covering tie up sample point intersection of sets in intensive subspace
Collection and default cluster class density threshold, and candidate subspace is tieed up according to k+1, is occured simultaneously and default cluster class density threshold, after being clustered
K+1 tie up intensive subspace the step of, can include:
First, sample point intersection of sets concentrates sample during all k of statistics k+1 dimension candidates subspace covering tie up intensive subspace
The quantity of this point.
It should be noted that multiple k may be covered in k+1 dimension candidates subspace ties up intensive subspace, all k dimensions are intensive
There is intersecting situation subspace each other, and multiple sample points are contained in the common factor obtained after intersecting, during statistics is occured simultaneously here
The quantity of sample point, is used to determine whether k+1 dimension candidates subspace is that k+1 ties up intensive subspace.
Secondly, default cluster class density threshold is obtained, when the quantity of sample point is more than default cluster class density threshold in common factor,
Determine that k+1 dimension candidates subspace is the intensive subspace of k+1 dimensions after cluster.
It should be noted that the default cluster intensive threshold value of class is the decision threshold of intensive subspace, usually user or this
Art personnel rule of thumb preset, if the quantity of sample point is more than the default cluster class density threshold in occuring simultaneously,
Think that the sample dot density of k+1 dimension candidates subspace is larger, k+1 dimension candidates subspace can be defined as the k+1 after cluster
Tie up intensive subspace.
If it is emphasized that k+1 ties up intensive subspace presence, and k+1 is less than total dimension of high-dimensional big data, then
Repeat S103 to S106, iterative calculation k+2 dimensions candidate subspace and intensive subspace;Otherwise terminate.
Specifically, the determination for the ease of carrying out down one-dimensional candidate subspace, it is determined that the intensive son of k+1 dimensions after cluster is empty
Between after, all samples during intensive subspace and the k+1 tie up intensive subspace can be tieed up with key-value pair form storage k+1
Point.
It should be noted that key-value pair form similar to index form, for example with<S_k+1,m>Key-value pair form represents
M k+1 ties up intensive subspace, with<S_k+1,m,<…,dt,…>>Key-value pair form is represented in m-th intensive subspace of k+1 dimensions
Comprising all sample points, dt be with sample point in the original high-dimensional big data for obtaining the specific sample that represents of residing line number
Point.
Using the present embodiment, first, a Map tasks are set up for every data line in the high-dimensional big data of input,
Data are split according to dimension in each Map task, the characteristic value after division is by being summarised in a Reduce nodes
In obtain the intensive subspace of 1 dimension;Then, higher-dimension candidate subspace is determined by the intensive subspace of low-dimensional, then for each low-dimensional
The 2nd Map tasks are set up in intensive subspace, obtain being distributed in all sample points that each k ties up intensive subspace, appoint in the 2nd Map
Determine whether higher-dimension candidate subspace covers the intensive subspace of low-dimensional in business;Finally, by collecting after in the 2nd Reduce nodes
The common factor of sample set in the middle intensive subspace of low-dimensional covered according to higher-dimension candidate subspace, it is determined that the higher-dimension after cluster is close
Collection subspace.The present embodiment relies on distributed MapReduce frameworks and realizes that distributed parallel Mafia subspace clusterings are calculated
The improvement of method, by MapReduce frameworks be responsible for big data treatment distributed storage, scheduling, load balancing, it is fault-tolerant
The challenges such as weighing apparatus, fault-tolerant processing and network service, using the data partition and tasks in parallel advantageous design of MapReduce
Rational Map and Reudce functions realize that the distributed parallel of algorithm is performed.This embodiment avoids unnecessary reduction operation
And input/output operations, it is ensured that whole calculating task is balanced and non-interference in each node distribution, only specific a small amount of number
Shared according to needing, all data are stored and accessed by effective key value structure, logical with minimum input/output cost and node
The balance of reliable existing disk access cost and network access cost, and the present embodiment can process higher-dimension big data, significantly add
Speed execution speed is simultaneously with good expansibility, and high-dimensional big data cluster can be lifted based on MapReduce frameworks
Operational efficiency.
As shown in Fig. 2 a kind of Subspace clustering method towards high-dimensional big data for being provided of the embodiment of the present invention,
The step of the embodiment shown in Fig. 1 after S105, the method can also include:
S107, the intensive subspace of k dimensions that k+1 ties up candidate subspace and k+1 dimension candidates subspace covering is distributed to respectively
Individual child node is performed.
It should be noted that each k+1 dimension candidate subspaces and each k can be tieed up into close by Distributed Cache Mechanism
Collection subspace is distributed to each child node and performs respectively, and each child node can perform a subspace for cluster, it is also possible to perform
The subspace of multiple cluster, does not limit here.Certainly, the efficiency that algorithm runs in order to better improve, can be saved with a son
Point performs a subspace for cluster, and the quantity of the child node for so needing may be more.
It is emphasized that S101 to S106 is identical with embodiment illustrated in fig. 1 in the present embodiment, no longer go to live in the household of one's in-laws on getting married here
State.
Using the present embodiment, first, a Map tasks are set up for every data line in the high-dimensional big data of input,
Data are split according to dimension in each Map task, the characteristic value after division is by being summarised in a Reduce nodes
In obtain the intensive subspace of 1 dimension;Then, higher-dimension candidate subspace is determined by the intensive subspace of low-dimensional, then for each low-dimensional
The 2nd Map tasks are set up in intensive subspace, obtain being distributed in all sample points that each k ties up intensive subspace, appoint in the 2nd Map
Determine whether higher-dimension candidate subspace covers the intensive subspace of low-dimensional in business;Finally, by collecting after in the 2nd Reduce nodes
The common factor of sample set in the middle intensive subspace of low-dimensional covered according to higher-dimension candidate subspace, it is determined that the higher-dimension after cluster is close
Collection subspace.The present embodiment relies on distributed MapReduce frameworks and realizes that distributed parallel Mafia subspace clusterings are calculated
The improvement of method, by MapReduce frameworks be responsible for big data treatment distributed storage, scheduling, load balancing, it is fault-tolerant
The challenges such as weighing apparatus, fault-tolerant processing and network service, using the data partition and tasks in parallel advantageous design of MapReduce
Rational Map and Reudce functions realize that the distributed parallel of algorithm is performed.This embodiment avoids unnecessary reduction operation
And input/output operations, it is ensured that whole calculating task is balanced and non-interference in each node distribution, only specific a small amount of number
Shared according to needing, all data are stored and accessed by effective key value structure, logical with minimum input/output cost and node
The balance of reliable existing disk access cost and network access cost, and the present embodiment can process higher-dimension big data, significantly add
Speed execution speed is simultaneously with good expansibility, and high-dimensional big data cluster can be lifted based on MapReduce frameworks
Operational efficiency;And the intensive subspace that obtains and candidate subspace caching will be clustered, the execution of each child node will be distributed to, can be with
Ensure the executed in parallel of cluster data, improve the operational efficiency of algorithm.
Corresponding to above method embodiment, the embodiment of the invention provides a kind of subspace towards high-dimensional big data and gather
Class device, as shown in figure 3, the device can include:
First segmentation module 310, the high-dimensional big data for obtaining input, for each under MapReduce frameworks
Row data set up a Map tasks, and data are split according to dimension in each Map task, obtain each
The characteristic value of each dimension in one Map tasks;
First determining module 320, for the characteristic value of each dimension in each Map task to be sent to first
Reduce nodes so that in each Reduce node, obtain and the data area according to all characteristic values of each dimension,
Preset window number, preset window merge threshold value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension;
Second determining module 330, for tieing up intensive subspace according to each two k, determines k+1 dimension candidates subspace, wherein,
Total dimensions of the k more than or equal to 1, k+1 less than or equal to the high-dimensional big data;
Second segmentation module 340, sets up the 2nd Map tasks, and be distributed in for tieing up intensive subspace for each k
Each k ties up all sample points of intensive subspace;
3rd determining module 350, in each the 2nd Map task, tieing up intensive comprising k in k+1 dimension candidate subspaces
In subspace during all dimensions, determine that the k+1 dimension candidates subspace covers the k and ties up intensive subspace;
4th determining module 360, in the intensive subspace of all k dimensions by k+1 dimension candidates subspace covering
Sample point set is sent to the 2nd Reduce nodes, so as in each the 2nd Reduce node, obtain k+1 dimensions candidate's
All k of space covering tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace, and according to the k+1
Dimension candidate subspace, the common factor and the default cluster class density threshold, the k+1 after being clustered tie up intensive subspace.
Using the present embodiment, first, a Map tasks are set up for every data line in the high-dimensional big data of input,
Data are split according to dimension in each Map task, the characteristic value after division is by being summarised in a Reduce nodes
In obtain the intensive subspace of 1 dimension;Then, higher-dimension candidate subspace is determined by the intensive subspace of low-dimensional, then for each low-dimensional
The 2nd Map tasks are set up in intensive subspace, obtain being distributed in all sample points that each k ties up intensive subspace, appoint in the 2nd Map
Determine whether higher-dimension candidate subspace covers the intensive subspace of low-dimensional in business;Finally, by collecting after in the 2nd Reduce nodes
The common factor of sample set in the middle intensive subspace of low-dimensional covered according to higher-dimension candidate subspace, it is determined that the higher-dimension after cluster is close
Collection subspace.The present embodiment relies on distributed MapReduce frameworks and realizes that distributed parallel Mafia subspace clusterings are calculated
The improvement of method, by MapReduce frameworks be responsible for big data treatment distributed storage, scheduling, load balancing, it is fault-tolerant
The challenges such as weighing apparatus, fault-tolerant processing and network service, using the data partition and tasks in parallel advantageous design of MapReduce
Rational Map and Reudce functions realize that the distributed parallel of algorithm is performed.This embodiment avoids unnecessary reduction operation
And input/output operations, it is ensured that whole calculating task is balanced and non-interference in each node distribution, only specific a small amount of number
Shared according to needing, all data are stored and accessed by effective key value structure, logical with minimum input/output cost and node
The balance of reliable existing disk access cost and network access cost, and the present embodiment can process higher-dimension big data, significantly add
Speed execution speed is simultaneously with good expansibility, and high-dimensional big data cluster can be lifted based on MapReduce frameworks
Operational efficiency.
Optionally, first determining module 320, specifically can be used for:
According to all characteristic values of each dimension, the maximum and minimum value of characteristic value in each dimension are determined, and determine
The data area of all characteristic values of each dimension is more than or equal to the minimum value and less than or equal to the number of the maximum
According to region;
Obtain and according to preset window number, the data area is divided into NpIndividual home window, and by all characteristic values
Distribute to each home window, wherein, NpIt is the preset window number;
Obtain preset window and merge threshold value, and count the quantity of characteristic value in each home window;
When the difference of the quantity of characteristic value merges threshold value less than the preset window in two adjacent home windows, merge
Described two adjacent home windows, the window after being merged;
Obtain preset window density threshold, in determining the window after the merging, the quantity of characteristic value it is default more than described
The window of window density threshold is the 1 intensive subspace of dimension.
Optionally, first determining module 320, specifically can be also used for:
When the total quantity of the window after the merging is 1, according to the preset window number, by the window after the merging
It is divided into NpIndividual window.
Optionally, first determining module 320, specifically can be also used for:
With all sample points in the key-value pair form storage intensive subspace of 1 dimension and the intensive subspace of 1 dimension.
Optionally, second determining module 330, specifically can be used for:
When each k ties up intensive subspace and is the intensive subspace of 1 dimension, merge each two 1 and tie up intensive subspace, obtain 2
Dimension candidate subspace;
At the intensive subspace of k dimensions subspace intensive for N-dimensional, if each two k ties up intensive subspace includes identical k-1 dimensions
Intensive subspace, is merged, duplicate removal by the way that described two k are tieed up with intensive subspace, obtains k+1 dimension candidates subspace, wherein,
N is more than 1 and less than total dimension of the high-dimensional big data.
Optionally, the 4th determining module 360, specifically can be used for:
The all k for counting the k+1 dimension candidates subspace covering tie up sample point intersection of sets concentration sample in intensive subspace
The quantity of this point;
Default cluster class density threshold is obtained, the quantity of sample point is more than the default cluster class density threshold in the common factor
When, determine that the k+1 dimension candidates subspace is the intensive subspace of k+1 dimensions after cluster.
Optionally, the 4th determining module 360, specifically can be also used for:
The all samples during the k+1 ties up intensive subspace and the intensive subspace of k+1 dimensions are stored in key-value pair form
Point.
Further, comprising the first segmentation module 310, the first determining module 320, the second determining module 330, second
On the basis of segmentation module 340, the 3rd determining module 350, the 4th determining module 360, as shown in figure 4, embodiment of the present invention institute
A kind of subspace clustering device towards high-dimensional big data for providing can also include:
Distribution module 410, the k dimensions for the k+1 to be tieed up candidate subspace and k+1 dimension candidates subspace covering are close
Collection subspace is distributed to the execution of each child node.
Using the present embodiment, first, a Map tasks are set up for every data line in the high-dimensional big data of input,
Data are split according to dimension in each Map task, the characteristic value after division is by being summarised in a Reduce nodes
In obtain the intensive subspace of 1 dimension;Then, higher-dimension candidate subspace is determined by the intensive subspace of low-dimensional, then for each low-dimensional
The 2nd Map tasks are set up in intensive subspace, obtain being distributed in all sample points that each k ties up intensive subspace, appoint in the 2nd Map
Determine whether higher-dimension candidate subspace covers the intensive subspace of low-dimensional in business;Finally, by collecting after in the 2nd Reduce nodes
The common factor of sample set in the middle intensive subspace of low-dimensional covered according to higher-dimension candidate subspace, it is determined that the higher-dimension after cluster is close
Collection subspace.The present embodiment relies on distributed MapReduce frameworks and realizes that distributed parallel Mafia subspace clusterings are calculated
The improvement of method, by MapReduce frameworks be responsible for big data treatment distributed storage, scheduling, load balancing, it is fault-tolerant
The challenges such as weighing apparatus, fault-tolerant processing and network service, using the data partition and tasks in parallel advantageous design of MapReduce
Rational Map and Reudce functions realize that the distributed parallel of algorithm is performed.This embodiment avoids unnecessary reduction operation
And input/output operations, it is ensured that whole calculating task is balanced and non-interference in each node distribution, only specific a small amount of number
Shared according to needing, all data are stored and accessed by effective key value structure, logical with minimum input/output cost and node
The balance of reliable existing disk access cost and network access cost, and the present embodiment can process higher-dimension big data, significantly add
Speed execution speed is simultaneously with good expansibility, and high-dimensional big data cluster can be lifted based on MapReduce frameworks
Operational efficiency;And the intensive subspace that obtains and candidate subspace caching will be clustered, the execution of each child node will be distributed to, can be with
Ensure the executed in parallel of cluster data, improve the operational efficiency of algorithm.
It should be noted that the subspace clustering device towards high-dimensional big data of the embodiment of the present invention is using above-mentioned
Towards the device of the Subspace clustering method of high-dimensional big data, then the above-mentioned Subspace clustering method towards high-dimensional big data
All embodiments be applied to the device, and can reach same or analogous beneficial effect.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality
Body or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or deposited between operating
In any this actual relation or order.And, term " including ", "comprising" or its any other variant be intended to
Nonexcludability is included, so that process, method, article or equipment including a series of key elements not only will including those
Element, but also other key elements including being not expressly set out, or also include being this process, method, article or equipment
Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that
Also there is other identical element in process, method, article or equipment including the key element.
Each embodiment in this specification is described by the way of correlation, identical similar portion between each embodiment
Divide mutually referring to what each embodiment was stressed is the difference with other embodiment.Especially for system reality
Apply for example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method
Part explanation.
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the scope of the present invention.It is all
Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention
It is interior.
Claims (10)
1. a kind of Subspace clustering method towards high-dimensional big data, it is characterised in that methods described includes:
The high-dimensional big data of input is obtained, a Map tasks are set up for per data line under MapReduce frameworks, and
Data are split according to dimension in each Map task, the feature of each dimension in each Map task is obtained
Value;
The characteristic value of each dimension in each Map task is sent to a Reduce nodes so that each first
In Reduce nodes, obtain and the data area according to all characteristic values of each dimension, preset window number, preset window merging threshold
Value and preset window density threshold, obtain the 1 intensive subspace of dimension of each dimension;
Intensive subspace is tieed up according to each two k, k+1 dimension candidates subspace is determined, wherein, k is less than or waits more than or equal to 1, k+1
In total dimension of the high-dimensional big data;
Intensive subspace is tieed up for each k set up the 2nd Map tasks, and acquisition is distributed in each k and ties up all of intensive subspace
Sample point;
In each the 2nd Map task, in all dimensions during intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k, institute is determined
State k+1 dimension candidates subspace and cover the intensive subspace of k dimensions;
All k of k+1 dimension candidates subspace covering are tieed up into the sample point set in intensive subspace to send to second
Reduce nodes, so that in each the 2nd Reduce node, all k dimensions for obtaining the k+1 dimension candidates subspace covering are close
Sample point intersection of sets collection and default cluster class density threshold in collection subspace, and candidate subspace, the friendship are tieed up according to the k+1
Collection and the default cluster class density threshold, the k+1 after being clustered tie up intensive subspace.
2. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that the acquisition
And data area, preset window number, preset window merging threshold value and preset window density according to all characteristic values of each dimension
Threshold value, obtains the 1 intensive subspace of dimension of each dimension, including:
According to all characteristic values of each dimension, the maximum and minimum value of characteristic value in each dimension are determined, and determine each
The data area of all characteristic values of dimension is more than or equal to the minimum value and less than or equal to the data field of the maximum
Domain;
Obtain and according to preset window number, the data area is divided into NpIndividual home window, and by all characteristic values distribute to
Each home window, wherein, NpIt is the preset window number;
Obtain preset window and merge threshold value, and count the quantity of characteristic value in each home window;
When the difference of the quantity of characteristic value merges threshold value less than the preset window in two adjacent home windows, merge described
Two adjacent home windows, the window after being merged;
Obtain preset window density threshold, in determining the window after the merging, the quantity of characteristic value be more than the preset window
The window of density threshold is the 1 intensive subspace of dimension.
3. the Subspace clustering method towards high-dimensional big data according to claim 2, it is characterised in that described two
When the difference of the quantity of characteristic value merges threshold value less than the preset window in individual adjacent home window, merge described two adjacent
Home window, after the window after being merged, methods described also includes:
When the total quantity of the window after the merging is 1, according to the preset window number, the window after the merging is divided equally
It is NpIndividual window.
4. the Subspace clustering method towards high-dimensional big data according to claim 2, it is characterised in that the acquisition
Preset window density threshold, in determining the window after the merging, the quantity of characteristic value be more than the preset window density threshold
Window for the 1 intensive subspace of dimension after, methods described also includes:
With all sample points in the key-value pair form storage intensive subspace of 1 dimension and the intensive subspace of 1 dimension.
5. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that the basis
Each two k ties up intensive subspace, determines k+1 dimension candidates subspace, including:
When each k ties up intensive subspace and is the intensive subspace of 1 dimension, merge each two 1 and tie up intensive subspace, obtain 2 dimensions time
Select subspace;
At the intensive subspace of k dimensions subspace intensive for N-dimensional, if each two k ties up intensive subspace ties up intensive comprising identical k-1
Subspace, is merged, duplicate removal by the way that described two k are tieed up with intensive subspace, obtains k+1 dimension candidates subspace, wherein, N is big
In 1 and less than total dimension of the high-dimensional big data.
6. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that described in k+
When all dimensions in intensive subspace are tieed up in 1 dimension candidate subspace comprising k, determine that the k+1 dimension candidates subspace covers the k dimensions
After intensive subspace, methods described also includes:
The intensive subspace of k dimensions that the k+1 ties up candidate subspace and k+1 dimension candidates subspace covering is distributed to each height
Node is performed.
7. the Subspace clustering method towards high-dimensional big data according to claim 1, it is characterised in that the acquisition
All k of the k+1 dimension candidates subspace covering tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace
Value, and candidate subspace, the common factor and the default cluster class density threshold are tieed up according to the k+1, the k+1 dimensions after being clustered
Intensive subspace, including:
The all k for counting the k+1 dimension candidates subspace covering tie up sample point intersection of sets concentration sample point in intensive subspace
Quantity;
Default cluster class density threshold is obtained, when the quantity of sample point is more than the default cluster class density threshold in the common factor,
Determine that the k+1 dimension candidates subspace is the intensive subspace of k+1 dimensions after cluster.
8. the Subspace clustering method towards high-dimensional big data according to claim 7, it is characterised in that the acquisition
Default cluster class density threshold, when the quantity of sample point is more than the default cluster class density threshold in the common factor, determines the k
+ 1 dimension candidate subspace is that methods described also includes after k+1 after cluster ties up intensive subspace:
The all sample points during the k+1 ties up intensive subspace and the intensive subspace of k+1 dimensions are stored in key-value pair form.
9. a kind of subspace clustering device towards high-dimensional big data, it is characterised in that described device includes:
First segmentation module, the high-dimensional big data for obtaining input is built under MapReduce frameworks for per data line
A vertical Map tasks, and data are split according to dimension in each Map task, obtain each Map task
In each dimension characteristic value;
First determining module, for the characteristic value of each dimension in each Map task to be sent to a Reduce nodes,
So that in each Reduce node, obtain and the data area according to all characteristic values of each dimension, preset window number,
Preset window merges threshold value and preset window density threshold, obtains the 1 intensive subspace of dimension of each dimension;
Second determining module, for tieing up intensive subspace according to each two k, determines k+1 dimension candidates subspace, wherein, k be more than or
Total dimension equal to 1, k+1 less than or equal to the high-dimensional big data;
Second segmentation module, the 2nd Map tasks are set up for tieing up intensive subspace for each k, and acquisition is distributed in each k dimensions
All sample points of intensive subspace;
3rd determining module, in each the 2nd Map task, in intensive subspace is tieed up in k+1 dimension candidate subspaces comprising k
During all dimensions, determine that the k+1 dimension candidates subspace covers the k and ties up intensive subspace;
4th determining module, for all k of k+1 dimension candidates subspace covering to be tieed up the sample point set in intensive subspace
Close and send to the 2nd Reduce nodes, so as in each the 2nd Reduce node, obtain the k+1 dimension candidates subspace covering
All k tie up sample point intersection of sets collection and default cluster class density threshold in intensive subspace, and candidate's is tieed up according to the k+1
Space, the common factor and the default cluster class density threshold, the k+1 after being clustered tie up intensive subspace.
10. the subspace clustering device towards high-dimensional big data according to claim 9, it is characterised in that the dress
Putting also includes:
Distribution module, the intensive son of k dimensions for the k+1 to be tieed up candidate subspace and k+1 dimension candidates subspace covering is empty
Between be distributed to each child node execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710112771.3A CN106909942B (en) | 2017-02-28 | 2017-02-28 | Subspace clustering method and device for high-dimensionality big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710112771.3A CN106909942B (en) | 2017-02-28 | 2017-02-28 | Subspace clustering method and device for high-dimensionality big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106909942A true CN106909942A (en) | 2017-06-30 |
CN106909942B CN106909942B (en) | 2022-09-13 |
Family
ID=59209172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710112771.3A Active CN106909942B (en) | 2017-02-28 | 2017-02-28 | Subspace clustering method and device for high-dimensionality big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106909942B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341522A (en) * | 2017-07-11 | 2017-11-10 | 重庆大学 | A kind of text based on density semanteme subspace and method of the image without tag recognition |
CN108090182A (en) * | 2017-12-15 | 2018-05-29 | 清华大学 | A kind of distributed index method and system of extensive high dimensional data |
CN109344194A (en) * | 2018-09-20 | 2019-02-15 | 北京工商大学 | Pesticide residue high dimensional data visual analysis method and system based on subspace clustering |
CN110874615A (en) * | 2019-11-14 | 2020-03-10 | 深圳前海微众银行股份有限公司 | Feature clustering processing method, cluster server and readable storage medium |
CN112926658A (en) * | 2021-02-26 | 2021-06-08 | 西安交通大学 | Image clustering method and device based on two-dimensional data embedding and adjacent topological graph |
CN115357609A (en) * | 2022-10-24 | 2022-11-18 | 深圳比特微电子科技有限公司 | Method, device, equipment and medium for processing data of Internet of things |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
CN102419774A (en) * | 2011-12-15 | 2012-04-18 | 上海大学 | SNP data-oriented clustering method |
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN104156463A (en) * | 2014-08-21 | 2014-11-19 | 南京信息工程大学 | Big-data clustering ensemble method based on MapReduce |
CN105279524A (en) * | 2015-11-04 | 2016-01-27 | 盐城工学院 | High-dimensional data clustering method based on unweighted hypergraph segmentation |
CN105471670A (en) * | 2014-09-11 | 2016-04-06 | 中兴通讯股份有限公司 | Flow data classification method and device |
CN105930916A (en) * | 2016-04-07 | 2016-09-07 | 大连理工大学 | Parallel modular neural network-based byproduct gas real-time prediction method |
-
2017
- 2017-02-28 CN CN201710112771.3A patent/CN106909942B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
CN102419774A (en) * | 2011-12-15 | 2012-04-18 | 上海大学 | SNP data-oriented clustering method |
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN104156463A (en) * | 2014-08-21 | 2014-11-19 | 南京信息工程大学 | Big-data clustering ensemble method based on MapReduce |
CN105471670A (en) * | 2014-09-11 | 2016-04-06 | 中兴通讯股份有限公司 | Flow data classification method and device |
CN105279524A (en) * | 2015-11-04 | 2016-01-27 | 盐城工学院 | High-dimensional data clustering method based on unweighted hypergraph segmentation |
CN105930916A (en) * | 2016-04-07 | 2016-09-07 | 大连理工大学 | Parallel modular neural network-based byproduct gas real-time prediction method |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341522A (en) * | 2017-07-11 | 2017-11-10 | 重庆大学 | A kind of text based on density semanteme subspace and method of the image without tag recognition |
CN108090182A (en) * | 2017-12-15 | 2018-05-29 | 清华大学 | A kind of distributed index method and system of extensive high dimensional data |
CN108090182B (en) * | 2017-12-15 | 2018-10-30 | 清华大学 | A kind of distributed index method and system of extensive high dimensional data |
CN109344194A (en) * | 2018-09-20 | 2019-02-15 | 北京工商大学 | Pesticide residue high dimensional data visual analysis method and system based on subspace clustering |
CN109344194B (en) * | 2018-09-20 | 2021-09-28 | 北京工商大学 | Subspace clustering-based pesticide residue high-dimensional data visual analysis method and system |
CN110874615A (en) * | 2019-11-14 | 2020-03-10 | 深圳前海微众银行股份有限公司 | Feature clustering processing method, cluster server and readable storage medium |
CN110874615B (en) * | 2019-11-14 | 2023-09-26 | 深圳前海微众银行股份有限公司 | Feature clustering processing method, cluster server and readable storage medium |
CN112926658A (en) * | 2021-02-26 | 2021-06-08 | 西安交通大学 | Image clustering method and device based on two-dimensional data embedding and adjacent topological graph |
CN112926658B (en) * | 2021-02-26 | 2023-03-21 | 西安交通大学 | Image clustering method and device based on two-dimensional data embedding and adjacent topological graph |
CN115357609A (en) * | 2022-10-24 | 2022-11-18 | 深圳比特微电子科技有限公司 | Method, device, equipment and medium for processing data of Internet of things |
Also Published As
Publication number | Publication date |
---|---|
CN106909942B (en) | 2022-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106909942A (en) | A kind of Subspace clustering method and device towards high-dimensional big data | |
Deng et al. | Differential evolution algorithm with wavelet basis function and optimal mutation strategy for complex optimization problem | |
CN103345514B (en) | Streaming data processing method under big data environment | |
Chen et al. | An ordered clustering algorithm based on K-means and the PROMETHEE method | |
CN103235825B (en) | A kind of magnanimity face recognition search engine design method based on Hadoop cloud computing framework | |
CN110008259A (en) | The method and terminal device of visualized data analysis | |
CN103631922B (en) | Extensive Web information extracting method and system based on Hadoop clusters | |
CN101799748B (en) | Method for determining data sample class and system thereof | |
CN107193967A (en) | A kind of multi-source heterogeneous industry field big data handles full link solution | |
Ferranti et al. | A distributed approach to multi-objective evolutionary generation of fuzzy rule-based classifiers from big data | |
CN105912666A (en) | Method for high-performance storage and inquiry of hybrid structure data aiming at cloud platform | |
CN102385719A (en) | Regression prediction method and device | |
CN102737126A (en) | Classification rule mining method under cloud computing environment | |
CN102054002A (en) | Method and device for generating decision tree in data mining system | |
CN103324765A (en) | Multi-core synchronization data query optimization method based on column storage | |
CN103440539B (en) | A kind of user power utilization data processing method | |
CN110471946A (en) | A kind of LOF outlier detection method and system based on grid beta pruning | |
Xu et al. | Distributed maximal clique computation and management | |
Lee et al. | A comparison of network clustering algorithms in keyword network analysis: A case study with geography conference presentations | |
Zhang et al. | Fast fine-grained air quality index level prediction using random forest algorithm on cluster computing of spark | |
CN106126341A (en) | It is applied to many Computational frames processing system and the association rule mining method of big data | |
CN113052225A (en) | Alarm convergence method and device based on clustering algorithm and time sequence association rule | |
CN105930531A (en) | Method for optimizing cloud dimensions of agricultural domain ontological knowledge on basis of hybrid models | |
Vrbić | Data mining and cloud computing | |
CN107656995A (en) | Towards the data management system of big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |