Mass-Based Density Peaks Clustering Algorithm

Ding Ling¹⁸ &
Xu Xiao¹⁹

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 538))

Included in the following conference series:

International Conference on Intelligent Information Processing

1281 Accesses

Abstract

Density peaks clustering algorithm (DPC) relies on local-density and relative-distance of dataset to find cluster centers. However, the calculation of these attributes is based on Euclidean distance simply, and DPC is not satisfactory when dataset’s density is uneven or dimension is higher. In addition, parameter $ d_{\text{c}} $ only considers the global distribution of the dataset, a little change of $ d_{\text{c}} $ has a great influence on small-scale dataset clustering. Aiming at these drawbacks, this paper proposes a mass-based density peaks clustering algorithm (MDPC). MDPC introduces a mass-based similarity measure method to calculate the new similarity matrix. After that, K-nearest neighbour information of the data is obtained according to the new similarity matrix, and then MDPC redefines the local density based on the K-nearest neighbour information. Experimental results show that MDPC is superior to DPC, and satisfied on datasets with uneven density and higher dimensions, which also avoids the influence of $ d_{\text{c}} $ on the small-scale datasets.

You have full access to this open access chapter, Download conference paper PDF

M_k-NNG-DPC: density peaks clustering based on improved mutual K-nearest-neighbor graph

Article 13 November 2019

Density Peak Clustering Based on Cumulative Nearest Neighbors Degree and Micro Cluster Merging

Article 02 August 2019

Adaptive Density Peak Algorithm Based on K-nearest Neighbors with Pre-screening Strategy

Keywords

1 Introduction

Clustering is named unsupervised learning as it does not depend on the pre-definition of classes and the labelling of data samples, and it is an effective technique for data mining [1]. As so far, clustering is applied to the pattern recognition, image processing, genetic research, and many other fields [2].

The main idea of clustering is to classify data objects into multiple clusters according to a measure of similarity. As far as possible, the similarity of the data objects in the same cluster is greater, and the similarity of data objects between different clusters is smaller [3]. Moreover, different clustering targets correspond to different clustering algorithms and the current clustering algorithms are mainly divided into: partition-based clustering, density-based clustering, grid-based clustering, hierarchical clustering and model-based clustering [4]. These different algorithms are suitable for different types of datasets with different advantages and disadvantages.

In 2014, Rodriguez et al. proposed a clustering by fast search and find of density peaks algorithm (DPC) [5]. DPC algorithm uses the local density and relative distance properties of the data to determine the cluster centers quickly and can be used for arbitrary shape datasets and perform sample points allocation effectively [6]. However, it has the following limitations: (1) the calculation of similarity between data samples relies on Euclidean distance simply, which makes DPC cannot get satisfactory clustering results when the data distribution is uneven or the data dimension is higher [7]. (2) A little change of parameter $ d_{\text{c}} $ in small-scale datasets will affect the clustering results obviously [8]. At present, many scholars have optimized DPC. Du et al. [9] and Xie et al. [10] both introduced K-nearest neighbours algorithm and considered the local distribution of datasets to redefine the local density, thereby unifying local metrics to reduce the impact of $ d_{\text{c}} $ on clustering results. But $ k $ nearest neighbours are found still based on the Euclidean distance, which is also unsatisfactory.

To alleviate the adverse influence of the limitation of DPC, this paper proposes a mass-based density peaks clustering algorithm (MDPC). The main innovations of MDPC algorithm include: (1) Consider the environment around the datasets and using mass-based measure to replace the Euclidean distance for measuring the similarity between datasets to improve the clustering accuracy of data with higher dimensions or uneven distribution; (2) Redefine the local density by the improved K-nearest neighbour information of samples and make MDPC independent of $ d_{\text{c}} $.

The remaining parts of this paper are as follows: Sect. 2, the basic principle of density peaks clustering algorithm and mass-based similarity measure method. In Sect. 3, this paper proposes a mass-based density peaks clustering algorithm and analyses its performance from the theoretical level. Section 4 designs experiments to test this algorithm and other clustering algorithms for comparison on different datasets. Finally, the work done in this paper is summarized and the direction of the next research is given.

2 Related Works

2.1 Density Peaks Clustering Algorithm

A density peaks clustering algorithm (DPC) was proposed to find cluster centers fast by Rodrigue and Laio. DPC algorithm is based on an important assumption that the local density of the cluster centers is greater than the local density of the surrounding neighbours and the distance between cluster centers and the points with higher local density is relatively far [11].

DPC algorithm first calculates the local density and relative distance attributes of each data point. The local density is defined as:

$$ \begin{aligned} \rho_{i} = \sum\limits_{j} {\chi (d_{ij} - d_{c} } ), \hfill \\ \chi (x) = \left\{ {\begin{array}{*{20}c} {1,} & {x < 0,} \\ {0,} & {x \ge 0,} \\ \end{array} } \right. \hfill \\ \end{aligned} $$

(1)

Where $ d_{ij} $ is the distance between the data points $ x_{i} $ and $ x_{j} $. $ d_{\text{c}} $ is the only input parameter that represents the cut-off distance, and is defined as the average number of neighbours which is around 1% to 2% of the total number of points in the dataset. The relative distance $ \delta_{i} $ of the data point $ x_{i} $ is the minimum value of the distance from the point to all points whose local density is larger, and its formula is:

$$ \delta_{i} = \mathop {\hbox{min} }\nolimits_{{j:\rho_{j} > \rho_{i} }} \left( {d_{ij} } \right), $$

(2)

For the densest point, we can get:

$$ \delta_{i} = \mathop {\hbox{max} }\nolimits_{j} \left( {d_{ij} } \right), $$

(3)

DPC algorithm selects data points with large $ \rho_{i} $ and $ \delta_{i} $ as cluster centers. After DPC determines the cluster centers, it needs to allocate the remaining points to the corresponding clusters. DPC algorithm first assigns all remaining points to its nearest point’s cluster whose local density equal to or higher than the current point. Then, a boundary threshold is defined for each cluster to remove noise points.

DPC algorithm is simple and effective, can deal with noise outliers as well as get clusters of arbitrary shape clustering [12]. But, the disadvantages of DPC are obviously: First, the calculation of local density and relative distance is based on the similarity between data nodes, and the measure of similarity simply depends on the Euclidean distance, which causes DPC cannot obtain satisfactory clustering results on complex data, especially when the data distribution is uneven and the data dimension is higher [13]; Second, the calculation of local density depends on the choice of the cut-off distance $ d_{\text{c}} $, but it only considers the global distribution of the data and ignores the local information, which will lead to a big influence of $ d_{\text{c}} $’s change on the clustering results, especially on small-scale datasets [14].

2.2 Mass-Based Similarity Measure

Since 1970s, psychologists have stated that the similarities between two instances cannot be simply characterized by geometric models, and the measure of similarity is influenced by the background and the neighbours [15]. Based on this fact, it can define a more appropriate measure of similarity, here called mass-based similarity measure [16].

The basic idea of the mass-based similarity measure is that the two instances of the dense region have similarities less than two instances of the same interval but in the low-density region [17]. The geometric model-based similarity calculation only depends on the geometric position derivation. On the contrary, mass-based measure of similarity mainly depends on the data distribution, that is, the probability mass covering the smallest region of two instances [18].

Assume that $ D $ represents a data sample in the probability density function $ F $, and $ H \in H(D) $ represents a hierarchical division that divides the space into non-overlapping and non-empty domains. $ R(x,y|H;D) $ denotes the minimum domain for $ H $ and $ D $ over $ x $ and $ y $. Notice that $ R(x,y|H;D) $ is the smallest area covering $ x $ and $ y $, similar to the shortest distance in the $ x $ and $ y $ geometric models.

Mass-based similarity measure defines two parameters $ t $ and $ \psi $ to represent the number of “iTrees” and the size of each “iTree”, and the height of each “iTree” is up to $ h = \left\lceil {\log_{2} \psi } \right\rceil $. First, build an “iForest” consisting of $ t $ “iTree” as the partition structure $ R $. Each iTree is built separately using subset $ D \subset D $, where $ |D| = \psi $. Axis-parallel segmentation algorithm is used at each internal node of the “iTree” to divide the samples at the node into two non-empty subsets until each points are quarantined or reach the maximum height $ h $. After “iForest” is established, all instances in $ D $ are traversed to record the mass for each node. The second step is to value the mass. The evaluation through each “iTree” analytical test points $ x $ and $ y $, calculate the sum of the mass containing the lowest node of both $ x $ and $ y $, that is $ \sum\nolimits_{i} {|R(x,y|H_{i} )|} $. Finally, $ m_{e} (x,y) $ is the mean of these mass:

$$ m_{e} (x,y) = \frac{1}{t}\sum\limits_{i = 1}^{t} {\frac{{|R(x,y|H_{i} )|}}{|D|}} \, . $$

(4)

3 Mass-Based Density Peaks Clustering Algorithm

This paper proposes a mass-based density peaks clustering algorithm (MDPC). MDPC algorithm will maintain the central idea of DPC, and quickly find the cluster centers whose local density and relative distance properties are larger, but similarity calculations and local density measurement will be improved.

First, the similarity measure between samples. A mass-based similarity measure will be introduced to replace the Euclidean distance. A new similarity matrix will be derived from Eq. (4).

Then based on the new similarity matrix, the K-nearest neighbours of the sample are found and defined. New local density is defined as:

$$ \rho_{i} = \sum\limits_{j \in KNN(i)} {\exp ( - m_{e} (x_{i} ,x_{j} ))} , $$

(5)

Where $ KNN(i) $ is the $ k $ nearest neighbours of point $ x_{i} $. At the same time, the relative distance attribute of the data sample no longer depends on the similarity of the geometric distance metric, but uses the similarity calculated by Eq. (4):

$$ \delta_{i} = \left\{ {\begin{array}{*{20}c} {\mathop {\hbox{min} }\limits_{{_{{j:\rho_{j} > \rho_{i} }} }} \left( {m_{e} (x_{i} ,x_{j} )} \right),} & {if\;\exists j \, s.t.\;\rho_{i} > \rho_{j} } \\ {\mathop {\hbox{max} }\limits_{j} (m_{e} (x_{i} ,x_{j} )),} & { \, otherwise} \\ \end{array} } \right. . $$

(6)

Specific steps of MDPC algorithm are described as Algorithm 1.

MDPC algorithm retains the main ideal of DPC algorithm and finds density peaks as cluster centers quickly. However, MDPC algorithm’s local density and relative distance properties are measured by the mass-based similarity measure method instead of the simple Euclidean distance, which makes MDPC more efficient in high dimensional datasets and uneven density datasets. In addition, the mass-based similarity between data samples is used to define a new local density based on the improved K-nearest neighbour information. Compared with DPC algorithm, MDPC algorithm avoids excessive dependence on the $ d_{\text{c}} $, and the local density metric is suitable for any size dataset.

4 Experiments

4.1 Experimental Preparation

In order to prove the performance of MDPC algorithm, the experiments were tested on synthetic datasets and real-word datasets. The clustering accuracy $ Acc $ was used to measure the clustering results. The higher the value of $ Acc $, the better the clustering performance of MDPC. If $ y_{i} $ and $ z_{i} $ are the intrinsic class labels and clustering result labels, respectively. $ map\left( \cdot \right) $ maps each label to a class label by the Hungarian, and the map is optimal. $ Acc $ is calculated as follows:

$$ Acc = {{\sum\nolimits_{i = 1}^{N} {\delta \left( {y_{i} ,map\left( {z_{i} } \right)} \right)} } \mathord{\left/ {\vphantom {{\sum\nolimits_{i = 1}^{N} {\delta \left( {y_{i} ,map\left( {z_{i} } \right)} \right)} } n}} \right. \kern-0pt} n} \, . $$

(7)

In addition to DPC algorithm, we compared the MDPC algorithm with the optimization algorithm DPC-KNN. The parameter $ d_{\text{c}} $ in DPC algorithm is in the interval [0.2%–6%], and $ k $ in DPC-KNN algorithm is taken from 5 to 7. In MDPC, both $ t $ and $ \psi $ take the default values 100 and 256. The value of $ k $ is also taken from 5 to 7.

4.2 Results and Evaluation

Synthetic Datasets

This section conducts MDPC testing on the synthetic dataset D, which is a typical dataset containing three clusters with uneven density. Along with 97 samples, D has two attributes.

The experiment shows the result of the two-dimensional dataset visually. One colour represents one cluster. MDPC algorithm and DPC algorithm are clustered on the above D datasets respectively. The results are shown in Fig. 1.

From Fig. 1, it can be seen that MDPC can handle datasets with uneven density very well. On dataset D, MDPC algorithm can be well clustered into 3 categories, but DPC does not divide the dataset into 3 classes well, because DPC simply uses the geometric distance between data to measure similarity and calculate the local density and relative distance properties. Therefore, for datasets with uneven dataset distribution, DPC does not recognize all clusters well.

Although MDPC algorithm and DPC algorithm can obtain satisfactory clustering results by selecting suitable parameters on the dataset with relatively uniform distribution. But DPC algorithm needs to select the appropriate $ d_{\text{c}} $. The changes of $ d_{\text{c}} $ have great influence on the clustering results. MDPC algorithm no longer need to select $ d_{\text{c}} $. Although there still has one parameter, but since MDPC still selects the cluster centers according to the characteristics of DPC, the local density of the cluster center must be higher. Small changes in $ k $ have little effect on the clustering results.

MDPC algorithm introduces mass-based similarity measure method and considers the data distribution environmental, thus MDPC algorithm is more effective than DPC in dealing with uneven density. In addition, MDPC overcomes the influence of $ d_{\text{c}} $ on the clustering results based on the K-nearest neighbours algorithm while adding an optional parameter $ k $, but the change of $ k $ has a little effect on the clustering results.

Real-Word Datasets

This section conducts MDPC on 4 real-word datasets. The characteristics of the experimental data are shown in Table 1. As the changes of $ d_{\text{c}} $ in DPC algorithm have a greater impact on the clustering results on small-scale datasets, and the clustering results of DPC on the datasets with higher dimensions is not satisfactory. Thus, this experiment selected the classic small-scale datasets and contains higher dimensions.

Table 1. UCI datasets

Full size table

In this experiment, MDPC algorithm, DPC algorithm and DPC-KNN algorithm were clustered in the above 4 datasets respectively. The clustering results are shown in Table 2, and the corresponding optimal parameters are given. Bold is the best result in the algorithms, while MDPC gives 20 test averages.

Table 2. Accuracy of different algorithms on different datasets

Full size table

It can be seen from Table 2 that the overall clustering performance of MDPC algorithm is better than DPC and DPC-KNN. DPC algorithm on the higher-dimensional dataset is not satisfactory, and $ d_{\text{c}} $ needs an appropriate choice. Although DPC-KNN algorithm avoids the choice of $ d_{\text{c}} $, the clustering result is not ideal compared with MDPC algorithm. MDPC algorithm uses the mass-based similarity replaces the geometric distance and considers the environment of data distribution to work well for datasets with higher data dimension. In addition, MDPC also chooses $ k $ nearest neighbours to measure local density which avoids the selection of $ d_{\text{c}} $. The increased parameter $ k $ in MDPC has little effect on the clustering results as the cluster centers in densely dense areas. Thus, MDPC is superior to DPC and DPC-KNN.

5 Conclusions

This paper proposes an optimized density peaks clustering algorithm based on a novel mass-based similarity measure. The mass-based measure is used to calculate the similarity between data samples first, and the obtained similarity is introduced into the K-nearest neighbour information of the samples. A new local density is redefined by the K-nearest neighbour information to avoid the influence of parameter selection, and improves DPC algorithm on the higher-dimensional and uneven-density datasets. In addition, MDPC algorithm matins the main steps of DPC algorithm to select the cluster centers, thus the choice of increased parameter is robust. MDPC algorithm is superior to the traditional DPC algorithm and the optimized DPC-KNN algorithm.

In this paper, how to allocate the no-center points of MDPC algorithm instead of adopting a one-step allocation strategy, and the effective treatment of noise points requires further exploration.

References

Morris, K., Mcnicholas, P.: Clustering, classification, discriminant analysis, and dimension reduction via generalized hyperbolic mixtures. Comput. Stat. Data Anal. 97, 133–150 (2016)
Article MathSciNet Google Scholar
Ivannikova, E., Park, H., Hämäläinen, T., et al.: Revealing community structures by ensemble clustering using group diffusion. Inf. Fusion 42, 24–36 (2018)
Article Google Scholar
Slimen, Y., Allio, S., Jacques, J.: Model-based co-clustering for functional data. Neurocomputing 291, 97–108 (2018)
Article Google Scholar
Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2011)
Article MathSciNet Google Scholar
Rodríguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014)
Article Google Scholar
Xu, X., Ding, S., Du, M., et al.: DPCG: an efficient density peaks clustering algorithm based on grid. Int. J. Mach. Learn. Cybern. 9, 743–754 (2018)
Article Google Scholar
Ding, S., Du, M., Sun, T., et al.: An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood. Knowl. Based Syst. 133, 294–313 (2017)
Article Google Scholar
Liu, R., Wang, H., Yu, X.: Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 450, 200–226 (2018)
Article MathSciNet Google Scholar
Du, M., Ding, S., Jia, H.: Study on density peaks clustering based on K-nearest neighbors and principal component analysis. Knowl. Based Syst. 99, 135–145 (2016)
Article Google Scholar
Xie, J., Gao, H., Xie, W., et al.: Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 354, 19–40 (2016)
Article Google Scholar
Shi, Y., Chen, Z., Qi, Z., et al.: A novel clustering-based image segmentation via density peaks algorithm with mid-level feature. Neural Comput. Appl. 28, 29–39 (2017)
Article Google Scholar
Bai, L., Cheng, X., Liang, J., et al.: Fast density clustering strategies based on the k-means algorithm. Pattern Recogn. 71, 375–386 (2017)
Article Google Scholar
Wang, M., Min, F., Zhang, Z., et al.: Active learning through density clustering. Expert Syst. Appl. 85, 305–317 (2017)
Article Google Scholar
Zhou, L., Pei, C.: Delta-distance based clustering with a divide-and-conquer strategy: 3DC clustering. Pattern Recogn. Lett. 73, 52–59 (2016)
Article Google Scholar
Krumhansl, C.: Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol. Rev. 85, 445–463 (1987)
Article Google Scholar
Kai, M., Zhu, Y., Carman, M., et al.: Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, San Francisco, California, USA, pp. 1205–1214, 13–17 August 2016
Google Scholar
Aryal, S., Kai, M., Haffari, G., et al.: Mp-dissimilarity: a data dependent dissimilarity measure. In: 2014 IEEE International Conference on Data Mining, Shenzhen, China, pp. 707–712, 14–17 December 2014
Google Scholar
Chen, B., Ting, K., Washio, T., et al.: Half-space mass: a maximally robust and efficient data depth method. Mach. Learn. 100, 677–699 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under Grant no. 61672522 and no. 61379101.

Author information

Authors and Affiliations

School of Computing Technology and Gaming Development, Asia Pacific University of Technology and Innovation, Petaling, 57000, Kuala Lumpur, Malaysia
Ding Ling
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, China
Xu Xiao

Authors

Ding Ling
View author publications
You can also search for this author in PubMed Google Scholar
Xu Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xu Xiao .

Editor information

Editors and Affiliations

Institute of Computing Technology, CAS, Beijing, China
Zhongzhi Shi
University of Reims Champagne-Ardenne, Saint Drezery, France
Eunika Mercier-Laurent
University of South Australia, Mawson Lakes, SA, Australia
Jiuyong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ling, D., Xiao, X. (2018). Mass-Based Density Peaks Clustering Algorithm. In: Shi, Z., Mercier-Laurent, E., Li, J. (eds) Intelligent Information Processing IX. IIP 2018. IFIP Advances in Information and Communication Technology, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-030-00828-4_5

Download citation

DOI: https://doi.org/10.1007/978-3-030-00828-4_5
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00827-7
Online ISBN: 978-3-030-00828-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)