Abstract
Density peaks clustering algorithm (DPC) relies on local-density and relative-distance of dataset to find cluster centers. However, the calculation of these attributes is based on Euclidean distance simply, and DPC is not satisfactory when dataset’s density is uneven or dimension is higher. In addition, parameter \( d_{\text{c}} \) only considers the global distribution of the dataset, a little change of \( d_{\text{c}} \) has a great influence on small-scale dataset clustering. Aiming at these drawbacks, this paper proposes a mass-based density peaks clustering algorithm (MDPC). MDPC introduces a mass-based similarity measure method to calculate the new similarity matrix. After that, K-nearest neighbour information of the data is obtained according to the new similarity matrix, and then MDPC redefines the local density based on the K-nearest neighbour information. Experimental results show that MDPC is superior to DPC, and satisfied on datasets with uneven density and higher dimensions, which also avoids the influence of \( d_{\text{c}} \) on the small-scale datasets.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Clustering is named unsupervised learning as it does not depend on the pre-definition of classes and the labelling of data samples, and it is an effective technique for data mining [1]. As so far, clustering is applied to the pattern recognition, image processing, genetic research, and many other fields [2].
The main idea of clustering is to classify data objects into multiple clusters according to a measure of similarity. As far as possible, the similarity of the data objects in the same cluster is greater, and the similarity of data objects between different clusters is smaller [3]. Moreover, different clustering targets correspond to different clustering algorithms and the current clustering algorithms are mainly divided into: partition-based clustering, density-based clustering, grid-based clustering, hierarchical clustering and model-based clustering [4]. These different algorithms are suitable for different types of datasets with different advantages and disadvantages.
In 2014, Rodriguez et al. proposed a clustering by fast search and find of density peaks algorithm (DPC) [5]. DPC algorithm uses the local density and relative distance properties of the data to determine the cluster centers quickly and can be used for arbitrary shape datasets and perform sample points allocation effectively [6]. However, it has the following limitations: (1) the calculation of similarity between data samples relies on Euclidean distance simply, which makes DPC cannot get satisfactory clustering results when the data distribution is uneven or the data dimension is higher [7]. (2) A little change of parameter \( d_{\text{c}} \) in small-scale datasets will affect the clustering results obviously [8]. At present, many scholars have optimized DPC. Du et al. [9] and Xie et al. [10] both introduced K-nearest neighbours algorithm and considered the local distribution of datasets to redefine the local density, thereby unifying local metrics to reduce the impact of \( d_{\text{c}} \) on clustering results. But \( k \) nearest neighbours are found still based on the Euclidean distance, which is also unsatisfactory.
To alleviate the adverse influence of the limitation of DPC, this paper proposes a mass-based density peaks clustering algorithm (MDPC). The main innovations of MDPC algorithm include: (1) Consider the environment around the datasets and using mass-based measure to replace the Euclidean distance for measuring the similarity between datasets to improve the clustering accuracy of data with higher dimensions or uneven distribution; (2) Redefine the local density by the improved K-nearest neighbour information of samples and make MDPC independent of \( d_{\text{c}} \).
The remaining parts of this paper are as follows: Sect. 2, the basic principle of density peaks clustering algorithm and mass-based similarity measure method. In Sect. 3, this paper proposes a mass-based density peaks clustering algorithm and analyses its performance from the theoretical level. Section 4 designs experiments to test this algorithm and other clustering algorithms for comparison on different datasets. Finally, the work done in this paper is summarized and the direction of the next research is given.
2 Related Works
2.1 Density Peaks Clustering Algorithm
A density peaks clustering algorithm (DPC) was proposed to find cluster centers fast by Rodrigue and Laio. DPC algorithm is based on an important assumption that the local density of the cluster centers is greater than the local density of the surrounding neighbours and the distance between cluster centers and the points with higher local density is relatively far [11].
DPC algorithm first calculates the local density and relative distance attributes of each data point. The local density is defined as:
Where \( d_{ij} \) is the distance between the data points \( x_{i} \) and \( x_{j} \). \( d_{\text{c}} \) is the only input parameter that represents the cut-off distance, and is defined as the average number of neighbours which is around 1% to 2% of the total number of points in the dataset. The relative distance \( \delta_{i} \) of the data point \( x_{i} \) is the minimum value of the distance from the point to all points whose local density is larger, and its formula is:
For the densest point, we can get:
DPC algorithm selects data points with large \( \rho_{i} \) and \( \delta_{i} \) as cluster centers. After DPC determines the cluster centers, it needs to allocate the remaining points to the corresponding clusters. DPC algorithm first assigns all remaining points to its nearest point’s cluster whose local density equal to or higher than the current point. Then, a boundary threshold is defined for each cluster to remove noise points.
DPC algorithm is simple and effective, can deal with noise outliers as well as get clusters of arbitrary shape clustering [12]. But, the disadvantages of DPC are obviously: First, the calculation of local density and relative distance is based on the similarity between data nodes, and the measure of similarity simply depends on the Euclidean distance, which causes DPC cannot obtain satisfactory clustering results on complex data, especially when the data distribution is uneven and the data dimension is higher [13]; Second, the calculation of local density depends on the choice of the cut-off distance \( d_{\text{c}} \), but it only considers the global distribution of the data and ignores the local information, which will lead to a big influence of \( d_{\text{c}} \)’s change on the clustering results, especially on small-scale datasets [14].
2.2 Mass-Based Similarity Measure
Since 1970s, psychologists have stated that the similarities between two instances cannot be simply characterized by geometric models, and the measure of similarity is influenced by the background and the neighbours [15]. Based on this fact, it can define a more appropriate measure of similarity, here called mass-based similarity measure [16].
The basic idea of the mass-based similarity measure is that the two instances of the dense region have similarities less than two instances of the same interval but in the low-density region [17]. The geometric model-based similarity calculation only depends on the geometric position derivation. On the contrary, mass-based measure of similarity mainly depends on the data distribution, that is, the probability mass covering the smallest region of two instances [18].
Assume that \( D \) represents a data sample in the probability density function \( F \), and \( H \in H(D) \) represents a hierarchical division that divides the space into non-overlapping and non-empty domains. \( R(x,y|H;D) \) denotes the minimum domain for \( H \) and \( D \) over \( x \) and \( y \). Notice that \( R(x,y|H;D) \) is the smallest area covering \( x \) and \( y \), similar to the shortest distance in the \( x \) and \( y \) geometric models.
Mass-based similarity measure defines two parameters \( t \) and \( \psi \) to represent the number of “iTrees” and the size of each “iTree”, and the height of each “iTree” is up to \( h = \left\lceil {\log_{2} \psi } \right\rceil \). First, build an “iForest” consisting of \( t \) “iTree” as the partition structure \( R \). Each iTree is built separately using subset \( D \subset D \), where \( |D| = \psi \). Axis-parallel segmentation algorithm is used at each internal node of the “iTree” to divide the samples at the node into two non-empty subsets until each points are quarantined or reach the maximum height \( h \). After “iForest” is established, all instances in \( D \) are traversed to record the mass for each node. The second step is to value the mass. The evaluation through each “iTree” analytical test points \( x \) and \( y \), calculate the sum of the mass containing the lowest node of both \( x \) and \( y \), that is \( \sum\nolimits_{i} {|R(x,y|H_{i} )|} \). Finally, \( m_{e} (x,y) \) is the mean of these mass:
3 Mass-Based Density Peaks Clustering Algorithm
This paper proposes a mass-based density peaks clustering algorithm (MDPC). MDPC algorithm will maintain the central idea of DPC, and quickly find the cluster centers whose local density and relative distance properties are larger, but similarity calculations and local density measurement will be improved.
First, the similarity measure between samples. A mass-based similarity measure will be introduced to replace the Euclidean distance. A new similarity matrix will be derived from Eq. (4).
Then based on the new similarity matrix, the K-nearest neighbours of the sample are found and defined. New local density is defined as:
Where \( KNN(i) \) is the \( k \) nearest neighbours of point \( x_{i} \). At the same time, the relative distance attribute of the data sample no longer depends on the similarity of the geometric distance metric, but uses the similarity calculated by Eq. (4):
Specific steps of MDPC algorithm are described as Algorithm 1.
MDPC algorithm retains the main ideal of DPC algorithm and finds density peaks as cluster centers quickly. However, MDPC algorithm’s local density and relative distance properties are measured by the mass-based similarity measure method instead of the simple Euclidean distance, which makes MDPC more efficient in high dimensional datasets and uneven density datasets. In addition, the mass-based similarity between data samples is used to define a new local density based on the improved K-nearest neighbour information. Compared with DPC algorithm, MDPC algorithm avoids excessive dependence on the \( d_{\text{c}} \), and the local density metric is suitable for any size dataset.
4 Experiments
4.1 Experimental Preparation
In order to prove the performance of MDPC algorithm, the experiments were tested on synthetic datasets and real-word datasets. The clustering accuracy \( Acc \) was used to measure the clustering results. The higher the value of \( Acc \), the better the clustering performance of MDPC. If \( y_{i} \) and \( z_{i} \) are the intrinsic class labels and clustering result labels, respectively. \( map\left( \cdot \right) \) maps each label to a class label by the Hungarian, and the map is optimal. \( Acc \) is calculated as follows:
In addition to DPC algorithm, we compared the MDPC algorithm with the optimization algorithm DPC-KNN. The parameter \( d_{\text{c}} \) in DPC algorithm is in the interval [0.2%–6%], and \( k \) in DPC-KNN algorithm is taken from 5 to 7. In MDPC, both \( t \) and \( \psi \) take the default values 100 and 256. The value of \( k \) is also taken from 5 to 7.
4.2 Results and Evaluation
Synthetic Datasets
This section conducts MDPC testing on the synthetic dataset D, which is a typical dataset containing three clusters with uneven density. Along with 97 samples, D has two attributes.
The experiment shows the result of the two-dimensional dataset visually. One colour represents one cluster. MDPC algorithm and DPC algorithm are clustered on the above D datasets respectively. The results are shown in Fig. 1.
From Fig. 1, it can be seen that MDPC can handle datasets with uneven density very well. On dataset D, MDPC algorithm can be well clustered into 3 categories, but DPC does not divide the dataset into 3 classes well, because DPC simply uses the geometric distance between data to measure similarity and calculate the local density and relative distance properties. Therefore, for datasets with uneven dataset distribution, DPC does not recognize all clusters well.
Although MDPC algorithm and DPC algorithm can obtain satisfactory clustering results by selecting suitable parameters on the dataset with relatively uniform distribution. But DPC algorithm needs to select the appropriate \( d_{\text{c}} \). The changes of \( d_{\text{c}} \) have great influence on the clustering results. MDPC algorithm no longer need to select \( d_{\text{c}} \). Although there still has one parameter, but since MDPC still selects the cluster centers according to the characteristics of DPC, the local density of the cluster center must be higher. Small changes in \( k \) have little effect on the clustering results.
MDPC algorithm introduces mass-based similarity measure method and considers the data distribution environmental, thus MDPC algorithm is more effective than DPC in dealing with uneven density. In addition, MDPC overcomes the influence of \( d_{\text{c}} \) on the clustering results based on the K-nearest neighbours algorithm while adding an optional parameter \( k \), but the change of \( k \) has a little effect on the clustering results.
Real-Word Datasets
This section conducts MDPC on 4 real-word datasets. The characteristics of the experimental data are shown in Table 1. As the changes of \( d_{\text{c}} \) in DPC algorithm have a greater impact on the clustering results on small-scale datasets, and the clustering results of DPC on the datasets with higher dimensions is not satisfactory. Thus, this experiment selected the classic small-scale datasets and contains higher dimensions.
In this experiment, MDPC algorithm, DPC algorithm and DPC-KNN algorithm were clustered in the above 4 datasets respectively. The clustering results are shown in Table 2, and the corresponding optimal parameters are given. Bold is the best result in the algorithms, while MDPC gives 20 test averages.
It can be seen from Table 2 that the overall clustering performance of MDPC algorithm is better than DPC and DPC-KNN. DPC algorithm on the higher-dimensional dataset is not satisfactory, and \( d_{\text{c}} \) needs an appropriate choice. Although DPC-KNN algorithm avoids the choice of \( d_{\text{c}} \), the clustering result is not ideal compared with MDPC algorithm. MDPC algorithm uses the mass-based similarity replaces the geometric distance and considers the environment of data distribution to work well for datasets with higher data dimension. In addition, MDPC also chooses \( k \) nearest neighbours to measure local density which avoids the selection of \( d_{\text{c}} \). The increased parameter \( k \) in MDPC has little effect on the clustering results as the cluster centers in densely dense areas. Thus, MDPC is superior to DPC and DPC-KNN.
5 Conclusions
This paper proposes an optimized density peaks clustering algorithm based on a novel mass-based similarity measure. The mass-based measure is used to calculate the similarity between data samples first, and the obtained similarity is introduced into the K-nearest neighbour information of the samples. A new local density is redefined by the K-nearest neighbour information to avoid the influence of parameter selection, and improves DPC algorithm on the higher-dimensional and uneven-density datasets. In addition, MDPC algorithm matins the main steps of DPC algorithm to select the cluster centers, thus the choice of increased parameter is robust. MDPC algorithm is superior to the traditional DPC algorithm and the optimized DPC-KNN algorithm.
In this paper, how to allocate the no-center points of MDPC algorithm instead of adopting a one-step allocation strategy, and the effective treatment of noise points requires further exploration.
References
Morris, K., Mcnicholas, P.: Clustering, classification, discriminant analysis, and dimension reduction via generalized hyperbolic mixtures. Comput. Stat. Data Anal. 97, 133–150 (2016)
Ivannikova, E., Park, H., Hämäläinen, T., et al.: Revealing community structures by ensemble clustering using group diffusion. Inf. Fusion 42, 24–36 (2018)
Slimen, Y., Allio, S., Jacques, J.: Model-based co-clustering for functional data. Neurocomputing 291, 97–108 (2018)
Fraley, C., Raftery, A.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97, 611–631 (2011)
Rodríguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344, 1492–1496 (2014)
Xu, X., Ding, S., Du, M., et al.: DPCG: an efficient density peaks clustering algorithm based on grid. Int. J. Mach. Learn. Cybern. 9, 743–754 (2018)
Ding, S., Du, M., Sun, T., et al.: An entropy-based density peaks clustering algorithm for mixed type data employing fuzzy neighborhood. Knowl. Based Syst. 133, 294–313 (2017)
Liu, R., Wang, H., Yu, X.: Shared-nearest-neighbor-based clustering by fast search and find of density peaks. Inf. Sci. 450, 200–226 (2018)
Du, M., Ding, S., Jia, H.: Study on density peaks clustering based on K-nearest neighbors and principal component analysis. Knowl. Based Syst. 99, 135–145 (2016)
Xie, J., Gao, H., Xie, W., et al.: Robust clustering by detecting density peaks and assigning points based on fuzzy weighted K-nearest neighbors. Inf. Sci. 354, 19–40 (2016)
Shi, Y., Chen, Z., Qi, Z., et al.: A novel clustering-based image segmentation via density peaks algorithm with mid-level feature. Neural Comput. Appl. 28, 29–39 (2017)
Bai, L., Cheng, X., Liang, J., et al.: Fast density clustering strategies based on the k-means algorithm. Pattern Recogn. 71, 375–386 (2017)
Wang, M., Min, F., Zhang, Z., et al.: Active learning through density clustering. Expert Syst. Appl. 85, 305–317 (2017)
Zhou, L., Pei, C.: Delta-distance based clustering with a divide-and-conquer strategy: 3DC clustering. Pattern Recogn. Lett. 73, 52–59 (2016)
Krumhansl, C.: Concerning the applicability of geometric models to similarity data: the interrelationship between similarity and spatial density. Psychol. Rev. 85, 445–463 (1987)
Kai, M., Zhu, Y., Carman, M., et al.: Overcoming key weaknesses of distance-based neighbourhood methods using a data dependent dissimilarity measure. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, San Francisco, California, USA, pp. 1205–1214, 13–17 August 2016
Aryal, S., Kai, M., Haffari, G., et al.: Mp-dissimilarity: a data dependent dissimilarity measure. In: 2014 IEEE International Conference on Data Mining, Shenzhen, China, pp. 707–712, 14–17 December 2014
Chen, B., Ting, K., Washio, T., et al.: Half-space mass: a maximally robust and efficient data depth method. Mach. Learn. 100, 677–699 (2015)
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant no. 61672522 and no. 61379101.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 IFIP International Federation for Information Processing
About this paper
Cite this paper
Ling, D., Xiao, X. (2018). Mass-Based Density Peaks Clustering Algorithm. In: Shi, Z., Mercier-Laurent, E., Li, J. (eds) Intelligent Information Processing IX. IIP 2018. IFIP Advances in Information and Communication Technology, vol 538. Springer, Cham. https://doi.org/10.1007/978-3-030-00828-4_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-00828-4_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00827-7
Online ISBN: 978-3-030-00828-4
eBook Packages: Computer ScienceComputer Science (R0)