More Web Proxy on the site http://driver.im/

research-article

UP-DPC: : Ultra-scalable parallel density peak clustering

Authors:

Zhifeng HaoAuthors Info & Claims

Volume 660, Issue C

https://doi.org/10.1016/j.ins.2024.120114

Published: 17 April 2024 Publication History

Abstract

Density Peak Clustering (DPC) is a highly effective density-based clustering algorithm, but its scalability is limited by the expensive Density Peak Estimation (DPE) step. To address this challenge, we propose UP-DPC: Ultra-Scalable Parallel Density Peak Clustering, a novel framework that employs approximate Density Peak Estimation and performs DPC on LDP-wise graphs. This approach enables UP-DPC to handle datasets of arbitrary scale without relying on spatial indexing for acceleration. Furthermore, we introduce a five-layer computational architecture and leverage parallel computation techniques to further enhance the speed and efficiency of UP-DPC.

To evaluate the scalability and effectiveness of UP-DPC, we conduct extensive experiments on 14 datasets, including the large/web-scale datasets, and compare UP-DPC with 21 algorithms. Notably, on the MNIST8M dataset consisting of 8,000k data objects, UP-DPC achieves an NMI (Normalized Mutual Information) value of 0.6464 in just 35.41 seconds, outperforming the state-of-the-art GPU-based method, which only archives an NMI of 0.045 in 56.96 seconds. These results demonstrate the superior scalability and effectiveness of UP-DPC in handling large/web-scale datasets. The proposed framework offers significant improvements over existing methods and shows promise as a solution for density-based clustering tasks.

References

[1]

J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905.

Digital Library

[2]

M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: SIGKDD, vol. 96, 1996, pp. 226–231.

[3]

A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496.

[4]

J. Xie, X. Liu, M. Wang, SFKNN-DPC: standard deviation weighted distance based density peak clustering algorithm, Inf. Sci. 653 (2024).

[5]

A. Vedaldi, S. Soatto, Quick shift and kernel methods for mode seeking, in: ECCV, 2008, pp. 705–718.

[6]

Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 790–799.

Digital Library

[7]

H. Jiang, J. Jang, S. Kpotufe, Quickshift++: provably good initializations for sample-based mean shift, in: ICML, 2018, pp. 2294–2303.

[8]

G. Yang, H. Lv, Y. Yang, Z. Gong, X. Chen, Z. Hao, Fastdec: clustering by fast dominance estimation, in: ECML-PKDD, 2022, pp. 138–156.

[9]

X. Zheng, C. Ren, Y. Yang, Z. Gong, X. Chen, Z. Hao, Quickdsc: clustering by quick density subgraph estimation, Inf. Sci. 581 (2021) 403–427.

[10]

Z. Rasool, R. Zhou, L. Chen, C. Liu, J. Xu, Index-based solutions for efficient density peak clustering, IEEE Trans. Knowl. Data Eng. 34 (5) (2020) 2212–2226.

[11]

L. Ma, G. Yang, X. Chen, Y. Yang, Z. Gong, Z. Hao, Ultra-dpc: ultra-scalable and index-free density peak clustering, in: APWeb-WAIM, 2023, pp. 1–15.

[12]

R. Liu, H. Wang, X. Yu, Shared-nearest-neighbor-based clustering by fast search and find of density peaks, Inf. Sci. 450 (2018) 200–226.

Digital Library

[13]

M. Du, S. Ding, H. Jia, Study on density peaks clustering based on k-nearest neighbors and principal component analysis, Knowl.-Based Syst. 99 (2016) 135–145.

Digital Library

[14]

Y. Wang, D. Wang, Y. Zhou, X. Zhang, C. Quek, VDPC: variational density peak clustering algorithm, Inf. Sci. 621 (2023) 627–651.

[15]

Z. Long, Y. Gao, H. Meng, Y. Yao, T. Li, Clustering based on local density peaks and graph cut, Inf. Sci. 600 (2022) 263–286.

[16]

C. Li, S. Ding, X. Xu, H. Hou, L. Ding, Fast density peaks clustering algorithm based on improved mutual k-nearest-neighbor and sub-cluster merging, Inf. Sci. 647 (2023).

[17]

J. Guan, S. Li, X. He, J. Chen, Clustering by fast detection of main density peaks within a peak digraph, Inf. Sci. 628 (2023) 504–521.

[18]

S. Ding, W. Du, X. Xu, T. Shi, Y. Wang, C. Li, An improved density peaks clustering algorithm based on natural neighbor with a merging strategy, Inf. Sci. 624 (2023) 252–276.

[19]

S. Raschka, J. Patterson, C. Nolet, Machine learning in python: main developments and technology trends in data science, machine learning, and artificial intelligence, Information 11 (4) (2020) 193.

[20]

M. Strobl, J. Sander, R.J. Campello, O. Zaïane, Model-based clustering with hdbscan, in: ECML-PKDD, 2021, pp. 364–379.

[21]

K.C. Gowda, G. Krishna, Agglomerative clustering using the concept of mutual nearest neighbourhood, Pattern Recognit. 10 (2) (1978) 105–112.

[22]

McInnes, L.; Healy, J.; Melville, J. (2018): Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

[23]

J. MacQueen, Classification and analysis of multivariate observations, in: Berkeley Symp. Math. Statist. Probability, 1967, pp. 281–297.

[24]

J.A. Hartigan, M.A. Wong, Algorithm as 136: a k-means clustering algorithm, J. R. Stat. Soc., Ser. C, Appl. Stat. 28 (1) (1979) 100–108.

[25]

G. Yang, S. Deng, Y. Yang, Z. Gong, X. Chen, Z. Hao, Litewsc: a lightweight framework for web-scale spectral clustering, in: DASFAA, 2022, pp. 556–573.

[26]

M. Mohan, C. Monteleoni, Beyond the Nyström approximation: speeding up spectral clustering using uniform sampling and weighted kernel k-means, in: IJCAI, 2017.

[27]

X. Chen, D. Cai, Large Scale Spectral Clustering with Landmark-Based Representation, AAAI, vol. 25, 2011, pp. 313–318.

[28]

D. Huang, C.-D. Wang, J.-S. Wu, J.-H. Lai, C.-K. Kwoh, Ultra-scalable spectral clustering and ensemble clustering, IEEE Trans. Knowl. Data Eng. 32 (6) (2019) 1212–1226.

[29]

Y. Yang, S. Deng, J. Lu, Y. Li, Z. Gong, Z. Hao, et al., Graphlshc: towards large scale spectral hypergraph clustering, Inf. Sci. 544 (2021) 117–134.

[30]

G. Yang, S. Deng, C. Chen, Y. Yang, Z. Gong, X. Chen, Z. Hao, Litewsec: a lightweight framework for web-scale spectral ensemble clustering, IEEE Trans. Knowl. Data Eng. (2023) 1–15.

[31]

J. Brakensiek, V. Guruswami, Bridging between 0/1 and linear programming via random walks, in: STOC, 2019, pp. 568–577.

[32]

D. Huang, C.-D. Wang, H. Peng, J. Lai, C.-K. Kwoh, Enhanced ensemble clustering via fast propagation of cluster-wise similarities, IEEE Trans. Syst. Man Cybern. Syst. 51 (1) (2018) 508–520.

[33]

J.A. Rice, Mathematical Statistics and Data Analysis, Cengage Learning, 2006.

[34]

H.-P. Kriegel, E. Schubert, A. Zimek, The (black) art of runtime evaluation: are we comparing algorithms or implementations?, Knowl. Inf. Syst. 52 (2017) 341–378.

[35]

G. Yang, S. Deng, X. Chen, C. Chen, Y. Yang, Z. Gong, Z. Hao, Reskm: a general framework to accelerate large-scale spectral clustering, Pattern Recognit. 137 (2023).

[36]

J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Trans. Big Data 7 (3) (2019) 535–547.

[37]

D. Sculley, Web-scale k-means clustering, in: WWW, 2010, pp. 1177–1178.

[38]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: machine learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830.

[39]

D. Cheng, Q. Zhu, J. Huang, Q. Wu, L. Yang, Clustering with local density peaks-based minimum spanning tree, IEEE Trans. Knowl. Data Eng. 33 (2021) 374–387.

[40]

T. Qiu, Y. Li, Fast LDP-MST: an efficient density-peak-based clustering method for large-size datasets, IEEE Trans. Knowl. Data Eng. 35 (5) (2023) 4767–4780.

[41]

R.J. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, in: PAKDD, 2013, pp. 160–172.

[42]

N.X. Vinh, J. Epps, J. Bailey, Information theoretic measures for clusterings comparison: is a correction for chance necessary?, in: ICML, 2009, pp. 1073–1080.

[43]

T. Hastie, R. Tibshirani, J.H. Friedman, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2, 2009.

[44]

P. Fränti, O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognit. 39 (5) (2006) 761–775.

Digital Library

[45]

J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (5) (1994) 550–554.

Digital Library

[46]

Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.

[47]

G. Cohen, S. Afshar, J. Tapson, A. Van Schaik, Emnist: extending mnist to handwritten letters, in: IJCNN, 2017, pp. 2921–2926.

[48]

G. Loosli, S. Canu, L. Bottou, Training invariant support vector machines using selective sampling, in: Large Scale Kernel Machines, vol. 2, 2007.

Recommendations

Ultra-DPC: Ultra-scalable and Index-Free Density Peak Clustering
Web and Big Data
Abstract
Density-based clustering is a fundamental and effective tool for recognizing connectivity structure. The density peak, the data object with the maximum density within a predefined sphere, plays a critical role. However, Density Peak Estimation (...
ANN-DPC: Density peak clustering by finding the adaptive nearest neighbors
Abstract
DPC(clustering by fast search and find of density peaks) is an efficient clustering algorithm. However, DPC and its variations usually cannot detect the appropriate cluster centers for a dataset containing sparse and dense clusters simultaneously,...
Graphical abstract

Display Omitted
Highlights
- Introducing adaptive nearest neighbors for a point to define its accurate local density and partition it as a super-core, core, linked, or slave point.
- Introducing super-core point absorbing technique and the dependency vector to ...
Parallel boosted clustering
Abstract
Scalability of clustering algorithms is a critical issue in real world clustering applications. Usually, data sampling and parallelization are two common ways to address the scalability issue. Despite their wide utilization in a number ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 660, Issue C

Mar 2024

599 pages

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 17 April 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents