[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

UP-DPC: : Ultra-scalable parallel density peak clustering

Published: 17 April 2024 Publication History

Abstract

Density Peak Clustering (DPC) is a highly effective density-based clustering algorithm, but its scalability is limited by the expensive Density Peak Estimation (DPE) step. To address this challenge, we propose UP-DPC: Ultra-Scalable Parallel Density Peak Clustering, a novel framework that employs approximate Density Peak Estimation and performs DPC on LDP-wise graphs. This approach enables UP-DPC to handle datasets of arbitrary scale without relying on spatial indexing for acceleration. Furthermore, we introduce a five-layer computational architecture and leverage parallel computation techniques to further enhance the speed and efficiency of UP-DPC.
To evaluate the scalability and effectiveness of UP-DPC, we conduct extensive experiments on 14 datasets, including the large/web-scale datasets, and compare UP-DPC with 21 algorithms. Notably, on the MNIST8M dataset consisting of 8,000k data objects, UP-DPC achieves an NMI (Normalized Mutual Information) value of 0.6464 in just 35.41 seconds, outperforming the state-of-the-art GPU-based method, which only archives an NMI of 0.045 in 56.96 seconds. These results demonstrate the superior scalability and effectiveness of UP-DPC in handling large/web-scale datasets. The proposed framework offers significant improvements over existing methods and shows promise as a solution for density-based clustering tasks.

References

[1]
J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (2000) 888–905.
[2]
M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial databases with noise, in: SIGKDD, vol. 96, 1996, pp. 226–231.
[3]
A. Rodriguez, A. Laio, Clustering by fast search and find of density peaks, Science 344 (6191) (2014) 1492–1496.
[4]
J. Xie, X. Liu, M. Wang, SFKNN-DPC: standard deviation weighted distance based density peak clustering algorithm, Inf. Sci. 653 (2024).
[5]
A. Vedaldi, S. Soatto, Quick shift and kernel methods for mode seeking, in: ECCV, 2008, pp. 705–718.
[6]
Y. Cheng, Mean shift, mode seeking, and clustering, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 790–799.
[7]
H. Jiang, J. Jang, S. Kpotufe, Quickshift++: provably good initializations for sample-based mean shift, in: ICML, 2018, pp. 2294–2303.
[8]
G. Yang, H. Lv, Y. Yang, Z. Gong, X. Chen, Z. Hao, Fastdec: clustering by fast dominance estimation, in: ECML-PKDD, 2022, pp. 138–156.
[9]
X. Zheng, C. Ren, Y. Yang, Z. Gong, X. Chen, Z. Hao, Quickdsc: clustering by quick density subgraph estimation, Inf. Sci. 581 (2021) 403–427.
[10]
Z. Rasool, R. Zhou, L. Chen, C. Liu, J. Xu, Index-based solutions for efficient density peak clustering, IEEE Trans. Knowl. Data Eng. 34 (5) (2020) 2212–2226.
[11]
L. Ma, G. Yang, X. Chen, Y. Yang, Z. Gong, Z. Hao, Ultra-dpc: ultra-scalable and index-free density peak clustering, in: APWeb-WAIM, 2023, pp. 1–15.
[12]
R. Liu, H. Wang, X. Yu, Shared-nearest-neighbor-based clustering by fast search and find of density peaks, Inf. Sci. 450 (2018) 200–226.
[13]
M. Du, S. Ding, H. Jia, Study on density peaks clustering based on k-nearest neighbors and principal component analysis, Knowl.-Based Syst. 99 (2016) 135–145.
[14]
Y. Wang, D. Wang, Y. Zhou, X. Zhang, C. Quek, VDPC: variational density peak clustering algorithm, Inf. Sci. 621 (2023) 627–651.
[15]
Z. Long, Y. Gao, H. Meng, Y. Yao, T. Li, Clustering based on local density peaks and graph cut, Inf. Sci. 600 (2022) 263–286.
[16]
C. Li, S. Ding, X. Xu, H. Hou, L. Ding, Fast density peaks clustering algorithm based on improved mutual k-nearest-neighbor and sub-cluster merging, Inf. Sci. 647 (2023).
[17]
J. Guan, S. Li, X. He, J. Chen, Clustering by fast detection of main density peaks within a peak digraph, Inf. Sci. 628 (2023) 504–521.
[18]
S. Ding, W. Du, X. Xu, T. Shi, Y. Wang, C. Li, An improved density peaks clustering algorithm based on natural neighbor with a merging strategy, Inf. Sci. 624 (2023) 252–276.
[19]
S. Raschka, J. Patterson, C. Nolet, Machine learning in python: main developments and technology trends in data science, machine learning, and artificial intelligence, Information 11 (4) (2020) 193.
[20]
M. Strobl, J. Sander, R.J. Campello, O. Zaïane, Model-based clustering with hdbscan, in: ECML-PKDD, 2021, pp. 364–379.
[21]
K.C. Gowda, G. Krishna, Agglomerative clustering using the concept of mutual nearest neighbourhood, Pattern Recognit. 10 (2) (1978) 105–112.
[22]
McInnes, L.; Healy, J.; Melville, J. (2018): Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
[23]
J. MacQueen, Classification and analysis of multivariate observations, in: Berkeley Symp. Math. Statist. Probability, 1967, pp. 281–297.
[24]
J.A. Hartigan, M.A. Wong, Algorithm as 136: a k-means clustering algorithm, J. R. Stat. Soc., Ser. C, Appl. Stat. 28 (1) (1979) 100–108.
[25]
G. Yang, S. Deng, Y. Yang, Z. Gong, X. Chen, Z. Hao, Litewsc: a lightweight framework for web-scale spectral clustering, in: DASFAA, 2022, pp. 556–573.
[26]
M. Mohan, C. Monteleoni, Beyond the Nyström approximation: speeding up spectral clustering using uniform sampling and weighted kernel k-means, in: IJCAI, 2017.
[27]
X. Chen, D. Cai, Large Scale Spectral Clustering with Landmark-Based Representation, AAAI, vol. 25, 2011, pp. 313–318.
[28]
D. Huang, C.-D. Wang, J.-S. Wu, J.-H. Lai, C.-K. Kwoh, Ultra-scalable spectral clustering and ensemble clustering, IEEE Trans. Knowl. Data Eng. 32 (6) (2019) 1212–1226.
[29]
Y. Yang, S. Deng, J. Lu, Y. Li, Z. Gong, Z. Hao, et al., Graphlshc: towards large scale spectral hypergraph clustering, Inf. Sci. 544 (2021) 117–134.
[30]
G. Yang, S. Deng, C. Chen, Y. Yang, Z. Gong, X. Chen, Z. Hao, Litewsec: a lightweight framework for web-scale spectral ensemble clustering, IEEE Trans. Knowl. Data Eng. (2023) 1–15.
[31]
J. Brakensiek, V. Guruswami, Bridging between 0/1 and linear programming via random walks, in: STOC, 2019, pp. 568–577.
[32]
D. Huang, C.-D. Wang, H. Peng, J. Lai, C.-K. Kwoh, Enhanced ensemble clustering via fast propagation of cluster-wise similarities, IEEE Trans. Syst. Man Cybern. Syst. 51 (1) (2018) 508–520.
[33]
J.A. Rice, Mathematical Statistics and Data Analysis, Cengage Learning, 2006.
[34]
H.-P. Kriegel, E. Schubert, A. Zimek, The (black) art of runtime evaluation: are we comparing algorithms or implementations?, Knowl. Inf. Syst. 52 (2017) 341–378.
[35]
G. Yang, S. Deng, X. Chen, C. Chen, Y. Yang, Z. Gong, Z. Hao, Reskm: a general framework to accelerate large-scale spectral clustering, Pattern Recognit. 137 (2023).
[36]
J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Trans. Big Data 7 (3) (2019) 535–547.
[37]
D. Sculley, Web-scale k-means clustering, in: WWW, 2010, pp. 1177–1178.
[38]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: machine learning in python, J. Mach. Learn. Res. 12 (2011) 2825–2830.
[39]
D. Cheng, Q. Zhu, J. Huang, Q. Wu, L. Yang, Clustering with local density peaks-based minimum spanning tree, IEEE Trans. Knowl. Data Eng. 33 (2021) 374–387.
[40]
T. Qiu, Y. Li, Fast LDP-MST: an efficient density-peak-based clustering method for large-size datasets, IEEE Trans. Knowl. Data Eng. 35 (5) (2023) 4767–4780.
[41]
R.J. Campello, D. Moulavi, J. Sander, Density-based clustering based on hierarchical density estimates, in: PAKDD, 2013, pp. 160–172.
[42]
N.X. Vinh, J. Epps, J. Bailey, Information theoretic measures for clusterings comparison: is a correction for chance necessary?, in: ICML, 2009, pp. 1073–1080.
[43]
T. Hastie, R. Tibshirani, J.H. Friedman, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2, 2009.
[44]
P. Fränti, O. Virmajoki, Iterative shrinking method for clustering problems, Pattern Recognit. 39 (5) (2006) 761–775.
[45]
J.J. Hull, A database for handwritten text recognition research, IEEE Trans. Pattern Anal. Mach. Intell. 16 (5) (1994) 550–554.
[46]
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[47]
G. Cohen, S. Afshar, J. Tapson, A. Van Schaik, Emnist: extending mnist to handwritten letters, in: IJCNN, 2017, pp. 2921–2926.
[48]
G. Loosli, S. Canu, L. Bottou, Training invariant support vector machines using selective sampling, in: Large Scale Kernel Machines, vol. 2, 2007.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal
Information Sciences: an International Journal  Volume 660, Issue C
Mar 2024
599 pages

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 17 April 2024

Author Tags

  1. Clustering
  2. Density peak estimation
  3. Large-scale
  4. Scalability
  5. Parallel computation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media