[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2388996.2389081acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

A new scalable parallel DBSCAN algorithm using the disjoint-set data structure

Published: 10 November 2012 Publication History

Abstract

DBSCAN is a well-known density based clustering algorithm capable of discovering arbitrary shaped clusters and eliminating noise data. However, parallelization of Dbscan is challenging as it exhibits an inherent sequential data access order. Moreover, existing parallel implementations adopt a master-slave strategy which can easily cause an unbalanced workload and hence result in low parallel efficiency.
We present a new parallel Dbscan algorithm (Pdsdbscan) using graph algorithmic concepts. More specifically, we employ the disjoint-set data structure to break the access sequentiality of Dbscan. In addition, we use a tree-based bottom-up approach to construct the clusters. This yields a better-balanced workload distribution. We implement the algorithm both for shared and for distributed memory.
Using data sets containing up to several hundred million high-dimensional points, we show that Pdsdbscan significantly outperforms the master-slave approach, achieving speedups up to 25.97 using 40 cores on shared memory architecture, and speedups up to 5,765 using 8,192 cores on distributed memory architecture.

References

[1]
U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery in databases," AI magazine, vol. 17, no. 3, p. 37, 1996.
[2]
J. MacQueen et al., "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1. USA, 1967, pp. 281--297.
[3]
H. Park and C. Jun, "A simple and fast algorithm for K-medoids clustering," Expert Systems with Applications, vol. 36, no. 2, pp. 3336--3341, 2009.
[4]
T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in ACM SIGMOD Record, vol. 25(2). ACM, 1996, pp. 103--114.
[5]
M. Ester, H. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proceedings of the 2nd International Conference on Knowledge Discovery and Data mining, vol. 1996. AAAI Press, 1996, pp. 226--231.
[6]
W. Wang, J. Yang, and R. Muntz, "STING: A statistical information grid approach to spatial data mining," in Proceedings of the International Conference on Very Large Data Bases. IEEE, 1997, pp. 186--195.
[7]
G. Sheikholeslami, S. Chatterjee, and A. Zhang, "WaveCluster: a wavelet-based clustering approach for spatial data in very large databases," The VLDB Journal, vol. 8, no. 3, pp. 289--304, 2000.
[8]
A. Mukhopadhyay and U. Maulik, "Unsupervised satellite image segmentation by combining SA based fuzzy clustering with support vector machine," in Proceedings of 7th ICAPR'09. IEEE, 2009, pp. 381--384.
[9]
D. Birant and A. Kut, "ST-DBSCAN: An algorithm for clustering spatial-temporal data," Data & Knowledge Engineering, vol. 60, no. 1, pp. 208--221, 2007.
[10]
M. Surdeanu, J. Turmo, and A. Ageno, "A hybrid unsupervised approach for document clustering," in Proceedings of the 11th ACM SIGKDD. ACM, 2005, pp. 685--690.
[11]
S. Madeira and A. Oliveira, "Biclustering algorithms for biological data analysis: a survey," Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. 1, no. 1, pp. 24--45, 2004.
[12]
J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques. Morgan Kaufmann, 2011.
[13]
H. Kargupta and J. Han, Next generation of data mining. Chapman & Hall/CRC, 2009, vol. 7.
[14]
S. Brecheisen, H. Kriegel, and M. Pfeifle, "Parallel density-based clustering of complex objects," Advances in Knowledge Discovery and Data Mining, pp. 179--188, 2006.
[15]
D. Arlia and M. Coppola, "Experiments in parallel clustering with DBSCAN," in Euro-Par 2001 Parallel Processing. Springer, LNCS, 2001, pp. 326--331.
[16]
M. Chen, X. Gao, and H. Li, "Parallel DBSCAN with priority r-tree," in Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on. IEEE, 2010, pp. 508--511.
[17]
M. Coppola and M. Vanneschi, "High-performance data mining with skeleton-based structured parallel programming," Parallel Computing, vol. 28, no. 5, pp. 793--813, 2002.
[18]
Y. Fu, W. Zhao, and H. Ma, "Research on parallel DBSCAN algorithm design based on mapreduce," Advanced Materials Research, vol. 301, pp. 1133--1138, 2011.
[19]
X. Xu, J. Jäger, and H. Kriegel, "A fast parallel clustering algorithm for large spatial databases," High Performance Data Mining, pp. 263--290, 2002.
[20]
A. Zhou, S. Zhou, J. Cao, Y. Fan, and Y. Hu, "Approaches for scaling DBSCAN algorithm to large spatial databases," Journal of computer science and technology, vol. 15, no. 6, pp. 509--526, 2000.
[21]
T. Cormen, Introduction to algorithms. The MIT press, 2001.
[22]
M. Patwary, J. Blair, and F. Manne, "Experiments on union-find algorithms for the disjoint-set data structure," in Proceedings of the 9th International Symposium on Experimental Algorithms (SEA 2010). Springer, LNCS 6049, 2010, pp. 411--423.
[23]
M. M. A. Patwary, P. Refsnes, and F. Manne, "Multi-core spanning forest algorithms using the disjoint-set data structure," in Proceedings of the 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012). IEEE, 2012, to appear.
[24]
M. B. Kennel, "KDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space," 2004, institute for Nonlinear Science, University of California.
[25]
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger, "The r*-tree: an efficient and robust access method for points and rectangles," Proceedings of the 1990 ACM SIGMOD international conference on Management of data, vol. 19, no. 2, pp. 322--331, 1990.
[26]
J. Bentley, "Multidimensional binary search trees used for associative searching," Communications of the ACM, vol. 18, no. 9, pp. 509--517, 1975.
[27]
R. Tarjan, "A class of algorithms which require nonlinear time to maintain disjoint sets," Journal of computer and system sciences, vol. 18, no. 2, pp. 110--127, 1979.
[28]
B. Galler and M. Fisher, "An improved equivalence algorithm," Communications of the ACM, vol. 7, pp. 301--303, 1964.
[29]
M. Patwary, A. Gebremedhin, and A. Pothen, "New multithreaded ordering and coloring algorithms for multicore architectures," in Proceedings of 17th International European Conference on Parallel and Distributed Computing (Euro-Par 2011). Springer, LNCS 6853, 2011, pp. 250--262.
[30]
F. Manne and M. Patwary, "A scalable parallel union-find algorithm for distributed memory computers," in Parallel Processing and Applied Mathematics. Springer, LNCS, 2010, pp. 186--195.
[31]
"CLUTO - clustering high-dimensional datasets," 2006, http://glaros.dtc.umn.edu/gkhome/cluto/cluto/.
[32]
"Parallel K-means data clustering," 2005, http://users.eecs.northwestern.edu/wkliao/Kmeans/.
[33]
R. Agrawal and R. Srikant, "Quest synthetic data generator," IBM Almaden Research Center, 1994.
[34]
J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi, "NU-MineBench 3.0," Technical Report CUCIS-2005-08-01, Northwestern University, Tech. Rep., 2010.
[35]
G. Lemson and the Virgo Consortium, "Halo and galaxy formation histories from the millennium simulation: Public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony," Arxiv preprint astro-ph/0608019, 2006.
[36]
V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly et al., "Simulations of the formation, evolution and clustering of galaxies and quasars," Nature, vol. 435, no. 7042, pp. 629--636, 2005.
[37]
S. Bertone, G. De Lucia, and P. Thomas, "The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model," Monthly Notices of the Royal Astronomical Society, vol. 379, no. 3, pp. 1143--1154, 2007.
[38]
G. De Lucia and J. Blaizot, "The hierarchical formation of the brightest cluster galaxies," Monthly Notices of the Royal Astronomical Society, vol. 375, no. 1, pp. 2--14, 2007.
[39]
R. Bower, A. Benson, R. Malbon, J. Helly, C. Frenk, C. Baugh, S. Cole, and C. Lacey, "Breaking the hierarchy of galaxy formation," Monthly Notices of the Royal Astronomical Society, vol. 370, no. 2, pp. 645--655, 2006.
[40]
Y. Liu, W.-k. Liao, and A. Choudhary, "Design and evaluation of a parallel HOP clustering algorithm for cosmological simulation," in Proceedings of the 17th International Symposium on Parallel and Distributed Processing (IPDPS 2003). Washington, DC, USA: IEEE, 2003, p. 82.1.

Cited By

View all
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • (2022)GraphZeppelin: Storage-Friendly Sketching for Connected Components on Dynamic Graph StreamsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526146(325-339)Online publication date: 10-Jun-2022
  • (2021)ConnectItProceedings of the VLDB Endowment10.14778/3436905.343692314:4(653-667)Online publication date: 22-Feb-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2012
1161 pages
ISBN:9781467308045

Sponsors

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 10 November 2012

Check for updates

Author Tags

  1. density based clustering
  2. disjoint-set data structure
  3. union-find algorithm

Qualifiers

  • Research-article

Conference

SC '12
Sponsor:

Acceptance Rates

SC '12 Paper Acceptance Rate 100 of 461 submissions, 22%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Survey and Experimental Review on Data Distribution Strategies for Parallel Spatial Clustering AlgorithmsJournal of Computer Science and Technology10.1007/s11390-024-2700-039:3(610-636)Online publication date: 1-May-2024
  • (2022)GraphZeppelin: Storage-Friendly Sketching for Connected Components on Dynamic Graph StreamsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526146(325-339)Online publication date: 10-Jun-2022
  • (2021)ConnectItProceedings of the VLDB Endowment10.14778/3436905.343692314:4(653-667)Online publication date: 22-Feb-2021
  • (2020)Parallel and Scalable Precise ClusteringProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414646(217-228)Online publication date: 30-Sep-2020
  • (2020)Anomaly detection for NILM task with Apache FlinkProceedings of the 14th ACM International Conference on Distributed and Event-based Systems10.1145/3401025.3401758(199-203)Online publication date: 13-Jul-2020
  • (2019)k/2-hopProceedings of the VLDB Endowment10.14778/3329772.332977312:9(948-960)Online publication date: 1-May-2019
  • (2019)Hybrid CPU/GPU clustering in shared memory on the billion point scaleProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330349(35-45)Online publication date: 26-Jun-2019
  • (2018)G-OPTICSInternational Journal of Web and Grid Services10.1504/IJWGS.2018.09258314:3(273-287)Online publication date: 1-Jan-2018
  • (2018)KeyBin2Proceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225149(1-10)Online publication date: 13-Aug-2018
  • (2018)Parallelizing Pruning-based Graph Structural ClusteringProceedings of the 47th International Conference on Parallel Processing10.1145/3225058.3225063(1-10)Online publication date: 13-Aug-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media