More Web Proxy on the site http://driver.im/

research-article

KeyBin2: Distributed Clustering for Scalable and In-Situ Analysis

Authors:

Michela Taufer,

Trilce EstradaAuthors Info & Claims

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Article No.: 34, Pages 1 - 10

https://doi.org/10.1145/3225058.3225149

Published: 13 August 2018 Publication History

Abstract

We present KeyBin2, a key-based clustering method that is able to learn from distributed data in parallel. KeyBin2 uses random projections and discrete optimizations to efficiently clustering very high dimensional data. Because it is based on keys computed independently per dimension and per data point, KeyBin2 scales linearly. We perform accuracy and scalability tests to evaluate our algorithm's performance using synthetic and real datasets. The experiments show that KeyBin2 outperforms other parallel clustering methods for problems with increased complexity. Finally, we present an application of KeyBin2 for in-situ clustering of protein folding trajectories.

References

[1]

C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. 1999. Fast Algorithms for Projected Clustering. SIGMOD Rec. 28, 2 (1999), 61--72.

Digital Library

[2]

Charu C Aggarwal and Philip S Yu. 2001. Outlier detection for high dimensional data. In ACM Sigmod Record, Vol. 30. ACM, 37--46.

Digital Library

[3]

Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record 27, 2 (1998), 94--105.

Digital Library

[4]

David Baker. 1998. Metastable states and folding free energy barriers. Nature Structural and Molecular Biology 5 (1998).

[5]

M. Bendechache, N. A. Le-Khac, and M. T. Kechadi. 2016. Hierarchical Aggregation Approach for Distributed Clustering of Spatial Datasets. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). 1098--1103.

[6]

J. C. Bennett, H. Abbasi, P. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen. 2012. Combining In-situ and In-transit Processing to Enable Extreme-scale Scientific Analysis. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 49:1--49:9.

Digital Library

[7]

C. Best and H. C. Hege. 2002. Visualizing and identifying conformational ensembles in molecular dynamics trajectories. Computing in Science Engineering 4, 5 (2002).

Digital Library

[8]

Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When is Nearest Neighbor meaningful?. In International conference on database theory. Springer, 217--235.

Digital Library

[9]

Ella Bingham, Ella Bingham, Heikki Mannila, and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. International Conference on Knowledge Discovery and Data Mining (KDD) (2001). 245--250.

Digital Library

[10]

Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1--27.

[11]

Canonizer. 2012. Implementation of MAFIA Subspace Clustering on NVidia GPUs. (2012). https://github.com/canonizer/gpumafia open source code.

[12]

CERN 2018. CERN - Large Hadron Collider. (2018). http://home.web.cern.ch/

[13]

X Chen, J Benson, and T Estrada. 2017. keybin: Key-Based Binning for Distributed Clustering. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 572--581.

[14]

Xinyu Chen and Trilce Estrada. 2017. Index Clustering: A Map-reduce Clustering Approach using Numba. In Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA,. INSTICC, SciTePress. 233--240.

Digital Library

[15]

Bi-Ru Dai and I-Chang Lin. 2012. Efficient map/reduce-based dbscan algorithm with optimized data partition. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on. IEEE, 59--66.

Digital Library

[16]

Lisandro Dalcín, Rodrigo Paz, and Mario Storti. 2005. MPI for Python. J. Parallel and Distrib. Comput. 65, 9 (2005), 1108--1115.

Digital Library

[17]

Sanjoy Dasgupta. 1999. Learning mixtures of Gaussians. In Foundations of computer science, 1999. 40th annual symposium on. IEEE, 634--644.

Digital Library

[18]

Sanjoy Dasgupta and Anupam Gupta. 1999. An elementary proof of the Johnson-Lindenstrauss lemma. International Computer Science Institute, Technical Report (1999), 6--99.

[19]

M. Dreher and B. Raffin. 2014. A Flexible Framework for Asynchronous in Situ and in Transit Analytics for Scientific Simulations. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 277--286.

[20]

Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, and Others. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, Vol. 96. 226--231.

Digital Library

[21]

Trilce Estrada and Michela Taufer. 2012. On the Effectiveness of Application-aware Self-management for Scientific Discovery in Volunteer Computing Systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 80:1--80:11.

Digital Library

[22]

Xiaoli Z Fern and Carla E Brodley. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning (ICML-03). 186--193.

Digital Library

[23]

A. Gionis, P. Indyk, and R. Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., 518--529.

Digital Library

[24]

S Goil, H Nagesh, and A Choudhary. 1999. MAFIA: Efficient and scalable subspace clustering for very large data sets. ... Discovery and Data Mining 5 (1999), 443--452.

[25]

Markus Götz, Christian Bodenstein, and Morris Riedel. 2015. HPDBSCAN: highly parallel DBSCAN. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. ACM, 2.

Digital Library

[26]

Poonam Goyal, Sonal Kumari, Shubham Singh, Vivek Kishore, Sundar S Balasubramaniam, and Navneet Goyal. 2016. A Parallel Framework for Grid-Based Bottom-Up Subspace Clustering. In Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on. IEEE, 331--340.

[27]

Alexander Hinneburg and Da Keim. 1998. An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of 4th International Conference in Knowledge Discovery and Data Mining (KDD 98) (1998), 58--65.

Digital Library

[28]

L. Hong, Z. Gao, X. Pan, and K. Yang. 2011. Segmentation of high resolution remote sensing image based on hierarchically multiscale object-oriented Markov random fields model. In IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM). 343--347.

[29]

Xiaojuan Hu, Lei Liu, Ningjia Qiu, Di Yang, and Meng Li. 2017. A MapReduce-based improvement algorithm for DBSCAN. Journal of Algorithms & Computational Technology (2017), 1748301817735665.

[30]

P. Indyk and R. Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. ACM, 604--613.

Digital Library

[31]

J.A. Insley, L. Grinberg, and M.E. Papka. 2011. Visualizing multiscale, multiphysics simulation data: Brain blood flow. In IEEE Symposium on Large Data Analysis and Visualization (LDAV). 3--7.

[32]

Anil K Jain and East Lansing. 2009. Data Clustering: 50 Years Beyond K-Means 1 Anil K. Jain Michigan State University. (2009).

[33]

T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. 2017. In situ data analytics and indexing of protein trajectories. J Comput Chem. 38, 16 (2017).

[34]

H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. 2001. Distributed Clustering Using Collective Principal Component Analysis. Knowledge and Information Systems 3, 4 (2001), 422--448.

Digital Library

[35]

H. Kawashima, R. R. Sato, and H. Kitagawa. 2008. Models and Issues on Probabilistic Data Streams with Bayesian Networks. In Proc. of the International Symposium on Applications and the Internet (SAINT).

Digital Library

[36]

Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. ACM, 7.

Digital Library

[37]

Z. Li and M. Parashar. 2007. Grid-based asynchronous Replica Exchange. In 8th IEEE/ACM International Conference on Grid Computing. 201--208.

Digital Library

[38]

Liao. 2018. Parallel K-Means Data Clustering. (2018). http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html open source code.

[39]

Hubert W Lilliefors. 1967. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American statistical Association 62, 318 (1967), 399--402.

[40]

Y. Liu, L. C. Jiao, F. Shang, F. Yin, and F. Liu. 2013. An Efficient Matrix Bifactorization Alternative Optimization Method for Low-rank Matrix Recovery and Completion. Neural Netw. 48 (2013).

Digital Library

[41]

Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129--137.

Digital Library

[42]

LSST 2018. Large Synoptic Survey Telescope. (2018). http://www.lsst.org/lsst/

[43]

INB. Molecular modeling and Bioinformatics Group. 2018. Molecular Dynamics Extended Library. (2018). http://mmb.pcb.ub.es/MoDEL/index.jsf

[44]

NASA 2015. NASA-Center for Climate Simulation. (2015). http://www.nasa.gov/topics/earth/features/climate-sim-center.html

[45]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. In ACM SIGGRAPH 2008 classes. ACM, 16.

Digital Library

[46]

J.M. Paluska, H. Pham, U. Saif, G. Chau, C. Terman, and S. Ward. 2008. Structured Decomposition of Adaptive Applications. In Proceedings of the 6th Annual IEEE International Conference on Pervasive Computing and Communications. 1--10.

Digital Library

[47]

Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 62.

Digital Library

[48]

Md Mostofa Ali Patwary, Suren Byna, Nadathur Rajagopalan Satish, Narayanan Sundaram, Zarija Lukić, Vadim Roytershteyn, Michael J Anderson, Yushu Yao, Pradeep Dubey, and others. 2015. BD-CATS: big data clustering at trillion particle scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 6.

Digital Library

[49]

Alejandro Pelaez, Andres Quiroz, James C Browne, Edward Chuah, and Manish Parashar. 2014. Online failure prediction for hpc resources using decentralized clustering. In High Performance Computing (HiPC), 2014 21st International Conference on. IEEE, 1--9.

[50]

Dan Pelleg, Andrew W Moore, and Others. 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In Icml, Vol. 1. 727--734.

Digital Library

[51]

J. Phillips. 2011. Validating clustering of molecular dynamics simulations using polymer models. BMC Bioinformatics 12, 1 (2011).

[52]

A. Quiroz, M. Parashar, N. Gnanasambandam, and N. Sharma. 2012. Design and Evaluation of Decentralized Online Clustering. ACM Trans. Auton. Adapt. Syst. 7, 3 (2012), 34:1--34:31.

Digital Library

[53]

G.N. Ramachandran, C. Ramakrishnan, and V. Sasisekharan. 1963. Multipolar representation of protein structure. Journal of Molecular Biology 7, 95 (1963).

[54]

Conrad Sanderson and Ryan Curtin. 2017. An open source C++ implementation of multi-threaded Gaussian mixture models, k-means and expectation maximisation. arXiv preprint arXiv:1707.09094 (2017).

[55]

SDSS 2014. SDSS - Sloan Digital Sky Survey. (2014). https://www.sdss3.org/

[56]

Bernard W Silverman. 1981. Using kernel density estimates to investigate multi-modality. Journal of the Royal Statistical Society. Series B (Methodological) (1981), 97--99.

[57]

D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers. 2012. Reducing Data Movement Costs Using Energy-Efficient, Active Computation on SSD. In 2012 Workshop on Power-Aware Computing and Systems. USENIX.

Digital Library

[58]

T. Tu. 2008. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In CM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SCâĂ&Zacute;08).

Digital Library

[59]

Dongkuan Xu and Yingjie Tian. 2015. A Comprehensive Survey of Clustering Algorithms. Annals of Data Science 2, 2 (2015), 165--193.

[60]

B. Zhang, T. Estrada, P. Cicotti, and M. Taufer. 2014. Enabling in-situ data analysis for large protein folding trajectory datasets. In IEEE International Parallel and Distributed Processing Symposium.

Digital Library

[61]

F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky. 2013. GoldRush: Resource Efficient in Situ Scientific Data Analytics Using Fine-grained Interference Aware Execution. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 78:1--78:12.

Digital Library

Index Terms

KeyBin2: Distributed Clustering for Scalable and In-Situ Analysis
1. Theory of computation

Recommendations

Efficient clustering on Riemannian manifolds

Reformulating computer vision problems over Riemannian manifolds has demonstrated superior performance in various computer vision applications. This is because visual data often forms a special structure lying on a lower dimensional space embedded in a ...
Clustering Faster and Better with Projected Data
ICISDM '22: Proceedings of the 6th International Conference on Information System and Data Mining

The K-means clustering algorithm can take a lot of time to converge, especially for large datasets in high dimension and a large number of clusters. By applying several enhancements it is possible to improve the performance without significantly ...
Privacy-Preserving Hierarchical-k-means Clustering on Horizontally Partitioned Data

Privacy preserving mining of distributed data is an important direction for data mining, and privacy preserving clustering is one of the main researches. Privacy-preserving data mining techniques enable knowledge discovery without requiring disclosure ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

August 2018

945 pages

ISBN:9781450365109

DOI:10.1145/3225058

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICPP 2018

ICPP 2018: 47th International Conference on Parallel Processing

August 13 - 16, 2018

OR, Eugene, USA

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
69
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents