[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3225058.3225149acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

KeyBin2: Distributed Clustering for Scalable and In-Situ Analysis

Published: 13 August 2018 Publication History

Abstract

We present KeyBin2, a key-based clustering method that is able to learn from distributed data in parallel. KeyBin2 uses random projections and discrete optimizations to efficiently clustering very high dimensional data. Because it is based on keys computed independently per dimension and per data point, KeyBin2 scales linearly. We perform accuracy and scalability tests to evaluate our algorithm's performance using synthetic and real datasets. The experiments show that KeyBin2 outperforms other parallel clustering methods for problems with increased complexity. Finally, we present an application of KeyBin2 for in-situ clustering of protein folding trajectories.

References

[1]
C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park. 1999. Fast Algorithms for Projected Clustering. SIGMOD Rec. 28, 2 (1999), 61--72.
[2]
Charu C Aggarwal and Philip S Yu. 2001. Outlier detection for high dimensional data. In ACM Sigmod Record, Vol. 30. ACM, 37--46.
[3]
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Record 27, 2 (1998), 94--105.
[4]
David Baker. 1998. Metastable states and folding free energy barriers. Nature Structural and Molecular Biology 5 (1998).
[5]
M. Bendechache, N. A. Le-Khac, and M. T. Kechadi. 2016. Hierarchical Aggregation Approach for Distributed Clustering of Spatial Datasets. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). 1098--1103.
[6]
J. C. Bennett, H. Abbasi, P. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen. 2012. Combining In-situ and In-transit Processing to Enable Extreme-scale Scientific Analysis. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 49:1--49:9.
[7]
C. Best and H. C. Hege. 2002. Visualizing and identifying conformational ensembles in molecular dynamics trajectories. Computing in Science Engineering 4, 5 (2002).
[8]
Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1999. When is Nearest Neighbor meaningful?. In International conference on database theory. Springer, 217--235.
[9]
Ella Bingham, Ella Bingham, Heikki Mannila, and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. International Conference on Knowledge Discovery and Data Mining (KDD) (2001). 245--250.
[10]
Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1--27.
[11]
Canonizer. 2012. Implementation of MAFIA Subspace Clustering on NVidia GPUs. (2012). https://github.com/canonizer/gpumafia open source code.
[12]
CERN 2018. CERN - Large Hadron Collider. (2018). http://home.web.cern.ch/
[13]
X Chen, J Benson, and T Estrada. 2017. keybin: Key-Based Binning for Distributed Clustering. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 572--581.
[14]
Xinyu Chen and Trilce Estrada. 2017. Index Clustering: A Map-reduce Clustering Approach using Numba. In Proceedings of the 6th International Conference on Data Science, Technology and Applications - Volume 1: DATA,. INSTICC, SciTePress. 233--240.
[15]
Bi-Ru Dai and I-Chang Lin. 2012. Efficient map/reduce-based dbscan algorithm with optimized data partition. In Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on. IEEE, 59--66.
[16]
Lisandro Dalcín, Rodrigo Paz, and Mario Storti. 2005. MPI for Python. J. Parallel and Distrib. Comput. 65, 9 (2005), 1108--1115.
[17]
Sanjoy Dasgupta. 1999. Learning mixtures of Gaussians. In Foundations of computer science, 1999. 40th annual symposium on. IEEE, 634--644.
[18]
Sanjoy Dasgupta and Anupam Gupta. 1999. An elementary proof of the Johnson-Lindenstrauss lemma. International Computer Science Institute, Technical Report (1999), 6--99.
[19]
M. Dreher and B. Raffin. 2014. A Flexible Framework for Asynchronous in Situ and in Transit Analytics for Scientific Simulations. In IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 277--286.
[20]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, and Others. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, Vol. 96. 226--231.
[21]
Trilce Estrada and Michela Taufer. 2012. On the Effectiveness of Application-aware Self-management for Scientific Discovery in Volunteer Computing Systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 80:1--80:11.
[22]
Xiaoli Z Fern and Carla E Brodley. 2003. Random projection for high dimensional data clustering: A cluster ensemble approach. In Proceedings of the 20th International Conference on Machine Learning (ICML-03). 186--193.
[23]
A. Gionis, P. Indyk, and R. Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of the 25th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., 518--529.
[24]
S Goil, H Nagesh, and A Choudhary. 1999. MAFIA: Efficient and scalable subspace clustering for very large data sets. ... Discovery and Data Mining 5 (1999), 443--452.
[25]
Markus Götz, Christian Bodenstein, and Morris Riedel. 2015. HPDBSCAN: highly parallel DBSCAN. In Proceedings of the Workshop on Machine Learning in High-Performance Computing Environments. ACM, 2.
[26]
Poonam Goyal, Sonal Kumari, Shubham Singh, Vivek Kishore, Sundar S Balasubramaniam, and Navneet Goyal. 2016. A Parallel Framework for Grid-Based Bottom-Up Subspace Clustering. In Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on. IEEE, 331--340.
[27]
Alexander Hinneburg and Da Keim. 1998. An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In Proceedings of 4th International Conference in Knowledge Discovery and Data Mining (KDD 98) (1998), 58--65.
[28]
L. Hong, Z. Gao, X. Pan, and K. Yang. 2011. Segmentation of high resolution remote sensing image based on hierarchically multiscale object-oriented Markov random fields model. In IEEE International Conference on Spatial Data Mining and Geographical Knowledge Services (ICSDM). 343--347.
[29]
Xiaojuan Hu, Lei Liu, Ningjia Qiu, Di Yang, and Meng Li. 2017. A MapReduce-based improvement algorithm for DBSCAN. Journal of Algorithms & Computational Technology (2017), 1748301817735665.
[30]
P. Indyk and R. Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. ACM, 604--613.
[31]
J.A. Insley, L. Grinberg, and M.E. Papka. 2011. Visualizing multiscale, multiphysics simulation data: Brain blood flow. In IEEE Symposium on Large Data Analysis and Visualization (LDAV). 3--7.
[32]
Anil K Jain and East Lansing. 2009. Data Clustering: 50 Years Beyond K-Means 1 Anil K. Jain Michigan State University. (2009).
[33]
T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. 2017. In situ data analytics and indexing of protein trajectories. J Comput Chem. 38, 16 (2017).
[34]
H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. 2001. Distributed Clustering Using Collective Principal Component Analysis. Knowledge and Information Systems 3, 4 (2001), 422--448.
[35]
H. Kawashima, R. R. Sato, and H. Kitagawa. 2008. Models and Issues on Probabilistic Data Streams with Bayesian Networks. In Proc. of the International Symposium on Applications and the Internet (SAINT).
[36]
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC. ACM, 7.
[37]
Z. Li and M. Parashar. 2007. Grid-based asynchronous Replica Exchange. In 8th IEEE/ACM International Conference on Grid Computing. 201--208.
[38]
Liao. 2018. Parallel K-Means Data Clustering. (2018). http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html open source code.
[39]
Hubert W Lilliefors. 1967. On the Kolmogorov-Smirnov test for normality with mean and variance unknown. Journal of the American statistical Association 62, 318 (1967), 399--402.
[40]
Y. Liu, L. C. Jiao, F. Shang, F. Yin, and F. Liu. 2013. An Efficient Matrix Bifactorization Alternative Optimization Method for Low-rank Matrix Recovery and Completion. Neural Netw. 48 (2013).
[41]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129--137.
[42]
LSST 2018. Large Synoptic Survey Telescope. (2018). http://www.lsst.org/lsst/
[43]
INB. Molecular modeling and Bioinformatics Group. 2018. Molecular Dynamics Extended Library. (2018). http://mmb.pcb.ub.es/MoDEL/index.jsf
[44]
NASA 2015. NASA-Center for Climate Simulation. (2015). http://www.nasa.gov/topics/earth/features/climate-sim-center.html
[45]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. In ACM SIGGRAPH 2008 classes. ACM, 16.
[46]
J.M. Paluska, H. Pham, U. Saif, G. Chau, C. Terman, and S. Ward. 2008. Structured Decomposition of Adaptive Applications. In Proceedings of the 6th Annual IEEE International Conference on Pervasive Computing and Communications. 1--10.
[47]
Mostofa Ali Patwary, Diana Palsetia, Ankit Agrawal, Wei-keng Liao, Fredrik Manne, and Alok Choudhary. 2012. A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 62.
[48]
Md Mostofa Ali Patwary, Suren Byna, Nadathur Rajagopalan Satish, Narayanan Sundaram, Zarija Lukić, Vadim Roytershteyn, Michael J Anderson, Yushu Yao, Pradeep Dubey, and others. 2015. BD-CATS: big data clustering at trillion particle scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 6.
[49]
Alejandro Pelaez, Andres Quiroz, James C Browne, Edward Chuah, and Manish Parashar. 2014. Online failure prediction for hpc resources using decentralized clustering. In High Performance Computing (HiPC), 2014 21st International Conference on. IEEE, 1--9.
[50]
Dan Pelleg, Andrew W Moore, and Others. 2000. X-means: Extending k-means with efficient estimation of the number of clusters. In Icml, Vol. 1. 727--734.
[51]
J. Phillips. 2011. Validating clustering of molecular dynamics simulations using polymer models. BMC Bioinformatics 12, 1 (2011).
[52]
A. Quiroz, M. Parashar, N. Gnanasambandam, and N. Sharma. 2012. Design and Evaluation of Decentralized Online Clustering. ACM Trans. Auton. Adapt. Syst. 7, 3 (2012), 34:1--34:31.
[53]
G.N. Ramachandran, C. Ramakrishnan, and V. Sasisekharan. 1963. Multipolar representation of protein structure. Journal of Molecular Biology 7, 95 (1963).
[54]
Conrad Sanderson and Ryan Curtin. 2017. An open source C++ implementation of multi-threaded Gaussian mixture models, k-means and expectation maximisation. arXiv preprint arXiv:1707.09094 (2017).
[55]
SDSS 2014. SDSS - Sloan Digital Sky Survey. (2014). https://www.sdss3.org/
[56]
Bernard W Silverman. 1981. Using kernel density estimates to investigate multi-modality. Journal of the Royal Statistical Society. Series B (Methodological) (1981), 97--99.
[57]
D. Tiwari, S. S. Vazhkudai, Y. Kim, X. Ma, S. Boboila, and P. J. Desnoyers. 2012. Reducing Data Movement Costs Using Energy-Efficient, Active Computation on SSD. In 2012 Workshop on Power-Aware Computing and Systems. USENIX.
[58]
T. Tu. 2008. A scalable parallel framework for analyzing terascale molecular dynamics simulation trajectories. In CM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SCâĂŹ08).
[59]
Dongkuan Xu and Yingjie Tian. 2015. A Comprehensive Survey of Clustering Algorithms. Annals of Data Science 2, 2 (2015), 165--193.
[60]
B. Zhang, T. Estrada, P. Cicotti, and M. Taufer. 2014. Enabling in-situ data analysis for large protein folding trajectory datasets. In IEEE International Parallel and Distributed Processing Symposium.
[61]
F. Zheng, H. Yu, C. Hantas, M. Wolf, G. Eisenhauer, K. Schwan, H. Abbasi, and S. Klasky. 2013. GoldRush: Resource Efficient in Situ Scientific Data Analytics Using Fine-grained Interference Aware Execution. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 78:1--78:12.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Oregon: University of Oregon

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Clustering
  2. Map-Reduce
  3. Privacy Preserving
  4. Random Projection

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2018

Acceptance Rates

ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 69
    Total Downloads
  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media