Abstract
Data mining techniques such as clustering are usually applied to centralized data sets. At present, more and more data is generated and stored in local sites. The transmission of the entire local data set to server is often unacceptable because of performance considerations, privacy and security aspects, and bandwidth constraints. In this paper, we propose a distributed clustering model based on ensemble learning, which could analyze and mine distributed data sources to find global clustering patterns. A typical scenario of the distributed clustering is a ‘two-stage’ course, i.e. firstly doing clustering in local sites and then in global site. The local clustering results transmitted to server site form an ensemble and combining schemes of ensemble learning use the ensemble to generate global clustering results. In the model, generating global patterns from ensemble is mathematically converted to be a combinatorial optimization problem. As an implementation for the model, a novel distributed clustering algorithm called DK-means is presented. Experimental results show that DK-means achieves similar results to K-means which clusters centralized data set at a time and is scalable to data distribution varying in local sites, and show validity of the model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: LeCam, L.M., Neyman, J. (eds.) Proc. of the 5th Berkeley Symp. on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press (1967)
McLachlan, G., Basford, K.: Mixture Models: Inference and Application to Clustering, Marcel Dekker, New York (1988)
Ester, M., Kriegel, H.P., Sander, J., et al.: A density based algorithm of discovering clusters in large spatial databases with noise. In: Simoudis, E., Jiawei, H., Fayyad, U.M. (eds.) Proc. of the 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 226–231. AAAI Press, Stanford, California, USA (1996)
Park, B.H., Kargupta, H.: Distributed Data Mining: Algorithms, Systems, and Applications. In: Ye, N. (ed.) The Handbook of Data Mining, Lawrence Erlbaum Associates Publishers, Mahwah, NJ (2003)
Januzaj, E., Kriegel, H.P., Pfeifle, M.: Scalable Density-Based Distributed Clustering. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, Springer, Heidelberg (2004)
Kriegel, H.-P., Kröger, P., Pryakhin, A., et al.: Effective and Efficient Distributed Model-based Clustering. In: Proc. of the 5th IEEE International Conference on Data Mining, pp. 258–265 (2005)
Topchy, A., Jain, A.K., Punch, W.: Clustering Ensembles: Models of Consensus and Weak Partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(12), 1866–1881 (2005)
Minaei, B., Topchy, A., Punch, W.F.: Ensembles of Partitions via Data Resampling. In: Proc. Intl. Conf. on Information Technology, ITCC 2004, Las Vegas (2004)
Hore, P., Hall, L.O.: Scalable Clustering: A Distributed Approach. IEEE International Conference on Fuzzy Systems 1, 143–148 (2004)
Dubes, R., Jain, A.K.: Clustering Techniques: The User’s Dilemma. Pattern Recognition 8, 247–260 (1976)
Fred, A., Jain, A.K.: Evidence Accumulation Clustering Based on the k-Means Algorithm. In: Caelli, T., et al. (eds.) Proc. Structural, Syntactic, and Statistical Pattern Recognition, pp. 442–451 (2002)
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases, Irvine, CA. University of California, Department of Information and Computer Science (1998), http://www.ics.uci.edu/~mlearn/MLRepository.html
Modha, D.S., Spangler, W.: Feature weighting in k-means clustering. Machine Learning 52(3), 217–237 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ji, G., Ling, X. (2007). Ensemble Learning Based Distributed Clustering. In: Washio, T., et al. Emerging Technologies in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77018-3_32
Download citation
DOI: https://doi.org/10.1007/978-3-540-77018-3_32
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77016-9
Online ISBN: 978-3-540-77018-3
eBook Packages: Computer ScienceComputer Science (R0)