Abstract
Clustering is a common problem in the analysis of large data sets. Streaming algorithms, which make a single pass over the data set using small working memory and produce a clustering comparable in cost to the optimal offline solution, are especially useful. We develop the first streaming algorithms achieving a constant-factor approximation to the cluster radius for two variations of the k-center clustering problem. We give a streaming (4 + ε)-approximation algorithm using O(ε − 1 kz) memory for the problem with outliers, in which the clustering is allowed to drop up to z of the input points; previous work used a random sampling approach which yields only a bicriteria approximation. We also give a streaming (6 + ε)-approximation algorithm using O(ε − 1 ln (ε − 1) k + k 2) memory for a variation motivated by anonymity considerations in which each cluster must contain at least a certain number of input points.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agarwal, P., Har-Peled, S., Yu, H.: Robust shape fitting via peeling and grating coresets. In: Proc. of ACM Symp. on Discrete Algorithms (SODA), pp. 182–191 (2006)
Aggarwal, G., Feder, T., Kenthapadi, K., Khuller, S., Panigrahy, R., Thomas, D., Zhu, A.: Achieving anonymity via clustering. In: Proc. of ACM Principles of Database Systems (PODS), pp. 153–162 (2006)
Aldenderfer, M.S., Blashfield, R.K.: Cluster Analysis. Sage, Beverly Hills (1984)
Bădoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proc. of ACM Symp. on Theory of Computing (STOC), pp. 250–257 (2002)
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic infomation retrieval. In: Proc. of ACM Symp. on Theory of Computing (STOC), pp. 626–635 (1997)
Charikar, M., O’Callaghan, L., Panigrahy, R.: Better streaming algorithms for clustering problems. In: Proc. of ACM Symp. on Theory of Computing (STOC), pp. 30–39 (2003)
Charikar, M., Khuller, S., Mount, D., Narasimhan, G.: Algorithms for facility location problems with outliers. In: Proc. of ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 642–651 (2001)
Chen, K.: A constant factor approximation algorithm for k-median clustering with outliers. In: Proc. of ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 826–835 (2008)
Everitt, B.: Cluster Analysis. Heinemann Educational, London (1974)
Gonzalez, T.: Clustering to minimize the maximum inter-cluster distance. Theoretical Computer Science 38, 293–306 (1985)
Guha, S.: The k-center karma of a data stream (unpublished manuscript) (2007)
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: Proc. of IEEE Foundations of Computer Science (FOCS), pp. 359–366 (2000)
Hartigan, J.A.: Clustering Algorithms. Wiley, Chichester (1975)
Hochbaum, D., Shmoys, D.B.: A best possible approximation algorithm for the k-center problem. Math. of Operations Research 10, 180–184 (1985)
Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice Hall, NJ (1988)
Rasmussen, E.: Clustering algorithms. In: Frakes, W., Baeza-Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Matthew McCutchen, R., Khuller, S. (2008). Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity. In: Goel, A., Jansen, K., Rolim, J.D.P., Rubinfeld, R. (eds) Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2008 2008. Lecture Notes in Computer Science, vol 5171. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85363-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-85363-3_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85362-6
Online ISBN: 978-3-540-85363-3
eBook Packages: Computer ScienceComputer Science (R0)