Abstract
Given a set of points \(P \subset\mathbb{R}^{d}\), the k-means clustering problem is to find a set of k centers \(C = \{ c_{1},\ldots,c_{k}\}, c_{i} \in\mathbb{R}^{d}\), such that the objective function ∑ x∈P e(x,C)2, where e(x,C) denotes the Euclidean distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that has been studied with respect to clustering.
D 2-sampling (Arthur and Vassilvitskii, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points \(P \subset\mathbb{R}^{d}\), the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled point.
D 2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) show that k points chosen as centers from P using D 2-sampling give an O(logk) approximation in expectation. Ailon et al. (NIPS, pp. 10–18, 2009) and Aggarwal et al. (Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 15–28, Springer, Berlin, 2009) extended results of Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) to show that O(k) points chosen as centers using D 2-sampling give an O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D 2-sampling by giving a simple randomized (1+ϵ)-approximation algorithm that uses the D 2-sampling in its core.
Similar content being viewed by others
Notes
\(\tilde{O}\) notation hides a O(logk/ϵ) factor which simplifies the expression.
It can be used in conjunction with Chen [14] to obtain a superior running time but at the cost of the simplicity of our approach.
It turns out that even minor perturbations from uniform distribution can be catastrophic and indeed in this paper we had to work around this.
References
Ackermann, M.R.: Algorithms for the Bregman k-Median Problem. Ph.D. Thesis, University of Paderborn (2010)
Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for Bregman divergences. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’09, pp. 1088–1097. SIAM, Philadelphia (2009)
Ackermann, M.R., Blömer, J.: Bregman clustering for separable instances. In: Proceedings of the 12th Scandinavian Conference on Algorithm Theory, SWAT’10, pp. 212–223. Springer, Berlin (2010)
Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms 6, 59 (2010)
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Lecture Notes in Computer Science, vol. 5687, pp. 15–28. Springer, Berlin (2009)
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: NIPS, pp. 10–18 (2009)
Arthur, D., Manthey, B., Röglin, H.: Smoothed analysis of the k-means method. J. ACM 19, 1 (2011)
Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, SCG’06, pp. 144–153. ACM, New York (2006)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035. SIAM, Philadelphia (2007)
Awasthi, P., Blum, A., Sheffet, O.: Stability yields a PTAS for k-median and k-means clustering. In: Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, FOCS’10, pp. 309–318. IEEE Comput. Soc., Los Alamitos (2010)
Bādoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC’02, pp. 250–257. ACM, New York (2002)
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)
Ke, C.: On k-median clustering in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA’06, pp. 1177–1185. ACM, New York (2006)
Dasgupta, S.: The hardness of k-means clustering. Technical Report CS2008-0916, Department of Computer Science and Engineering, University of California, San Diego (2008)
Fernandez de la Vega, W., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing, STOC’03, pp. 50–58. ACM, New York (2003)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, A.R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990)
Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equitz, W.: Efficient and effective querying by image content. J. Intell. Inf. Syst. 3(3–4), 231–262 (1994)
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, STOC’11, pp. 569–578. ACM, New York (2011)
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, SCG’07, pp. 11–18. ACM, New York (2007)
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’13. SIAM, Philadelphia (2013)
Feldman, D., Schulman, L.J.: Data reduction for weighted and outlier-resistant clustering. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’12, pp. 1343–1354. SIAM, Philadelphia (2012)
Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC’05, pp. 209–217. ACM, New York (2005)
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the Twenty-First Annual Symposium on Computational Geometry, SCG’05, pp. 126–134. ACM, New York (2005)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, STOC’04, pp. 291–300. ACM, New York (2004)
Har-Peled, S., Sadri, B.: How fast is the k-means method? In: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’05, pp. 877–885. SIAM, Philadelphia (2005)
Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG’94, pp. 332–339. ACM, New York (1994)
Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 5(2), 1–32 (2010)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Matoušek, J.: On approximate geometric k-clustering. Discrete Comput. Geom. 24(1), 61–84 (2000)
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of Lloyd-type methods for the k-means problem. J. ACM 28, 1 (2013)
Swain, M., Ballard, D.: Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991)
Vattani, A.: k-means requires exponentially many iterations even in the plane. In: Proceedings of the 25th Annual Symposium on Computational Geometry, SCG’09, pp. 324–332. ACM, New York (2009)
Author information
Authors and Affiliations
Corresponding author
Appendix: A Sampling Lemma
Appendix: A Sampling Lemma
Here is argue that the approximation analysis of Algorithm 1 in Sect. 4 can be carried over to Algorithm 2. The following sampling lemma will be useful for this.
Lemma 14
(Sampling Lemma)
Let M≤N be positive integers and 0<α≤1. Let P be a finite set of elements. Consider multisets of size N comprising of elements from P. Let \(\mathcal{S}\) denote a set of such multisets. Let \(\mathcal{D}\) be an arbitrary distribution on elements of \(\mathcal{S}\) and sampling an element from this distribution is denoted by \(S \leftarrow\mathcal{D}\). Consider a fixed ordering of elements of P that defines ordering of elements in multisets of \(\mathcal{S}\) and a fixed ordering of subsets of size M of any set of size N so that \(\forall i \in\{1, \ldots, \binom{N}{M}\}\), the ith subset of a multiset \(S \in\mathcal {S}\) is clearly defined. Let \(f: \mathcal{S} \times\{1, \ldots, \binom{N}{M}\} \rightarrow\{0, 1\}\) be any function denoting some relation between S and its ith subset. Let T be any subset of \(\{1, \ldots, \binom{N}{M}\}\). Consider the following two randomized procedures:
If Pr[A outputs 1]≥α, then Pr[B outputs 1]≥1−e −α.
Proof
Let n=|T|. Let the probability of sampling S from \(\mathcal{S}\) be denoted by \(\mathcal{D}(S)\). Consider a complete bipartite graph where the nodes on the left are members of the set \(\mathcal{S}\) and nodes on the right are element in T. We label an edge (S,i) red if f(S,i)=1 and black otherwise. Edge (S,i) is given a weight of \(\frac{\mathcal{D}(S)}{n}\) so that the sum of weight of all edges in this complete bipartite graph is 1. We first observe that the sum of weight of red edges is at least α/n. This is indeed the case since \(\sum_{S \in\mathcal{S}, \exists i\ f(S, i)=1} \mathcal{D}(S) = \mathbf{Pr}[A \mbox{ outputs } 1]\geq \alpha\). Now let w i denote the sum of weight of red edges incident on the node i on the right. The probability that in iteration i, B does not output 1 is (1−n⋅w i ). So, the probability that B outputs 0 is at most
So, Pr[B outputs 1]≥1−e −α. □
The following argument shows that the analysis of Algorithm 1 extends to Algorithm 2. In each step of Algorithm 1, we sample S with D 2 sampling and then iterate over all subsets of S of size M. Suppose the probability that there is a subset i such that the ith subset of S is good (in the sense that its centroid is close to an optimal center) is at least 3/4. We indeed show this in Sect. 4. On the other hand, in Algorithm 2, we iterate over all subsets i and in the iteration corresponding to subset i, we sample a multiset S with D 2 sampling. We succeed in finding a good center if there is a subset i such that the ith subset is good for S sampled in iteration i. We note that if the success probability in the first randomized procedure is large, then the probability of success in the second procedure is also large. For this, consider the Sampling Lemma given above. The above two randomized procedures correspond to procedures A and B in Lemma 14. Using this lemma, we get that the success probability in any iteration of step 1(b) in Algorithm 2 is at least (1−e −3/4)≥1/2. So, the probability that Algorithm 2 succeeds in finding a set of k good centers is at least 1/2k. Since we repeat 2k times, we get that Algorithm 2 succeeds with high probability.
Rights and permissions
About this article
Cite this article
Jaiswal, R., Kumar, A. & Sen, S. A Simple D 2-Sampling Based PTAS for k-Means and Other Clustering Problems. Algorithmica 70, 22–46 (2014). https://doi.org/10.1007/s00453-013-9833-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-013-9833-9