A Simple D 2-Sampling Based PTAS for k-Means and Other Clustering Problems

Ragesh Jaiswal¹,
Amit Kumar¹ &
Sandeep Sen¹

753 Accesses
Explore all metrics

Abstract

Given a set of points $P \subset\mathbb{R}^{d}$, the k-means clustering problem is to find a set of k centers $C = \{ c_{1},\ldots,c_{k}\}, c_{i} \in\mathbb{R}^{d}$, such that the objective function ∑_x∈P e(x,C)², where e(x,C) denotes the Euclidean distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that has been studied with respect to clustering.

D ²-sampling (Arthur and Vassilvitskii, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points $P \subset\mathbb{R}^{d}$, the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled point.

D ²-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) show that k points chosen as centers from P using D ²-sampling give an O(logk) approximation in expectation. Ailon et al. (NIPS, pp. 10–18, 2009) and Aggarwal et al. (Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 15–28, Springer, Berlin, 2009) extended results of Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) to show that O(k) points chosen as centers using D ²-sampling give an O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D ²-sampling by giving a simple randomized (1+ϵ)-approximation algorithm that uses the D ²-sampling in its core.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Improved PTAS for the constrained k-means problem

Article 23 August 2018

Faster Algorithms for the Constrained k-means Problem

Article 06 November 2017

A Unified Framework for Clustering Constrained Data Without Locality Property

Article 20 August 2019

Notes

$\tilde{O}$ notation hides a O(logk/ϵ) factor which simplifies the expression.
It can be used in conjunction with Chen [14] to obtain a superior running time but at the cost of the simplicity of our approach.
It turns out that even minor perturbations from uniform distribution can be catastrophic and indeed in this paper we had to work around this.

References

Ackermann, M.R.: Algorithms for the Bregman k-Median Problem. Ph.D. Thesis, University of Paderborn (2010)
Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for Bregman divergences. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’09, pp. 1088–1097. SIAM, Philadelphia (2009)
Chapter Google Scholar
Ackermann, M.R., Blömer, J.: Bregman clustering for separable instances. In: Proceedings of the 12th Scandinavian Conference on Algorithm Theory, SWAT’10, pp. 212–223. Springer, Berlin (2010)
Google Scholar
Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms 6, 59 (2010)
Article MathSciNet Google Scholar
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Lecture Notes in Computer Science, vol. 5687, pp. 15–28. Springer, Berlin (2009)
Chapter Google Scholar
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: NIPS, pp. 10–18 (2009)
Google Scholar
Arthur, D., Manthey, B., Röglin, H.: Smoothed analysis of the k-means method. J. ACM 19, 1 (2011)
Article Google Scholar
Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, SCG’06, pp. 144–153. ACM, New York (2006)
Chapter Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035. SIAM, Philadelphia (2007)
Google Scholar
Awasthi, P., Blum, A., Sheffet, O.: Stability yields a PTAS for k-median and k-means clustering. In: Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, FOCS’10, pp. 309–318. IEEE Comput. Soc., Los Alamitos (2010)
Chapter Google Scholar
Bādoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC’02, pp. 250–257. ACM, New York (2002)
Chapter Google Scholar
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
MATH MathSciNet Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)
Article Google Scholar
Ke, C.: On k-median clustering in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA’06, pp. 1177–1185. ACM, New York (2006)
Google Scholar
Dasgupta, S.: The hardness of k-means clustering. Technical Report CS2008-0916, Department of Computer Science and Engineering, University of California, San Diego (2008)
Fernandez de la Vega, W., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing, STOC’03, pp. 50–58. ACM, New York (2003)
Chapter Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, A.R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990)
Article Google Scholar
Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equitz, W.: Efficient and effective querying by image content. J. Intell. Inf. Syst. 3(3–4), 231–262 (1994)
Article Google Scholar
Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, STOC’11, pp. 569–578. ACM, New York (2011)
Chapter Google Scholar
Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, SCG’07, pp. 11–18. ACM, New York (2007)
Chapter Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’13. SIAM, Philadelphia (2013)
Google Scholar
Feldman, D., Schulman, L.J.: Data reduction for weighted and outlier-resistant clustering. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’12, pp. 1343–1354. SIAM, Philadelphia (2012)
Chapter Google Scholar
Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC’05, pp. 209–217. ACM, New York (2005)
Chapter Google Scholar
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the Twenty-First Annual Symposium on Computational Geometry, SCG’05, pp. 126–134. ACM, New York (2005)
Chapter Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, STOC’04, pp. 291–300. ACM, New York (2004)
Chapter Google Scholar
Har-Peled, S., Sadri, B.: How fast is the k-means method? In: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’05, pp. 877–885. SIAM, Philadelphia (2005)
Google Scholar
Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG’94, pp. 332–339. ACM, New York (1994)
Chapter Google Scholar
Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 5(2), 1–32 (2010)
Article MATH MathSciNet Google Scholar
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MATH MathSciNet Google Scholar
Matoušek, J.: On approximate geometric k-clustering. Discrete Comput. Geom. 24(1), 61–84 (2000)
Article MATH MathSciNet Google Scholar
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of Lloyd-type methods for the k-means problem. J. ACM 28, 1 (2013)
Google Scholar
Swain, M., Ballard, D.: Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991)
Article Google Scholar
Vattani, A.: k-means requires exponentially many iterations even in the plane. In: Proceedings of the 25th Annual Symposium on Computational Geometry, SCG’09, pp. 324–332. ACM, New York (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, IIT Delhi, New Delhi, 110016, India
Ragesh Jaiswal, Amit Kumar & Sandeep Sen

Authors

Ragesh Jaiswal
View author publications
You can also search for this author in PubMed Google Scholar
Amit Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Sandeep Sen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ragesh Jaiswal.

Appendix: A Sampling Lemma

Here is argue that the approximation analysis of Algorithm 1 in Sect. 4 can be carried over to Algorithm 2. The following sampling lemma will be useful for this.

Lemma 14

(Sampling Lemma)

Let M≤N be positive integers and 0<α≤1. Let P be a finite set of elements. Consider multisets of size N comprising of elements from P. Let $\mathcal{S}$ denote a set of such multisets. Let $\mathcal{D}$ be an arbitrary distribution on elements of $\mathcal{S}$ and sampling an element from this distribution is denoted by $S \leftarrow\mathcal{D}$. Consider a fixed ordering of elements of P that defines ordering of elements in multisets of $\mathcal{S}$ and a fixed ordering of subsets of size M of any set of size N so that $\forall i \in\{1, \ldots, \binom{N}{M}\}$, the ith subset of a multiset $S \in\mathcal {S}$ is clearly defined. Let $f: \mathcal{S} \times\{1, \ldots, \binom{N}{M}\} \rightarrow\{0, 1\}$ be any function denoting some relation between S and its ith subset. Let T be any subset of $\{1, \ldots, \binom{N}{M}\}$. Consider the following two randomized procedures:

If Pr[A outputs 1]≥α, then Pr[B outputs 1]≥1−e ^−α.

Proof

Let n=|T|. Let the probability of sampling S from $\mathcal{S}$ be denoted by $\mathcal{D}(S)$. Consider a complete bipartite graph where the nodes on the left are members of the set $\mathcal{S}$ and nodes on the right are element in T. We label an edge (S,i) red if f(S,i)=1 and black otherwise. Edge (S,i) is given a weight of $\frac{\mathcal{D}(S)}{n}$ so that the sum of weight of all edges in this complete bipartite graph is 1. We first observe that the sum of weight of red edges is at least α/n. This is indeed the case since $\sum_{S \in\mathcal{S}, \exists i\ f(S, i)=1} \mathcal{D}(S) = \mathbf{Pr}[A \mbox{ outputs } 1]\geq \alpha$. Now let w _i denote the sum of weight of red edges incident on the node i on the right. The probability that in iteration i, B does not output 1 is (1−n⋅w _i). So, the probability that B outputs 0 is at most

$$\prod_{i=1}^{n} (1 - n\cdot w_i) \leq e^{-n \cdot\sum_{i=1}^{n} w_i} = e^{-n \cdot{\scriptsize\mbox{(sum of weight of red edges)}}} = e^{-n \cdot (\alpha/n)} = e^{-\alpha} $$

So, Pr[B outputs 1]≥1−e ^−α. □

The following argument shows that the analysis of Algorithm 1 extends to Algorithm 2. In each step of Algorithm 1, we sample S with D ² sampling and then iterate over all subsets of S of size M. Suppose the probability that there is a subset i such that the ith subset of S is good (in the sense that its centroid is close to an optimal center) is at least 3/4. We indeed show this in Sect. 4. On the other hand, in Algorithm 2, we iterate over all subsets i and in the iteration corresponding to subset i, we sample a multiset S with D ² sampling. We succeed in finding a good center if there is a subset i such that the ith subset is good for S sampled in iteration i. We note that if the success probability in the first randomized procedure is large, then the probability of success in the second procedure is also large. For this, consider the Sampling Lemma given above. The above two randomized procedures correspond to procedures A and B in Lemma 14. Using this lemma, we get that the success probability in any iteration of step 1(b) in Algorithm 2 is at least (1−e ^−3/4)≥1/2. So, the probability that Algorithm 2 succeeds in finding a set of k good centers is at least 1/2^k. Since we repeat 2^k times, we get that Algorithm 2 succeeds with high probability.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jaiswal, R., Kumar, A. & Sen, S. A Simple D ²-Sampling Based PTAS for k-Means and Other Clustering Problems. Algorithmica 70, 22–46 (2014). https://doi.org/10.1007/s00453-013-9833-9

Download citation

Received: 07 October 2012
Accepted: 07 September 2013
Published: 21 September 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s00453-013-9833-9

A Simple D ²-Sampling Based PTAS for k-Means and Other Clustering Problems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved PTAS for the constrained k-means problem

Faster Algorithms for the Constrained k-means Problem

A Unified Framework for Clustering Constrained Data Without Locality Property

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: A Sampling Lemma

Lemma 14

Proof

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A Simple D 2-Sampling Based PTAS for k-Means and Other Clustering Problems

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improved PTAS for the constrained k-means problem

Faster Algorithms for the Constrained k-means Problem

A Unified Framework for Clustering Constrained Data Without Locality Property

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix: A Sampling Lemma

Appendix: A Sampling Lemma

Lemma 14

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation

A Simple D ²-Sampling Based PTAS for k-Means and Other Clustering Problems