[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
rapid-communication

On the k-means/median cost function

Published: 01 August 2022 Publication History

Highlights

We try to understand how the optimal k-means cost behaves as a function of the k (the number of centers/clusters).
We show that D 2 sampling is a useful method for designing pseudo-approximation algorithm and movement-based coreset for k-means.
Our techniques generalize for the metric k-median problem in metric spaces with bounded doubling dimension.

Abstract

In this work, we study the k-means cost function. Given a dataset X ⊆ R d and an integer k, the goal of the Euclidean k-means problem is to find a set of k centers C ⊆ R d such that Φ ( C, X ) ≡ ∑ x ∈ X min c ∈ C ⁡ ‖ x − c ‖ 2 is minimized. Let Δ ( X, k ) ≡ min C ⊆ R d ⁡ Φ ( C, X ) denote the cost of the optimal k-means solution. For any dataset X, Δ ( X, k ) decreases as k increases. In this work, we try to understand this behavior more precisely. For any dataset X ⊆ R d, integer k ≥ 1, and a precision parameter ε > 0, let L ( X, k, ε ) denote the smallest integer such that Δ ( X, L ( X, k, ε ) ) ≤ ε ⋅ Δ ( X, k ). We show upper and lower bounds on this quantity. Our techniques generalize for the metric k-median problem in metric spaces with bounded doubling dimension. Finally, we observe that for any dataset X, we can compute a set S of size O ( L ( X, k, ε / c ) ) using D 2 -sampling such that Φ ( S, X ) ≤ ε ⋅ Δ ( X, k ) for some fixed constant c. Some of the applications include new pseudo-approximation guarantees for k-means++ and bounds for movement-based coreset constructions.

References

[1]
Marcel R. Ackermann, Marcus Märtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, Christian Sohler, Streamkm++: a clustering algorithm for data streams, ACM J. Exp. Algorithmics 17 (May 2012).
[2]
Ankit Aggarwal, Amit Deshpande, Ravi Kannan, Adaptive sampling for k-means clustering, in: Irit Dinur, Klaus Jansen, Joseph Naor, José Rolim (Eds.), Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, in: Lecture Notes in Computer Science, vol. 5687, Springer Berlin Heidelberg, 2009, pp. 15–28.
[3]
David Arthur, Sergei Vassilvitskii, k-means++: the advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '07, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2007, pp. 1027–1035.
[4]
Mihai Bādoiu, Sariel Har-Peled, Piotr Indyk, Approximate clustering via core-sets, in: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC '02, ACM, New York, NY, USA, 2002, pp. 250–257.
[5]
Anup Bhattacharya, Ragesh Jaiswal, Nir Ailon, Tight lower bound instances for k-means++ in two dimensions, Theor. Comput. Sci. 634 (2016) 55–66.
[6]
Dan Feldman, Michael Langberg, A unified framework for approximating and clustering data, in: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, STOC '11, ACM, New York, NY, USA, 2011, pp. 569–578.
[7]
Dan Feldman, Melanie Schmidt, Christian Sohler, Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering, in: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '13, SIAM, 2013.
[8]
Sariel Har-Peled, Akash Kushal, Smaller coresets for k-median and k-means clustering, in: Proceedings of the Twenty-First Annual Symposium on Computational Geometry, SCG '05, ACM, New York, NY, USA, 2005, pp. 126–134.
[9]
Sariel Har-Peled, Soham Mazumdar, On coresets for k-means and k-median clustering, in: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, STOC '04, ACM, New York, NY, USA, 2004, pp. 291–300.
[10]
Michael Langberg, Leonard J. Schulman, Universal epsilon-approximators for integrals, in: Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010, 2010, pp. 598–607.
[11]
Konstantin Makarychev, Aravind Reddy, Liren Shan, Improved guarantees for k-means++ and k-means++ parallel, in: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, 2020.
[12]
Jiri Matousek, Lectures on Discrete Geometry, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2002.
[13]
Melanie Schmidt, Chris Schwiegelshohn, Christian Sohler, Fair coresets and streaming algorithms for fair k-means, in: Evripidis Bampis, Nicole Megow (Eds.), Approximation and Online Algorithms, Springer International Publishing, Cham, 2020, pp. 232–251.
[14]
Dennis Wei, A constant-factor bi-criteria approximation guarantee for k-means++, in: Advances in Neural Information Processing Systems, 2016, pp. 604–612.

Index Terms

  1. On the k-means/median cost function
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Information Processing Letters
      Information Processing Letters  Volume 177, Issue C
      Aug 2022
      98 pages

      Publisher

      Elsevier North-Holland, Inc.

      United States

      Publication History

      Published: 01 August 2022

      Author Tags

      1. k-means clustering
      2. k-median clustering
      3. D 2-sampling
      4. Coresets
      5. Randomized algorithms

      Qualifiers

      • Rapid-communication

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 0
        Total Downloads
      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 18 Jan 2025

      Other Metrics

      Citations

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media