[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3540261.3540870guideproceedingsArticle/Chapter ViewAbstractPublication PagesnipsConference Proceedingsconference-collections
research-article

Label consistency in overfitted generalized k-means

Published: 10 June 2024 Publication History

Abstract

We provide theoretical guarantees for label consistency in generalized k-means problems, with an emphasis on the overfitted case where the number of clusters used by the algorithm is more than the ground truth. We provide conditions under which the estimated labels are close to a refinement of the true cluster labels. We consider both exact and approximate recovery of the labels. Our results hold for any constant-factor approximation to the k-means problem. The results are also model-free and only based on bounds on the maximum or average distance of the data points to the true cluster centers. These centers themselves are loosely defined and can be taken to be any set of points for which the aforementioned distances can be controlled. We show the usefulness of the results with applications to some manifold clustering problems.

Supplementary Material

Additional material (3540261.3540870_supp.pdf)
Supplemental material.

References

[1]
Emmanuel Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446–6531, 2017.
[2]
Arash A Amini and Zahra S Razaee. Concentration of kernel matrices with application to kernel spectral clustering. The Annals of Statistics, 49(1):531–556, 2021.
[3]
David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Technical report, Stanford, 2006.
[4]
Pranjal Awasthi, Afonso S Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, and Rachel Ward. Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 191–200, 2015.
[5]
Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a ptas for k-median and k-means clustering. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 309–318. IEEE, 2010.
[6]
Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 49–60. IEEE, 2017.
[7]
Noureddine El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191–3216, 2010.
[8]
Yingjie Fei and Yudong Chen. Hidden integrality of sdp relaxations for sub-gaussian mixture models. In Conference On Learning Theory, pages 1931–1965. PMLR, 2018.
[9]
Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxed k-means. Mathematical Statistics and Learning, 1(3):317–374, 2019.
[10]
Takayuki Iguchi, Dustin G Mixon, Jesse Peterson, and Soledad Villar. Probably certifiably correct k-means clustering. Mathematical Programming, 165(2):605–642, 2017.
[11]
Tao Jiang, Stephen Vavasis, and Chen Wen Zhai. Recovery of a mixture of gaussians by sum-of-norms clustering. Journal of Machine Learning Research, 21(225):1–16, 2020.
[12]
Jiashun Jin. Fast community detection by score. Annals of Statistics, 43(1):57–89, 2015.
[13]
Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89–112, 2004.
[14]
Mieczysław A Kłopotek. On the consistency of k-means++ algorithm. Fundamenta Informaticae, 172(4):361–377, 2020.
[15]
Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+/spl epsiv/)-approximation algorithm for k-means clustering in any dimensions. In 45th Annual IEEE Symposium on Foundations of Computer Science, pages 454–462. IEEE, 2004.
[16]
Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662–3671. PMLR, 2019.
[17]
Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1):215–237, 2015.
[18]
Xiaodong Li, Yang Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. When do birds of a feather flock together? k-means, proximity, and conic programming. Mathematical Programming, 179(1):295–341, 2020.
[19]
Tamás Linder. Learning-theoretic methods in vector quantization. In Principles of nonparamet-ric learning, pages 163–210. Springer, 2002.
[20]
Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
[21]
Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd's algorithm and its variants. arXiv preprint arXiv:1612.02099, 2016.
[22]
James MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
[23]
Konstantin Makarychev, Aravind Reddy, and Liren Shan. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33, 2020.
[24]
Jiri Matoušek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61–84, 2000.
[25]
Dustin G Mixon, Soledad Villar, and Rachel Ward. Clustering subgaussian mixtures with k-means. In 2016 IEEE Information Theory Workshop (ITW), pages 211–215. IEEE, 2016.
[26]
Abhinav Nellore and Rachel Ward. Recovery guarantees for exemplar-based clustering. Information and Computation, 245:165–180, 2015.
[27]
Ashkan Panahi, Devdatt Dubhashi, Fredrik D Johansson, and Chiranjib Bhattacharyya. Clustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. In International conference on machine learning, pages 2769–2777. PMLR, 2017.
[28]
Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite programming. SIAM journal on optimization, 18(1):186–205, 2007.
[29]
David Pollard. Strong consistency of k-means clustering. The Annals of Statistics, pages 135–140, 1981.
[30]
David Pollard. Quantization and the method of k-means. IEEE Transactions on Information theory, 28(2):199–205, 1982.
[31]
Defeng Sun, Kim-Chuan Toh, and Yancheng Yuan. Convex clustering: model, theoretical guarantee and efficient algorithm. Journal of Machine Learning Research, 22(9):1–32, 2021.
[32]
Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. Advances in Neural Information Processing Systems, 29:604–612, 2016.
[33]
Zhixin Zhou and Arash A Amini. Analysis of spectral clustering algorithms for community detection: the general bipartite setting. The Journal of Machine Learning Research, 20(1):1774–1820, 2019.
[34]
Changbo Zhu, Huan Xu, Chenlei Leng, and Shuicheng Yan. Convex optimization procedure for clustering: Theoretical revisit. Advances in Neural Information Processing Systems, 27:1619–1627, 2014.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems
December 2021
30517 pages

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 10 June 2024

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media