More Web Proxy on the site http://driver.im/

research-article

Label consistency in overfitted generalized k-means

AUTHORs:

Arash A. AminiAuthors Info & Claims

NIPS'21: Proceedings of the 35th International Conference on Neural Information Processing Systems

Article No.: 609, Pages 7965 - 7977

Published: 06 December 2021 Publication History

Abstract

We provide theoretical guarantees for label consistency in generalized k-means problems, with an emphasis on the overfitted case where the number of clusters used by the algorithm is more than the ground truth. We provide conditions under which the estimated labels are close to a refinement of the true cluster labels. We consider both exact and approximate recovery of the labels. Our results hold for any constant-factor approximation to the k-means problem. The results are also model-free and only based on bounds on the maximum or average distance of the data points to the true cluster centers. These centers themselves are loosely defined and can be taken to be any set of points for which the aforementioned distances can be controlled. We show the usefulness of the results with applications to some manifold clustering problems.

Supplementary Material

Additional material (3540261.3540870_supp.pdf)

Supplemental material.

Download
1.03 MB

References

[1]

Emmanuel Abbe. Community detection and stochastic block models: recent developments. The Journal of Machine Learning Research, 18(1):6446–6531, 2017.

Digital Library

[2]

Arash A Amini and Zahra S Razaee. Concentration of kernel matrices with application to kernel spectral clustering. The Annals of Statistics, 49(1):531–556, 2021.

[3]

David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. Technical report, Stanford, 2006.

[4]

Pranjal Awasthi, Afonso S Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, and Rachel Ward. Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 191–200, 2015.

Digital Library

[5]

Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a ptas for k-median and k-means clustering. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 309–318. IEEE, 2010.

[6]

Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 49–60. IEEE, 2017.

[7]

Noureddine El Karoui. On information plus noise kernel random matrices. The Annals of Statistics, 38(5):3191–3216, 2010.

[8]

Yingjie Fei and Yudong Chen. Hidden integrality of sdp relaxations for sub-gaussian mixture models. In Conference On Learning Theory, pages 1931–1965. PMLR, 2018.

[9]

Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxed k-means. Mathematical Statistics and Learning, 1(3):317–374, 2019.

[10]

Takayuki Iguchi, Dustin G Mixon, Jesse Peterson, and Soledad Villar. Probably certifiably correct k-means clustering. Mathematical Programming, 165(2):605–642, 2017.

Digital Library

[11]

Tao Jiang, Stephen Vavasis, and Chen Wen Zhai. Recovery of a mixture of gaussians by sum-of-norms clustering. Journal of Machine Learning Research, 21(225):1–16, 2020.

[12]

Jiashun Jin. Fast community detection by score. Annals of Statistics, 43(1):57–89, 2015.

[13]

Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89–112, 2004.

Digital Library

[14]

Mieczysław A Kłopotek. On the consistency of k-means++ algorithm. Fundamenta Informaticae, 172(4):361–377, 2020.

[15]

Amit Kumar, Yogish Sabharwal, and Sandeep Sen. A simple linear time (1+/spl epsiv/)-approximation algorithm for k-means clustering in any dimensions. In 45th Annual IEEE Symposium on Foundations of Computer Science, pages 454–462. IEEE, 2004.

[16]

Silvio Lattanzi and Christian Sohler. A better k-means++ algorithm via local search. In International Conference on Machine Learning, pages 3662–3671. PMLR, 2019.

[17]

Jing Lei and Alessandro Rinaldo. Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1):215–237, 2015.

[18]

Xiaodong Li, Yang Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. When do birds of a feather flock together? k-means, proximity, and conic programming. Mathematical Programming, 179(1):295–341, 2020.

Digital Library

[19]

Tamás Linder. Learning-theoretic methods in vector quantization. In Principles of nonparamet-ric learning, pages 163–210. Springer, 2002.

[20]

Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.

Digital Library

[21]

Yu Lu and Harrison H Zhou. Statistical and computational guarantees of lloyd's algorithm and its variants. arXiv preprint arXiv:1612.02099, 2016.

[22]

James MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.

[23]

Konstantin Makarychev, Aravind Reddy, and Liren Shan. Improved guarantees for k-means++ and k-means++ parallel. Advances in Neural Information Processing Systems, 33, 2020.

[24]

Jiri Matoušek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61–84, 2000.

[25]

Dustin G Mixon, Soledad Villar, and Rachel Ward. Clustering subgaussian mixtures with k-means. In 2016 IEEE Information Theory Workshop (ITW), pages 211–215. IEEE, 2016.

[26]

Abhinav Nellore and Rachel Ward. Recovery guarantees for exemplar-based clustering. Information and Computation, 245:165–180, 2015.

Digital Library

[27]

Ashkan Panahi, Devdatt Dubhashi, Fredrik D Johansson, and Chiranjib Bhattacharyya. Clustering by sum of norms: Stochastic incremental algorithm, convergence and cluster recovery. In International conference on machine learning, pages 2769–2777. PMLR, 2017.

[28]

Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite programming. SIAM journal on optimization, 18(1):186–205, 2007.

[29]

David Pollard. Strong consistency of k-means clustering. The Annals of Statistics, pages 135–140, 1981.

[30]

David Pollard. Quantization and the method of k-means. IEEE Transactions on Information theory, 28(2):199–205, 1982.

Digital Library

[31]

Defeng Sun, Kim-Chuan Toh, and Yancheng Yuan. Convex clustering: model, theoretical guarantee and efficient algorithm. Journal of Machine Learning Research, 22(9):1–32, 2021.

[32]

Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. Advances in Neural Information Processing Systems, 29:604–612, 2016.

[33]

Zhixin Zhou and Arash A Amini. Analysis of spectral clustering algorithms for community detection: the general bipartite setting. The Journal of Machine Learning Research, 20(1):1774–1820, 2019.

Digital Library

[34]

Changbo Zhu, Huan Xu, Chenlei Leng, and Shuicheng Yan. Convex optimization procedure for clustering: Theoretical revisit. Advances in Neural Information Processing Systems, 27:1619–1627, 2014.

Index Terms

Label consistency in overfitted generalized k-means

Index terms have been assigned to the content through auto-classification.

Recommendations

Transductive Multilabel Learning via Label Set Propagation

The problem of multilabel classification has attracted great interest in the last decade, where each instance can be assigned with a set of multiple class labels simultaneously. It has a wide variety of real-world applications, e.g., automatic image ...
Semi-supervised partial multi-label classification via consistency learning
Highlights
- We solve the inconsistency of the distributions in features and labels and acquire the label level instance correlation via HSIC for partial multi-label ...
Abstract
Partial multi-label learning refers to the problem that each instance is associated with a candidate label set involving both relevant and noisy labels. Existing solutions mainly focus on label disambiguation, while ignoring the ...
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '21: Proceedings of the 35th International Conference on Neural Information Processing Systems

December 2021

30517 pages

ISBN:9781713845393

Copyright © 2021 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 06 December 2021

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Figures

Tables

Media

View Table of Conten