[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Free access

Approximation bounds for hierarchical clustering: average linkage, bisecting k-means, and local search

Published: 06 March 2024 Publication History

Abstract

Hierarchical clustering is a data analysis method that has been used for decades. Despite its widespread use, the method has an underdeveloped analytical foundation. Having a well understood foundation would both support the currently used methods and help guide future improvements. The goal of this paper is to give an analytic framework to better understand observations seen in practice. This paper considers the dual of a problem framework for hierarchical clustering introduced by Dasgupta (2016). The main result is that one of the most popular algorithms used in practice, average linkage agglomerative clustering, has a small constant approximation ratio for this objective. To contrast, this paper establishes that using several other popular algorithms, including bisecting k-means divisive clustering, have a very poor lower bound on its approximation ratio for the same objective. However, we show that there are divisive algorithms that perform well with respect to this objective by giving two constant approximation algorithms. This paper is some of the first work to establish guarantees on widely used hierarchical algorithms for a natural objective function. This objective and analysis give insight into what these popular algorithms are optimizing and when they will perform well.

References

[1]
Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrougé. Subquadratic high-dimensional hierarchical clustering. In Advances in Neural Information Processing Systems, pages 11576-11586, 2019.
[2]
Margareta Ackerman and Shai Ben-David. A characterization of linkage-based hierarchical clustering. Journal of Machine Learning Research, 17:232:1-232:17, 2016.
[3]
Margareta Ackerman, Shai Ben-David, Simina Brânzei, and David Loker. Weighted clustering. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada., 2012.
[4]
Sanjeev Arora, Satish Rao, and Umesh V. Vazirani. Expander flows, geometric embeddings and graph partitioning. J. ACM, 56(2):5:1-5:37, 2009.
[5]
Pranjal Awasthi, Afonso S Bandeira, Moses Charikar, Ravishankar Krishnaswamy, Soledad Villar, and Rachel Ward. Relax, no need to round: Integrality of clustering formulations. In Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pages 191-200. ACM, 2015.
[6]
Shai Ben-David and Margareta Ackerman. Measures of clustering quality: A working set of axioms for clustering. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 121-128, 2008. URL http://papers.nips.cc/paper/3491-measures-of-clustering-quality-a-working-set-of-axioms-for-clustering.
[7]
Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19, pages 841-854, 2017.
[8]
Moses Charikar, Vaggos Chatziafratis, and Rad Niazadeh. Hierarchical clustering better than average-linkage. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2291-2304, 2019a.
[9]
Moses Charikar, Vaggos Chatziafratis, Rad Niazadeh, and Grigory Yaroslavtsev. Hierarchical clustering for euclidean data. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, pages 2721-2730, 2019b.
[10]
Giovanni Chierchia and Benjamin Perret. Ultrametric fitting by gradient descent. In Advances in neural information processing systems, pages 3175-3186, 2019.
[11]
Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. CoRR, abs/1704.02147, 2017.
[12]
Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016, pages 118-127, 2016.
[13]
Trevor Hastie, Robert Tibshirani, and Jerome Friedman. Unsupervised Learning, pages 485-585. Springer New York, New York, NY, 2009.
[14]
Katherine A. Heller and Zoubin Ghahramani. Bayesian hierarchical clustering. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, pages 297-304, 2005.
[15]
Anil K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8): 651 - 666, 2010. ISSN 0167-8655. URL http://www.sciencedirect.com/science/article/pii/S0167865509002323.
[16]
Akshay Krishnamurthy, Sivaraman Balakrishnan, Min Xu, and Aarti Singh. Efficient active algorithms for hierarchical clustering. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.
[17]
Silvio Lattanzi, Thomas Lavastida, Kefu Lu, and Benjamin Moseley. A framework for parallelizing hierarchical clustering methods. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 73-89. Springer, 2019.
[18]
Xiaofei Ma and Satya Dhavala. Hierarchical clustering with prior knowledge. arXiv preprint arXiv:1806.03432, 2018.
[19]
Aditya Krishna Menon, Anand Rajagopalan, Baris Sumengen, Gui Citovsky, Qin Cao, and Sanjiv Kumar. Online hierarchical clustering approximations. arXiv preprint arXiv:1909.09667, 2019.
[20]
Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, and Amr Ahmed. Gradient-based hierarchical clustering using continuous representations of trees in hyperbolic space. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 714-722, 2019.
[21]
Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, 2(1):86-97, 2012.
[22]
Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2316-2324, 2016.
[23]
Dingkang Wang and Yusu Wang. An improved cost function for hierarchical cluster trees. arXiv preprint arXiv:1812.02715, 2018.
[24]
Yuyan Wang and Ben Moseley. An objective for hierarchical clustering in euclidean space and its connection tobisecting k-means. In Proceedings of the 34th Conference on Artificial Intelligence (AAAI 2020), 2020.
[25]
Reza Zadeh and Shai Ben-David. A uniqueness theorem for clustering. In UAI 2009, Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, June 18-21, 2009, pages 639-646, 2009.

Index Terms

  1. Approximation bounds for hierarchical clustering: average linkage, bisecting k-means, and local search
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image The Journal of Machine Learning Research
        The Journal of Machine Learning Research  Volume 24, Issue 1
        January 2023
        18881 pages
        ISSN:1532-4435
        EISSN:1533-7928
        Issue’s Table of Contents
        CC-BY 4.0

        Publisher

        JMLR.org

        Publication History

        Published: 06 March 2024
        Accepted: 01 January 2023
        Revised: 01 May 2020
        Received: 01 February 2018
        Published in JMLR Volume 24, Issue 1

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 151
          Total Downloads
        • Downloads (Last 12 months)151
        • Downloads (Last 6 weeks)33
        Reflects downloads up to 20 Dec 2024

        Other Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Full Access

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media