[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

Published: 30 May 2024 Publication History

Abstract

We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points -- while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the datase size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time -- within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available at https://github.com/Andrew-Draganov/Fast-Coreset-Generation and has scripts to recreate the experiments.

Supplemental Material

MP4 File
We show that one can obtain k-means and k-median coresets in effectively linear time. Any algorithm that runs faster must make compromises and we analyze these tradeoffs in depth.

References

[1]
Marcel R. Ackermann, Marcus Märtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. Streamkm++: A clustering algorithm for data streams. ACM J. Exp. Algorithmics, 17(1), 2012. 2133803.2184450. URL https://doi.org/10.1145/2133803.2184450.
[2]
David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7--9, 2007, pages 1027--1035, 2007. URL http://dl.acm.org/citation.cfm?id=1283383.1283494.
[3]
Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 15th International Workshop, APPROX 2012, and 16th International Workshop, RANDOM 2012, Cambridge, MA, USA, August 15--17, 2012. Proceedings, pages 37--49, 2012. URL http://dx.doi.org/10.1007/978--3--642--32512-0_4.
[4]
Olivier Bachem, Mario Lucic, Hamed Hassani, and Andreas Krause. Fast and provably good seedings for k-means. Advances in neural information processing systems, 29, 2016.
[5]
Olivier Bachem, Mario Lucic, S. Hamed Hassani, and Andreas Krause. Approximate k-means++ in sublinear time. In Dale Schuurmans and Michael P.Wellman, editors, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12--17, 2016, Phoenix, Arizona, USA, pages 1459--1467. AAAI Press, 2016. URL https://doi.org/10.1609/aaai.v30i1.10259.
[6]
Olivier Bachem, Mario Lucic, and Andreas Krause. Scalable k-means clustering via lightweight coresets. In Yike Guo and Faisal Farooq, editors, Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19--23, 2018, pages 1119--1127. ACM, 2018. URL https://doi.org/10.1145/3219819.3219973.
[7]
Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19--21, 2002, Montréal, Québec, Canada, pages 250--257, 2002. URL https://doi.org/10.1145/509907.509947.
[8]
Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and euclidean spaces and their applications. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12--16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 23:1--23:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL https://doi.org/10.4230/LIPIcs.ICALP.2021.23.
[9]
Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23--26, 2019, pages 1039--1050, 2019. URL https://doi.org/10.1145/3313276.3316318.
[10]
Shai Ben-David. A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering. Mach. Learn., 66(2--3):243--257, 2007. URL https://doi.org/10. 1007/s10994-006-0587--3.
[11]
Jon Louis Bentley and James B. Saxe. Decomposable searching problems I: static-to-dynamic transformation. J. Algorithms, 1(4):301--358, 1980. URL https://doi.org/10.1016/0196--6774(80)90015--2.
[12]
Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. pages 591--596, 2011. URL http://ismir2011.ismir.net/papers/OS6--1.pdf.
[13]
Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131--151, 1999. URL https://doi.org/10.1016/S0168--1699(99)00046-0.
[14]
Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excludedminor graphs and beyond. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, pages 2679--2696. SIAM, 2021. URL https://doi.org/10.1137/1.9781611976465.159.
[15]
Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 462--473. IEEE, 2022. URL https://doi.org/10.1109/FOCS54457.2022.00051.
[16]
Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923--947, 2009. URL https://doi.org/10.1137/070699007.
[17]
Flavio Chierichetti, Nilesh N. Dalvi, and Ravi Kumar. Correlation clustering in mapreduce. In Sofus A. Macskassy,Claudia Perlich, Jure Leskovec, Wei Wang, and Rayid Ghani, editors, The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014, pages 641--650. ACM, 2014. URL https://doi.org/10.1145/2623330.2623743.
[18]
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In IsabelleGuyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA, pages 5029--5037, 2017. URL https://proceedings.neurips.cc/ paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.
[19]
Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Rocco A. Servedio and Ronitt Rubinfeld, editors, Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14--17, 2015 pages 163--172. ACM, 2015. URL https://doi.org/10.1145/2746539.2746569.
[20]
Michael B. Cohen, Yin Tat Lee, Gary L. Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Daniel Wichs and Yishay Mansour, editors, Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18--21, 2016, pages 9--21. ACM, 2016. URL https://doi.org/10.1145/2897518.2897647.
[21]
Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9--12, 2019, Patras, Greece, pages 41:1--41:14, 2019. URL https://doi.org/10.4230/LIPIcs.ICALP.2019.41.
[22]
Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15--17, 2017, pages 49--60, 2017. URL https://doi.org/10.1109/FOCS.2017.14.
[23]
Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Fast and accurate-means++ via rejection sampling. Advances in Neural Information Processing Systems, 33:16235--16245, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/babcff88f8be8c4795bd6f0f8cccca61-Abstract.html.
[24]
Vincent Cohen-Addad, Silvio Lattanzi, Ashkan Norouzi-Fard, Christian Sohler, and Ola Svensson. Parallel and efficient hierarchical k-median clustering. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6--14, 2021, virtual, pages 20333--20345, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/aa495e18c7e3a21a4e48923b92048a61-Abstract.html.
[25]
Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In SamirKhuller and Virginia Vassilevska Williams, editors, STOC '21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, Italy, June 21--25, 2021. ACM, 2021. URL https://doi.org/10. 1145/3406325.3451022.
[26]
Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Improved coresets and sublinear algorithms for power means in euclidean spaces. In Marc'Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6--14, 2021, virtual, pages 21085--21098, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/b035d6563a2adac9f822940c145263ce-Abstract.html.
[27]
Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In Stefano Leonardi and Anupam Gupta, editors, STOC '22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 1038--1051. ACM, 2022. URL https://doi.org/10.1145/3519935.3519946.
[28]
Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.
[29]
Robson Leonardo Ferreira Cordeiro, Caetano Traina Jr., Agma Juci Machado Traina, Julio César López-Hernández, U Kang, and Christos Faloutsos. Clustering very large multi-dimensional datasets with mapreduce. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21--24, 2011, pages 690--698. ACM, 2011. URL https://doi.org/10.1145/2020408.2020516.
[30]
Artur Czumaj and Christian Sohler. Sublinear-time approximation algorithms for clustering via random sampling. Random Struct. Algorithms, 30(1--2):226--256, 2007. URL https://doi.org/10.1002/rsa.20157.
[31]
Yichuan Deng, Zhao Song, YitanWang, and Yuanyuan Yang. A nearly optimal size coreset algorithm with nearly linear time. CoRR, abs/2210.08361, 2022. URL https://doi.org/10.48550/arXiv.2210.08361.
[32]
Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
[33]
Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce. In Chid Apté, Joydeep Ghosh, and Padhraic Smyth, editors, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, August 21--24, 2011, pages 681--689. ACM, 2011. URL https://doi.org/10.1145/2020408.2020515.
[34]
Georgios Exarchakis, Omar Oubari, and Gregor Lenz. A sampling-based approach for efficient clustering in large datasets. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18--24, 2022, pages 12393--12402. IEEE, 2022. URL https://doi.org/10.1109/ CVPR52688.2022.01208.
[35]
Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. A tight bound on approximating arbitrary metrics by tree metrics. In Lawrence L. Larmore and Michel X. Goemans, editors, Proceedings of the 35th Annual ACM Symposium on Theory of Computing, June 9--11, 2003, San Diego, CA, USA, pages 448--455. ACM, 2003. URL https://doi.org/10.1145/780542.780608.
[36]
Dan Feldman. Introduction to core-sets: an updated survey. arXiv preprint arXiv:2011.09384, 2020. 1335.
[37]
Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Lance Fortnow and Salil P. Vadhan, editors, Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6--8 June 2011, pages 569--578. ACM, 2011. URL https://doi.org/10.1145/1993636.1993712.
[38]
Hendrik Fichtenberger, Marc Gillé, Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Bico: Birch meets coresets for k-means clustering. In European symposium on Algorithms, pages 481--492. Springer, 2013.
[39]
Sariel Har-Peled. Geometric approximation algorithms. Number 173. American Mathematical Soc., 2011. surv/173.
[40]
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13--16, 2004, pages 291--300. ACM, 2004. URL https://doi.org/10.1145/1007352.1007400.
[41]
Qinghao Hu, Jiaxiang Wu, Lu Bai, Yifan Zhang, and Jian Cheng. Fast k-means for large scale clustering. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 2099--2102, 2017.
[42]
Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In Konstantin Makarychev, Yury Makarychev, Madhur Tulsiani, Gautam Kamath, and Julia Chuzhoy, editors, Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22--26, 2020, pages 1416--1429. ACM, 2020. URL https://doi.org/10.1145/3357713.3384296.
[43]
Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587--7598, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/ 810dfbbebb17302018ae903e9cb7a483-Abstract.html.
[44]
Lingxiao Huang, Jian Li, and Xuan Wu. Towards optimal coreset construction for (k, z)-clustering: Breaking the quadratic dependency on k. CoRR, abs/2211.11923, 2022. URL https://doi.org/10.48550/ arXiv.2211.11923.
[45]
Lingxiao Huang, Shaofeng H.-C. Jiang, and Jianing Lou. The power of uniform sampling for k-median. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23--29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 13933--13956. PMLR, 2023. URL https://proceedings.mlr.press/v202/huang23j.html.
[46]
Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23--26, 2010, Las Vegas, Nevada, USA, pages 299--308, 2010. URL http://dx.doi.org/10.1109/FOCS.2010.35.
[47]
Michael Langberg and Leonard J. Schulman. Universal epsilon-approximators for integrals. In Moses Charikar, editor, Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17--19, 2010, pages 598--607. SIAM, 2010. URL https://doi.org/10.1137/1. 9781611973075.50.
[48]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proc. IEEE, 86(11):2278--2324, 1998. URL https://doi.org/10.1109/5.726791.
[49]
Stuart P. Lloyd. Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28(2):129--136, 1982. 1056489. URL https://doi.org/10.1109/TIT.1982.1056489.
[50]
Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23--26, 2019, pages 1027--1038, 2019. URL https://doi.org/10.1145/3313276.3316350.
[51]
Adam Meyerson, Liadan O'Callaghan, and Serge A. Plotkin. A k-median algorithm with running time independent of data size. Mach. Learn., 56(1--3):61--87, 2004. URL https://doi.org/10.1023/B: MACH.0000033115.78247.f0.
[52]
Luis Moreira-Matias, Michel Ferreira, Joao Mendes-Moreira, L. L., and J. J. Taxi Service Trajectory - Prediction Challenge, ECML PKDD 2015. UCI Machine Learning Repository, 2015.
[53]
N/A, 1990. URL https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990).
[54]
R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. J. ACM, 59(6):28, 2012. URL http://doi.acm.org/10.1145/2395116.2395117.
[55]
Pixabay. Shooting star sky night royalty-free vector graphic, 2023. URL https://pixabay.com/vectors/shooting-starsky- night-dark-stars-2024127/. [Online; accessed Jan 8th 2024].
[56]
Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In Approximation and Online Algorithms - 17th International Workshop, WAOA 2019, Munich, Germany, September 12--13, 2019, Revised Selected Papers, pages 232--251, 2019. URL https://doi.org/10.1007/978- 3-030--39479-0_16.
[57]
Chris Schwiegelshohn and Omar Ali Sheikh-Omar. An empirical evaluation of k-means coresets. In Shiri Chechik, Gonzalo Navarro, Eva Rotenberg, and Grzegorz Herman, editors, 30th Annual European Symposium on Algorithms, ESA 2022, September 5--9, 2022, Berlin/Potsdam, Germany, volume 244 of LIPIcs, pages 84:1--84:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022. URL https://doi.org/10.4230/LIPIcs.ESA.2022.84.
[58]
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. BIRCH: an efficient data clustering method for very large databases. pages 103--114, 1996. URL https://doi.org/10.1145/233269.233324.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 2, Issue 3
SIGMOD
June 2024
1953 pages
EISSN:2836-6573
DOI:10.1145/3670010
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2024
Published in PACMMOD Volume 2, Issue 3

Author Tags

  1. coresets
  2. k-means clustering
  3. k-median clustering
  4. nearly-linear time

Qualifiers

  • Research-article

Funding Sources

  • European Union's Horizon 2020 research and innovation programme

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 232
    Total Downloads
  • Downloads (Last 12 months)232
  • Downloads (Last 6 weeks)44
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media