[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3313276.3316318acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss lemma

Published: 23 June 2019 Publication History

Abstract

We show that for n points in d-dimensional Euclidean space, a data oblivious random projection of the columns onto mO((logk+loglogn−6log1/ε) dimensions is sufficient to approximate the cost of all k-means clusterings up to a multiplicative (1±ε) factor. The previous-best upper bounds on m are O(logn· ε−2) given by a direct application of the Johnson-Lindenstrauss Lemma, and O(kε−2) given by [Cohen et al.-STOC’15].

References

[1]
Dimitris Achlioptas. 2003.
[2]
Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci. 66, 4 (2003), 671–687.
[3]
Dimitris Achlioptas and Frank McSherry. 2005. On Spectral Learning of Mixtures of Distributions. In Learning Theory, 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy, June 27-30, 2005, Proceedings. 458–469. org/10.1007/11503415_31
[4]
Nir Ailon and Bernard Chazelle. 2006. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, Seattle, WA, USA, May 21-23, 2006. 557–563.
[5]
Nir Ailon and Edo Liberty. 2008. Fast dimension reduction using Rademacher series on dual BCH codes. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2008, San Francisco, California, USA, January 20-22, 2008. 1–9.
[6]
Nir Ailon and Edo Liberty. 2013. An Almost Optimal Unrestricted Fast Johnson-Lindenstrauss Transform. ACM Trans. Algorithms 9, 3 (2013), 21:1–21:12.
[7]
Noga Alon and Bo’az Klartag. 2017.
[8]
Optimal Compression of Approximate Inner Products and Dimension Reduction. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017. 639–650.
[9]
Alexandr Andoni, Piotr Indyk, and Mihai Patrascu. 2006. On the Optimality of the Dimensionality Reduction Method. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2006), 21-24 October 2006, Berkeley, California, USA, Proceedings. 449–458.
[10]
Pranjal Awasthi and Or Sheffet. 2012.
[11]
Improved Spectral-Norm Bounds for Clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 15th International Workshop, APPROX 2012, and 16th International Workshop, RANDOM 2012, Cambridge, MA, USA, August 15-17, 2012. Proceedings. 37–49.
[12]
Joshua D. Batson, Daniel A. Spielman, and Nikhil Srivastava. 2012.
[13]
Twice-Ramanujan Sparsifiers. SIAM J. Comput. 41, 6 (2012), 1704–1721.
[14]
Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. 2015. Toward a Unified Theory of Sparse Dimensionality Reduction in Euclidean Space. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015. 499–508.
[15]
Christos Boutsidis and Malik Magdon-Ismail. 2013.
[16]
Deterministic Feature Selection for $k$-Means Clustering. IEEE Trans. Information Theory 59, 9 (2013), 6099–6110.
[17]
Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. 2009. Unsupervised Feature Selection for the $k$-means Clustering Problem. In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009, Vancouver, British Columbia, Canada. 153–161. STOC ’19, June 23–26, 2019, Phoenix, AZ, USA Becchetti, Bury, Cohen-Addad, Grandoni, Schwiegelshohn
[18]
Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. 2010. Random Projections for $k$-means Clustering. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada. 298–306.
[19]
Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros Drineas. 2015. Randomized Dimensionality Reduction for k-Means Clustering. IEEE Trans. Information Theory 61, 2 (2015), 1045–1062.
[20]
Vladimir Braverman, Dan Feldman, and Harry Lang. 2016.
[21]
New Frameworks for Offline and Streaming Coreset Constructions. CoRR abs/1612.00889 (2016).
[22]
arXiv: 1612.00889 http://arxiv.org/abs/1612.00889
[23]
Vladimir Braverman, Harry Lang, Keith Levin, and Morteza Monemizadeh. 2016.
[24]
Clustering Problems on Sliding Windows. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016. 1374–1390.
[25]
S. Charles Brubaker and Santosh Vempala. 2008.
[26]
Isotropic PCA and Affine-Invariant Clustering. In 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25-28, 2008, Philadelphia, PA, USA. 551–560.
[27]
Ke Chen. 2009. On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications. SIAM J. Comput. 39, 3 (2009), 923–947.
[28]
Kenneth L. Clarkson, Petros Drineas, Malik Magdon-Ismail, Michael W. Mahoney, Xiangrui Meng, and David P. Woodruff. 2016.
[29]
The Fast Cauchy Transform and Faster Robust Linear Regression. SIAM J. Comput. 45, 3 (2016), 763–810.
[30]
Kenneth L. Clarkson and David P. Woodruff. 2009.
[31]
Numerical linear algebra in the streaming model. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC). 205–214.
[32]
Kenneth L. Clarkson and David P. Woodruff. 2013.
[33]
Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA. 81–90.
[34]
Michael B. Cohen. 2016. Nearly Tight Oblivious Subspace Embeddings by Trace Inequalities. In Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016, Arlington, VA, USA, January 10-12, 2016. 278– 287.
[35]
Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. 2015. Dimensionality Reduction for k-Means Clustering and Low Rank Approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC 2015, Portland, OR, USA, June 14-17, 2015. 163–172.
[36]
Michael B. Cohen, T. S. Jayram, and Jelani Nelson. 2018.
[37]
Simple Analyses of the Sparse Johnson-Lindenstrauss Transform. In 1st Symposium on Simplicity in Algorithms, SOSA 2018, January 7-10, 2018, New Orleans, LA, USA. 15:1–15:9.
[38]
Michael B. Cohen, Cameron Musco, and Christopher Musco. 2017. Input Sparsity Time Low-rank Approximation via Ridge Leverage Score Sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017, Barcelona, Spain, Hotel Porta Fira, January 16-19. 1758–1777.
[39]
Vincent Cohen-Addad and Chris Schwiegelshohn. 2017. On the Local Structure of Stable Clustering Instances. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017. 49–60.
[40]
Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. 2010.
[41]
A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010. 341–350.
[42]
Petros Drineas, Alan M. Frieze, Ravi Kannan, Santosh Vempala, and V. Vinay. 2004. Clustering Large Graphs via the Singular Value Decomposition. Machine Learning 56, 1-3 (2004), 9–33.
[43]
Michael Elkin, Arnold Filtser, and Ofer Neiman. 2017.
[44]
Terminal embeddings. Theor. Comput. Sci. 697 (2017), 1–36.
[45]
Dan Feldman and Michael Langberg. 2011. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC 2011, San Jose, CA, USA, 6-8 June 2011. 569–578.
[46]
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. 2007. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, Gyeongju, South Korea, June 6-8, 2007.
[47]
[48]
Dan Feldman, Melanie Schmidt, and Christian Sohler. 2013.
[49]
Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6-8, 2013. 1434– 1453.
[50]
Gereon Frahling and Christian Sohler. 2005.
[51]
Coresets in dynamic geometric data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC). 209–217.
[52]
Sariel Har-Peled and Akash Kushal. 2007. Smaller Coresets for k-Median and k-Means Clustering. Discrete & Computational Geometry 37, 1 (2007), 3–19.
[53]
Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means and kmedian clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004. 291–300.
[54]
T. S. Jayram and David P. Woodruff. 2013.
[55]
Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Subconstant Error. ACM Trans. Algorithms 9, 3 (2013), 26:1–26:17.
[56]
William Johnson and Joram Lindenstrauss. 1984. Extensions of Lipschitz maps into a Hilbert space. 26 (01 1984), 189–206.
[57]
Daniel M. Kane, Raghu Meka, and Jelani Nelson. 2011. Almost Optimal Explicit Johnson-Lindenstrauss Families. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques - 14th International Workshop, APPROX 2011, and 15th International Workshop, RANDOM 2011, Princeton, NJ, USA, August 17-19, 2011. Proceedings. 628–639.
[58]
Daniel M. Kane and Jelani Nelson. 2014. Sparser Johnson-Lindenstrauss Transforms. J. ACM 61, 1 (2014), 4:1–4:23.
[59]
Ravindran Kannan, Hadi Salmasian, and Santosh Vempala. 2008. The Spectral Method for General Mixture Models. SIAM J. Comput. 38, 3 (2008), 1141–1156.
[60]
Amit Kumar and Ravindran Kannan. 2010.
[61]
Clustering with Spectral Norm and the k-Means Algorithm. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA. 299–308.
[62]
Michael Langberg and Leonard J. Schulman. 2010. Universal ε-approximators for Integrals. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2010, Austin, Texas, USA, January 17-19, 2010. 598–607.
[63]
Kasper Green Larsen and Jelani Nelson. 2016. The Johnson-Lindenstrauss Lemma Is Optimal for Linear Dimensionality Reduction. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy. 82:1–82:11.
[64]
Kasper Green Larsen and Jelani Nelson. 2017.
[65]
Optimality of the Johnson-Lindenstrauss Lemma. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017. 633–638.
[66]
Sepideh Mahabadi, Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. 2018. Nonlinear dimension reduction via outer Bi-Lipschitz extensions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018. 1088–1101.
[67]
Konstantin Makarychev, Yury Makarychev, Maxim Sviridenko, and Justin Ward. 2016. A Bi-Criteria Approximation Algorithm for k-Means. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2016, September 7-9, 2016, Paris, France. 14:1–14:20.
[68]
Xiangrui Meng and Michael W. Mahoney. 2013. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 1-4, 2013. 91–100.
[69]
Shyam Narayanan and Jelani Nelson. 2018.
[70]
Optimal terminal dimensionality reduction in Euclidean space. CoRR abs/1810.09250 (2018).
[71]
Jelani Nelson and Huy L. Nguyen. 2013. OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS, Berkeley, CA, USA. 117–126.
[72]
Jelani Nelson and Huy L. Nguyen. 2013. Sparsity lower bounds for dimensionality reducing maps. In Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 1-4, 2013. 101–110.
[73]
Jelani Nelson and Huy L. Nguyên. 2014. Lower Bounds for Oblivious Subspace Embeddings. In Automata, Languages, and Programming - 41st International Colloquium, ICALP 2014, Copenhagen, Denmark, July 8-11, 2014, Proceedings, Part I. 883–894.
[74]
Tamás Sarlós. 2006.
[75]
Improved Approximation Algorithms for Large Matrices via Random Projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS). 143–152.
[76]
Christian Sohler and David P. Woodruff. 2018. Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018.
[77]
Santosh Vempala and Grant Wang. 2004.
[78]
A spectral algorithm for learning mixture models. J. Comput. Syst. Sci. 68, 4 (2004), 841–860. Abstract 1 Introduction 1.1 Related Work 2 Preliminaries 3 High Level Proof Ideas 4 A Cluster Decomposition 5 Analysis of Oblivious Random Projections 6 Data Dependent Dimension Reduction 7 A Brief Remark on Coresets for k-Means 8 Acknowledgments References

Cited By

View all
  • (2024)Making old things newProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692550(12046-12086)Online publication date: 21-Jul-2024
  • (2024)Optimal coresets for low-dimensional geometric medianProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692081(262-270)Online publication date: 21-Jul-2024
  • (2024)Fair Projections as a Means toward Balanced RecommendationsACM Transactions on Intelligent Systems and Technology10.1145/366492916:1(1-32)Online publication date: 30-Dec-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2019: Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing
June 2019
1258 pages
ISBN:9781450367059
DOI:10.1145/3313276
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dimension reduction
  2. k-means
  3. random projections

Qualifiers

  • Research-article

Funding Sources

Conference

STOC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25
57th Annual ACM Symposium on Theory of Computing (STOC 2025)
June 23 - 27, 2025
Prague , Czech Republic

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)54
  • Downloads (Last 6 weeks)4
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Making old things newProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692550(12046-12086)Online publication date: 21-Jul-2024
  • (2024)Optimal coresets for low-dimensional geometric medianProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692081(262-270)Online publication date: 21-Jul-2024
  • (2024)Fair Projections as a Means toward Balanced RecommendationsACM Transactions on Intelligent Systems and Technology10.1145/366492916:1(1-32)Online publication date: 30-Dec-2024
  • (2024)Settling Time vs. Accuracy Tradeoffs for Clustering Big DataProceedings of the ACM on Management of Data10.1145/36549762:3(1-25)Online publication date: 30-May-2024
  • (2024)Sensitivity Sampling for $k$-Means: Worst Case and Stability Optimal Coreset Bounds2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS61266.2024.00106(1707-1723)Online publication date: 27-Oct-2024
  • (2024)MapReduce Algorithms for Robust Center-Based Clustering in Doubling MetricsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104966(104966)Online publication date: Aug-2024
  • (2024)Distributed estimation and inference for spatial autoregression model with large scale networksJournal of Econometrics10.1016/j.jeconom.2023.105629238:2(105629)Online publication date: Jan-2024
  • (2024)On Coresets for Fair Clustering in Metric and Euclidean Spaces and Their ApplicationsJournal of Computer and System Sciences10.1016/j.jcss.2024.103506(103506)Online publication date: Jan-2024
  • (2024)Coresets for kernel clusteringMachine Learning10.1007/s10994-024-06540-z113:8(5891-5906)Online publication date: 22-Apr-2024
  • (2023)On generalization bounds for projective clusteringProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3669262(71723-71754)Online publication date: 10-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media