[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3357713.3384331acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

Non-adaptive adaptive sampling on turnstile streams

Published: 22 June 2020 Publication History

Abstract

Adaptive sampling is a useful algorithmic tool for data summarization problems in the classical centralized setting, where the entire dataset is available to the single processor performing the computation. Adaptive sampling repeatedly selects rows of an underlying matrix A∈ℝ n× d , where nd, with probabilities proportional to their distances to the subspace of the previously selected rows. Intuitively, adaptive sampling seems to be limited to trivial multi-pass algorithms in the streaming model of computation due to its inherently sequential nature of assigning sampling probabilities to each row only after the previous iteration is completed. Surprisingly, we show this is not the case by giving the first one-pass algorithms for adaptive sampling on turnstile streams and using space poly(d,k,logn), where k is the number of adaptive sampling rounds to be performed.
Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model. We give the first relative-error algorithm for column subset selection on turnstile streams. We show our adaptive sampling algorithm also gives the first relative-error algorithm for subspace approximation on turnstile streams that returns k noisy rows of A. The quality of the output can be improved to a (1+є)-approximation at the tradeoff of a bicriteria algorithm that outputs a larger number of rows. We then give the first algorithm for projective clustering on turnstile streams that uses space sublinear in n. In fact, we use space poly(d,k,s,1/є,logn) to output a (1+є)-approximation, where s is the number of k-dimensional subspaces. Our adaptive sampling primitive also provides the first algorithm for volume maximization on turnstile streams. We complement our volume maximization algorithmic results with lower bounds that are tight up to lower order terms, even for multi-pass algorithms. By a similar construction, we also obtain lower bounds for volume maximization in the row-arrival model, which we match with competitive upper bounds.

References

[1]
Pankaj K Agarwal, Sariel Har-Peled, and Kasturi R Varadarajan. 2005. Geometric approximation via coresets. Combinatorial and computational geometry, 52 (2005), 1–30.
[2]
Pankaj K. Agarwal and R. Sharathkumar. 2015. Streaming Algorithms for Extent Problems in High Dimensions. Algorithmica, 72, 1 (2015), 83–98.
[3]
Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58, 1 (1999), 137–147.
[4]
Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David P. Woodruff. 2009. Efficient Sketches for Earth-Mover Distance, with Applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS. 324–330.
[5]
Alexandr Andoni, Robert Krauthgamer, and Krzysztof Onak. 2011. Streaming Algorithms via Precision Sampling. In IEEE 52nd Annual Symposium on Foundations of Computer Science, FOCS. 363–372.
[6]
Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. 2002. Approximate clustering via core-sets. In Proceedings on 34th Annual ACM Symposium on Theory of Computing. 250–257.
[7]
Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, and D. Sivakumar. 2004. An information statistics approach to data stream and communication complexity. J. Comput. Syst. Sci., 68, 4 (2004), 702–732.
[8]
Christos Boutsidis, Michael W. Mahoney, and Petros Drineas. 2009. An improved approximation algorithm for the column subset selection problem. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA. 968–977.
[9]
Timothy M Chan. 2006. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry, 35, 1-2 (2006), 20–35.
[10]
Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theor. Comput. Sci., 312, 1 (2004), 3–15.
[11]
Ke Chen. 2009. On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications. SIAM J. Comput., 39, 3 (2009), 923–947.
[12]
Ali Çivril and Malik Magdon-Ismail. 2009. On selecting a maximum volume sub-matrix of a matrix and related problems. Theor. Comput. Sci., 410, 47-49 (2009), 4801–4811.
[13]
Kenneth L. Clarkson and David P. Woodruff. 2015. Input Sparsity and Hardness for Robust Subspace Approximation. In IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS. 310–329.
[14]
Michael B. Cohen, Cameron Musco, and Christopher Musco. 2017. Input Sparsity Time Low-rank Approximation via Ridge Leverage Score Sampling. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2017. 1758–1777.
[15]
Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. 2006. Matrix Approximation and Projective Clustering via Volume Sampling. Theory of Computing, 2, 12 (2006), 225–247.
[16]
Amit Deshpande and Kasturi R. Varadarajan. 2007. Sampling-based dimension reduction for subspace approximation. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing. 641–650.
[17]
Amit Deshpande and Santosh Vempala. 2006. Adaptive Sampling and Fast Low-Rank Matrix Approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, 9th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems, APPROX and 10th International Workshop on Randomization and Computation, RANDOM, Proceedings. 292–303.
[18]
Dan Feldman, Morteza Monemizadeh, Christian Sohler, and David P. Woodruff. 2010. Coresets and Sketches for High Dimensional Subspace Approximation Problems. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA. 630–649.
[19]
Venkatesan Guruswami and Ali Kemal Sinop. 2012. Optimal column-based low-rank matrix reconstruction. In Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA. 1207–1214.
[20]
Sariel Har-Peled and Soham Mazumdar. 2004. On coresets for k-means and k-median clustering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing. 291–300.
[21]
Piotr Indyk, Sepideh Mahabadi, Shayan Oveis Gharan, and Alireza Rezaei. 2019. Composable Core-sets for Determinant Maximization: A Simple Near-Optimal Algorithm. In International Conference on Machine Learning. 4254–4263.
[22]
Piotr Indyk, Sepideh Mahabadi, Shayan Oveis Gharan, and Alireza Rezaei. 2020. Composable Core-sets for Determinant Maximization Problems via Spectral Spanners. In Proceedings of the thirty-first annual ACM-SIAM symposium on Discrete algorithms.
[23]
Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. 2014. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS. 100–108.
[24]
Rajesh Jayaram and David P. Woodruff. 2018. Perfect Lp Sampling in a Data Stream. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS. 544–555.
[25]
Hossein Jowhari, Mert Saglam, and Gábor Tardos. 2011. Tight bounds for Lp samplers, finding duplicates in streams, and related problems. In Proceedings of the 30th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS. 49–58.
[26]
Bala Kalyanasundaram and Georg Schnitger. 1992. The Probabilistic Communication Complexity of Set Intersection. SIAM J. Discrete Math., 5, 4 (1992), 545–557.
[27]
Michael Kerber and Sharath Raghvendra. 2015. Approximation and Streaming Algorithms for Projective Clustering via Random Projections. In Proceedings of the 27th Canadian Conference on Computational Geometry, CCCG.
[28]
Roie Levin, Anish Prasad Sevekari, and David P. Woodruff. 2018. Robust Subspace Approximation in a Stream. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS. 10706–10716.
[29]
Sepideh Mahabadi, Ilya Razenshteyn, David P. Woodruff, and Samson Zhou. 2020. Non-Adaptive Adaptive Sampling on Turnstile Streams. CoRR, abs/2004.10969 (2020).
[30]
Morteza Monemizadeh and David P. Woodruff. 2010. 1-Pass Relative-Error L_ p-Sampling with Applications. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA. 1143–1160.
[31]
Cecilia M. Procopiuc. 2017. Projective Clustering. In Encyclopedia of Machine Learning and Data Mining. 1018–1025.
[32]
Alexander A. Razborov. 1992. On the Distributional Complexity of Disjointness. Theor. Comput. Sci., 106, 2 (1992), 385–390.
[33]
Nariankadu D. Shyamalkumar and Kasturi R. Varadarajan. 2012. Efficient Subspace Approximation Algorithms. Discrete & Computational Geometry, 47, 1 (2012), 44–63.
[34]
Christian Sohler and David P. Woodruff. 2011. Subspace embeddings for the L_ 1-norm with applications. In Proceedings of the 43rd ACM Symposium on Theory of Computing, STOC. 755–764.
[35]
Christian Sohler and David P. Woodruff. 2018. Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS. 802–813.
[36]
David P. Woodruff and Qin Zhang. 2012. Tight bounds for distributed functional monitoring. In Proceedings of the 44th Symposium on Theory of Computing Conference, STOC. 941–960.

Cited By

View all
  • (2024)Improving the Bit Complexity of Communication for Distributed Convex OptimizationProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649787(1130-1140)Online publication date: 10-Jun-2024
  • (2023)Provable data subset selection for efficient neural networks trainingProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619846(34533-34555)Online publication date: 23-Jul-2023
  • (2023)Streaming Euclidean Max-Cut: Dimension vs Data ReductionProceedings of the 55th Annual ACM Symposium on Theory of Computing10.1145/3564246.3585170(170-182)Online publication date: 2-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2020: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing
June 2020
1429 pages
ISBN:9781450369794
DOI:10.1145/3357713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computational geometry
  2. determinantal point processes
  3. streaming algorithms
  4. volume maximization

Qualifiers

  • Research-article

Conference

STOC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25
57th Annual ACM Symposium on Theory of Computing (STOC 2025)
June 23 - 27, 2025
Prague , Czech Republic

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)6
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Improving the Bit Complexity of Communication for Distributed Convex OptimizationProceedings of the 56th Annual ACM Symposium on Theory of Computing10.1145/3618260.3649787(1130-1140)Online publication date: 10-Jun-2024
  • (2023)Provable data subset selection for efficient neural networks trainingProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619846(34533-34555)Online publication date: 23-Jul-2023
  • (2023)Streaming Euclidean Max-Cut: Dimension vs Data ReductionProceedings of the 55th Annual ACM Symposium on Theory of Computing10.1145/3564246.3585170(170-182)Online publication date: 2-Jun-2023
  • (2023)New Subset Selection Algorithms for Low Rank Approximation: Offline and OnlineProceedings of the 55th Annual ACM Symposium on Theory of Computing10.1145/3564246.3585100(1802-1813)Online publication date: 2-Jun-2023
  • (2023)One-Pass Additive-Error Subset Selection for $$\ell _{p}$$ Subspace Approximation and (k, p)-ClusteringAlgorithmica10.1007/s00453-023-01124-085:10(3144-3167)Online publication date: 11-May-2023
  • (2022)Lazy and fast greedy MAP inference for determinantal point processProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600471(2776-2789)Online publication date: 28-Nov-2022
  • (2022)Truly Perfect Samplers for Data Streams and Sliding WindowsProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3524139(29-40)Online publication date: 12-Jun-2022
  • (2022)Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS52979.2021.00116(1183-1196)Online publication date: Feb-2022
  • (2020)Near Optimal Linear Algebra in the Online and Sliding Window Models2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS)10.1109/FOCS46700.2020.00055(517-528)Online publication date: Nov-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media