Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleMay 2024
Settling Time vs. Accuracy Tradeoffs for Clustering Big Data
Proceedings of the ACM on Management of Data (PACMMOD), Volume 2, Issue 3Article No.: 173, Pages 1–25https://doi.org/10.1145/3654976We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the ...
- research-articleJuly 2023
Lossy Kernelization of Same-Size Clustering
Theory of Computing Systems (TOCSYS), Volume 67, Issue 4Pages 785–824https://doi.org/10.1007/s00224-023-10129-9AbstractIn this work, we study the k-median clustering problem with an additional equal-size constraint on the clusters from the perspective of parameterized preprocessing. Our main result is the first lossy (2-approximate) polynomial kernel for this ...
- rapid-communicationAugust 2022
On the k-means/median cost function
Highlights- We try to understand how the optimal k-means cost behaves as a function of the k (the number of centers/clusters).
- We show that D 2 sampling is a useful method for designing pseudo-approximation algorithm and movement-based coreset for ...
In this work, we study the k-means cost function. Given a dataset X ⊆ R d and an integer k, the goal of the Euclidean k-means problem is to find a set of k centers C ⊆ R d such that Φ ( C , X ) ≡ ∑ x ∈ X min c ∈ C ‖ x − c ‖ 2 is minimized. Let ...
- ArticleJune 2022
Lossy Kernelization of Same-Size Clustering
AbstractIn this work, we study the k-median clustering problem with an additional equal-size constraint on the clusters, from the perspective of parameterized preprocessing. Our main result is the first lossy (2-approximate) polynomial kernel for this ...
- research-articleOctober 2015
Subsampled Exponential Mechanism: Differential Privacy in Large Output Spaces
AISec '15: Proceedings of the 8th ACM Workshop on Artificial Intelligence and SecurityPages 25–33https://doi.org/10.1145/2808769.2808776In the last several years, differential privacy has become the leading framework for private data analysis. It provides bounds on the amount that a randomized function can change as the result of a modification to one record of a database. This ...
- articleJanuary 2015
Probabilistic k-Median Clustering in Data Streams
Theory of Computing Systems (TOCSYS), Volume 56, Issue 1Pages 251–290https://doi.org/10.1007/s00224-014-9539-7The focus of our work is introducing and constructing probabilistic coresets. A probabilistic coreset can contain probabilistic points, and the number of these points should be polylogarithmic in the input size. However, the overall storage size is also ...
- research-articleSeptember 2010
Clustering for metric and nonmetric distance measures
ACM Transactions on Algorithms (TALG), Volume 6, Issue 4Article No.: 59, Pages 1–26https://doi.org/10.1145/1824777.1824779We study a generalization of the k-median problem with respect to an arbitrary dissimilarity measure D. Given a finite set P of size n, our goal is to find a set C of size k such that the sum of errors D(P,C) = ∑p ∈ P minc ∈ C {D(p,c)} is minimized. The ...
- articleMarch 2007
A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering
We consider a framework of sample-based clustering . In this setting, the input to a clustering algorithm is a sample generated i.i.d by some unknown arbitrary distribution. Based on such a sample, the algorithm has to output a clustering of the full ...