Abstract: We present algorithms that create coresets in an online setting for clustering problems based on a wide subset of Bregman divergences. Notably, our coresets have a small additive error, similar in magnitude to the gap between expected and empirical loss (Bachem et. al. 2017), and take update time $O(d)$ for every incoming point where $d$ is the dimension of the point. Our first algorithm gives online coresets of size $\tilde{O}(\mbox{poly}(k,d,\epsilon,\mu))$ for $k$-clusterings according to any $\mu$-similar Bregman divergence. We further extend this algorithm to show the existence of non-parametric coresets, where the coreset size is independent of $k$, the number of clusters, for the same subclass of Bregman divergences. Our non-parametric coresets also function as coresets for non-parametric versions of the Bregman clustering like DP-Means. While these coresets provide additive error guarantees, they are significantly smaller for high dimensional data than the (relative-error) coresets obtained in (Bachem et. al 2015) for DP-Means--- for the input of size $n$ our coresets grow as $O(\log n)$ while being independent of $d$ as opposed to $O(d^d)$ for points in $\~R^d$ (Bachem et. al 2015). We also present experiments to compare the performance of our algorithms with other sampling techniques.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We have updated the camera ready version with all the changes requested by the reviewers.
Assigned Action Editor: ~Raman_Arora1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 57
Loading