Optimal Sub-Gaussian Mean Estimation in Very High Dimensions

File

Author Details

Jasper C.H. Lee

University of Wisconsin-Madison, WI, USA

Paul Valiant

Purdue University, West Lafayette, IN, USA

Cite As Get BibTex

Jasper C.H. Lee and Paul Valiant. Optimal Sub-Gaussian Mean Estimation in Very High Dimensions. In 13th Innovations in Theoretical Computer Science Conference (ITCS 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 215, pp. 98:1-98:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.ITCS.2022.98

Abstract

We address the problem of mean estimation in very high dimensions, in the high probability regime parameterized by failure probability δ. For a distribution with covariance Σ, let its "effective dimension" be d_eff = {Tr(Σ)}/{λ_{max}(Σ)}. For the regime where d_eff = ω(log^2 (1/δ)), we show the first algorithm whose sample complexity is optimal to within 1+o(1) factor. The algorithm has a surprisingly simple structure: 1) re-center the samples using a known sub-Gaussian estimator, 2) carefully choose an easy-to-compute positive integer t and then remove the t samples farthest from the origin and 3) return the sample mean of the remaining samples. The core of the analysis relies on a novel vector Bernstein-type tail bound, showing that under general conditions, the sample mean of a bounded high-dimensional distribution is highly concentrated around a spherical shell.

Subject Classification

ACM Subject Classification

Mathematics of computing → Nonparametric statistics
Mathematics of computing → Multivariate statistics
Theory of computation → Sample complexity and generalization bounds
Theory of computation → Streaming, sublinear and near linear time algorithms

Keywords

High-dimensional mean estimation

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Noga Alon, Yossi Matias, and Mario Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci, 58(1):137-147, 1999.
Olivier Catoni. Challenging the empirical mean and empirical variance: a deviation study. Ann. I. H. Poincaré -PR, 48(4):1148-1185, 2012.
Olivier Catoni and Ilaria Giulini. Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector. arXiv:1802.04308, 2018.
Yeshwanth Cherapanamjeri, Nicolas Flammarion, and Peter L. Bartlett. Fast mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 786-806, 2019.
Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira. Sub-Gaussian mean estimators. Ann. Stat, 44(6):2695-2725, 2016.
Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional Gaussians. arXiv:1810.08693, 2020.
Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robust estimators in high-dimensions without the computational intractability. SIAM J. Comput., 48(2):742-864, 2019.
Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In Proc. ICML'17, pages 999-1008, 2017.
Ilias Diakonikolas, Gautam Kamath, Daniel M. Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Robustly learning a Gaussian: Getting optimal error, efficiently. In Proc. SODA'18, pages 2683-2702, 2018.
Ilias Diakonikolas and Daniel Kane. Robust high-dimensional statistics. In Tim Roughgarden, editor, Beyond the Worst-Case Analysis of Algorithms, pages 382-402. Cambridge University Press, 2021.
Ilias Diakonikolas, Daniel M. Kane, and Ankit Pensia. Outlier robust mean estimation with subgaussian rates via stability. In Proc. NeuRIPS'20, pages 1830-1840, 2020.
Samuel B. Hopkins. Mean estimation with sub-Gaussian rates in polynomial time. Ann. Stat., 48(2):1193-1213, 2020.
Mark R. Jerrum, Leslie G. Valiant, and Vijay V. Vazirani. Random generation of combinatorial structures from a uniform distribution. Theor. Comput. Sci, 43:169-188, 1986.
Jasper C.H. Lee and Paul Valiant. Optimal sub-Gaussian mean estimation in ℝ. To appear in Proc. FOCS'21.
Zhixian Lei, Kyle Luh, Prayaag Venkat, and Fred Zhang. A fast spectral algorithm for mean estimation with sub-Gaussian rates. In Proc. COLT '20, pages 2598-2612, 2020.
Gábor Lugosi and Shahar Mendelson. Mean estimation and regression under heavy-tailed distributions - a survey. Found. Comput. Math., 19(5):1145-1190, 2019.
Gábor Lugosi and Shahar Mendelson. Sub-Gaussian estimators of the mean of a random vector. Ann. Stat., 47(2):783-794, 2019.
Gábor Lugosi and Shahar Mendelson. Robust multivariate mean estimation: the optimality of trimmed mean. Ann. Stat., 49(1):393-410, 2021.
Stanislav Minsker. On some extensions of Bernstein’s inequality for self-adjoint operators. Stat. Probab. Lett., 127:111-119, 2017.
A.S. Nemirovsky and D.B. Yudin. Problem Complexity and Method Efficiency in Optimization. Wiley, 1983.
Roberto I. Oliveira and Paulo Orenstein. The sub-Gaussian property of trimmed means estimators. Technical Report, IMPA, 2019.
Joel A. Tropp. An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning, 8(1-2):1-230, 2015.
V.V. Yurinskiĭ. Exponential inequalities for sums of random vectors. J. Multivar. Anal., 6(4):473-499, 1976.

Optimal Sub-Gaussian Mean Estimation in Very High Dimensions

Authors Jasper C.H. Lee, Paul Valiant

File

Document Identifiers

Author Details

Acknowledgements

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Optimal Sub-Gaussian Mean Estimation in Very High Dimensions

Authors Jasper C.H. Lee, Paul Valiant

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message