[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3564246.3585206acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article
Open access

A Moment-Matching Approach to Testable Learning and a New Characterization of Rademacher Complexity

Published: 02 June 2023 Publication History

Abstract

A remarkable recent paper by Rubinfeld and Vasilyan (2022) initiated the study of testable learning, where the goal is to replace hard-to-verify distributional assumptions (such as Gaussianity) with efficiently testable ones and to require that the learner succeed whenever the unknown distribution passes the corresponding test. In this model, they gave an efficient algorithm for learning halfspaces under testable assumptions that are provably satisfied by Gaussians.
In this paper we give a powerful new approach for developing algorithms for testable learning using tools from moment matching and metric distances in probability. We obtain efficient testable learners for any concept class that admits low-degree sandwiching polynomials, capturing most important examples for which we have ordinary agnostic learners. We recover the results of Rubinfeld and Vasilyan as a corollary of our techniques while achieving improved, near-optimal sample complexity bounds for a broad range of concept classes and distributions.
Surprisingly, we show that the information-theoretic sample complexity of testable learning is tightly characterized by the Rademacher complexity of the concept class, one of the most well-studied measures in statistical learning theory. In particular, uniform convergence is necessary and sufficient for testable learning. This leads to a fundamental separation from (ordinary) distribution-specific agnostic learning, where uniform convergence is sufficient but not necessary.

References

[1]
Noga Alon, Oded Goldreich, and Yishay Mansour. 2003. Almost k-wise independence versus k-wise independence. Inform. Process. Lett., 88, 3 (2003), 107–110.
[2]
D. Bakry and Michel Émery. 1985. Diffusions hypercontractives. In Séminaire de probabilités, XIX, 1983/84 (Lecture Notes in Math., Vol. 1123). Springer, Berlin, 177–206. https://doi.org/10.1007/BFb0075847
[3]
Ainesh Bakshi, Ilias Diakonikolas, Samuel B. Hopkins, Daniel Kane, Sushrut Karmalkar, and Pravesh K. Kothari. 2020. Outlier-Robust Clustering of Gaussians and Other Non-Spherical Mixtures. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, Durham, NC, USA, November 16-19, 2020, Sandy Irani (Ed.). IEEE, 149–159. https://doi.org/10.1109/FOCS46700.2020.00023
[4]
Ainesh Bakshi and Pravesh Kothari. 2020. Outlier-Robust Clustering of Non-Spherical Mixtures. CoRR, abs/2005.02970 (2020), arXiv:2005.02970. arxiv:2005.02970
[5]
Ainesh Bakshi and Pravesh K. Kothari. 2021. List-Decodable Subspace Recovery: Dimension Independent Error in Polynomial Time. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, Dániel Marx (Ed.). SIAM, 1279–1297. https://doi.org/10.1137/1.9781611976465.78
[6]
Peter L Bartlett. 2014. UC Berkeley CS281B/Stat241B: Statistical Learning Theory, Lecture 7. URL: https://www.stat.berkeley.edu/ bartlett/courses/2014spring-cs281bstat241b/lectures/07-notes.pdf
[7]
Peter L Bartlett, Stéphane Boucheron, and Gábor Lugosi. 2002. Model selection and error estimation. Machine Learning, 48, 1 (2002), 85–113.
[8]
Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. 2005. Local rademacher complexities. The Annals of Statistics, 33, 4 (2005), 1497–1537.
[9]
Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. 2020. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117, 48 (2020), 30063–30070.
[10]
Peter L Bartlett and Shahar Mendelson. 2002. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3, Nov (2002), 463–482.
[11]
Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. 2021. Deep learning: a statistical viewpoint. Acta numerica, 30 (2021), 87–201.
[12]
Louay MJ Bazzi. 2009. Polylogarithmic independence can fool DNF formulas. SIAM J. Comput., 38, 6 (2009), 2220–2272.
[13]
Mikhail Belkin. 2021. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30 (2021), 203–248.
[14]
Gyora M Benedek and Alon Itai. 1991. Learnability with respect to fixed distributions. Theoretical Computer Science, 86, 2 (1991), 377–389.
[15]
Lucien Birgé and Pascal Massart. 1997. From model selection to adaptive estimation. In Festschrift for lucien le cam. Springer, 55–87.
[16]
Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. 2003. Introduction to statistical learning theory. In Summer school on machine learning. 169–207.
[17]
Mark Braverman. 2010. Polylogarithmic independence fools AC^0 circuits. Journal of the ACM (JACM), 57, 5 (2010), 1–10.
[18]
Nader H Bshouty and Christino Tamon. 1996. On the Fourier spectrum of monotone functions. Journal of the ACM (JACM), 43, 4 (1996), 747–770.
[19]
Amit Daniely. 2016. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing. 105–117.
[20]
Amit Daniely and Shai Shalev-Shwartz. 2016. Complexity theoretic limitations on learning DNF’s. In Conference on Learning Theory. 815–830.
[21]
Anindya De, Shivam Nadimpalli, and Rocco Servedio. 2022. Convex Influences. Innovations in Theoretical Computer Science.
[22]
Ilias Diakonikolas, Parikshit Gopalan, Ragesh Jaiswal, Rocco A Servedio, and Emanuele Viola. 2010. Bounded independence fools halfspaces. SIAM J. Comput., 39, 8 (2010), 3441–3462.
[23]
Ilias Diakonikolas, Daniel Kane, and Nikos Zarifis. 2020. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33 (2020), 13586–13596.
[24]
Ilias Diakonikolas, Daniel M Kane, and Jelani Nelson. 2010. Bounded independence fools degree-2 threshold functions. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science. 11–20.
[25]
Ilias Diakonikolas, Daniel M Kane, Thanasis Pittas, and Nikos Zarifis. 2021. The optimality of polynomial regression for agnostic learning under gaussian marginals in the SQ model. In Conference on Learning Theory. 1552–1584.
[26]
Vitaly Feldman, Venkatesan Guruswami, Prasad Raghavendra, and Yi Wu. 2012. Agnostic learning of monomials by halfspaces is hard. SIAM J. Comput., 41, 6 (2012), 1558–1590.
[27]
Yu R Gabovich. 1981. Stability of the characterization of the multivariate normal distribution in the Skitovich-Darmois theorem. Journal of Soviet Mathematics, 16, 5 (1981), 1341–1349.
[28]
Surbhi Goel, Aravind Gollakota, and Adam Klivans. 2020. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33 (2020), 2147–2158.
[29]
Venkatesan Guruswami and Prasad Raghavendra. 2009. Hardness of learning halfspaces with noise. SIAM J. Comput., 39, 2 (2009), 742–765.
[30]
Prahladh Harsha and Srikanth Srinivasan. 2019. On polynomial approximations to AC^0. Random Structures & Algorithms, 54, 2 (2019), 289–303.
[31]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. 2022. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50, 2 (2022), 949–986.
[32]
Misha Ivkov and Pravesh K. Kothari. 2022. List-decodable covariance estimation. In STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, Stefano Leonardi and Anupam Gupta (Eds.). ACM, 1276–1283. https://doi.org/10.1145/3519935.3520006
[33]
Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. 2008. Agnostically learning halfspaces. SIAM J. Comput., 37, 6 (2008), 1777–1805.
[34]
Daniel Kane, Adam Klivans, and Raghu Meka. 2013. Learning halfspaces under log-concave densities: Polynomial approximations and moment matching. In Conference on Learning Theory. 522–545.
[35]
Daniel M Kane. 2011. The Gaussian surface area and noise sensitivity of degree-d polynomial threshold functions. computational complexity, 20, 2 (2011), 389–412.
[36]
Sushrut Karmalkar, Adam R. Klivans, and Pravesh Kothari. 2019. List-decodable Linear Regression. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 7423–7432. https://proceedings.neurips.cc/paper/2019/hash/7f5fc754c7af0a6370c9bf91314e79f4-Abstract.html
[37]
Michael Kearns and Leslie Valiant. 1994. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM (JACM), 41, 1 (1994), 67–95.
[38]
Michael J Kearns, Robert E Schapire, and Linda M Sellie. 1992. Toward efficient agnostic learning. In Proceedings of the fifth annual workshop on Computational learning theory. 341–352.
[39]
LB Klebanov and ST Rachev. 1996. Proximity of probability measures with common marginals in a finite number of directions. Lecture Notes-Monograph Series, 162–174.
[40]
Adam Klivans and Raghu Meka. 2013. Moment-matching polynomials. arXiv preprint arXiv:1301.0820.
[41]
Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. 2008. Learning geometric concepts via Gaussian surface area. In 2008 49th Annual IEEE Symposium on Foundations of Computer Science. 541–550.
[42]
Adam R Klivans and Alexander A Sherstov. 2009. Cryptographic hardness for learning intersections of halfspaces. J. Comput. System Sci., 75, 1 (2009), 2–12.
[43]
Vladimir Koltchinskii. 2001. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47, 5 (2001), 1902–1914.
[44]
Vladimir Koltchinskii. 2006. Local Rademacher complexities and oracle inequalities in risk minimization. The Annals of Statistics, 34, 6 (2006), 2593–2656.
[45]
Vladimir Koltchinskii and Dmitriy Panchenko. 2000. Rademacher processes and bounding the risk of function learning. In High dimensional probability II. Springer, 443–457.
[46]
Pravesh K Kothari and Roi Livni. 2018. Improper learning by refuting. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018).
[47]
Gil Kur and Alexander Rakhlin. 2021. On the Minimal Error of Empirical Risk Minimization. In Conference on Learning Theory. 2849–2852.
[48]
Michel Ledoux. 2001. The Concentration of Measure Phenomenon (Mathematical surveys and monographs). American Mathematical Society.
[49]
Nathan Linial, Yishay Mansour, and Noam Nisan. 1993. Constant depth circuits, Fourier transform, and learnability. Journal of the ACM (JACM), 40, 3 (1993), 607–620.
[50]
Vaishnavh Nagarajan and J Zico Kolter. 2019. Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32 (2019).
[51]
Svetlozar T Rachev, Lev B Klebanov, Stoyan V Stoyanov, and Frank Fabozzi. 2013. The methods of distances in the theory of probability and statistics. 10, Springer.
[52]
Prasad Raghavendra and Morris Yau. 2020. List Decodable Learning via Sum of Squares. In Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020, Shuchi Chawla (Ed.). SIAM, 161–180. https://doi.org/10.1137/1.9781611975994.10
[53]
Prasad Raghavendra and Morris Yau. 2020. List Decodable Subspace Recovery. arxiv:2002.03004.
[54]
Herbert Robbins. 1955. A remark on Stirling’s formula. The American mathematical monthly, 62, 1 (1955), 26–29.
[55]
Ronitt Rubinfeld and Arsen Vasilyan. 2022. Testing distributional assumptions of learning algorithms. arXiv preprint arXiv:2204.07196v1, April.
[56]
Ronitt Rubinfeld and Arsen Vasilyan. 2022. Testing distributional assumptions of learning algorithms (v2). Private communication.
[57]
Adrien Saumard and Jon A Wellner. 2014. Log-concavity and strong log-concavity: a review. Statistics surveys, 8 (2014), 45.
[58]
Shai Shalev-Shwartz and Shai Ben-David. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
[59]
Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. 2010. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11 (2010), 2635–2670.
[60]
Alexander Shapiro. 2001. On duality theory of conic linear problems. In Semi-infinite programming. Springer, 135–165.
[61]
Avishay Tal. 2017. Tight bounds on the Fourier spectrum of AC^0. In 32nd Computational Complexity Conference (CCC 2017).
[62]
A.B. Tsybakov. 2008. Introduction to Nonparametric Estimation. Springer New York. isbn:9780387790527 lccn:2008939894
[63]
Salil Vadhan. 2017. On learning vs. refutation. In Conference on Learning Theory. 1835–1848.
[64]
Sara A Van de Geer. 2000. Empirical Processes in M-estimation. 6, Cambridge university press.
[65]
V.N. Vapnik. 1998. Statistical Learning Theory. Wiley. isbn:9780471030034 lccn:97037075
[66]
Roman Vershynin. 2018. High-dimensional probability: An introduction with applications in data science. 47, Cambridge university press.
[67]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2021. Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64, 3 (2021), 107–115.
[68]
Vladimir Mikhailovich Zolotarev. 1984. Probability metrics. Theory of Probability & Its Applications, 28, 2 (1984), 278–302.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
STOC 2023: Proceedings of the 55th Annual ACM Symposium on Theory of Computing
June 2023
1926 pages
ISBN:9781450399135
DOI:10.1145/3564246
This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. PAC learning
  2. Rademacher complexity
  3. generalization
  4. moment matching
  5. sandwiching polynomials

Qualifiers

  • Research-article

Funding Sources

Conference

STOC '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

Upcoming Conference

STOC '25
57th Annual ACM Symposium on Theory of Computing (STOC 2025)
June 23 - 27, 2025
Prague , Czech Republic

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 348
    Total Downloads
  • Downloads (Last 12 months)183
  • Downloads (Last 6 weeks)26
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media