[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Free access

Optimal bounds between f-divergences and integral probability metrics

Published: 01 January 2021 Publication History

Abstract

The families of f-divergences (e.g. the Kullback-Leibler divergence) and Integral Probability Metrics (e.g. total variation distance or maximum mean discrepancies) are widely used to quantify the similarity between probability distributions. In this work, we systematically study the relationship between these two families from the perspective of convex duality. Starting from a tight variational representation of the f-divergence, we derive a generalization of the moment-generating function, which we show exactly characterizes the best lower bound of the f-divergence as a function of a given IPM. Using this characterization, we obtain new bounds while also recovering in a unified manner well-known results, such as Hoeffding's lemma, Pinsker's inequality and its extension to subgaussian functions, and the Hammersley-Chapman-Robbins bound. This characterization also allows us to prove new results on topological properties of the divergence which may be of independent interest.

References

[1]
R. Agrawal and T. Horel. Optimal Bounds between f-Divergences and Integral Probability Metrics. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), volume 119 of Proceedings of Machine Learning Research. PMLR, July 2020.
[2]
S. M. Ali and S. D. Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B, 28(1):131-142, 1966. ISSN 00359246.
[3]
Y. Altun and A. Smola. Unifying divergence minimization and statistical inference via convex duality. In G. Lugosi and H. U. Simon, editors, Learning Theory, pages 139-153, Berlin, Heidelberg, 2006. Springer. ISBN 978-3-540-35296-9.
[4]
M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm. Mutual information neural estimation. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 531-540, Stockholmsmässan, Stockholm Sweden, 10-15 Jul 2018. PMLR.
[5]
A. Ben-Tal and A. Charnes. A dual optimization framework for some problems of information theory and statistics. Technical report, Center for Cybernetic Studies, University of Texas, Austin, 1977.
[6]
A. Ben-Tal and A. Charnes. A dual optimization framework for some problems of information theory and statistics. Problems of Control and Information Theory. Problemy Upravlenija i Teorii Informacii, 8(5-6):387-401, 1979. ISSN 0370-2529.
[7]
C. Berg, J. P. R. Christensen, and P. Ressel. Introduction to Locally Convex Topological Vector Spaces and Dual Pairs, pages 1-15. Springer, New York, NY, 1984. ISBN 978-1-4612-1128-0.
[8]
S. G. Bobkov and F. Götze. Exponential integrability and transportation cost related to logarithmic Sobolev inequalities. Journal of Functional Analysis, 163(1):1-28, Apr. 1999. ISSN 0022-1236.
[9]
F. Bolley and C. Villani. Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. Annales de la Faculté des sciences de Toulouse : Mathématiques, 14(3):331-352, 2005.
[10]
J. M. Borwein and A. S. Lewis. Duality relationships for entropy-like minimization problems. SIAM Journal on Control and Optimization, 29(2):325-338, 1991.
[11]
J. M. Borwein and A. S. Lewis. Partially-finite programming in L1 and the existence of maximum entropy estimates. SIAM Journal on Optimization, 3(2):248-267, 1993.
[12]
S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, Oxford, 2013. ISBN 9780199535255.
[13]
N. Bourbaki. Topological Vector Spaces. Elements of Mathematics. Springer-Verlag, Berlin, 1987. ISBN 978-3-540-13627-9. Translated by H.G. Eggleston & S. Madan from Espaces vectoriels topologiques, Masson, Paris, 1981.
[14]
J. Bretagnolle and C. Huber. Estimation des densités: risque minimax. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 47(2):119-137, Jan 1979. ISSN 1432-2064.
[15]
A. Brøndsted. Conjugate convex functions in topological vector spaces. Matematisk-fysiske Meddelelser udgivet af det Kongelige Danske Videnskabernes Selskab, 34(2):27, 1964. ISSN 0023-3323.
[16]
M. Broniatowski and A. Keziou. Minimization of Φ-divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica, 43(4):403-442, 2006.
[17]
A. Carbery, M. Christ, and J. Wright. Multidimensional van der Corput and sublevel set estimates. Journal of the American Mathematical Society, 12(4):981-1015, 1999. ISSN 1088-6834.
[18]
I. Csiszár. Informationstheoretische Konvergenzbegriffe im Raum der Wahrschein-lichkeitsverteilungen. A Magyar Tudományos Akadémia. Matematikai Kutató Intézetének Közleményei, 7:137-158, 1962. ISSN 0541-9514.
[19]
I. Csiszár. über topologische und metrische Eigenschaften der relativen Information der Ordnung a. In Transactions of the Third Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, 1962, pages 63-73. Publishing House of the Czechoslovak Academy of Science, Prague, 1964.
[20]
I. Csiszár. On topological properties of f-divergences. Studia Scientiarum Mathematicarum Hungarica, 2:329-339, 1967.
[21]
I. Csiszár and F. Matúš. Information projections revisited. IEEE Transactions on Information Theory, 49(6):1474-1490, June 2003. ISSN 0018-9448.
[22]
I. Csiszár and F. Matúš. Generalized minimizers of convex integral functionals, Bregman distance, Pythagorean identities. Kybernetika, 48(4):637-689, 2012. ISSN 0023-5954.
[23]
I. Csiszár. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. A Magyar Tudományos Akadémia Matematikai Kutató Intézetének Közleményei, 8(1-2):85-108, 1963.
[24]
I. Csiszár. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar, 2:299-318, 1967.
[25]
I. Csiszár. I-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3(1):146-158, 02 1975.
[26]
I. Csiszár, F. Gamboa, and E. Gassiat. MEM pixel correlated solutions for generalized moment and interpolation problems. IEEE Transactions on Information Theory, 45(7): 2253-2270, Nov. 1999.
[27]
M. D. Donsker and S. R. S. Varadhan. Asymptotic evaluation of certain Markov process expectations for large time--III. Communications on Pure and Applied Mathematics, 29 (4):389-461, 1976.
[28]
R. M. Dudley. On sequential convergence. Transactions of the American Mathematical Society, 112(3):483-507, 1964. ISSN 0002-9947, 1088-6850.
[29]
R. M. Dudley. Consistency of M-Estimators and One-Sided Bracketing. In E. Eberlein, M. Hahn, and M. Talagrand, editors, High Dimensional Probability, pages 33-58. Birkhäuser Basel, Basel, 1998. ISBN 978-3-0348-9790-7 978-3-0348-8829-5.
[30]
G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, UAI'15, page 258-267, Arlington, Virginia, USA, 2015. AUAI Press. ISBN 9780996643108.
[31]
G. A. Edgar and L. Sucheston. On maximal inequalities in Orlicz spaces. In R. D. Mauldin, R. M. Shortt, and C. E. Silva, editors, Contemporary Mathematics, volume 94, pages 113-129. American Mathematical Society, Providence, Rhode Island, 1989. ISBN 978-0-8218-5099-2 978-0-8218-7682-4.
[32]
I. Ekeland and R. Témam. Convex Analysis and Variational Problems. Society for Industrial and Applied Mathematics, 1999.
[33]
A. A. Fedotov, P. Harremoës, and F. Topsøe. Refinements of Pinsker's inequality. IEEE Transactions on Information Theory, 49(6):1491-1498, June 2003.
[34]
G. L. Gilardoni. On the minimum f-divergence for given total variation. Comptes Rendus Mathematique, 343(11):763-766, 2006. ISSN 1631-073X.
[35]
G. L. Gilardoni. An improvement on Vajda's inequality. In V. Sidoravicius and M. E. Vares, editors, In and Out of Equilibrium 2, volume 60 of Progress in Probability, pages 299-304. Birkhäuser Basel, Basel, 2008. ISBN 978-3-7643-8786-0.
[36]
G. L. Gilardoni. On Pinsker's and Vajda's type inequalities for Csiszár's f-divergences. IEEE Transactions on Information Theory, 56(11):5377-5386, Nov 2010.
[37]
G. Gorni. Conjugation and second-order properties of convex functions. Journal of Mathematical Analysis and Applications, 158(2):293-315, July 1991. ISSN 0022-247X.
[38]
N. Gozlan and C. Léonard. Transport inequalities. A survey. Markov Processes and Related Fields, 16(4):635-736, 2010. ISSN 1024-2953. URL http://math-mprf.org/journal/articles/id1224/.
[39]
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. J. Mach. Learn. Res., 13(25):723-773, Mar. 2012. ISSN 1532-4435.
[40]
A. Guntuboyina, S. Saha, and G. Schiebinger. Sharp inequalities for f-divergences. IEEE Transactions on Information Theory, 60(1):104-121, Jan 2014.
[41]
P. Harremoës. Information Topologies with Applications. In I. Csiszár, G. O. H. Katona, G. Tardos, and G. Wiener, editors, Entropy, Search, Complexity, volume 16, pages 113-150. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-32573-4. Series Title: Bolyai Society Mathematical Studies.
[42]
P. Harremoës and I. Vajda. On Pairs of f-Divergences and Their Joint Range. IEEE Transactions on Information Theory, 57(6):3230-3235, June 2011. ISSN 0018-9448, 1557-9654.
[43]
H. H. Herda. On non-symmetric modular spaces. Colloquium Mathematicum, 17(2):333-346, 1967. ISSN 0010-1354.
[44]
J.-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I, volume 305 of Grundlehren Der Mathematischen Wissenschaften. Springer Berlin Heidelberg, Berlin, Heidelberg, 1993. ISBN 978-3-642-08161-3 978-3-662-02796-7.
[45]
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13-30, 1963. ISSN 01621459.
[46]
A. D. Ioffe and V. M. Tikhomirov. On minimization of integral functionals. Functional Analysis and Its Applications, 3(3):218-227, Jul 1969. ISSN 1573-8485.
[47]
G. J. O. Jameson. Convex series. Mathematical Proceedings of the Cambridge Philosophical Society, 72(1):37-47, July 1972. ISSN 1469-8064, 0305-0041.
[48]
J. Jiao, Y. Han, and T. Weissman. Dependence measures bounding the exploration bias for general measurements. In 2017 IEEE International Symposium on Information Theory (ISIT), pages 1475-1479, June 2017.
[49]
J. H. B. Kemperman. On the optimum rate of transmitting information. The Annals of Mathematical Statistics, 40(6):2156-2177, 12 1969.
[50]
M. Khosravifard, D. Fooladivanda, and T. A. Gulliver. Exceptionality of the Variational Distance. In Proceedings of the 2006 IEEE Information Theory Workshop, pages 274-276, Chengdu, China, Oct. 2006. IEEE. ISBN 978-1-4244-0067-6 978-1-4244-0068-3.
[51]
M. Khosravifard, D. Fooladivanda, and T. A. Gulliver. Confliction of the Convexity and Metric Properties in f-Divergences. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E90-A(9):1848-1853, Sept. 2007. ISSN 0916-8508.
[52]
J. Kisynski. Convergence du type L. Colloquium Mathematicum, 7(2):205-211, 1960. ISSN 0010-1354.
[53]
H. König. Theory and applications of superconvex spaces. In Aspects of Positivity in Functional Analysis (Tübingen, 1985), volume 122 of North-Holland Math. Stud., pages 79-118. North-Holland, Amsterdam, 1986.
[54]
S. Kullback. Information theory and statistics. Wiley, New York, 1959.
[55]
S. Kullback. A lower bound for discrimination information in terms of variation. IEEE Transactions on Information Theory, 13(1):126-127, January 1967.
[56]
S. Kullback and R. A. Leibler. On Information and Sufficiency. Annals of Mathematical Statistics, 22(1):79-86, Mar. 1951. ISSN 0003-4851, 2168-8990.
[57]
C. Léonard. Minimizers of energy functionals. Acta Mathematica Hungarica, 93(4):281-325, 2001a. ISSN 02365294.
[58]
C. Léonard. Minimization of Energy Functionals Applied to Some Inverse Problems. Applied Mathematics and Optimization, 44(3):273-297, Jan. 2001b. ISSN 0095-4616, 1432-0606.
[59]
C. Léonard. Orlicz Spaces. https://leonard.perso.math.cnrs.fr/papers/Leonard-Orlicz%20spaces.pdf, Apr. 2007. URL https://leonard.perso.math.cnrs.fr/papers/Leonard-Orlicz%20spaces.pdf.
[60]
V. L. Levin. Some properties of support functionals. Mathematical Notes of the Academy of Sciences of the USSR, 4(6):900-906, Dec. 1968. ISSN 0001-4346, 1573-8876.
[61]
W. Luxemburg and A. Zaanen. Conjugate spaces of Orlicz spaces. Indagationes Mathematicae (Proceedings), 59:217-228, 1956. ISSN 1385-7258.
[62]
K. Marton. A simple proof of the blowing-up lemma (corresp.). IEEE Transactions on Information Theory, 32(3):445-446, May 1986. ISSN 1557-9654.
[63]
J. J. Moreau. Sur la fonction polaire d'une fonction semi-continue supérieurement. Comptes rendus hebdomadaires des séances de l'Académie des sciences, 258:1128-1130, 1964.
[64]
T. Morimoto. Markov processes and the H-theorem. Journal of the Physical Society of Japan, 18(3):328-331, Mar. 1963. ISSN 0031-9015, 1347-4073.
[65]
M. Morse and W. Transue. Functionals f Bilinear Over the Product A × B of Two Pseudo-Normed Vector Spaces: II. Admissible Spaces A. Annals of Mathematics, 51(3):576-614, 1950. ISSN 0003-486X.
[66]
A. Müller. Integral probability metrics and their generating classes of functions. Advances in Applied Probability, 29(2):429-443, 1997. ISSN 00018678.
[67]
J. Musielak. Orlicz Spaces and Modular Spaces, volume 1034 of Lecture Notes in Mathematics. Springer Berlin Heidelberg, Berlin, Heidelberg, 1983. ISBN 978-3-540-38692-6.
[68]
H. Nakano. Modulared Semi-Ordered Linear Spaces. Tokyo Mathematical Book Series,v. 1. Maruzen Co., Tokyo, 1950.
[69]
X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1089-1096. Curran Associates, Inc., 2008.
[70]
X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theor., 56(11):5847-5861, Nov. 2010. ISSN 0018-9448.
[71]
R. Nock, Z. Cranko, A. K. Menon, L. Qu, and R. C. Williamson. f-GANs in an information geometric nutshell. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, page 456-464, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
[72]
S. Nowozin, B. Cseke, and R. Tomioka. f-GAN: Training generative neural samplers using variational divergence minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, page 271-279, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
[73]
M. S. Pinsker. Informatsiya i informatsionnaya ustoichivost' sluchainykh velichin i protsessov. Probl. Peredachi Inf., 7, 1960.
[74]
M. S. Pinsker. Information and Information Stability of Random Variables and Processes. Holden-Day, 1964. Translation of Pinsker (1960) by Amiel Feinstein.
[75]
M. M. Rao and Z. D. Ren. Theory of Orlicz Spaces. Number 146 in Monographs and Textbooks in Pure and Applied Mathematics. M. Dekker, New York, 1991. ISBN 978-0-8247-8478-2.
[76]
M. D. Reid and R. C. Williamson. Generalised Pinsker inequalities. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009, 2009.
[77]
M. D. Reid and R. C. Williamson. Information, divergence and risk for binary experiments. J. Mach. Learn. Res., 12:731-817, 2011. ISSN 1532-4435.
[78]
A. Rényi. On measures of entropy and information. In J. Neyman, editor, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume I, pages 547-561. University of California Press, 1961.
[79]
R. T. Rockafellar. Level sets and continuity of conjugate convex functions. Transactions of the American Mathematical Society, 123(1):46-63, 1966. ISSN 0002-9947, 1088-6850.
[80]
R. T. Rockafellar. Integrals which are convex functionals. Pacific J. Math., 24(3):525-539, 1968.
[81]
R. T. Rockafellar. Integrals which are convex functionals. II. Pacific J. Math., 39(2):439-469, 1971.
[82]
R. T. Rockafellar. Integral functionals, normal integrands and measurable selections. In J. P. Gossez, E. J. Lami Dozo, J. Mawhin, and L. Waelbroeck, editors, Nonlinear Operators and the Calculus of Variations, pages 157-207, Berlin, Heidelberg, 1976. Springer Berlin Heidelberg. ISBN 978-3-540-38075-7.
[83]
R. T. Rockafellar and R. J.-B. Wets. Variational Analysis, volume 317 of Grundlehren Der Mathematischen Wissenschaften. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998a. ISBN 978-3-642-02431-3.
[84]
R. T. Rockafellar and R. J.-B. Wets. Measurability. In Rockafellar and Wets (1998a), chapter 14, pages 642-683. ISBN 978-3-642-02431-3.
[85]
A. Ruderman, M. Reid, D. García-García, and J. Petterson. Tighter variational representations of f-divergences via restriction to probability measures. In J. Langford and J. Pineau, editors, Proceedings of the 29th International Conference on Machine Learning (ICML '12), pages 671-678, New York, NY, USA, July 2012. Omnipress. ISBN 978-1-4503-1285-1.
[86]
D. Russo and J. Zou. How much does your data exploration overfit? Controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302-323, Jan 2020. ISSN 1557-9654.
[87]
I. N. Sanov. On the probability of large deviations of random variables. Mat. Sb. (N.S.), 42 (84):11-44, 1957.
[88]
I. Sason. On f-Divergences: Integral Representations, Local Behavior, and Inequalities. Entropy, 20(5):383, May 2018.
[89]
I. Sason and S. Verdú. f-Divergence Inequalities. IEEE Transactions on Information Theory, 62(11):5973-6006, Nov. 2016.
[90]
S. Simons. The occasional distributivity of o over e and the change of variable formula for conjugate functions. Nonlinear Analysis: Theory, Methods & Applications, 14(12): 1111-1120, Jan. 1990. ISSN 0362546X.
[91]
M. Sion. On general minimax theorems. Pacific Journal of Mathematics, 8(1):171-176, Mar. 1958. ISSN 0030-8730, 0030-8730.
[92]
B. K. Sriperumbudur, A. Gretton, K. Fukumizu, G. R. G. Lanckriet, and B. Schölkopf. A note on integral probability metrics and Φ-divergences. CoRR, abs/0901.2698v1, 2009.
[93]
B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. G. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6: 1550-1599, 2012.
[94]
E. M. Stein. Harmonic Analysis: Real-Variable Methods, Orthogonality, and Oscillatory Integrals. Number 43 in Princeton Mathematical Series. Princeton University Press, Princeton, NJ, 1993. ISBN 978-0-691-03216-0.
[95]
E. Szpilrajn. Remarques sur les fonctions complètement additives d'ensemble et sur les ensembles jouissant de la propriété de Baire. Fundamenta Mathematicae, 22(1):303-311, 1934. ISSN 0016-2736.
[96]
M. Teboulle and I. Vajda. Convergence of best Φ-entropy estimates. IEEE Transactions on Information Theory, 39(1):297-301, 1993.
[97]
I. Vajda. Note on discrimination information and variation. IEEE Transactions on Information Theory, 16(6):771-773, November 1970.
[98]
I. Vajda. On the f-divergence and singularity of probability measures. Periodica Mathematica Hungarica, 2(1):223-234, Mar 1972. ISSN 1588-2829.
[99]
I. Vajda. χα-divergence and generalized Fisher's information. In Transactions of the Sixth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes, pages 873-886. Academia, Prague, 1973.
[100]
M. Valadier. Intégration de convexes fermés notamment d'épigraphes. Inf-convolution continue. Rev. Française Informat. Recherche Opérationnelle, 4(Sér. R-2):57-73, 1970.
[101]
R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Number 47 in Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018. ISBN 978-1-108-41519-4.
[102]
C. Zălinescu. Convex Analysis in General Vector Spaces. World Scientific, River Edge, N.J.; London, 2002. ISBN 978-981-238-067-8.

Index Terms

  1. Optimal bounds between f-divergences and integral probability metrics
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Please enable JavaScript to view thecomments powered by Disqus.

              Information & Contributors

              Information

              Published In

              cover image The Journal of Machine Learning Research
              The Journal of Machine Learning Research  Volume 22, Issue 1
              January 2021
              13310 pages
              ISSN:1532-4435
              EISSN:1533-7928
              Issue’s Table of Contents
              CC-BY 4.0

              Publisher

              JMLR.org

              Publication History

              Accepted: 01 May 2021
              Revised: 01 May 2021
              Published: 01 January 2021
              Received: 01 August 2020
              Published in JMLR Volume 22, Issue 1

              Author Tags

              1. f-divergence
              2. integral probability metrics
              3. probability inequalities
              4. convex analysis
              5. convergence of measures

              Qualifiers

              • Research-article

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • 0
                Total Citations
              • 170
                Total Downloads
              • Downloads (Last 12 months)71
              • Downloads (Last 6 weeks)6
              Reflects downloads up to 05 Jan 2025

              Other Metrics

              Citations

              View Options

              View options

              PDF

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              Login options

              Full Access

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media