[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Nonequilibrium Thermodynamics and Scale Invariance
Next Article in Special Issue
Information Submanifold Based on SPD Matrices and Its Applications to Sensor Networks
Previous Article in Journal
Packer Detection for Multi-Layer Executables Using Entropy Analysis
Previous Article in Special Issue
Witnessing Multipartite Entanglement by Detecting Asymmetry
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Hölder Projective Divergences

1
Computer Science Department LIX, École Polytechnique, 91128 Palaiseau Cedex, France
2
Sony Computer Science Laboratories Inc., Tokyo 141-0022, Japan
3
Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia
4
Computer Vision and Multimedia Laboratory (Viper), University of Geneva, CH-1211 Geneva, Switzerland
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(3), 122; https://doi.org/10.3390/e19030122
Submission received: 20 January 2017 / Revised: 8 March 2017 / Accepted: 10 March 2017 / Published: 16 March 2017
(This article belongs to the Special Issue Information Geometry II)
Graphical abstract
">
Figure 1
<p>Hölder proper divergence (bi-parametric) and Hölder improper pseudo-divergence (tri-parametric) encompass Cauchy–Schwarz divergence and skew Bhattacharyya divergence.</p> ">
Figure 2
<p>First row: the Hölder pseudo divergence (HPD) <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>}</mo> </mrow> </semantics> </math>, KL divergence and reverse KL divergence. Remaining rows: the HD <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mi>γ</mi> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1.5</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from top to bottom) and <math display="inline"> <semantics> <mrow> <mi>γ</mi> <mo>∈</mo> <mo>{</mo> <mn>0.5</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>5</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from left to right). The reference distribution <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math> is presented as “★”. The minimizer of <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>, if different from <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math>, is presented as “•”. Notice that <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>2</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mo>=</mo> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> </mrow> </semantics> </math>. (<b>a</b>) Reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>; (<b>b</b>) reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>6</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>.</p> ">
Figure 2 Cont.
<p>First row: the Hölder pseudo divergence (HPD) <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>}</mo> </mrow> </semantics> </math>, KL divergence and reverse KL divergence. Remaining rows: the HD <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mi>γ</mi> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1.5</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from top to bottom) and <math display="inline"> <semantics> <mrow> <mi>γ</mi> <mo>∈</mo> <mo>{</mo> <mn>0.5</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>5</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from left to right). The reference distribution <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math> is presented as “★”. The minimizer of <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>, if different from <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math>, is presented as “•”. Notice that <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>2</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mo>=</mo> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> </mrow> </semantics> </math>. (<b>a</b>) Reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>; (<b>b</b>) reference categorical distribution <math display="inline"> <semantics> <mrow> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>=</mo> <mrow> <mo stretchy="false">(</mo> <mn>1</mn> <mo>/</mo> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1</mn> <mo>/</mo> <mn>6</mn> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>.</p> ">
Figure 3
<p>First row: <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>, where <math display="inline"> <semantics> <msub> <mi>p</mi> <mi>r</mi> </msub> </semantics> </math> is the standard Gaussian distribution and <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>}</mo> </mrow> </semantics> </math> compared to the KL divergence. The rest of the rows: <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mi>α</mi> <mo>,</mo> <mi>γ</mi> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mrow> <mo stretchy="false">(</mo> <msub> <mi>p</mi> <mi>r</mi> </msub> <mo>:</mo> <mi>p</mi> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> for <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>∈</mo> <mo>{</mo> <mn>4</mn> <mo>/</mo> <mn>3</mn> <mo>,</mo> <mn>1.5</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>4</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from top to bottom) and <math display="inline"> <semantics> <mrow> <mi>γ</mi> <mo>∈</mo> <mo>{</mo> <mn>0.5</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mn>5</mn> <mo>,</mo> <mn>10</mn> <mo>}</mo> </mrow> </semantics> </math> (from left to right). Notice that <math display="inline"> <semantics> <mrow> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>2</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> <mo>=</mo> <msubsup> <mi>D</mi> <mrow> <mn>2</mn> <mo>,</mo> <mn>1</mn> <mo>,</mo> <mn>1</mn> </mrow> <mi mathvariant="monospace">H</mi> </msubsup> </mrow> </semantics> </math>. The coordinate system is formed by <span class="html-italic">μ</span> (mean) and <span class="html-italic">σ</span> (standard deviation).</p> ">
Figure 4
<p>Variational <span class="html-italic">k</span>-means clustering results on a toy dataset consisting of a set of 2D Gaussians organized into two or three clusters. The cluster centroids are represented by contour plots using the same density levels. (<b>a</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>1.1</mn> </mrow> </semantics> </math> (Hölder clustering); (<b>b</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics> </math> (Cauchy–Schwarz clustering); (<b>c</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>1.1</mn> </mrow> </semantics> </math> (Hölder clustering); (<b>d</b>) <math display="inline"> <semantics> <mrow> <mi>α</mi> <mo>=</mo> <mi>γ</mi> <mo>=</mo> <mn>2</mn> </mrow> </semantics> </math> (Cauchy–Schwarz clustering).</p> ">
Versions Notes

Abstract

:
We describe a framework to build distances by measuring the tightness of inequalities and introduce the notion of proper statistical divergences and improper pseudo-divergences. We then consider the Hölder ordinary and reverse inequalities and present two novel classes of Hölder divergences and pseudo-divergences that both encapsulate the special case of the Cauchy–Schwarz divergence. We report closed-form formulas for those statistical dissimilarities when considering distributions belonging to the same exponential family provided that the natural parameter space is a cone (e.g., multivariate Gaussians) or affine (e.g., categorical distributions). Those new classes of Hölder distances are invariant to rescaling and thus do not require distributions to be normalized. Finally, we show how to compute statistical Hölder centroids with respect to those divergences and carry out center-based clustering toy experiments on a set of Gaussian distributions which demonstrate empirically that symmetrized Hölder divergences outperform the symmetric Cauchy–Schwarz divergence.

Graphical Abstract">

Graphical Abstract

1. Introduction: Inequality, Proper Divergence and Improper Pseudo-Divergence

1.1. Statistical Divergences from Inequality Gaps

An inequality [1] is denoted mathematically by lhs rhs , where lhs and rhs denote respectively the left-hand-side and right-hand-side of the inequality. One can build dissimilarity measures from inequalities lhs rhs by measuring the inequality tightness: For example, we may quantify the tightness of an inequality by its difference gap:
Δ = rhs lhs 0 .
When lhs > 0 , the inequality tightness can also be gauged by the log-ratio gap:
D = log rhs lhs = log lhs rhs 0 .
We may further compose this inequality tightness value measuring non-negative gaps with a strictly monotonically increasing function f (with f ( 0 ) = 0 ).
A bi-parametric inequality lhs ( p , q ) rhs ( p , q ) is called proper if it is strict for p q (i.e., lhs ( p , q ) < rhs ( p , q ) , p q ) and tight if and only if (iff) p = q (i.e., lhs ( p , q ) = rhs ( p , q ) , p = q ). Thus a proper bi-parametric inequality allows one to define dissimilarities such that D ( p , q ) = 0 iff p = q . Such a dissimilarity is called proper. Otherwise, an inequality or dissimilarity is said to be improper. Note that there are many equivalent words used in the literature instead of (dis-)similarity: distance (although often assumed to have metric properties; here, we used the notion of distance as a dissimilarity that may be asymmetric), pseudo-distance, discrimination, proximity, information deviation, etc.
A statistical dissimilarity between two discrete or continuous distributions p ( x ) and q ( x ) on a support X can thus be defined from inequalities by summing up or taking the integral for the inequalities instantiated on the observation space X :
x X , D x ( p , q ) = rhs ( p ( x ) , q ( x ) ) lhs ( p ( x ) , q ( x ) ) D ( p , q ) = x X rhs ( p ( x ) , q ( x ) ) lhs ( p ( x ) , q ( x ) ) discrete   case , X rhs ( p ( x ) , q ( x ) ) lhs ( p ( x ) , q ( x ) ) d x continuous   case .
In such a case, we get a separable divergence by construction. Some non-separable inequalities induce a non-separable divergence. For example, the renowned Cauchy–Schwarz divergence [2] is not separable because in the inequality:
X p ( x ) q ( x ) d x X p ( x ) 2 d x X q ( x ) 2 d x ,
the rhs is not separable.
Furthermore, a proper dissimilarity is called a divergence in information geometry [3] when it is C 3 (i.e., three times differentiable, thus allowing one to define a metric tensor [4] and a cubic tensor [3]).
Many familiar distances can be reinterpreted as inequality gaps in disguise. For example, Bregman divergences [5] and Jensen divergences [6] (also called Burbea–Rao divergences [7,8]) can be reinterpreted as inequality difference gaps and the Cauchy–Schwarz distance [2] as an inequality log-ratio gap:
Example 1 (Bregman divergence as a Bregman score-induced gap divergence).
A proper score function [9] S ( p : q ) induces a gap divergence D ( p : q ) = S ( p : q ) S ( p : p ) 0 . A Bregman divergence [5] B F ( p : q ) for a strictly convex and differentiable real-valued generator F ( x ) is induced by the Bregman score S F ( p : q ) . Let S F ( p : q ) = F ( q ) p q , F ( q ) denote the Bregman proper score minimized for p = q . Then, the Bregman divergence is a gap divergence: B F ( p : q ) = S F ( p : q ) S F ( p : p ) 0 . When F is strictly convex, the Bregman score is proper, and the Bregman divergence is proper.
Example 2 (Cauchy–Schwarz distance as a log-ratio gap divergence).
Consider the Cauchy–Schwarz inequality X p ( x ) q ( x ) d x X p ( x ) 2 d x X q ( x ) 2 d x . Then, the Cauchy–Schwarz distance [2] between two continuous distributions is defined by CS ( p : q ) = log X p ( x ) q ( x ) d x ( X p ( x ) 2 d x ) ( X q ( x ) 2 d x ) 0 .
Note that we use the modern notation D ( p : q ) to emphasize that the divergence is potentially asymmetric: D ( p : q ) D ( q : p ) ; see [3]. In information theory [10], the older notation “ ” is often used instead of the “:” that is used in information geometry [3].
To conclude this introduction, let us finally introduce the notion of projective statistical distances. A statistical distance D ( p : q ) is said to be projective when it is invariant to scaling of p ( x ) and q ( x ) , that is,
D ( λ p : λ q ) = D ( p : q ) , λ , λ > 0 .
The Cauchy–Schwarz distance is a projective divergence. Another example of such a projective divergence is the parametric γ-divergence [11].
Example 3 (γ-divergence as a projective score-induced gap divergence).
The γ-divergence [11,12] D γ ( p : q ) for γ > 0 is projective:
D γ ( p : q ) = S γ ( p : q ) S γ ( p : p ) , w i t h S γ ( p : q ) = 1 γ ( 1 + γ ) p ( x ) q ( x ) γ d x q ( x ) 1 + γ d x γ 1 + γ .
The γ-divergence is related to the proper pseudo-spherical score [11].
The γ-divergences have been proven useful for robust statistical inference [11] in the presence of heavy outlier contamination. In general, bi-parametric homogeneous inequalities yield corresponding log-ratio projective divergences: Let lhs ( p : q ) and rhs ( p : q ) be homogeneous functions of degree k N (i.e., lhs ( λ p : λ q ) = ( λ λ ) k lhs ( p : q ) and rhs ( λ p : λ q ) = ( λ λ ) k rhs ( p : q ) ); then, it comes that:
D ( λ p : λ q ) = log lhs ( λ p : λ q ) rhs ( λ p : λ q ) = log ( λ λ ) k lhs ( p : q ) ( λ λ ) k rhs ( p : q ) = log lhs ( p : q ) rhs ( p : q ) = D ( p : q ) .
For example, Hölder and Cauchy–Schwarz inequalities are homogeneous inequalities of degree one that yield projective log-ratio divergences.
There are many works studying classes of (statistical) divergences and their properties. For example, Zhang [13] studied the relationships between divergences, duality and convex analysis by defining the class of divergences:
D F ( α ) ( p : q ) = 4 1 α 2 1 α 2 F ( p ) + 1 + α 2 F ( q ) F 1 α 2 p + 1 + α 2 q , α 1 ,
for a real-valued convex generator function F. Interestingly, this divergence can be interpreted as a gap divergence derived from the Jensen convex inequality:
1 α 2 F ( p ) + 1 + α 2 F ( q ) F 1 α 2 p + 1 + α 2 q .
This work is further extended in [14] where Zhang stresses the two different types of duality in information geometry: the referential duality and the representational duality (with the study of the ( ρ , τ ) -geometry for monotone embeddings).
It is well-known that Rényi divergence generalizes the Kullback–Leibler divergence: Rényi divergence is induced by Rényi entropy, which generalizes Shannon entropy, while keeping the important feature of being additive. Another generalization of Shannon entropy is Tsallis entropy, which is non-additive in general and allows one to define the Tsallis divergence. Both the Rényi and Tsallis entropies can be unified by the biparametric family of Sharma–Mittal entropies [15], and the corresponding Sharma–Mittal divergences can be defined. There are many ways to extend the definitions of Sharma–Mittal divergences. For example, in [16], a generalization of Rényi divergences is proposed, and its induced geometry is investigated.

1.2. Pseudo-Divergences and the Axiom of Indiscernibility

Consider a broader class of statistical pseudo-divergences based on improper inequalities, where the tightness of an inequality lhs ( p , q ) rhs ( p , q ) does not imply that p = q . This family of dissimilarity measures has interesting properties that have not been studied before.
Formally, statistical pseudo-divergences are defined with respect to density measures p ( x ) and q ( x ) with x X , where X denotes the support. By definition, pseudo-divergences satisfy the following three fundamental properties:
  • Non-negativeness: D ( p : q ) 0 for any p ( x ) , q ( x ) ;
  • Reachable indiscernibility:
    • p ( x ) , there exists q ( x ) , such that D ( p : q ) = 0 ,
    • q ( x ) , there exists p ( x ) , such that D ( p : q ) = 0 .
  • Positive correlation: if D ( p : q ) = 0 , then p ( x 1 ) p ( x 2 ) q ( x 1 ) q ( x 2 ) 0 for any x 1 , x 2 X .
As compared to statistical divergence measures, such as the Kullback–Leibler (KL) divergence:
KL ( p : q ) = X p ( x ) log p ( x ) q ( x ) d x ,
pseudo-divergences do not require D ( p : p ) = 0 . Instead, any pair of distributions p ( x ) and q ( x ) with D ( p : q ) = 0 only have to be “positively correlated” such that p ( x 1 ) p ( x 2 ) implies q ( x 1 ) q ( x 2 ) , and vice versa. Any divergence with D ( p : q ) = 0 p ( x ) = q ( x ) (law of indiscernibles) automatically satisfies this weaker condition, and therefore, any divergence belongs to the broader class of pseudo-divergences. Indeed, if p ( x ) = q ( x ) , then ( p ( x 1 ) p ( x 2 ) ) ( q ( x 1 ) q ( x 2 ) ) = ( p ( x 1 ) p ( x 2 ) ) 2 0 . However, the converse is not true. As we shall describe in the remainder, the family of pseudo-divergences is not limited to proper divergence measures. In the remainder, the term “pseudo-divergence” refers to such divergences that are not proper divergence measures.
We study two novel statistical dissimilarity families: one family of statistical improper pseudo-divergences and one family of proper statistical divergences. Within the class of pseudo-divergences, this work concentrates on defining a tri-parametric family of dissimilarities called Hölder log-ratio gap divergence that we concisely abbreviate as HPD for “Hölder pseudo divergence” in the remainder. We also study its proper divergence counterpart termed HD for “Hölder divergence”.

1.3. Prior Work and Contributions

The term “Hölder divergence” was first coined in 2014 based on the definition of the Hölder score [17,18]: The score-induced Hölder divergence D ( p : q ) is a proper gap divergence that yields a scale-invariant divergence. Let p a , σ ( x ) = a σ p ( σ x ) for a , σ > 0 be a transformation. Then, a scale-invariant divergence satisfies D ( p a , σ : q a , σ ) = κ ( a , σ ) D ( p : q ) for a function κ ( a , σ ) > 0 . This gap divergence is proper since it is based on the so-called Hölder score, but is not projective and does not include the Cauchy–Schwarz divergence. Due to these differences, the Hölder log-ratio gap divergence introduced here shall not be confused with the Hölder gap divergence induced by the Hölder score that relies both on a scalar γ and a function ϕ ( · ) .
We shall introduce two novel families of log-ratio projective gap divergences based on Hölder ordinary (or forward) and reverse inequalities that extend the Cauchy–Schwarz divergence, study their properties and consider as an application clustering Gaussian distributions: We experimentally show better clustering results when using symmetrized Hölder divergences than using the Cauchy–Schwarz divergence. To contrast with the “Hölder composite score-induced divergences” of [18], our Hölder divergences admit closed-form expressions between distributions belonging to the same exponential families [19] provided that the natural parameter space is a cone or affine.
Our main contributions are summarized as follows:
  • Define the tri-parametric family of Hölder improper pseudo-divergences (HPDs) in Section 2 and the bi-parametric family of Hölder proper divergences in Section 3 (HDs) for positive and probability measures, and study their properties (including their relationships with skewed Bhattacharyya distances [8] via escort distributions);
  • Report closed-form expressions of those divergences for exponential families when the natural parameter space is a cone or affine (including, but not limited to the cases of categorical distributions and multivariate Gaussian distributions) in Section 4;
  • Provide approximation techniques to compute those divergences between mixtures based on log-sum-exp inequalities in Section 4.6;
  • Describe a variational center-based clustering technique based on the convex-concave procedure for computing Hölder centroids, and report our experimental results in Section 5.

1.4. Organization

This paper is organized as follows: Section 2 introduces the definition and properties of Hölder pseudo-divergences (HPDs). It is followed by Section 3 that describes Hölder proper divergences (HDs). In Section 4, closed-form expressions for those novel families of divergences are reported for the categorical, multivariate Gaussian, Bernoulli, Laplace and Wishart distributions. Section 5 defines Hölder statistical centroids and presents a variational k-means clustering technique: we show experimentally that using Hölder divergences improves clustering quality over the Cauchy–Schwarz divergence. Finally, Section 6 concludes this work and hints at further perspectives from the viewpoint of statistical estimation and manifold learning. In Appendix A, we recall the proof of the ordinary and reverse Hölder’s inequalities.

2. Hölder Pseudo-Divergence: Definition and Properties

Hölder’s inequality (see [20,21] and Appendix A for a proof) states for positive real-valued functions p ( x ) and q ( x ) defined on the support X that:
X p ( x ) q ( x ) d x X p ( x ) α d x 1 α X q ( x ) β d x 1 β ,
where exponents α and β satisfy α β > 0 , as well as the exponent conjugacy condition: 1 α + 1 β = 1 . In a more general form, Hölder’s inequality holds for any real and complex valued functions. In this work, we only focus on real positive functions that are densities of positive measures. We also write β = α ¯ = α α 1 , meaning that α and β are conjugate Hölder exponents. We check that α > 1 and β > 1 . Hölder inequality holds even if the lhs is infinite (meaning that the integral diverges), since the rhs is also infinite in that case.
The reverse Hölder inequality holds for conjugate exponents 1 α + 1 β = 1 with α β < 0 (then 0 < α < 1 and β < 0 , or α < 0 and 0 < β < 1 ):
X p ( x ) q ( x ) d x X p ( x ) α d x 1 α X q ( x ) β d x 1 β .
Both Hölder’s inequality and the reverse Hölder inequality turn tight when p ( x ) α q ( x ) β (see proof in Appendix A).

2.1. Definition

Let ( X , F , μ ) be a measurable space where μ is the Lebesgue measure, and let L γ ( X , μ ) denote the Lebesgue space of functions that have their γ-th power of absolute value Lebesgue integrable, for any γ > 0 (when γ 1 , L γ ( X , μ ) is a Banach space). We define the following pseudo-divergence:
Definition 1 (Hölder statistical pseudo-divergence).
For conjugate exponents α and β with α β > 0 and σ , τ > 0 , the Hölder pseudo-divergence (HPD) between two densities p ( x ) L α σ ( X , μ ) and q ( x ) L β τ ( X , μ ) of positive measures absolutely continuous with respect to (w.r.t.) μ is defined by the following log-ratio gap:
D α , σ , τ H ( p : q ) = log X p ( x ) σ q ( x ) τ d x X p ( x ) α σ d x 1 α X q ( x ) β τ d x 1 β .
When 0 < α < 1 and β = α ¯ = α α 1 < 0 , or α < 0 and 0 < β < 1 , and σ , τ > 0 , the reverse HPD is defined by:
D α , σ , τ H ( p : q ) = log X p ( x ) σ q ( x ) τ d x X p ( x ) α σ d x 1 α X q ( x ) β τ d x 1 β .
By Hölder’s inequality and the reverse Hölder inequality, D α , σ , τ H ( p : q ) 0 with D α , σ , τ H ( p : q ) = 0 iff p ( x ) α σ q ( x ) β τ or equivalently q ( x ) p ( x ) α σ β τ = p ( x ) σ τ ( α 1 ) . When α > 1 , x σ τ ( α 1 ) is monotonically increasing, and D α , σ , τ H is indeed a pseudo-divergence. However, the reverse HPD is not a pseudo-divergence because x σ τ ( α 1 ) will be monotonically decreasing if α < 0 or 0 < α < 1 . Therefore, we only consider HPD with α > 1 in the remainder, and leave here the notion of reverse Hölder divergence for future studies.
When α = β = 2 , σ = τ = 1 , the HPD becomes the Cauchy–Schwarz divergence CS [22]:
D 2 , 1 , 1 H ( p : q ) = CS ( p : q ) = log X p ( x ) q ( x ) d x X p ( x ) 2 d x 1 2 X q ( x ) 2 d x 1 2 ,
which has been proven useful to get closed-form divergence formulas between mixtures of exponential families with conic or affine natural parameter spaces [23].
The Cauchy–Schwarz divergence is proper for probability densities since the Cauchy–Schwarz inequality becomes an equality iff q ( x ) = λ p ( x ) σ τ ( α 1 ) = λ p ( x ) implying that λ = X λ p ( x ) d x = X q ( x ) d x = 1 . It is however not proper for positive densities.
Fact 1 (CS is only proper for probability densities).
The Cauchy–Schwarz divergence CS ( p : q ) is proper for square-integrable probability densities p ( x ) , q ( x ) L 2 ( X , μ ) , but not proper for positive square-integrable densities.

2.2. Properness and Improperness

In the general case, the divergence D α , σ , τ H is not even proper for normalized (probability) densities, not to mention general unnormalized (positive) densities. Indeed, when p ( x ) = q ( x ) , we have:
D α , σ , τ H ( p : p ) = log p ( x ) σ + τ d x p ( x ) α σ d x 1 α p ( x ) β τ d x 1 β 0   when   α σ β τ .
Let us consider the general case. For unnormalized positive distributions p ˜ ( x ) and q ˜ ( x ) (the tilde notation stems from the notation of homogeneous coordinates in projective geometry), the inequality becomes an equality when: p ˜ ( x ) α σ q ˜ ( x ) β τ , i.e., p ( x ) α σ q ( x ) β τ , or q ( x ) p ( x ) α σ / α ¯ τ = p ( x ) σ τ ( α 1 ) . We can check that D α , σ , τ H ( p : λ p σ τ ( α 1 ) ) = 0 for any λ > 0 :
log p ( x ) σ λ τ p ( x ) σ ( α 1 ) d x p ( x ) α σ d x 1 α λ β τ p ( x ) ( α 1 ) β σ d x 1 β = log λ τ p ( x ) α σ d x p ( x ) α σ d x 1 α λ β τ p ( x ) α σ d x 1 β = 0 ,
since ( α 1 ) β = ( α 1 ) α ¯ = ( α 1 ) α α 1 = α .
Fact 2 (HPD is improper).
The Hölder pseudo-divergences are improper statistical distances.

2.3. Reference Duality

In general, Hölder divergences are asymmetric when α β ( 2 ) or σ τ , but enjoy the following reference duality [24]:
D α , σ , τ H ( p : q ) = D β , τ , σ H ( q : p ) = D α α 1 , τ , σ H ( q : p ) .
Fact 3 (Reference duality HPD).
The Hölder pseudo-divergences satisfy the reference duality β = α ¯ = α α 1 : D α , σ , τ H ( p : q ) = D β , τ , σ H ( q : p ) = D α α 1 , τ , σ H ( q : p ) .
An arithmetic symmetrization of the HPD yields a symmetric HPD S α , σ , τ H , given by:
S α , σ , τ H ( p : q ) = S α , σ , τ H ( q : p ) = D α , σ , τ H ( p : q ) + D α , σ , τ H ( q : p ) 2 , = 1 2 log p ( x ) σ q ( x ) τ d x p ( x ) τ q ( x ) σ d x p ( x ) α σ d x 1 α p ( x ) β τ d x 1 β q ( x ) α σ d x 1 α q ( x ) β τ d x 1 β .

2.4. HPD is a Projective Divergence

In the above definition, densities p ( x ) and q ( x ) can either be positive or normalized probability distributions. Let p ˜ ( x ) and q ˜ ( x ) denote positive (not necessarily normalized) measures, and w ( p ˜ ) = X p ˜ ( x ) d x the overall mass so that p ( x ) = p ˜ ( x ) w ( p ˜ ) is the corresponding normalized probability measure. Then, we check that HPD is a projective divergence [11] since:
D α , σ , τ H ( p ˜ : q ˜ ) = D α , σ , τ H ( p : q ) ,
or in general:
D α , σ , τ H ( λ p : λ q ) = D α , σ , τ H ( p : q )
for all prescribed constants λ , λ > 0 . Projective divergences may also be called “angular divergences” or “cosine divergences”, since they do not depend on the total mass of the density measures.
Fact 4 (HPD is projective).
The Hölder pseudo-divergences are projective distances.

2.5. Escort Distributions and Skew Bhattacharyya Divergences

Let us define with respect to the probability measures p ( x ) L 1 α ( X , μ ) and q ( x ) L 1 β ( X , μ ) the following escort probability distributions [3]:
p α E ( x ) = p ( x ) 1 α p ( x ) 1 α d x ,
and
q β E ( x ) = q ( x ) 1 β q ( x ) 1 β d x .
Since HPD is a projective divergence, we compute with respect to the conjugate exponents α and β the Hölder escort divergence (HED):
D α HE ( p : q ) = D α , 1 , 1 H ( p α E : q β E ) = D α , 1 α , 1 β H ( p : q ) = log X p ( x ) 1 / α q ( x ) 1 / β d x = B 1 / α ( p : q ) ,
which turns out to be the familiar skew Bhattacharyya divergence B 1 / α ( p : q ) ; see [8].
Fact 5 (HED as a skew Bhattacharyya divergence).
The Hölder escort divergence amounts to a skew Bhattacharyya divergence: D α HE ( p : q ) = B 1 / α ( p : q ) for any α > 0 .
In particular, the Cauchy–Schwarz escort divergence CS HE ( p : q ) amounts to the Bhattacharyya distance [25] B ( p : q ) = log X p ( x ) q ( x ) d x :
CS HE ( p : q ) = D 2 HE ( p : q ) = D 2 , 1 , 1 H ( p 2 E : q 2 E ) = D 2 , 1 2 , 1 2 H ( p : q ) = B 1 / 2 ( p : q ) = B ( p : q ) .
Observe that the Cauchy–Schwarz escort distributions are the square root density representations [26] of distributions.

3. Proper Hölder Divergence

3.1. Definition

To get a proper HD between probability distributions p ( x ) and q ( x ) , we need to have p ( x ) α σ q ( x ) β τ . That is, we have α σ = β τ , or equivalently, we set τ = ( α 1 ) σ for free prescribed parameters α > 1 and σ > 0 . Alternatively, as we shall consider in the remainder, one may set α σ = β τ = γ as a free prescribed parameter, which yields σ = γ / α and τ = γ / β . Thus, in general, we define a bi-parametric family of proper Hölder divergence on probability distributions D α , γ H .
Let p ( x ) and q ( x ) be positive measures in L γ ( X , μ ) for a prescribed scalar value γ > 0 . Plugging σ = γ / α and τ = γ / β into the definition of HPD D α , σ , τ H , we get the following definition:
Definition 2 (Proper Hölder divergence).
For conjugate exponents α , β > 0 and γ > 0 , the proper Hölder divergence (HD) between two densities p ( x ) and q ( x ) is defined by:
D α , γ H ( p : q ) = D α , γ α , γ β H ( p : q ) = log X p ( x ) γ / α q ( x ) γ / β d x ( X p ( x ) γ d x ) 1 / α ( X q ( x ) γ d x ) 1 / β .
Following Hölder’s inequality, we can check that D α , γ H ( p : q ) 0 and D α , γ H ( p : q ) = 0 iff p ( x ) γ q ( x ) γ , i.e., p ( x ) q ( x ) (see Appendix A). If p ( x ) and q ( x ) belong to the statistical probability manifold, then D α , γ H ( p : q ) = 0 iff p ( x ) = q ( x ) almost everywhere. This says that HD is a proper divergence for probability measures, and it becomes a pseudo-divergence for positive measures. Note that we have abused the notation D H to denote both the Hölder pseudo-divergence (with three subscripts) and the Hölder divergence (with two subscripts).
Similar to HPD, HD is asymmetric when α β with the following reference duality:
D α , γ H ( p : q ) = D α ¯ , γ H ( q : p ) .
HD can be symmetrized as:
S α , γ H ( p : q ) = D α , γ H ( p : q ) + D α , γ H ( q : p ) 2 = 1 2 log X p ( x ) γ / α q ( x ) γ / β d x X p ( x ) γ / β q ( x ) γ / α d x X p ( x ) γ d x X q ( x ) γ d x .
Furthermore, one can easily check that HD is a projective divergence.
For conjugate exponents α , β > 0 and γ > 0 , we rewrite the definition of HD as:
D α , γ H ( p : q ) = log X p ( x ) γ X p ( x ) γ d x 1 / α q ( x ) γ X q ( x ) γ d x 1 / β d x , = log p 1 / γ E ( x ) 1 / α q 1 / γ E ( x ) 1 / β d x = B 1 α ( p 1 / γ E : q 1 / γ E ) .
Therefore, HD can be reinterpreted as the skew Bhattacharyya divergence [8] between the escort distributions. In particular, when γ = 1 , we get:
D α , 1 H ( p : q ) = log X p ( x ) 1 / α q ( x ) 1 / β d x = B 1 α ( p : q ) .
Fact 6.
The bi-parametric family of statistical Hölder divergences D α , γ H passes through the one-parametric family of skew Bhattacharyya divergences when γ = 1 .

3.2. Special Case: The Cauchy–Schwarz Divergence

Within the family of Hölder divergence, we set α = β = γ = 2 and get the Cauchy–Schwarz (CS) divergence.
D 2 , 2 H ( p : q ) = D 2 , 1 , 1 H ( p : q ) = CS ( p : q ) .
Figure 1 displays a diagram of those divergence classes with their inclusion relationships.
As stated earlier, notice that the Cauchy–Schwarz inequality
p ( x ) q ( x ) d x p ( x ) 2 d x p ( x ) 2 d x
is not proper as it is an equality when p ( x ) and q ( x ) are linearly dependent (i.e., p ( x ) = λ q ( x ) for λ > 0 ). The arguments of the CS divergence are square-integrable real-valued density functions p ( x ) and q ( x ) . Thus, the Cauchy–Schwarz divergence is not proper for positive measures, but is proper for normalized probability distributions, since in this case, p ( x ) d x = λ q ( x ) d x = 1 implies that λ = 1 .

3.3. Limit Cases of Hölder Divergences and Statistical Estimation

Let us define the inner product of unnormalized densities as:
p ˜ ( x ) , q ˜ ( x ) = X p ˜ ( x ) q ˜ ( x ) d x
(for L 2 ( X , μ ) integrable functions), and define the L α norm of densities as p ˜ ( x ) α = ( X p ˜ ( x ) α d x ) 1 / α for α 1 . Then, the CS divergence can be concisely written as:
CS ( p ˜ : q ˜ ) = log p ˜ ( x ) , q ˜ ( x ) p ˜ ( x ) 2 q ˜ ( x ) 2 ,
and the Hölder pseudo-divergence is written as:
D α , 1 , 1 H ( p ˜ : q ˜ ) = log p ˜ ( x ) , q ˜ ( x ) p ˜ ( x ) α q ˜ ( x ) α ¯ .
When α 1 + , we have α ¯ = α / ( α 1 ) + . Then, it comes that:
lim α 1 + D α , 1 , 1 H ( p ˜ : q ˜ ) = log p ˜ ( x ) , q ˜ ( x ) p ˜ ( x ) 1 q ˜ ( x ) = log p ˜ ( x ) , q ˜ ( x ) + log X p ˜ ( x ) d x + log max x X q ˜ ( x ) .
When α + and α ¯ 1 + , we have:
lim α + D α , 1 , 1 H ( p ˜ : q ˜ ) = log p ˜ ( x ) , q ˜ ( x ) p ˜ ( x ) q ˜ ( x ) 1 = log p ˜ ( x ) , q ˜ ( x ) + log max x X p ˜ ( x ) + log X q ˜ ( x ) d x .
Now, consider a pair of probability densities p ( x ) and q ( x ) . We have:
lim α 1 + D α , 1 , 1 H ( p : q ) = log p ( x ) , q ( x ) + max x X log q ( x ) , lim α + D α , 1 , 1 H ( p : q ) = log p ( x ) , q ( x ) + max x X log p ( x ) , CS ( p : q ) = log p ( x ) , q ( x ) + log p ( x ) 2 + log q ( x ) 2 .
In an estimation scenario, p ( x ) is fixed, and q ( x | θ ) = q θ ( x ) is free along a parametric manifold M ; then, minimizing Hölder divergence reduces to:
arg   min θ M lim α 1 + D α , 1 , 1 H ( p : q θ ) = arg   min θ M log p ( x ) , q θ ( x ) + max x X log q θ ( x ) , arg   min θ M lim α + D α , 1 , 1 H ( p : q ) = arg   min θ M log p ( x ) , q θ ( x ) , arg   min θ M CS ( p : q ) = arg   min θ M log p ( x ) , q θ ( x ) + log q θ ( x ) 2 .
Therefore, when α varies from 1 to + , only the regularizer in the minimization problem changes. In any case, Hölder divergence always has the term log p ( x ) , q ( x ) , which shares a similar form as the Bhattacharyya distance [25]:
B ( p : q ) = log X p ( x ) q ( x ) d x = log p ( x ) , q ( x ) .
HPD between p ˜ ( x ) and q ˜ ( x ) is also closely related to their cosine similarity p ˜ ( x ) , q ˜ ( x ) p ˜ ( x ) 2 q ˜ ( x ) 2 . When α = 2 , σ = τ = 1 , HPD is exactly the cosine similarity after a non-linear transformation.

4. Closed-Form Expressions of HPD and HD for Conic and Affine Exponential Families

We report closed-form formulas for the HPD and HD between two distributions belonging to the same exponential family provided that the natural parameter space is a cone or affine. A cone Ω is a convex domain, such that for P , Q Ω and any λ > 0 , we have P + λ Q Ω . For example, the set of positive measures absolutely continuous with a base measure μ is a cone. Recall that an exponential family [19] has a density function p ( x ; θ ) that can be written canonically as:
p ( x ; θ ) = exp t ( x ) , θ F ( θ ) + k ( x ) .
In this work, we consider the auxiliary carrier measure term k ( x ) = 0 . The base measure is either the Lebesgue measure μ or the counting measure μ C . A conic or affine exponential family (CAEF) is an exponential family with the natural parameter space Θ being a cone or affine. The log-normalizer F ( θ ) is a strictly convex function also called the cumulant generating function [3].
Lemma 1 (HPD and HD for CAEFs).
For distributions p ( x ; θ p ) and p ( x ; θ q ) belonging to the same exponential family with conic or affine natural parameter space [23], both the HPD and HD are available in closed-form:
D α , σ , τ H ( p : q ) = 1 α F ( α σ θ p ) + 1 β F ( β τ θ q ) F ( σ θ p + τ θ q ) ,
D α , γ H ( p : q ) = 1 α F ( γ θ p ) + 1 β F ( γ θ q ) F γ α θ p + γ β θ q .
Proof. 
Consider k ( x ) = 0 and a conic or affine natural space Θ (see [23]); then, for all a , b > 0 , we have:
p ( x ) a d x 1 b = exp 1 b F ( a θ p ) a b F ( θ p ) ,
since a θ p Θ . Indeed, we have:
p ( x ) a d x 1 / b = exp a θ , t ( x ) a F ( θ ) d x 1 / b = exp a θ , t ( x ) F ( a θ ) + F ( a θ ) a F ( θ ) d x 1 / b = exp 1 b F ( a θ ) a b F ( θ ) exp a θ , t ( x ) F ( a θ ) d x = 1 1 / b .
Similarly, we have for all a , b > 0 (details omitted),
p ( x ) a q ( x ) b d x = exp ( F ( a θ p + b θ q ) a F ( θ p ) b F ( θ q ) ) ,
since a θ p + b θ q Θ . Therefore, we get:
D α , σ , τ H ( p : q ) = log p ( x ) σ q ( x ) τ d x p ( x ) α σ d x 1 α q ( x ) β τ d x 1 β = F ( σ θ p + τ θ q ) + F ( σ θ p ) + F ( τ θ q ) + 1 α F ( α σ θ p ) F ( σ θ p ) + 1 β F ( β τ θ q ) F ( τ θ q ) = 1 α F ( α σ θ p ) + 1 β F ( β τ θ q ) F ( σ θ p + τ θ q ) 0 , D α , γ H ( p : q ) = log p ( x ) γ / α q ( x ) γ / β d x p ( x ) γ d x 1 α q ( x ) γ d x 1 β = F γ α θ p + γ β θ q + γ α F ( θ p ) + γ β F ( θ q ) + 1 α F ( γ θ p ) γ α F ( θ p ) + 1 β F ( γ θ q ) γ β F ( θ q ) = 1 α F ( γ θ p ) + 1 β F ( γ θ q ) F γ α θ p + γ β θ q 0 .
When 1 > α > 0 , we have β = α α 1 < 0 . To get similar results for the reverse Hölder divergence, we need the natural parameter space to be affine (e.g., isotropic Gaussians or multinomials; see [27]). ☐
In particular, if p ( x ) and q ( x ) belong to the same exponential family so that p ( x ) = exp ( θ p , t ( x ) F ( θ p ) ) and q ( x ) = exp ( θ q , t ( x ) F ( θ q ) ) , one can easily check that D α , 1 , 1 H ( p : q ) = 0 iff θ q = ( α 1 ) θ p . For HD, we can check that D α , γ H ( p : p ) = 0 is proper since 1 α + 1 β = 1 .
The following result is straightforward from Lemma 1.
Lemma 2 (Symmetric HPD and HD for CAEFs).
For distributions p ( x ; θ p ) and p ( x ; θ q ) belonging to the same exponential family with conic or affine natural parameter space [23], the symmetric HPD and HD are available in closed-form:
S α , σ , τ H ( p : q ) = 1 2 1 α F ( α σ θ p ) + 1 β F ( β τ θ p ) + 1 α F ( α σ θ q ) + 1 β F ( β τ θ q ) F ( σ θ p + τ θ q ) F ( τ θ p + σ θ q ) ; S α , γ H ( p : q ) = 1 2 F ( γ θ p ) + F ( γ θ q ) F γ α θ p + γ β θ q F γ β θ p + γ α θ q .
Remark 1.
By reference duality,
S α , σ , τ H ( p : q ) = S α ¯ , τ , σ H ( p : q ) ; S α , γ H ( p : q ) = S α ¯ , γ H ( p : q ) .
Note that the Hölder score-induced divergence [18] does not admit in general closed-form formulas for exponential families since it relies on a function ϕ ( · ) (see Definition 4 of [18]).
Note that CAEF convex log-normalizers satisfy:
1 α F ( α θ p ) + 1 β F ( β θ q ) F ( θ p + θ q ) .
A necessary condition is that F ( λ θ ) λ F ( θ ) for λ > 0 (take θ p = θ , θ q = 0 and F ( 0 ) = 0 in the above equality).
The escort distribution for an exponential family is given by:
p α E ( x ; θ ) = e F ( θ ) α F ( θ α ) p ( x ; θ ) 1 α .
The Hölder equality holds when p ( x ) α q ( x ) β or p ( x ) α q ( x ) β 1 . For exponential families, this condition is satisfied when α θ p β θ q Θ . That is, we need to have:
α θ p 1 α 1 θ q Θ .
Thus, we may choose small enough α = 1 + ϵ > 1 so that the condition is not satisfied for fixed θ p and θ q for many exponential distributions. Since multinomials have affine natural space [27], this condition is always met, but not for non-affine natural parameter spaces like normal distributions.
Notice the following fact:
Fact 7 (Density of a CAEF in L γ ( X , μ ) ).
The density of exponential families with conic or affine natural parameter space belongs to L γ ( X , μ ) for any γ > 0 .
Proof. 
We have X ( exp ( θ , t ( x ) F ( θ ) ) ) γ d μ ( x ) = e F ( γ θ ) γ F ( θ ) < for any γ > 0 provided that γ θ belongs to the natural parameter space. When Θ is a cone or affine, the condition is satisfied. ☐
Let p ˜ ( x ; θ ) = exp t ( x ) , θ denote the unnormalized positive exponential family density and p ( x ; θ ) = p ˜ ( x ; θ ) Z ( θ ) the normalized density with Z ( θ ) = exp ( F ( θ ) ) the partition function. Although HD is a projective divergence since we have D α , σ , τ H ( p 1 : p 2 ) = D α , σ , τ H ( p ˜ 1 : p ˜ 2 ) , observe that the HD value depends on the log-normalizer F ( θ ) (since the HD is an integral on the support; see also [12] for a similar argument with the γ-divergence [11]).
In practice, even when the log-normalizer is computationally intractable, we may still estimate the HD by Monte Carlo techniques: Indeed, we can sample a distribution p ˜ ( x ) either by rejection sampling [12] or by the Markov chain Monte Carlo (MCMC) Metropolis–Hasting technique: It just requires to be able to sample a proposal distribution that has the same support.
We shall now instantiate the HPD and HD formulas for several exponential families with conic or affine natural parameter spaces.

4.1. Case Study: Categorical Distributions

Let p = ( p 0 , , p m ) and q = ( q 0 , , q m ) be two categorical distributions in the m-dimensional probability simplex Δ m . We rewrite p in the canonical form of exponential families [19] as:
p i = exp ( θ p ) i log 1 + i = 1 m exp ( θ p ) i , i { 1 , , m } ,
with the redundant parameter:
p 0 = 1 i = 1 m p i = 1 1 + i = 1 m exp ( θ p ) i .
From Equation (47), the convex cumulant generating function has the form F ( θ ) = log 1 + i = 1 m exp ( θ p ) i . The inverse transformation from p to θ is therefore given by:
θ i = log p i p 0 , i { 1 , , m } .
The natural parameter space Θ is affine (hence conic), and by applying Lemma 1, we get the following closed-form formula:
D α , σ , τ H ( p : q ) = 1 α log 1 + i = 1 m exp ( α σ ( θ p ) i ) + 1 β log 1 + i = 1 m exp ( β τ ( θ q ) i ) log 1 + i = 1 m exp ( σ ( θ p ) i + τ ( θ q ) i ) ,
D α , γ H ( p : q ) = 1 α log 1 + i = 1 m exp ( γ ( θ p ) i ) + 1 β log 1 + i = 1 m exp ( γ ( θ q ) i ) log 1 + i = 1 m exp γ α ( θ p ) i + γ β ( θ q ) i .
To get some intuitions, Figure 2 shows the Hölder divergence from a given reference distribution p r to any categorical distribution ( p 0 , p 1 , p 2 ) in the 2D probability simplex Δ 2 . A main observation is that the Kullback–Leibler (KL) divergence exhibits a barrier near the boundary Δ 2 with large values. This is not the case for Hölder divergences: D α , 1 , 1 H ( p r : p ) does not have a sharp increase near the boundary (although it still penalizes the corners of Δ 2 ). For example, let p = ( 0 , 1 / 2 , 1 / 2 ) , p r = ( 1 / 3 , 1 / 3 , 1 / 3 ) , then KL ( p r : p ) , but D 2 , 1 , 1 H ( p r : p ) = 2 / 3 . Another observation is that the minimum D ( p r : p ) can be reached at some point p p r (see for example D 4 , 1 , 1 H ( p r : p ) in Figure 2b; the bluest area corresponding to the minimum of D ( p r : p ) is not in the same location as the reference point).
Consider an HPD ball of center c and prescribed radius r w.r.t. the HPD. Since p ( x ) α 1 for α 2 does not belong to the probability manifold, but to the positive measure manifold, and since the distance is projective, we deduce that the displaced ball center c of a ball c lying in the probability manifold can be computed as the intersection of the ray λ p ( x ) α 1 anchored at the origin 0 and passing through p ( x ) α 1 with the probability manifold. For the discrete probability simplex Δ, since we have λ x X p ( x ) α 1 = 1 , we deduce that the displaced ball center is at:
c = c x X p ( x ) α 1
This center is displayed as “•” in Figure 2.
In general, the HPD bisector [28] between two distributions belonging to the same CAEF is defined by:
1 α ( F ( α θ 1 ) F ( α θ 2 ) ) = F ( θ 2 + θ ) F ( θ 1 + θ ) .

4.2. Case Study: Bernoulli Distribution

The Bernoulli distribution is just a special case of the category distribution when the number of categories is two (i.e., m = 1 ). To be consistent with the previous section, we rewrite a Bernoulli distribution p = ( p 0 , p 1 ) in the canonical form:
p 1 = exp θ p log 1 + exp ( θ p ) = exp ( θ p ) 1 + exp ( θ p ) ,
so that:
p 0 = 1 1 + exp ( θ p ) .
Then, the cumulant generating function becomes F ( θ p ) = log 1 + exp ( θ p ) . By Lemma 1,
D α , σ , τ H ( p : q ) = 1 α log 1 + exp ( α σ θ p ) + 1 β log 1 + exp ( β τ θ q ) log 1 + exp ( σ θ p + τ θ q ) ,
D α , γ H ( p : q ) = 1 α log 1 + exp ( γ θ p ) + 1 β log 1 + exp ( γ θ q ) log 1 + exp γ α θ p + γ β θ q .

4.3. Case Study: MultiVariate Normal Distributions

Let us now instantiate the formulas for multivariate normals (Gaussian distributions). We have the log-normalizer F ( θ ) expressed using the usual parameters as [15]:
F ( θ ) = F ( μ ( θ ) , Σ ( θ ) ) = 1 2 log ( 2 π ) d | Σ | + 1 2 μ Σ 1 μ .
Since:
θ = ( Σ 1 μ , 1 2 Σ 1 ) = ( v , M ) , μ = 1 2 M 1 v , Σ = 1 2 M 1 .
It follows that:
θ p + θ q = θ p + q = ( v p + v q , M p + M q ) = Σ p 1 μ p + Σ q 1 μ q , 1 2 Σ p 1 1 2 Σ q 1 .
Therefore, we have:
μ p + q = ( Σ p 1 + Σ q 1 ) 1 ( Σ p 1 μ p + Σ q 1 μ q ) , Σ p + q = ( Σ p 1 + Σ q 1 ) 1
We thus get the following closed-form formula for p N ( μ p , Σ p ) and q N ( μ q , Σ q ) :
D α , σ , τ H ( N ( μ p , Σ p ) : N ( μ q , Σ q ) ) = 1 2 α log Σ p α σ + σ 2 μ p Σ p 1 μ p + 1 2 β log Σ q β τ + τ 2 μ q Σ q 1 μ q + 1 2 log σ Σ p 1 + τ Σ q 1 1 2 ( σ Σ p 1 μ p + τ Σ q 1 μ q ) ( σ Σ p 1 + τ Σ q 1 ) 1 ( σ Σ p 1 μ p + τ Σ q 1 μ q ) ; D α , γ H ( N ( μ p , Σ p ) : N ( μ q , Σ q ) ) = 1 2 α log Σ p γ + γ 2 α μ p Σ p 1 μ p + 1 2 β log Σ q γ + γ 2 β μ q Σ q 1 μ q + 1 2 log γ α Σ p 1 + γ β Σ q 1 1 2 γ α Σ p 1 μ p + γ β Σ q 1 μ q γ α Σ p 1 + γ β Σ q 1 1 γ α Σ p 1 μ p + γ β Σ q 1 μ q .
Figure 3 shows HPD and HD for univariate Gaussian distributions as compared to the KL divergence. Again, HPD and HD have more tolerance for distributions near the boundary σ = 0 , which is in contrast to the (reverse) KL divergence.

4.4. Case Study: Zero-Centered Laplace Distribution

The zero-centered Laplace distribution is defined on the support ( , ) with the pdf:
p ( x ; s ) = 1 2 s exp | x | s = exp | x | s log ( 2 s ) .
We have θ = 1 s , F ( θ ) = log ( 2 θ ) . Therefore, it comes that:
D α , σ , τ H ( p : q ) = 1 α log 2 α σ θ p + 1 β log 2 β τ θ q log 2 σ θ p + τ θ q = 1 α log s p α σ + 1 β log s q β τ + log σ s p + τ s q ,
D α , γ H ( p : q ) = 1 α log 2 γ θ p + 1 β log 2 γ θ q log 2 γ α θ p + γ β θ q = 1 α log s p + 1 β log s q + log 1 α s p + 1 β s p .
In this special case, D α , γ H ( p : q ) does not vary with γ.

4.5. Case Study: Wishart Distribution

The Wishart distribution is defined on the d × d positive definite cone with the density:
p ( X ; n , S ) = | X | n d 1 2 exp 1 2 tr ( S 1 X ) 2 n d 2 | S | n 2 Γ d n 2 ,
where n > d 1 is the degree of freedom and S 0 is a positive-definite scale matrix. We rewrite it in the canonical form:
p ( X ; n , S ) = exp 1 2 tr ( S 1 X ) + n d 1 2 log | X | n d 2 log 2 n 2 log | S | log Γ d n 2 .
We can see that θ = ( θ 1 , θ 2 ) , θ 1 = 1 2 S 1 , θ 2 = n d 1 2 , and:
F ( θ ) = n d 2 log 2 + n 2 log | S | + log Γ d n 2 = ( θ 2 + d + 1 2 ) d log 2 + ( θ 2 + d + 1 2 ) log | 1 2 ( θ 1 ) 1 | + log Γ d θ 2 + d + 1 2 .
The resulting D α , σ , τ H ( p : q ) and D α , γ H ( p : q ) are straightforward from the above expression of F ( θ ) and Lemma 1. We will omit these tedious expressions for brevity.

4.6. Approximating Hölder Projective Divergences for Statistical Mixtures

Given two finite mixture models m ( x ) = i = 1 k w i p i ( x ) and m ( x ) = j = 1 k w j p j ( x ) , we derive analytic bounds of their Hölder divergences. When only an approximation is needed, one may compute Hölder divergences based on Monte Carlo stochastic sampling.
Let us assume that all mixture components are in an exponential family [19], so that p i ( x ) = p ( x ; θ i ) = exp ( θ i , t ( x ) F ( θ i ) ) and p j ( x ) = p ( x ; θ j ) = exp ( θ j , t ( x ) F ( θ j ) ) are densities (w.r.t. the Lebesgue measure μ).
Without loss of generality, we only consider the pseudo Hölder divergence D α , 1 , 1 H . We rewrite it in the form:
D α , 1 , 1 H ( m : m ) = log X m ( x ) m ( x ) d x + 1 α log X m ( x ) α d x + 1 β log X m ( x ) β d x .
To compute the first term, we observe that a product of mixtures is also a mixture:
X m ( x ) m ( x ) d x = i = 1 k j = 1 k w i w j X p i ( x ) p j ( x ) d x = i = 1 k j = 1 k w i w j X exp θ i + θ j , t ( x ) F ( θ i ) F ( θ j ) d x = i = 1 k j = 1 k w i w j exp F ( θ i + θ j ) F ( θ i ) F ( θ j ) ,
which can be computed in O ( k k ) time.
The second and third terms in Equation (68) are not straightforward to calculate and shall be bounded. Based on computational geometry, we adopt the log-sum-exp bounding technique of [29] and divide the support X into L pieces of elementary intervals X = l = 1 L I l . In each interval I l , the indices:
δ l = arg   min i   w i p i ( x ) and ϵ l = arg   min i   w i p i ( x )
represent the unique dominating component and the dominated component. Then, we bound as follows:
max I l k α w ϵ l α p ϵ l ( x ) α d x , I l w δ l α p δ l ( x ) α d x I l m ( x ) α d x I l k α w δ l α p δ l ( x ) α d x .
All terms on the lhs and rhs of Equation (71) can be computed exactly by noticing that:
I p i ( x ) α d x = I exp ( α θ i , t ( x ) α F ( θ i ) ) = exp ( F ( α θ i ) α F ( θ i ) ) I p ( x ; α θ i ) d x .
When α θ Θ where Θ denotes the natural parameter space, the integral I p ( x ; α θ i ) d x converges; see [29] for further details.
Then, the bounds of X m ( x ) α d x can be obtained by summing the bounds in Equation (71) over all elementary intervals. Thus, D α , 1 , 1 H ( m : m ) can be both lower and upper bounded.

5. Hölder Centroids and Center-Based Clustering

We study the application of HPD and HD for clustering distributions [30], specially clustering Gaussian distributions [31,32,33], which have been used in sound processing [31], sensor network [32], statistical debugging [32], quadratic invariants of switched systems [34], etc. Other potential applications of HD may include nonnegative matrix factorization [35], and clustering von Mises–Fisher [36,37] (log-normalizer expressed using Bessel functions).

5.1. Hölder Centroids

We study center-based clustering of a finite set of distributions belonging to the same exponential family. By a slight abuse of notation, we shall write D α , σ , τ H ( θ : θ ) instead of D α , σ , τ H ( p θ : p θ ) . Given a list of distributions belonging to the same conic exponential family with natural parameters { θ 1 , , θ n } and their associated positive weights { w 1 , , w n } with i = 1 n w i = 1 , consider their centroids based on HPD and HD as follows:
C α ( { θ i , w i } ) = arg   min C i = 1 n w i D α , 1 , 1 H ( θ i : C ) ,
C α , γ ( { θ i , w i } ) = arg   min C i = 1 n w i D α , γ H ( θ i : C ) .
By an abuse of notation, C denotes both the HPD centroid and HD centroid. When the context is clear, the parameters in parentheses can be omitted so that these centroids are simply denoted as C α and C α , γ . Both of them are defined as the right-sided centroids. The corresponding left-handed centroids are obtained according to reference duality, i.e.,
C α ¯ = arg   min C i = 1 n w i D α , 1 , 1 H ( C : θ i ) ,
C α ¯ , γ = arg   min C i = 1 n w i D α , γ H ( C : θ i ) .
By Lemma 1, these centroids can be obtained for distributions belonging to the same exponential family as follows:
C α = arg   min C 1 β F ( β C ) i = 1 n w i F ( θ i + C ) ,
C α , γ = arg   min C 1 β F ( γ C ) i = 1 n w i F γ α θ i + γ β C .
Let γ = α ; we get:
C α , α ( { θ i , w i } ) = arg   min C 1 β F ( α C ) i = 1 n w i F θ i + α β C = β α C α = 1 α 1 C α ( { θ i , w i } ) ,
meaning that the HPD centroid is just a special case of HD centroid up to a scaling transformation in the natural parameters space. Let γ = β ; we get:
C α , β ( { θ i , w i } ) = arg   min C 1 β F ( β C ) i = 1 n w i F β α θ i + C = C α β α θ i , w i = C α 1 α 1 θ i , w i .
Let us consider the general HD centroid C α , γ . Since F is convex, the minimization energy is the sum of a convex function 1 β F ( γ C ) with a concave function i = 1 n w i F γ α θ i + γ β C . We can therefore use the concave-convex procedure (CCCP) [8] that optimizes the difference of convex programs (DCPs): We start with C α , γ 0 = i = 1 n w i θ i (the barycenter, belonging to Θ) and then update:
C α , γ t + 1 = 1 γ ( F ) 1 i = 1 n w i F γ α θ i + γ β C α , γ t
for t = 0 , 1 , until convergence. This can be done by noting that η = F ( θ ) are the dual parameters that are also known as the expectation parameters (or moment parameters). Therefore, F and ( F ) 1 can be computed through Legendre transformations between the natural parameter space and the dual parameter space.
This iterative optimization is guaranteed to converge to a local minimum, with a main advantage of bypassing the learning rate parameter of gradient descent algorithms. Since F is strictly convex, F is monotonous, and the rhs expression can be interpreted as a multi-dimensional quasi-arithmetic mean. In fact, it is a barycenter on unnormalized weights scaled by β = α ¯ .
For exponential families, the symmetric HPD centroid is:
O α = arg   min O i = 1 n w i S α , 1 , 1 H ( θ i : O ) = arg   min O 1 2 α F ( α O ) + 1 2 β F ( β O ) i = 1 n w i F ( θ i + O ) .
In this case, the CCCP update rule is not in closed form because we cannot easily inverse the sum of gradients (but when α = β , the two terms collapse, so the CS centroid can be calculated using CCCP). Nevertheless, we can implement the reciprocal operation numerically. Interestingly, the symmetric HD centroid can be solved by CCCP! It amounts to solving:
O α , γ = arg   min O i = 1 n w i S α , γ H ( θ i : O ) = arg   min O F ( γ O ) i = 1 n w i F γ α θ i + γ β O + F γ β θ i + γ α O .
One can apply CCCP to iteratively update the centroid based on:
O α , γ t + 1 = 1 γ ( F ) 1 i = 1 n w i 1 β F γ α θ i + γ β O α , γ t + 1 α F γ β θ i + γ α O α , γ t .
Notice the similarity with the updating procedure of C α , γ t .
Once the centroid, say O α , γ , has been computed, we calculate the associated Hölder information:
i = 1 n w i S α , γ H ( θ i : O α , γ ) ,
which generalizes the notion of variance and Bregman information [5] to the case of Hölder distances.

5.2. Clustering Based on Symmetric Hölder Divergences

Given a set of fixed densities { p 1 , , p n } , we can perform variational k-means [6] with respect to the Hölder divergence to minimize the cost function:
E ( O 1 , , O L , l 1 , , l n ) = i = 1 n S α , γ H ( p i : O l i ) ,
where O 1 , , O L are the cluster centers and l i { 1 , , L } is the cluster label of p i . The algorithm is given by Algorithm 1. Notice that one does not need to wait for the CCCP iterations to converge. It only has to improve the cost function E before updating the assignment. We have implemented the algorithm based on the symmetric HD. One can easily modify it based on HPD and other variants.
Algorithm 1: Hölder variational k-means.
Entropy 19 00122 i001
We made a toy dataset generator, which can randomly generate n 2D Gaussians that have an underlying structure of two or three clusters. In the first cluster, the mean of each Gaussian G ( μ , Σ ) has the prior distribution μ G ( ( 2 , 0 ) , I ) ; the covariance matrix is obtained by first generating σ 1 Γ ( 7 , 0.01 ) , σ 2 Γ ( 7 , 0.003 ) , where Γ means a gamma distribution with prescribed shape and scale, then rotating the covariance matrix diag ( σ 1 , σ 2 ) so that the resulting Gaussian has a “radial direction” with respect to the center ( 2 , 0 ) . The second and third clusters are similar to the first cluster with the only difference being that their μ’s are centered around ( 2 , 0 ) and ( 0 , 2 3 ) , respectively. See Figure 4 for an intuitive presentation of the toy dataset.
To reduce the number of parameters that has to be tuned, we only investigate the case α = γ . If we choose α = γ = 2 , then S α , γ H becomes the CS divergence, and Algorithm 1 reduces to traditional CS clustering. From Figure 4, we can observe that the clustering result does vary with the settings of α and γ. We performed clustering experiments on two different settings of the number of clusters and two different settings of the sample size. Table 1 shows the clustering accuracy measured by the percentage of “correctly-clustered” Gaussians, i.e., the output label by clustering algorithms that coincides with the true label corresponding to the data generating process. The large variance of the clustering accuracy is because different runs are based on different random datasets using the same generator. We see that the symmetric Hölder divergence can give significantly better clustering results as compared to CS clustering. Intuitively, the symmetric Hölder centroid with α and γ close to one has a smaller variance (see Figure 4); therefore, it can better capture the clustering structure. This hints that one should consider the general Hölder divergence to replace CS in similar clustering applications [22,38]. Although one faces the problem of tuning the parameter α and γ, Hölder divergences can potentially give better results. This is expected because CS is just one particular case of the class of Hölder divergences.

6. Conclusions and Perspectives

We introduced the notion of pseudo-divergences that generalizes the concept of divergences in information geometry [3] that are smooth non-metric statistical distances that are not required to obey the law of the indiscernibles. Pseudo-divergences can be built from inequalities by considering the inequality difference gap or its log-ratio gap. We then defined two classes of statistical measures based on Hölder’s ordinary and reverse inequalities: the tri-parametric family of Hölder pseudo-divergences and the bi-parametric family of Hölder divergences. By construction, the Hölder divergences are proper divergences between probability densities. Both statistical Hölder distance families are projective divergences that do not require distributions to be normalized and admit closed-form expressions when considering exponential families with conic or affine natural parameter space (like multinomials or multivariate normals). Those two families of distances can be symmetrized and encompass both the Cauchy–Schwarz divergence and the family of skew Bhattacharyya divergences. Since the Cauchy–Schwarz divergence is often used in distribution clustering applications [22], we carried out preliminary experiments demonstrating experimentally that the symmetrized Hölder divergences improved over the Cauchy–Schwarz divergence for a toy dataset of Gaussians. We briefly touched upon the use of these novel divergences in statistical estimation theory. These projective Hölder (pseudo-)divergences are different from the recently introduced compositive scored-induced Hölder divergences [17,18] that are not projective divergences and do not admit closed-form expressions for exponential families in general.
We elicited the special role of escort distributions [3] for Hölder divergences in our framework: Escort distributions transform distributions to allow one:
  • To reveal that Hölder pseudo-divergences on escort distributions amount to skew Bhattacharyya divergences [8],
  • To transform the improper Hölder pseudo-divergences into proper Hölder divergences, and vice versa.
It is interesting to consider other potential applications of Hölder divergences and compare their efficiency against the reference Cauchy–Schwarz divergence: For example, HD t-SNE (Stochastic Neighbor Embedding) compared to CS t-SNE [39], HD vector quantization (VQ) compared to CS VQ [40], HD saliency vs. CS saliency detection in images [41], etc.
Let us conclude with a perspective note on pseudo-divergences, statistical estimators and manifold learning. Proper divergences have been widely used in statistical estimators to build families of estimators [42,43]. Similarly, given a prescribed density p 0 ( x ) , a pseudo-divergence yields a corresponding estimator by minimizing D ( p 0 : q ) with respect to q ( x ) . However, in this case, the resulting q ( x ) is potentially biased and is not guaranteed to recover the optimal input p 0 ( x ) . Furthermore, the minimizer of D ( p 0 : q ) may not be unique, i.e., there could be more than one probability density q ( x ) yielding D ( p 0 : q ) = 0 .
How can pseudo-divergences be useful? We have the following two simple arguments:
  • In an estimation scenario, we can usually pre-compute p 1 ( x ) p 0 ( x ) according to D ( p 1 : p 0 ) = 0 . Then, the estimation q ( x ) = arg min q D ( p 1 : q ) will automatically target at p 0 ( x ) . We call this technique “pre-aim.”
    For example, given positive measure p ( x ) , we first find p 0 ( x ) to satisfy D α , 1 , 1 H ( p 0 : p ) = 0 . We have p 0 ( x ) = p ( x ) 1 α 1 that satisfies this condition. Then, a proper divergence between p ( x ) and q ( x ) can be obtained by aiming q ( x ) towards p 0 ( x ) . For conjugate exponents α and β,
    D α , 1 , 1 H ( p 0 : q ) = log X p 0 ( x ) q ( x ) d x X p 0 ( x ) α d x 1 / α X q ( x ) β d x 1 / β = log X p ( x ) 1 α 1 q ( x ) d x X p ( x ) α α 1 d x 1 / α X q ( x ) β d x 1 / β = log X p ( x ) β α q ( x ) d x X p ( x ) β d x 1 / α X q ( x ) β d x 1 / β = D α , β H ( p : q ) .
    This means that the pre-aim technique of HPD is equivalent to HD D α , γ H when we set γ = β .
    As an alternative implementation of pre-aim, since D α , 1 , 1 H ( p : p α 1 ) = 0 , a proper divergence between p ( x ) and q ( x ) can be constructed by measuring:
    D α , 1 , 1 H ( q : p α 1 ) = log X p ( x ) α β q ( x ) d x X q ( x ) α d x 1 / α X p ( x ) α d x 1 / β = D β , α H ( p : q ) ,
    turning out again to belong to the class of HD.
    In practice, HD as a bi-parametric family may be less used than HPD with pre-aim because of the difficulty in choosing the parameter γ and because that HD has a slightly more complicated expression. The family of HD connecting CS divergence with skew Bhattacharyya divergence [8] is nevertheless of theoretical importance.
  • In manifold learning [44,45,46,47], it is an essential topic to align two category distributions p 0 ( x ) and q ( x ) corresponding respectively to the input and output [47], both for learning and for performance evaluation. In this case, the dimensionality of the statistical manifold that encompasses p 0 ( x ) and q ( x ) is so high that to preserve monotonically p 0 ( x ) in the resulting q ( x ) is already a difficult non-linear optimization and could be sufficient for the application, while preserving perfectly the input p 0 ( x ) is not so meaningful because of the input noise. It is then much easier to define pseudo-divergences using inequalities which do not necessarily need to be proper with potentially more choices. On the other hand, projective divergences including Hölder divergences introduced in this work are more meaningful in manifold learning than KL divergence (which is widely used) because they give scale invariance of the probability densities, meaning that one can define positive similarities then directly align these similarities, which is guaranteed to be equivalent to aligning the corresponding distributions. This could potentially give unified perspectives in between the two approaches of similarity-based manifold learning [46] and the probabilistic approach [44].
Hölder-type inequalities have been generalized to sets [48] instead of pairs of objects and to positive functional spaces, as well [49]. We also note that some divergences like Csiszár f -divergences enjoy themselves Hölder-type inequalities [50].
We expect that these two novel parametric Hölder classes of statistical divergences and pseudo-divergences open up new insights and applications in statistics and information sciences. Furthermore, the framework to build divergences or pseudo-divergences from proper or improper biparametric inequalities [1] offers novel classes of divergences to study.
Reproducible source code is available online [51].

Acknowledgments

The authors gratefully thank the referees for their comments. Ke Sun is funded by King Abdullah University of Science and Technology (KAUST).

Author Contributions

Frank Nielsen discussed the seminal ideas with Ke Sun and Stéphane Marchand-Maillet. Frank Nielsen and Ke Sun contributed to the theoretical results as well as to the writing of the article. Ke Sun implemented the methods and performed the numerical experiments. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

D α HS Hölder proper non-projective Scored-induced divergence [18]
D α , σ , τ H Hölder improper projective pseudo-divergence (new)
D α , γ H Hölder proper projective divergence (new)
D α HE Hölder proper projective escort divergence (new)
KL Kullback-Leibler divergence [10]
CS Cauchy–Schwarz divergence [2]
BBhattacharyya distance [25]
B 1 α skew Bhattacharyya distance [8]
D γ γ-divergence (score-induced) [11]
p α E , q β E escort distributions
α , β Hölder conjugate pair of exponents: 1 α + 1 β = 1
α ¯ , β Hölder conjugate exponent: α ¯ = β = α α 1
θ p , θ q natural parameters of exponential family distributions
X support of distributions
μLebesgue measure
L γ ( X , μ ) Lebesgue space of functions f such that X | f ( x ) | γ d x <

Appendix A. Proof of Hölder Ordinary and Reverse Inequalities

We extend the proof ([52], p. 78) to prove both the (ordinary or forward) Hölder inequality and the reverse Hölder inequality.
Proof. 
First, let us observe that log ( x ) is strictly convex on ( 0 , + ) since ( log ( x ) ) = 1 x 2 . It follows that for 0 < a < 1 that:
log ( a x 1 + ( 1 a ) x 2 ) a log ( x 1 ) ( 1 a ) log ( x 2 ) ,
where the equality holds iff x 1 = x 2 .
Conversely, when a < 0 or a > 1 , we have:
log ( a x 1 + ( 1 a ) x 2 ) a log ( x 1 ) ( 1 a ) log ( x 2 ) ,
where the equality holds iff x 1 = x 2 .
Equivalently, we can write these two inequalities as follows:
x 1 a x 2 1 a a x 1 + ( 1 a ) x 2 ( if 0 < a < 1 ) ; x 1 a x 2 1 a a x 1 + ( 1 a ) x 2 ( if a < 0 or a > 1 ) ,
both of them are tight iff x 1 = x 2 .
Let P and Q be positive measures with Radon–Nikodym densities p ( x ) > 0 and q ( x ) > 0 be positive densities with respect to the reference Lebesgue measure μ. The densities are strictly greater than zero on the support X . Plugging:
a = 1 α , 1 a = 1 β , x 1 = p ( x ) α X p ( x ) α d x , x 2 = q ( x ) β X q ( x ) β d x ,
into Equation (A3), we get:
p ( x ) X p ( x ) α d x 1 / α q ( x ) X q ( x ) β d x 1 / β 1 α p ( x ) α p ( x ) α d x + 1 β q ( x ) β q ( x ) β d x if α > 0 and β > 0 , p ( x ) X p ( x ) α d x 1 / α q ( x ) X q ( x ) β d x 1 / β 1 α p ( x ) α p ( x ) α d x + 1 β q ( x ) β q ( x ) β d x if α < 0 or β < 0 .
Assume that p ( x ) in L α ( X , μ ) and q ( x ) in L β ( X , μ ) , so that both X p ( x ) α d x and X q ( x ) β d x converge. Integrate both sides on X to get:
X p ( x ) q ( x ) d x X p ( x ) α d x 1 / α X q ( x ) β d x 1 / β 1 if α > 0 and β > 0 , X p ( x ) q ( x ) d x X p ( x ) α d x 1 / α X q ( x ) β d x 1 / β 1 if α < 0 or β < 0 .
The necessary and sufficient condition for equality is that:
p ( x ) α X p ( x ) α d x = q ( x ) β X q ( x ) β d x ,
almost everywhere. That is, there exists a positive constant λ > 0 , such that:
p ( x ) α = λ q ( x ) β , λ > 0 , almost   everywhere .
The Hölder conjugate exponents α and β satisfies 1 α + 1 β = 1 . That is, β = α α 1 . Thus, when α < 0 , we necessarily have β > 0 , and vice versa.
We can unify these two straight and reverse Hölder inequalities into a single inequality by considering the sign of α β = α 2 α 1 : We get the general Hölder inequality:
sign ( α β ) X p ( x ) q ( x ) d x X p ( x ) α d x 1 / α X q ( x ) β d x 1 / β sign ( α β ) .
When α = β = 2 , the Hölder inequality becomes the Cauchy–Schwarz inequality:
X p ( x ) q ( x ) d x X p ( x ) 2 d x X q ( x ) 2 d x .
Historically, Cauchy stated the discrete sum inequality in 1821, while Schwarz reported the integral form of the inequality in 1888.

References

  1. Mitrinovic, D.S.; Pecaric, J.; Fink, A.M. Classical and New Inequalities in Analysis; Springer Science & Business Media: New York, NY, USA, 2013; Volume 61. [Google Scholar]
  2. Budka, M.; Gabrys, B.; Musial, K. On accuracy of PDF divergence estimators and their applicability to representative data sampling. Entropy 2011, 13, 1229–1266. [Google Scholar] [CrossRef]
  3. Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences series; Springer: Tokyo, Japan, 2016. [Google Scholar]
  4. Rao, C.R. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
  5. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  6. Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and clustering. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 2016–2020.
  7. Burbea, J.; Rao, C. On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 1982, 28, 489–495. [Google Scholar] [CrossRef]
  8. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
  9. Gneiting, T.; Raftery, A.E. Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
  10. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  11. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  12. Nielsen, F.; Nock, R. Patch Matching with Polynomial Exponential Families and Projective Divergences. In Proceedings of the 9th International Conference on Similarity Search and Applications, Tokyo, Japan, 24–26 October 2016; pp. 109–116.
  13. Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
  14. Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar] [CrossRef]
  15. Nielsen, F.; Nock, R. A closed-form expression for the Sharma–Mittal entropy of exponential families. J. Phys. A Math. Theor. 2011, 45, 032003. [Google Scholar] [CrossRef]
  16. De Souza, D.C.; Vigelis, R.F.; Cavalcante, C.C. Geometry Induced by a Generalization of Rényi Divergence. Entropy 2016, 18, 407. [Google Scholar] [CrossRef]
  17. Kanamori, T.; Fujisawa, H. Affine invariant divergences associated with proper composite scoring rules and their applications. Bernoulli 2014, 20, 2278–2304. [Google Scholar] [CrossRef]
  18. Kanamori, T. Scale-invariant divergences for density functions. Entropy 2014, 16, 2611–2628. [Google Scholar] [CrossRef]
  19. Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv, 2009; arXiv:0911.4863. [Google Scholar]
  20. Rogers, L.J. An extension of a certain theorem in inequalities. Messenger Math. 1888, 17, 145–150. [Google Scholar]
  21. Holder, O.L. Über einen Mittelwertssatz. Nachr. Akad. Wiss. Gottingen Math. Phys. Kl. 1889, 44, 38–47. [Google Scholar]
  22. Hasanbelliu, E.; Giraldo, L.S.; Principe, J.C. Information theoretic shape matching. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2436–2451. [Google Scholar] [CrossRef] [PubMed]
  23. Nielsen, F. Closed-form information-theoretic divergences for statistical mixtures. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR), Tsukuba, Japan, 11–15 November 2012; pp. 1723–1726.
  24. Zhang, J. Reference duality and representation duality in information geometry. Am. Inst. Phys. Conf. Ser. 2015, 1641, 130–146. [Google Scholar]
  25. Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
  26. Srivastava, A.; Jermyn, I.; Joshi, S. Riemannian analysis of probability density functions with applications in vision. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8.
  27. Nielsen, F.; Nock, R. On the Chi Square and Higher-Order Chi Distances for Approximating f-Divergences. IEEE Signal Process. Lett. 2014, 1, 10–13. [Google Scholar] [CrossRef]
  28. Nielsen, F.; Nock, R. Skew Jensen-Bregman Voronoi diagrams. In Transactions on Computational Science XIV; Springer: New York, NY, USA, 2011; pp. 102–128. [Google Scholar]
  29. Nielsen, F.; Sun, K. Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities. Entropy 2016, 18, 442. [Google Scholar] [CrossRef]
  30. Notsu, A.; Komori, O.; Eguchi, S. Spontaneous clustering via minimum gamma-divergence. Neural Comput. 2014, 26, 421–448. [Google Scholar] [CrossRef] [PubMed]
  31. Rigazio, L.; Tsakam, B.; Junqua, J.C. An optimal Bhattacharyya centroid algorithm for Gaussian clustering with applications in automatic speech recognition. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, Istanbul, Turkey, 5–9 June 2000; Volume 3, pp. 1599–1602.
  32. Davis, J.V.; Dhillon, I.S. Differential Entropic Clustering of Multivariate Gaussians. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2006; pp. 337–344.
  33. Nielsen, F.; Nock, R. Clustering multivariate normal distributions. In Emerging Trends in Visual Computing; Springer: New York, NY, USA, 2009; pp. 164–174. [Google Scholar]
  34. Allamigeon, X.; Gaubert, S.; Goubault, E.; Putot, S.; Stott, N. A scalable algebraic method to infer quadratic invariants of switched systems. In Proceedings of the 12th International Conference on Embedded Software, Amsterdam, The Netherlands, 4–9 October 2015; pp. 75–84.
  35. Sun, D.L.; Févotte, C. Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 6201–6205.
  36. Banerjee, A.; Dhillon, I.S.; Ghosh, J.; Sra, S. Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learn. Res. 2005, 6, 1345–1382. [Google Scholar]
  37. Gopal, S.; Yang, Y. Von Mises-Fisher Clustering Models. J. Mach. Learn. Res. 2014, 32, 154–162. [Google Scholar]
  38. Rami, H.; Belmerhnia, L.; Drissi El Maliani, A.; El Hassouni, M. Texture Retrieval Using Mixtures of Generalized Gaussian Distribution and Cauchy-Schwarz Divergence in Wavelet Domain. Image Commun. 2016, 42, 45–58. [Google Scholar] [CrossRef]
  39. Bunte, K.; Haase, S.; Biehl, M.; Villmann, T. Stochastic neighbor embedding (SNE) for dimension reduction and visualization using arbitrary divergences. Neurocomputing 2012, 90, 23–45. [Google Scholar] [CrossRef]
  40. Villmann, T.; Haase, S. Divergence-based vector quantization. Neural Comput. 2011, 23, 1343–1392. [Google Scholar] [CrossRef] [PubMed]
  41. Huang, J.B.; Ahuja, N. Saliency detection via divergence analysis: A unified perspective. In Proceedings of the 2012 21st International Conference on Pattern Recognition (ICPR), Tsukuba, Japan, 11–15 November 2012; pp. 2748–2751.
  42. Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: Abingdon, UK, 2005. [Google Scholar]
  43. Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Abingdon, UK, 2011. [Google Scholar]
  44. Hinton, G.E.; Roweis, S.T. Stochastic Neighbor Embedding. In Advances in Neural Information Processing Systems 15 (NIPS); MIT Press: Vancouver, BC, Canada, 2002; pp. 833–840. [Google Scholar]
  45. Maaten, L.V.D.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  46. Carreira-Perpiñán, M.Á. The Elastic Embedding Algorithm for Dimensionality Reduction. In Proceedings of the International Conference on Machine Learning, Haifa, Israel, 21–25 June 2010; pp. 167–174.
  47. Sun, K.; Marchand-Maillet, S. An Information Geometry of Statistical Manifold Learning. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1–9.
  48. Cheung, W.S. Generalizations of Hölder’s inequality. Int. J. Math. Math. Sci. 2001, 26, 7–10. [Google Scholar] [CrossRef]
  49. Hazewinkel, M. Encyclopedia of Mathematics; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2001. [Google Scholar]
  50. Chen, G.S.; Shi, X.J. Generalizations of Hölder inequalities for Csiszár’s f -divergence. J. Inequal. Appl. 2013, 2013, 151. [Google Scholar] [CrossRef]
  51. Nielsen, F.; Sun, K.; Marchand-Maillet, S. On Hölder Projective Divergences. 2017. Available online: https://www.lix.polytechnique.fr/~nielsen/HPD/ (accessed on 16 March 2017).
  52. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Figure 1. Hölder proper divergence (bi-parametric) and Hölder improper pseudo-divergence (tri-parametric) encompass Cauchy–Schwarz divergence and skew Bhattacharyya divergence.
Figure 1. Hölder proper divergence (bi-parametric) and Hölder improper pseudo-divergence (tri-parametric) encompass Cauchy–Schwarz divergence and skew Bhattacharyya divergence.
Entropy 19 00122 g001
Figure 2. First row: the Hölder pseudo divergence (HPD) D α , 1 , 1 H ( p r : p ) for α { 4 / 3 , 2 , 4 } , KL divergence and reverse KL divergence. Remaining rows: the HD D α , γ H ( p r : p ) for α { 4 / 3 , 1.5 , 2 , 4 , 10 } (from top to bottom) and γ { 0.5 , 1 , 2 , 5 , 10 } (from left to right). The reference distribution p r is presented as “★”. The minimizer of D α , 1 , 1 H ( p r : p ) , if different from p r , is presented as “•”. Notice that D 2 , 2 H = D 2 , 1 , 1 H . (a) Reference categorical distribution p r = ( 1 / 3 , 1 / 3 , 1 / 3 ) ; (b) reference categorical distribution p r = ( 1 / 2 , 1 / 3 , 1 / 6 ) .
Figure 2. First row: the Hölder pseudo divergence (HPD) D α , 1 , 1 H ( p r : p ) for α { 4 / 3 , 2 , 4 } , KL divergence and reverse KL divergence. Remaining rows: the HD D α , γ H ( p r : p ) for α { 4 / 3 , 1.5 , 2 , 4 , 10 } (from top to bottom) and γ { 0.5 , 1 , 2 , 5 , 10 } (from left to right). The reference distribution p r is presented as “★”. The minimizer of D α , 1 , 1 H ( p r : p ) , if different from p r , is presented as “•”. Notice that D 2 , 2 H = D 2 , 1 , 1 H . (a) Reference categorical distribution p r = ( 1 / 3 , 1 / 3 , 1 / 3 ) ; (b) reference categorical distribution p r = ( 1 / 2 , 1 / 3 , 1 / 6 ) .
Entropy 19 00122 g002aEntropy 19 00122 g002b
Figure 3. First row: D α , 1 , 1 H ( p r : p ) , where p r is the standard Gaussian distribution and α { 4 / 3 , 2 , 4 } compared to the KL divergence. The rest of the rows: D α , γ H ( p r : p ) for α { 4 / 3 , 1.5 , 2 , 4 , 10 } (from top to bottom) and γ { 0.5 , 1 , 2 , 5 , 10 } (from left to right). Notice that D 2 , 2 H = D 2 , 1 , 1 H . The coordinate system is formed by μ (mean) and σ (standard deviation).
Figure 3. First row: D α , 1 , 1 H ( p r : p ) , where p r is the standard Gaussian distribution and α { 4 / 3 , 2 , 4 } compared to the KL divergence. The rest of the rows: D α , γ H ( p r : p ) for α { 4 / 3 , 1.5 , 2 , 4 , 10 } (from top to bottom) and γ { 0.5 , 1 , 2 , 5 , 10 } (from left to right). Notice that D 2 , 2 H = D 2 , 1 , 1 H . The coordinate system is formed by μ (mean) and σ (standard deviation).
Entropy 19 00122 g003
Figure 4. Variational k-means clustering results on a toy dataset consisting of a set of 2D Gaussians organized into two or three clusters. The cluster centroids are represented by contour plots using the same density levels. (a) α = γ = 1.1 (Hölder clustering); (b) α = γ = 2 (Cauchy–Schwarz clustering); (c) α = γ = 1.1 (Hölder clustering); (d) α = γ = 2 (Cauchy–Schwarz clustering).
Figure 4. Variational k-means clustering results on a toy dataset consisting of a set of 2D Gaussians organized into two or three clusters. The cluster centroids are represented by contour plots using the same density levels. (a) α = γ = 1.1 (Hölder clustering); (b) α = γ = 2 (Cauchy–Schwarz clustering); (c) α = γ = 1.1 (Hölder clustering); (d) α = γ = 2 (Cauchy–Schwarz clustering).
Entropy 19 00122 g004
Table 1. Clustering accuracy of the 2D Gaussian dataset (based on 1000 independent runs). CS, Cauchy–Schwarz. Bold numbers indicate the best obtained performance.
Table 1. Clustering accuracy of the 2D Gaussian dataset (based on 1000 independent runs). CS, Cauchy–Schwarz. Bold numbers indicate the best obtained performance.
k (#Clusters)n (#Samples) α = γ = 1.1 α = γ = 1.5 α = γ = 2 (CS) α = γ = 10
250 94.5 % ± 10.5 % 89.9 % ± 13.2 % 89.4 % ± 13.5 % 88.9 % ± 14.0 %
100 96.9 % ± 6.8 % 94.3 % ± 9.9 % 93.8 % ± 10.6 % 93.1 % ± 11.6 %
350 84.6 % ± 15.5 % 79.3 % ± 14.8 % 79.0 % ± 14.7 % 78.7 % ± 14.5 %
100 89.6 % ± 13.8 % 83.9 % ± 14.6 % 83.1 % ± 14.5 % 82.8 % ± 14.4 %

Share and Cite

MDPI and ACS Style

Nielsen, F.; Sun, K.; Marchand-Maillet, S. On Hölder Projective Divergences. Entropy 2017, 19, 122. https://doi.org/10.3390/e19030122

AMA Style

Nielsen F, Sun K, Marchand-Maillet S. On Hölder Projective Divergences. Entropy. 2017; 19(3):122. https://doi.org/10.3390/e19030122

Chicago/Turabian Style

Nielsen, Frank, Ke Sun, and Stéphane Marchand-Maillet. 2017. "On Hölder Projective Divergences" Entropy 19, no. 3: 122. https://doi.org/10.3390/e19030122

APA Style

Nielsen, F., Sun, K., & Marchand-Maillet, S. (2017). On Hölder Projective Divergences. Entropy, 19(3), 122. https://doi.org/10.3390/e19030122

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop