[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
The Entropic Linkage between Equity and Bond Market Dynamics
Next Article in Special Issue
Conjugate Representations and Characterizing Escort Expectations in Information Geometry
Previous Article in Journal / Special Issue
The Mehler-Fock Transform in Signal Processing
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Inconsistency of Template Estimation by Minimizing of the Variance/Pre-Variance in the Quotient Space †

by
Loïc Devilliers
1,*,
Stéphanie Allassonnière
2,
Alain Trouvé
3 and
Xavier Pennec
1
1
Université Côte d’Azur, Inria, France
2
INSERM, UMRS 1138, CRC, team 22, Paris Descartes University, UPMC, Paris, France
3
CMLA, ENS Cachan, CNRS, Université Paris-Saclay, 94235 Cachan, France
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in Information Processing in Medical Imaging 2017, Boone, NC, USA, 25–30 June 2017.
Entropy 2017, 19(6), 288; https://doi.org/10.3390/e19060288
Submission received: 27 April 2017 / Revised: 7 June 2017 / Accepted: 17 June 2017 / Published: 20 June 2017
(This article belongs to the Special Issue Information Geometry II)
Figure 1
<p>Three functions defined on the interval <math display="inline"> <semantics> <mrow> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>]</mo> </mrow> </semantics> </math>. The blue one (<math display="inline"> <semantics> <msub> <mi>f</mi> <mn>0</mn> </msub> </semantics> </math>) is a step function, the red one (<math display="inline"> <semantics> <msub> <mi>f</mi> <mn>1</mn> </msub> </semantics> </math>) is a translated version of the blue one when noise has been added, and the green one (<math display="inline"> <semantics> <msub> <mi>f</mi> <mn>3</mn> </msub> </semantics> </math>) is the null function.</p> ">
Figure 2
<p>Due to the invariant action, the orbits are parallel. Here the orbits are circles centred at 0. This is the case when the group <span class="html-italic">G</span> is the group of rotations.</p> ">
Figure 3
<p>Representation of the three functions <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>1</mn> </msub> </semantics> </math>, <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>2</mn> </msub> </semantics> </math> and <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>3</mn> </msub> </semantics> </math> with <math display="inline"> <semantics> <mrow> <mi>h</mi> <mo>=</mo> <mn>0.05</mn> </mrow> </semantics> </math>. the functions <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>2</mn> </msub> </semantics> </math> and <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>3</mn> </msub> </semantics> </math> are registered with respect to <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>1</mn> </msub> </semantics> </math>. However <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>2</mn> </msub> </semantics> </math> and <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>3</mn> </msub> </semantics> </math> are not registered with each other, since it is more profitable to shift <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>2</mn> </msub> </semantics> </math> in order to align the highest parts of <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>2</mn> </msub> </semantics> </math> and <math display="inline"> <semantics> <msub> <mi>f</mi> <mn>3</mn> </msub> </semantics> </math>.</p> ">
Figure 4
<p>We minimize the variance on each half-line <math display="inline"> <semantics> <mrow> <msup> <mi mathvariant="double-struck">R</mi> <mo>+</mo> </msup> <mi>v</mi> </mrow> </semantics> </math> where <math display="inline"> <semantics> <mrow> <mo stretchy="false">∥</mo> <mi>v</mi> <mo stretchy="false">∥</mo> <mo>=</mo> <mn>1</mn> </mrow> </semantics> </math>. The element which minimizes the variance on such a half-line is <math display="inline"> <semantics> <mrow> <mover accent="true"> <mi>λ</mi> <mo>˜</mo> </mover> <mrow> <mo stretchy="false">(</mo> <mi>v</mi> <mo stretchy="false">)</mo> </mrow> <mi>v</mi> </mrow> </semantics> </math>, where <math display="inline"> <semantics> <mrow> <mover accent="true"> <mi>λ</mi> <mo>˜</mo> </mover> <mrow> <mo stretchy="false">(</mo> <mi>v</mi> <mo stretchy="false">)</mo> </mrow> <mo>≥</mo> <mn>0</mn> </mrow> </semantics> </math>. We get a surface in <span class="html-italic">M</span> by <math display="inline"> <semantics> <mrow> <mi>S</mi> <mo>∈</mo> <mi>v</mi> <mo>↦</mo> <mover accent="true"> <mi>λ</mi> <mo>˜</mo> </mover> <mrow> <mo stretchy="false">(</mo> <mi>v</mi> <mo stretchy="false">)</mo> </mrow> <mi>v</mi> </mrow> </semantics> </math> (which is a curve in this figure since we draw it in dimension 2). The Proof of Theorem 1 states that if <math display="inline"> <semantics> <mrow> <mo>[</mo> <msub> <mi>m</mi> <mo>☆</mo> </msub> <mo>]</mo> </mrow> </semantics> </math> is a Fréchet mean then <math display="inline"> <semantics> <msub> <mi>m</mi> <mo>☆</mo> </msub> </semantics> </math> is an extreme point of this surface. On this picture there are four extreme points which are in the same orbit: we took here the simple example of the group of rotations of 0, 90, 180 and 270 degrees.</p> ">
Figure 5
<p>Iterative minimization of the function <span class="html-italic">J</span> on the two axis, the horizontal axis represents the variable in the space <span class="html-italic">M</span>, the vertical axis represents the set of all the possible registrations <math display="inline"> <semantics> <msup> <mi>G</mi> <mi>I</mi> </msup> </semantics> </math>. Once the convergence is reached, the point <math display="inline"> <semantics> <mrow> <mo stretchy="false">(</mo> <msub> <mi>m</mi> <mi>n</mi> </msub> <mo>,</mo> <msub> <mi>g</mi> <mi>n</mi> </msub> <mo stretchy="false">)</mo> </mrow> </semantics> </math> is the minimum of the function <span class="html-italic">J</span> on the two axis in green. Is this point the minimum of <span class="html-italic">J</span> on its whole domain? There are two pitfalls: firstly this point could be a saddle point, it can be avoided with Proposition 2, secondly this point could be a local (but not global) minimum, this is discussed in <a href="#sec2dot5dot3-entropy-19-00288" class="html-sec">Section 2.5.3</a>.</p> ">
Figure 6
<p>Template <math display="inline"> <semantics> <msub> <mi>t</mi> <mn>0</mn> </msub> </semantics> </math> and template estimation <math display="inline"> <semantics> <mover accent="true"> <mi>m</mi> <mo stretchy="false">^</mo> </mover> </semantics> </math> on <a href="#entropy-19-00288-f006" class="html-fig">Figure 6</a>a. Empirical variance at the template and template estimation with the max-max algorithm as a function of the size of the sample on <a href="#entropy-19-00288-f006" class="html-fig">Figure 6</a>b. (<b>a</b>) Example of a template (a step function) and the estimated template <math display="inline"> <semantics> <mover accent="true"> <mi>m</mi> <mo stretchy="false">^</mo> </mover> </semantics> </math> with a sample size <math display="inline"> <semantics> <msup> <mn>10</mn> <mn>5</mn> </msup> </semantics> </math> in <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mn>64</mn> </msup> </semantics> </math>, <math display="inline"> <semantics> <mi>ϵ</mi> </semantics> </math> is Gaussian noise and <math display="inline"> <semantics> <mrow> <mi>σ</mi> <mo>=</mo> <mn>10</mn> </mrow> </semantics> </math>. At the discontinuity points of the template, we observe a Gibbs-like phenomena; (<b>b</b>) Variation of <math display="inline"> <semantics> <mrow> <msub> <mi>F</mi> <mi>I</mi> </msub> <mrow> <mo stretchy="false">(</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> (in blue) and of <math display="inline"> <semantics> <mrow> <msub> <mi>F</mi> <mi>I</mi> </msub> <mrow> <mo stretchy="false">(</mo> <mover accent="true"> <mi>m</mi> <mo stretchy="false">^</mo> </mover> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math> (in red) as a function of <span class="html-italic">I</span> the size of the sample. Since convergence is already reached, <math display="inline"> <semantics> <mrow> <mi>F</mi> <mo stretchy="false">(</mo> <mover accent="true"> <mi>m</mi> <mo stretchy="false">^</mo> </mover> <mo stretchy="false">)</mo> </mrow> </semantics> </math>, which is the limit of red curve, is below <math display="inline"> <semantics> <mrow> <mi>F</mi> <mo stretchy="false">(</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> <mo stretchy="false">)</mo> </mrow> </semantics> </math>: <math display="inline"> <semantics> <mrow> <mi>F</mi> <mo stretchy="false">(</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> <mo stretchy="false">)</mo> </mrow> </semantics> </math> is the limit of the blue curve. Due to the inconsistency, <math display="inline"> <semantics> <mover accent="true"> <mi>m</mi> <mo stretchy="false">^</mo> </mover> </semantics> </math> is an example of point such that <math display="inline"> <semantics> <mrow> <mi>F</mi> <mrow> <mo stretchy="false">(</mo> <mover accent="true"> <mi>m</mi> <mo stretchy="false">^</mo> </mover> <mo stretchy="false">)</mo> </mrow> <mo>&lt;</mo> <mi>F</mi> <mrow> <mo stretchy="false">(</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> <mo stretchy="false">)</mo> </mrow> </mrow> </semantics> </math>.</p> ">
Figure 7
<p>Example of an other template (here a discretization of a continuous function) and template estimation with a sample size <math display="inline"> <semantics> <msup> <mn>10</mn> <mn>3</mn> </msup> </semantics> </math> in <math display="inline"> <semantics> <msup> <mi mathvariant="double-struck">R</mi> <mn>64</mn> </msup> </semantics> </math>, <math display="inline"> <semantics> <mi>ϵ</mi> </semantics> </math> is Gaussian noise and <math display="inline"> <semantics> <mrow> <mi>σ</mi> <mo>=</mo> <mn>10</mn> </mrow> </semantics> </math>. Even with a continuous function the inconsistency appears.</p> ">
Figure 8
<p>Example of three orbits, when <math display="inline"> <semantics> <msub> <mover accent="true"> <mi>d</mi> <mo>˜</mo> </mover> <mi>Q</mi> </msub> </semantics> </math> does not satisfy the inequality triangular.</p> ">
Figure 9
<p>Representation of the three cases, on each we can find an <span class="html-italic">x</span> in the support of the noise such as <math display="inline"> <semantics> <mrow> <mfenced separators="" open="&#x2329;" close="&#x232A;"> <mi>x</mi> <mo>,</mo> <msub> <mi>g</mi> <mn>0</mn> </msub> <mo>·</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> </mfenced> <mo>&gt;</mo> <mfenced separators="" open="&#x2329;" close="&#x232A;"> <mi>x</mi> <mo>,</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> </mfenced> </mrow> </semantics> </math> and by continuity of the dot product <math display="inline"> <semantics> <mrow> <mfenced separators="" open="&#x2329;" close="&#x232A;"> <mi>ϵ</mi> <mo>,</mo> <msub> <mi>g</mi> <mn>0</mn> </msub> <mo>·</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> </mfenced> <mo>&gt;</mo> <mfenced separators="" open="&#x2329;" close="&#x232A;"> <mi>ϵ</mi> <mo>,</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> </mfenced> </mrow> </semantics> </math> with is an event with a non zero probability, (for instance the ball in gray). This is enough in order to show that <math display="inline"> <semantics> <mrow> <mi>θ</mi> <mo stretchy="false">(</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> <mo stretchy="false">)</mo> <mo>&gt;</mo> <mn>0</mn> </mrow> </semantics> </math>. (<b>a</b>) Case 1: <math display="inline"> <semantics> <msub> <mi>t</mi> <mn>0</mn> </msub> </semantics> </math> and <math display="inline"> <semantics> <mrow> <mi>g</mi> <mo>·</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> </mrow> </semantics> </math> are linearly independent; (<b>b</b>) Case 2: <math display="inline"> <semantics> <mrow> <mi>g</mi> <mo>·</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> </mrow> </semantics> </math> is proportional to <math display="inline"> <semantics> <msub> <mi>t</mi> <mn>0</mn> </msub> </semantics> </math> with a factor <math display="inline"> <semantics> <mrow> <mo>&gt;</mo> <mn>1</mn> </mrow> </semantics> </math>; (<b>c</b>) Case 3: <math display="inline"> <semantics> <mrow> <mi>g</mi> <mo>·</mo> <msub> <mi>t</mi> <mn>0</mn> </msub> </mrow> </semantics> </math> is proportional to <math display="inline"> <semantics> <msub> <mi>t</mi> <mn>0</mn> </msub> </semantics> </math> with a factor <math display="inline"> <semantics> <mrow> <mo>&lt;</mo> <mn>1</mn> </mrow> </semantics> </math>.</p> ">
Figure 10
<p>In the case of affine translation by vectors of <span class="html-italic">V</span>, the orbits are affine subspace parallel to <span class="html-italic">V</span>. The distance between two orbits <math display="inline"> <semantics> <mrow> <mo>[</mo> <mi>x</mi> <mo>]</mo> </mrow> </semantics> </math> and <math display="inline"> <semantics> <mrow> <mo>[</mo> <mi>y</mi> <mo>]</mo> </mrow> </semantics> </math> is given by the distance between the orthogonal projection of <span class="html-italic">x</span> and <span class="html-italic">y</span> in <math display="inline"> <semantics> <msup> <mi>V</mi> <mo>⊥</mo> </msup> </semantics> </math>. This is an example where template estimation is consistent.</p> ">
Versions Notes

Abstract

:
We tackle the problem of template estimation when data have been randomly deformed under a group action in the presence of noise. In order to estimate the template, one often minimizes the variance when the influence of the transformations have been removed (computation of the Fréchet mean in the quotient space). The consistency bias is defined as the distance (possibly zero) between the orbit of the template and the orbit of one element which minimizes the variance. In the first part, we restrict ourselves to isometric group action, in this case the Hilbertian distance is invariant under the group action. We establish an asymptotic behavior of the consistency bias which is linear with respect to the noise level. As a result the inconsistency is unavoidable as soon as the noise is enough. In practice, template estimation with a finite sample is often done with an algorithm called “max-max”. In the second part, also in the case of isometric group finite, we show the convergence of this algorithm to an empirical Karcher mean. Our numerical experiments show that the bias observed in practice can not be attributed to the small sample size or to a convergence problem but is indeed due to the previously studied inconsistency. In a third part, we also present some insights of the case of a non invariant distance with respect to the group action. We will see that the inconsistency still holds as soon as the noise level is large enough. Moreover we prove the inconsistency even when a regularization term is added.

1. Introduction

1.1. General Introduction

Template estimation is a well known issue in different fields such as statistics on signals [1], shape theory, computational anatomy [2,3,4] etc. In these fields, the template (which can be viewed as the prototype of our data) can be (according to different vocabulary) shifted, transformed, wrapped or deformed due to different groups acting on data. Moreover, due to a limited precision in the measurement, the presence of noise is almost always unavoidable. These mixed effects on data lead us to study the consistency of algorithms which claim to compute the template. A popular algorithm consists in the minimization of the variance, in other words, the computation of the Fréchet mean in quotient space. This method has been already proved to be inconsistent [5,6,7]. In [5] the authors proves the inconsistency with a lower bound of the expectation of the error between the original template and the estimated template with a finite sample, they deduce that this expectation does not go to zero as the size of the sample goes to infinity. This work was done in a functional space, where functions only observed at a finite number of points of the functions were observed. In this case one can model these observable values on a grid. When the resolution of the grid goes to zero, one can show the consistency [8] by using the Fréchet mean with the Wasserstein distance on the space of measures rather than in the space of functions. However, in (medical) images the number of pixels or voxels is finite.
In [6], the authors demonstrated the inconsistency in a finite dimensional manifold with Gaussian noise, when the noisel level tends to zero. In our previous work [7], we focused our study on the inconsistency with Hilbert Space (including infinite dimensional case) as ambient space. This current paper is an extension of a conference paper [9].

1.2. Why Using a Group Action? Comparison with the Standard Norm

In the following, we take a simple example which justifies the use of the group action in order to compare the shape of two functions:
On Figure 1, suppose that you want to compare these functions. The simplest way to compare f 0 with f 1 would be to compute the L 2 -norm (or any other norm) of f 0 f 1 , if we do that we have that f 0 f 1 0.6 . Likewise f 0 f 2 0.6 , therefore the norm tells us that f 0 is at the same distance from f 1 and from f 2 . Yet, our eyes would say that f 0 , f 1 have the same shape, contrarily to f 0 and f 2 . Therefore the simple use of the L 2 -norm in the space of functions is not enough. To have a relevant way to compare functions, one can register functions first. Firstly, we estimate the better time translation which aligns f 0 and f 1 and secondly, we compute the L 2 -norm after this alignment step. On this example, we find that the distance is now ≃0.02. On the contrarily, after alignment the distance between f 0 and f 2 is still ≃0.6. With this new way of comparing functions, the functions f 0 looks like f 1 but do not look like f 2 . This fits with our intuition. That is why we use a group action in order to perform statistics. In the following paragraph, we precise how to do it in general.
This idea of using deformations/transformation in order to compare things is not new. It was already proposed by Darcy Thompson [10] in the beginning of the 20th century, in order to classify species.

1.3. Settings and Notation

In this paper, we suppose that observations belong to a Hilbert space ( M , · , · ) , we denote by · the norm associated to the dot product · , · . We also consider a group of transformation G which acts on M the space of observations. This means that g · ( g · x ) = ( g g ) · x and e · x = x for all x M , g , g G , where e is the identity element of G. Note that in this article, g · x is the result of the action of g on x, and · should not to be confused with the multiplication of real numbers noted ×.
The generative model is the following: we transform an unknown template  t 0 M with Φ a random and unknown element of the group G and we add some noise. Let σ be a positive noise level and ϵ a standardized noise: E ( ϵ ) = 0 , E ( ϵ 2 ) = 1 . Moreover we suppose that ϵ and Φ are independent random variables. Finally, the only observable random variable is:
Y = Φ · t 0 + σ ϵ .
This generative model is commonly used in Computational anatomy in diverse frameworks, for instance with currents [11,12], varifolds [13], LDDMM on images [14] but also in functional data analysis [1]. All these works are applied in different spaces, for instance, the varifold builds an embedding of the surfaces into an Hilbert space, and a group of diffeomorphisms have the ability of deform these surfaces. Supposing a general group action on a space with the generative model (1) allows us to embed all these various situations into one abstract model, and to study template estimation in this abstract model.
Example of noise: if we assume that the noise is independent and identically distributed on each pixel or voxel with a standard deviation w, then σ = N w , where N is the number of pixels/voxels. However, the noise which we consider can be more general: we do not require the fact that the noise is independent over each region of the space M.
Note that the inconsistency of Template estimation can be also studied with an alternative generative model, called backward model where Y = Φ · ( t 0 + σ ϵ ) [7]. Some authors also use the term perturbation model see [15,16,17].
Quotient space: the random transformation of the template by the group leads us to project the observation Y into the quotient space. The quotient space is defined as the set containing all the orbit [ x ] = { g · x , g G } for x M . The set which is constituted of all orbits is call the quotient space M by the group G and is noted by:
Q = M / G = { [ x ] , x M } .
As we want to do statistics on this space, we aim to equip the quotient with a metric. One often requires that d M the distance in the ambient space is invariant under the group action G, this means that
m , n M , g G d M ( g · m , g · n ) = d M ( m , n ) .
If d M is invariant and if the orbits are closed sets (if the orbits are not closed sets, it is possible to have d Q ( [ a ] , [ b ] ) = 0 even if [ a ] [ b ] , in this case we call d Q a pseudo-distance. Nevertheless, this has no consequence in this paper if d Q is only a pseudo-distance), then
d Q ( [ x ] , [ y ] ) = inf g G d M ( x , g · y ) ,
is well defined, and d Q is a distance in the quotient space. The quotient distance d Q ( [ x ] , [ y ] ) is the distance between x and y where y is the registration of y with respect to x. We say in this case that y is in optimal position with respect to x.
One particular distance in the ambient space M, which we use in all this article, is the distance given by the norm of the Hilbert space: d M ( a , b ) = a b . Moreover we say that G acts isometrically on M, if x g · x is a linear map which leaves the norm unchanged. In this case d M the distance given by the norm of the Hilbert space is invariant under the group action. The quotient (pseudo)-distance is, in this case (see Figure 2), d Q ( [ a ] , [ b ] ) = inf g G a g · b .
Remark 1.
When G acts isometrically on M a Hilbert space, by expansion of the squared norm we have:
d Q ( [ a ] , [ b ] ) 2 = a 2 2 sup g G a , g · b + b 2
Thus, even if the quotient space is not a linear space, we have a “polarization identity” in the quotient space:
sup g G a , g · b = 1 2 a 2 + b 2 d Q 2 ( [ a ] , [ b ] = 1 2 d Q 2 ( [ a ] , [ 0 ] ) + d Q 2 ( [ b ] , [ 0 ] ) d Q 2 ( [ a ] , [ b ]
When the distance given by the norm is invariant under the group action, we define the variance of the random orbit [ Y ] as the expectation of the (pseudo)-distance between the random orbit [ Y ] and the orbit of a point x in M:
F ( x ) = E ( d Q 2 ( [ x ] , [ Y ] ) ) = E ( inf g G g · x Y 2 ) = E ( inf g G x g · Y 2 ) .
Note that F ( x ) is well defined for all x M because E ( Y 2 ) is finite. Moreover, since F ( g · x ) = F ( x ) , for all x M and g G , the variance F is well defined in the quotient space: [ x ] F ( x ) does have a sense.
Moreover, in presence of a sample of the observable variable Y noted Y 1 , , Y n , one can define the empirical variance of a point x in M:
F n ( x ) = k = 1 n ( inf g G g · x Y i 2 ) = k = 1 n ( inf g G x g · Y i 2 ) .
Definition 1.
Template estimation is performed by minimizing F n :
t 0 ^ n = argmin x M F n ,
In order to study this estimation method, one can look the limit of this estimator when the number of data n tends to + , in this case, the estimation becomes:
t 0 ^ = argmin x M F
If m H minimizes F, then [ m ] is called a Fréchet mean of [ Y ] .
Definition 2.
We say that the estimation is consistent if t 0 minimizes F. Moreover the consistency bias, noted C B , is the (pseudo)-distance between the orbit of the template [ t 0 ] and [ m ] : C B = d Q ( [ t 0 ] , [ m ] ) . If such a m does not exist, then the consistency bias is infinite.
Note that, if the action is not isometric and is not either invariant, a priori d Q is no longer a (pseudo)-distance in the quotient space (this point is discussed in Section 3). However one can still define F and wonder if the minimization of F is a consistent estimator of t 0 . In this case, we call F a pre-variance.

1.4. Questions and Contributions

This setting leads us to wonder about few things listed below:
Questions:
  • Is t 0 a minimum of the variance or the pre-variance?
  • What is the behavior of the consistency bias with respect to the noise level?
  • How to perform such a minimization of the variance? Indeed, in practice we have only a sample and not the whole distribution.
Contribution: In the case of an isometric action, we provide a Taylor expansion of the consistency bias when the noise level σ tends to infinity. As we do not have the whole distribution, we minimize the empirical variance given a sample. An element which minimizes this empirical variance is called an empirical Fréchet mean. We already know that the empirical Fréchet mean converges to the Fréchet mean when the sample size tends to infinity [18]. Therefore our problem is reduced to finding an empirical Fréchet mean with a finite but sufficiently large sample. One algorithm called the “max-max” algorithm [19] aims to compute such an empirical Fréchet mean. We establish some properties of the convergence of this algorithm. In particular, when the group is finite, the algorithm converges in a finite number of steps to an empirical Karcher mean (a local minimum of the empirical variance given a sample). This helps us to illustrate the inconsistency in this very simple framework.
We would like to insist on this point: the noise is created in the ambient space with our generative model and the computation of the Fréchet mean is done in the quotient space, this interaction induces an inconsistency. On the opposite, if one models the noise directly in the quotient space and compute the Fréchet mean in the quotient space, we have no reason to suspect any inconsistency.
Moreover it is also possible to define and use isometric actions on curves [1,20] or on surfaces [21] where our work can be directly applied. The previous works related to the inconsistency of template estimation [5,6,7] focused on isometric action, which is a restriction to real applications. That is why we provide, in Section 3, some insights of the non invariant case: the inconsistency also appears as soon as the noise level is large enough.
This article is organized as follows: Section 2 is dedicated for isometric action. More precisely, in Section 2.2, we study the presence of the inconsistency and we establish the asymptotic behavior when the noise parameter σ tends to ∞. In Section 2.4 we detail the max-max algorithm and its properties. In Section 2.5 we illustrate the inconsistency with synthetic data. Finally in Section 3, we prove the inconsistency for more general group action, when the noise level is large enough. We do it in two settings, the first one is that the group contains a subgroup acting isometrically on M, the second one is that the group acts linearly on the space M.

2. Inconsistency of Template Estimation with an Isometric Action

2.1. Congruent Section and Computation of Fréchet Mean in Quotient Space

Given points m and y, there is a priori no closed formed expression in order to compute the quotient distance inf g G g · m y . Therefore computing and minimizing the variance in the quotient does not seem straightforward. There is one case where it may be possible: the existence of a congruent section. We say that s : Q M is a section if π s = I d , where π : M Q is the canonical projection into the quotient space. Moreover we say that the section s is congruent if:
o , o Q s ( o ) s ( o ) = d Q ( o , o ) .
Then S = s ( Q ) the image of the quotient by the section is a part of M which has an interesting property:
p , q S , p q = d Q ( [ p ] , [ q ] ) .
In other words, the section gives us a part of M containing a point of each orbit such that all points in S are already registered. Moreover, if s is a section, s : [ m ] g · s ( [ m ] ) is also a section, without loss of generality we can assume that t 0 = s ( [ t 0 ] ) .
In this case, the variance is equal to:
F ( m ) = E ( s ( [ m ] ) s ( [ Y ] ) 2 ) ,
where we recognize the variance of the random variable s ( [ Y ] ) . As we know that the element which minimizes the variance in a linear space is given by the expected value, we have that:
F ( m ) F ( E ( s ( [ Y ] ) ) ) .
Moreover this inequality is strict if and only if m and E ( s ( [ Y ] ) ) are not in the same orbit.
Therefore, we have a method in order to know if the estimation is consistent or not: computing E ( s ( [ Y ] ) ) and verifying if t 0 and E ( s ( [ Y ] ) ) are in the same orbit, and the consistency bias is given by d Q ( [ t 0 ] , [ E ( s ( [ Y ] ) ) ] ) . Moreover if we take m S , we have F ( m ) = E ( m s ( [ Y ] ) 2 ) and it is now straightforward that F | S the restriction of F to S is differentiable on S (we say that F | S is differentiable on S , even if S is not open, because m E ( m s ( [ Y ] ) 2 ) is defined and differentiable on M, and is equal to F | S ), and that F | S ( m ) = m E ( s ( [ Y ] ) in particular F | S ( t 0 ) = t 0 E ( s ( [ Y ] ) ) gives us the value of the bias.
Example 1.
The action of rotations: G = S O ( n ) acts isometrically on M = R n . We notice that the quotient distance is d Q ( [ x ] , [ y ] ) = | x y | . We can check that s ( [ x ] ) = x v is a section for v an unitary vector. Therefore the computation of the bias is given by d Q ( [ t 0 ] , [ E ( s ( [ Y ] ) ] ) = | E ( Y ) t 0 ) | .
Unfortunately, the congruent section generally does not exist. Let us give an example:
Example 2.
Taking N N with N 3 , we consider the action of G = Z / N Z on M = R N by time translation, for k ¯ Z / N Z , and ( x 1 , x 2 , , x N ) :
k ¯ · ( x 1 , x 2 , , x N ) = ( x 1 + k , x 2 + k , , x N + k ) ,
where indexes are taken modulo N. If we take p 1 = ( 0 , 5 , 0 , , 0 ) , p 2 = ( 0 , 3 , 2 , 0 , , 0 ) , p 3 = ( 2 , 3 , 0 , , 0 ) . By hand we can check that there is no x [ p 1 ] , y [ p 2 ] and z [ p 3 ] such that x y = d Q ( [ p 1 ] , [ p 2 ] ) , x z = d Q ( [ p 1 ] , [ p 3 ] ) , and y z = d Q ( [ p 2 ] , [ p 3 ] ) . Thus, a congruent section in Q = M / G does not exists.
We can generalize this simple example by taking a non finite group:
Example 3.
Let us take M = L 2 ( R / Z ) the set of 1-periodic functions such that 0 1 f 2 ( t ) d t < + . G = R / Z acts on L 2 ( R / Z ) by time translation defined by:
τ R / Z , f L 2 ( R / Z ) f τ   w i t h   f ( x ) = f ( x + τ ) .
Then a section in Q = M / G does not exists.
Proof. 
Let us take f 1 = 1 [ 1 4 , 3 4 ] , f 2 = f 1 + 2 1 [ 1 4 , 1 4 + η ] and  f 2 = f 1 + 2 1 [ 1 4 + η , 1 4 + 2 η ] for some η ( 0 , 1 4 ) (see Figure 3). Let us suppose that a section s exists, then without loss of generality we can assume that s ( [ f 1 ] ) = f 1 , then we should have f 1 s ( [ f 2 ] ) = s ( [ f 1 ] ) s ( [ f 2 ] ) = d Q ( [ f 1 ] , [ f 2 ] ) in other words, s ( [ f 2 ] ) should be registered with respect to f 1 . For τ R / Z we can verify that f 1 τ · f 2 f 1 f 2 and that this inequality is strict as soon as τ 0 . Then f 2 is the only element of [ f 2 ] registered with f 1 then s ( [ f 2 ] ) = f 2 . Likewise for s ( [ f 3 ] ) = f 3 , then we should have:
d Q ( [ f 2 ] , [ f 3 ] ) = f 2 f 3 ,
However it is easy to verify that d Q 2 ( [ f 2 ] , [ f 3 ] ) η · f 2 f 3 2 = 2 η < 8 η = f 2 f 3 2 = d Q ( [ f 2 ] , [ f 3 ] ) . This is a contradiction. Therefore, a congruent section does not exist. ☐
When the congruent section exists, then the quotient can be included in a part S of the ambient space M and the metric d M and d Q are corresponding. The existence of a congruent section indicates us that the quotient space is not so complicated. Indeed when there is an existence of a congruent section, the quotient space is embedded in the ambient space with respect to the distances in the quotient space and in the ambient space. In that case computations are easier, projecting data on this part S and taking the mean. Then when such a congruent section does not exist, computing the Fréchet mean in quotient space is not so obvious. However, we can established proofs of inconsistency which are less tight. In this article we prove that the method is inconsistent when the noise is large.

2.2. Inconsistency and Quantification of the Consistency Bias

We start with Theorem 1 which gives us an asymptotic behavior of the consistency bias when the noise level σ tends to infinity. One key notion in Theorem 1 is the concept of fixed point under the action G: a point x M is a fixed point if for all g G , g · x = x . We require that the support of the noise ϵ is not included in the set of fixed points. However, this condition is almost always fulfilled. For instance in R n the set of fixed points under a linear group action is a null set for the Lebesgue measure (unless the action is trivial: g · x = x for all g G but this situation is irrelevant).
Theorem 1.
Let us suppose that the support of the noise ϵ is not included in the set of fixed points under the group action. Let Y be the observable variable defined in Equation (1). If the Fréchet mean of [ Y ] exists, then we have the following lower and upper bounds of the consistency bias noted CB:
σ K 2 t 0 C B σ K + 2 t 0 ,
where K = sup v = 1 E sup g G v , g · ϵ ( 0 , 1 ] , K is a constant which depends only of the standardized noise and of the group action. The consistency bias has the following asymptotic behavior when the noise level σ tends to infinity:
C B = σ K + o ( σ ) a s σ + .
In the following we note by S the unit sphere of M. For v S , we call θ ( v ) = E sup g G v , g · ϵ , so that K = sup v S θ ( v ) . The sketch of the proof is the following:
  • K > 0 because the support of ϵ is not included in the set of fixed points under the action of G.
  • K 1 is the consequence of the Cauchy-Schwarz inequality.
  • The proof of Inequalities (3) is based on the triangular inequalities:
    m t 0 C B = inf g G t 0 g · m t 0 + m ,
    where m minimizes F: having a piece of information about the norm of m is enough to deduce a piece of information about the consistency bias.
  • The asymptotic Taylor expansion of the consistency bias (4) is the direct consequence of inequalities (3).
Proof of Theorem 1.
We note S the unit sphere in M. In order to prove that K > 0 , we take x in the support of ϵ such that x is not a fixed point under the action of G. It exists g 0 G such that g 0 · x x . We note v 0 = g 0 · x x S , we have v 0 , g 0 · x = x > v 0 , x and by continuity of the dot product it exists r > 0 such that: y B ( x , r ) v 0 , g 0 · y > v 0 , y as x is in the support of ϵ we have P ( ϵ B ( x , r ) ) > 0 , it follows:
P sup g G v 0 , g · ϵ > v 0 , ϵ > 0 .
Thanks to Inequality (5) and the fact that sup g G v 0 , g · ϵ v 0 , ϵ we have:
θ ( v 0 ) = E sup g G v 0 , g · ϵ > E ( v 0 , ϵ ) = v 0 , E ( ϵ ) = v 0 , 0 = 0 .
Then we get K θ ( v 0 ) > 0 . Moreover, if we use the Cauchy-Schwarz inequality:
K sup v S E ( v × ϵ ) E ( ϵ 2 ) 1 2 = 1 .
In order to prove Inequalities (3), we use the “polar” coordinates of a point in M (see Figure 4), every point in M can be represented by ( r , v ) where r 0 is the radius, and v belong to S the unit sphere in M, v represents the “angle”. We compute F ( m ) as a function of ( r , v ) . In a first step, we minimize this expression as a function of r, in a second step we minimize this expression as a function of v. This makes appear the constant K. As we said, let us take r 0 and v S , we expand the variance at the point r v :
F ( r v ) = E inf g G r v g · Y 2 = r 2 2 r E sup g G v , g · Y + E ( Y 2 ) .
Indeed g · Y = Y thanks to the isometric action. We note x + = max ( x , 0 ) the positive part of x. Moreover we define the two following functions:
λ ( v ) = E ( sup g G v , g · Y ) = E ( sup g G g · Y , v ) and   λ ˜ ( v ) = λ ( v ) + for   v S ,
since that f : x R + x 2 2 b x + c reaches its minimum at the point r = b + and f ( b + ) = c ( b + ) 2 , the r 0 which minimizes (6) is λ ˜ ( v ) and the minimum value of the variance restricted to the half line R + v is:
F ( λ ˜ ( v ) v ) = E ( Y 2 ) λ ˜ ( v ) 2 .
To find [ m ] the Fréchet mean of  [ Y ] , we need to maximize λ ˜ ( v ) 2 with respect to v S :
m = λ ( v ) v with   v argmax v S λ ( v ) .
Note that we remove the positive part and the square because argmax λ = argmax ( λ + ) 2 indeed λ takes a non negative value. In order to prove it let us remark that:
λ ( v ) E ( v , Φ · t 0 + ϵ ) = v , E ( Φ · t 0 ) + 0 ,
then there is two cases: if E ( Φ · t 0 ) = 0 then for any v S we have λ ( v ) 0 , if w = E ( Φ · t 0 ) 0 then we take v = w w S , and we get λ ( v ) w w , w = w 0 .
As we said in the sketch of the proof we are interested in getting information about the norm of m :
m = λ ( v ) = sup v S λ .
Let v S , we have: t 0 v , g Φ · t 0 t 0 because the action is isometric. Now we decompose Y = Φ · t 0 + σ ϵ and we get:
λ ( v ) = E sup g G v , g · Y = E sup g G v , g · σ ϵ + v , g Φ · t 0
λ ( v ) E sup g G v , g · σ ϵ + t 0 = σ E sup g G v , g · ϵ + t 0
λ ( v ) E sup g G v , g · σ ϵ t 0 = σ E sup g G v , g · ϵ t 0 .
By taking the largest value in these inequalities with respect to v S , we get by definition of K:
t 0 + σ K m = sup v S λ ( v ) t 0 + σ K .
Moreover we recall the triangular inequalities:
m t 0 C B = inf g G t 0 g · m t 0 + m ,
Thanks to (10) and to (11), Inequalities (3) are proved. ☐

2.3. Remarks about Theorem 1 and Its Proof

We can ensure the presence of inconsistency as soon as the signal to noise ratio satisfies t 0 σ < K 2 . Moreover, if the signal to noise ratio verifies t 0 σ < K 3 then the consistency bias is not smaller than t 0 i.e., C B t 0 . In other words, the Fréchet mean in quotient space is too far from the template: the template estimation with the Fréchet mean in quotient space is useless in this case. In [7] we also gave lower and upper bounds as a function of σ but these bounds were less informative than bounds given by Theorem 1. These bounds did not give the asymptotic behaviour of the consistency bias. Moreover, in [7] the lower bound goes to zero when the template becomes closed to fixed points. This may suggest that the consistency bias was small for this kind of template. We prove here that it is not the case.
Note that Theorem 1 is not a contradiction with [1] where the authors proved the consistency of template estimation with the Fréchet mean in quotient space for all σ > 0 . Indeed their noise was included in the set of constant functions which are the fixed points under their group action.
The constant K appearing in the asymptotic behaviour of the consistency bias (4) is a constant of interest. We can give several (but similar) interpretations of K:
  • It follows from Equation (3) that K is the consistency bias with a null template t 0 = 0 and a standardized noise ( σ = 1 ).
  • From the proof of Theorem 1 we know that 0 < K E ( ϵ ) 1 . On the one hand, if G is the group of rotations then K = E ( ϵ ) , because for all v s.t. v = 1 , sup g G v , g ϵ = ϵ , by aligning v and ϵ . On the other hand if G acts trivially (which means that g · x = x for all g G , x M ) then K = 0 . The general case for K is between two extreme cases: the group where the orbits are minimal (one point) and the group for which the orbits are maximal (the whole sphere). We can state that the more the group action has the ability to align the elements, the larger the constant K is and the larger the consistency bias is.
  • The squared quotient distance between two points is:
    d Q ( [ a ] , [ b ] ) 2 = a 2 2 sup g G a , g · b + b 2 ,
    thus the larger sup g G a , g · b , the smaller d Q ( [ a ] , [ b ] ) . K = 1 1 2 inf v = 1 E ( d Q 2 ( [ v ] , [ ϵ ] ) ) , encodes the level of contraction of the quotient distance (or folding). The larger K is, the more contracted the quotient space is.
One disadvantage of Theorem 1 is that it ensures the presence of inconsistency for σ large enough but it says nothing when σ is small, in this case one can refer to [6] or [7].
We can remark that this Theorem can be used as an alternating proof the following Theorem (which was already proved in [7]), proving and quantifying inconsistency when the template is a fixed point:
Corollary 1.
Let G acting isometrically on M an Hilbert space. Let t 0 be a fixed point, and ϵ a standardized noise which support is not included in the set of fixed points. Then estimating the template with the Fréchet mean is inconsistent. Moreover if the Fréchet mean in quotient space exists then the consistency bias is equal to:
C B = σ K .
Indeed for t 0 = 0 which is a particular fixed point we have C B = σ K thanks to Theorem 1. If t 0 is a fixed point non necessarily equal to 0, we can define Y = Y t 0 = 0 + σ ϵ , in this random variable 0 is the template we can apply the formula C B = σ K to the random variable Y , which concludes.
In the proof of Theorem 1, we have seen that the minimum of the variance restricted to the half-line R + v for v S , was
E ( Y 2 ) E inf g G v , g · Y + 2 .
therefore λ ˜ ( v ) = E inf g G v , g · Y + is a registration score: λ ˜ ( v ) tells you how much it is a good idea to search the Fréchet mean of [ Y ] in the direction pointed by v: the more λ ˜ ( v ) is large, the more v is a good choice. On the contrary when this value is equal to zero, it is useless to search the Fréchet mean in this direction.
Likewise, for v S , θ ( v ) = E ( sup g G g · v , ϵ ) is a registration score with respect to the noise, the larger θ ( v ) , the more the unit vector v looks like to the noise ϵ after registration.
If [ m ] is a Fréchet mean of [ Y ] we have seen that its norm verifies:
m = sup v = 1 E ( sup g G v , g · Y ) .
Then if there is two different Fréchet means of [ Y ] noted [ m ] and [ n ] , we can deduce that m = n . Even if there is no uniqueness of the Fréchet mean in the quotient space, we can state that the representants of the different Fréchet means have all the same norm.
Remark 2.
We can also wonder if the converse of Theorem 1 is true: if ϵ is a non biased noise always included in the set of fixed point, is [ t 0 ] a Fréchet mean of [ Φ · t 0 + σ ϵ ] ? A simple computation show that t 0 is a minimum of the variance:
F ( m ) = E inf g G m g · ( Φ t 0 + σ ϵ ) 2 = m 2 + E ( Φ t 0 + σ ϵ 2 ) 2 E ( sup g m , g Φ t 0 + m , g σ ϵ ) = m 2 + E ( Φ t 0 + σ ϵ 2 ) 2 E sup g G m , g · t 0 2 m , E ( σ ϵ ) = m 2 + E ( Φ t 0 + σ ϵ 2 ) 2 E sup g G m , g · t 0
We see that the element m which minimizes (12) does not depend of σ, in particular we can assume σ = 0 , and wonder which elements minimizes F ( m ) = E ( inf g G m g Φ · t 0 2 ) , it becomes clear that only the points in the orbit of t 0 can minimize this variance. Then when ϵ is included in the set of fixed points, the estimation is always consistent for all σ. This is an alternative proof of the Theorem of consistency done by Kurtek et al. [1].
In the proof of Theorem 1, we have seen that the direction of the Fréchet mean of [ Y ] is given by the supremum of this quantity (7):
E sup g G v , g · σ ϵ + v , g Φ · t 0 .
This Equation is a good illustration of the difficulty to compute the Fréchet mean in quotient space. Indeed, we have on one side the contribution of the noise v , g · σ ϵ and on the other side the contribution of the template v , g Φ · t 0 , and we take the supremum of the sum of these two contributions over g G . Unfortunately the supremum of the sum of two terms is not equal to the sum of the supremum of each of these terms. Hence, it is difficult to separate these two contributions. However, we can intuit that when the noise is large, v , g σ ϵ prevails over v , g Φ · t 0 , and the use of the Cauchy-Schwarz inequality in Equations (8) and (9) proves it rigorously. We can conclude that, when the noise is large, the direction of the Fréchet mean in the quotient space depends more on the noise than on the template.

2.4. Template Estimation with the Max-Max Algorithm

2.4.1. Max-Max Algorithm Converges to a Local Minima of the Empirical Variance

Section 2.2 can be understood as follows: if we want to estimate the template by minimizing the Fréchet mean with quotient space, then there is a bias. This supposes that we are able to compute such a Fréchet mean. In practice, we cannot minimize the exact variance in quotient space, because we have only a finite sample and not the whole distribution. In this section we study the estimation of the empirical Fréchet mean with the max-max algorithm. We assume that the group is finite. In this case, the registration can always be found by an exhaustive search. Hence, the numeric experiments which we conduct in Section 2.5 lead to an empirical Karcher mean in a finite number of steps. In a compact group acting continuously, the registration also exists but is not necessarily computable without approximation.
If we have a sample: Y 1 , , Y I of independent and identically distributed copies of Y, then we define the empirical variance in the quotient space:
M x F I ( x ) = 1 I i = 1 I d Q 2 ( [ x ] , [ Y i ] ) = 1 I i = 1 I min g i G x g i · Y i 2 = 1 I i = 1 I min g i G g i · x Y i 2 .
The empirical variance is an approximation of the variance. Indeed thanks to the law of large number we have lim I F I ( x ) = F ( x ) for all x M . One element which minimizes globally (respectively locally) F I is called an empirical Fréchet mean (respectively an empirical Karcher mean). For x M and g ̲ G I : g ̲ = ( g 1 , , g I ) where g i G for all i = 1 . . I we define J an auxiliary function by:
J ( x , g ̲ ) = 1 I I i = 1 x g i · Y i 2 = 1 I I i = 1 g i 1 · x Y i 2 .
The max-max Algorithm 1 iteratively minimizes the function J in the variable x M and in the variable g ̲ G I (see Figure 5):
Algorithm 1 Max-Max Algorithm.
Require: A starting point m 0 M , a sample Y 1 , , Y I .
n = 0 .
while Convergence is not reached do
  Minimizing g ̲ G I J ( m n , g ̲ ) : we get g i n by registering Y i with respect to m n .
  Minimizing x M J ( x , g ̲ n ) : we get m n + 1 = 1 I i = 1 I g i n Y i .
   n = n + 1 .
end while
m ^ = m n
First, we note that this algorithm is sensitive to the the starting point. However we remark that m 1 = 1 I i = 1 I g i · Y i for some g i G , thus without loss of generality, we can start from m 1 = 1 I i = 1 I g i · Y i for some g i G . The empirical variance does not increase at each step of the algorithm since:
F I ( m n ) = J ( m n , g ̲ n ) J ( m n + 1 , g ̲ n ) J ( m n + 1 , g ̲ n + 1 ) = F I ( m n + 1 )
Proposition 1.
As the group is finite, the convergence is reached in a finite number of steps.
Proof of Proposition 1.
The sequence ( F I ( m n ) ) n N is non-increasing. Moreover the sequence ( m n ) n N takes value in a finite set which is: { 1 I i = 1 I g i · Y i , g i G } . Therefore, the sequence ( F I ( m n ) ) n N is stationary. Let n N such that F I ( m n ) = F I ( m n + 1 ) . Hence the empirical variance did not decrease between step n and step n + 1 and we have:
F I ( m n ) = J ( m n , g ̲ n ) = J ( m n + 1 , g ̲ n ) = J ( m n + 1 , g ̲ n + 1 ) = F I ( m n + 1 ) ,
as m n + 1 is the unique element which minimizes m J ( m , g ̲ n ) we conclude that m n + 1 = m n . ☐
This proposition gives us a shutoff parameter in the max-max algorithm: we stop the algorithm as soon as m n = m n + 1 . Let us call m ^ the final result of the max-max algorithm. It may seem logical that m ^ is at least a local minimum of the empirical variance. However this intuition may be wrong: let us give a toy counterexample, suppose that we observe Y 1 , , Y I , due to the transformation of the group it is possible that i = 1 n Y i = 0 . We can start from m 1 = 0 in the max-max algorithm, as Y i and 0 are already registered, the max-max algorithm does not transform Y i . At step two, we still have m 2 = 0 , by induction the max-max algorithm stays at 0 even if 0 is not a Fréchet or Karcher mean of [ Y ] . Because 0 is equally distant from all the points in the orbit of Y i , 0 is called a focal point of [ Y i ] . The notion of focal point is important for the consistency of the Fréchet mean in manifold [22]. Fortunately, the situation where m ^ is not a Karcher mean is almost always avoided due to the following statement:
Proposition 2.
Let m ^ be the result of the max-max algorithm. If the registration of Y i with respect to m ^ is unique, in other words, if m ^ is not a focal point of Y i for all i 1 . . I then m ^ is a local minimum of F I : [ m ^ ] is an empirical Karcher mean of [ Y ] .
Note that, if we call z the registration of y with respect to m, then the registration is unique if and only if m , z g · z 0 for all g G { e } . Once the max-max algorithm has reached convergence, it suffices to test this condition for m ^ obtained by the max-max algorithm and Y i for all i. This condition is in fact generic and is always obtained in practice.
Proof of Proposition 2.
We call g i the unique element in G which register Y i with respect to m ^ , for all h G { g i } , m ^ g i · Y i < m ^ h i · Y i . By continuity of the norm we have for a close enough to m: a g i · Y i < a h i · Y i for all h i g i (note that this argument requires a finite group). The registrations of Y i with respect to m and to a are the same:
F I ( a ) = 1 I i = 1 I a g i · Y i 2 = J ( a , g ̲ ) J ( m ^ , g ̲ ) = F I ( m ^ ) ,
because m J ( m , g ̲ ) has one unique local minimum m ^ . ☐
Remark 3.
We remark the max-max algorithm is in fact a gradient descent. The gradient descent is a general method to find the minimum of a differentiable function. Here we are interested in the minimum of the variance F: let m 0 M and we define by induction the gradient descent of the variance m n + 1 = m n ρ F ( m n ) , where ρ > 0 and F the variance in the quotient space. In [7], the gradient of the variance in quotient space for finite group and for a regular point m was computed (m is regular as soon as g · m = m implies g = e ), this leads to:
m n + 1 = m n 2 ρ m n E ( g ( Y , m n ) · Y ) ,
where g ( Y , m n ) is the almost-surely unique element of the group which registers Y with respect to m n . Now if we have a set of data Y 1 , , Y n we can approximated the expectation which leads to the following approximated gradient descent:
m n + 1 = m n ( 1 2 ρ ) + ρ 2 I i = 1 I g ( Y i , m n ) · Y i ,
now by taking ρ = 1 2 we get m n + 1 = 1 I i = 1 I g ( Y i , m n ) · Y i . So the approximated gradient descent with ρ = 1 2 is exactly the max-max algorithm. However, the max-max algorithm for finite group, is proved to be converging in a finite number of steps which is not the case for gradient descent in general.

2.5. Simulation on Synthetic Data

In this Section, we consider data in an Euclidean space R N equipped with its canonical dot product · , · , and G = Z / N Z acts on R N by time translation on coordinates:
( k ¯ Z / N Z , ( x 1 , , x N ) R N ) ( x 1 + k , x 2 + k , x N + k ) ,
where indexes are taken modulo N. This space models the discretization of functions defined on [ 0 , 1 ] with N points. This action is found in [19] and used for neuroelectric signals in [20]. The registration between two vectors can be made by an exhaustive research but it is faster with the fast Fourier transform [23].

2.5.1. Max-Max Algorithm with a Step Function as Template

We display an example of a template and template estimation with the max-max algorithm on Figure 6a. This experiment was already conducted in [19], but no explanation of the appearance of the bias was provided. We know from Section 2.4 that the max-max output is an empirical Karcher mean, and that this result can be obtained in a finite number of steps. Taking σ = 10 may seem extremely high, however the standard deviation of the noise at each point is not 10 but σ N = 1.25 which is reasonable.
The sample size is 10 5 , the algorithm stopped after 247 steps, and m ^ the estimated template (in red on the Figure 6a) is not a focal points of the orbits [ Y i ] , then Proposition 2 applies. We call empirical bias (noted EB) the quotient distance between the true template and the point m ^ given by the max-max result. On this experiment we have E B σ 0.11 . Of course, one could think that we estimate the template with an empirical bias due to a too small sample size which induces fluctuation. To reply to this objection, we keep in memory m ^ obtained with the max-max algorithm. If there was no inconsistency then we would have F ( t 0 ) F ( m ^ ) . We do not know the value of the variance F at these points, but thanks to the law of large number, we know that:
F ( t 0 ) = lim I F I ( t 0 )   and   F ( m ^ ) = lim I F I ( m ^ ) ,
Given a sample, we compute F I ( t 0 ) and F I ( m ^ ) thanks to the definition of the empirical variance F I (13). We display the result on Figure 6b, this tends to confirm that F ( t 0 ) > F ( m ^ ) . In other word, the variance at the template is larger that the variance at the point given by the max-max algorithm.

2.5.2. Max-Max Algorithm with a Continuous Template

Figure 6a shows that the main source of the inconsistency was the discontinuity of the template. One may think that a continuous template would lead to a better behaviour. However, it is not the case as presented on Figure 7. Even with a large number of observations created from a continuous template we do not observe a convergence to the template. In the example of Figure 7, the empirical bias satisfies E B σ = 0.23 . In green we also display the mean of data knowing transformations, this produces a much better result, since that in this case we have E B σ = 0.04 .

2.5.3. Does the Max-Max Algorithm Give Us a Global Minimum or Only a Local Minimum of the Variance?

Proposition 2 tells us that the output of the max-max algorithm is a Karcher mean of the variance, but we do not know whether it is Fréchet mean of the variance. In other words, is the output a global minimum of the variance? In fact, F I has a lot of local minima which are not global. To illustrate this, we may use the max-max algorithm with different starting points and we observe different outputs (which are all local minima thanks to Proposition 2) with different empirical variance on Table 1.

3. Inconsistency in the Case of Non Invariant Distance under the Group Action

3.1. Notation and Hypothesis

In this Section, data still come from an Hilbert space M. However, we take a group of deformation G which acts in a non invariant way on M. Starting from a template t 0 we consider a random deformation in the group G namely a random variable Φ which takes value in G and ϵ an standardized noise in M independent of Φ . We suppose that our observable random variable is:
Y = Φ · t 0 + σ ϵ   with   σ > 0 , E ( ϵ ) = 0 , E ( ϵ 2 ) = 1 ,
where σ is the noise level. We suppose that E ( Y 2 ) < + , and we define the pre-variance of Y in M / G as the map defined by:
F ( m ) = E inf g G g · m Y 2 .
In this part we still study the inconsistency of template estimation by minimizing F.
We present two frameworks where we can ensure the presence of inconsistency: in Section 3.3 we suppose that the group G contains a non trivial group H which acts isometrically on M. However, some groups do not satisfy this hypothesis, that is why, in Section 3.4 we do not suppose that G contains a subgroup acting isometrically but we require that G acts linearly on M. In both sections we prove inconsistency as soon as the variance σ 2 is large enough.
These hypothesis are not unacceptable as for example, deformations that are considered in computational anatomy may include rotations which form a subgroup H of the diffeomorphic deformations which acts isometrically. Concerning the second case, an important example is:
Example 4.
Let G be a subgroup of the group of C diffeomorphisms on R n G acts linearly on L 2 ( R n ) with the map:
φ G f L 2 ( R n ) φ · f = f φ 1 .
Note that this action is not isometric: indeed, f φ 1 has generally a different L 2 -norm than f, because a Jacobian determinant appears in the computation of the integral.

3.2. Where Did We Need an Isometric Action Previously?

Let M be an Hilbert space, and G a group acting on M. Can we define a distance in the quotient space Q = M / G defined as the set which contains all the orbits? When the action is invariant, the orbits are parallel in the sense where d M ( m , n ) = d M ( g · m , g · n ) for all m , n M and for all g G . This implies that:
d Q ( [ m ] , [ n ] ) = inf g G m g · n ,
is a distance on Q. However, it is not necessarily the case when the action is no longer invariant. Let us take the following example:
Example 5.
We call C d i f f ( R 2 ) the set of the C diffeomorphisms of R 2 . We equip R 2 with its canonical Euclidean structure. We take p = ( 1 , 1 ) , q = ( 1 , 1 ) and r = ( 2 , 0 ) (see Figure 8),
G = f C d i f f ( R 2 ) | f ( q ) = ( q ) , f ( p ) = ( p ) , x R f ( x , 0 ) R r ,
G acts on R 2 by f · ( x , y ) = f ( x , y ) . Then q and p are fixed points under this group action and the orbit of r is the horizontal line { ( x , 0 ) , x R } . On this example:
inf g G q g · r = q ( 1 , 0 ) = 1 h o w e v e r inf g G r g · q = r q = 2 ,
then the function d Q is not symmetric. One could think define a distance by:
d ˜ Q ( [ m ] , [ n ] ) = inf h , g G h · m g · n .
Unfortunately, in this case d ˜ Q ( [ p ] , [ q ] ) = p q = 2 2 and d ˜ Q ( [ p ] , [ r ] ) = 1 = d ˜ Q ( [ r ] , [ q ] ) then we do not have d ˜ Q ( [ p ] , [ q ] ) d ˜ Q ( [ p ] , [ r ] ) + d ˜ Q ( [ r ] , [ q ] ) . In other words we do not have the triangular inequality.
Therefore when the action is no longer invariant, a priori one cannot define a distance in the quotient anymore. If Y is a random variable in M, F ( m ) = E ( inf g G g · m Y 2 ) cannot be interpreted as the variance of [ Y ] .
However inf g G g · a b is positive and is equal to zero if a = b , then inf g G g · a b is a pre-distance in M. Then inf g G g · m Y measures the discrepancy between the random point Y and the current point m. Even if the discrepancy measure is not symmetric or does not satisfy the triangular inequality, we can still define F ( x ) = E ( inf g G g · x Y 2 ) and call it the pre-variance of the projection of Y into M / G , if E ( Y 2 ) < + . The elements which minimize this function are the element whose orbit are the closest of the random point Y. Hence, we wonder if the template can be estimated by minimizing this pre-variance. Note that, once again F ( x ) = F ( g · x ) for all x M and g G . Then the pre-variance is well defined in the quotient space by [ x ] F ( x ) .
It is not surprising to use a discrepancy measure which is not a distance, for instance the Kullback-Leibler divergence [24] is not symmetric although it is commonly used.
In the proof of inconsistency of Theorem 1, we used that the action was isometric in order to simplify the expansion of the variance in Equation (6):
F ( m ) = E inf g G m g · Y 2 = E inf g G m 2 m , g · Y + g · Y 2 ,
with g · Y 2 = Y 2 there was only one term which depends on g: g · m , Y and the two other terms could be pulled out of the infimum. When the action is no longer isometric we cannot do this trick anymore. To remedy this situation, in this article, we require that the orbit of the template is a bounded set.
In the following, we prove inconsistency even with non isometric action (but only when the noise level is large enough if the template is not a fixed point). The sketches of the different proofs are always the same: finding a point m such that F ( m ) < F ( t 0 ) , in order to do that it suffices to find an upper bound of F ( m ) and a lower bound of F ( t 0 ) and to compare these two bounds.

3.3. Non Invariant Group Action, with a Subgroup Acting Isometrically

In this subsection G acts on M an Hilbert space. We assume that there exists a subgroup H G such that H acts isometrically on M. As H is included in G, we deduce a useful link between the variance of Y projected in M / H and the pre-variance of Y projected in M / G :
F ( m ) = E ( inf g G g · m Y 2 ) E ( inf h H h · m Y 2 ) = F H ( m ) .
The orbit of a point m under the group action G is [ m ] = { g · m , g G } , whereas the orbit of the point m under the group action H is [ m ] H = { h · m , h H } . Moreover, we call F H the variance of [ Y ] H in the quotient space M / H , and F the variance of [ Y ] in the quotient space M / G .

3.3.1. Inconsistency when the Template Is a Fixed Point

We begin by assuming that the template t 0 is a fixed point under the action of G:
Proposition 3.
Suppose that t 0 is a fixed point under the group action G. Let ϵ be a standardized noise which support is not included in the fixed points under the group action of H, and Y = Φ · t 0 + σ ϵ = t 0 + σ ϵ . Then t 0 is not a minimum of the pre-variance F.
Proof. 
We have:
  • Thanks to Corollary 1 of Section 2.2 we know that [ t 0 ] H = [ E ( Y ) ] H is not the Fréchet mean of [ Y ] H the projection of Y into M / H : we can find m M such that:
    F H ( m ) < F H ( t 0 ) .
    Note that in order to apply Corollary 1, we do not need that Φ is included in H, because t 0 is a fixed point.
  • Because we take the infimum over more elements we have:
    F ( m ) F H ( m ) .
  • As t 0 is a fixed point under the action of G and under the action of H:
    F H ( t 0 ) = F ( t 0 ) = E ( t 0 Y 2 ) .
With Equations (15)–(17), we conclude that t 0 does not minimize F. ☐

3.3.2. Inconsistency in the General Case for the Template

The following Proposition 4 tells us that when σ is large enough then there is an inconsistency.
Proposition 4.
We suppose that the template is not a fixed point and that its orbit under the group G is bounded. We consider A sup g G g · t 0 t 0 and a inf g G g · t 0 t 0 , note that a 1 A and we have:
g G a t 0 g · t 0 A t 0 .
We note:
θ ( t 0 ) = 1 t 0 E ( sup g G g · t 0 , ϵ )   a n d   θ H = 1 t 0 E sup h H h · t 0 , ϵ .
We suppose that θ H > 0 . If σ is bigger than a critical noise level noted σ c defined as:
σ c = t 0 θ H θ ( t 0 ) θ H + A + θ ( t 0 ) θ H + A 2 + A 2 a 2 .
Then we have inconsistency.
Note that in Section 2.2 we have proved inconsistency in the isometric case as soon as σ > 2 t 0 K , where K θ H , then we find in this theorem an analogical sufficient condition on σ where θ ( t 0 ) θ H + A + θ ( t 0 ) θ H + A 2 + A 2 a 2 is a corrective term due to the non invariant action.
We have shown in [7] that if the orbit of the template [ t 0 ] H is a manifold, then θ H > 0 as soon as the support of ϵ is not included in T t 0 [ t 0 ] (the normal space of the orbit of the template t 0 at the point t 0 ). If [ t 0 ] is not a manifold, we have also seen in [7] that θ H > 0 as soon as t 0 is an accumulation point of [ t 0 ] H and the support of ϵ contains a ball B ( 0 , r ) . Hence, θ H > 0 is a rather generic condition. Condition (18) can be reformulated as follows: as soon as the signal to noise ratio t 0 σ is sufficiently small:
t 0 σ < θ H θ ( t 0 ) θ H + A + θ ( t 0 ) θ H + A 2 + A 2 a 2 ,
then there is inconsistency.
We remark the presence of the constants θ ( t 0 ) and θ H in Proposition 4. This kind of constants were already here in the isometric case under the form θ ( t 0 t 0 ) = 1 t 0 E ( sup g G t 0 , g · ϵ ) , due to the polarization identity (2), we can state that it measures how much the template looks like to the noise after registration, but only in the isometric case. However we can intuit that this constant plays a analogical role in the non isometric case.
Example 6.
Let G acting on M, we suppose that G contains H = O ( M ) the orthogonal group of M. Assume that G can modifying the norm of the template by multiplying its norm by at most 2. Then we can set up A = 2 and a = 0 . By aligning ϵ and t 0 we have θ H = E ( ϵ ) > 0 , and θ ( t 0 ) = A E ( ϵ ) then when the signal to noise ratio t 0 σ is smaller that E ( ϵ ) 4 + 20 then there is inconsistency. By Cauchy-Schwarz inequality we have E ( ϵ ) E ( ϵ 2 ) = 1 , thus the signal to noise ratio has to be rather small in order to fulfill this condition.

3.3.3. Proof of Proposition 4

We define the following values:
λ H = 1 t 0 2 E sup h H h · t 0 , Y   and   λ ( t 0 ) = 1 t 0 2 E sup g G g · t 0 , Y .
Note that λ H and λ ( t 0 ) are registration scores which definitions are the same than the registration score used in the proof of Theorem 1 in Section 2 (only the normalization by t 0 is different). The proof of Proposition 4 is based on the following Lemma:
Lemma 1.
If:
λ H 0 ,
a 2 2 λ ( t 0 ) + λ H 2 > 0 ,
then t 0 is not a minimizer of the pre-variance of [ Y ] in M / G .
How condition (20) can be understood? In order to answer to that question, let us imagine that G = H acts isometrically, then a can be set up to 1, and λ ( t 0 ) = λ H the condition (20) becomes λ H 2 2 λ H + 1 = ( λ H 1 ) 2 > 0 and the conditions of Theorem 4.2 of [7] aimed to ensure that λ H > 1 . Now let us return to the non invariant case: if H is strictly included in G such that a is closed enough to 1 and λ ( t 0 ) closed enough to λ H , then on can think that condition (20) still holds. However, the closed enough seems hard to be quantified.
Proof of Lemma 1.
The proof is based on the following points:
  • F ( λ H t 0 ) F H ( λ H t 0 ) ,
  • F H ( λ H t 0 ) < F ( t 0 ) .
With items 1 and 2 we get that F ( λ H t 0 ) < F ( t 0 ) . items 1 is just based on the fact that in the map F, we take the infimum on a larger set than on F H . We now prove item 2, in order to do that we expand the two quantities, firstly:
F H ( λ H t 0 ) = E inf h H h · λ H t 0 2 + Y 2 2 h · λ H t 0 , Y
= λ H 2 t 0 2 + E ( Y 2 ) 2 λ H E sup h H h · t 0 , Y = E ( Y 2 ) λ H 2 t 0 2 ,
We use the fact that H acts isometrically between Equations (21) and (22) and the fact that λ H 0 because inf a A λ a = λ sup a A a is true for any A subset of R if λ 0 . Secondly:
F ( t 0 ) = E inf g G g · t 0 2 + Y 2 2 g · t 0 , Y a 2 t 0 2 + E ( Y 2 ) 2 E sup g G g · t 0 , Y a 2 t 0 2 + E ( Y 2 ) 2 λ ( t 0 ) t 0 2
Then:
F ( t 0 ) F H ( λ H t 0 ) t 0 2 a 2 2 λ ( t 0 ) + λ H 2 > 0 ,
thanks to hypothesis (20). ☐
Proof of Proposition 4.
In order to prove Proposition 4, all we have to do is proving λ H 0 and proving that Condition (20) is fulfilled when σ > σ c . Firstly, thanks to Cauchy-Schwarz inequality, we have:
λ H = 1 t 0 2 E sup h H h · t 0 , Φ · t 0 + σ ϵ 1 t 0 2 A t 0 2 + E ( sup h H h · t 0 , σ ϵ ) A + σ θ H t 0
Note that as σ > σ c A t 0 θ H we get λ H 0 , this proves (19). We also have:
λ ( t 0 ) = 1 t 0 2 E sup g G g · t 0 , Φ · t 0 + σ ϵ 1 t 0 2 A 2 t 0 2 + σ E sup g G g · t 0 , ϵ A 2 + σ θ ( t 0 ) t 0 ,
Then we can find a lower bound of a 2 2 λ ( t 0 ) + λ H 2 :
a 2 2 λ ( t 0 ) + λ H 2 a 2 A 2 + σ θ ( t 0 ) t 0 + σ θ H t 0 A 2 a 2 A 2 2 σ θ H t 0 θ ( t 0 ) θ H + A + σ θ H t 0 2 : = P ( σ )
For σ > σ c where σ c is the biggest solution of the quadratic Equation P ( σ ) = 0 , we get a 2 2 λ ( t 0 ) + λ H 2 > 0 and template estimation is inconsistent thanks to Lemma 1. The critical σ c is exactly the one given by Proposition 4. ☐

3.4. Linear Action

The result of the previous part has a drawback, it requires that the group of deformations contains a non trivial subgroup which acts isometrically. We know remove this hypothesis, but we require that the group acts linearly on data.

3.4.1. Inconsistency

In this Subsection we suppose that the group G acts linearly on M. Once again, we can give a criteria on the noise level which leads to inconsistency:
Proposition 5.
We suppose that the orbit of the template is bounded with:
a 0 , A > 0   s u c h   t h a t   g G a t 0 g · t 0 A t 0 .
We suppose that A < 2 . In other words, the deformation of the template can multiply the norm of the template by less than 2 . We also suppose that:
θ ( t 0 ) = 1 t 0 E sup g G g · t 0 , ϵ > 0 .
There is inconsistency as soon as
σ σ c = t 0 θ ( t 0 ) A 2 + 1 + 1 a 2 ( 2 A 2 ) 2 A 2 .
Example 7.
For instance if A 1.2 , then there is inconsistency if σ 7 t 0 θ ( t 0 ) .
Once again we find a condition which is similar to the isometric case, but due to the non invariant action we have here a corrective term which depends on A and a. Note that as G does not act isometrically, results in [7] do not apply in order to fulfill Condition (23). However it is easy to fulfill this Condition thanks to the following Proposition:
Proposition 6.
If t 0 is not a fixed point, and if the support of ϵ contains a ball B ( 0 , ρ ) for ρ > 0 then
θ ( t 0 ) = 1 t 0 E sup g G g · t 0 , ϵ > 0 .
Remark 4.
It is possible to remove the condition A < 2 in Proposition 5. Indeed Let be h G such that:
sup g G g · t 0 h · t 0 < 2 .
The template t 0 can be replaced by h · t 0 since Φ t 0 + σ ϵ is equal to Φ h 1 · h t 0 and applying Proposition 5 to the new template h · t 0 . We get that h · t 0 does not minimize the variance F with A 2 (because the new template is h · t 0 ) . Since h · t 0 does not minimize F, the original template t 0 does not minimize the pre-variance F neither, since F ( t 0 ) = F ( h · t 0 ) .
This changes the critical σ c since we apply Proposition 5 to h · t 0 instead of t 0 itself.

3.4.2. Proofs of Proposition 5 and Proposition 6

As in Section 3.3 we first prove a Lemma:
Lemma 2.
We define:
λ ( t 0 ) = 1 t 0 2 E sup g G g · t 0 , Y .
Suppose that λ ( t 0 ) 0 and that:
a 2 2 λ ( t 0 ) + λ ( t 0 ) 2 ( 2 A 2 ) > 0 .
Then t 0 is not a minimum of F.
Proof of Lemma 2.
Since
g G a t 0 g · t 0 A t 0 ,
then by linearity of the action we get:
g G , μ R a μ t 0 g · μ t 0 A μ t 0 .
We remind that:
F ( m ) = E inf g G g · m 2 2 g · m , Y + Y 2 .
By using Equations (25) and (26) we get:
F ( t 0 ) a 2 t 0 2 2 λ ( t 0 ) t 0 2 + E ( Y 2 ) ,
We get:
F ( λ ( t 0 ) t 0 ) E A 2 λ ( t 0 ) t 0 2 + Y 2 + inf g G 2 λ ( t 0 ) g · t 0 , Y A 2 λ ( t 0 ) 2 t 0 2 + E ( Y 2 ) 2 λ ( t 0 ) 2 t 0 2 .
Note that we use the fact that the action is linear in Equation (27). We obtain that t 0 is not the minimum of the F:
F ( t 0 ) F ( λ ( t 0 ) t 0 ) t 0 2 a 2 2 λ ( t 0 ) + λ ( t 0 ) 2 ( 2 A 2 ) > 0 .
Proof of Proposition 5.
By solving the following quadratic inequality we remark that:
a 2 2 λ ( t 0 ) + ( 2 A 2 ) λ ( t 0 ) 2 > 0   if   λ ( t 0 ) > 1 + 1 a 2 ( 2 A 2 ) 2 A 2 ,
Besides, as in Section 3.3.2 we can take a lower bound of λ ( t 0 ) by decomposing Y = Φ · t 0 + σ ϵ and applying Cauchy-Schwarz inequality Φ · t 0 , g · t 0 A 2 t 0 2 , we get:
λ ( t 0 ) A 2 + σ t 0 θ ( t 0 ) .
Thanks to Condition (28) and the fact that σ > σ c we get:
λ ( t 0 ) A 2 + σ t 0 θ ( t 0 ) > 1 + 1 a 2 ( 2 A 2 ) 2 A 2
Then λ ( t 0 ) 0 and Condition (24) is fulfilled. Thus, there is inconsistency, according to Lemma 2. ☐
Proof of Proposition 6.
First we notice that:
t 0 θ ( t 0 ) = E sup g G g · t 0 , ϵ E ( t 0 , ϵ ) = t 0 , E ( ϵ ) = 0 .
In order to have θ ( t 0 ) > 0 , first we show that it exists x B ( 0 , ρ ) and g 0 G such that
sup g G g · t 0 , x g 0 · t 0 , x > t 0 , x .
Let g 0 G such that g 0 · t 0 t 0 . There are three cases to be distinguished (see Figure 9):
  • The vectors g 0 · t 0 and t 0 are linearly independent. In this case t 0 ¬ ( g 0 · t 0 ) , then we can find x t 0 and x ( g · t 0 ) . Then t 0 , x = 0 and g · t 0 , x 0 , without loss of generality we can assume that g · t 0 , x > 0 (replacing x by x if necessary). We also can assume that x B ( 0 , ρ ) (replacing x by x ρ 2 x if necessary. Then we have x B ( 0 , ρ ) and:
    g 0 · t 0 , x > 0 = t 0 , x .
  • If g 0 · t 0 = w t 0 with w > 1 , we take x = ρ 2 t 0 t 0 B ( 0 , ρ ) and we have:
    g · t 0 , x = w ρ 2 t 0 > ρ 2 t 0 = t 0 , x .
  • If g 0 · t 0 = w t 0 with w < 1 we take x = ρ 2 t 0 t 0 B ( 0 , ρ ) and we have:
    g 0 · t 0 , x = w ρ 2 t 0 > ρ 2 t 0 = t 0 , x .
In all these cases we can find x such that g 0 · t 0 , x > t 0 , x By continuity it exists r > 0 such that for all y on this ball we have g · t 0 , y > t 0 , y . Then the event { sup g G g · t 0 , ϵ > t 0 , ϵ } has non zero probability, since x is in the support of ϵ we have P ( ϵ B ( x , r ) ) > 0 . Thus Inequality in (29) will be strict. This proves that θ ( t 0 ) > 0 . ☐

3.5. Example of a Template Estimation Which is Consistent

In order to underline the importance of the hypotheses, we give an example where the method is consistent:
Example 8.
Let M be an Hilbert space and V a closed sub-linear space of M. Then G = V acts on M by (see Figure 10):
( v , m ) G × M m + v .
This action is not isometric, indeed m m + v is not linear (except if v = 0 ). However this action is invariant, let us consider V the orthogonal space of V. The variance in the quotient space is:
F ( m ) = E inf v V m + v Y 2 = E ( p ( m ) p ( Y ) 2 ) = E ( p ( m ) p ( t 0 ) + ϵ 2 ) ,
where p : M V the orthogonal projection on V . Then it is clear that t 0 minimizes F. In fact, s : [ m ] p ( m ) is just a congruent section of the quotient (see Section 2.1). Here, once again, we see the role played by the the congruent section (when it exists) in order to study the consistency.
Hence, is there a contradiction with Proposition 4 or Proposition 5 which prove inconsistency as soon as the noise level is large enough? In Proposition 4, we require that there is a subgroup acting isometrically, in this example the only element which acts linearly is the identity element m m + 0 , then H = { 0 } is the only possibility, however the support of the noise should not be included in the set of fixed point under the group action of H. Here, all points are fixed under H, hence it is not possible to fulfill this condition. Example 8 is not a contradiction with Proposition 4, it is also not a contradiction with Proposition 5 since it does not act linearly on data.

3.6. Inconsistency with Non Invariant Action and Regularization

In practice people add a regularization term in the function they minimize in LDDMM [11,14], or in Demons [25] etc. Because, if one considers two points, one does not want necessarily to fit one with the other. Indeed, even if one deformation matches exactly these two points, it may be an unrealistic deformation. So far, we did not study the use of such a term in the inconsistency.

3.6.1. Case of Deformations Closed to the Identity Element of G

If we suppose that the deformations Φ of the template is closed to identity, it is useless to take the infimum over G because G contains big deformations. Perhaps one of these big deformations can reaches the infimum in F, but this element is not the one which deformed the template in the generative model. Then such big deformations should not be taken into account. That is why, if we suppose that G can be equipped with a distance d G , then we can assume that there exists r > 0 such that the deformation Φ belongs almost surely to
B = B ( e , r ) = { g G , d G ( e , g ) < r } .
Instead of defining F ( m ) as E ( inf g G g · m Y 2 ) , one can define F ( m ) = E ( inf g B g · m Y 2 ) , and the previous proofs will still be true, when replacing for instance λ ( t 0 ) by λ ( t 0 ) = 1 t 0 2 E ( sup g B g · t 0 , Y ) etc. Likewise we need to replace the hypothesis “the support of ϵ is not included in the set of fixed points” by ”the support of ϵ in not included is the set of fixed points under the action restricted to B ”.
Note that restraining ourselves to B is equivalent to add a following regularization on the function F:
F ( m ) = E inf g G g · m Y 2 + R e g ( g ) with R e g ( g ) = 0 if g B + if g B .
Moreover considering only the elements in B will automatically satisfy the condition A < 2 in Proposition 5 as long as the group G acts continuously on the template, if r is small enough.

3.6.2. Inconsistency in the Case of a Group Acting Linearly with a Bounded Regularization

In this Section we suppose that the group G acts linearly. We also suppose that A < 2 . The regularization term is a bounded map R e g : G [ 0 , Ω ] . With this framework, we still able to prove that there is inconsistency as soon as the noise level is large enough:
Proposition 7.
Let G be a group acting linearly on M. We suppose that the orbit of the template t 0 is bounded with A = sup g G g · t 0 t 0 < 2 , the generative model is still Y = Φ · t 0 + σ ϵ . We define the pre-variance as:
F ( m ) = E inf g G Y g · m 2 + R e g ( g ) .
Then as soon as the noise level is large enough, i.e.,:
σ > σ c = t 0 θ ( t 0 ) A 2 + 1 + 1 ( a 2 + Ω t 0 2 ) ( 2 A 2 ) 2 A 2 .
Then t 0 is not a minimizser of F.
The proof is exactly the same as the Proof of Proposition 5, we take 0 as a lower bound of the the regularization term in the lower bound of F ( t 0 ) , and we take Ω as a upper bound of the regularization term in the upper bound of F ( λ ( t 0 ) t 0 ) . We solve a similar quadratic equation in order to find the critical σ .

4. Conclusions and Discussion

We provided an asymptotic behavior of the consistency bias when the noise level σ tends to infinity in the case of isometric action. As a consequence, the inconsistency can not be neglected when σ is large. When the action is no longer isometric, inconsistency has been also shown when the noise level is large.
However, we have not answered this question: can the inconsistency be neglected? When the noise level is small enough, then the consistency bias is small [6,7], hence it can be neglected. Note that the quotient space is not a manifold, this prevents us to use a priori the Central Limit theorem for manifold proved in [22]. However, if the Central Limit theorem could be applied to quotient space, the fluctuations induces an error which would be approximately equal to σ I and if K 1 I , then the inconsistency could be neglected because it is small compared to fluctuation. One way to avoid the inconsistency is to use another framework, for a instance a Bayesian paradigm [26].
In the numerical experiments we presented, we have seen that the estimated template is more crispy that the true template. The intuition is that the estimated template in computational anatomy with a group of diffeomorphisms is also more detailed. However, the true template is almost always unknown. It is then possible that one think that the computation of the template succeeded to capture small details of the template while it is just an artifact due to the inconsistency. Moreover in order to tackle this question, one needs to have a good modeliation of the noise, for instance in [1], the observations are curves, what is a relevant noise in the space of curves?
In this article, we have considered actions which do not let the distance invariant. Although we have only shown the inconsistency as soon as the noise level is large enough, the inequality used where not optimal at all, surely future works could improve this work and prove that inconsistency appears for small noise level. Moreover a quantification of the inconsistency should be established.

Author Contributions

Loïc Devilliers elaborated the main results and wrote the article; Stéphanie Allassonnière, Alain Trouvé and Xavier Pennec have co-supervised this work.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kurtek, S.A.; Srivastava, A.; Wu, W. Signal Estimation Under Random Time-Warpings and Nonlinear Signal Alignment. In Advances in Neural Information Processing Systems 24; Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2011; pp. 675–683. [Google Scholar]
  2. Guimond, A.; Meunier, J.; Thirion, J.P. Average Brain Models: A Convergence Study. Comput. Vis. Image Underst. 2000, 77, 192–210. [Google Scholar] [CrossRef]
  3. Joshi, S.; Davis, B.; Jomier, M.; Gerig, G. Unbiased diffeomorphic atlas construction for computational anatomy. Neuroimage 2004, 23, S151–S160. [Google Scholar] [CrossRef] [PubMed]
  4. Cootes, T.F.; Marsland, S.; Twining, C.J.; Smith, K.; Taylor, C.J. Groupwise diffeomorphic non-rigid registration for automatic model building. In Computer Vision—ECCV 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 316–327. [Google Scholar]
  5. Bigot, J.; Charlier, B. On the consistency of Fréchet means in deformable models for curve and image analysis. Electron. J. Stat. 2011, 5, 1054–1089. [Google Scholar] [CrossRef]
  6. Miolane, N.; Holmes, S.; Pennec, X. Template shape estimation: Correcting an asymptotic bias. SIAM J. Imaging Sci. 2017, 10, 808–844. [Google Scholar] [CrossRef]
  7. Devilliers, L.; Allassonnière, S.; Trouvé, A.; Pennec, X. Template estimation in computational anatomy: Fréchet means in top and quotient spaces are not consistent. SIAM J. Imaging Sci. 2017, in press. [Google Scholar]
  8. Panaretos, V.M.; Zemel, Y. Amplitude and phase variation of point processes. Ann. Stat. 2016, 44, 771–812. [Google Scholar] [CrossRef]
  9. Devilliers, L.; Pennec, X.; Allassonnière, S. Inconsistency of Template Estimation with the Fréchet Mean in Quotient Space. In Information Processing in Medical Imaging, Proceedings of the 25th International Conference (IPMI 2017), Boone, NC, USA, 25–30 June 2017; Niethammer, M., Styner, M., Aylward, S., Zhu, H., Oguz, I., Yap, P.T., Shen, D., Eds.; Springer: Cham, Switzerland, 2017; pp. 16–27. [Google Scholar]
  10. Thompson, D.W. On Growth and Form; Cambridge University Press: Cambridge, UK, 1942. [Google Scholar]
  11. Durrleman, S.; Prastawa, M.; Charon, N.; Korenberg, J.R.; Joshi, S.; Gerig, G.; Trouvé, A. Morphometry of anatomical shape complexes with dense deformations and sparse parameters. NeuroImage 2014, 101, 35–49. [Google Scholar] [CrossRef] [PubMed]
  12. Glasbey, C.; Mardia, K. A penalized likelihood approach to image warping. J. R. Stat. Soc. Ser. B Stat. Methodol. 2001, 63, 465–492. [Google Scholar] [CrossRef]
  13. Charlier, B. Necessary and sufficient condition for the existence of a Fréchet mean on the circle. ESAIM Probab. Stat. 2013, 17, 635–649. [Google Scholar] [CrossRef]
  14. Beg, M.F.; Miller, M.I.; Trouvé, A.; Younes, L. Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int. J. Comput. Vis. 2005, 61, 139–157. [Google Scholar] [CrossRef]
  15. Huckemann, S. Inference on 3d procrustes means: Tree bole growth, rank deficient diffusion tensors and perturbation models. Scand. J. Stat. 2011, 38, 424–446. [Google Scholar] [CrossRef]
  16. Rohlf, F.J. Bias and error in estimates of mean shape in geometric morphometrics. J. Hum. Evol. 2003, 44, 665–683. [Google Scholar] [CrossRef]
  17. Goodall, C. Procrustes methods in the statistical analysis of shape. J. R. Stat. Soc. Ser. B Methodol. 1991, 53, 285–339. [Google Scholar]
  18. Ziezold, H. On expected figures and a strong law of large numbers for random elements in quasi-metric spaces. In Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians; Springer: Dordrecht, The Netherlands, 1977; pp. 591–602. [Google Scholar]
  19. Allassonnière, S.; Amit, Y.; Trouvé, A. Towards a coherent statistical framework for dense deformable template estimation. J. R. Stat. Soc. Ser. B Stat. Methodol. 2007, 69, 3–29. [Google Scholar] [CrossRef]
  20. Hitziger, S.; Clerc, M.; Gramfort, A.; Saillet, S.; Bénar, C.; Papadopoulo, T. Jitter-adaptive dictionary learning-application to multi-trial neuroelectric signals. arXiv, 2013; arXiv:1301.3611. [Google Scholar]
  21. Kurtek, S.; Klassen, E.; Ding, Z.; Avison, M.J.; Srivastava, A. Parameterization-invariant shape statistics and probabilistic classification of anatomical surfaces. In Proceedings of the 22nd International Conference on Information Processing in Medical Imaging, Monastery Irsee, Germany, 3–8 July 2011; pp. 147–158. [Google Scholar]
  22. Bhattacharya, A.; Bhattacharya, R. Statistics on Riemannian manifolds: Asymptotic distribution and curvature. Proc. Am. Math. Soc. 2008, 136, 2959–2967. [Google Scholar] [CrossRef]
  23. Cooley, J.W.; Tukey, J.W. An algorithm for the machine calculation of complex Fourier series. Math. Comput. 1965, 19, 297–301. [Google Scholar] [CrossRef]
  24. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  25. Lombaert, H.; Grady, L.; Pennec, X.; Ayache, N.; Cheriet, F. Spectral Log-Demons: Diffeomorphic Image Registration with Very Large Deformations. Int. J. Comput. Vis. 2013, 107, 254–271. [Google Scholar] [CrossRef]
  26. Cheng, W.; Dryden, I.L.; Huang, X. Bayesian registration of functions and curves. Bayesian Anal. 2016, 11, 447–475. [Google Scholar] [CrossRef]
Figure 1. Three functions defined on the interval [ 0 , 1 ] . The blue one ( f 0 ) is a step function, the red one ( f 1 ) is a translated version of the blue one when noise has been added, and the green one ( f 3 ) is the null function.
Figure 1. Three functions defined on the interval [ 0 , 1 ] . The blue one ( f 0 ) is a step function, the red one ( f 1 ) is a translated version of the blue one when noise has been added, and the green one ( f 3 ) is the null function.
Entropy 19 00288 g001
Figure 2. Due to the invariant action, the orbits are parallel. Here the orbits are circles centred at 0. This is the case when the group G is the group of rotations.
Figure 2. Due to the invariant action, the orbits are parallel. Here the orbits are circles centred at 0. This is the case when the group G is the group of rotations.
Entropy 19 00288 g002
Figure 3. Representation of the three functions f 1 , f 2 and f 3 with h = 0.05 . the functions f 2 and f 3 are registered with respect to f 1 . However f 2 and f 3 are not registered with each other, since it is more profitable to shift f 2 in order to align the highest parts of f 2 and f 3 .
Figure 3. Representation of the three functions f 1 , f 2 and f 3 with h = 0.05 . the functions f 2 and f 3 are registered with respect to f 1 . However f 2 and f 3 are not registered with each other, since it is more profitable to shift f 2 in order to align the highest parts of f 2 and f 3 .
Entropy 19 00288 g003
Figure 4. We minimize the variance on each half-line R + v where v = 1 . The element which minimizes the variance on such a half-line is λ ˜ ( v ) v , where λ ˜ ( v ) 0 . We get a surface in M by S v λ ˜ ( v ) v (which is a curve in this figure since we draw it in dimension 2). The Proof of Theorem 1 states that if [ m ] is a Fréchet mean then m is an extreme point of this surface. On this picture there are four extreme points which are in the same orbit: we took here the simple example of the group of rotations of 0, 90, 180 and 270 degrees.
Figure 4. We minimize the variance on each half-line R + v where v = 1 . The element which minimizes the variance on such a half-line is λ ˜ ( v ) v , where λ ˜ ( v ) 0 . We get a surface in M by S v λ ˜ ( v ) v (which is a curve in this figure since we draw it in dimension 2). The Proof of Theorem 1 states that if [ m ] is a Fréchet mean then m is an extreme point of this surface. On this picture there are four extreme points which are in the same orbit: we took here the simple example of the group of rotations of 0, 90, 180 and 270 degrees.
Entropy 19 00288 g004
Figure 5. Iterative minimization of the function J on the two axis, the horizontal axis represents the variable in the space M, the vertical axis represents the set of all the possible registrations G I . Once the convergence is reached, the point ( m n , g n ) is the minimum of the function J on the two axis in green. Is this point the minimum of J on its whole domain? There are two pitfalls: firstly this point could be a saddle point, it can be avoided with Proposition 2, secondly this point could be a local (but not global) minimum, this is discussed in Section 2.5.3.
Figure 5. Iterative minimization of the function J on the two axis, the horizontal axis represents the variable in the space M, the vertical axis represents the set of all the possible registrations G I . Once the convergence is reached, the point ( m n , g n ) is the minimum of the function J on the two axis in green. Is this point the minimum of J on its whole domain? There are two pitfalls: firstly this point could be a saddle point, it can be avoided with Proposition 2, secondly this point could be a local (but not global) minimum, this is discussed in Section 2.5.3.
Entropy 19 00288 g005
Figure 6. Template t 0 and template estimation m ^ on Figure 6a. Empirical variance at the template and template estimation with the max-max algorithm as a function of the size of the sample on Figure 6b. (a) Example of a template (a step function) and the estimated template m ^ with a sample size 10 5 in R 64 , ϵ is Gaussian noise and σ = 10 . At the discontinuity points of the template, we observe a Gibbs-like phenomena; (b) Variation of F I ( t 0 ) (in blue) and of F I ( m ^ ) (in red) as a function of I the size of the sample. Since convergence is already reached, F ( m ^ ) , which is the limit of red curve, is below F ( t 0 ) : F ( t 0 ) is the limit of the blue curve. Due to the inconsistency, m ^ is an example of point such that F ( m ^ ) < F ( t 0 ) .
Figure 6. Template t 0 and template estimation m ^ on Figure 6a. Empirical variance at the template and template estimation with the max-max algorithm as a function of the size of the sample on Figure 6b. (a) Example of a template (a step function) and the estimated template m ^ with a sample size 10 5 in R 64 , ϵ is Gaussian noise and σ = 10 . At the discontinuity points of the template, we observe a Gibbs-like phenomena; (b) Variation of F I ( t 0 ) (in blue) and of F I ( m ^ ) (in red) as a function of I the size of the sample. Since convergence is already reached, F ( m ^ ) , which is the limit of red curve, is below F ( t 0 ) : F ( t 0 ) is the limit of the blue curve. Due to the inconsistency, m ^ is an example of point such that F ( m ^ ) < F ( t 0 ) .
Entropy 19 00288 g006
Figure 7. Example of an other template (here a discretization of a continuous function) and template estimation with a sample size 10 3 in R 64 , ϵ is Gaussian noise and σ = 10 . Even with a continuous function the inconsistency appears.
Figure 7. Example of an other template (here a discretization of a continuous function) and template estimation with a sample size 10 3 in R 64 , ϵ is Gaussian noise and σ = 10 . Even with a continuous function the inconsistency appears.
Entropy 19 00288 g007
Figure 8. Example of three orbits, when d ˜ Q does not satisfy the inequality triangular.
Figure 8. Example of three orbits, when d ˜ Q does not satisfy the inequality triangular.
Entropy 19 00288 g008
Figure 9. Representation of the three cases, on each we can find an x in the support of the noise such as x , g 0 · t 0 > x , t 0 and by continuity of the dot product ϵ , g 0 · t 0 > ϵ , t 0 with is an event with a non zero probability, (for instance the ball in gray). This is enough in order to show that θ ( t 0 ) > 0 . (a) Case 1: t 0 and g · t 0 are linearly independent; (b) Case 2: g · t 0 is proportional to t 0 with a factor > 1 ; (c) Case 3: g · t 0 is proportional to t 0 with a factor < 1 .
Figure 9. Representation of the three cases, on each we can find an x in the support of the noise such as x , g 0 · t 0 > x , t 0 and by continuity of the dot product ϵ , g 0 · t 0 > ϵ , t 0 with is an event with a non zero probability, (for instance the ball in gray). This is enough in order to show that θ ( t 0 ) > 0 . (a) Case 1: t 0 and g · t 0 are linearly independent; (b) Case 2: g · t 0 is proportional to t 0 with a factor > 1 ; (c) Case 3: g · t 0 is proportional to t 0 with a factor < 1 .
Entropy 19 00288 g009
Figure 10. In the case of affine translation by vectors of V, the orbits are affine subspace parallel to V. The distance between two orbits [ x ] and [ y ] is given by the distance between the orthogonal projection of x and y in V . This is an example where template estimation is consistent.
Figure 10. In the case of affine translation by vectors of V, the orbits are affine subspace parallel to V. The distance between two orbits [ x ] and [ y ] is given by the distance between the orthogonal projection of x and y in V . This is an example where template estimation is consistent.
Entropy 19 00288 g010
Table 1. Empirical variances at 5 different outputs of the max-max algorithm coming from the same sample of size 10 4 , but with different starting points. We use σ = 10 and the action of time translation in R 64 . Conclusion: on these five points, only m ^ 3 is an eventual global minima.
Table 1. Empirical variances at 5 different outputs of the max-max algorithm coming from the same sample of size 10 4 , but with different starting points. We use σ = 10 and the action of time translation in R 64 . Conclusion: on these five points, only m ^ 3 is an eventual global minima.
PointsTemplate t 0 m ^ 1 m ^ 2 m ^ 3 m ^ 4 m ^ 5
Empirical variance at these points96.71495.68495.68195.67695.67795.682

Share and Cite

MDPI and ACS Style

Devilliers, L.; Allassonnière, S.; Trouvé, A.; Pennec, X. Inconsistency of Template Estimation by Minimizing of the Variance/Pre-Variance in the Quotient Space. Entropy 2017, 19, 288. https://doi.org/10.3390/e19060288

AMA Style

Devilliers L, Allassonnière S, Trouvé A, Pennec X. Inconsistency of Template Estimation by Minimizing of the Variance/Pre-Variance in the Quotient Space. Entropy. 2017; 19(6):288. https://doi.org/10.3390/e19060288

Chicago/Turabian Style

Devilliers, Loïc, Stéphanie Allassonnière, Alain Trouvé, and Xavier Pennec. 2017. "Inconsistency of Template Estimation by Minimizing of the Variance/Pre-Variance in the Quotient Space" Entropy 19, no. 6: 288. https://doi.org/10.3390/e19060288

APA Style

Devilliers, L., Allassonnière, S., Trouvé, A., & Pennec, X. (2017). Inconsistency of Template Estimation by Minimizing of the Variance/Pre-Variance in the Quotient Space. Entropy, 19(6), 288. https://doi.org/10.3390/e19060288

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop