Abstract
3D action recognition was shown to benefit from a covariance representation of the input data (3D positions of the joints). A kernel machine fed with such feature is an effective paradigm for 3D action recognition, yielding state-of-the-art results. Yet, the whole framework is affected by the well-known scalability issue. In fact, in general, the kernel function has to be evaluated for all pairs of instances inducing a Gram matrix whose complexity is quadratic in the number of samples. In this work we reduce such complexity to be linear by proposing a novel and explicit feature map to approximate the kernel function. This allows to train a linear classifier with an explicit feature encoding, which implicitly implements a Log-Euclidean machine in a scalable fashion. Not only we prove that the proposed approximation is unbiased, but also we work out an explicit strong bound for its variance, attesting a theoretical superiority of our approach with respect to existing ones. Experimentally, we verify that our representation provides a compact encoding and outperforms other approximation schemes on a number of publicly available benchmark datasets for 3D action recognition.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Action recognition is a key research domain in video/image processing and computer vision, being nowadays ubiquitous in human-robot interaction, autonomous driving vehicles, elderly care and video-surveillance to name a few [21]. Yet, challenging difficulties arise due to visual ambiguities (illumination variations, texture of clothing, general background noise, view heterogeneity, occlusions). As an effective countermeasure, joint-based skeletal representations (extracted from depth images) are a viable solution.
Combined with a skeletal representation, the symmetric and positive definite (SPD) covariance operator scores a sound performance in 3D action recognition [5, 9, 22]. Indeed, while properly modeling the skeletal dynamics with a second order statistic, the covariance operator is also naturally able to handle different temporal duration of action instances. This avoids slow pre-processing stages such as time warping or interpolation [20]. In addition, the superiority of such representation can be attested by achieving state-of-the-art performance by means of a relative simple classification pipeline [5, 22] where, basicallyFootnote 1, a non-linear Support Vector Machine (SVM) is trained using the Log-Euclidean kernel
to compare covariance operators \(\mathbf {X}\), \(\mathbf {Y}\). In (1), for any SPD matrix \(\mathbf {X}\), we define
being \(\mathbf {U}\) the matrix of eigenvectors which diagonalizes \(\mathbf {X}\) in terms of the eigenvalues \(\lambda _1 \ge \dots \ge \lambda _d > 0\). Very intuitively, for any fixed bandwidth \(\sigma > 0\), \(K_{\ell E}(\mathbf {X},\mathbf {Y})\) is actually computing a radial basis Gaussian function by comparing the covariance operators \(\mathbf {X}\) and \(\mathbf {Y}\) by means of the Frobenius norm \(\Vert \cdot \Vert _F\) (after \(\mathbf {X},\mathbf {Y}\) have been log-projected). Computationally, the latter stage is not problematic (see Sect. 3) and can be performed for each covariance operator before computing the kernel. In addition to its formal properties in Riemannian geometry, this makes (1) widely used in practice [5, 9, 22].
However, the modern big data regime mines the applicability of such a kernel function. Indeed, since (1) has to be computed for every pair of instances in the dataset, the so produced Gram matrix has prohibitive size. So its storage becomes time- and memory-expensive and the related computations (required to train the model) are simply unfeasible.
The latter inconvenient can be solved as follows. According to the well established kernel theory [2], the Kernel (1) induces an infinite-dimension feature map \(\varphi \), meaning that \(K_{\ell E}(\mathbf {X},\mathbf {Y}) = \langle \varphi (\mathbf {X}), \varphi (\mathbf {Y}) \rangle \). However, if we are able to obtain an explicit feature map \(\varPhi \) such that \(K_{\ell E}(\mathbf {X},\mathbf {Y}) \approx \langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \), we can directly compute a finite-dimensional feature representation \(\varPhi (\mathbf {X})\) for each action instance separately. Then, with a compact \(\varPhi \), we can train a linear SVM instead of its kernelized version. This is totally feasible and quite efficient even in the big data regime [7]. Therefore, the whole pipeline will actually provide a scalable implementation of a Log-Euclidean SVM, whose cost is reduced from quadratic to linear.
In our work we specifically tackle the aforementioned issue through the following main contributions.
-
1.
We propose a novel compact and explicit feature map to approximate the Log-Euclidean kernel within a probabilistic framework.
-
2.
We provide a rigorous mathematical formulation, proving that the proposed approximation has null bias and bounded variance.
-
3.
We compare the proposed feature map approximation against alternative approximation schemes, showing the formal superiority of our framework.
-
4.
We experimentally evaluate our method against the very same approximation schemes over six 3D action recognition datasets, confirming with practice our theoretical findings.
The rest of the paper is outlined as follows. In Sect. 2 we review the most relevant related literature. Section 3 proposes the novel approximation and discusses its foundation. We compare it with alternative paradigms in Sect. 4. Section 5 draws conclusions and the Appendix A reports all proofs of our theoretical results.
2 Related Work
In this Section, we summarize the most relevant works in both covariance-based 3D action recognition and kernels’ approximations.
Originally envisaged for image classification and detection tasks, the covariance operator has experienced a growing interest for action recognition, experiencing many different research trends: [9] extends it to the infinite dimensional case, while [10] hierarchically combines it in a temporal pyramid; [12, 22] investigate the conceptual analogy with trial-specific kernel matrices and [5] further proposes a new kernelization as to model arbitrary, non-linear relationships conveyed by the raw data. However, those kernel methods usually do not scale up easily to big datasets due to demanding storage and computational costs. As a solution, the exact kernel representation can be replaced by an approximated, more efficient version. In the literature, this is done according to the following mainstream approaches.
-
(i)
The kernel Gram matrix is replaced with a surrogate low-rank version, in order to alleviate both memory and computational costs. Within these methods, [1] applied Cholesky decomposition and [24] exploited Nyström approximation.
-
(ii)
Instead of the exact kernel function k, an explicit feature map \(\varPhi \) is computed, so that the induced linear kernel \(\langle \varPhi (\mathbf {x}),\varPhi (\mathbf {y}) \rangle \) approximates \(k(\mathbf {x},\mathbf {y})\). Our work belong to this class of methods.
In this context, Rahimi and Recht [17] exploited the formalism of the Fourier Transform to approximate shift invariant kernels \(k(\mathbf {x},\mathbf {y}) = k(\mathbf {x}-\mathbf {y})\) through an expansion of trigonometric functions. Leveraging on a similar idea, Le et al. [13] sped up the computation by exploiting the Walsh-Hadamard transform, downgrading the running cost of [17] from linear to log-linear with respect to the data dimension. Recently, Kar and Karnick [11] proposed an approximated feature maps for dot product kernels \(k(\mathbf {x},\mathbf {y}) = k(\langle \mathbf {x},\mathbf {y} \rangle )\) by directly exploiting the MacLaurin expansion of the kernel function.
Instead of considering a generic class of kernels, our work specifically focuses on the log-Euclidean one, approximating it through a novel unbiased estimator which ensures a explicit bound for variance (as only provided by [13]) and resulting in a superior classification performance with respect to [11, 13, 17].
3 The Proposed Approximated Feature Map
In this Section, we present the main theoretical contribution of this work, namely (i) a random, explicit feature map \(\varPhi \) such that \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \approx K_{\ell E}(\mathbf {X},\mathbf {Y})\), (ii) the proof of its unbiasedness and (iii) a strong theoretical bound on its variance.
Construction of the Approximated Feature Map. In order to construct a \(\nu \) dimensional feature map \(\mathbf {X} \mapsto \varPhi (\mathbf {X}) = [\varPhi _1(\mathbf {X}),\dots ,\varPhi _\nu (\mathbf {X})] \in \mathbb {R}^{\nu }\), for any \(d \times d\) SPD matrix \(\mathbf {X}\), fix a probability distribution \(\rho \) supported over \(\mathbb {N}\). Precisely, each component \(\varPhi _1, \dots , \varPhi _\nu \) of our \(\nu \)-dimensional feature map \(\varPhi \) is independently computed according to the following algorithm.
The genesis of (3) can be explained by inspecting the feature map \(\varphi \) associated to the kernel \(K(x,y)=\exp \left( -\frac{1}{2\sigma ^2}|x-y|^2\right) \), where \(x,y \in \mathbb {R}\) for simplicity. It results \(\varphi (x) \propto \left[ 1, \sqrt{\frac{1}{1!\sigma ^2}}x,\sqrt{\frac{1}{2!\sigma ^4}}x^2,\sqrt{\frac{1}{3!\sigma ^6}}x^3,\dots \right] .\) Intuitively, we can say that (3) approximates the infinite dimensional \(\varphi (x)\) by randomly selecting one of its components: this is the role played by \(n \sim \rho \). In addition, we introduce the \(\log \) mapping and replace the exponentiation with a Kronecker product. As a consequence, the random weights \(\mathbf {W}\) ensure that \(\varPhi (\mathbf {X})\) achieves a sound approximation of (1), in terms of unbiasedness and rapidly decreasing variance.
In the rest of the Section we discuss the theoretical foundation of our analysis, where all proofs have been moved to Appendix A for convenience.
Unbiased Estimation. In order for a statistical estimator to be reliable, we need it to be at least unbiased, i.e., its expected value must be equal to the exact function it is approximating. The unbiasedness of the feature map \(\varPhi \) of Eq. (3) for the Log-Euclidean kernel (1) is proved by the following result.
Theorem 1
Let \(\rho \) be a generic probability distribution over \(\mathbb {N}\) and consider \(\mathbf {X}\) and \(\mathbf {Y}\), two generic SPD matrices such that \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _{F} = 1.\) Then, \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) is an unbiased estimator of (1). That is
where the expectation is computed over n and \(\mathbf {W}\) which define \(\varPhi _j(\mathbf {X})\) as in (3).
Once averaging upon all possible realizations of n sampled from \(\rho \) and the Gaussian weights \(\mathbf {W}\), Theorem 1 guarantees that the linear kernel \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) induced by \(\varPhi \) is equal to \(K_{\ell E}(\mathbf {X},\mathbf {Y})\). This formalizes the unbiasedness of our approximation.
On the Assumption \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _{F} = 1\). Under a practical point of view, this assumption may seem unfavorable, but this is not the case. The reason is provided by Eq. (2), which is very convenient to compute the logarithm of a SPD matrix. Since in (3), \(\varPhi (\mathbf {X})\) is explicitly dependent on \(\log \mathbf {X}\), we can simply use (2) and then divide each entry of the obtained matrix by \(\Vert \log \mathbf {X} \Vert _F\). This is a non-restrictive strategy to satisfy our assumption and actually analogous to require input vectors to have unitary norm, which is very common in machine learning [2].
Low-Variance. One can note that, in Theorem 1, even by choosing \(\nu = 1\) (a scalar feature map), \(\varPhi (\mathbf {X}) = [\varPhi _1(\mathbf {X})] \in \mathbb {R}\) is unbiased for (1). However, since \(\varPhi \) is an approximated finite version of the exact infinite feature map associated to (1), one would expect the quality of the approximation to be very bad in the scalar case, and to improve as \(\nu \) grows larger. This is indeed true, as proved by the following statement.
Theorem 2
The variance of \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) as estimator of (1) can be explicitly bounded. Precisely,
where \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _F = 1\) and the variance is computed over all possible realizations of \(n \sim \rho \) and \(\mathbf {W}\), the latter being element-wise sampled from a \(\mathcal {N}(0,\sigma ^2/\sqrt{n})\) distribution. Also, \(\mathcal {C}_\rho {\mathop {=}\limits ^\mathrm{def}} \sum _{n = 0}^{\infty } \frac{1}{\rho (n) n!}\), the series being convergent.
Let us discuss the bound on the variance provided by Theorem 2. Since the bandwidth \(\sigma \) of the kernel function (1) we want to approximate is fixed, the term \(\exp \left( \frac{3 - 2\sigma ^2}{\sigma ^4} \right) \) can be left out from our analysis. The bound in (5) is linear in \(\mathcal {C}_\rho \) and inversely cubic in \(\nu \). When \(\nu \) grows, the increased dimensionality of our feature map \(\varPhi \) makes the variance rapidly vanishing, ensuring that the approximated kernel \(K_\varPhi (\mathbf {X}, \mathbf {Y})= \langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) converges to the target one, that is \(K_{\ell E}\). Such trend may be damaged by big values of \(\mathcal {C}_\rho .\) Since the latter depends on the distribution \(\rho \), let us fix it to be the geometric distribution \(\mathcal {G}(\theta )\) with parameter \(0 \le \theta < 1\). This yields
There is a trade-off between a low variance (i.e., \(\mathcal {C}_\rho \) small) and a reduced computational cost for \(\varPhi \) (i.e., n small). Indeed, choosing \(\theta \approx 1\) makes \(\mathcal {C}_\rho \) big in (6). In this case, the integer n sampled from \(\rho = \mathcal {G}(\theta )\) is small with great probability: this leads to a reduced number of Kronecker products to be computed in \(\log (\mathbf {X})^{\otimes n}\). Conversely, when \(\theta \approx 0\), despite n and the related computational cost of \(\log (\mathbf {X})^{\otimes n}\) are likely to grow, \(\mathcal {C}_\rho \) is small, ensuring a low variance for the estimator.
As a final theoretical result, Theorems 1 and 2 immediately yield that
for every pairs of unitary Frobenius normed SPD matrices \(\mathbf {X},\mathbf {Y}\) and \(\epsilon > 0\), as a straightforward implication of the Chebyshev inequality. This ensures that \(K_\varPhi \) differs in module from \(K_{\ell E}\) by more than \(\epsilon \) with a (low) probability \(\mathbb {P}\), which is inversely cubic and quadratic in \(\nu \) and \(\epsilon \), respectively.
Final Remarks. To sum up, we have presented a constructive algorithm to compute a \(\nu \)-dimensional feature map \(\varPhi \) whose induced linear kernel is an unbiased estimator of the log-Euclidean one. Additionally, we ensure an explicit bound on the variance which rapidly vanishes as \(\nu \) grows (inversely cubic decrease). This implies that \(\langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \) and \(K_{\ell E}(\mathbf {X},\mathbf {Y})\) are equal with very high probability, even at low \(\nu \) values. This implements a Log-Euclidean kernel in a scalable manner, downgrading the quadratic cost of computing \(K_{\ell E}(\mathbf {X},\mathbf {Y})\) for every \(\mathbf {X},\mathbf {Y}\) into the linear cost of evaluating the feature map \(\varPhi (\mathbf {X})\) as in (3) for every \(\mathbf {X}\). Practically, this achieve a linear implementation of the log-Euclidean SVM.
4 Results
In this Section, we compare our proposed approximated feature map versus the alternative ones by Rahimi and Recht [17], Kar and Karnick [11] and Le et al. [13] (see Sect. 2).
Theoretical Comparison. Let us notice that all approaches [11, 13, 17] are applicable also to the log-Euclidean kernel (1). Indeed, [13, 17] includes our case of study since \(K_{\ell E}(\mathbf {X},\mathbf {Y}) = k(\log \mathbf {X} - \log \mathbf {Y})\) is logarithmic shift invariant. At the same time, thanks to the assumption \(\Vert \log \mathbf {X} \Vert _F = \Vert \log \mathbf {Y} \Vert _F = 1\) as in Theorem 1, we obtain \(K_{\ell E}(\mathbf {X},\mathbf {Y}) = k(\langle \log \mathbf {X}, \log \mathbf {Y} \rangle )\) (see (13) in Appendix A), thus satisfying the hypothesis of Kar and Karnick [11].
As we proved in Theorem 1, all works [11, 13, 17] can also guarantee an unbiased estimation for the exact kernel function.
Actually, what makes our approach superior is the explicit bound on the variance (see Table 1). Indeed, [11, 17] are totally lacking in this respect. Moreover, despite an analogous bound is provided in [13, Theorem 4], it only ensures a \(O(1/\nu )\) decrease rate for the variance with respect to the feature dimensionality \(\nu \). Differently, we can guarantee a \(O(1/{\nu ^3})\) trend. This implies that, we achieve a better approximation of the kernel with a lower dimensional feature representation, which ease the training of the linear SVM [7].
Experimental Comparison. We reported here the experimental comparison on 3D action recognition between our proposed approximation and the paradigms of [11, 13, 17].
Datasets. For the experiments, we considered UTKinect [23], Florence3D [19], MSR-Action-Pairs (MSR-pairs) [16], MSR-Action3D [3, 14], HDM-05 [15] and MSRC-Kinect12 [8] datasets.
For each, we follow the usual training and testing splits proposed in the literature. For Florence3D and UTKinect, we use the protocols of [20]. For MSR-Action3D, we adopt the splits originally proposed by [14]. On MSRC-Kinect12, once highly corrupted action instances are removed as in [10], training is performed on odd-index subject, while testing on the even-index ones. On HDM-05, the training exploited all instances of “bd” and “mm” subjects, being “bk”, “dg” and “tr” left out for testing [22], using the 65 action classes protocol of [6].
Data Preprocessing. As a common pre-processing step, we normalize the data by computing the relative displacements of all joints \(x-y-z\) coordinates and the ones of the hip (central) joint, for each timestamps.
Results. Figure 1 reports the quantitative performance while varying \(\nu \) in the range 10, 20, 50, 100, 200, 500, 1000, 2000, 5000. When comparing with [13], since the data input size must be a multiple of a power of 2, we zero-padded our vectorized representation to match 4096 and (whenever possible) 2048 and 1024 input dimensionality. These cases are then compared with the results related to \(\nu =\) 5000, 2000, 1000 for RGW and [11, 17], respectively. Since all approaches have a random component, we performed ten repetitions for each method and dimensionality setup, averaging the scored classification performances obtained through a linear SVM with \(C = 10\). We employ the publicly available codes for [11, 13, 17]. Finally, we also report the classification performance with the exact method obtained by feeding an SVM with the log-Euclidean kernel whose bandwidth \(\sigma \) is chosen via cross validation.
Discussion. For large \(\nu \) values, all methods are able to reproduce the performance of the log-Euclidean kernel (black dotted line). Still, in almost all the cases, our approximation is able to outperform the competitors: for instance, we gapped Rahimi and Recht on both MSR-Pairs and MSR-Action3D, while Kar and Karnick scored a much lower performance on HDM-05 and Florence3D. If comparing to Le et al., the performance is actually closer, but this happens for all the methods which are able to cope the performance of the Log-Euclidean kernel with \(\nu \ge 2000,5000\). Precisely, the true superiority of our approach is evident in the case of a small \(\nu \) value (\(\nu =\) 10, 20, 50). Indeed, our approximation always provides a much rapid growing accuracy (MSR-Action3D, Florence3D and UTKinect), with only a few cases where the gap is thinner (Kar and Karnick [11] on MSR-pairs and Rahimi and Recht [17] on MSRC-Kinect 12). Therefore, our approach ensures a more descriptive and compact representation, providing a superior classification performance.
5 Conclusions
In this work we propose a novel scalable implementation of a Log-Euclidean SVM to perform proficient classification of SPD (covariance) matrices. We achieve a linear complexity by providing an explicit random feature map whose induced linear kernel is an unbiased estimator of the exact kernel function.
Our approach proved to be more effective than alternative approximations [11, 13, 17], both theoretically and experimentally. Theoretically, we achieve an explicit bound on the variance on the estimator (such result is totally absent in [11, 17]), which is decreasing with inversely cubic pace versus the inverse linear of [13]. Experimentally, through a broad evaluation, we assess the superiority of our representation which is able to provide a superior classification performance at a lower dimensionality.
References
Bach, F.R., Jordan, M.I.: Predictive low-rank decomposition for kernel methods. In: ICML (2005)
Bishop, C.M.: Pattern Recognition and Machine Learning - Information Science and Statistics. Springer, New York (2006)
Bloom, V., Makris, D., Argyriou, V.: G3D: a gaming action dataset and real time action recognition evaluation framework. In: CVPR (2012)
Casella, G., Berger, R.: Statistical Inference. Duxbury Advanced Series in Statistics and Decision Sciences. Thomson Learning, Boston (2002)
Cavazza, J., Zunino, A., San Biagio, M., Murino, V.: Kernelized covariance for action recognition. In: ICPR (2016)
Cho, K., Chen, X.: Classifying and visualizing motion capture sequences using deep neural networks. CoRR 1306.3874 (2014)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. JMLR 9, 1871–1874 (2008)
Fothergill, S., Mentis, H.M., Kohli, P., Nowozin, S.: Instructing people for training gestural interactive systems. In: ACM-CHI (2012)
Harandi, M., Salzmann, M., Porikli, F.: Bregman divergences for infinite dimensional covariance matrices. In: CVPR (2014)
Hussein, M., Torki, M., Gowayyed, M., El-Saban., M.: Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations. IJCAI (2013)
Kar, P., Karnick, H.: Random feature maps for dot product kernels. In: AISTATS (2012)
Koniusz, P., Cherian, A., Porikli, F.: Tensor representations via kernel linearization for action recognition from 3D skeletons. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016 Part IV. LNCS, vol. 9908, pp. 37–53. Springer, Cham (2016). doi:10.1007/978-3-319-46493-0_3
Le, Q., Sarlos, T., Smola, A.: Fastfood - approximating kernel expansion in loglinear time. In: ICML (2013)
Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: CVPR Workshop (2010)
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: HDM-05 doc. Technical report (2007)
Oreifej, O., Liu., Z.: HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: CVPR (2013)
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: NIPS (2007)
Rudin, W.: Real and Complex Analysis, 3rd edn. McGraw-Hill Inc., New York (1987)
Seidenari, L., Varano, V., Berretti, S., Bimbo, A.D., Pala, P.: Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: CVPR Workshops (2013)
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: CVPR, June 2014
Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods. Front. Robot. AI 2, 28 (2015)
Wang, L., Zhang, J., Zhou, L., Tang, C., Li, W.: Beyond covariance: feature representation with nonlinear kernel matrices. In: ICCV (2015)
Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3D joints. In: CVPR Workshops (2012)
Zhang, K., Tsang, I.W., Kwok, J.T.: Improved Nyström low-rank approximation. In: ICML (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Proofs of All Theoretical Results
A Proofs of All Theoretical Results
In this Appendix we report the formal proofs for both the unbiased approximation (Theorem 1) and the related rapidly decreasing variance (Theorem 2).
Proof of Theorem 1. Use the definition of (3) and the linearity of the expectation. We get that \(\mathbb {E}_{n,\mathbf {W}}\left[ \langle \varPhi (\mathbf {X}), \varPhi (\mathbf {Y}) \rangle \right] \) equals to
by simply noticing that the dependence with respect to \(\mathbf {W}\) involves the terms inside the trace operators only. Let us focus on the term \(\mathrm{tr}\left( \mathbf {W}^\top \log (\mathbf {X})^{\otimes n} \right) \). We can expand
by using the definition of \(\log (\mathbf {X})^{\otimes n}\) and the properties of the trace operator. In Eq. (9), we replace the random coefficient \(w_{i_1,\dots ,i_{2n}}\) with \(u^{(1)}_{i_1,i_2},\dots ,u^{(n)}_{i_{2n - 1},i_{2n}}\) independent and identically distributed (i.i.d.) according to a \(\mathcal {N}(0,\sigma ^2)\) distribution. We can notice that (9) can be rewritten as
Making use of (10) in (8), we can rewrite \(\mathbb {E}_{n,\mathbf {W}}\left[ K_\varPhi (\mathbf {X},\mathbf {Y})\right] \) as
by also considering the independence of \(u^{(\alpha )}_{i,j}\) are independent. By furthermore using the fact that \(\mathbb {E}_{\mathbf {W}}\left[ u^{(1)}_{i,j}u^{(1)}_{h,k} \right] = 0\) if \(i \ne h\) and \(j \ne k\) and the formula for the variance of a Gaussian distribution, we get
by introducing the Frobenius inner product \(\langle \mathbf {A}, \mathbf {B} \rangle _F = \sum _{i,j = 1}^d \mathbf {A}_{ij} \mathbf {B}_{ij}\) between matrices \(\mathbf {A}\) and \(\mathbf {B}\). By expanding the expectation over \(\rho \), (12) becomes
The thesis easily comes from (13) by using the Taylor expansion for the exponential function and the assumption \(\Vert \log (\mathbf {X})\Vert _F = \Vert \log (\mathbf {Y})\Vert _F = 1\). \(\square \)
Proof of Theorem 2. Due to the independence of the components in \(\varPhi \), by definition of inner product we get \( \mathbb {V}_{n,\mathbf {W}} \left[ \langle \varPhi (\mathbf {X}),\varPhi (\mathbf {Y}) \rangle \right] = \nu \mathbb {V}_{n,\mathbf {W}} \left[ \varPhi _1(\mathbf {X}) \varPhi _1(\mathbf {Y}) \right] \). But then \(\mathbb {V}_{n,\mathbf {W}} \left[ \langle \varPhi (\mathbf {X}),\varPhi (\mathbf {Y}) \rangle \right] \le \nu \mathbb {E}_{n,\mathbf {W}} \left[ \varPhi _1(\mathbf {X})^2 \varPhi _1(\mathbf {Y})^2 \right] \) by definition of variance. Taking advantage of (3), yields to the equality between \(\mathbb {V}_{n,\mathbf {W}} \left[ K_\varPhi (\mathbf {X},\mathbf {Y}) \right] \) and
where \(u^{(1)}_{i_1,i_2},\dots ,u^{(n)}_{i_{2n - 1},i_{2n}}\) are i.i.d. from \(\mathcal {N}(0,\sigma ^2)\) distribution used to re-parametrize the original weights \(\mathbf {W}\). Exploit the independence of \(u^{(\alpha )}_{ij}\) to rewrite (14) as
By exploiting the zero correlation of the weights in \(\mathbf {U}\) and the formula \(\mathbb {E}[(\mathcal {N}(0,\sigma ^2))^4] = 3\sigma ^4\) [4]. Thus,
Since \(\sum _{i,j=1}^d \log (\mathbf {X})^2_{ij} \log (\mathbf {Y})^2_{ij} \le \left( \sum _{i,j=1}^d \log (\mathbf {X})^2_{ij} \right) \left( \sum _{i,j=1}^d \log (\mathbf {Y})^2_{ij} \right) = 1\) due to the assumption of unitary Frobenius norm for both \( \log \mathbf {X}\) and \( \log \mathbf {Y}\), we get
We can now expand the expectation over \(\rho \) in (17), achieving
since the series of the products is less than the product of the series, provided that both converge. This is actually true since, by exploiting the McLaurin expansion for the exponential function, we easily get \(\sum _{n = 0}^\infty \left( \frac{3}{\sigma ^4}\right) ^n \frac{1}{n!} = \exp \left( \frac{3}{\sigma ^4} \right) \). On the other hand, since \(\rho \) is a probability distribution, it must be \( \lim _{n \rightarrow \infty } \frac{\rho (n+1)}{\rho (n)} = L\) where \(0 < L \le 1\), being \(\mathbb {N}\) the support of \(\rho \) and due to \(\sum _{n = 0}^\infty \rho (n) = 1\). Then, since \(\lim _{n \rightarrow \infty } \frac{\rho (n)}{\rho (n+1)} = \frac{1}{L} < \infty \) and \(\lim _{n \rightarrow \infty } \frac{1}{n +1} = 0\), by the ration criterion for positive-terms series [18], there must exist a constant \(\mathcal {C}_\rho > 0\) such that
Therefore, by combining (19) in (18), we obtain
which is the thesis. \(\square \)
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Cavazza, J., Morerio, P., Murino, V. (2017). A Compact Kernel Approximation for 3D Action Recognition. In: Battiato, S., Gallo, G., Schettini, R., Stanco, F. (eds) Image Analysis and Processing - ICIAP 2017 . ICIAP 2017. Lecture Notes in Computer Science(), vol 10484. Springer, Cham. https://doi.org/10.1007/978-3-319-68560-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-68560-1_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68559-5
Online ISBN: 978-3-319-68560-1
eBook Packages: Computer ScienceComputer Science (R0)