Abstract
The FastICA algorithm is one of the most popular algorithms in the domain of independent component analysis (ICA). Despite its success, it is observed that FastICA occasionally yields outcomes that do not correspond to any true solutions (known as demixing vectors) of the ICA problem. These outcomes are commonly referred to as spurious solutions. Although FastICA is a well-studied ICA algorithm, the occurrence of spurious solutions is not yet completely understood by the ICA community. In this contribution, we aim at addressing this issue. In the first part of this work, we are interested in characterizing the relationship between demixing vectors, local optimizers of the contrast function and (attractive or unattractive) fixed points of FastICA algorithm. We will show that there exists an inclusion relationship between these sets. In the second part, we investigate the possible scenarios where spurious solutions occur. It will be shown that when certain bimodal Gaussian mixture distributions are involved, there may exist spurious solutions that are attractive fixed points of FastICA. In this case, popular nonlinearities such as “Gauss” or “tanh” tend to yield spurious solutions, whereas “kurtosis” gives much more reliable results.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
In this paper, notation \(\subset\) stands for the “subset” rather than the “proper subset” inclusion. Hence, \({\mathbb {D}}\subset {\mathbb {L}}\) does not exclude \({\mathbb {D}}={\mathbb {L}}\).
References
Wei T (2014) On the spurious solutions of the FastICA algorithm. In: 2014 IEEE workshop on statistical signal processing (SSP) (SSP’14), Gold Coast, Australia, June 2014, pp 161–164
Comon P, Jutten C (2010) Handbook of blind source separation: independent component analysis and applications. Academic Press, London
Hyvärinen A, Karhunen J, Oja E (2001) Independent component analysis. Wiley-Interscience, New York
Amari S-I, Cichocki A (2002) Adaptive blind signal and image processing. Wiley, New York
Cardoso JF, Souloumiac A (1993) Blind beamforming for non-gaussian signals. IEEE Proc F 140(6):362–370
Comon P (1994) Independent component analysis: a new concept? Signal Process 36(3):287–314
Hyvärinen A, Oja E (1997) A fast fixed-point algorithm for independent component analysis. Neural Comput 9(7):1483–1492
Zarzoso V, Comon P (2010) Robust independent component analysis by iterative maximization of the kurtosis contrast with algebraic optimal step size. IEEE Trans Neural Netw 21(2):248–261
Hyvärinen A (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw 10(3):626–634
Delfosse N, Loubaton P (1995) Adaptive blind separation of independent sources: a deflation approach. Signal Process 45(1):59–83
Oja E, Yuan Z (2006) The FastICA algorithm revisited: convergence analysis. IEEE Trans Neural Netw 17(6):1527–1541
Regalia PA, Kofidis E (2003) Monotonic convergence of fixed-point algorithms for ICA. IEEE Trans Neural Netw 14(4):943–949
Tichavsky P, Koldovsky Z, Oja E (2006) Performance analysis of the FastICA algorithm and Cramér-Rao bounds for linear independent component analysis. IEEE Trans Signal Process 54(4):1189–1203
Ollila E, Kim H-J, Koivunen V (2008) Compact Cramér-Rao bound expression for independent component analysis. IEEE Trans Signal Process 56(4):1421–1428
Basiri S, Ollila E, Koivunen V (2014) Fast and robust bootstrap method for testing hypotheses in the ICA model. In: acoustics, speech and signal processing (ICASSP), 2014 IEEE international conference on, May 2014, pp 6–10
Shen H, Kleinsteuber M, Hüper K (2008) Local convergence analysis of FastICA and related algorithms. IEEE Trans Neural Netw 19(6):1022–1032
Hyvärinen A (1997) One-unit contrast functions for independent component analysis: a statistical analysis. In: Proceedings of the IEEE NNSP workshop ’97. Neural networks for signal processing VII
Ollila E (2010) The deflation-based FastICA estimator: Statistical analysis revisited. IEEE Trans Signal Process 58(3):1370–1381
Miettinen J, Nordhausen K, Oja H, Taskinen S (2014) Deflation-based fastica with adaptive choices of nonlinearities. IEEE Trans Signal Process 62(21):5716–5724
Wei T (2015) A convergence and asymptotic analysis of the generalized symmetric FastICA algorithm (submitted). ArXiv
Wei T (2015) An overview of the asymptotic performance of the family of the FastICA algorithms (accepted). In: latent variable analysis and signal separation, ser. lecture notes in computer science, August 2015
Douglas S (2003) On the convergence behavior of the FastICA algorithm. In: Proceedings of the 4th symposium on independent component analysis blind source separation, Nara, Japan, pp 409–414
Vrins F, Verleysen M (2005) On the entropy minimization of a linear mixture of variables for source separation. Signal Process 85(5):1029–1044
Hyvärinen A, Oja E (2000) Independent component analysis: algorithms and applications. Neural Netw 13(4–5):411–430
Dermoune A, Wei T (2013) FastICA algorithm: Five criteria for the optimal choice of the nonlinearity function. IEEE Trans Signal Process 61(8):2078–2087
Himberg J, Hyvärinen A (2003) ICASSO: software for investigating the reliability of ICA estimates by clustering and visualization. In: In Proceedings of the 2003 IEEE workshop on neural networks for signal processing (NNSP2003), pp 259–268
Acknowledgments
The author would like to express the deepest gratitude to Prof. A. Dermoune for his invaluable guidance. The author would also like to thank the anonymous referees for carefully reading the manuscript and for giving us many helpful and constructive suggestions resulting in the present work.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Proof of Theorem 7
Without loss of generality, we may suppose that the mixing matrix \({\mathbf {A}}\) is an identity matrix. Let \({\mathbf {v}}\) be a fixed point, and write \({\mathbf {v}}=(v_1,\ldots ,v_d)^{\mathsf {T}}\). By (14), we have for any \(i\in \{1,\ldots ,d\}\),
or equivalently, after some algebraic simplifications,
where \(\kappa _i\mathop {=}\limits ^{\mathrm {def}}{\mathbb {E}}[s_i^4]-3\ne 0\) by assumption \({\mathcal {A}}_2\) (see Sect. 3.1). Denote by \({\mathcal {I}}\) the set of indices i such that \(v_i\ne 0\). It follows from (26) that for \(i\in {\mathcal {I}}\)
Since \(\sum _{i=1}^d v_i^2=1\), we deduce from (27)
or equivalently
Then, we can rewrite (26) as
Now let us calculate \({\mathbf {M}}\mathop {=}\limits ^{\mathrm {def}}{\mathbb {E}}[g'({\mathbf {v}}^{\mathsf {T}}{\mathbf {s}}){\mathbf {s}}{\mathbf {s}}^{\mathsf {T}}]\). We have
From this, we deduce that \({\mathbf {M}}=3(2{\mathbf {v}}{\mathbf {v}}^{\mathsf {T}}+ {\mathbf {I}}+{\mathbf {D}})\), where \({\mathbf {D}}\) is a diagonal matrix with the ith diagonal entry \({\mathbf {D}}_i=\kappa _i v_i^2\). Since
it follows from (27) that \({\mathbf {D}}_i=-\alpha ({\mathbf {v}})\) for \(i\in {\mathcal {I}}\) and \({\mathbf {D}}_i=0\) otherwise. Besides, since \({\mathbb {E}}[g'({\mathbf {v}}^{\mathsf {T}}{\mathbf {s}}){\mathbf {I}}]=3{\mathbf {I}}\), we have \({\mathbb {E}}[g'({\mathbf {v}}^{\mathsf {T}}{\mathbf {s}})({\mathbf {I}}-{\mathbf {s}}{\mathbf {s}}^{\mathsf {T}})]=3{\mathbf {I}}-{\mathbf {M}}=-3({\mathbf {D}}+ 2{\mathbf {v}}{\mathbf {v}}^{\mathsf {T}})\). From this, we deduce that
where \(\bar{\mathbf {D}}\mathop {=}\limits ^{\mathrm {def}}-{\mathbf {D}}/|\alpha ({\mathbf {v}})|\). Clearly, the diagonal entry of \(\bar{\mathbf {D}}\) satisfies \(\bar{\mathbf {D}}_i=1\) for \(i\in {\mathcal {I}}\) and \(\bar{\mathbf {D}}_i=0\) otherwise. Besides, it is easy to see that \(\bar{\mathbf {D}}{\mathbf {v}}={\mathbf {v}}\), which implies \({\mathrm {span}}({\mathbf {v}})\subset {\mathrm {range}}(\bar{\mathbf {D}})\). Denote by \(\#{\mathcal {I}}\) the cardinal of \({\mathcal {I}}\). Since \(\dim ({\mathrm {span}}({\mathbf {v}}))=1\) and \(\dim ({\mathrm {range}}(\bar{\mathbf {D}}))=\#{\mathcal {I}}\ge 1\), this inclusion becomes an equality if and only if \(\#{\mathcal {I}}=1\), i.e., there is exactly one entry \(v_i\ne 0\), or equivalently, \({\mathbf {v}}={\mathbf {e}}_i\in {\mathbb {D}}\) for some i. If this is the case, then we have immediately \(\varvec{f}'({\mathbf {v}})=0\) by (28). Otherwise, take any vector \({\mathbf {u}}=(u_1,\ldots ,u_d)^{\mathsf {T}}\) such that \(\Vert {\mathbf {u}}\Vert =1\), \({\mathbf {u}}^{\mathsf {T}}{\mathbf {v}}=0\) and \(u_i=0\) for \(i\ne {\mathcal {I}}\). We have
On the one hand, by the submultiplicativity of spectral norm,
on the other hand, there also holds
It then follows that \(\Vert \varvec{f}'({\mathbf {v}})\Vert =3\) for any \({\mathbf {v}}\in {\mathbb {F}}\backslash {\mathbb {D}}\). This fact also implies \({\mathbb {D}}={\mathbb {L}}\).
1.2 Proof of Proposition 8
Without loss of generality, in what follows we take \({\mathbf {A}}={\mathbf {I}}\) for simplicity.
1.2.1 Case \(d=2\)
Let us first consider the simplest case \(d=2\). If \(\alpha ({\mathbf {e}}_1)\) and \(\alpha ({\mathbf {e}}_2)\) have the same sign, say, positive, then \({\mathbf {e}}_1\) and \({\mathbf {e}}_2\) are local minimizers of the contrast function \({\mathcal {J}}(\cdot )\) on \({\mathcal {S}}\). Write
Then, it is easy to see that \(\theta _1=0\) and \(\theta _2=\pi /2\) are local minimizers of \(f(\theta )\) on \({\mathbb {R}}\). From that we deduce immediately that f reaches its local maximum at some internal point \(\theta _0\in (\theta _1,\theta _2)\). The corresponding vector \({\mathbf {v}}\mathop {=}\limits ^{\mathrm {def}}\big (\cos (\theta _0), \sin (\theta _0)\big )^{\mathsf {T}}\) is then a local maximizer of \({\mathcal {J}}({\mathbf {w}})\) on \({\mathcal {S}}\).
We actually proved the following result:
Lemma 10
Let \(s_1\) and \(s_2\) be two random variables such that the quantity \({\mathbb {E}}[g'(s_i) - g(s_i)s_i]\) has the same sign for \(i=1,2.\) Then, \(f(\theta )\mathop {=}\limits ^{\mathrm {def}}{\mathbb {E}}[G(\cos (\theta )s_1 + \sin (\theta )s_2)]\) has a local optimum at some \(\theta _0\in (0,2\pi )\). Moreover, the angle \(\theta _0\) satisfies
Equality (31) comes directly from (14).
1.2.2 Case \(d>2\)
Suppose \({\mathbf {e}}_i\) and \({\mathbf {e}}_j\) are two demixing vectors such that both \(\alpha ({\mathbf {e}}_i)\) and \(\alpha ({\mathbf {e}}_j)\) are positive. Write
By Lemma 10, there exists \(\theta '\in (0,2\pi )\) such that \(\theta '\) maximizes \(f^{(i,j)}(\theta )\) and satisfies
where \({\mathbf {s}}_{ij}=(s_i,s_j)^{\mathsf {T}}\). Consider vector \({\mathbf {u}}=(u_1,\ldots ,u_d)\) with \(u_i=\cos (\theta ')\), \(u_j=\sin (\theta ')\) and \(u_k=0\) for \(k\ne i,j\). Clearly, \({\mathbf {u}}^{\mathsf {T}}{\mathbf {s}}=\cos (\theta ')s_i + \sin (\theta ')s_j={\mathbf {w}}(\theta ')^{\mathsf {T}}{\mathbf {s}}_{ij}\). It then follows from (32) that
which implies \({\mathbf {v}}\in {\mathbb {F}}\) by (14).
In the particular case that \(s_1\) and \(s_2\) have the same distribution, we must have \(\cos (\theta ')=\sin (\theta ')=1/\sqrt{2}\) by symmetry. This means \(\theta '=\pi /4\).
1.3 Proof of Proposition 9
Suppose that \({\mathbf {u}}\in {\mathbb {R}}^2\) is a spurious attractive fixed point in the case \(d=2\) with \(s_1{\sim } {\mathcal {D}}_1\) and \(s_2{\sim } {\mathcal {D}}_2\) . Then \(\Vert \varvec{f}'({\mathbf {u}})\Vert <1\) and
for some real scalar \(c\in (0,1)\). Now let us consider the case \(d=n>2\) with \(s_i{\sim } {\mathcal {D}}_1\) and \(s_j{\sim }{\mathcal {D}}_2\) for some indices \(i\ne j\). In the sequel, we assume \(i=1\), \(j=2\) for simplicity. Take
It is easy to see that \({\mathbf {v}}^{\mathsf {T}}{\mathbf {x}}=cs_1 + \sqrt{1-c^2}s_2\) and \({\mathbf {v}}\in {\mathbb {F}}\). Next, we show \(\Vert \varvec{f}'({\mathbf {v}})\Vert =\Vert \varvec{f}'({\mathbf {u}})\Vert <1\), where \(\varvec{f}'({\mathbf {v}})\) and \(\varvec{f}'({\mathbf {u}})\) are, respectively, \(n\times n\) and \(2\times 2\) matrices. Note that these two “\(\varvec{f}'(\cdot )\)” are different mappings for they are determined by different ICA models. Denote respectively by \({\mathbf {A}}_{\mathbf {u}}\) and \({\mathbf {A}}_{\mathbf {v}}\) the mixing matrices for each case. Since the mixing matrix \({\mathbf {A}}_{\mathbf {u}}\) is orthogonal, for any \({\mathbf {w}}\in {\mathbb {R}}^2\) we have
Similar equality also holds for \({\mathbf {A}}_{\mathbf {v}}\) and \({\mathbf {w}}\in {\mathbb {R}}^n\). Denote
Notice that
by the construction of \({\mathbf {u}}\) and \({\mathbf {v}}\). This implies \({\mathbf {u}}^{\mathsf {T}}{\mathbf {x}}\) and \({\mathbf {v}}^{\mathsf {T}}{\mathbf {x}}\) have the same distribution and therefore \(\alpha ({\mathbf {u}})=\alpha ({\mathbf {v}})\). Besides, it is easily seen that
From (36), we can deduce that \(\Vert {\mathbf {B}}_{\mathbf {u}}\Vert =\Vert {\mathbf {B}}_{\mathbf {v}}\Vert\). Finally, combining this result with (35) gives \(\Vert \varvec{f}'({\mathbf {v}})\Vert <1\).
1.4 Probability distributions used in Table 1
1.4.1 Generalized Gaussian distribution
The generalized Gaussian density function with parameter \(\alpha\), zero mean and unit variance is given by
where \(\alpha\) is a positive parameter that controls the distributions exponential rate of decay, \(\Gamma\) is the Gamma function and
This generalized Gaussian family encompasses the ordinary standard normal distribution for \(\alpha =2\) , the Laplace distribution for \(\alpha =1\) and the uniform distribution in the limit \(\alpha \rightarrow \infty\).
1.4.2 Bimodal distribution with Gaussian mixture
The bimodal distribution used in this paper consists of a mixture of two Gaussian distribution. Define random variable
where \(Y_i{\sim }{\mathcal {N}}(\mu _i,\sigma _i^2)\) and \(Z{\sim }{\mathcal {B}}(p)\) are mutually independent random variables. Here, \({\mathcal {B}}(p)\) denotes the Bernoulli distribution with parameter p, i.e., \({\mathbb {P}}(Z=1)=p\) and \({\mathbb {P}}(Z=0)=1-p\). It is easy to see that the probability density function (PDF) of X is given by
where \(f_{Y_i}\) is the PDF of \(Y_i\) for \(i=1,2\).
Now take any \(\mu _1,\mu _2\) such that \(\mu _1\mu _2<0\) and \(|\mu _1\mu _2|\le 1\), then let \(\sigma ^2_1=\sigma ^2_2=1-|\mu _1\mu _2|\) and
Defined in such a way, X is a random variable with zero mean, unit variance and two modes at \(\mu _1\) and \(\mu _2\). Since the PDF of X is completely determined by \(\mu _1,\mu _2\), we use them as controlling parameter and denote by “Bimod\((\mu _1,\mu _2)\)” the distribution of X. Note that if \(\mu _1=-\mu _2\), then \(p=1/2\) and the distribution of X is symmetrical. In this case, we write simply “Bimod (\(\mu\))” with \(\mu =|\mu _1|\). Note also that the “bpsk” distribution is actually Bimod (1).
Rights and permissions
About this article
Cite this article
Wei, T. A study of the fixed points and spurious solutions of the deflation-based FastICA algorithm. Neural Comput & Applic 28, 13–24 (2017). https://doi.org/10.1007/s00521-015-2033-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-015-2033-6