Abstract
Nonnegative matrix factorization (NMF) is a standard linear dimensionality reduction technique for nonnegative data sets. In order to measure the discrepancy between the input data and the low-rank approximation, the Kullback–Leibler (KL) divergence is one of the most widely used objective function for NMF. It corresponds to the maximum likehood estimator when the underlying statistics of the observed data sample follows a Poisson distribution, and KL NMF is particularly meaningful for count data sets, such as documents. In this paper, we first collect important properties of the KL objective function that are essential to study the convergence of KL NMF algorithms. Second, together with reviewing existing algorithms for solving KL NMF, we propose three new algorithms that guarantee the non-increasingness of the objective function. We also provide a global convergence guarantee for one of our proposed algorithms. Finally, we conduct extensive numerical experiments to provide a comprehensive picture of the performances of the KL NMF algorithms.
Similar content being viewed by others
References
Ang, A.M.S., Gillis, N.: Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Comput. 31(2), 417–439 (2019)
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017). https://doi.org/10.1287/moor.2016.0817
Bioucas-Dias, J.M., Plaza, A., Dobigeon, N., Parente, M., Du, Q., Gader, P., Chanussot, J.: Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J. Select. Top. Appl. Earth Obs. Remote Sens. 5(2), 354–379 (2012)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1), 459–494 (2014)
Carbonetto, P., Luo, K., Dey, K., Hsiao, J., Stephens, M.: fastTopics: fast algorithms for fitting topic models and non-negative matrix factorizations to count data (2021). R package version 0.4-11. https://github.com/stephenslab/fastTopics
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Chi, E.C., Kolda, T.G.: On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl. 33(4), 1272–1299 (2012)
Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley (2009)
Dey, K.K., Hsiao, C.J., Stephens, M.: Visualizing the structure of RNA-SEQ expression data using grade of membership models. PLoS Genet. 13(3), e1006599 (2017)
Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 52(8), 3913–3927 (2008)
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)
Févotte, C., Cemgil, A.T.: Nonnegative matrix factorizations as probabilistic inference in composite models. In: 2009 17th European Signal Processing Conference, pp. 1913–1917 (2009)
Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011). https://doi.org/10.1162/NECO_a_00168
Finesso, L., Spreij, P.: Nonnegative matrix factorization and I-divergence alternating minimization. Linear Algebra Appl. 416(2), 270–287 (2006). https://doi.org/10.1016/j.laa.2005.11.012
Fu, X., Huang, K., Sidiropoulos, N.D., Ma, W.K.: Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications. IEEE Signal Process. Mag. 36(2), 59–80 (2019)
Gillis, N.: The why and how of nonnegative matrix factorization. In: Suykens, J., Signoretto, M., Argyriou, A. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, Machine Learning and Pattern Recognition chap 12, pp. 257–291. Chapman & Hall/CRC, Boca Raton, Florida (2014)
Gillis, N.: Introduction to nonnegative matrix factorization. SIAG/OPT Views News 25(1), 7–16 (2017)
Gillis, N., Hien, L.T.K., Leplat, V., Tan, V.Y.: Distributionally robust and multi-objective nonnegative matrix factorization. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/TPAMI.2021.3058693
Gonzalez, E.F., Zhang, Y.: Accelerating the Lee-Seung algorithm for nonnegative matrix factorization. Technical report, Rice University (2005). https://scholarship.rice.edu/handle/1911/102034%20Technical%20report
Guillamet, D., Vitrià, J.: Non-negative matrix factorization for face recognition. In: Escrig, M.T., Toledo, F., Golobardes, E. (eds.) Topics in Artificial Intelligence, pp. 336–344. Springer, Berlin (2002)
Hasinoff, S.W.: Photon, Poisson Noise (2014). http://people.csail.mit.edu/hasinoff/pubs/hasinoff-photon-2011-preprint.pdf
Hien, L.T.K., Gillis, N., Patrinos, P.: Inertial block proximal method for non-convex non-smooth optimization. In: Thirty-seventh International Conference on Machine Learning (ICML 2020) (2020)
Ho, N.D., Dooren, P.V.: Non-negative matrix factorization with fixed row and column sums. Linear Algebra and its Applications 429(5), 1020–1025 (2008). Special Issue devoted to selected papers presented at the 13th Conference of the International Linear Algebra Society
Hong, M., Razaviyayn, M., Luo, Z.Q., Pang, J.S.: A unified algorithmic framework for block-structured optimization involving big data: with applications in machine learning and signal processing. IEEE Signal Process. Mag. 33(1), 57–77 (2015)
Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)
Hsieh, C.J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1064–1072 (2011)
Huang, K., Sidiropoulos, N.D., Liavas, A.P.: A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process. 64(19), 5052–5065 (2016)
Kim, C., Kim, Y., Klabjan, D.: Scale invariant power iteration (2019). arXiv:1905.09882
Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999)
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd edn. Cambridge University Press (2014). http://mmds.org
Li, L., Lebanon, G., Park, H.: Fast Bregman divergence NMF using Taylor expansion and coordinate descent. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-12), pp. 307–315. Association for Computing Machinery, New York, NY, USA (2012)
Lin, C.J.: On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw. 18(6), 1589–1596 (2007)
Lin, X., Boutros, P.C.: Optimization and expansion of non-negative matrix factorization. BMC Bioinform. 21(1), 1–10 (2020)
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28, 333–354 (2018)
Lucy, L.B.: An iterative technique for the rectification of observed distributions. Astron. J. 79, 745 (1974)
Ma, W., Bioucas-Dias, J.M., Chan, T., Gillis, N., Gader, P., Plaza, A.J., Ambikapathi, A., Chi, C.: A signal processing perspective on hyperspectral unmixing: insights from remote sensing. IEEE Signal Process. Mag. 31(1), 67–81 (2014)
Nesterov, Y.: Lectures on Convex Optimization. Springer (2018)
Razaviyayn, M., Hong, M., Luo, Z.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
Richardson, W.H.: Bayesian-based iterative method of image restoration. JoSA 62(1), 55–59 (1972)
Shahnaz, F., Berry, M.W., Pauca, V., Plemmons, R.J.: Document clustering using nonnegative matrix factorization. Inf. Process. Manag. 42(2), 373–386 (2006). http://www.sciencedirect.com/science/article/pii/S0306457304001542
Sun, D.L., Févotte, C.: Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6201–6205 (2014)
Takahashi, N., Hibi, R.: Global convergence of modified multiplicative updates for nonnegative matrix factorization. Comput. Optim. Appl. 57(2), 417–440 (2014)
Teboulle, M., Vaisbourd, Y.: Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM J. Imaging Sci. 13(1), 381–421 (2020)
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 16(12), 371–416 (2015)
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Tech. rep. (2008)
Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789 (2013). https://doi.org/10.1137/120887795
Yanez, F., Bach, F.: Primal-dual algorithms for non-negative matrix factorization with the Kullback-Leibler divergence. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2257–2261 (2017)
Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)
Acknowledgements
We thank the anonymous reviewers for their insightful comments that helped us improve the paper. We also thank Peter Carbonetto for helpful feedback and useful references.
Funding
This work was supported by the European Research Council (ERC starting grant n\(^\text {o}\) 679515), and by the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS Project no O005318F-RG47.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
Not applicable.
Availability of data and material
The data sets that support the findings of this study are freely available online, and available from the corresponding author, Nicolas Gillis, upon reasonable request.
Code availability
The codes used to perform the experiments in this paper are available from https://github.com/LeThiKhanhHien/KLNMF.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Technical Proofs
A Technical Proofs
1.1 A.1 Proof of Proposition 1
Proposition 1: (A) Given \(\varepsilon \ge 0\) and a nonnegative matrix V, Problem (2) attains its minimum, that is, it has at least one optimal solution.
(B) Let \( \nu =\sum _{i=1}^m\sum _{j=1}^n V_{ij}\). Denote \(D^*(V,\varepsilon )\) be the optimal value of Problem (2). We have \(D^*(V,\varepsilon ) \le D^*(V,0) + (\min \{ n+ mr, m+nr\} \sqrt{\nu }+mn\varepsilon )\varepsilon . \)
Proof
Problem (2) can be rewritten as follows
We note that
Let us now prove the parts (A) and (B) of Proposition 1 separately.
(A) We note that
Then we have
Let us now consider Problem (19) with V such that \(\sum _{i=1}^m\sum _{j=1}^n V_{ij}=1\). In the following, we separate the proof into 2 cases: \(\varepsilon =0\), which corresponds to the original KL NMF Problem (1), and \(\varepsilon >0\), for which the objective of Problem (19) is finite at every pair of non-negative matrices (W, H).
Case 1 \(\varepsilon =0\). We use the methodology in the proof of [16, Proposition 2.1]. We first observe that \(WH=\sum _{i=1}^r W_{:,i} H_{i,:}\), hence without loss of generality we can consider the case when H has no zero rows; otherwise we could then consider Problem (1) with smaller rank than r. Given a feasible solution (W, H) of (1), let \(h_k=\sum _{j=1}^n H_{kj}\) and let \(\mathrm{Diag}(h)\) be the diagonal matrix with its diagonal being h. We then have \((\tilde{W},\tilde{H})\), where \(\tilde{W}= W \mathrm{Diag}(h)\) and \(\tilde{H}= (\mathrm{Diag}(h))^{-1} H\), is also a feasible solution of (1) and it preserves the objective value since \(\tilde{W} \tilde{H}=WH\). We observe that \(\sum _{j=1}^n \tilde{H}_{kj}=1\). Hence, we can restrict our search for the optimal solution to the set
We note that \( \lim _{x\rightarrow 0} (x - V_{ij} \log x) = +\infty \) for a given \(V_{ij}>0\), and \(x \mapsto (x - V_{ij} \log x)\) is a decreasing function when \(x<V_{ij}\). Therefore, there exists a positive constant \(\delta \) such that \((WH)_{ij} \ge \delta \) for all i, j with \(V_{ij}>0\); otherwise, if for all \(\delta >0\) there exists (i, j) with \(V_{ij}>0\) that \((WH)_{ij} \le \delta \) then we can choose arbitrary small \(\delta <\min \{V_{ij}:V_{ij}>0\}\) which makes \((WH)_{ij} - V_{ij} \log (WH)_{ij}\) arbitrary large. Therefore, we can further restrict the constraint set of (1) to the set
It follows from [32, Theorem 2] (see also Sect. 3.1) that, given a feasible solution (W, H), a multiplicative update step \(W_{ik}^+= W_{ik} \big (\sum _{l=1,V_{il}>0}^n \frac{H_{kl} V_{il}}{(WH)_{il}} \big )\big / \big (\sum _{l=1}^n H_{kl} \big ) \), for \(i\in [m], k\in [r]\), supposed to be well-defined, decreases the objective function. Hence, given \((W,H)\in \mathcal A_2\), we observe that
Therefore, we can further restrict our search for the optimal solutions of (1) to
By choosing appropriate \(\delta \), we note that \((W,1/n ee^\top ) \in \mathcal {A}_3\) with \(W_{ik}=1/m \sum _{l} V_{il}\) for \(i\in [m], k\in [r]\), then the set \(\mathcal {A}_3\) is non-empty. We see that \(\mathcal {A}_3\) is closed and bounded, and D(V|WH)) is continuous on \(\mathcal {A}_3\), hence, Problem (1) has at least one solution.
Case 2 \(\varepsilon >0\). When \(\varepsilon >0\), we see (0, 0) is a feasible solution of (19). Let
We then can restrict our search for the optimal solutions of Problem (19) to the constraint set \(\mathcal A_4=\{(W,H): W,H \ge 0, f_\varepsilon (W,H)\le C\}. \) Using inequality \(\log x \le x-1\) for all \(x>0\), we obtain
where \(v_{\max }=\max _{ij} V_{ij}\), we use \((WH)_{ij}\ge 0\) in the second inequality and \(\sum _{ij}V_{ij}=1\) in the third inequality. Hence, from \(f_\varepsilon (W,H)\le C\), W and H are bounded. Therefore, \(\mathcal A_4\) is bounded. We also see that \(\mathcal A_4\) is closed; hence, it is a compact set. The objective of (19) is continuous on this set since \(\varepsilon >0\). Therefore, Problem (19) has at least one solution.
(B) Let us fix \(\varepsilon >0\). Note that all points of \(\mathcal {A}_3\) are feasible solutions of Problem (19). We hence have
where in the first inequality we use the fact that the optimal value over bigger set does not exceed the optimal value over smaller set. On the other hand, it follows from (20) and the case \(\sum _{i,j} V_{ij}=1\) that, for \((W,H)\in \mathcal A_3\), we have
Hence \(D^*(V,\varepsilon ) \le D^*(V,0) + n \varepsilon + mr \varepsilon + mn \varepsilon ^2\). By exchanging the role of W and \(H^\top \) and noting that \(D(V|WH)=D(V^\top |H^\top W^\top )\), we can prove a similar bound \(D^*(V,\varepsilon ) \le D^*(V,0) + m \varepsilon + nr \varepsilon + mn \varepsilon ^2\). Together with (21), we obtain Result (B). \(\square \)
1.2 BSUM Framework for the MU
MU is a block successive upper-bound minimization algorithm (BSUM) in which each column of H and each row of W are updated by minimizing majorized functions of the KL objective. Let us first introduce BSUM, then derive MU for solving (2).
BSUM was proposed in [41] to solve the minimization problem in (12).
Putting in the notations of (12), first, let us formally define a majorized function.
Definition 4
A function \(u_i : \mathcal X_i \times \mathcal X \rightarrow \mathbb R\) is said to majorize f(x) if
Using the majorized functions, at iteration k, BSUM fixes the latest values of block \(j\ne i\) and updates block \(x_i\) by
From Definition 4 we have
In other words, BSUM produces a non-increasing sequence \(\{f(x^k)\}\). In the following, we give a majorized function for D(t|Wh) defined in (7).
Proposition 10
[31, Lemma 3] Let
Then \(u_\mathrm{MU}(h,h^k)\) majorizes the function \(h\mapsto D(t|Wh)\).
When BSUM uses \(u_\mathrm{MU}(\cdot ,\cdot )\) to update a column h of H (similarly for a row of W) by \(h^{k+1} = \arg \min _{h\ge \varepsilon } u_\mathrm{MU}(h,h^k)\), we obtain \( (h^{k+1})_l=\max \left\{ \frac{(h^k)_l}{\sum _{i=1}^m W_{il}}\sum _{i=1}^m \frac{t_i W_{il}}{(Wh^k)_i}, \varepsilon \right\} , \) leading to the MU. More specifically, the updates of MU for solving Problem (2) are
We note that the non-increasing property of \(f(x^k)\) produced by BSUM for Problem (12) does not guarantee the convergence of \( \{x^k\}\). To guarantee some convergence for \( \{x^k\}\), we need \(\varepsilon > 0\) for the objective function to be directionally differentiable, which is not the case when \(\varepsilon = 0\), \((WH)_{ij}=0\) and \(V_{ij}>0\) for some i, j.
1.3 Proof of Theorem 1
Theorem 1: Suppose Assumption 1 is satisfied. Let \(\{x^k\}\) be the sequence generated by Algorithm 1. Let also \(\varPhi (x)= f (x)+ \sum _i I_{\mathcal X_i}(x_i)\), where \(I_{\mathcal X_i}\) is the indicator function of \(\mathcal X_i\). We have
-
(i)
\(\varPhi \big (x^k\big )\) is non-increasing;
-
(ii)
Suppose \(\mathcal X_i \subset \mathrm{int\, dom} \, \kappa _i\). If \(\{x^k\}\) is bounded and \(\kappa _i(x_i)\) is strongly convex on bounded subsets of \(\mathcal X_i\) that contain \(\{x_i^k\}\), then every limit point of \(\{x^k\}\) is a critical point of \(\varPhi \);
-
(iii)
If together with the conditions in (ii) we assume that \(\nabla \kappa _i\) and \(\nabla _i f\) are Lipschitz continuous on bounded convex subsets of \(\mathcal X_i\) that contain \(\{x_i^k\}\), then the whole sequence \(\{x^k\}\) converges to a critical point of \(\varPhi \).
Proof
We follow the methodology established in [5] that bases on the following theorem to prove the global convergence of BMD. \(\square \)
Theorem 3
Let \(\varPhi : \mathbb {R}^N\rightarrow (-\infty ,+\infty ]\) be a proper and lower semicontinuous function which is bounded from below. Let \(\mathcal {A}\) be a generic algorithm which generates a bounded sequence \(\{x^{k}\}\) by \( x^{k+1}\in \mathcal A(x^{k})\), \(k=0,1,\ldots \) Assume that the following conditions are satisfied.
(B1) Sufficient decrease property: There exists some \(\rho _1>0\) such that
(B2) Boundedness of subgradient: There exists some \(\rho _2>0\) such that
(B3) KL property: \(\varPhi \) is a KL function.
(B4) A continuity condition: If a subsequence \(\{x^{k_n}\}\) converges to \(\bar{x}\) then \(\varPhi (x^{k_n})\rightarrow \varPhi (\bar{x})\) as \(n\rightarrow \infty \).
Then \( \sum _{k=1}^\infty \Vert x^{k}-x^{k+1}\Vert <\infty \) and \(\{x^{k}\}\) converges to a critical point of \(\varPhi \).
We will prove Theorem 1 (iii) by verifying all the conditions in Theorem 3 for \(\varPhi \). Let us first recall the following three-point property.
Proposition 11
(Property 1 of [48]) Let \( z^+= \arg \min _{u \in \mathcal X_i} \phi (u)+\mathcal B_{\kappa }(u,z)\), where \(\phi \) is a proper convex function, \(\mathcal X_i\) is convex and \(\mathcal B_{\kappa }\) is the Bregman distance with respect to \(\kappa _i\). Then for all \(u\in \mathcal X_i\) we have
Denote \(\varPhi _i^k(x_i)=f_i^k(x_i)+I_{\mathcal X_i}(x_i)\). We have
where the first inequality uses Assumption 1, the second inequality uses Proposition 11 applied for (13) and the third uses convexity of \(f_i\). Note that \(I_{\mathcal X_i}(x)=0\) if \(x\in \mathcal {X}\) and \(I_{\mathcal X_i}(x)=+\infty \) otherwise. Hence, for all \(x_i\),
Choosing \(x=x_i^k\) in (26), we have \( \varPhi ^k_i\big (x_i^{k+1}\big ) \le \varPhi ^k_i(x_i^k) - \underline{L}_i \mathcal B_{\kappa _i} \big (x_i^k,x_i^{k+1} \big ). \) Therefore,
From (27), we see that \(\varPhi \big (x^{k}\big )\) is non-increasing. Theorem 1(i) is proved.
As \(\{x^k\}\) is assumed to be bounded, then there exists positive constants \(A_i\) such that \(\Vert x_i^k \Vert \le A_i\), \(i=1,\ldots ,s\), for all k. Suppose a subsequence \(x^{k_n} \rightarrow x^*\). As \( \mathcal X\) is closed convex set and \(x^{k_n} \in \mathcal X\), then \(x^*\in \mathcal X\). Hence, \(\varPhi (x^{k_n}) \rightarrow \varPhi (x^*)\), i.e., the continuity condition (B4) is satisfied.
Since \(\kappa _i\) is strongly convex on the sets \(\{x_i: x_i \in \mathcal X_i, \Vert x_i^k \Vert \le A_i \}\), we have \(\mathcal {B}_{\kappa _i}\big (x_i^k,x_i^{k+1} \big )\ge \sigma _i/2\Vert x_i^k-x_i^{k+1}\Vert ^2, \) for some constant \(\sigma _i\). Hence, from (27) we derive the sufficient decrease property (B1) and \(\sum _{k=1}^\infty \Vert x^k-x^{k+1}\Vert ^2 <+\infty \). Consequently, \(\Vert x^k-x^{k+1}\Vert \rightarrow 0\). Then we also have \(x^{k_n+1}\rightarrow x^*\). In (26) we let \(k=k_n\rightarrow \infty \) and note that \(x_i^*\in \mathcal X_i \subset \mathrm{int\, dom}\, \kappa _i\), then we obtain \(\varPhi (x^*)\le \varPhi (x_1^*,\ldots ,x_i,\ldots ,x_s^*)\), \(\forall \,x_i\). This means \(x_i^*\) is a local minimizer of \(\min _{x_i} \varPhi (x_1^*,\ldots ,x_i,\ldots ,x_s^*)\). Hence \(0\in \partial _i \varPhi (x^*)\), for \(i=1,\ldots s\). In other words, we have \(0\in \partial \varPhi (x^*)\), i.e., Theorem 1 (ii) is proved.
To prove Theorem 1 (iii), it remains to verify the boundedness of subgradients. From [2, Proposition 2.1] we have
It follows from (13) that \(0\in \nabla f_i^k(x_i^k) + \partial I_{\mathcal X_i} (x_i^{k+1}) + L_i^{k+1} \big (\nabla \kappa _i (x_i^{k+1})-\nabla \kappa _i (x_i^{k})\big )\). Hence,
As \(\nabla _i f\) and \(\nabla \kappa _i\) are Lipschitz continuous on \(\{x_i: \Vert x_i^k \Vert \le A_i \}\), we derive from (28) that there exists some constant \(a_i\) such that \(\Vert \omega _i^{k+1}\Vert \le a_1 \Vert x^{k}-x^{k+1}\Vert \), for \(i=1,\ldots ,s\). Then the boundedness of subgradients in (B4) is satisfied. Theorem 1 is proved. \(\square \)
1.4 Proof of Proposition 9
Proposition 9: The objective function for the perturbed KL-NMF problem (1) is non-increasing under the updates of Algorithm 2.
Proof
We note that if g(x) is a self-concordant function with constant \(\mathbf{c}\) then \(\mathbf{c}^2\, g(x)\) is a standard self-concordant function. Hence, using the result of [47, Theorem 6] (see also Sect. 2.4), we derive that the objective function of (1) is strictly decreasing when a damp Newton step is used. We now prove the proposition for the case when a full Newton step is used, that is when the gradient \(f'_{W_{ik}}\le 0\) or \(\lambda \le 0.683802\) and the update of \(W_{ik}\) then is \( W_{ik}^+= s = W_{ik} - \frac{f'_{W_{ik}}}{f''_{W_{ik}}} \).
Consider the case \(f'_{W_{ik}}\le 0\). Let us use \(f_{W_{ik}}\) to denote the objective of (1) with respect to \(W_{ik}\). Considering the function \(W_{ik}\mapsto g(W_{ik})=f'_{W_{ik}}\) with \(W_{ik}\ge 0\), we see that \(g(W_{ik})\) is a concave function. Hence, we have \(g(W_{ik}^+)-g(W_{ik}) \le g'(W_{ik}) (W_{ik}^+-W_{ik}). \) This implies that \( f'_{W_{ik}^+} \le f'_{W_{ik}} - f''_{W_{ik}} \frac{f'_{W_{ik}}}{f''_{W_{ik}}}=0. \) Since \(W_{ik} -W_{ik}^+ = \frac{f'_{W_{ik}}}{f''_{W_{ik}}} \le 0\) and \(W_{ik} \mapsto f_{W_{ik}}\) is convex, we then obtain \(f_{W_{ik}} \ge f_{W_{ik}^+} + f'_{W_{ik}^+}(W_{ik}-W_{ik}^+) \ge f_{W_{ik}^+}. \) Hence, the objective of (1) is non-increasing when we update \(W_{ik}\) to \(W_{ik}^+\).
Consider the case \(\lambda \le 0.683802\). Denote \(W_{ik}^\alpha =W_{ik} + \alpha d\). When \(\alpha =1\) we have \(W_{ik}^\alpha =s\), that is when a full proximal Newton step is applied. It follows from [47, Inequality (58)] that \(f_{W_{ik}^\alpha } \le f_{W_{ik}} - \alpha \lambda ^2 + \omega _*(\alpha \lambda ) \) for \(\alpha \in (0,1]\), where \(\omega _*(t)=- t - \log (1-t)\). Hence, when \(\alpha =1\) we get \(f_{W_{ik}^\alpha } \le f_{W_{ik}} - (\lambda ^2 + \lambda + \log (1-\lambda )).\) It is not difficult to see that when \(\lambda \le 0.683802\) the value of \(\lambda ^2 + \lambda + \log (1-\lambda )\) is positive. Hence, in this case, a full proximal Newton step also decreases the objective function.\(\square \)
Rights and permissions
About this article
Cite this article
Hien, L.T.K., Gillis, N. Algorithms for Nonnegative Matrix Factorization with the Kullback–Leibler Divergence. J Sci Comput 87, 93 (2021). https://doi.org/10.1007/s10915-021-01504-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-021-01504-0