[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content
Log in

Algorithms for Nonnegative Matrix Factorization with the Kullback–Leibler Divergence

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

Nonnegative matrix factorization (NMF) is a standard linear dimensionality reduction technique for nonnegative data sets. In order to measure the discrepancy between the input data and the low-rank approximation, the Kullback–Leibler (KL) divergence is one of the most widely used objective function for NMF. It corresponds to the maximum likehood estimator when the underlying statistics of the observed data sample follows a Poisson distribution, and KL NMF is particularly meaningful for count data sets, such as documents. In this paper, we first collect important properties of the KL objective function that are essential to study the convergence of KL NMF algorithms. Second, together with reviewing existing algorithms for solving KL NMF, we propose three new algorithms that guarantee the non-increasingness of the objective function. We also provide a global convergence guarantee for one of our proposed algorithms. Finally, we conduct extensive numerical experiments to provide a comprehensive picture of the performances of the KL NMF algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://www.cs.utexas.edu/~cjhsieh/nmf.

  2. http://statweb.stanford.edu/~dlsun/admm.html.

  3. https://github.com/felipeyanez/nmf.

References

  1. Ang, A.M.S., Gillis, N.: Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Comput. 31(2), 417–439 (2019)

    Article  MathSciNet  Google Scholar 

  2. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)

    Article  MathSciNet  Google Scholar 

  3. Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017). https://doi.org/10.1287/moor.2016.0817

    Article  MathSciNet  MATH  Google Scholar 

  4. Bioucas-Dias, J.M., Plaza, A., Dobigeon, N., Parente, M., Du, Q., Gader, P., Chanussot, J.: Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J. Select. Top. Appl. Earth Obs. Remote Sens. 5(2), 354–379 (2012)

  5. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1), 459–494 (2014)

    Article  MathSciNet  Google Scholar 

  6. Carbonetto, P., Luo, K., Dey, K., Hsiao, J., Stephens, M.: fastTopics: fast algorithms for fitting topic models and non-negative matrix factorizations to count data (2021). R package version 0.4-11. https://github.com/stephenslab/fastTopics

  7. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)

    Article  MathSciNet  Google Scholar 

  8. Chi, E.C., Kolda, T.G.: On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl. 33(4), 1272–1299 (2012)

    Article  MathSciNet  Google Scholar 

  9. Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley (2009)

  10. Dey, K.K., Hsiao, C.J., Stephens, M.: Visualizing the structure of RNA-SEQ expression data using grade of membership models. PLoS Genet. 13(3), e1006599 (2017)

    Article  Google Scholar 

  11. Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 52(8), 3913–3927 (2008)

    Article  MathSciNet  Google Scholar 

  12. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)

    Article  MathSciNet  Google Scholar 

  13. Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)

    Article  Google Scholar 

  14. Févotte, C., Cemgil, A.T.: Nonnegative matrix factorizations as probabilistic inference in composite models. In: 2009 17th European Signal Processing Conference, pp. 1913–1917 (2009)

  15. Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the \(\beta \)-divergence. Neural Comput. 23(9), 2421–2456 (2011). https://doi.org/10.1162/NECO_a_00168

    Article  MathSciNet  MATH  Google Scholar 

  16. Finesso, L., Spreij, P.: Nonnegative matrix factorization and I-divergence alternating minimization. Linear Algebra Appl. 416(2), 270–287 (2006). https://doi.org/10.1016/j.laa.2005.11.012

    Article  MathSciNet  MATH  Google Scholar 

  17. Fu, X., Huang, K., Sidiropoulos, N.D., Ma, W.K.: Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications. IEEE Signal Process. Mag. 36(2), 59–80 (2019)

    Article  Google Scholar 

  18. Gillis, N.: The why and how of nonnegative matrix factorization. In: Suykens, J., Signoretto, M., Argyriou, A. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, Machine Learning and Pattern Recognition chap 12, pp. 257–291. Chapman & Hall/CRC, Boca Raton, Florida (2014)

    Google Scholar 

  19. Gillis, N.: Introduction to nonnegative matrix factorization. SIAG/OPT Views News 25(1), 7–16 (2017)

    Google Scholar 

  20. Gillis, N., Hien, L.T.K., Leplat, V., Tan, V.Y.: Distributionally robust and multi-objective nonnegative matrix factorization. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/TPAMI.2021.3058693

  21. Gonzalez, E.F., Zhang, Y.: Accelerating the Lee-Seung algorithm for nonnegative matrix factorization. Technical report, Rice University (2005). https://scholarship.rice.edu/handle/1911/102034%20Technical%20report

  22. Guillamet, D., Vitrià, J.: Non-negative matrix factorization for face recognition. In: Escrig, M.T., Toledo, F., Golobardes, E. (eds.) Topics in Artificial Intelligence, pp. 336–344. Springer, Berlin (2002)

    Chapter  Google Scholar 

  23. Hasinoff, S.W.: Photon, Poisson Noise (2014). http://people.csail.mit.edu/hasinoff/pubs/hasinoff-photon-2011-preprint.pdf

  24. Hien, L.T.K., Gillis, N., Patrinos, P.: Inertial block proximal method for non-convex non-smooth optimization. In: Thirty-seventh International Conference on Machine Learning (ICML 2020) (2020)

  25. Ho, N.D., Dooren, P.V.: Non-negative matrix factorization with fixed row and column sums. Linear Algebra and its Applications 429(5), 1020–1025 (2008). Special Issue devoted to selected papers presented at the 13th Conference of the International Linear Algebra Society

  26. Hong, M., Razaviyayn, M., Luo, Z.Q., Pang, J.S.: A unified algorithmic framework for block-structured optimization involving big data: with applications in machine learning and signal processing. IEEE Signal Process. Mag. 33(1), 57–77 (2015)

    Article  Google Scholar 

  27. Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)

    MathSciNet  MATH  Google Scholar 

  28. Hsieh, C.J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1064–1072 (2011)

  29. Huang, K., Sidiropoulos, N.D., Liavas, A.P.: A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process. 64(19), 5052–5065 (2016)

    Article  MathSciNet  Google Scholar 

  30. Kim, C., Kim, Y., Klabjan, D.: Scale invariant power iteration (2019). arXiv:1905.09882

  31. Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999)

    Article  Google Scholar 

  32. Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)

  33. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd edn. Cambridge University Press (2014). http://mmds.org

  34. Li, L., Lebanon, G., Park, H.: Fast Bregman divergence NMF using Taylor expansion and coordinate descent. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-12), pp. 307–315. Association for Computing Machinery, New York, NY, USA (2012)

  35. Lin, C.J.: On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw. 18(6), 1589–1596 (2007)

    Article  Google Scholar 

  36. Lin, X., Boutros, P.C.: Optimization and expansion of non-negative matrix factorization. BMC Bioinform. 21(1), 1–10 (2020)

    Article  Google Scholar 

  37. Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28, 333–354 (2018)

    Article  MathSciNet  Google Scholar 

  38. Lucy, L.B.: An iterative technique for the rectification of observed distributions. Astron. J. 79, 745 (1974)

    Article  Google Scholar 

  39. Ma, W., Bioucas-Dias, J.M., Chan, T., Gillis, N., Gader, P., Plaza, A.J., Ambikapathi, A., Chi, C.: A signal processing perspective on hyperspectral unmixing: insights from remote sensing. IEEE Signal Process. Mag. 31(1), 67–81 (2014)

    Article  Google Scholar 

  40. Nesterov, Y.: Lectures on Convex Optimization. Springer (2018)

  41. Razaviyayn, M., Hong, M., Luo, Z.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)

    Article  MathSciNet  Google Scholar 

  42. Richardson, W.H.: Bayesian-based iterative method of image restoration. JoSA 62(1), 55–59 (1972)

    Article  Google Scholar 

  43. Shahnaz, F., Berry, M.W., Pauca, V., Plemmons, R.J.: Document clustering using nonnegative matrix factorization. Inf. Process. Manag. 42(2), 373–386 (2006). http://www.sciencedirect.com/science/article/pii/S0306457304001542

  44. Sun, D.L., Févotte, C.: Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6201–6205 (2014)

  45. Takahashi, N., Hibi, R.: Global convergence of modified multiplicative updates for nonnegative matrix factorization. Comput. Optim. Appl. 57(2), 417–440 (2014)

    Article  MathSciNet  Google Scholar 

  46. Teboulle, M., Vaisbourd, Y.: Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM J. Imaging Sci. 13(1), 381–421 (2020)

    Article  MathSciNet  Google Scholar 

  47. Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 16(12), 371–416 (2015)

    MathSciNet  MATH  Google Scholar 

  48. Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Tech. rep. (2008)

  49. Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789 (2013). https://doi.org/10.1137/120887795

    Article  MathSciNet  MATH  Google Scholar 

  50. Yanez, F., Bach, F.: Primal-dual algorithms for non-negative matrix factorization with the Kullback-Leibler divergence. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2257–2261 (2017)

  51. Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments that helped us improve the paper. We also thank Peter Carbonetto for helpful feedback and useful references.

Funding

This work was supported by the European Research Council (ERC starting grant n\(^\text {o}\) 679515), and by the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS Project no O005318F-RG47.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Gillis.

Ethics declarations

Conflict of interest

Not applicable.

Availability of data and material

The data sets that support the findings of this study are freely available online, and available from the corresponding author, Nicolas Gillis, upon reasonable request.

Code availability

The codes used to perform the experiments in this paper are available from https://github.com/LeThiKhanhHien/KLNMF.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Technical Proofs

A Technical Proofs

1.1 A.1 Proof of Proposition 1

Proposition 1: (A) Given \(\varepsilon \ge 0\) and a nonnegative matrix V, Problem (2) attains its minimum, that is, it has at least one optimal solution.

(B) Let \( \nu =\sum _{i=1}^m\sum _{j=1}^n V_{ij}\). Denote \(D^*(V,\varepsilon )\) be the optimal value of Problem (2). We have \(D^*(V,\varepsilon ) \le D^*(V,0) + (\min \{ n+ mr, m+nr\} \sqrt{\nu }+mn\varepsilon )\varepsilon . \)

Proof

Problem (2) can be rewritten as follows

$$\begin{aligned} \min _{W,H}&\quad D\big (V|(W+\varepsilon ee^\top )(H+\varepsilon ee^\top ) \big )\nonumber \\ \text { such that }&W_{ik}\ge 0, H_{kj} \ge 0 \text { for } i \in [m], k \in [r],j \in [n]. \end{aligned}$$
(19)

We note that

$$\begin{aligned} (W+\varepsilon ee^\top )(H+\varepsilon ee^\top )=WH+ \varepsilon (W ee^\top + ee^\top H)+\varepsilon ^2 ee^\top . \end{aligned}$$
(20)

Let us now prove the parts (A) and (B) of Proposition 1 separately.

(A) We note that

$$\begin{aligned} \frac{1}{\nu }D\big (V|(W+\varepsilon ee^\top )(H+\varepsilon ee^\top )\big )=D\Big (\frac{V}{\nu }\Big |\big (\frac{W}{\sqrt{\nu }}+\frac{\varepsilon }{\sqrt{\nu }} ee^\top \big )\big (\frac{H}{\sqrt{\nu }}+\frac{\varepsilon }{\sqrt{\nu }} ee^\top \big )\Big ). \end{aligned}$$

Then we have

$$\begin{aligned} D^*(V,\varepsilon )=\nu D^*(V/\nu ,\varepsilon /\sqrt{\nu }). \end{aligned}$$
(21)

Let us now consider Problem (19) with V such that \(\sum _{i=1}^m\sum _{j=1}^n V_{ij}=1\). In the following, we separate the proof into 2 cases: \(\varepsilon =0\), which corresponds to the original KL NMF Problem (1), and \(\varepsilon >0\), for which the objective of Problem (19) is finite at every pair of non-negative matrices (WH).

Case 1 \(\varepsilon =0\). We use the methodology in the proof of [16, Proposition 2.1]. We first observe that \(WH=\sum _{i=1}^r W_{:,i} H_{i,:}\), hence without loss of generality we can consider the case when H has no zero rows; otherwise we could then consider Problem (1) with smaller rank than r. Given a feasible solution (WH) of (1), let \(h_k=\sum _{j=1}^n H_{kj}\) and let \(\mathrm{Diag}(h)\) be the diagonal matrix with its diagonal being h. We then have \((\tilde{W},\tilde{H})\), where \(\tilde{W}= W \mathrm{Diag}(h)\) and \(\tilde{H}= (\mathrm{Diag}(h))^{-1} H\), is also a feasible solution of (1) and it preserves the objective value since \(\tilde{W} \tilde{H}=WH\). We observe that \(\sum _{j=1}^n \tilde{H}_{kj}=1\). Hence, we can restrict our search for the optimal solution to the set

$$\begin{aligned} \mathcal A_1=\{(W,H): W,H \ge 0, He = e\}. \end{aligned}$$

We note that \( \lim _{x\rightarrow 0} (x - V_{ij} \log x) = +\infty \) for a given \(V_{ij}>0\), and \(x \mapsto (x - V_{ij} \log x)\) is a decreasing function when \(x<V_{ij}\). Therefore, there exists a positive constant \(\delta \) such that \((WH)_{ij} \ge \delta \) for all ij with \(V_{ij}>0\); otherwise, if for all \(\delta >0\) there exists (ij) with \(V_{ij}>0\) that \((WH)_{ij} \le \delta \) then we can choose arbitrary small \(\delta <\min \{V_{ij}:V_{ij}>0\}\) which makes \((WH)_{ij} - V_{ij} \log (WH)_{ij}\) arbitrary large. Therefore, we can further restrict the constraint set of (1) to the set

$$\begin{aligned} \mathcal A_2=\big \{(W,H): (W,H)\in \mathcal A_1, (WH)_{ij}\ge \delta \forall (i,j) \in \{(i,j):V_{ij} >0\} \big \}. \end{aligned}$$

It follows from [32, Theorem 2] (see also Sect. 3.1) that, given a feasible solution (WH), a multiplicative update step \(W_{ik}^+= W_{ik} \big (\sum _{l=1,V_{il}>0}^n \frac{H_{kl} V_{il}}{(WH)_{il}} \big )\big / \big (\sum _{l=1}^n H_{kl} \big ) \), for \(i\in [m], k\in [r]\), supposed to be well-defined, decreases the objective function. Hence, given \((W,H)\in \mathcal A_2\), we observe that

$$\begin{aligned} \sum _{k=1}^m W_{ik}^+=\sum _{k=1}^m W_{ik} \sum _{l=1}^n \frac{H_{kl} V_{il}}{(WH)_{il}}=\sum _{l=1}^n \sum _{k=1}^m \frac{W_{ik}H_{kl} V_{il}}{(WH)_{il}}=\sum _{l=1}^n V_{il}. \end{aligned}$$

Therefore, we can further restrict our search for the optimal solutions of (1) to

$$\begin{aligned} \mathcal A_3=\big \{(W,H): (W,H)\in \mathcal A_1, W e =V e, (WH)_{ij}\ge \delta \forall (i,j) \in \{(i,j):V_{ij} >0\} \big \}. \end{aligned}$$

By choosing appropriate \(\delta \), we note that \((W,1/n ee^\top ) \in \mathcal {A}_3\) with \(W_{ik}=1/m \sum _{l} V_{il}\) for \(i\in [m], k\in [r]\), then the set \(\mathcal {A}_3\) is non-empty. We see that \(\mathcal {A}_3\) is closed and bounded, and D(V|WH)) is continuous on \(\mathcal {A}_3\), hence, Problem (1) has at least one solution.

Case 2 \(\varepsilon >0\). When \(\varepsilon >0\), we see (0, 0) is a feasible solution of (19). Let

$$\begin{aligned} C=f_\varepsilon (0 , 0)= mnr\varepsilon ^2-\log (r\varepsilon ^2) +\sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij}-1. \end{aligned}$$

We then can restrict our search for the optimal solutions of Problem (19) to the constraint set \(\mathcal A_4=\{(W,H): W,H \ge 0, f_\varepsilon (W,H)\le C\}. \) Using inequality \(\log x \le x-1\) for all \(x>0\), we obtain

$$\begin{aligned} \begin{aligned} f_\varepsilon (W,H)&\ge \sum _{i=1}^m\sum _{j=1}^n (1-V_{ij})\big ((WH)_{ij}+\varepsilon (W ee^\top + ee^\top H)_{ij}+\varepsilon ^2 \big )+ \sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij} \\&\ge \sum _{i=1}^m\sum _{j=1}^n (1-V_{ij})\big (\varepsilon \sum _{k}W_{ik}+\varepsilon \sum _{k}H_{kj} +\varepsilon ^2\big )+ \sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij}\\&\ge (mn-1)\varepsilon ^2+ (1-v_{\max })\varepsilon \big (n \sum _{ik}W_{ik} + m \sum _{kj}H_{kj}\big ) + \sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij}, \end{aligned} \end{aligned}$$

where \(v_{\max }=\max _{ij} V_{ij}\), we use \((WH)_{ij}\ge 0\) in the second inequality and \(\sum _{ij}V_{ij}=1\) in the third inequality. Hence, from \(f_\varepsilon (W,H)\le C\), W and H are bounded. Therefore, \(\mathcal A_4\) is bounded. We also see that \(\mathcal A_4\) is closed; hence, it is a compact set. The objective of (19) is continuous on this set since \(\varepsilon >0\). Therefore, Problem (19) has at least one solution.

(B) Let us fix \(\varepsilon >0\). Note that all points of \(\mathcal {A}_3\) are feasible solutions of Problem (19). We hence have

$$\begin{aligned} \begin{aligned} D^*(V,\varepsilon )&\le \min _{(W,H)\in \mathcal {A}_3} f_\varepsilon (W,H)\\ {}&\le \min _{(W,H)\in \mathcal {A}_3} D(V|WH) + \max _{(W,H)\in \mathcal {A}_3} (f_\varepsilon (W,H)-D(V|WH))\\&= D^*(V,0) + \max _{(W,H)\in \mathcal {A}_3} (f_\varepsilon (W,H)-D(V|WH)), \end{aligned} \end{aligned}$$
(22)

where in the first inequality we use the fact that the optimal value over bigger set does not exceed the optimal value over smaller set. On the other hand, it follows from (20) and the case \(\sum _{i,j} V_{ij}=1\) that, for \((W,H)\in \mathcal A_3\), we have

$$\begin{aligned} \begin{aligned} f_\varepsilon (W,H)-D(V|WH)&= \sum _{i=1}^m\sum _{j=1}^n \big (\varepsilon \sum _{k=1}^r W_{ik} + \varepsilon \sum _{k=1}^r H_{kj} + \varepsilon ^2\big )\\&\qquad \qquad -\sum _{i=1}^m\sum _{j=1}^n V_{ij} \log \frac{((W+\varepsilon ee^\top )(H+\varepsilon ee^\top ))_{ij}}{(WH)_{ij}}\\&\le n \varepsilon + mr \varepsilon + mn \varepsilon ^2. \end{aligned} \end{aligned}$$

Hence \(D^*(V,\varepsilon ) \le D^*(V,0) + n \varepsilon + mr \varepsilon + mn \varepsilon ^2\). By exchanging the role of W and \(H^\top \) and noting that \(D(V|WH)=D(V^\top |H^\top W^\top )\), we can prove a similar bound \(D^*(V,\varepsilon ) \le D^*(V,0) + m \varepsilon + nr \varepsilon + mn \varepsilon ^2\). Together with (21), we obtain Result (B). \(\square \)

1.2 BSUM Framework for the MU

MU is a block successive upper-bound minimization algorithm (BSUM) in which each column of H and each row of W are updated by minimizing majorized functions of the KL objective. Let us first introduce BSUM, then derive MU for solving (2).

BSUM was proposed in [41] to solve the minimization problem in (12).

Putting in the notations of (12), first, let us formally define a majorized function.

Definition 4

A function \(u_i : \mathcal X_i \times \mathcal X \rightarrow \mathbb R\) is said to majorize f(x) if

$$\begin{aligned} \begin{aligned}&u_i (x_i , x) = f(x), \forall \, x \in \mathcal X, \\&u_i (y_i , x) \ge f(x_1,\ldots ,x_{i-1},y_i,x_{i+1},\ldots ,x_s), \forall \, y_i \in \mathcal {X}_i, x \in \mathcal X. \end{aligned} \end{aligned}$$
(23)

Using the majorized functions, at iteration k, BSUM fixes the latest values of block \(j\ne i\) and updates block \(x_i\) by

$$\begin{aligned} x_i^{k}=\arg \min _{x_i\in \mathcal X_i} u_i(x_i,x_1^k,\ldots ,x_{i-1}^k,x_i^{k-1}, x_{i+1}^{k-1},\ldots ,x_s^{k-1}). \end{aligned}$$
(24)

From Definition 4 we have

$$\begin{aligned} \begin{aligned} f(x^k)&=u_1(x_1^k,x_1^k,\ldots ,x_s^k) \ge u_1(x_1^{k+1},x_1^k,\ldots ,x_s^k)\\ {}&\ge f(x_1^{k+1},x_2^k,\ldots ,x_s^k)=u_2(x_2^k,x_1^{k+1},x_2^k,\ldots ,x_s^k)\\&\ge u_2(x_2^{k+1},x_1^{k+1},x_2^k,\ldots ,x_s^k) \ge f(x_1^{k+1},x_2^{k+1},\ldots ,x_{s-1}^k,x_s^k) \ge \cdots \ge f(x^{k+1}). \end{aligned} \end{aligned}$$

In other words, BSUM produces a non-increasing sequence \(\{f(x^k)\}\). In the following, we give a majorized function for D(t|Wh) defined in (7).

Proposition 10

[31, Lemma 3] Let

$$\begin{aligned} u_\mathrm{MU}(h,h^k):=\sum _{i=1}^m\Big ((Wh)_i -t_i\sum _{l=1}^r \frac{W_{il}(h^k)_l}{(Wh^k)_i}\Big (\log (W_{il}h_l)-\log \frac{W_{il}(h^k)_l}{(Wh^k)_i}\Big )+ t_i \log t_i -t_i \Big ). \end{aligned}$$

Then \(u_\mathrm{MU}(h,h^k)\) majorizes the function \(h\mapsto D(t|Wh)\).

When BSUM uses \(u_\mathrm{MU}(\cdot ,\cdot )\) to update a column h of H (similarly for a row of W) by \(h^{k+1} = \arg \min _{h\ge \varepsilon } u_\mathrm{MU}(h,h^k)\), we obtain \( (h^{k+1})_l=\max \left\{ \frac{(h^k)_l}{\sum _{i=1}^m W_{il}}\sum _{i=1}^m \frac{t_i W_{il}}{(Wh^k)_i}, \varepsilon \right\} , \) leading to the MU. More specifically, the updates of MU for solving Problem (2) are

$$\begin{aligned} W_{ik}^+= \max \Big \{ W_{ik} \frac{\sum _{l=1}^n \frac{H_{kl} V_{il}}{(WH)_{il}} }{\sum _{l=1}^n H_{kl}},\varepsilon \Big \}, H_{kj}^+= \max \Big \{ H_{kj} \frac{\sum _{l=1}^m \frac{W_{lk} V_{lj}}{(WH)_{lj}} }{\sum _{l=1}^m W_{lk}},\varepsilon \Big \}. \end{aligned}$$
(25)

We note that the non-increasing property of \(f(x^k)\) produced by BSUM for Problem (12) does not guarantee the convergence of \( \{x^k\}\). To guarantee some convergence for \( \{x^k\}\), we need \(\varepsilon > 0\) for the objective function to be directionally differentiable, which is not the case when \(\varepsilon = 0\), \((WH)_{ij}=0\) and \(V_{ij}>0\) for some ij.

1.3 Proof of Theorem 1

Theorem 1: Suppose Assumption 1 is satisfied. Let \(\{x^k\}\) be the sequence generated by Algorithm 1. Let also \(\varPhi (x)= f (x)+ \sum _i I_{\mathcal X_i}(x_i)\), where \(I_{\mathcal X_i}\) is the indicator function of \(\mathcal X_i\). We have

  1. (i)

    \(\varPhi \big (x^k\big )\) is non-increasing;

  2. (ii)

    Suppose \(\mathcal X_i \subset \mathrm{int\, dom} \, \kappa _i\). If \(\{x^k\}\) is bounded and \(\kappa _i(x_i)\) is strongly convex on bounded subsets of \(\mathcal X_i\) that contain \(\{x_i^k\}\), then every limit point of \(\{x^k\}\) is a critical point of \(\varPhi \);

  3. (iii)

    If together with the conditions in (ii) we assume that \(\nabla \kappa _i\) and \(\nabla _i f\) are Lipschitz continuous on bounded convex subsets of \(\mathcal X_i\) that contain \(\{x_i^k\}\), then the whole sequence \(\{x^k\}\) converges to a critical point of \(\varPhi \).

Proof

We follow the methodology established in [5] that bases on the following theorem to prove the global convergence of BMD. \(\square \)

Theorem 3

Let \(\varPhi : \mathbb {R}^N\rightarrow (-\infty ,+\infty ]\) be a proper and lower semicontinuous function which is bounded from below. Let \(\mathcal {A}\) be a generic algorithm which generates a bounded sequence \(\{x^{k}\}\) by \( x^{k+1}\in \mathcal A(x^{k})\), \(k=0,1,\ldots \) Assume that the following conditions are satisfied.

(B1) Sufficient decrease property: There exists some \(\rho _1>0\) such that

$$\begin{aligned} \rho _1 \Vert x^{k}-x^{k+1}\Vert ^2 \le \varPhi (x^{k}) - \varPhi (x^{k+1}), \forall \,k=0,1,\ldots . \end{aligned}$$

(B2) Boundedness of subgradient: There exists some \(\rho _2>0\) such that

$$\begin{aligned} \Vert w^{k+1}\Vert \le \rho _2 \Vert x^{k}-x^{k+1}\Vert , w^{(k)}\in \partial \varPhi (x^{k}), \forall k=0,1,\ldots \end{aligned}$$

(B3) KL property: \(\varPhi \) is a KL function.

(B4) A continuity condition: If a subsequence \(\{x^{k_n}\}\) converges to \(\bar{x}\) then \(\varPhi (x^{k_n})\rightarrow \varPhi (\bar{x})\) as \(n\rightarrow \infty \).

Then \( \sum _{k=1}^\infty \Vert x^{k}-x^{k+1}\Vert <\infty \) and \(\{x^{k}\}\) converges to a critical point of \(\varPhi \).

We will prove Theorem 1 (iii) by verifying all the conditions in Theorem 3 for \(\varPhi \). Let us first recall the following three-point property.

Proposition 11

(Property 1 of [48]) Let \( z^+= \arg \min _{u \in \mathcal X_i} \phi (u)+\mathcal B_{\kappa }(u,z)\), where \(\phi \) is a proper convex function, \(\mathcal X_i\) is convex and \(\mathcal B_{\kappa }\) is the Bregman distance with respect to \(\kappa _i\). Then for all \(u\in \mathcal X_i\) we have

$$\begin{aligned} \phi (u) + \mathcal B_{\kappa }(u,z) \ge \phi (z^+) + \mathcal B_{\kappa }(z^+,z) + \mathcal B_{\kappa }(u,z^+). \end{aligned}$$

Denote \(\varPhi _i^k(x_i)=f_i^k(x_i)+I_{\mathcal X_i}(x_i)\). We have

$$\begin{aligned} \begin{aligned} f^k_i(x_i^{k+1})&\le f_i^k(x_i^{k} ) + \langle \nabla f_i^k(x_i^{k} ),x_i^{k+1}-x_i^{k} ) \rangle + L_i^{k} \mathcal B_{\kappa _i} (x_i^{k+1},x_i^{k} )\\&\le f_i^k(x_i^{k} ) + \langle \nabla f_i^k(x_i^{k} ),x_i-x_i^{k} ) \rangle + L_i^{k} \mathcal B_{\kappa _i} (x_i,x_i^{k} ) - L_i^{k} \mathcal B_{\kappa _i} (x_i,x_i^{k+1} )\\&\le f_i^k(x) + L_i^{k} \mathcal B_{\kappa _i} (x,x_i^{k} ) - L_i^{k} \mathcal B_{\kappa _i} (x,x_i^{k+1} ), \end{aligned} \end{aligned}$$

where the first inequality uses Assumption 1, the second inequality uses Proposition 11 applied for (13) and the third uses convexity of \(f_i\). Note that \(I_{\mathcal X_i}(x)=0\) if \(x\in \mathcal {X}\) and \(I_{\mathcal X_i}(x)=+\infty \) otherwise. Hence, for all \(x_i\),

$$\begin{aligned} \varPhi ^k_i(x_i^{k+1}) \le \varPhi (x_1^{k+1},\ldots ,x_{i-1}^{k+1},x_i,x_{i+1}^k,\ldots ,x_s^k)+ L_i^{k}\big ( \mathcal B_{\kappa _i} (x_i,x_i^{k} ) - \mathcal B_{\kappa _i} (x_i,x_i^{k+1})\big ). \end{aligned}$$
(26)

Choosing \(x=x_i^k\) in (26), we have \( \varPhi ^k_i\big (x_i^{k+1}\big ) \le \varPhi ^k_i(x_i^k) - \underline{L}_i \mathcal B_{\kappa _i} \big (x_i^k,x_i^{k+1} \big ). \) Therefore,

$$\begin{aligned} \begin{aligned} \varPhi \big (x^{k+1}\big ) - \varPhi \big (x^{k}\big )&= \sum _{i=1}^s \varPhi ^k_i\big (x_i^{k+1}\big ) - \varPhi ^k_i\big (x_i^{k}\big ) \le - \sum _{i=1}^s \underline{L}_i \mathcal B_{\kappa _i} \big (x_i^k,x_i^{k+1} \big ). \end{aligned} \end{aligned}$$
(27)

From (27), we see that \(\varPhi \big (x^{k}\big )\) is non-increasing. Theorem 1(i) is proved.

As \(\{x^k\}\) is assumed to be bounded, then there exists positive constants \(A_i\) such that \(\Vert x_i^k \Vert \le A_i\), \(i=1,\ldots ,s\), for all k. Suppose a subsequence \(x^{k_n} \rightarrow x^*\). As \( \mathcal X\) is closed convex set and \(x^{k_n} \in \mathcal X\), then \(x^*\in \mathcal X\). Hence, \(\varPhi (x^{k_n}) \rightarrow \varPhi (x^*)\), i.e., the continuity condition (B4) is satisfied.

Since \(\kappa _i\) is strongly convex on the sets \(\{x_i: x_i \in \mathcal X_i, \Vert x_i^k \Vert \le A_i \}\), we have \(\mathcal {B}_{\kappa _i}\big (x_i^k,x_i^{k+1} \big )\ge \sigma _i/2\Vert x_i^k-x_i^{k+1}\Vert ^2, \) for some constant \(\sigma _i\). Hence, from (27) we derive the sufficient decrease property (B1) and \(\sum _{k=1}^\infty \Vert x^k-x^{k+1}\Vert ^2 <+\infty \). Consequently, \(\Vert x^k-x^{k+1}\Vert \rightarrow 0\). Then we also have \(x^{k_n+1}\rightarrow x^*\). In (26) we let \(k=k_n\rightarrow \infty \) and note that \(x_i^*\in \mathcal X_i \subset \mathrm{int\, dom}\, \kappa _i\), then we obtain \(\varPhi (x^*)\le \varPhi (x_1^*,\ldots ,x_i,\ldots ,x_s^*)\), \(\forall \,x_i\). This means \(x_i^*\) is a local minimizer of \(\min _{x_i} \varPhi (x_1^*,\ldots ,x_i,\ldots ,x_s^*)\). Hence \(0\in \partial _i \varPhi (x^*)\), for \(i=1,\ldots s\). In other words, we have \(0\in \partial \varPhi (x^*)\), i.e., Theorem 1 (ii) is proved.

To prove Theorem 1 (iii), it remains to verify the boundedness of subgradients. From [2, Proposition 2.1] we have

$$\begin{aligned} \partial \varPhi \big (x^{k}\big )=\partial \varPhi _1\big (x^{k}\big )\times \partial _s \varPhi \big (x^{k}\big )=\{\nabla _1 f\big (x^{k}\big )+ \partial I_{\mathcal X_1}(x_1^k)\} \times \cdots \times \{\nabla _s f\big (x^{k}\big )+ \partial I_{\mathcal X_s}(x_s^k)\}. \end{aligned}$$

It follows from (13) that \(0\in \nabla f_i^k(x_i^k) + \partial I_{\mathcal X_i} (x_i^{k+1}) + L_i^{k+1} \big (\nabla \kappa _i (x_i^{k+1})-\nabla \kappa _i (x_i^{k})\big )\). Hence,

$$\begin{aligned} \begin{aligned} \omega _i^{k+1}&=\nabla _i f\big (x^{k+1}\big )+ \partial I_{\mathcal X_1}(x_1^{k+1})\\&=\nabla _i f\big (x^{k+1}\big )-\nabla f_i^k(x_i^k) + L_i^{k} \big (\nabla \kappa _i (x_i^{k})-\nabla \kappa _i (x_i^{k+1})\big ) \in \partial \varPhi _i\big (x^{k+1}\big ). \end{aligned} \end{aligned}$$
(28)

As \(\nabla _i f\) and \(\nabla \kappa _i\) are Lipschitz continuous on \(\{x_i: \Vert x_i^k \Vert \le A_i \}\), we derive from (28) that there exists some constant \(a_i\) such that \(\Vert \omega _i^{k+1}\Vert \le a_1 \Vert x^{k}-x^{k+1}\Vert \), for \(i=1,\ldots ,s\). Then the boundedness of subgradients in (B4) is satisfied. Theorem 1 is proved. \(\square \)

1.4 Proof of Proposition 9

Proposition 9: The objective function for the perturbed KL-NMF problem (1) is non-increasing under the updates of Algorithm 2.

Proof

We note that if g(x) is a self-concordant function with constant \(\mathbf{c}\) then \(\mathbf{c}^2\, g(x)\) is a standard self-concordant function. Hence, using the result of [47, Theorem 6] (see also Sect. 2.4), we derive that the objective function of (1) is strictly decreasing when a damp Newton step is used. We now prove the proposition for the case when a full Newton step is used, that is when the gradient \(f'_{W_{ik}}\le 0\) or \(\lambda \le 0.683802\) and the update of \(W_{ik}\) then is \( W_{ik}^+= s = W_{ik} - \frac{f'_{W_{ik}}}{f''_{W_{ik}}} \).

Consider the case \(f'_{W_{ik}}\le 0\). Let us use \(f_{W_{ik}}\) to denote the objective of (1) with respect to \(W_{ik}\). Considering the function \(W_{ik}\mapsto g(W_{ik})=f'_{W_{ik}}\) with \(W_{ik}\ge 0\), we see that \(g(W_{ik})\) is a concave function. Hence, we have \(g(W_{ik}^+)-g(W_{ik}) \le g'(W_{ik}) (W_{ik}^+-W_{ik}). \) This implies that \( f'_{W_{ik}^+} \le f'_{W_{ik}} - f''_{W_{ik}} \frac{f'_{W_{ik}}}{f''_{W_{ik}}}=0. \) Since \(W_{ik} -W_{ik}^+ = \frac{f'_{W_{ik}}}{f''_{W_{ik}}} \le 0\) and \(W_{ik} \mapsto f_{W_{ik}}\) is convex, we then obtain \(f_{W_{ik}} \ge f_{W_{ik}^+} + f'_{W_{ik}^+}(W_{ik}-W_{ik}^+) \ge f_{W_{ik}^+}. \) Hence, the objective of (1) is non-increasing when we update \(W_{ik}\) to \(W_{ik}^+\).

Consider the case \(\lambda \le 0.683802\). Denote \(W_{ik}^\alpha =W_{ik} + \alpha d\). When \(\alpha =1\) we have \(W_{ik}^\alpha =s\), that is when a full proximal Newton step is applied. It follows from [47, Inequality (58)] that \(f_{W_{ik}^\alpha } \le f_{W_{ik}} - \alpha \lambda ^2 + \omega _*(\alpha \lambda ) \) for \(\alpha \in (0,1]\), where \(\omega _*(t)=- t - \log (1-t)\). Hence, when \(\alpha =1\) we get \(f_{W_{ik}^\alpha } \le f_{W_{ik}} - (\lambda ^2 + \lambda + \log (1-\lambda )).\) It is not difficult to see that when \(\lambda \le 0.683802\) the value of \(\lambda ^2 + \lambda + \log (1-\lambda )\) is positive. Hence, in this case, a full proximal Newton step also decreases the objective function.\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hien, L.T.K., Gillis, N. Algorithms for Nonnegative Matrix Factorization with the Kullback–Leibler Divergence. J Sci Comput 87, 93 (2021). https://doi.org/10.1007/s10915-021-01504-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-021-01504-0

Keywords

Navigation