Algorithms for Nonnegative Matrix Factorization with the Kullback–Leibler Divergence

1689 Accesses
29 Citations
2 Altmetric
Explore all metrics

Abstract

Nonnegative matrix factorization (NMF) is a standard linear dimensionality reduction technique for nonnegative data sets. In order to measure the discrepancy between the input data and the low-rank approximation, the Kullback–Leibler (KL) divergence is one of the most widely used objective function for NMF. It corresponds to the maximum likehood estimator when the underlying statistics of the observed data sample follows a Poisson distribution, and KL NMF is particularly meaningful for count data sets, such as documents. In this paper, we first collect important properties of the KL objective function that are essential to study the convergence of KL NMF algorithms. Second, together with reviewing existing algorithms for solving KL NMF, we propose three new algorithms that guarantee the non-increasingness of the objective function. We also provide a global convergence guarantee for one of our proposed algorithms. Finally, we conduct extensive numerical experiments to provide a comprehensive picture of the performances of the KL NMF algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A Novel Newton-Type Algorithm for Nonnegative Matrix Factorization with Alpha-Divergence

Sparse Non-negative Matrix Factorization with Generalized Kullback-Leibler Divergence

Nonnegative Matrix Factorization and Log-Determinant Divergences

Notes

References

Ang, A.M.S., Gillis, N.: Accelerating nonnegative matrix factorization algorithms using extrapolation. Neural Comput. 31(2), 417–439 (2019)
Article MathSciNet Google Scholar
Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)
Article MathSciNet Google Scholar
Bauschke, H.H., Bolte, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42(2), 330–348 (2017). https://doi.org/10.1287/moor.2016.0817
Article MathSciNet MATH Google Scholar
Bioucas-Dias, J.M., Plaza, A., Dobigeon, N., Parente, M., Du, Q., Gader, P., Chanussot, J.: Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J. Select. Top. Appl. Earth Obs. Remote Sens. 5(2), 354–379 (2012)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1), 459–494 (2014)
Article MathSciNet Google Scholar
Carbonetto, P., Luo, K., Dey, K., Hsiao, J., Stephens, M.: fastTopics: fast algorithms for fitting topic models and non-negative matrix factorizations to count data (2021). R package version 0.4-11. https://github.com/stephenslab/fastTopics
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet Google Scholar
Chi, E.C., Kolda, T.G.: On tensors, sparsity, and nonnegative factorizations. SIAM J. Matrix Anal. Appl. 33(4), 1272–1299 (2012)
Article MathSciNet Google Scholar
Cichocki, A., Zdunek, R., Phan, A.H., Amari, S.: Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley (2009)
Dey, K.K., Hsiao, C.J., Stephens, M.: Visualizing the structure of RNA-SEQ expression data using grade of membership models. PLoS Genet. 13(3), e1006599 (2017)
Article Google Scholar
Ding, C., Li, T., Peng, W.: On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Comput. Stat. Data Anal. 52(8), 3913–3927 (2008)
Article MathSciNet Google Scholar
Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201–213 (2002)
Article MathSciNet Google Scholar
Févotte, C., Bertin, N., Durrieu, J.L.: Nonnegative matrix factorization with the Itakura-Saito divergence: with application to music analysis. Neural Comput. 21(3), 793–830 (2009)
Article Google Scholar
Févotte, C., Cemgil, A.T.: Nonnegative matrix factorizations as probabilistic inference in composite models. In: 2009 17th European Signal Processing Conference, pp. 1913–1917 (2009)
Févotte, C., Idier, J.: Algorithms for nonnegative matrix factorization with the $\beta $-divergence. Neural Comput. 23(9), 2421–2456 (2011). https://doi.org/10.1162/NECO_a_00168
Article MathSciNet MATH Google Scholar
Finesso, L., Spreij, P.: Nonnegative matrix factorization and I-divergence alternating minimization. Linear Algebra Appl. 416(2), 270–287 (2006). https://doi.org/10.1016/j.laa.2005.11.012
Article MathSciNet MATH Google Scholar
Fu, X., Huang, K., Sidiropoulos, N.D., Ma, W.K.: Nonnegative matrix factorization for signal and data analytics: Identifiability, algorithms, and applications. IEEE Signal Process. Mag. 36(2), 59–80 (2019)
Article Google Scholar
Gillis, N.: The why and how of nonnegative matrix factorization. In: Suykens, J., Signoretto, M., Argyriou, A. (eds.) Regularization, Optimization, Kernels, and Support Vector Machines, Machine Learning and Pattern Recognition chap 12, pp. 257–291. Chapman & Hall/CRC, Boca Raton, Florida (2014)
Google Scholar
Gillis, N.: Introduction to nonnegative matrix factorization. SIAG/OPT Views News 25(1), 7–16 (2017)
Google Scholar
Gillis, N., Hien, L.T.K., Leplat, V., Tan, V.Y.: Distributionally robust and multi-objective nonnegative matrix factorization. IEEE Trans. Pattern Anal. Mach. Intell. (2021). https://doi.org/10.1109/TPAMI.2021.3058693
Gonzalez, E.F., Zhang, Y.: Accelerating the Lee-Seung algorithm for nonnegative matrix factorization. Technical report, Rice University (2005). https://scholarship.rice.edu/handle/1911/102034%20Technical%20report
Guillamet, D., Vitrià, J.: Non-negative matrix factorization for face recognition. In: Escrig, M.T., Toledo, F., Golobardes, E. (eds.) Topics in Artificial Intelligence, pp. 336–344. Springer, Berlin (2002)
Chapter Google Scholar
Hasinoff, S.W.: Photon, Poisson Noise (2014). http://people.csail.mit.edu/hasinoff/pubs/hasinoff-photon-2011-preprint.pdf
Hien, L.T.K., Gillis, N., Patrinos, P.: Inertial block proximal method for non-convex non-smooth optimization. In: Thirty-seventh International Conference on Machine Learning (ICML 2020) (2020)
Ho, N.D., Dooren, P.V.: Non-negative matrix factorization with fixed row and column sums. Linear Algebra and its Applications 429(5), 1020–1025 (2008). Special Issue devoted to selected papers presented at the 13th Conference of the International Linear Algebra Society
Hong, M., Razaviyayn, M., Luo, Z.Q., Pang, J.S.: A unified algorithmic framework for block-structured optimization involving big data: with applications in machine learning and signal processing. IEEE Signal Process. Mag. 33(1), 57–77 (2015)
Article Google Scholar
Hoyer, P.O.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)
MathSciNet MATH Google Scholar
Hsieh, C.J., Dhillon, I.S.: Fast coordinate descent methods with variable selection for non-negative matrix factorization. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1064–1072 (2011)
Huang, K., Sidiropoulos, N.D., Liavas, A.P.: A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans. Signal Process. 64(19), 5052–5065 (2016)
Article MathSciNet Google Scholar
Kim, C., Kim, Y., Klabjan, D.: Scale invariant power iteration (2019). arXiv:1905.09882
Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999)
Article Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. In: Advances in Neural Information Processing Systems, pp. 556–562 (2001)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd edn. Cambridge University Press (2014). http://mmds.org
Li, L., Lebanon, G., Park, H.: Fast Bregman divergence NMF using Taylor expansion and coordinate descent. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-12), pp. 307–315. Association for Computing Machinery, New York, NY, USA (2012)
Lin, C.J.: On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Trans. Neural Netw. 18(6), 1589–1596 (2007)
Article Google Scholar
Lin, X., Boutros, P.C.: Optimization and expansion of non-negative matrix factorization. BMC Bioinform. 21(1), 1–10 (2020)
Article Google Scholar
Lu, H., Freund, R.M., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28, 333–354 (2018)
Article MathSciNet Google Scholar
Lucy, L.B.: An iterative technique for the rectification of observed distributions. Astron. J. 79, 745 (1974)
Article Google Scholar
Ma, W., Bioucas-Dias, J.M., Chan, T., Gillis, N., Gader, P., Plaza, A.J., Ambikapathi, A., Chi, C.: A signal processing perspective on hyperspectral unmixing: insights from remote sensing. IEEE Signal Process. Mag. 31(1), 67–81 (2014)
Article Google Scholar
Nesterov, Y.: Lectures on Convex Optimization. Springer (2018)
Razaviyayn, M., Hong, M., Luo, Z.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
Article MathSciNet Google Scholar
Richardson, W.H.: Bayesian-based iterative method of image restoration. JoSA 62(1), 55–59 (1972)
Article Google Scholar
Shahnaz, F., Berry, M.W., Pauca, V., Plemmons, R.J.: Document clustering using nonnegative matrix factorization. Inf. Process. Manag. 42(2), 373–386 (2006). http://www.sciencedirect.com/science/article/pii/S0306457304001542
Sun, D.L., Févotte, C.: Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6201–6205 (2014)
Takahashi, N., Hibi, R.: Global convergence of modified multiplicative updates for nonnegative matrix factorization. Comput. Optim. Appl. 57(2), 417–440 (2014)
Article MathSciNet Google Scholar
Teboulle, M., Vaisbourd, Y.: Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM J. Imaging Sci. 13(1), 381–421 (2020)
Article MathSciNet Google Scholar
Tran-Dinh, Q., Kyrillidis, A., Cevher, V.: Composite self-concordant minimization. J. Mach. Learn. Res. 16(12), 371–416 (2015)
MathSciNet MATH Google Scholar
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Tech. rep. (2008)
Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789 (2013). https://doi.org/10.1137/120887795
Article MathSciNet MATH Google Scholar
Yanez, F., Bach, F.: Primal-dual algorithms for non-negative matrix factorization with the Kullback-Leibler divergence. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2257–2261 (2017)
Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)
Article Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments that helped us improve the paper. We also thank Peter Carbonetto for helpful feedback and useful references.

Funding

This work was supported by the European Research Council (ERC starting grant n$^\text {o}$ 679515), and by the Fonds de la Recherche Scientifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under EOS Project no O005318F-RG47.

Author information

Authors and Affiliations

Department of Mathematics and Operational Research, Faculté Polytechnique, Université de Mons, Rue de Houdain 9, 7000, Mons, Belgium
Le Thi Khanh Hien & Nicolas Gillis

Authors

Le Thi Khanh Hien
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Gillis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Gillis.

Ethics declarations

Conflict of interest

Not applicable.

Availability of data and material

The data sets that support the findings of this study are freely available online, and available from the corresponding author, Nicolas Gillis, upon reasonable request.

Code availability

The codes used to perform the experiments in this paper are available from https://github.com/LeThiKhanhHien/KLNMF.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Technical Proofs

1.1 A.1 Proof of Proposition 1

Proposition 1: (A) Given $\varepsilon \ge 0$ and a nonnegative matrix V, Problem (2) attains its minimum, that is, it has at least one optimal solution.

(B) Let $ \nu =\sum _{i=1}^m\sum _{j=1}^n V_{ij}$. Denote $D^*(V,\varepsilon )$ be the optimal value of Problem (2). We have $D^*(V,\varepsilon ) \le D^*(V,0) + (\min \{ n+ mr, m+nr\} \sqrt{\nu }+mn\varepsilon )\varepsilon . $

Proof

Problem (2) can be rewritten as follows

$$\begin{aligned} \min _{W,H}&\quad D\big (V|(W+\varepsilon ee^\top )(H+\varepsilon ee^\top ) \big )\nonumber \\ \text { such that }&W_{ik}\ge 0, H_{kj} \ge 0 \text { for } i \in [m], k \in [r],j \in [n]. \end{aligned}$$

(19)

We note that

$$\begin{aligned} (W+\varepsilon ee^\top )(H+\varepsilon ee^\top )=WH+ \varepsilon (W ee^\top + ee^\top H)+\varepsilon ^2 ee^\top . \end{aligned}$$

(20)

Let us now prove the parts (A) and (B) of Proposition 1 separately.

(A) We note that

$$\begin{aligned} \frac{1}{\nu }D\big (V|(W+\varepsilon ee^\top )(H+\varepsilon ee^\top )\big )=D\Big (\frac{V}{\nu }\Big |\big (\frac{W}{\sqrt{\nu }}+\frac{\varepsilon }{\sqrt{\nu }} ee^\top \big )\big (\frac{H}{\sqrt{\nu }}+\frac{\varepsilon }{\sqrt{\nu }} ee^\top \big )\Big ). \end{aligned}$$

Then we have

$$\begin{aligned} D^*(V,\varepsilon )=\nu D^*(V/\nu ,\varepsilon /\sqrt{\nu }). \end{aligned}$$

(21)

Let us now consider Problem (19) with V such that $\sum _{i=1}^m\sum _{j=1}^n V_{ij}=1$. In the following, we separate the proof into 2 cases: $\varepsilon =0$, which corresponds to the original KL NMF Problem (1), and $\varepsilon >0$, for which the objective of Problem (19) is finite at every pair of non-negative matrices (W, H).

Case 1 $\varepsilon =0$. We use the methodology in the proof of [16, Proposition 2.1]. We first observe that $WH=\sum _{i=1}^r W_{:,i} H_{i,:}$, hence without loss of generality we can consider the case when H has no zero rows; otherwise we could then consider Problem (1) with smaller rank than r. Given a feasible solution (W, H) of (1), let $h_k=\sum _{j=1}^n H_{kj}$ and let $\mathrm{Diag}(h)$ be the diagonal matrix with its diagonal being h. We then have $(\tilde{W},\tilde{H})$, where $\tilde{W}= W \mathrm{Diag}(h)$ and $\tilde{H}= (\mathrm{Diag}(h))^{-1} H$, is also a feasible solution of (1) and it preserves the objective value since $\tilde{W} \tilde{H}=WH$. We observe that $\sum _{j=1}^n \tilde{H}_{kj}=1$. Hence, we can restrict our search for the optimal solution to the set

$$\begin{aligned} \mathcal A_1=\{(W,H): W,H \ge 0, He = e\}. \end{aligned}$$

We note that $ \lim _{x\rightarrow 0} (x - V_{ij} \log x) = +\infty $ for a given $V_{ij}>0$, and $x \mapsto (x - V_{ij} \log x)$ is a decreasing function when $x<V_{ij}$. Therefore, there exists a positive constant $\delta $ such that $(WH)_{ij} \ge \delta $ for all i, j with $V_{ij}>0$; otherwise, if for all $\delta >0$ there exists (i, j) with $V_{ij}>0$ that $(WH)_{ij} \le \delta $ then we can choose arbitrary small $\delta <\min \{V_{ij}:V_{ij}>0\}$ which makes $(WH)_{ij} - V_{ij} \log (WH)_{ij}$ arbitrary large. Therefore, we can further restrict the constraint set of (1) to the set

$$\begin{aligned} \mathcal A_2=\big \{(W,H): (W,H)\in \mathcal A_1, (WH)_{ij}\ge \delta \forall (i,j) \in \{(i,j):V_{ij} >0\} \big \}. \end{aligned}$$

It follows from [32, Theorem 2] (see also Sect. 3.1) that, given a feasible solution (W, H), a multiplicative update step $W_{ik}^+= W_{ik} \big (\sum _{l=1,V_{il}>0}^n \frac{H_{kl} V_{il}}{(WH)_{il}} \big )\big / \big (\sum _{l=1}^n H_{kl} \big ) $, for $i\in [m], k\in [r]$, supposed to be well-defined, decreases the objective function. Hence, given $(W,H)\in \mathcal A_2$, we observe that

$$\begin{aligned} \sum _{k=1}^m W_{ik}^+=\sum _{k=1}^m W_{ik} \sum _{l=1}^n \frac{H_{kl} V_{il}}{(WH)_{il}}=\sum _{l=1}^n \sum _{k=1}^m \frac{W_{ik}H_{kl} V_{il}}{(WH)_{il}}=\sum _{l=1}^n V_{il}. \end{aligned}$$

Therefore, we can further restrict our search for the optimal solutions of (1) to

$$\begin{aligned} \mathcal A_3=\big \{(W,H): (W,H)\in \mathcal A_1, W e =V e, (WH)_{ij}\ge \delta \forall (i,j) \in \{(i,j):V_{ij} >0\} \big \}. \end{aligned}$$

By choosing appropriate $\delta $, we note that $(W,1/n ee^\top ) \in \mathcal {A}_3$ with $W_{ik}=1/m \sum _{l} V_{il}$ for $i\in [m], k\in [r]$, then the set $\mathcal {A}_3$ is non-empty. We see that $\mathcal {A}_3$ is closed and bounded, and D(V|WH)) is continuous on $\mathcal {A}_3$, hence, Problem (1) has at least one solution.

Case 2 $\varepsilon >0$. When $\varepsilon >0$, we see (0, 0) is a feasible solution of (19). Let

$$\begin{aligned} C=f_\varepsilon (0 , 0)= mnr\varepsilon ^2-\log (r\varepsilon ^2) +\sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij}-1. \end{aligned}$$

We then can restrict our search for the optimal solutions of Problem (19) to the constraint set $\mathcal A_4=\{(W,H): W,H \ge 0, f_\varepsilon (W,H)\le C\}. $ Using inequality $\log x \le x-1$ for all $x>0$, we obtain

$$\begin{aligned} \begin{aligned} f_\varepsilon (W,H)&\ge \sum _{i=1}^m\sum _{j=1}^n (1-V_{ij})\big ((WH)_{ij}+\varepsilon (W ee^\top + ee^\top H)_{ij}+\varepsilon ^2 \big )+ \sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij} \\&\ge \sum _{i=1}^m\sum _{j=1}^n (1-V_{ij})\big (\varepsilon \sum _{k}W_{ik}+\varepsilon \sum _{k}H_{kj} +\varepsilon ^2\big )+ \sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij}\\&\ge (mn-1)\varepsilon ^2+ (1-v_{\max })\varepsilon \big (n \sum _{ik}W_{ik} + m \sum _{kj}H_{kj}\big ) + \sum _{i=1}^m\sum _{j=1}^n V_{ij}\log V_{ij}, \end{aligned} \end{aligned}$$

where $v_{\max }=\max _{ij} V_{ij}$, we use $(WH)_{ij}\ge 0$ in the second inequality and $\sum _{ij}V_{ij}=1$ in the third inequality. Hence, from $f_\varepsilon (W,H)\le C$, W and H are bounded. Therefore, $\mathcal A_4$ is bounded. We also see that $\mathcal A_4$ is closed; hence, it is a compact set. The objective of (19) is continuous on this set since $\varepsilon >0$. Therefore, Problem (19) has at least one solution.

(B) Let us fix $\varepsilon >0$. Note that all points of $\mathcal {A}_3$ are feasible solutions of Problem (19). We hence have

$$\begin{aligned} \begin{aligned} D^*(V,\varepsilon )&\le \min _{(W,H)\in \mathcal {A}_3} f_\varepsilon (W,H)\\ {}&\le \min _{(W,H)\in \mathcal {A}_3} D(V|WH) + \max _{(W,H)\in \mathcal {A}_3} (f_\varepsilon (W,H)-D(V|WH))\\&= D^*(V,0) + \max _{(W,H)\in \mathcal {A}_3} (f_\varepsilon (W,H)-D(V|WH)), \end{aligned} \end{aligned}$$

(22)

where in the first inequality we use the fact that the optimal value over bigger set does not exceed the optimal value over smaller set. On the other hand, it follows from (20) and the case $\sum _{i,j} V_{ij}=1$ that, for $(W,H)\in \mathcal A_3$, we have

$$\begin{aligned} \begin{aligned} f_\varepsilon (W,H)-D(V|WH)&= \sum _{i=1}^m\sum _{j=1}^n \big (\varepsilon \sum _{k=1}^r W_{ik} + \varepsilon \sum _{k=1}^r H_{kj} + \varepsilon ^2\big )\\&\qquad \qquad -\sum _{i=1}^m\sum _{j=1}^n V_{ij} \log \frac{((W+\varepsilon ee^\top )(H+\varepsilon ee^\top ))_{ij}}{(WH)_{ij}}\\&\le n \varepsilon + mr \varepsilon + mn \varepsilon ^2. \end{aligned} \end{aligned}$$

Hence $D^*(V,\varepsilon ) \le D^*(V,0) + n \varepsilon + mr \varepsilon + mn \varepsilon ^2$. By exchanging the role of W and $H^\top $ and noting that $D(V|WH)=D(V^\top |H^\top W^\top )$, we can prove a similar bound $D^*(V,\varepsilon ) \le D^*(V,0) + m \varepsilon + nr \varepsilon + mn \varepsilon ^2$. Together with (21), we obtain Result (B). $\square $

1.2 BSUM Framework for the MU

MU is a block successive upper-bound minimization algorithm (BSUM) in which each column of H and each row of W are updated by minimizing majorized functions of the KL objective. Let us first introduce BSUM, then derive MU for solving (2).

BSUM was proposed in [41] to solve the minimization problem in (12).

Putting in the notations of (12), first, let us formally define a majorized function.

Definition 4

A function $u_i : \mathcal X_i \times \mathcal X \rightarrow \mathbb R$ is said to majorize f(x) if

$$\begin{aligned} \begin{aligned}&u_i (x_i , x) = f(x), \forall \, x \in \mathcal X, \\&u_i (y_i , x) \ge f(x_1,\ldots ,x_{i-1},y_i,x_{i+1},\ldots ,x_s), \forall \, y_i \in \mathcal {X}_i, x \in \mathcal X. \end{aligned} \end{aligned}$$

(23)

Using the majorized functions, at iteration k, BSUM fixes the latest values of block $j\ne i$ and updates block $x_i$ by

$$\begin{aligned} x_i^{k}=\arg \min _{x_i\in \mathcal X_i} u_i(x_i,x_1^k,\ldots ,x_{i-1}^k,x_i^{k-1}, x_{i+1}^{k-1},\ldots ,x_s^{k-1}). \end{aligned}$$

(24)

From Definition 4 we have

$$\begin{aligned} \begin{aligned} f(x^k)&=u_1(x_1^k,x_1^k,\ldots ,x_s^k) \ge u_1(x_1^{k+1},x_1^k,\ldots ,x_s^k)\\ {}&\ge f(x_1^{k+1},x_2^k,\ldots ,x_s^k)=u_2(x_2^k,x_1^{k+1},x_2^k,\ldots ,x_s^k)\\&\ge u_2(x_2^{k+1},x_1^{k+1},x_2^k,\ldots ,x_s^k) \ge f(x_1^{k+1},x_2^{k+1},\ldots ,x_{s-1}^k,x_s^k) \ge \cdots \ge f(x^{k+1}). \end{aligned} \end{aligned}$$

In other words, BSUM produces a non-increasing sequence $\{f(x^k)\}$. In the following, we give a majorized function for D(t|Wh) defined in (7).

Proposition 10

[31, Lemma 3] Let

$$\begin{aligned} u_\mathrm{MU}(h,h^k):=\sum _{i=1}^m\Big ((Wh)_i -t_i\sum _{l=1}^r \frac{W_{il}(h^k)_l}{(Wh^k)_i}\Big (\log (W_{il}h_l)-\log \frac{W_{il}(h^k)_l}{(Wh^k)_i}\Big )+ t_i \log t_i -t_i \Big ). \end{aligned}$$

Then $u_\mathrm{MU}(h,h^k)$ majorizes the function $h\mapsto D(t|Wh)$.

When BSUM uses $u_\mathrm{MU}(\cdot ,\cdot )$ to update a column h of H (similarly for a row of W) by $h^{k+1} = \arg \min _{h\ge \varepsilon } u_\mathrm{MU}(h,h^k)$, we obtain $ (h^{k+1})_l=\max \left\{ \frac{(h^k)_l}{\sum _{i=1}^m W_{il}}\sum _{i=1}^m \frac{t_i W_{il}}{(Wh^k)_i}, \varepsilon \right\} , $ leading to the MU. More specifically, the updates of MU for solving Problem (2) are

$$\begin{aligned} W_{ik}^+= \max \Big \{ W_{ik} \frac{\sum _{l=1}^n \frac{H_{kl} V_{il}}{(WH)_{il}} }{\sum _{l=1}^n H_{kl}},\varepsilon \Big \}, H_{kj}^+= \max \Big \{ H_{kj} \frac{\sum _{l=1}^m \frac{W_{lk} V_{lj}}{(WH)_{lj}} }{\sum _{l=1}^m W_{lk}},\varepsilon \Big \}. \end{aligned}$$

(25)

We note that the non-increasing property of $f(x^k)$ produced by BSUM for Problem (12) does not guarantee the convergence of $ \{x^k\}$. To guarantee some convergence for $ \{x^k\}$, we need $\varepsilon > 0$ for the objective function to be directionally differentiable, which is not the case when $\varepsilon = 0$, $(WH)_{ij}=0$ and $V_{ij}>0$ for some i, j.

1.3 Proof of Theorem 1

Theorem 1: Suppose Assumption 1 is satisfied. Let $\{x^k\}$ be the sequence generated by Algorithm 1. Let also $\varPhi (x)= f (x)+ \sum _i I_{\mathcal X_i}(x_i)$, where $I_{\mathcal X_i}$ is the indicator function of $\mathcal X_i$. We have

(i)
$\varPhi \big (x^k\big )$ is non-increasing;
(ii)
Suppose $\mathcal X_i \subset \mathrm{int\, dom} \, \kappa _i$. If $\{x^k\}$ is bounded and $\kappa _i(x_i)$ is strongly convex on bounded subsets of $\mathcal X_i$ that contain $\{x_i^k\}$, then every limit point of $\{x^k\}$ is a critical point of $\varPhi $;
(iii)
If together with the conditions in (ii) we assume that $\nabla \kappa _i$ and $\nabla _i f$ are Lipschitz continuous on bounded convex subsets of $\mathcal X_i$ that contain $\{x_i^k\}$, then the whole sequence $\{x^k\}$ converges to a critical point of $\varPhi $.

Proof

We follow the methodology established in [5] that bases on the following theorem to prove the global convergence of BMD. $\square $

Theorem 3

Let $\varPhi : \mathbb {R}^N\rightarrow (-\infty ,+\infty ]$ be a proper and lower semicontinuous function which is bounded from below. Let $\mathcal {A}$ be a generic algorithm which generates a bounded sequence $\{x^{k}\}$ by $ x^{k+1}\in \mathcal A(x^{k})$, $k=0,1,\ldots $ Assume that the following conditions are satisfied.

(B1) Sufficient decrease property: There exists some $\rho _1>0$ such that

$$\begin{aligned} \rho _1 \Vert x^{k}-x^{k+1}\Vert ^2 \le \varPhi (x^{k}) - \varPhi (x^{k+1}), \forall \,k=0,1,\ldots . \end{aligned}$$

(B2) Boundedness of subgradient: There exists some $\rho _2>0$ such that

$$\begin{aligned} \Vert w^{k+1}\Vert \le \rho _2 \Vert x^{k}-x^{k+1}\Vert , w^{(k)}\in \partial \varPhi (x^{k}), \forall k=0,1,\ldots \end{aligned}$$

(B3) KL property: $\varPhi $ is a KL function.

(B4) A continuity condition: If a subsequence $\{x^{k_n}\}$ converges to $\bar{x}$ then $\varPhi (x^{k_n})\rightarrow \varPhi (\bar{x})$ as $n\rightarrow \infty $.

Then $ \sum _{k=1}^\infty \Vert x^{k}-x^{k+1}\Vert <\infty $ and $\{x^{k}\}$ converges to a critical point of $\varPhi $.

We will prove Theorem 1 (iii) by verifying all the conditions in Theorem 3 for $\varPhi $. Let us first recall the following three-point property.

Proposition 11

(Property 1 of [48]) Let $ z^+= \arg \min _{u \in \mathcal X_i} \phi (u)+\mathcal B_{\kappa }(u,z)$, where $\phi $ is a proper convex function, $\mathcal X_i$ is convex and $\mathcal B_{\kappa }$ is the Bregman distance with respect to $\kappa _i$. Then for all $u\in \mathcal X_i$ we have

$$\begin{aligned} \phi (u) + \mathcal B_{\kappa }(u,z) \ge \phi (z^+) + \mathcal B_{\kappa }(z^+,z) + \mathcal B_{\kappa }(u,z^+). \end{aligned}$$

Denote $\varPhi _i^k(x_i)=f_i^k(x_i)+I_{\mathcal X_i}(x_i)$. We have

$$\begin{aligned} \begin{aligned} f^k_i(x_i^{k+1})&\le f_i^k(x_i^{k} ) + \langle \nabla f_i^k(x_i^{k} ),x_i^{k+1}-x_i^{k} ) \rangle + L_i^{k} \mathcal B_{\kappa _i} (x_i^{k+1},x_i^{k} )\\&\le f_i^k(x_i^{k} ) + \langle \nabla f_i^k(x_i^{k} ),x_i-x_i^{k} ) \rangle + L_i^{k} \mathcal B_{\kappa _i} (x_i,x_i^{k} ) - L_i^{k} \mathcal B_{\kappa _i} (x_i,x_i^{k+1} )\\&\le f_i^k(x) + L_i^{k} \mathcal B_{\kappa _i} (x,x_i^{k} ) - L_i^{k} \mathcal B_{\kappa _i} (x,x_i^{k+1} ), \end{aligned} \end{aligned}$$

where the first inequality uses Assumption 1, the second inequality uses Proposition 11 applied for (13) and the third uses convexity of $f_i$. Note that $I_{\mathcal X_i}(x)=0$ if $x\in \mathcal {X}$ and $I_{\mathcal X_i}(x)=+\infty $ otherwise. Hence, for all $x_i$,

$$\begin{aligned} \varPhi ^k_i(x_i^{k+1}) \le \varPhi (x_1^{k+1},\ldots ,x_{i-1}^{k+1},x_i,x_{i+1}^k,\ldots ,x_s^k)+ L_i^{k}\big ( \mathcal B_{\kappa _i} (x_i,x_i^{k} ) - \mathcal B_{\kappa _i} (x_i,x_i^{k+1})\big ). \end{aligned}$$

(26)

Choosing $x=x_i^k$ in (26), we have $ \varPhi ^k_i\big (x_i^{k+1}\big ) \le \varPhi ^k_i(x_i^k) - \underline{L}_i \mathcal B_{\kappa _i} \big (x_i^k,x_i^{k+1} \big ). $ Therefore,

$$\begin{aligned} \begin{aligned} \varPhi \big (x^{k+1}\big ) - \varPhi \big (x^{k}\big )&= \sum _{i=1}^s \varPhi ^k_i\big (x_i^{k+1}\big ) - \varPhi ^k_i\big (x_i^{k}\big ) \le - \sum _{i=1}^s \underline{L}_i \mathcal B_{\kappa _i} \big (x_i^k,x_i^{k+1} \big ). \end{aligned} \end{aligned}$$

(27)

From (27), we see that $\varPhi \big (x^{k}\big )$ is non-increasing. Theorem 1(i) is proved.

As $\{x^k\}$ is assumed to be bounded, then there exists positive constants $A_i$ such that $\Vert x_i^k \Vert \le A_i$, $i=1,\ldots ,s$, for all k. Suppose a subsequence $x^{k_n} \rightarrow x^*$. As $ \mathcal X$ is closed convex set and $x^{k_n} \in \mathcal X$, then $x^*\in \mathcal X$. Hence, $\varPhi (x^{k_n}) \rightarrow \varPhi (x^*)$, i.e., the continuity condition (B4) is satisfied.

Since $\kappa _i$ is strongly convex on the sets $\{x_i: x_i \in \mathcal X_i, \Vert x_i^k \Vert \le A_i \}$, we have $\mathcal {B}_{\kappa _i}\big (x_i^k,x_i^{k+1} \big )\ge \sigma _i/2\Vert x_i^k-x_i^{k+1}\Vert ^2, $ for some constant $\sigma _i$. Hence, from (27) we derive the sufficient decrease property (B1) and $\sum _{k=1}^\infty \Vert x^k-x^{k+1}\Vert ^2 <+\infty $. Consequently, $\Vert x^k-x^{k+1}\Vert \rightarrow 0$. Then we also have $x^{k_n+1}\rightarrow x^*$. In (26) we let $k=k_n\rightarrow \infty $ and note that $x_i^*\in \mathcal X_i \subset \mathrm{int\, dom}\, \kappa _i$, then we obtain $\varPhi (x^*)\le \varPhi (x_1^*,\ldots ,x_i,\ldots ,x_s^*)$, $\forall \,x_i$. This means $x_i^*$ is a local minimizer of $\min _{x_i} \varPhi (x_1^*,\ldots ,x_i,\ldots ,x_s^*)$. Hence $0\in \partial _i \varPhi (x^*)$, for $i=1,\ldots s$. In other words, we have $0\in \partial \varPhi (x^*)$, i.e., Theorem 1 (ii) is proved.

To prove Theorem 1 (iii), it remains to verify the boundedness of subgradients. From [2, Proposition 2.1] we have

$$\begin{aligned} \partial \varPhi \big (x^{k}\big )=\partial \varPhi _1\big (x^{k}\big )\times \partial _s \varPhi \big (x^{k}\big )=\{\nabla _1 f\big (x^{k}\big )+ \partial I_{\mathcal X_1}(x_1^k)\} \times \cdots \times \{\nabla _s f\big (x^{k}\big )+ \partial I_{\mathcal X_s}(x_s^k)\}. \end{aligned}$$

It follows from (13) that $0\in \nabla f_i^k(x_i^k) + \partial I_{\mathcal X_i} (x_i^{k+1}) + L_i^{k+1} \big (\nabla \kappa _i (x_i^{k+1})-\nabla \kappa _i (x_i^{k})\big )$. Hence,

$$\begin{aligned} \begin{aligned} \omega _i^{k+1}&=\nabla _i f\big (x^{k+1}\big )+ \partial I_{\mathcal X_1}(x_1^{k+1})\\&=\nabla _i f\big (x^{k+1}\big )-\nabla f_i^k(x_i^k) + L_i^{k} \big (\nabla \kappa _i (x_i^{k})-\nabla \kappa _i (x_i^{k+1})\big ) \in \partial \varPhi _i\big (x^{k+1}\big ). \end{aligned} \end{aligned}$$

(28)

As $\nabla _i f$ and $\nabla \kappa _i$ are Lipschitz continuous on $\{x_i: \Vert x_i^k \Vert \le A_i \}$, we derive from (28) that there exists some constant $a_i$ such that $\Vert \omega _i^{k+1}\Vert \le a_1 \Vert x^{k}-x^{k+1}\Vert $, for $i=1,\ldots ,s$. Then the boundedness of subgradients in (B4) is satisfied. Theorem 1 is proved. $\square $

1.4 Proof of Proposition 9

Proposition 9: The objective function for the perturbed KL-NMF problem (1) is non-increasing under the updates of Algorithm 2.

Proof

We note that if g(x) is a self-concordant function with constant $\mathbf{c}$ then $\mathbf{c}^2\, g(x)$ is a standard self-concordant function. Hence, using the result of [47, Theorem 6] (see also Sect. 2.4), we derive that the objective function of (1) is strictly decreasing when a damp Newton step is used. We now prove the proposition for the case when a full Newton step is used, that is when the gradient $f'_{W_{ik}}\le 0$ or $\lambda \le 0.683802$ and the update of $W_{ik}$ then is $ W_{ik}^+= s = W_{ik} - \frac{f'_{W_{ik}}}{f''_{W_{ik}}} $.

Consider the case $f'_{W_{ik}}\le 0$. Let us use $f_{W_{ik}}$ to denote the objective of (1) with respect to $W_{ik}$. Considering the function $W_{ik}\mapsto g(W_{ik})=f'_{W_{ik}}$ with $W_{ik}\ge 0$, we see that $g(W_{ik})$ is a concave function. Hence, we have $g(W_{ik}^+)-g(W_{ik}) \le g'(W_{ik}) (W_{ik}^+-W_{ik}). $ This implies that $ f'_{W_{ik}^+} \le f'_{W_{ik}} - f''_{W_{ik}} \frac{f'_{W_{ik}}}{f''_{W_{ik}}}=0. $ Since $W_{ik} -W_{ik}^+ = \frac{f'_{W_{ik}}}{f''_{W_{ik}}} \le 0$ and $W_{ik} \mapsto f_{W_{ik}}$ is convex, we then obtain $f_{W_{ik}} \ge f_{W_{ik}^+} + f'_{W_{ik}^+}(W_{ik}-W_{ik}^+) \ge f_{W_{ik}^+}. $ Hence, the objective of (1) is non-increasing when we update $W_{ik}$ to $W_{ik}^+$.

Consider the case $\lambda \le 0.683802$. Denote $W_{ik}^\alpha =W_{ik} + \alpha d$. When $\alpha =1$ we have $W_{ik}^\alpha =s$, that is when a full proximal Newton step is applied. It follows from [47, Inequality (58)] that $f_{W_{ik}^\alpha } \le f_{W_{ik}} - \alpha \lambda ^2 + \omega _*(\alpha \lambda ) $ for $\alpha \in (0,1]$, where $\omega _*(t)=- t - \log (1-t)$. Hence, when $\alpha =1$ we get $f_{W_{ik}^\alpha } \le f_{W_{ik}} - (\lambda ^2 + \lambda + \log (1-\lambda )).$ It is not difficult to see that when $\lambda \le 0.683802$ the value of $\lambda ^2 + \lambda + \log (1-\lambda )$ is positive. Hence, in this case, a full proximal Newton step also decreases the objective function.$\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hien, L.T.K., Gillis, N. Algorithms for Nonnegative Matrix Factorization with the Kullback–Leibler Divergence. J Sci Comput 87, 93 (2021). https://doi.org/10.1007/s10915-021-01504-0

Download citation

Received: 05 October 2020
Revised: 24 March 2021
Accepted: 20 April 2021
Published: 08 May 2021
DOI: https://doi.org/10.1007/s10915-021-01504-0

Algorithms for Nonnegative Matrix Factorization with the Kullback–Leibler Divergence

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Newton-Type Algorithm for Nonnegative Matrix Factorization with Alpha-Divergence

Sparse Non-negative Matrix Factorization with Generalized Kullback-Leibler Divergence

Nonnegative Matrix Factorization and Log-Determinant Divergences

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Availability of data and material

Code availability

Additional information

Publisher's Note

A Technical Proofs

A Technical Proofs

1.1 A.1 Proof of Proposition 1

Proof

1.2 BSUM Framework for the MU

Definition 4

Proposition 10

1.3 Proof of Theorem 1

Proof

Theorem 3

Proposition 11

1.4 Proof of Proposition 9

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation