[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

A Sparse Smoothing Newton Method for Solving Discrete Optimal Transport Problems

Published: 21 October 2024 Publication History

Abstract

The discrete optimal transport (OT) problem, which offers an effective computational tool for comparing two discrete probability distributions, has recently attracted much attention and played essential roles in many modern applications. This paper proposes to solve the discrete OT problem by applying a squared smoothing Newton method via the Huber smoothing function for solving the corresponding KKT system directly. The proposed algorithm admits appealing convergence properties and is able to take advantage of the solution sparsity to greatly reduce computational costs. Moreover, the algorithm can be extended to solve problems with similar structures, including the Wasserstein barycenter (WB) problem with fixed supports. To verify the practical performance of the proposed method, we conduct extensive numerical experiments to solve a large set of discrete OT and WB benchmark problems. Our numerical results show that the proposed method is efficient compared to state-of-the-art linear programming (LP) solvers. Moreover, the proposed method consumes less memory than existing LP solvers, which demonstrates the potential usage of our algorithm for solving large-scale OT and WB problems.

1 Introduction

We are interested in the discrete optimal transport (OT) problem that takes the form of
\begin{align}\min_{X\in\mathbb{R}^{m\times n}}\quad\left\langle C,X\right\rangle\quad \mathrm{s.t.}\quad Xe_{n}=a,\;X^{T}e_{m}=b,\;X\geq 0,\end{align}
(1)
where \(C\in\mathbb{R}^{m\times n}_{+}\) denotes a certain cost matrix, \(a\in\mathbb{R}^{m}_{+}\) and \(b\in\mathbb{R}^{n}_{+}\) denote two discrete probability distributions satisfying \(e_{m}^{T}a=e_{n}^{T}b=1\), and \(e_{m}\) (resp. \(e_{n}\)) denotes the vector of all ones in \(\mathbb{R}^{m}\) (resp. \(\mathbb{R}^{n}\)). By the standard Lagrangian duality theory, the dual problem of (1) is given as the following maximization problem
\begin{align}\max_{f\in\mathbb{R}^{m}, g\in\mathbb{R}^{n}, Z\in\mathbb{R}^{m\times n}}\quad \left\langle a,f\right\rangle+\left\langle b,g\right\rangle\quad\mathrm{s.t.} \quad fe_{n}^{T}+e_{m}g^{T}+Z=C,\;Z\geq 0.\end{align}
(2)
Obviously, the feasible set for problem (1) is nonempty and compact. Hence, the primal problem (1) has a finite optimal value that is attainable. Since the strong duality for linear programming (LP) always holds, it is clear that the following Karush–Kuhn–Tucker (KKT) optimality conditions for (1) and (2)
\begin{align}Xe_{n}=a,\quad X^{T}e_{m}=b,\quad fe_{n}^{T}+e_{m}g^{T}+Z=C,\quad X\geq 0,\;Z \geq 0,\;\left\langle X,Z\right\rangle=0\end{align}
(3)
admits at least one solution. Note that the last three conditions in Equation (3) can be reformulated as the nonsmooth system \(X-\Pi_{+}(X-\sigma Z)=0\), where \(\Pi_{+}(\cdot)\) is the projection operation onto the nonnegative orthant \(\mathbb{R}^{m\times n}_{+}\) and \(\sigma \gt 0\) is any positive constant.
The OT problem has a large number of important applications, due partly to the essential metric property of its optimal solution. In particular, the optimal objective value of the OT problem defines a distance (i.e., the Wasserstein distance) between two probability distributions if the cost matrix \(C\) is chosen based on some distances; see for instance [Fu et al., 2006; Grauman and Darrell, 2004; Kendal et al., 2013; Peyré and Cuturi, 2019; Rubner et al., 2000; Santambrogio, 2015; Villani, 2009] and references therein for fundamental properties of OT problems together with many important applications in a wide variety of fields including imaging science, engineering, statistics and machine learning, management, finance, and beyond. For example, a representative application of the Wasserstein distance is the Wasserstein barycenter (WB) problem, which aims to compute the mean of a set of discrete probability distributions under the Wasserstein distance [Agueh and Carlier, 2011]. It has shown promising performance in many fields such as machine learning [Cuturi and Doucet, 2014; Peyré and Cuturi, 2019; Ye and Li, 2014], image processing [Rabin et al., 2012], and economics [Chiappori et al., 2010; Galichon, 2018]. For a set of discrete probability distributions with finite support points, a WB problem with its support points pre-specified can be reformulated as a LP problem [Anderes et al., 2016] that shares a similar structure as the OT problem; See Section 4 for the detailed description of the WB problem with fixed supports.
As real-world applications typically produce large-scale OT problems, computing the optimal transportation plans efficiently and accurately at scale has received growing attention. Indeed, the last few years have seen a blossoming of interest in developing efficient solution methods for solving OT problems. First, if we treat the OT problem as a special case of a general LP problem, it is clear that any solver for LP can be applied. In particular, it is commonly accepted that1 primal-dual interior point methods (IPMs) [Wright, 1997] and simplex methods [Dantzig, 2016], such as the highly optimized commercial solvers GUROBI, MOSEK, and CPLEX, and the open-source solver HiGHS, are the most efficient and robust solvers for general LPs. Moreover, when an LP problem (including the OT problem) has many more variables than the number of linear constraints, [Li et al., 2020] provides strong evidence that the dual-based inexact augmented Lagrangian method is also highly efficient and consumes less memory than that of IPMs. On the other hand, it is known from the literature (see e.g., [Peyré and Cuturi, 2019; Chapter 3]) that the OT problem can be viewed as an optimal network flow problem built upon a bipartite graph having \((m+n)\) nodes for which \(a\) and \(b\) are considered as the sources and sinks, respectively. Then, the network simplex method [Orlin 1997] can also be applied to solve the OT problem exactly. In fact, as demonstrated in [Schrieber et al., 2016], the network simplex method (implemented in CPLEX) and its variant (such as the transportation simplex method [Luenberger and Ye, 1984]) are very efficient and robust. Last, for applications for which accurate solutions are not required, entropic regularized algorithms (see e.g., the Sinkhorn algorithm [Cuturi, 2013] and its stabilized variants [Schmitzer, 2019]), entropic and/or Bregman proximal point methods (see e.g., [Chu et al., 2020; Yang and Toh, 2022]) and multiscale approaches [Liu et al., 2022; Mérigot, 2011; Schmitzer, 2016] are applicable and popular.
When it comes to the WB problem, the problem scale is usually much larger than that of the OT problem due to the need to compute multiple transport plans for the given discrete distributions. Thus, it is quite challenging to solve the WB problems by classical methods like IPMs or simplex methods. To overcome the computational challenges, one promising approach is to consider the entropic regularized problem [Benamou et al., 2015; Cuturi and Doucet, 2014], for which scalable Sinkhorn based algorithms can be developed for solving large-scale problems (see e.g., [Yang et al., 2021; Ye et al., 2017; Zhang et al., 2022]). However, the entropic regularized problem typically can only produce a rough approximate solution to the original WB problem. To get better approximate solutions, it is necessary to choose a small regularization parameter, but this will generally cause some numerical instabilities and also inefficiency to the Sinkhorn based algorithms. Thus, in this article, we aim to solve the OT and WB problems directly without introducing an entropic regularization.
In this article, we focus on solving the KKT system (3) directly by Newton-type methods. We know that the OT problem (1) has many more variables than the number of linear constraints. As a consequence, an optimal solution (not necessarily unique) of an OT problem is generally highly sparse. In fact, as an optimal network flow problem, the OT problem admits an optimal solution having at most \(m+n\) nonzero entries (see e.g., [Peyré and Cuturi, 2019; Section 3.4]). It is well-known that a primal-dual IPM applies the Newton method for solving a sequence of perturbed KKT systems. This approach is shown to be highly efficient and robust in general. However, as we will explain in Section 2, the primal-dual IPM needs to solve a sequence of dense and highly ill-conditioned linear systems which require excessive computational costs. Moreover, as the iterates are required to stay in the positive orthant, IPM is incapable of exploiting the solution sparsity when solving the OT problem.
The main issue that we want to address in this article is: How can we exploit the solution sparsity in designing a Newton-type method for solving the KKT system (3) directly? One popular way is to combine an IPM with the column generation technique; See e.g., [Tits et al., 2006; Ye, 1992] for more details. However, the column generation technique is effective only if one is able to identify the correct active set efficiently, which requires sophisticated pricing strategies that are usually highly heuristic. Another approach is to apply the semismooth Newton (SSN) method [Qi and Sun, 1993] to solve the original KKT system directly. It is shown that, under suitable conditions, the SSN method has a local superlinear or even quadratic convergence rate. However, there exist some practical and theoretical issues when applying the SSN method, as we shall explain later in Section 2. Among these issues, the most critical one is that the global convergence of SSN cannot be guaranteed due to the lack of a suitable merit function for line search.
To address the question just raised, we propose a squared smoothing Newton (SqSN) method via the Huber function for solving OT problems. The proposed method admits appealing global convergence properties, and it is able to take advantage of the underlying sparsity structure of the optimal solution. In addition, the proposed method enjoys superlinear convergence to an accumulation point. Extensive numerical experiments conducted on solving OT problems generated from the DOTmark collection [Schrieber et al., 2016] show the promising practical performance of our proposed method when compared to highly optimized LP solvers and the network simplex method. Besides the successful application to OT problems, we further extend the method to solve WB problems with fixed supports and provide an efficient way to solve the corresponding Newton system. Numerical results on some real image datasets indicate that the proposed method outperforms both the primal-dual IPM and dual simplex method implemented in the highly optimized commercial solver Gurobi.
The rest of this paper is organized as follows. In Section 2, to motivate the development of our proposed algorithm, we briefly introduce the main ideas of the primal-dual IPM and the SSN method for solving OT problems, and explain some of the limitations of both methods. Then in Section 3, we present our main algorithm, i.e., an SqSN method via the Huber smoothing function, and conduct rigorous convergence analysis. In Section 4, we extend our algorithm to solve WB problems with fixed supports, and explain how to solve the underlying Newton equations efficiently. To evaluate the practical performance of the proposed method, we conduct numerical experiments on numerous OT and WB problems with different sizes in Section 5. Finally, we conclude our paper in Section 6.

2 Preliminary

In this article, the following notations are used. For any finite dimensional Euclidean space \(\mathbb{E}\) equipped with an inner product \(\left\langle\cdot,\cdot\right\rangle\), we denote its induced norm by \(\left\lVert\cdot\right\rVert\). We denote the nonnegative orthant of \(\mathbb{R}^{n}\) as \(\mathbb{R}^{n}_{+}\). For any positive integer \(n\), \(e_{n}\in\mathbb{R}^{n}\) denotes the vector of all ones and \(I_{n}\in\mathbb{R}^{n\times n}\) denotes the identity matrix. Let \(X\in\mathbb{R}^{m\times n}\), we use \(\mathrm{vec}(X)\) to denote the vector in \(\mathbb{R}^{mn}\) obtained by stacking the columns of \(X\). Conversely, \(X=\mathrm{Mat}(x)\) is the unique matrix such that \(x=\mathrm{vec}(X)\). For given matrices \(A\) and \(B\), \(A\otimes B\) denotes the Kronecker product of \(A\) and \(B\). Additionally, for any matrix \(X\) such that \(AXB\) is well-defined, it holds that \(\mathrm{vec}(AXB)=(B^{T}\otimes A)\mathrm{vec}(X)\). We also use \(\circ\) to denote the element-wise multiplication operator of two vectors/matrices of the same size. Let \(x\) be any vector, \(\mathrm{Diag}(x)\) denotes the diagonal matrix with \(x\) on its diagonal. Conversely, for any square matrix \(X\), \(\mathrm{diag}(X)\) is the vector consisting of the diagonal entries of \(X\). For any vector \(x\), the projection of \(x\) onto \(\mathbb{R}^{n}_{+}\) is denoted by \(\Pi_{+}(x):=\mathrm{argmin}\{\left\lVert z-x\right\rVert^{2}\;:\;z\geq 0\}= \max\{x,0\}\), where the \(\max\) operation is applied elementwise on \(x\).
In this section, we shall briefly discuss the primal-dual IPM and the SSN method, which are both Newton-type methods for handling the KKT system (3). We also mention some theoretical and practical issues that we may face when applying these two methods to solving the OT problem.
We start by introducing some notation that would simplify our presentation. Let \(x:=\mathrm{vec}(X)\in\mathbb{R}^{mn}\). By using the properties of the Kronecker product, we see that
\begin{align*}Xe_{n}=(e_{n}^{T}\otimes I_{m})x,\quad X^{T}e_{m}=(I_{n}\otimes e_{m}^{T})x.\end{align*}
Denote \(z:=\mathrm{vec}(Z)\in\mathbb{R}^{mn}\), \(c:=\mathrm{vec}(C)\in\mathbb{R}^{mn}\), \(d:=(a;b)\in\mathbb{R}^{m+n}\), \(y:=(f;g)\in\mathbb{R}^{m+n}\) and \(A:=\begin{pmatrix}e_{n}^{T}\otimes I_{m}\\ I_{n}\otimes e_{m}^{T}\end{pmatrix}\). Then we can rewrite the KKT systems (3) as follows:
\begin{align}Ax=d,\quad A^{T}y+z=c,\quad x\circ z=0,\;x\geq 0,\;z\geq 0.\end{align}
(4)

2.1 Primal-Dual IPMs

Recall that a primal-dual IPM solves the KKT system (3) indirectly by solving a sequence of perturbed KKT systems of the form:
\begin{align}Ax=d,\quad A^{T}y+z=c,\quad x\circ z=\mu e_{mn},\;x\geq 0,\;z\geq 0,\end{align}
(5)
via the Newton method, where \(\mu \gt 0\) is the barrier parameter that is driven to zero. At each IPM iteration, the following Newton equation is solved
\begin{align}\begin{pmatrix}A & 0 & 0\\0 & A^{T} & I_{mn}\\\mathrm{Diag}(z) & 0 & \mathrm{Diag}(x)\end{pmatrix}\begin{pmatrix}\Delta x\\\Delta y\\\Delta z\end{pmatrix}=\begin{pmatrix}r_{p}:=d-Ax\\r_{d}:=c-A^{T}y-z\\r_{c}:=\gamma\mu e_{mn}-x\circ z\end{pmatrix}, \end{align}
(6)
where \((x,y,z)\in\mathbb{R}^{mn}_{++}\times\mathbb{R}^{m+n}\times\mathbb{R}^{mn}_{++}\) is the current primal-dual iterate, \((\Delta x,\Delta y,\Delta z)\in\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\times \mathbb{R}^{mn}\) is the Newton direction, \(\mu\) is set to be \(\langle x,z\rangle/(mn)\), and \(\gamma\in[0,1]\) is a given parameter. To reduce the computational cost, the large-scale system of \(2mn+m+n\) linear equations is typically reduced to the so-called normal system of \(m+n\) equations in \(\Delta y\) through the block elimination of \(\Delta x\) and \(\Delta z\). Specifically, we solve the normal equation
\begin{align}A\Theta A^{T}\Delta y=r_{p}+A\Theta(r_{d}-\mathrm{Diag}(x)^{-1}r_{c}),\end{align}
(7)
where \(\Theta:=\mathrm{Diag}(\theta)\in\mathbb{R}^{mn\times mn}\) is a diagonal matrix whose diagonal entries are \(\theta_{i}=x_{i}/z_{i}\), for \(i=1,\dots,mn\). After obtaining \(\Delta y\) by solving (7), we can compute \(\Delta x\) and \(\Delta z\) accordingly. To solve (7) efficiently, the following structure of \(A\Theta A^{T}\) is useful.
Fact 1.
For any diagonal matrix \(\Theta:=\mathrm{Diag}(\theta)\in\mathbb{R}^{mn\times mn}\), let \(V:=\mathrm{Mat}(\theta)\in\mathbb{R}^{m\times n}\), then it holds that
\begin{align*}A\Theta A^{T}=\begin{pmatrix}\mathrm{Diag}(Ve_{n}) & V\\V^{T} & \mathrm{Diag}(V^{T}e_{m})\end{pmatrix}.\end{align*}
As we can see, the iterates \(x\) and \(z\) generated by an IPM should be positive because of the third equation in (5). As a consequence, the matrix \(V=\mathrm{Mat}(\theta)\) is dense and \(A\Theta A^{T}\) contains two fully dense blocks. Therefore, factorizing \(A\Theta A^{T}\) in Equation (7) can be very expensive when \(m+n\) is large. As a result, solving the normal system by an iterative solver is the only viable option. Moreover, it is obvious that \(A\Theta A^{T}\) is always singular, since the vector \((e_{m};-e_{n})\in\mathbb{R}^{m+n}\) is an eigenvector for \(A\Theta A^{T}\) corresponding to the zero eigenvalue. Note that the singularity of the coefficient matrix may not be an issue since one can remove the last row of \(A\) in a prepossessing phase, which results in a normal equation having a coefficient matrix of the form
\begin{align*}\begin{pmatrix}\mathrm{Diag}(Ve_{n}) & \hat{V}\\\hat{V}^{T} & \mathrm{Diag}(\hat{V}^{T}e_{m})\end{pmatrix}\in\mathbb{R}^{(m+n-1) \times(m+n-1)},\end{align*}
where \(\hat{V}\in\mathbb{R}^{m\times(n-1)}\) is formed from the first \(n-1\) columns of \(V\). Then, by [Hu et al., 2024; Lemma 2.3], we see that the resulting coefficient matrix is nonsingular. However, the issue of ill-conditioning could still be a critical difficulty that goes against an iterative solver for fast convergence. In view of the above discussions, IPMs may not perform as efficiently as one would expect when solving large-scale OT problems.

2.2 The SSN Method

In this subsection, we shall discuss the SSN method. We first revisit the concept of semismoothness which was originally introduced in [Mifflin, 1977] for functionals and later extended for vector-valued functions in [Qi and Sun, 1993]. Let \(\mathbb{X}\) and \(\mathbb{Y}\) be two finite dimensional Euclidean spaces and \(F:\mathbb{X}\rightarrow\mathbb{Y}\) be a locally Lipschitz continuous function. According to Rademacher's Theorem [Rademacher, 1919], \(F\) is F-differentiable almost everywhere, and the Clarke's generalized Jacobian [Clarke, 1990] of \(F\) at \(x\in\mathbb{X}\), denoted by \(\partial F(x)\), is well-defined and it holds that
\begin{align*}\partial F(x):=\mathrm{conv}(\partial_{B}F(x)),\quad\partial_{B}F(x):=\left\{\lim_{D_{F}\ni z\rightarrow x}F^{\prime}(z)\right\},\quad \forall\;x\in\mathbb{X},\end{align*}
where “\(\mathrm{conv}\)” denotes the convex hull operation and \(D_{F}\subset\mathbb{X}\) is the set of points at which \(F\) is F-differentiable.
If \(F\) is directionally differentiable at \(x\in\mathbb{X}\), i.e., the limit
\begin{align*}F^{\prime}(x;d):=\lim_{t\downarrow 0}\frac{F(x+td)-F(x)}{t}\end{align*}
exists for all \(d\in\mathbb{X}\), and \(\left\lVert F(x+d)-F(x)-Vd\right\rVert=o(\left\lVert d\right\rVert),\;\forall V \in\partial F(x+d),\;\forall d\rightarrow 0,\) we say that \(F\) is semismooth at \(x\). If \(F\) is semismooth at \(x\in\mathbb{X}\) and it holds that
\begin{align*}\left\lVert F(x+d)-F(x)-Vd\right\rVert=O(\left\lVert d\right\rVert^{2}),\quad \forall V\in\partial F(x+d),\;\forall d\rightarrow 0,\end{align*}
then \(F\) is said to be strongly semismooth at \(x\). Moreover, \(F\) is said to be (strongly) semismooth everywhere on \(\mathbb{X}\), if \(F\) is (strongly) semismooth at any point \(x\in\mathbb{X}\).
Convex functions, smooth functions, and piecewise linear functions are examples of semismooth functions. In particular, one can verify that the projection operator onto the nonnegative orthant is strongly semismooth everywhere. Hence, the KKT mapping \(F(x,y,z)\) of the following equation is also strongly semismooth everywhere
\begin{align}0\;=\;F(x,y,z):=\begin{pmatrix}Ax-d\\-A^{T}y-z+c\\x-\Pi_{+}(x-\sigma z)\end{pmatrix},\quad(x,y,z)\in\mathbb{R}^{mn}\times\mathbb {R}^{m+n}\times\mathbb{R}^{mn},\end{align}
(8)
where \(\sigma \gt 0\) is a given constant. Therefore, a natural idea is to apply the SSN method [Qi and Sun, 1993], i.e., a nonsmooth version of Newton method, for solving (8). In particular, at each iteration of the SSN method, the following linear system is solved
\begin{align}\begin{pmatrix}A & 0 & 0\\0 & -A^{T} & -I_{mn}\\I_{mn}-V & 0 & \sigma V\end{pmatrix}\begin{pmatrix}\Delta x\\\Delta y\\\Delta z\end{pmatrix}=\begin{pmatrix}r_{p}:=d-Ax\\r_{d}:=A^{T}y+z-c\\r_{c}:=\Pi_{+}(x-\sigma z)-x\end{pmatrix},\end{align}
(9)
where \((x,y,z)\) denotes the current iterate and \(V=\mathrm{Diag}(v)\in\mathbb{R}^{mn\times mn}\) is an element in the Clarke's generalized Jacobian (see [Clarke, 1990]) of \(\Pi_{+}(x-\sigma z)\). Specifically, for given \((x,z)\), we may choose
\begin{align*}v_{i}=\left\{\begin{array}{ll}1, & {\rm if}\;\;(x-\sigma z)_{i}\geq 0, \\0, & {\rm otherwise},\end{array}\right.\quad i=1,\dots,mn.\end{align*}
By the expression of \(V\), we see that when \((x,z)\) are nearly optimal, \(V\) could have a large number of zeros diagonal elements. Thus, the sparsity structure of the optimal solution can be utilized in the SSN method, which is an attractive feature when compared with the IPM. However, there are two issues that prevent us from applying the SSN method directly:
(1)
The mapping \(F\) is nonsmooth due to the nonsmoothness of \(\Pi_{+}\). Hence, it is difficult to find a valid merit function for which a line search scheme for the SSN is well-defined. As a consequence, the globalization of the SSN method is a challenging difficulty that has yet to be resolved.
(2)
It is not clear how to reduce the large-scale linear system (9), which is typically singular, to a certain normal linear system to save computational cost as in the case of the IPM.

3 An SqSN Method

Note that the KKT Equation (8) can be reduced to finding the root of the following mapping \(E(x,y)\)
\begin{align}0\;=\;E(x,y):=\begin{pmatrix}Ax-d\\x-\Pi_{+}(x+\sigma(A^{T}y-c))\end{pmatrix},\quad\;(x,y)\in\mathbb{R}^{mn} \times\mathbb{R}^{m+n},\end{align}
(10)
by eliminating the variable \(z=c-A^{T}y\). In this section, we focus on presenting our main algorithm, the SqSN method via the Huber smoothing function, for solving the nonsmooth system (10). Existing smoothing Newton methods have been developed and studied extensively and were mainly applied for solving conic programming problems, complementarity problems, and variational inequalities. We refer the readers to [Chen et al., 1998; Kong et al., 2008; Qi et al., 2000; Sun et al., 2004] for more details.

3.1 The Huber Smoothing Function

The key idea of a smoothing method [Sun and Qi, 2001] is to construct a smoothing approximation function \(\mathcal{E}:\mathbb{R}_{++}\times\mathbb{R}^{mn}\times\mathbb{R}^{m+n} \rightarrow\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\) for the mapping \(E\), and then apply the Newton method for solving the smoothed system \(\mathcal{E}(\epsilon,x,y)=0\), for \((\epsilon,x,y)\in\mathbb{R}\times\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\). To this end, it is essential to construct a smoothing approximation of the projection operator \(\Pi_{+}(\cdot)\), namely, \(\Phi:\mathbb{R}_{++}\times\mathbb{R}^{mn}\rightarrow\mathbb{R}^{mn}\), such that for any \(\epsilon \gt 0\), \(\Phi(\epsilon,\cdot)\) is continuously differentiable on \(\mathbb{R}^{mn}\) and for any \(x\in\mathbb{R}^{mn}\), it holds that \(\left\lVert\Phi(\epsilon,x)-\Pi_{+}(x)\right\rVert\rightarrow 0\) as \(\epsilon\downarrow 0\).
Note that a smoothing approximation function of \(\Pi_{+}(\cdot)\) strongly depends on the smoothing function for the plus function
\begin{align*}\pi(t)=\max\{0,t\},\quad\forall\;t\in\mathbb{R}.\end{align*}
To the best of our knowledge, the Chen-Harker-Kanzow-Smale (CHKS) smoothing function [Chen and Harker, 1993; Kanzow, 1996; Smale, 2000], defined as
\begin{align*}\xi(\epsilon,t):=\frac{1}{2}\left(\sqrt{t^{2}+4\epsilon^{2}}+t\right),\quad \forall\;(\epsilon,t)\in\mathbb{R}\times\mathbb{R},\end{align*}
is the most commonly used smoothing function for \(\pi(\cdot)\) in the literature. Readers are referred to [Qi et al., 2000] for other smoothing functions, whose properties are also well studied. It is easy to see that \(\xi(\epsilon,t)\) maps any negative number \(t\) to a positive value when \(\epsilon\neq 0\), while the function \(\pi(\cdot)\) maps any negative number \(t\) to zero. Hence, the CHKS smoothing function does not preserve the sparsity structure of \(\pi(\cdot)\). As a result, a smoothing Newton method developed based on the CHKS smoothing function is not able to exploit the sparsity in the solutions of the OT problem to reduce the cost of solving the linear system at each iteration of the algorithm.
To resolve the issue just mentioned, we propose to use the Huber smoothing function (a translation of the function considered in [Zang, 1980]) which also maps any negative number to zero. In particular, the Huber smoothing function we shall use in this article is defined as:
\begin{align}h(\epsilon,t):=\left\{\begin{array}{ll}t-\frac{\lvert\epsilon\rvert} {2}, & t\geq\lvert\epsilon\rvert,\\\frac{t^{2}}{2\lvert\epsilon\rvert}, & 0 < t < \lvert\epsilon\rvert,\\0, & t\leq 0,\end{array}\right.\quad\forall\;(\epsilon,t)\in\mathbb{R}\backslash \{0\}\times\mathbb{R},\quad h(0,t)=\pi(t),\quad\forall t\in \mathbb{R}.\end{align}
(11)
Clearly, \(h(\epsilon,t)\) is continuously differentiable for \(\epsilon\neq 0\) and \(t\in\mathbb{R}\). In fact, we see that for \(\epsilon\neq 0\),
\begin{align*}h_{1}^{\prime}(\epsilon,t):=\frac{\partial h}{\partial\epsilon}(\epsilon,t)= \left\{\begin{array}{ll}-\frac{1}{2}\mathrm{sign}(\epsilon), & t\geq \lvert\epsilon\rvert,\\-\frac{t^{2}}{2\epsilon^{2}}\mathrm{sign}(\epsilon), & 0 \lt t \lt \lvert\epsilon\rvert, \\0, & t\leq 0,\end{array}\right.\quad h_{2}^{\prime}(\epsilon,t):=\frac{\partial h }{\partial t}(\epsilon,t)=\left\{\begin{array}{ll}1, & t\geq\lvert \epsilon\rvert,\\\frac{t}{\lvert\epsilon\rvert}, & 0 \lt t \lt \lvert\epsilon\rvert,\\0, & t\leq 0.\end{array}\right.\end{align*}
Moreover, it is easy to check that for any \(t\in\mathbb{R}\),
\begin{align*}\partial_{B}h(0,t)= & \;\left\{v\mid v=\lim_{{\epsilon^{k}\rightarrow 0}, t^{k} \rightarrow t}h^{\prime}(\epsilon^{k},t^{k}),\text{ if exists}\right\} =\left\{\begin{array}{ll}\left(\pm\frac{1}{2},1\right), & t \gt 0,\\\left\{\left(\pm\frac{1}{2}\xi^{2},\xi\right)\mid\xi\in[0,1]\right\}, & t=0,\\(0,0), & t \lt 0,\end{array}\right.\end{align*}
and consequently,
\begin{align}\partial h(0,t)=\mathrm{conv}(\partial_{B}h(0,t))=\left\{\begin{array} []{ll}T_{+}, & t > 0\\T_{0}, & t=0,\\\{(0,0)\}, & t < 0,\end{array}\right.\end{align}
(12)
where
\begin{align*}T_{+}:=\;\left\{(\nu,1)\mid\nu\in\left[-\frac{1}{2}, \frac{1}{2}\right]\right\},\quad{T_{0}:=\;\left\{\left(\nu,\xi \right)\mid\xi\in\left[0,1\right],\nu\in\left[-\frac{1}{2}\xi,\frac{1}{2}\xi \right]\right\}.}\end{align*}
Recall that the generalized Jacobian of the plus function \(\pi\) has the following form
\begin{align*}\partial\pi(t)=\left\{\begin{array}{ll}\{1\}, & t \gt 0,\\\left[0,1\right], & t=0,\\\{0\}, & t \lt 0,\end{array}\right.\quad\forall t\in\mathbb{R}.\end{align*}
Therefore, we can see that for any \(v\in\partial h(0,t)\), there exists a \(u\in\partial\pi(t)\) such that \(v(0,h)\equiv u(h)\) for any \(h\in\mathbb{R}^{mn}\). One can also verify that the converse statement is true.

3.2 The Main Algorithm

Using the Huber function \(h(\cdot,\cdot)\), we can construct the smoothing approximation for the projection operator \(\Pi_{+}(\cdot)\) as
\begin{align}\Phi(\epsilon,x):=(h(\epsilon,x_{1}),\dots,h(\epsilon,x_{mn}))^{T},\quad \forall x\in\mathbb{R}^{mn},\;\epsilon\neq 0.\end{align}
(13)
Given the above smoothing function \(\Phi(\cdot)\) for \(\Pi_{+}(\cdot)\), a smoothing approximation of \(E(\cdot,\cdot)\) can be constructed accordingly as follows:
\begin{align}\mathcal{E}(\epsilon,x,y):=\begin{pmatrix}Ax+\kappa_{p}\epsilon y-d\\(1+\kappa_{c}\epsilon)x-\Phi(\epsilon,x+\sigma(A^{T}y-c))\end{pmatrix},\quad \forall(\epsilon,x,y)\in\mathbb{R}_{++}\times\mathbb{R}^{mn}\times\mathbb{R}^{ m+n},\end{align}
(14)
where \(\kappa_{p} \gt 0\) and \(\kappa_{c} \gt 0\) are two given parameters. Note that adding the perturbation terms \(\kappa_{p}\epsilon y\) and \(\kappa_{c}\epsilon x\) is essential in our algorithmic design, since these perturbations will ensure that the Jacobian of \(\mathcal{E}\) at \((\epsilon,x,y)\) is nonsingular when \(\epsilon \gt 0\).
In the current literature, smoothing Newton methods can be roughly divided into two groups, the Jacobian smoothing Newton methods and the SqSN methods (see [Qi and Sun, 1999] for a comprehensive survey). The global convergence of a Jacobian smoothing Newton method strongly relies on the so-called Jacobian consistency property and the existence of a positive constant \(\kappa\) such that
\begin{align}\left\lVert\mathcal{E}(\epsilon,x,y)-E(x,y)\right\rVert\leq\kappa\epsilon, \quad\forall\epsilon > 0,\;(x,y)\in\mathbb{R}^{mn}\times\mathbb{R}^{m+n}.\end{align}
(15)
Unfortunately, the property (15) does not hold for (14). Therefore, in our setting, the Jacobian smoothing Newton method is not applicable. On the other hand, the SqSN method tries to find a solution of \(\mathcal{E}(\epsilon,x,y)=0\) by solving the following system
\begin{align}\widehat{\mathcal{E}}(\epsilon,x,y):=\begin{pmatrix}\epsilon\\\mathcal{E}(\epsilon,x,y)\end{pmatrix}=0,\quad(\epsilon,x,y)\in\mathbb{R} \times\mathbb{R}^{mn}\times\mathbb{R}^{m+n},\end{align}
(16)
via the Newton method. As we shall see shortly, the global convergence of the SqSN method does not rely on strong conditions such as (15). This explains why we choose the SqSN method in the present paper.
Associated with the system (16), we can define the merit function \(\phi:\mathbb{R}\times\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\rightarrow\mathbb{R}\)
\begin{align*}\phi(\epsilon,x,y):=\left\lVert\widehat{\mathcal{E}}(\epsilon,x,y)\right\rVert ^{2},\quad(\epsilon,x,y)\in\mathbb{R}\times\mathbb{R}^{mn}\times\mathbb{R}^{m+ n},\end{align*}
to ensure that the line search procedure within the SqSN method is well-defined. We also define an auxiliary function \(\zeta:\mathbb{R}\times\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\rightarrow\mathbb{R}\)
\begin{align*}\zeta(\epsilon,x,y):=r\min\Big{\{}1,\left\lVert\widehat{\mathcal{E}}(\epsilon,x,y)\right\rVert^{1+\tau}\Big{\}},\quad\forall(\epsilon,x,y) \in\mathbb{R}\times\mathbb{R}^{mn}\times\mathbb{R}^{m+n},\end{align*}
which controls how the smoothing parameter \(\epsilon\) is driven to zero, where \(r\in(0,1)\) and \(\tau\in(0,1]\) are two given constants. Using these notations, we can now describe our SqSN method via the Huber function in Algorithm 1.
Algorithm 1 An SqSN Method via the Huber Function
Require: Initial point \((x^{0},y^{0})\in\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\), \(\epsilon^{0} \gt 0\), \(r\in(0,1)\) such that \(\delta:=r\epsilon^{0} \lt 1\), \(\tau\in(0,1]\), \(\rho\in(0,1)\), and \(\mu\in(0,1/2)\).
1: for \(k\geq 0\) do
2: if \(\widehat{\mathcal{E}}(\epsilon^{k},x^{k},y^{k})=0\) then
3:  Output: \((\epsilon^{k},x^{k},y^{k})\);
4: else
5:  Compute \(\zeta_{k}=\zeta(\epsilon^{k},x^{k},y^{k})\);
6:  Find \(\Delta^{k}:=(\Delta\epsilon^{k};\Delta x^{k};\Delta y^{k})\) via solving the following linear system of equations
\begin{align*}\displaystyle\widehat{\mathcal{E}}(\epsilon^{k},x^{k},y^{k})+\widehat{\mathcal {E}}^{\prime}(\epsilon^{k},x^{k},y^{k})\Delta^{k}=(\zeta_{k}\epsilon^{0};0;0);\end{align*}
(17)
7:  Compute \(\ell_{k}\) as the smallest nonnegative integer \(\ell\) satisfying \(\phi(\epsilon^{k}+\rho^{\ell}\Delta\epsilon^{k},x^{k}+\rho^{\ell}\Delta x^{k}, y^{k}+\rho^{\ell}\Delta y^{k})\leq\left[1-2\mu(1-\delta)\rho^{\ell}\right]\phi (\epsilon^{k},x^{k},y^{k});\)
8:  Update \((\epsilon^{k+1},x^{k+1},y^{k+1})=(\epsilon^{k}+\rho^{\ell_{k}}\Delta\epsilon^{ k},x^{k}+\rho^{\ell_{k}}\Delta x^{k},y^{k}+\rho^{\ell_{k}}\Delta y^{k})\);
9: end if
10: end for
Before establishing the convergence properties of Algorithm 1, let us first see how to solve the linear system (17) and also explain the potential advantage of the SqSN method compared to the IPM and the SSN method. Suppose \(\epsilon^{k} \gt 0\), then, we see that
\begin{align*}\widehat{\mathcal{E}}^{\prime}(\epsilon^{k},x^{k},y^{k})=\begin{pmatrix}1 & 0 & 0 \\\kappa_{p}y^{k} & A & \kappa_{p}\epsilon I_{m+n}\\\kappa_{c}x^{k}-V_{1}^{k} & (1+\kappa_{c}\epsilon^{k})I_{mn}-V_{2}^{k} & -\sigma V _{2}^{k}A^{T}\end{pmatrix},\end{align*}
where \(V_{1}^{k}:=\Phi_{1}^{\prime}(\epsilon^{k},w^{k})\) and \(V_{2}^{k}:=\Phi_{2}^{\prime}(\epsilon^{k},w^{k})\) denote the partial derivatives of \(\Phi\) with respect to the first and second arguments at the referenced point with \(w^{k}:=x^{k}-\sigma(c-A^{T}y^{k})\), respectively. Note also that \(\left\lvert(V_{1}^{k})_{ii}\right\rvert\in[0,1/2]\) and \(\left\lvert(V_{2}^{k})_{ii}\right\rvert\in[0,1]\) for any \(i=1,\dots,mn\). From (17), we get \(\Delta\epsilon^{k}=-\epsilon^{k}+\zeta_{k}\epsilon^{0}\) and
\begin{align*}\begin{pmatrix}A & \kappa_{p}\epsilon^{k}I_{m+n}\\(1+\kappa_{c}\epsilon^{k})I_{mn}-V_{2}^{k} & -\sigma V_{2}^{k}A^{T}\end{pmatrix} \begin{pmatrix}\Delta x^{k}\\\Delta y^{k}\end{pmatrix}=\begin{pmatrix}r_{p}^{k}\\r_{c}^{k}\end{pmatrix},\end{align*}
where
\begin{align*}r_{p}^{k}:= & \;d-Ax^{k}-\kappa_{p}\epsilon^{k}y^{k}-\kappa_{p}y^{k}\Delta \epsilon^{k}, \\r_{c}^{k}:= & \;\Phi(\epsilon^{k},x^{k}+\sigma(A^{T}y^{k}-c))-(1+\kappa_{c} \epsilon^{k})x^{k}-(\kappa_{c}x^{k}-V_{1}^{k})\Delta\epsilon^{k}.\end{align*}
It then follows that \(\Delta x^{k}=\left[(1+\kappa_{c}\epsilon^{k})I_{mn}-V_{2}^{k}\right]^{-1}\left (r_{c}^{k}+\sigma V_{2}^{k}A^{T}\Delta y^{k}\right).\) Furthermore, we have that
\begin{align}\left[\kappa_{p}\epsilon^{k}I_{m+n}+\sigma A\left((1+\kappa_{c}\epsilon^{k})I_ {mn}-V_{2}^{k}\right)^{-1}V_{2}^{k}A^{T}\right]\Delta y^{k}=r_{p}^{k}-A\left((1+\kappa_{c}\epsilon^{k})I_{mn}-V_{2}^{k}\right)^{-1}r_{c}^{k}.\end{align}
(18)
Denote \(v^{k}:=\mathrm{diag}\left[\left((1+\kappa_{c}\epsilon^{k})I_{mn}-V_{2}^{k} \right)^{-1}V_{2}^{k}\right]\in\mathbb{R}^{mn}\) and recall that \(w^{k}:=x^{k}-\sigma(c-A^{T}y^{k})\). It is clear that
\begin{align*}v_{i}^{k}\left\{\begin{array}{ll} \gt 0 & \text{if }w_{i}^{k} \gt 0\\=0 & \text{if }w_{i}^{k}\leq 0\end{array}\right.,\quad\forall i=1,\dots,mn.\end{align*}
Keeping in mind that \(z^{k}=c-A^{T}y^{k}\) for \(k\geq 0\), we see that when \((x^{k},y^{k})\) is close to the optimal solution pair, \(w^{k}\) is expected to have a large number of nonpositive elements and hence \(v^{k}\) to be highly sparse. Moreover, let \(V^{k}:=\mathrm{mat}(v^{k})\in\mathbb{R}^{m\times n}\). By Fact 1, we see that
\begin{align}A\left((1+\kappa_{c}\epsilon^{k})I_{mn}-V_{2}^{k}\right)^{-1}V_{2}^{k}A^{T}= \begin{pmatrix}\mathrm{Diag}(V^{k}e_{n}) & V^{k}\\(V^{k})^{T} & \mathrm{Diag}((V^{k})^{T}e_{m})\end{pmatrix}\in\mathbb{S}^{m+n}.\end{align}
(19)
When \(V^{k}\) is sparse, it is clear that the coefficient matrix for computing \(\Delta y^{k}\) is also sparse. This explains the appealing feature of Algorithm 1 as compared to the IPM, since the attractive sparsity structure of the optimal solution is exploited in Algorithm 1. Moreover, we are always able to reduce (17) to the smaller-scale normal system (18) in Algorithm 1, which is not the case in the SSN method. We can also see that the matrix \(A\) is not explicitly required in the implementation of Algorithm 1, which is another advantage of the proposed method compared to an IPM. Lastly, let \(J:=\mathrm{diag}(I_{m},-I_{n})\), then one can see that
\begin{align}JA\left((1+\kappa_{c}\epsilon^{k})I_{mn}-V_{2}^{k}\right)^{-1}V_{2}^{k}A^{T}J= \begin{pmatrix}\mathrm{Diag}(V^{k}e_{n}) & -V^{k}\\-(V^{k})^{T} & \mathrm{Diag}((V^{k})^{T}e_{m})\end{pmatrix},\end{align}
(20)
which corresponds to the Laplacian matrix of a sparse bipartite graph defined by \(V^{k}\) 2. First, one can identify all the connected components of the bipartite graph that are disjoint from each other. The matrix entries linking different components are always zero. Next, by permuting the coefficient matrix of the Newton equation, it can be converted into a block diagonal matrix. Finally, the Newton equation can be decomposed into several smaller equations. Each coefficient matrix of the smaller equation has a similar pattern as (20). In this way, the computational effort for computing the search direction may be further reduced.

3.3 Convergence Properties of Algorithm 1

For the rest of this section, we shall establish the convergence properties of Algorithm 1 which allow us to resolve the two difficulties of the SSN method as mentioned at the end of the last section. The tools for our analysis are adapted from those in the literature (see, e.g., [Sun et al., 2004]).
We first show that the coefficient matrix in Equation (17) is nonsingular whenever \(\epsilon^{k} \gt 0\).
Lemma 1.
\(\widehat{\mathcal{E}}^{\prime}(\epsilon,x,y)\) is nonsingular for any \((\epsilon,x,y)\in\mathbb{R}_{++}\times\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\).
Proof.
For any \(\epsilon \gt 0\), it is easy to see that \(\widehat{\mathcal{E}}\) is continuously differentiable, and it holds that
\begin{align*}\widehat{\mathcal{E}}^{\prime}(\epsilon,x,y)=\begin{pmatrix}1 & 0 & 0\\\kappa_{p}y & A & \kappa_{p}\epsilon I_{m+n}\\\kappa_{c}x-V_{1} & (1+\kappa_{c}\epsilon)I_{mn}-V_{2} & -\sigma V_{2}A^{T}\end{pmatrix},\end{align*}
where \(V_{1}:=\Phi_{1}^{\prime}(\epsilon,w)\) and \(V_{2}:=\Phi_{2}^{\prime}(\epsilon,w)\) denote the partial derivatives of \(\Phi\) with respect to the first and the second arguments at the referenced point with \(w:=x-\sigma(c-A^{T}y)\), respectively. To show that \(\widehat{\mathcal{E}}^{\prime}(\epsilon,x,y)\) is nonsingular, it suffices to show that the following linear system
\begin{align*}\widehat{\mathcal{E}}^{\prime}(\epsilon,x,y)(\Delta\epsilon;\Delta x;\Delta y)=0\end{align*}
only has a trivial solution. It is obvious that \(\Delta\epsilon=0\). Since
\begin{align*}\left[(1+\kappa_{c}\epsilon)I_{mn}-V_{2}\right]\Delta x-\sigma V_{2}A^{T} \Delta y=0,\end{align*}
it follows that \(\Delta x=\sigma\left[(1+\kappa_{c}\epsilon)I_{mn}-V_{2}\right]^{-1}V_{2}A^{T}\Delta y\). As a consequence, we get
\begin{align*}\left[\kappa_{p}\epsilon I_{m+n}+\sigma A\left((1+\kappa_{c}\epsilon)I_{mn}-V_ {2}\right)^{-1}V_{2}A^{T}\right]\Delta y=0.\end{align*}
Clearly, the coefficient matrix of the above equation is symmetric positive definite. It then follows that \(\Delta y=0\) which further implies that \(\Delta x=0\). Therefore, the proof is completed. ∎
Lemma 1 shows that the linear system in Equation (17) is always solvable, and the Newton direction \((\Delta^{k},\Delta x^{k},\Delta y^{k})\) is well-defined. Next, we shall show that the line search procedure in Algorithm 1 is well-defined, i.e., \(\ell_{k}\) is always finite as long as \(\epsilon^{k}\) is positive.
Lemma 2.
For any \((\epsilon,x,y)\) with \(\epsilon \gt 0\), there exist a positive scalar \(\bar{\alpha}\in(0,1]\) such that for any \(\alpha\in(0,\bar{\alpha}]\), it holds that
\begin{align*}\phi(\epsilon+\alpha\Delta\epsilon,x+\alpha\Delta x,y+\alpha\Delta y)\leq\left [1-2\mu(1-\delta)\alpha\right]\phi(\epsilon,x,y),\end{align*}
where \((\Delta\epsilon,\Delta x,\Delta y)\) satisfies
\begin{align*}\widehat{\mathcal{E}}(\epsilon,x,y)+\widehat{\mathcal{E}}^{\prime}(\epsilon,x, y)(\Delta\epsilon;\Delta x;\Delta y)=(\zeta(\epsilon,x,y)\epsilon^{0};0;0),\end{align*}
\(\mu\in(0,1/2)\) , \(\epsilon^{0} \gt 0\), and \(\delta:=r\epsilon^{0} \lt 1\).
Proof.
First, since \(\epsilon \gt 0\), by Lemma 1, we see that \(\widehat{\mathcal{E}}^{\prime}(\epsilon,x,y)\) is nonsingular. Hence, \((\Delta\epsilon,\Delta x,\Delta y)\) is well-defined. Next, we can check that
\begin{align}& \left\langle\nabla\phi(\epsilon,x,y),(\Delta\epsilon,\Delta x, \Delta y)\right\rangle\;=\;\left\langle 2\nabla\widehat{\mathcal{E}}(\epsilon, x,y)\widehat{\mathcal{E}}(\epsilon,x,y),(\Delta\epsilon;\Delta x;\Delta y)\right\rangle \\= & \;\left\langle 2\widehat{\mathcal{E}}(\epsilon,x,y),\widehat{ \mathcal{E}}^{\prime}(\epsilon,x,y)(\Delta\epsilon;\Delta x;\Delta y)\right \rangle\;=\;\left\langle 2\widehat{\mathcal{E}}(\epsilon,x,y),(\zeta(\epsilon, x,y)\epsilon^{0};0;0)-\widehat{\mathcal{E}}(\epsilon,x,y)\right\rangle \\\leq & \;-2\phi(\epsilon,x,y)+2r\epsilon^{0}\left\lVert\widehat{\mathcal {E}}(x,y,z)\right\rVert\min\left\{1,\left\lVert\widehat{\mathcal{E}}(\epsilon,x,y)\right\rVert^{1+\tau}\right\}.\end{align}
(21)
We consider two possible cases: \(\left\lVert\widehat{\mathcal{E}}(\epsilon,x,y)\right\rVert \gt 1\) and \(\left\lVert\widehat{\mathcal{E}}(\epsilon,x,y)\right\rVert\leq 1\). If \(\left\lVert\widehat{\mathcal{E}}(\epsilon,x,y)\right\rVert \gt 1\), then (21) implies that
\begin{align*}\left\langle\nabla\phi(\epsilon,x,y),(\Delta\epsilon,\Delta x,\Delta y)\right \rangle\leq-2\phi(\epsilon,x,y)+2r\epsilon^{0}\left\lVert\widehat{\mathcal{E}} (x,y,z)\right\rVert\leq-2(1-\delta)\phi(\epsilon,x,y).\end{align*}
If \(\left\lVert\widehat{\mathcal{E}}(\epsilon,x,y)\right\rVert\leq 1\), then (21) implies that
\begin{align*}\left\langle\nabla\phi(\epsilon,x,y),(\Delta\epsilon,\Delta x, \Delta y)\right\rangle&\leq\; -2\phi(\epsilon,x,y)+2r\epsilon^{0}\left\lVert\widehat{\mathcal {E}}(\epsilon,x,y)\right\rVert^{2+\tau} \\&\leq \;-2\phi(\epsilon,x,y)+2r\epsilon^{0}\phi(\epsilon,x,y) \\&= \;-2(1-\delta)\phi(\epsilon,x,y).\end{align*}
Thus, in both cases, it always holds that
\begin{align*}\left\langle\nabla\phi(\epsilon,x,y),(\Delta\epsilon,\Delta x,\Delta y)\right \rangle\leq-2(1-\delta)\phi(\epsilon,x,y).\end{align*}
Now by the Taylor expansion, we get
\begin{align*}\phi(\epsilon+\alpha\Delta\epsilon,x+\alpha\Delta x,y+\alpha \Delta y)= & \;\phi(\epsilon,x,y)+\alpha\left\langle\nabla\phi(\epsilon,x,y),(\Delta\epsilon,\Delta x,\Delta y)\right\rangle+o(\alpha) \\\leq & \;\phi(\epsilon,x,y)-2\alpha(1-\delta)\phi(\epsilon,x,y)+o(\alpha).\end{align*}
Using the fact that \(\mu\in(0,1/2)\), there exists \(\bar{\alpha}\in(0,1]\) such that for \(\alpha\in(0,\bar{\alpha}]\), it always holds that
\begin{align*}\phi(\epsilon+\alpha\Delta\epsilon,x+\alpha\Delta x,y+\alpha\Delta y)\leq\left [1-2\mu(1-\delta)\alpha\right]\phi(\epsilon,x,y),\end{align*}
which completes the proof. ∎
Lemma 2 indicates that when \(\epsilon^{k} \gt 0\), the line search procedure in Algorithm 1 is always well-defined. Additionally, we need to show that Algorithm 1 generates a sequence \(\{\epsilon^{k}\}\) that is always positive before termination. The next lemma shows that \(\epsilon^{k}\) is indeed lower bounded by \(\zeta_{k}\epsilon^{0}\), which is positive before termination.
Lemma 3.
Suppose at the \(k\)th iteration of Algorithm 1 that \(\epsilon^{k} \gt 0\) and \(\epsilon^{k}\geq\zeta(\epsilon^{k},x^{k},y^{k})\epsilon^{0}\). Then, for any \(\alpha\in[0,1]\) satisfying
\begin{align}\phi(\epsilon^{k}+\alpha\Delta\epsilon^{k},x^{k}+\alpha\Delta x^{k},y^{k}+ \alpha\Delta y^{k})\leq\left[1-2\mu(1-\delta)\alpha\right]\phi(\epsilon^{k},x^ {k},y^{k}),\end{align}
(22)
it holds that \(\epsilon^{k}+\alpha\Delta\epsilon^{k}\geq\zeta(\epsilon^{k}+\alpha\Delta \epsilon^{k},x^{k}+\alpha\Delta x^{k},y^{k}+\alpha\Delta y^{k})\epsilon^{0}.\)
Proof.
From (17), it is not difficult to see that \(\Delta\epsilon^{k}=-\epsilon^{k}+\zeta(\epsilon^{k},x^{k},y^{k})\epsilon^{0}\). Hence, \(\Delta\epsilon^{k}\leq 0\), and for any \(\alpha\in[0,1]\), it holds that
\begin{align*}& \;\epsilon^{k}+\alpha\Delta\epsilon^{k}-\zeta(\epsilon^{k}+\alpha \Delta\epsilon^{k},x^{k}+\alpha\Delta x^{k},y^{k}+\alpha\Delta y^{k})\epsilon^ {0} \\\geq & \;\epsilon^{k}+\Delta\epsilon^{k}-\zeta(\epsilon^{k}+\alpha\Delta \epsilon^{k},x^{k}+\alpha\Delta x^{k},y^{k}+\alpha\Delta y^{k})\epsilon^{0} \\= & \;\zeta(\epsilon^{k},x^{k},y^{k})\epsilon^{0}-\zeta(\epsilon^{k}+ \alpha\Delta\epsilon^{k},x^{k}+\alpha\Delta x^{k},y^{k}+\alpha\Delta y^{k}) \epsilon^{0} \\\geq & \;0,\end{align*}
where in the last inequality we have used the fact that (22) implies that
\begin{align*}\zeta(\epsilon^{k},x^{k},y^{k})\geq\zeta(\epsilon^{k}+\alpha\Delta\epsilon^{k},x^{k}+\alpha\Delta x^{k},y^{k}+\alpha\Delta y^{k})\end{align*}
by the definition of the function \(\zeta\). Therefore, the proof is completed. ∎
The above lemma also explains the role of the auxiliary function \(\zeta\) in the algorithmic design. In particular, when we choose \(r\) to be small or \(\tau\) to be close to one, then the lower bound \(\zeta_{k}\epsilon^{0}\) is small. As a result, \(\epsilon^{k}\) will also be small. However, a smaller \(\epsilon^{k}\) will worsen the conditioning of the linear system (17). On the other hand, if we choose \(r\) to be large or \(\tau\) to be close to zero, then, \(\epsilon^{k}\) will be further away from zero and the algorithm would need more iterations to converge to an optimal solution. Therefore, there is a tradeoff in choosing the parameters \(r\) and \(\tau\) in the practical implementation of Algorithm 1.
Now, we are ready to state the first convergence result of the Algorithm 1.
Theorem 1.
Algorithm 1 is well-defined and generates an infinite sequence \(\{(\epsilon^{k},x^{k},y^{k})\}\) such that \(\epsilon^{k}\geq\zeta(\epsilon^{k},x^{k},y^{k})\epsilon^{0}\) with the property that any accumulation point \((\bar{\epsilon},\bar{x},\bar{y})\) of the sequence \(\{(\epsilon^{k},x^{k},y^{k})\}\) is a solution of the nonlinear system \(\widehat{\mathcal{E}}(\epsilon,x,y)=0\).
Proof.
It follows from Lemma 13 that Algorithm 1 is well-defined, and it generates an infinite sequence \(\{(\epsilon^{k},x^{k},y^{k})\}\) such that \(\epsilon^{k}\geq\zeta(\epsilon^{k},x^{k},y^{k})\epsilon^{0}\). Let \((\bar{\epsilon},\bar{x},\bar{y})\) be any accumulation point of the sequence \(\{(\epsilon^{k},x^{k},y^{k})\}\), if it exists. By taking a subsequence if necessary, we may assume without loss of generality that \(\{(\epsilon^{k},x^{k},y^{k})\}\) converges to \((\bar{\epsilon},\bar{x},\bar{y})\). Since the line-search scheme is well-defined, it follows that
\begin{align*}\phi(\epsilon^{k+1},x^{k+1},y^{k+1}) \lt \phi(\epsilon^{k},x^{k},y^{k}),\quad \forall k\geq 0.\end{align*}
That is, the sequence \(\{\phi(\epsilon^{k},x^{k},y^{k})\}\) is monotonically decreasing. By the definition of the function \(\zeta\), we see that the sequence \(\{\zeta_{k}\}\) is also monotonically decreasing. Hence, there exist \(\bar{\phi}\) and \(\bar{\zeta}\) such that
\begin{align*}\phi(\epsilon^{k},x^{k},y^{k})\rightarrow\bar{\phi},\quad\zeta_{k}\rightarrow \bar{\zeta},\quad{\rm as}\;k\rightarrow\infty.\end{align*}
As a consequence of the continuity of the function \(\phi\), we see that
\begin{align*}\phi(\bar{\epsilon},\bar{x},\bar{y})=\bar{\phi}\geq 0,\quad\zeta(\bar{\epsilon},\bar{x},\bar{y})=\bar{\zeta}\geq 0,\quad\bar{\epsilon}\geq\zeta(\bar{ \epsilon},\bar{x},\bar{y})\epsilon^{0}.\end{align*}
We next show that \(\bar{\phi}=\phi(\bar{\epsilon},\bar{x},\bar{y})=0\) by contradiction. To this end, let us assume that \(\bar{\phi} \gt 0\). This implies that \(\bar{\zeta} \gt 0\) and that \(\bar{\epsilon}\geq\bar{\zeta}\epsilon^{0} \gt 0\). Then, by Lemma 2, we see that there exist an open neighbourhood \(\mathcal{U}\) of \((\bar{\epsilon},\bar{x},\bar{y})\) and \(\bar{\alpha}\in(0,1]\) such that \((\epsilon^{k},x^{k},y^{k})\in\mathcal{U}\) with \(\epsilon^{k} \gt 0\) and for any \(\alpha\in(0,\bar{\alpha}]\), it holds that for \(k\geq 0\) sufficiently large,
\begin{align*}\phi(\epsilon^{k}+\alpha\Delta\epsilon^{k},x^{k}+\alpha\Delta x^{k},y^{k}+ \alpha\Delta y^{k})\leq\left[1-2\mu(1-\delta)\alpha\right]\phi(\epsilon^{k},x^ {k},y^{k}).\end{align*}
The existence of the fixed number \(\bar{\alpha}\in(0,1]\) further indicates that there exists a fixed nonnegative integer \(\ell\) such that \(\rho^{\ell}\in(0,\bar{\alpha}]\) and \(\rho^{\ell_{k}}\geq\rho^{\ell}\) for all \(k\geq 0\) sufficiently large. Therefore, we can check that
\begin{align*}\phi(\epsilon^{k+1},x^{k+1},y^{k+1})\leq\left[1-2\mu(1-\delta)\rho^{\ell_{k}} \right]\phi(\epsilon^{k},x^{k},y^{k})\leq\left[1-2\mu(1-\delta)\rho^{\ell} \right]\phi(\epsilon^{k},x^{k},y^{k}),\end{align*}
for all \(k\geq 0\) sufficiently large. Then, by letting \(k\rightarrow\infty\), we see that \(\bar{\phi}\leq\left[1-2\mu(1-\delta)\rho^{\ell}\right]\bar{\phi}\). This implies that \(\bar{\phi}\leq 0\), which contradicts the assumption that \(\bar{\phi} \gt 0\). Therefore, we conclude that \(\bar{\phi}=0\), i.e., \(\widehat{\mathcal{E}}(\bar{\epsilon},\bar{x},\bar{y})=0\). This completes the proof. ∎
From our previous convergence analysis, we see how the natural merit function \(\phi\) helps us to establish the convergence properties. Without such a continuously differentiable merit function, further stronger assumptions and much more complicated analysis may be required. Therefore, we are able to design a Newton-type method that is able to exploit the solution sparsity and admit global convergence properties by employing the SqSN method via the Huber function. Following the classical results on the local convergence rate of Newton-type methods, next we further show that Algorithm 1 has a local superlinear convergence rate under the assumption that every element of \(\partial\widehat{\mathcal{E}}(\bar{\epsilon},\bar{x},\bar{y})\) is nonsingular.
Theorem 2.
Let \((\bar{\epsilon},\bar{x},\bar{y})\) be any accumulation point of the sequence \(\{(\epsilon^{k},x^{k},y^{k})\}\) (if it exists) generated by the Algorithm 1. Suppose that every element of \(\partial\widehat{\mathcal{E}}(\bar{\epsilon},\bar{x},\bar{y})\) is nonsingular, then the whole sequence \(\{(\epsilon^{k},x^{k},y^{k})\}\) converges to \((\bar{\epsilon},\bar{x},\bar{y})\) superlinearly. In particular, it holds that
\begin{align*}\left\lVert\begin{pmatrix}\epsilon^{k+1}-\bar{\epsilon}\\x^{k+1}-\bar{x}\\y^{k+1}-\bar{y}\end{pmatrix}\right\rVert=O\left(\left\lVert\begin{pmatrix} \epsilon^{k}-\bar{\epsilon}\\x^{k}-\bar{x}\\y^{k}-\bar{y}\end{pmatrix}\right\rVert^{1+\tau}\right).\end{align*}
Proof.
The proof is given in Appendix A. ∎
It turns out that the conditions ensuring the local convergence rate of the Algorithm 1 are highly related to those for the SSN method. Classical results (see e.g., [Qi and Sun, 1993; Theorem 3.2]) show that the SSN method admits local quadratic convergence rate under the conditions that \(F\) is strongly semismooth and every element of \(\partial F(\bar{x},\bar{y},\bar{z})\) is nonsingular, where \((\bar{x},\bar{y},\bar{z})\) is a solution to the KKT system \(F(x,y,z)=0\) in Equation (8). The following lemma shows the connection between the nonsingularity of \(\partial F(\bar{x},\bar{y},\bar{z})\) and the nonsingularity of \(\partial\widehat{\mathcal{E}}(0,\bar{x},\bar{y})\).
Lemma 4.
Let \((\bar{x},\bar{y},\bar{z})\) be such that \(F(\bar{x},\bar{y},\bar{z})=0\). Then the following two statements are equivalent.
(1)
Every element of \(\partial F(\bar{x},\bar{y},\bar{z})\) is nonsingular.
(2)
Every element of \(\partial\widehat{\mathcal{E}}(0,\bar{x},\bar{y})\) is nonsingular.
Proof.
We first show statement 1 implies statement 2. Suppose that \(U\in\partial\widehat{\mathcal{E}}(0,\bar{x},\bar{y})\). Then there exists \(V\in{\partial\Phi(0,\bar{x}-\sigma(c-A^{T}\bar{y}))}\) such that
\begin{align*}U(\Delta\epsilon,\Delta x,\Delta y)=\begin{pmatrix}\Delta\epsilon\\A\Delta x\\\Delta x-V(\Delta\epsilon,\Delta x+\sigma A^{T}\Delta y)\end{pmatrix},\quad \forall(\Delta\epsilon,\Delta x,\Delta y)\in\mathbb{R}\times\mathbb{R}^{mn} \times\mathbb{R}^{m+n}.\end{align*}
To show that \(U\) is nonsingular, it suffices to show that \(U(\Delta\epsilon,\Delta x,\Delta y)=0\), implies that \((\Delta\epsilon,\Delta x,\Delta y)=0\). To this end, we assume that \(U(\Delta\epsilon,\Delta x,\Delta y)=0\). Then clearly, it holds that \(\Delta\epsilon=0\) and
\begin{align}\begin{pmatrix}A\Delta x\\-A^{T}\Delta y-\Delta z\\\Delta x-V(0,\Delta x-\sigma\Delta z)\end{pmatrix}=\begin{pmatrix}0\\0\\0\end{pmatrix},\end{align}
(23)
by denoting that \(\Delta z:=-A^{T}\Delta y\). By the expression of \(\partial h(0,t)\) given in Equation (12), we see that there exists \(W\in\partial\Pi_{+}(\bar{x}-\sigma(c-A^{T}\bar{y}))\) such that \(V(0,h)=W(h)\) for any \(h\in\mathbb{R}^{mn}\). Then the last equation of linear system (23) can be rewritten as \(\Delta x-W(\Delta x-\sigma\Delta z)=0\), and the resulting linear system only has the trivial solution since statement 1 holds true. Therefore, \(U\) is nonsingular. Similarly, one can show that statement 2 implies statement 1. The proof is completed. ∎
By the above lemma, we can see that the conditions for ensuring the local convergence rate for Algorithm 1 are not stronger than those for the SSN method. Furthermore, the nonsingularity assumption is equivalent to the primal and dual linear independence constraint qualification and the local Lipschitz homeomorphism of \(F(x,y,z)\) near \((\bar{x},\bar{y},\bar{z})\) (see e.g., [Chan and Sun, 2008; Kong et al., 2011]). These equivalent conditions are quite strong and may not hold generally. Surprisingly, in our numerical experiments, we indeed observe such local superlinear convergence empirically. This is reasonable because the nonsingularity assumption may hold in the selected experiments. Moreover, the nonsingularity assumption is sufficient but may not be necessary for the superlinear convergence. For example, the superlinear convergence of the projected SSN method is established in [Hu et al., 2022] under the local smoothness and the error bound conditions on a submanifold, without the nonsingularity assumption. More recently, the local Q-quadratically convergence of a modified two step SSN-type method for the nearest doubly stochastic matrix problem is established in [Hu et al., 2024] by further exploiting the problem structure, without nonsingularity.

4 Extension to the WB Problem

In this section, we proceed to extend our Algorithm 1 to solve the WB problem.

4.1 Problem Statement

First, we briefly recall the Wasserstein distance and describe the problem of computing a WB for a set of discrete probability distributions with finite support points. Given two discrete distributions \(\mathcal{P}^{(1)}=\{(a_{i}^{(1)}, {q}_{i}^{(1)}):i=1,\dots,m_{1}\}\) and \(\mathcal{P}^{(2)}=\{(a_{i}^{(2)},{q}_{i}^{(2)}):i=1,\dots,m_{2}\}\), the \(p\)-Wasserstein distance \(\mathcal{W}_{p}(\mathcal{P}^{(1)},\mathcal{P}^{(2)})\) is defined by the following OT problem:
\begin{align*}\begin{array}{cl}\left(\mathcal{W}_{p}\left(\mathcal{P}^{(1)},\mathcal{P}^{(2)}\right)\right)^{p}:= & \min\limits_{X\in\mathbb{R}^{m_{1}\times m_{2}}}\Big{\{}\left\langle X,\mathcal{D}\left(\mathcal{P}^{(1)},\mathcal{P}^{(2)} \right)\right\rangle\mid X^{\top}{{e}_{{m}_{1}}}={a}^{{(1)}},\;X{{e}_{{m}_{ 2}}}={a}^{{(2)}},\;X\geq{0}\Big{\}},\end{array}\end{align*}
where \(\mathcal{D}(\mathcal{P}^{(1)},\mathcal{P}^{(2)})\in\mathbb{R}^{m_{1}\times m_{ 2}}\) is the distance matrix with \(\mathcal{D}(\mathcal{P}^{(1)},\mathcal{P}^{(2)})_{ij}=\lVert q_{i}^{(1)}-q_{j} ^{(2)}\rVert_{p}^{p}\) and \(p\geq 1\). Then, given a set of discrete probability distributions \(\{\mathcal{P}^{(t)}\}_{t=1}^{N}\) with \(\mathcal{P}^{(t)}=\{(a_{i}^{(t)}, {q}_{i}^{(t)}):1\leq i\leq n\}\), a \(p\)-WB \(\mathcal{P}:=\{(w_{i}, {q}_{i}):i=1,\dots,m\}\) with \(m\) support points is an optimal solution to the following problem
\begin{align}\min\Big{\{}\sum_{t=1}^{N}\gamma_{t}\left(\mathcal{W}_{p}\left(\mathcal{P},\mathcal{P}^{(t)}\right)\right)^{p}\mid w\in\mathbb{R}^{m}_{+},\;e _{m}^{\top}w=1,\;q_{1},\ldots,q_{m}\in\mathbb{R}^{d}\Big{\}}\end{align}
(24)
for given weights \(\left(\gamma_{1},\dots,\gamma_{N}\right)\) satisfying \(\sum_{t=1}^{N}\gamma_{t}=1\) and \(\gamma_{t} \gt 0,\;t=1,\dots,N\). Note that problem (24) is a non-convex optimization problem, in which one needs to find the optimal support \({q}:=\{q_{1},\dots,q_{m}\}\) and the optimal weight vector \({w}:=(w_{1},\dots,w_{m})\) of a barycenter simultaneously. In many real applications, the support \({q}\) of a barycenter is pre-specified. Thus, one only needs to find the weight vector \({w}\) of a barycenter. In view of this, from now on, we assume that the support \({q}\) is given. Consequently, the problem (24) reduces to the WB problem with fixed support points
\begin{align}\begin{array}{cl} \min\limits_{{w},\{\Pi^{(t)}\!\}}&\sum\nolimits^{N}_{t=1}\langle D^{(t)}, \Pi^{(t)}\rangle\\\mathrm{s.t.} & \Pi^{(t)}{e}_{m_{t}}={w}, {}(\Pi^{(t)})^{\top}{e}_{m}={a}^{(t)}, {}\Pi^{(t)}\geq 0, {} {}t=1,\dots,N,\\& {e}^{\top}_{m}{w}=1, {}{w}\geq 0,\end{array}\end{align}
(25)
where \(D^{(t)}\) denotes \(\gamma_{t}\mathcal{D}(\mathcal{P}, \mathcal{P}^{(t)})\) for simplicity, for \(t=1,\dots,N\). Here, we assume that the dimensions of sampling distributions \(a^{(t)}\) are equal to \(n\) for convenience and the case with different dimensions can be extended in a straightforward manner. Since the last two constraints \({e}^{\top}_{m}{w}=1,{}{w}\geq 0\) are redundant, we remove them and reformulate the problem (25) as the following LP
\begin{align}\begin{array}{cl}\min\limits_{x}\big{\{}\langle c,x\rangle\;\mid\;{A }x={b},\;x\geq 0\big{\}},\end{array}\end{align}
(26)
where
\(x=\left(\operatorname{vec}\left(\Pi^{1}\right);\dots;\operatorname{vec}\left(\Pi^{N}\right);{w}\right)\in\mathbb{R}^{Nmn+m}\);
\(b=\left({a}^{(1)};{a}^{(2)};\dots;{a}^{(N)};0_{m};\dots;0_{m}\right)\in\mathbb {R}^{Nn+Nm}\);
\(c=\left(\operatorname{vec}\left(D^{(1)}\right);\dots;\operatorname{vec}\left(D ^{(N)}\right);0_{m}\right)\in\mathbb{R}^{Nmn+m}\);
\(A=\left(\begin{array}{cc}A_{1}&0\\ A_{2}&A_{3}\end{array}\right)\in\mathbb{R}^{(Nn+Nm)\times(Nmn+m)}\), \(A_{1}=\operatorname{Diag}\left(I_{n}\otimes{e}_{m}^{\top},\dots,I_{n}\otimes{e }_{m}^{\top}\right)\in\mathbb{R}^{Nn\times Nmn},\)
\(A_{2}=\operatorname{Diag}\left({e}_{n}^{\top}\otimes I_{m},\dots,{e}_{n}^{\top }\otimes I_{m}\right)\in\mathbb{R}^{Nm\times Nmn}\), and \(A_{3}=-{e}_{N}\otimes I_{m}\in\mathbb{R}^{Nm\times m}.\)

4.2 Newton Equations

Algorithm 1 can be directly extended to solve the WB problem (26). Similar to the linear system (18) for OT problems, the most expensive step is to tackle a structured system of linear equations w.r.t. \(\Delta y\) of the following form
\begin{align}\left(\lambda I+A\Theta A^{T}\right)\Delta y=R,\end{align}
(27)
where \(\lambda=\kappa_{p}\epsilon^{k} \gt 0\) is a scalar, and
\begin{align*}\begin{array}{ll} \Theta:=\left(\left(1+\kappa_{c}\epsilon^{k}\right)I-V_{2}^ {k}\right)^{-1}V_{2}^{k} & \in\mathbb{R}^{(Nmn+m)\times(Nmn+m)},\\R:=r_{p}^{k}-A\left(\left(1+\kappa_{c}\epsilon^{k}\right)I-V_{2}^{k} \right)^{-1}r_{c}^{k} & \in\mathbb{R}^{Nm+Nn}.\end{array}\end{align*}
We aim to accelerate the computation by exploiting the special structure of the matrix \(A\). Recall that for the OT problem, we can leverage the Fact 1 to simplify the coefficient matrix of \(\Delta y\) to Equation (19). Here, for the WB problem, we use a similar technique to simplify the coefficient matrix in Equation (27) as follows. First, we can verify that the matrix \(\Theta\) is a nonnegative diagonal matrix. In particular, \(\Theta\) takes the following form
\begin{align*}\Theta:=\left(\begin{array}{cccc}\operatorname{Diag}(\theta^{(1) }) & \\\\& \ddots \\\\& & \operatorname{Diag}(\theta^{(N)}) \\\ & & & \operatorname{Diag}(\bar{\theta})\end{array}\right)\end{align*}
where \(\theta^{(t)}\in\mathbb{R}^{mn}_{+}\) for \(t=1,\ldots,N\) and \(\bar{\theta}\in\mathbb{R}^{m}_{+}\). Define \(V_{t}:=\mathrm{Mat}(\theta^{(t)})\in\mathbb{R}^{m\times n}\) for \(t=1,\dots,N\). Then the system of linear Equation (27) can be equivalently written as
\begin{align}(\lambda I+A\Theta A^{T})\Delta y=\left(\begin{array}{cc}E_{1} & E_{2}\\E_{2}^{T} & E_{3}\end{array}\right)\left(\begin{array}{l}\Delta y_{1}\\\Delta y_{2}\end{array}\right)=\left(\begin{array}{c}R_{1}\\R_{2}\end{array}\right),\end{align}
(28)
where \(\Delta y:=(\Delta y_{1};\Delta y_{2})\in\mathbb{R}^{Nn+Nm},R:=(R_{1};R_{2})\in \mathbb{R}^{Nn+Nm}\), and
\begin{align*}\begin{array}{ll}E_{1}:={\operatorname{Diag}\left(\left(V_{1}^{T}e_{m};\dots;V_{N}^{T}e_{m}\right)\right)}+\lambda I & \in\mathbb{R}^{Nn\times Nn},\\E_{2}:=\operatorname{Diag}\left({V}_{1}^{T},\dots,{V}_{N}^{T}\right) & \in \mathbb{R}^{Nn\times Nm},\\E_{3}:={\mathrm{Diag}\left(\left({V}_{1}e_{n};\dots;{V}_{N}e_{n}\right) \right)}+(e_{N}e_{N}^{T})\otimes\mathrm{Diag}({\bar{\theta}})+\lambda I & \in \mathbb{R}^{{Nm\times Nm}}.\end{array}\end{align*}
To reduce the size of the linear system, we next eliminate \(\Delta y_{1}\) and compute \(\Delta y_{2}\) by
\begin{align}{S}\Delta y_{2}={R}_{3},\end{align}
(29)
where \({R}_{3}:=R_{2}-E_{2}^{T}E_{1}^{-1}R_{1}\in\mathbb{R}^{Nm}\) and \(S:=E_{3}-E_{2}^{T}E_{1}^{-1}E_{2}\) is the Schur complement matrix such that
\begin{align*}\begin{array}{lll}{S} &: =\mathrm{Diag}\left({S}_{1},\dots,{S}_{N}\right){+ \big{(}e_{N}\otimes\mathrm{Diag}(\sqrt{\bar{\theta}})\big{)}\big{(}e_{N} \otimes\mathrm{Diag}(\sqrt{\bar{\theta}})\big{)}^{\top}} & \in\mathbb{R}^{Nm \times Nm},\\{S}_{t} &: ={\mathrm{Diag}({V}_{t}e_{n}+{\lambda e_{m}})-{V_{t}}\mathrm{Diag} \left(V_{t}^{T}e_{m}+{\lambda e_{n}}\right)^{-1}{V}_{t}^{T}} & \in\mathbb{R}^{m \times m},\;{t=1,\dots,N}.\end{array}\end{align*}
It is clear that the coefficient matrix \(S\) is sparse and positive definite, so (29) can be efficiently solved directly by sparse Cholesky decomposition or iteratively by preconditioned conjugate gradient (PCG) method. Note that when forming the matrices \(S_{t}\) for \(t=1,\dots,N\), matrix inversions are only applied to diagonal matrices. Therefore, the cost of computing the coefficient matrix is relatively modest. Finally, it follows from (28) that
\begin{align*}\Delta y_{1}=E_{1}^{-1}\left(R_{1}-E_{2}\Delta y_{2}\right)\!.\end{align*}
We can also compute \(\Delta y_{1}\) efficiently thanks to the diagonal block structure of \(E_{1}\) and \(E_{2}\). To see this, we denote \(\Delta y_{1}:=(\Delta y_{1}^{1};\dots;\Delta y_{1}^{N})\in\mathbb{R}^{Nn}\), \(\Delta y_{2}:=(\Delta y_{2}^{1};\dots;\Delta y_{2}^{N})\in\mathbb{R}^{Nm}\) and \(R_{1}:=(R_{1}^{1};\dots;R_{1}^{N})\in\mathbb{R}^{Nn}\), then
\begin{align*}\Delta y_{1}^{t}=(R_{1}^{t}-{V}_{t}^{T}\Delta y_{2}^{t})\oslash({V}_{t}^{T}e_{ m}+\lambda e_{n}),\quad t=1,\dots,N,\end{align*}
where “\(\oslash\)” is the element-wise division operator. Moreover, observe that the coefficient matrix in Equation (29) is the sum of a symmetric positive definite and block-diagonal matrix and a low rank matrix. Thus, one can solve (29) efficiently by using the Sherman-Morrison-Woodbury formula. In our numerical experiments, we iteratively solve the Equation (29) using the PCG method with incomplete Cholesky factorization as the preconditioner for the first few steps. We then switch to sparse Cholesky decomposition to solve it directly if the PCG steps exceed 80.

5 Numerical Experiments

In this section, we conduct a set of numerical experiments and present the corresponding computational results to demonstrate the efficiency of the proposed Algorithm 1.

5.1 Experimental Settings

We implement our algorithm purely in Julia (version 1.8.2) and compare it with the highly optimized commercial and/or open-source LP solvers including Gurobi (version 9.5.1), HiGHS (version 1.3.0), and CPLEX (version 22.1.0.0) by calling their Julia interfaces through JuMP3. In particular, for these powerful LP solvers, we compare our algorithm with their barrier methods and the network simplex method implemented in CPLEX4. Our experiments are run on a Linux PC having Intel Xeon E5-2680 (v3) cores with 96 GB of RAM.
Given an approximate KKT solution \((x,y,z)\in\mathbb{R}^{mn}\times\mathbb{R}^{m+n}\times\mathbb{R}^{mn}\), we define the maximal relative KKT residue as follows
\begin{align*}\eta_{\rm max}:=\max\left\{\eta_{p}:=\frac{\left\lVert Ax-d\right \rVert}{1+\left\lVert d\right\rVert},\eta_{d}:=\frac{\left\lVert A^{T}y+z-c \right\rVert}{1+\left\lVert c\right\rVert},\eta_{c}:=\frac{\left\lVert x-\Pi_{ +}(x-z)\right\rVert}{1+\left\lVert x\right\rVert+\left\lVert z\right\rVert} \right\}.\end{align*}
Note that \(\eta_{d}\) is always zero in our case based on our algorithmic design. To measure the duality gap, we also define the following relative gap
\begin{align*}\eta_{g}:=\frac{\left\lvert\left\langle c,x\right\rangle-\left\langle d,y \right\rangle\right\rvert}{1+\left\lvert\left\langle c,x\right\rangle\right \rvert+\left\lvert\left\langle d,y\right\rangle\right\rvert}.\end{align*}
For a user-specified stopping tolerance \(\tt{tol} \gt 0\), we terminate the algorithm when the smoothing parameter \(\epsilon\) is smaller than \(\tt{tol}\times 10^{-2}\), or \(\max\{\eta_{\rm max},\eta_{g}\}\leq\tt{tol}\), or the maximum time \(\tt{TimeLimit}\), or the maximum number of iterations \(\tt{maxIter}\), is reached. In our implementation, we set \(\tt{tol}=10^{-8}\), \(\tt{TimeLimit}=24 (hrs)\), and \(\tt{maxIter}=1000\). For barrier methods and the network simplex method, we also set the related stopping tolerance as \(\tt{tol}\) and compute the corresponding KKT residues \(\eta_{\rm max}\) from the returned approximate solutions. For OT problems, we turn off the crossover phase for all LP solvers, as it is too time-consuming and in general does not improve the quality of the output solutions. On the other hand, we turn on the presolving phase for barrier methods, since it usually enhances the numerical stability and improves the convergence. However, for the network simplex method in CPLEX, we turn off the presolving phase since it consumes a lot of computational time but does not improve the performance of the network simplex method. Finally, we notice that the barrier method implemented in HiGHS only uses one thread when solving the problem. For WB problems, we turn off the presolving phase for the barrier method, since it does not improve the performance in our numerical tests. However, we turn on the presolving phase for the dual simplex method since it can dramatically decrease the computational time based on our numerical experience. To ensure fair comparisons, we restrict all methods to use only one thread for OT problems, while allowing multi-threading for WB problems.
We also mention that for the matrix \(A\) generated from an OT problem, the last row of \(A\) is always redundant. Thus, to improve the practical performance of all the solvers, we always drop the last equality constraint when passing the input data to a solver. Also, we notice that performing a suitable scaling scheme to the problem data can improve the efficiency of Algorithm 1. Therefore, in our implementation, for given input data \(d\) and \(c\), Algorithm 1 is executed on \(\hat{d}\) and \(\hat{c}\), where \(\hat{d}=d/\left\lVert d\right\rVert\) and \(\hat{c}=c/\left\lVert c\right\rVert\). However, the relative KKT residues and relative duality gap are evaluated on the original data. Finally, in our implementation, we set \(\epsilon^{0}=1\), \(r=0.75\), \(\tau=0.25\), \(\rho=0.5\), \(\mu=10^{-8}\), \(\sigma=\min\{10^{3},\left\lVert c\right\rVert\}\), \(\kappa_{p}=\kappa_{c}=1\).

5.2 Computational Results on DOTmark for OT Problems

In our experiments, the cost matrix \(C\) is chosen based on the squared Euclidean distance. In particular, for \(1\leq i\leq m\) and \(1\leq j\leq n\), \(C_{ij}:=(\ell_{1}-\ell_{2})^{2}+(k_{1}-k_{2})^{2}\), where the indices \(i\) and \(j\) correspond to the \((\ell_{1},k_{1})\) and \((\ell_{2},k_{2})\) positions in the two images, respectively. Since \(C\) contains some elements that are very large, we also normalize \(C\) by dividing it by its maximum value. The discrete probability distributions \(a\) and \(b\) are obtained from normalizing any two images. We consider images from the DOTmark collection [Schrieber et al., 2016] (a benchmark dataset for discrete OT problems) which contains ten classes of images arising from different scenarios. See Figure 1 for examples of images from each class. Within a specific class that contains ten different images, we can pair any two different images and compute the OT between them. Thus, each class of images generates 45 OT problems, resulting in 450 problems for ten classes. We mention that the DOTmark collection offers images at different resolutions, ranging from \(32\times 32\) to \(512\times 512\). However, due to limited available memory, we only present the results for the cases with \(32\times 32\), \(64\times 64\) resolutions and partial results for the cases with \(128\times 128\) resolution. As a consequence, \(450+450+10=910\) OT problems are tested in our experiments. Table 1 summarizes the problem sizes with respect to different image resolutions. We see that for OT problems generated from \(128\times 128\) images, the problem sizes are typically too large to be handled by those LP solvers used on our machine.
Fig. 1.
Fig. 1. Example images from the DOTmark collection [Schrieber et al., 2016].
Table 1.
Resolution#constraints#variables
\(32\times 32\)2,0481,048,576
\(64\times 64\)8,19216,777,216
\(128\times 128\)32,768268,435,456
Table 1. OT Problem Sizes with Different Images Resolutions
Notice that for most images from the “Shapes” and “MicroscopyImages” classes, the generated probability distributions may contain zero entries. In particular, if the \(i\)th (\(1\leq i\leq m\)) element in \(a\) is zero, then we can easily see that the \(i\)th row of the optimal transportation plan \(X\) is a zero vector. Similarly, if the \(j\)th (\(1\leq j\leq n\)) element in \(b\) is zero, then the \(j\)th column of the optimal transportation plan \(X\) is a zero vector. Therefore, one is able to drop those zero rows and/or columns in a prepossessing phase to get a smaller-scale OT problem. As a consequence, we always solve the smaller OT problems.
The detailed computational results for the instances with \(64\times 64\) resolution are presented in Table 2. (For brevity, we do not present the results for the instances with \(32\times 32\) resolution but note that the relative performance of various methods are similar to those observed in Table 2.) In the table, the barrier method and the network simplex method in CPLEX are denoted by “CPLEX-Bar” and “CPLEX-Net,” respectively. For each image class, we present the mean values over 45 OT problems. We report the mean value of the number of iterations taken by each solver in the column “Iter.” Note that CPLEX-Net performs the network simplex iterations, which are usually very large. Moreover, currently we do not know how to extract the total number of the network simplex iterations through JuMP. Therefore, we just set Iter to be ‘-’ for CPLEX-Net. In the column “Time,” the mean values of computational times for all the solvers are presented for comparing their practical efficiency. For the remaining four columns, we report the mean relative KKT residues (i.e., \(\eta_{p}\), \(\eta_{d}\), \(\eta_{c}\) and \(\eta_{g}\)) in order to compare the accuracy of the solutions.
Table 2.
ImageSolverTime (s)Iter\(\eta_{p}\)\(\eta_{d}\)\(\eta_{c}\)\(\eta_{g}\)
CauchyDensityHiGHS5.0e\(+\)02292.3e\(-\)068.2e\(-\)102.1e\(-\)101.8e\(-\)06
 Gurobi3.0e\(+\)02165.7e\(-\)111.6e\(-\)162.7e\(-\)092.2e\(-\)09
 CPLEX-Bar6.0e\(+\)02261.5e\(-\)103.4e\(-\)161.3e\(-\)101.5e\(-\)09
 CPLEX-Net3.6e\(+\)01-1.3e\(-\)186.2e\(-\)170.0e\(+\)007.6e\(-\)18
 SmoothNewton7.8e\(+\)01573.1e\(-\)120.0e\(+\)003.9e\(-\)094.9e\(-\)09
GRFmoderateHiGHS4.4e\(+\)02271.1e\(-\)029.0e\(-\)028.4e\(-\)101.7e\(-\)02
 Gurobi2.8e\(+\)02152.5e\(-\)121.5e\(-\)161.6e\(-\)091.8e\(-\)09
 CPLEX-Bar5.9e\(+\)02255.2e\(-\)103.1e\(-\)153.4e\(-\)105.8e\(-\)09
 CPLEX-Net3.3e\(+\)01-6.8e\(-\)194.6e\(-\)170.0e\(+\)002.6e\(-\)18
 SmoothNewton7.1e\(+\)01541.3e\(-\)120.0e\(+\)003.2e\(-\)094.9e\(-\)09
GRFsmoothHiGHS4.6e\(+\)02281.8e\(-\)061.5e\(-\)127.5e\(-\)101.1e\(-\)06
 Gurobi3.0e\(+\)02161.3e\(-\)111.6e\(-\)162.8e\(-\)092.9e\(-\)09
 CPLEX-Bar6.2e\(+\)02274.4e\(-\)103.0e\(-\)154.6e\(-\)104.8e\(-\)09
 CPLEX-Net3.8e\(+\)01-7.5e\(-\)195.6e\(-\)170.0e\(+\)003.1e\(-\)18
 SmoothNewton7.8e\(+\)01581.9e\(-\)120.0e\(+\)003.0e\(-\)095.0e\(-\)09
LogitGRFHiGHS5.2e\(+\)02301.8e\(-\)051.2e\(-\)126.4e\(-\)102.2e\(-\)05
 Gurobi3.0e\(+\)02162.7e\(-\)121.5e\(-\)162.4e\(-\)092.6e\(-\)09
 CPLEX-Bar6.8e\(+\)02303.6e\(-\)102.4e\(-\)154.1e\(-\)102.9e\(-\)09
 CPLEX-Net3.5e\(+\)01-9.1e\(-\)195.0e\(-\)170.0e\(+\)002.9e\(-\)18
 SmoothNewton7.7e\(+\)01591.3e\(-\)120.0e\(+\)003.3e\(-\)095.1e\(-\)09
ShapesHiGHS9.2e\(+\)01216.4e\(-\)071.6e\(-\)093.1e\(-\)107.0e\(-\)08
 Gurobi6.2e\(+\)01131.7e\(-\)119.2e\(-\)154.2e\(-\)101.8e\(-\)09
 CPLEX-Bar1.0e\(+\)02207.8e\(-\)109.6e\(-\)162.4e\(-\)104.5e\(-\)09
 CPLEX-Net7.5e\(+\)00-3.3e\(-\)183.8e\(-\)170.0e\(+\)008.2e\(-\)18
 SmoothNewton2.1e\(+\)01526.3e\(-\)110.0e\(+\)008.8e\(-\)093.6e\(-\)09
ClassicImagesHiGHS5.1e\(+\)02301.4e\(-\)061.0e\(-\)092.2e\(-\)102.5e\(-\)08
 Gurobi3.0e\(+\)02152.6e\(-\)121.4e\(-\)163.2e\(-\)092.5e\(-\)09
 CPLEX-Bar5.9e\(+\)02252.3e\(-\)102.0e\(-\)151.7e\(-\)102.4e\(-\)09
 CPLEX-Net3.5e\(+\)01-7.6e\(-\)194.1e\(-\)170.0e\(+\)002.0e\(-\)18
 SmoothNewton7.1e\(+\)01551.2e\(-\)120.0e\(+\)003.2e\(-\)095.0e\(-\)09
GRFroughHiGHS4.9e\(+\)02295.5e\(-\)042.0e\(-\)025.6e\(-\)106.1e\(-\)04
 Gurobi2.9e\(+\)02151.5e\(-\)121.4e\(-\)161.8e\(-\)092.2e\(-\)09
 CPLEX-Bar5.6e\(+\)02242.5e\(-\)106.1e\(-\)155.0e\(-\)103.5e\(-\)09
 CPLEX-Net3.4e\(+\)01-7.3e\(-\)192.5e\(-\)170.0e\(+\)005.7e\(-\)19
 SmoothNewton7.4e\(+\)01568.8e\(-\)130.0e\(+\)002.7e\(-\)094.9e\(-\)09
LogGRFHiGHS9.1e\(+\)02472.1e\(-\)035.5e\(-\)028.9e\(-\)105.3e\(-\)03
 Gurobi3.1e\(+\)02163.0e\(-\)121.7e\(-\)161.5e\(-\)092.3e\(-\)09
 CPLEX-Bar5.9e\(+\)02254.7e\(-\)101.9e\(-\)145.4e\(-\)106.1e\(-\)09
 CPLEX-Net4.4e\(+\)01-1.6e\(-\)186.3e\(-\)170.0e\(+\)008.7e\(-\)18
 SmoothNewton7.6e\(+\)01583.2e\(-\)120.0e\(+\)003.8e\(-\)095.1e\(-\)09
MicroscopyImagesHiGHS1.1e\(+\)02401.1e\(-\)063.0e\(-\)093.6e\(-\)108.7e\(-\)08
 Gurobi3.7e\(+\)01153.6e\(-\)121.5e\(-\)165.3e\(-\)091.2e\(-\)09
 CPLEX-Bar6.1e\(+\)01256.6e\(-\)101.1e\(-\)144.6e\(-\)103.4e\(-\)09
 CPLEX-Net4.1e\(+\)00-3.6e\(-\)185.1e\(-\)170.0e\(+\)006.4e\(-\)18
 SmoothNewton9.8e\(+\)00461.2e\(-\)100.0e\(+\)009.1e\(-\)092.0e\(-\)09
WhiteNoiseHiGHS4.8e\(+\)02271.0e\(-\)029.5e\(-\)025.0e\(-\)106.4e\(-\)03
 Gurobi3.0e\(+\)02168.4e\(-\)131.4e\(-\)161.8e\(-\)091.9e\(-\)09
 CPLEX-Bar6.1e\(+\)02273.4e\(-\)104.3e\(-\)155.2e\(-\)104.5e\(-\)09
 CPLEX-Net3.5e\(+\)01-1.1e\(-\)181.6e\(-\)170.0e\(+\)002.8e\(-\)19
 SmoothNewton6.9e\(+\)01567.9e\(-\)130.0e\(+\)003.3e\(-\)095.1e\(-\)09
Table 2. Computational Results for \(64\times 64\) Images
From the computational results, we observe that all the solvers except HiGHS were able to solve all the OT problems accurately. Indeed, the solutions provided by HiGHS are usually less accurate than the other solvers. Among all the solvers, CPLEX-Net showed the best performance, which coincides with the results presented in [Schrieber et al., 2016]. For the comparison between barrier methods, we observe that CPLEX-Bar and Gurobi had comparable performance. On the other hand, we see that HiGHS took more iterations and computational time than those of CPLEX-Bar and Gurobi. While Algorithm 1 took more iterations than those of the barrier methods, it is about 2–4 times faster than CPLEX-Bar and Gurobi. This indicates that Algorithm 1 is more efficient in terms of per-iteration cost when compared to barrier methods. This is exactly the consequence of the exploration of the solution sparsity.
We also present some computational results for images with \(128\times 128\) resolution. Since solving each OT problem of such a huge size is very time-consuming, we only conduct numerical experiments on the first two images in each class. Thus, only partial results are presented in Table 3. Note also that except Algorithm 1, all other solvers usually take more than 96 GB of RAM to handle these OT problems, so they usually encounter out-of-memory issues. Hence, only the numerical results for Algorithm 1 are presented in Table 3. We can see that the accuracy of the computed solution is still very high. Moreover, it only took less than 40 minutes to converge, even though each problem contains more than 268 million nonnegative variables. This indicates that Algorithm 1 is quite efficient and robust. More importantly, it consumes less memory than state-of-the-art solvers.
Table 3.
ImageTime (s)Iter\(\eta_{p}\)\(\eta_{d}\)\(\eta_{c}\)\(\eta_{g}\)
CauchyDensity2.21e\(+\)03961.1e\(-\)120.0e\(+\)002.8e\(-\)093.4e\(-\)09
GRFmoderate2.17e\(+\)03976.8e\(-\)130.0e\(+\)004.0e\(-\)105.4e\(-\)09
GRFsmooth2.18e\(+\)03938.7e\(-\)130.0e\(+\)003.7e\(-\)095.2e\(-\)09
LogitGRF1.85e\(+\)03868.0e\(-\)130.0e\(+\)004.1e\(-\)105.6e\(-\)09
Shapes4.92e\(+\)02483.1e\(-\)120.0e\(+\)002.6e\(-\)094.1e\(-\)09
ClassicImages2.22e\(+\)031016.4e\(-\)130.0e\(+\)004.3e\(-\)105.7e\(-\)09
GRFrough2.30e\(+\)03977.6e\(-\)130.0e\(+\)004.4e\(-\)105.8e\(-\)09
LogGRF2.28e\(+\)031089.2e\(-\)130.0e\(+\)004.1e\(-\)105.5e\(-\)09
MicroscopyImages5.74e\(+\)02881.4e\(-\)120.0e\(+\)001.2e\(-\)095.4e\(-\)09
WhiteNoise2.21e\(+\)03998.0e\(-\)130.0e\(+\)004.4e\(-\)105.3e\(-\)09
Table 3. Partial Computational Results for Algorithm 1 on \(128\times 128\)Images
To further evaluate the memory consumption for each solver, we also measure the peak memory used by each solver when solving the problems. We use the program “/usr/bin/time” provided by the Linux system for our measurements. The output of the “time” program contains the term “maximum resident set size” (maximum RSS) which provides a reasonable estimate of the peak memory usage. Since the “time” program takes significant time to generate its output, we only measure the memory usage of each solver on the OT problems derived from the first two images in each image class. The results on memory usage are summarized in Table 4. We can see that Algorithm 1 indeed used much less memory than the other solvers. This further shows the potential of Algorithm 1 for solving OT problems at scale.
Table 4.
ImageResolutionHiGHSGurobiCPLEX-BarCPLEX-NetSmoothNewton
CauchyDensity\(64\times 64\)222822260419489202243456
GRFmoderate\(64\times 64\)222812250319500202273751
GRFsmooth\(64\times 64\)222812415219486202253550
LogitGRF\(64\times 64\)223312163919492202173850
Shapes\(64\times 64\)75249236951589242254
ClassicImages\(64\times 64\)222822166819470202283744
GRFrough\(64\times 64\)222812415519499202373541
LogGRF\(64\times 64\)223302415819491202263803
MicroscopyImages\(64\times 64\)2632304923052471919
WhiteNoise\(64\times 64\)222822415819489202293572
Table 4. Maximum RSS (MB) for the First Two Images in Each Class

5.3 Computational Results on Real Data for WB Problems

In this subsection, we test our Algorithm 1 for computing WB of real image data. For a given set of images, we first rescale them to a pre-specified resolution, and vectorize and normalize them so that the sum of the resulting vectors are all ones. We generate the distance matrix \(D^{(t)}\) for all \(t=1,\dots N\) in the same manner as before. In our experiments, we set \(m=n\) so that the barycenter has the same dimension as the given images.
We conduct experiments on the MNIST dataset [LeCun et al., 1998], the Coil20 dataset [Nene et al., 1996] and the Yale-Face dataset [Belhumeur et al., 1997]. For the MNIST dataset, we randomly select 10 images for each digit \((0,\dots,9)\) with the original resolution \(28\times 28\). For the Coil20 dataset, we select 3 representative objects: Car, Duck and Pig. For each object, we choose 10 images and rescale them to \(32\times 32\). For the Yale-Face dataset, we select three human subjects: Yale01, Yale02, and Yale03. For each subject, we randomly choose 10 images and rescale them to \(30\times 40,36\times 48\) and \(54\times 72\), respectively. See Figure 2 for an example of images from each class. A summary of the problem sizes is shown in Table 5. We set the weight \(\gamma_{t}=1/N\) for all \(t=1,\dots N\) and \(p=2\).
Table 5.
DatabaseNResolution#constraints#variables
MNIST10\(28\times 28\)156806,147,344
Car10\(32\times 32\)2048010,486,784
Duck10\(32\times 32\)2048010,486,784
Pig10\(32\times 32\)2048010,486,784
Yale0110\(30\times 40\)2400014,401,200
Yale0210\(36\times 48\)3456029,861,568
Yale0310\(54\times 72\)77760151,169,328
Table 5. WB Problem Sizes with Different Image Resolutions
Fig. 2.
Fig. 2. Example images from different databases.
The results are summarized in Table 6. From the computational results, we observe that SmoothNewton and Gurobi-Bar are able to solve all the WB problems on image data successfully. However, Gurobi-Spx cannot solve the largest Yale-face problem within the 24-hour limit, since the corresponding LP is of very large size. On larger problems, Gurobi-Bar is faster than Gurobi-Spx. But SmoothNewton outperforms both Gurobi-Spx and Gurobi-Bar on all the image WB problems. For the WB instances on the Duck, Pig, and Yale-Face images, SmoothNewton is about ten times faster than both Gurobi-Spx and Gurobi-Bar.
Table 6.
ImageSolverTime(s)Iter\(\eta_{p}\)\(\eta_{d}\)\(\eta_{c}\)\(\eta_{g}\)
MNISTGurobi-Spx4.9e \(+\) 01103,0702.8e \(-\) 168.1e \(-\) 177.3e \(-\) 193.5e \(-\) 18
 Gurobi-Bar1.1e \(+\) 02305.0e \(-\) 115.1e \(-\) 156.1e \(-\) 111.2e \(-\) 09
 SmoothNewton1.3e \(+\) 01911.3e \(-\) 130.0e \(+\) 002.1e \(-\) 108.2e \(-\) 09
CarGurobi-Spx1.2e \(+\) 0238,6081.6e \(-\) 163.0e \(-\) 172.0e \(-\) 191.7e \(-\) 18
 Gurobi-Bar2.2e \(+\) 02338.6e \(-\) 128.4e \(-\) 146.7e \(-\) 136.1e \(-\) 11
 SmoothNewton2.6e \(+\) 011116.2e \(-\) 140.0e \(+\) 001.6e \(-\) 116.2e \(-\) 10
DuckGurobi-Spx3.2e \(+\) 02536,7382.6e \(-\) 168.2e \(-\) 171.8e \(-\) 115.1e \(-\) 18
 Gurobi-Bar2.6e \(+\) 02433.0e \(-\) 114.0e \(-\) 151.6e \(-\) 132.5e \(-\) 12
 SmoothNewton2.6e \(+\) 011004.5e \(-\) 130.0e \(+\) 002.7e \(-\) 109.1e \(-\) 09
PigGurobi-Spx5.5e \(+\) 021,306,8821.6e \(-\) 168.0e \(-\) 174.2e \(-\) 181.7e \(-\) 17
 Gurobi-Bar2.4e \(+\) 02383.8e \(-\) 115.2e \(-\) 144.4e \(-\) 127.4e \(-\) 09
 SmoothNewton2.3e \(+\) 01961.2e \(-\) 120.0e \(+\) 001.8e \(-\) 108.9e \(-\) 09
Yale01Gurobi-Spx3.7e \(+\) 036,126,0171.9e \(-\) 168.7e \(-\) 173.1e \(-\) 171.9e \(-\) 17
 Gurobi-Bar3.0e \(+\) 02323.6e \(-\) 111.1e \(-\) 154.1e \(-\) 108.3e \(-\) 09
 SmoothNewton3.1e \(+\) 01977.0e \(-\) 130.0e \(+\) 001.6e \(-\) 106.6e \(-\) 09
Yale02Gurobi-Spx2.0e \(+\) 0418,892,1972.6e \(-\) 168.5e \(-\) 171.9e \(-\) 112.5e \(-\) 17
 Gurobi-Bar2.1e \(+\) 03451.4e \(-\) 101.3e \(-\) 132.2e \(-\) 115.3e \(-\) 09
 SmoothNewton1.8e \(+\) 021151.0e \(-\) 120.0e \(+\) 001.7e \(-\) 109.7e \(-\) 09
Yale03Gurobi-Spx(exceed 24 hours)----
 Gurobi-Bar3.3e \(+\) 04261.9e \(-\) 132.4e \(-\) 124.1e \(-\) 142.2e \(-\) 09
 SmoothNewton2.5e \(+\) 031772.3e \(-\) 130.0e \(+\) 001.2e \(-\) 108.6e \(-\) 09
Table 6. Computational Results on Real Data for WB Problems

6 Conclusions

In this article, we have proposed an SqSN method via the Huber smoothing function for solving the KKT system of the discrete OT problem directly. Besides having appealing convergence properties, the proposed method is able to exploit the solution sparsity of the OT problem to greatly reduce the computational costs. We have also extended the method to solve WB problems. Our numerical experiments on solving a large set of OT and WB instances have demonstrated the excellent practical performance of the proposed method when compared to the state-of-the-art solvers.
There are some issues concerning the proposed method that deserve further investigation. For example, the condition for ensuring the local superlinear convergence rate is rather strong and cannot be verified a priori in general. Since we have observed the fast local convergence rate empirically based on our numerical tests, there may exist more relaxed conditions for ensuring such strong convergence properties, which deserve further studies. Moreover, it is clear that the method can also be applied to solving general LP problems. But the efficiency and robustness of the proposed method for solving general LP problems need further investigation.

Acknowledgment

We would like to thank the editors and referees for providing numerous valuable suggestions which have helped to improve the quality of this paper.

Footnotes

1
Based on the benchmark conducted by Hans D. Mittelmann, see http://plato.asu.edu/
2
The bipartite graph has \(m\) sources and \(n\) sinks. There is an edge from source \(i\) to sink \(j\) if and only if \(V^{k}_{ij} \gt 0\) for \(1\leq i\leq m\) and \(1\leq j\leq n\).
3
Julia: https://julialang.org/; Gurobi.jl: https://github.com/jump-dev/Gurobi.jl; HiGHS.jl: https://github.com/jump-dev/HiGHS.jl; CPLEX.jl: https://github.com/jump-dev/CPLEX.jl; and JuMP.jl: https://github.com/jump-dev/JuMP.jl. For the commercial solvers Gurobi and CPLEX, we use their academic licenses.
4
Based on the benchmark in [Schrieber et al., 2016], the network simplex method and its variants are the most efficient algorithms for solving OT problems to very high accuracy. However, only the network simplex method implemented in CPLEX has a friendly interface in JuMP, to the best of our knowledge. Therefore, we only compare Algorithm 1 with the network simplex method provided by CPLEX.

References

[1]
Martial Agueh and Guillaume Carlier. 2011. Barycenters in the Wasserstein space. SIAM Journal on Mathematical Analysis 43, 2 (2011), 904–924.
[2]
Ethan Anderes, Steffen Borgwardt, and Jacob Miller. 2016. Discrete Wasserstein barycenters: Optimal transport for discrete data. Mathematical Methods of Operations Research 84 (2016), 389–409.
[3]
Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman. 1997. Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 7 (1997), 711–720.
[4]
Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. 2015. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing 37, 2 (2015), A1111–A1138.
[5]
Zi Xian Chan and Defeng Sun. 2008. Constraint nondegeneracy, strong regularity, and nonsingularity in semidefinite programming. SIAM J. Optimization 19, 1 (2008), 370–396.
[6]
Bintong Chen and Patrick T. Harker. 1993. A non-interior-point continuation method for linear complementarity problems. SIAM Journal on Matrix Analysis and Applications 14, 4 (1993), 1168–1190.
[7]
Xiaojun Chen, Liqun Qi, and Defeng Sun. 1998. Global and superlinear convergence of the smoothing Newton method and its application to general box constrained variational inequalities. Mathematics of Computation 67, 222 (1998), 519–540.
[8]
Pierre-André Chiappori, Robert J. McCann, and Lars P. Nesheim. 2010. Hedonic price equilibria, stable matching, and optimal transport: equivalence, topology, and uniqueness. Economic Theory 42 (2010), 317–354.
[9]
Hong T. M. Chu, Ling Liang, Kim-Chuan Toh, and Lei Yang. 2020. An efficient implementable inexact entropic proximal point algorithm for a class of linear programming problems. Computational Optimization and Applications 85 (2020), 107–146.
[10]
Frank H. Clarke. 1990. Optimization and Nonsmooth Analysis. SIAM.
[11]
Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 26.
[12]
Marco Cuturi and Arnaud Doucet. 2014. Fast computation of Wasserstein barycenters. In Proceedings of the International Conference on Machine Learning. PMLR, 685–693.
[13]
George Dantzig. 2016. Linear programming and extensions. In Linear Programming and Extensions. Princeton University Press.
[14]
Anthony Y. Fu, Liu Wenyin, and Xiaotie Deng. 2006. Detecting phishing web pages with visual similarity assessment based on earth mover's distance (EMD). IEEE Transactions on Dependable and Secure Computing 3, 4 (2006), 301–311.
[15]
Alfred Galichon. 2018. Optimal Transport Methods in Economics. Princeton University Press.
[16]
Kristen Grauman and Trevor Darrell. 2004. Fast contour matching using approximate earth mover's distance. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’04), Vol. 1. IEEE, I–I.
[17]
Hao Hu, Xinxin Li, Haesol Im, and Henry Wolkowicz. 2024. A semismooth Newton-type method for the nearest doubly stochastic matrix problem. Mathematics of Operations Research 49, 2 (2024), 729–751.
[18]
Jiang Hu, Tonghua Tian, Shaohua Pan, and Zaiwen Wen. 2022. On the local convergence of the semismooth Newton method for composite optimization. arXiv:2211.01127. Retrieved from https://arxiv.org/abs/2211.01127
[19]
Christian Kanzow. 1996. Some noninterior continuation methods for linear complementarity problems. SIAM Journal on Matrix Analysis and Applications 17, 4 (1996), 851–868.
[20]
Dave Kendal, Cindy E. Hauser, Georgia E. Garrard, Sacha Jellinek, Katherine M. Giljohann, and Joslin L. Moore. 2013. Quantifying plant colour and colour difference as perceived by humans using digital images. PLoS One 8, 8 (2013), e72296.
[21]
Lingchen Kong, Jie Sun, and Naihua Xiu. 2008. A regularized smoothing Newton method for symmetric cone complementarity problems. SIAM Journal on Optimization 19, 3 (2008), 1028–1047.
[22]
Lingchen Kong, Levent Tunçel, and Naihua Xiu. 2011. Equivalent conditions for Jacobian nonsingularity in linear symmetric cone programming. Journal of Optimization Theory and Applications 148, 2 (2011), 364–389.
[23]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.
[24]
Xudong Li, Defeng Sun, and Kim-Chuan Toh. 2020. An asymptotically superlinearly convergent semismooth Newton augmented Lagrangian method for linear programming. SIAM Journal on Optimization 30, 3 (2020), 2410–2440.
[25]
Yiyang Liu, Zaiwen Wen, and Wotao Yin. 2022. A multiscale semi-smooth Newton method for optimal transport. Journal of Scientific Computing 91, 2 (2022), 39.
[26]
David G. Luenberger and Yinyu Ye. 1984. Linear and Nonlinear Programming. Vol. 2. Springer.
[27]
Quentin Mérigot. 2011. A multiscale approach to optimal transport. In Computer Graphics Forum, Mario Botsch and Scott Schaefer (Eds.), Vol. 30, Wiley Online Library, 1583–1592.
[28]
Robert Mifflin. 1977. Semismooth and semiconvex functions in constrained optimization. SIAM Journal on Control and Optimization 15, 6 (1977), 959–972.
[29]
Sameer A. Nene, Shree K. Nayar, Hiroshi Murase, et al. 1996. Columbia object image library (coil-20). (1996).
[30]
James B. Orlin. 1997. A polynomial time primal network simplex algorithm for minimum cost flows. Mathematical Programming 78, 2 (1997), 109–129.
[31]
Gabriel Peyré and Marco Cuturi. 2019. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11, 5–6 (2019), 355–607.
[32]
Liqun Qi and Defeng Sun. 1999. A survey of some nonsmooth equations and smoothing Newton methods. In Progress in Optimization. Andrew Eberhard, Robin Hill, Daniel Ralph, and Barney M. Glover (Eds.), Springer, 121–146.
[33]
Liqun Qi, Defeng Sun, and Guanglu Zhou. 2000. A new look at smoothing Newton methods for nonlinear complementarity problems and box constrained variational inequalities. Mathematical Programming 87, 1 (2000), 1–35.
[34]
Liqun Qi and Jie Sun. 1993. A nonsmooth version of Newton's method. Mathematical Programming 58, 1 (1993), 353–367.
[35]
Julien Rabin, Gabriel Peyré, Julie Delon, and Marc Bernot. 2012. Wasserstein barycenter and its application to texture mixing. In Proceedings of the 3rd International Conference on Scale Space and Variational Methods in Computer Vision (SSVM ’11), Revised Selected Papers 3. Springer, 435–446.
[36]
Hans Rademacher. 1919. Über partielle und totale differenzierbarkeit von Funktionen mehrerer Variabeln und über die Transformation der Doppelintegrale. Mathematische Annalen 79, 4 (1919), 340–359.
[37]
Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision 40, 2 (2000), 99–121.
[38]
Filippo Santambrogio. 2015. Optimal transport for applied mathematicians. Birkäuser, NY 55, 58–63 (2015), 94.
[39]
Bernhard Schmitzer. 2016. A sparse multiscale algorithm for dense optimal transport. Journal of Mathematical Imaging and Vision 56, 2 (2016), 238–259.
[40]
Bernhard Schmitzer. 2019. Stabilized sparse scaling algorithms for entropy regularized transport problems. SIAM Journal on Scientific Computing 41, 3 (2019), A1443–A1481.
[41]
Jörn Schrieber, Dominic Schuhmacher, and Carsten Gottschlich. 2016. DOTmark–a benchmark for discrete optimal transport. IEEE Access 5 (2016), 271–282.
[42]
Steve Smale. 2000. Algorithms for solving equations. In The Collected Papers of Stephen Smale, Felipe Cucker and Roderick S. C. Wong (Eds.), Vol. 3, World Scientific, 1263–1286.
[43]
Defeng Sun and Liqun Qi. 2001. Solving variational inequality problems via smoothing-nonsmooth reformulations. Journal of Computational and Applied Mathematics 129, 1–2 (2001), 37–62.
[44]
Jie Sun, Defeng Sun, and Liqun Qi. 2004. A squared smoothing Newton method for nonsmooth matrix equations and its applications in semidefinite optimization problems. SIAM Journal on Optimization 14, 3 (2004), 783–806.
[45]
André L. Tits, Pierre-Antoine Absil, and William P. Woessner. 2006. Constraint reduction for linear programs with many inequality constraints. SIAM Journal on Optimization 17, 1 (2006), 119–146.
[46]
Cédric Villani. 2009. Optimal Transport: Old and New. Vol. 338. Springer.
[47]
Stephen J. Wright. 1997. Primal-Dual Interior-Point Methods. SIAM.
[48]
Lei Yang, Jia Li, Defeng Sun, and Kim-Chuan Toh. 2021. A fast globally linearly convergent algorithm for the computation of Wasserstein barycenters. Journal of Machine Learning Research 22, 1 (2021), 984–1020.
[49]
Lei Yang and Kim-Chuan Toh. 2022. Bregman proximal point algorithm revisited: A new inexact version and its inertial variant. SIAM Journal on Optimization 32, 3 (2022), 1523–1554.
[50]
Jianbo Ye and Jia Li. 2014. Scaling up discrete distribution clustering using ADMM. In Proceedings of the IEEE International Conference on Image Processing (ICIP ’14). IEEE, 5267–5271.
[51]
Jianbo Ye, Panruo Wu, James Z. Wang, and Jia Li. 2017. Fast discrete distribution clustering using Wasserstein barycenter with sparse support. IEEE Transactions on Signal Processing 65, 9 (2017), 2317–2332.
[52]
Yinyu Ye. 1992. A potential reduction algorithm allowing column generation. SIAM Journal on Optimization 2, 1 (1992), 7–20.
[53]
Israel Zang. 1980. A smoothing-out technique for min—max optimization. Mathematical Programming 19, 1 (1980), 61–77.
[54]
Guojun Zhang, Yancheng Yuan, and Defeng Sun. 2022. An efficient HPR algorithm for the Wasserstein barycenter problem with \(O(Dim(P)/\varepsilon)\) computational complexity. arXiv:2211.14881. Retrieved from https://arxiv.org/abs/2211.14881

Appendix A Proofs

Proof of Theorem 2.
Since \((\bar{\epsilon},\bar{x},\bar{y})\) is an accumulation point of the sequence \(\{(\epsilon^{k},x^{k},y^{k})\}\), Theorem 1 implies that \(\widehat{\mathcal{E}}(\bar{\epsilon},\bar{x},\bar{y})=0\). In particular, it holds that \(\bar{\epsilon}=0\) and \(\zeta_{k}=r\lVert\widehat{\mathcal{E}}(\epsilon^{k},x^{k},y^{k})\rVert^{1+\tau}\) for \(k\geq 0\) sufficiently large. By [Qi and Sun, 1993; Proposition 3.1] and the fact that \(\epsilon^{k} \gt 0\), the nonsingularity of the set \(\partial\widehat{\mathcal{E}}(\bar{\epsilon},\bar{x},\bar{y})\) implies (see e.g., [Qi and Sun, 1993; Proposition 3.1]) that
\begin{align}\left\lVert\widehat{\mathcal{E}}^{\prime}(\epsilon^{k},x^{k},y^{k})^{-1}\right \rVert=O(1),\end{align}
(A1)
for \(k\geq 0\) sufficiently large. For notational simplicity, we let \(w^{k}=(\epsilon^{k};x^{k};y^{k})\), \(\bar{w}=(\bar{\epsilon};\bar{x};\bar{y})\), and \(\Delta w^{k}=(\Delta\epsilon^{k};\Delta x^{k};\Delta y^{k})\). Using (A1), we see that
\begin{align} & \;\left\|w^{k}+\Delta w^{k}-\bar{w}\right\|=\;\left\|w^{k}+ \widehat{\mathcal{E}}^{\prime}(w^{k})^{-1}\left(\begin{pmatrix}\zeta_{k} \epsilon^{0}\\ 0\\ 0 \end{pmatrix}-\widehat{\mathcal{E}}(w^{k})\right)-\bar{w}\right\| \notag\\ = & \;\left\lVert-\widehat{\mathcal{E}}^{\prime}(w^{k})^{-1}\left( \widehat{\mathcal{E}}(w^{k})-\widehat{\mathcal{E}}^{\prime}(w^{k})(w^{k}-\bar{ w})-\begin{pmatrix}\zeta_{k}\epsilon^{0}\\ 0\\ 0 \end{pmatrix}\right)\right\rVert\\ = & \;O\left(\left\|\widehat{\mathcal{E}}(w^{k})-\widehat{\mathcal{E} }^{\prime}(w^{k})(w^{k}-\bar{w})\right\|\right)+O\left(\zeta_{k}\epsilon^{0}\right) \notag\\ = & \;O\left(\left\|\widehat{\mathcal{E}}(w^{k})-\widehat{\mathcal{E}}^{\prime}(w^{k})(w^{k}-\bar{w})\right\|\right)+O\left(\left\lVert\widehat{\mathcal{E}}(\epsilon^{k},x^{k},y^{k})\right\rVert^{1+\tau}\right).\notag \end{align}
(A2)
Since \(\widehat{\mathcal{E}}\) is strongly semismooth everywhere, we have that, for \(k\geq 0\) sufficiently large,
\begin{align}\left\|\widehat{\mathcal{E}}(w^{k})-\widehat{\mathcal{E}}^{\prime}(w^{k})(w^{k }-\bar{w})\right\|=O\left(\left\|w^{k}-\bar{w}\right\|^{2}\right).\end{align}
(A3)
Moreover, since \(\widehat{\mathcal{E}}\) is locally Lipchitz continuous at \(\bar{w}\), it follows that
\begin{align}\left\lVert\widehat{\mathcal{E}}(w^{k})\right\rVert^{1+\tau}=\left\lVert \widehat{\mathcal{E}}(w^{k})-\widehat{\mathcal{E}}(\bar{w})\right\rVert^{1+ \tau}=O\left(\left\|w^{k}-\bar{w}\right\|^{1+\tau}\right).\end{align}
(A4)
By combining (A2)–(A4), we get for \(k\geq 0\) sufficiently large,
\begin{align}\left\|w^{k}+\Delta w^{k}-\bar{w}\right\|= & \;O\left(\left\|w^{k}-\bar{w}\right\|^{2}\right)+O\left(\left\|w^ {k}-\bar{w}\right\|^{1+\tau}\right)\;=\;O\left(\left\|w^{k}-\bar{w}\right\|^{1 +\tau}\right).\end{align}
(A5)
On the other hand, by (A1), we can conclude that \(\big{\|}(\widehat{\mathcal{E}}^{\prime}(w^{k}))^{-1}\big{\|}\) is bounded away from zero for \(k\geq 0\) sufficiently large. Therefore, it follows from (A3) that, for \(k\geq 0\) sufficiently large,
\begin{align}\left\|w^{k}-\bar{w}\right\|=O\left(\left\lVert\widehat{\mathcal{E}}(\epsilon^ {k},x^{k},y^{k})\right\rVert\right).\end{align}
(A6)
Using Equations (A5) and (A6) and the fact that \(\phi\) and \(\widehat{\mathcal{E}}\) are both locally Lipschitz continuous at \((\bar{\epsilon},\bar{x},\bar{y})\), we get for \(k\geq 0\) sufficiently large that
\begin{align*}\phi(w^{k}+\Delta w^{k})&=\;\left\lVert\widehat{\mathcal{E}}(w^{ k}+\Delta w^{k})-\widehat{\mathcal{E}}(\bar{w})\right\rVert^{2}=\;O\left(\left \|w^{k}+\Delta w^{k}-\bar{w}\right\|^{2}\right)=\;O\left(\left\|w^{k}-\bar{w} \right\|^{2(1+\tau)}\right) \\&= \;O\left(\left\lVert\widehat{\mathcal{E}}(w^{k})\right\rVert^{2(1 +\tau)}\right)=\;O\left(\phi(w^{k})^{1+\tau}\right)=\;o\left(\phi(w^{k})\right).\end{align*}
This shows that, for \(k\geq 0\) sufficiently large, \((\epsilon^{k+1},x^{k+1},y^{k+1})=(\epsilon^{k},x^{k},y^{k})+(\Delta\epsilon^{k },\Delta x^{k},\Delta y^{k})\), i.e., the unit step size is eventually accepted. Therefore, the proof is completed. ∎

Index Terms

  1. A Sparse Smoothing Newton Method for Solving Discrete Optimal Transport Problems

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Mathematical Software
    ACM Transactions on Mathematical Software  Volume 50, Issue 3
    September 2024
    129 pages
    EISSN:1557-7295
    DOI:10.1145/3613616
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2024
    Online AM: 16 August 2024
    Accepted: 31 July 2024
    Revised: 16 May 2024
    Received: 10 November 2023
    Published in TOMS Volume 50, Issue 3

    Check for updates

    Author Tags

    1. Optimal transport
    2. Wasserstein barycenter
    3. linear programming
    4. smoothing Newton method
    5. Huber function

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 411
      Total Downloads
    • Downloads (Last 12 months)411
    • Downloads (Last 6 weeks)199
    Reflects downloads up to 24 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media