Abstract
We analyze a class of exact distributed first order methods under a general setting on the underlying network and step-sizes. In more detail, we allow simultaneously for time-varying uncoordinated step sizes and time-varying directed weight-balanced networks, jointly connected over bounded intervals. The analyzed class of methods subsumes several existing algorithms like the unified Extra and unified DIGing (Jakovetić in IEEE Trans Signal Inf Process Netw 5(1):31–46, 2019), or the exact spectral gradient method (Jakovetić et al. in Comput Optim Appl 74:703–728, 2019) that have been analyzed before under more restrictive assumptions. Under the assumed setting, we establish R-linear convergence of the methods and present several implications that our results have on the literature. Most notably, we show that the unification strategy in Jakovetić (2019) and the spectral step-size selection strategy in Jakovetić et al. (2019) exhibit a high degree of robustness to uncoordinated time-varying step sizes and to time-varying networks.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
We consider a set of n computational agents and the following unconstrained optimization problem
where each \(f_j\) is a real-valued function of \({\mathbb {R}}^{d}\) and is held privately by one of the agents, and the agents can communicate according to a given network. Problems of this form arise in many practical applications such as sensor networks [15], distributed control [17], distributed learning [4] and many others. Several distributed methods [10,11,12,13, 18, 21, 26, 28, 35] have been proposed in literature for the solution of (1) that achieve exact convergence to the minimizer with fixed step-size, when the objective function is convex and Lipschitz-differentiable. In [18, 21, 26] two exact gradient-based methods where proposed, and the convergence was proved for the case where the underlying network is undirected, connected, and remains constant through the entire execution of the algorithm. In [13] a unified analysis of a class of first-order distributed methods is presented. In [28] the convergence of several first-order methods was generalized to the case of a time-varying network, provided that the network is connected at each iteration, while in [18] the convergence analysis of [21] is extended to the time-varying and directed case, assuming joint-connectivity of the sequence of networks and weight-balance of each graph.Footnote 1 Interestingly, the exact first order methods are also related with augmented Lagrangian algorithms, e.g., [20]. For example, the methods in [13, 18, 21] have been shown to be equivalent to certain primal-dual methods that optimize an augmented Lagrangian function associated with the original problem. In [24] an accelerated gradient-based method for the time-varying directed case is proposed, with weaker assumptions over the underlying networks. In [16] the authors propose a mirror descent method that assumes time-varying jointly-strongly-connected networks. In [8] the authors considered the problem of minimizing \(f(y)+G(y)\) over a closed and convex set \({\mathcal {K}}\), where f is a possibly non-convex function as in (1) and G is a convex non-separable term, and they propose a gradient-tracking method that achieves convergence in the case of time-varying directed jointly-connected networks for diminishing synchronized step sizes. In [25] the method proposed in [8] is extended with constant step-sizes to a more general framework while in [27] R-linear convergence is proved for [25] with strongly convex f(y). A unifying framework of these methods is presented in [34] and, for the case of constant and undirected networks, in [1]. In all the above methods the sequence of the step-sizes is assumed to be fixed and coordinated among all the agents. In [19, 32, 33, 35, 36] the case of uncoordinated time-constant step sizes is considered, that is, each node has a different step-size but these step sizes are constant in all iterations. In [14] a modification of [21] is proposed, with step-sizes varying both across nodes and iterations, and it is proved that there exist suitable safeguards for the steps, depending on the regularity properties of the objective function and the network, such that R-linear convergence of the generated sequence to the solution of (1) holds. This result is obtained for undirected and stationary network. In [29, 30] asynchronous modifications of [25] are proposed.
Of special interest to the current paper is the spectral gradient method (or Barzilai and Borwein method). This method is very popular in centralized optimization due to its efficiency, as reported in numerous studies, for example [5, 9]. In general, the method avoids the famous zig-zag behaviour of the steepest descent and converges much faster. The method was first proposed by Barzilai and Borwein [3]. This reference proves the method’s convergence for two-dimensional problems and convex quadratic functions. The analysis is then extended to arbitrary dimensions and convex quadratic functions by Raydan [22]. Minimization of generic functions is considered in [23] in combination with a nonmonotone line search. R-linear convergence for convex quadratic functions has been proved in [6]. In summary, despite its excellent numerical performance, spectral gradient methods are proved to converge without any safeguarding lower and upper bounds on the step size only for strongly convex quadratic costs. Convergence for generic functions beyond convex quadratic is proved only under step size safeguarding, coupled with a line search strategy. Distributed variants of spectral gradient methods and fixed network topologies are studied in [14].
We now summarize this paper’s contributions. We establish R-linear convergence of a class of exact distributed first-order methods under the general setting of time-varying directed weight-balanced networks, without the requirement of network connectedness at each iteration, and in the presence of time-varying uncoordinated step sizes. While there have been several existing studies of exact distributed methods under general settings, our study implies several new contributions to the literature; these contributions cannot be derived from existing works and are novelties of this paper.
-
We prove that the methods proposed in [13], referred to here (and also in [28]) as the unified Extra and the unified DIGing are robust to time-varying directed networks and time-varying uncoordinated step sizes, i.e., they converge R-linearly in this setting. Up to now, it is only known that these methods converge under static undirected networks [13] or time-varying networks where the network is connected at each iteration [28]. These methods have been previously considered only for time-invariant coordinated step sizes.
-
We prove that the method proposed in [14] is robust to time-varying directed networks. Before the current paper, the method was only known to converge for static, undirected networks.
-
It is shown in [28] that the Extra method [26] may diverge over time-varying networks, even when the network is connected at every iteration. On the other hand, as we show here, the unified Extra, a variant of Extra proposed in [13], is robust to time-varying networks. Hence, our results reveal that the unified Extra can be considered as a mean to modify Extra and make it robust.
-
We provide a thorough numerical study and an analytical study for a special problem structure that demonstrates that the unification strategy in [13] and the spectral gradient-like step-size selection strategy in [14] exhibit a high degree of robustness to time-varying networks and uncoordinated time-varying step-sizes. More precisely, we show that these strategies converge, when working on time-varying networks, for wider step-size ranges than commonly used strategies such as constant coordinated step-sizes and DIGing algorithmic forms. In addition, we show by simulation that actually a combination of the unification and the spectral step-size strategies further improves robustness.
Technically, while considering weight-balanced digraphs instead of undirected graphs does not lead to a significant analysis difference, major technical differences here with respect to prior work correspond to the analysis of the unification strategy [13] under time varying networks and time-and-node-varying step-sizes and spectral strategies [14] under time-varying networks.
This paper is organized as follows. In Sect. 2 we describe the computational framework that we consider and we present the methods that we analyse. In Sect. 3 we recall a few preliminary results from the literature and we prove a convergence theorem for the methods introduced in Sect. 2. In Sect. 4, we show analytically and by simulation that the unification and spectral step-size selection strategies increase robustness of the methods to time-varying networks and uncoordinated step-sizes. Finally, in Sect. 5, we conclude the paper and outline some future research directions.
2 The model and the class of considered methods
We make the following regularity assumptions for the local cost functions \(f_i\).
Assumption A1
-
Each function \(f_{i}: {\mathbb {R}}^d \rightarrow {\mathbb {R}}\), \(i=1,\dots ,n,\) is twice continuously differentiable;
-
There exists \(0\le \mu _i\le L_i\) such that for every \(i=1,\ldots ,n\) and every \(y \in {\mathbb {R}}^d,\)
$$\begin{aligned} \mu _i I \preceq \nabla ^2 f_i (y) \preceq L_i I \end{aligned}$$(2)
where we write \(A\preceq B\) if the matrix \(B-A\) is positive semi-definite. That is, we assume that each of the local functions is \(\mu _i\)-strongly convex, and has Lipschitz continuous gradient with constant \(L_i\). Denoting with \(L= \sum _{i=1}^n L_i\) and \(\mu = \sum _{i=1}^{n} \mu _i,\) we have that the aggregate function f is \(\mu\)-strongly convex and \(\nabla f\) is Lipschitz-continuous with the constant L.
Given \(x_1,\dots ,x_n\in {\mathbb {R}}^{d}\) we define
We denote with e the vector of length n with all components equal to 1. For a matrix \(A\in {\mathbb {R}}^{n\times n}\) we denote with \(\nu _{\max }(A)\) the largest singular value of A. Moreover, given a sequence of matrices \(\{M^k\}_k\) and \(m\in {\mathbb {N}}\), let
It is assumed that at iteration k the n agents are the nodes of a given network \(G^k = (\{1,\dots ,n\}, E_k)\), where \(E_k\) denotes the set of the edges of the network, and to each \(G^k\) we associate a consensus matrix \(W^k\in {\mathbb {R}}^{n\times n}\). The assumptions over the sequences \(\{G^k\}\) and \(\{W^k\},\) which are the same hypotheses considered in [18], are stated below.
Assumption A2
For every \(k =0,1,\ldots\), \(G^k= (\{1,\dots ,n\}, E_k)\) is a directed graph and \(W^k\) is an \(n\times n\) doubly stochastic matrix with \(w_{ij}=0\) if \(i\ne j\) and \((i,j)\notin E_k\). Moreover, there exists a positive integer m such that \(\sup _{k=0:m-1}\nu _k<1\), where \(\nu _k = \nu _{max}(W^k_m-\frac{1}{n}ee^T)\).
Remark 2.1
Assumption A2 is weaker than requiring each graph \(G^k\) to be connected. For example, it can be proved (see [18]) that in the case of undirected networks, if the sequence is jointly-connected then we can ensure Assumption A2 by taking \(W^k\) as, e.g., the Metropolis matrix, Xiao et al. [31], associated with \(G^k.\) In more detail, the following can be shown. Assume that the positive entries of the weight matrices \(W_k\)’s are always bounded from below by a positive constant \(\underline{w}\)- (including also the diagonal entries, i.e., assume that the diagonal entries of \(W_k\) are always greater than or equal to \(\underline{w}\)). Furthermore, assume network connectedness over bounded intercommunication intervals. That is, for any fixed iteration k, consider the graph \(G_{k}^m = (\{1,\dots ,n\}, E_k^m)\), \(E_k^m = \cup _{\ell =k-m+1}^{k}E_k\), whose set of links is the union of the sets of links of graphs at time instances \(\ell =k-m+1,...,k\). Assume that \(G_{k}^m\) is strongly connected, for every k. Now, it is easy to show that the above assumptions imply that \(\nu _{\textrm{max}} \left( W_k^m-\frac{1}{n}ee^T\right) <1\).Footnote 2
We also comment on the role of quantity m on the convergence of (5). Our main result, Theorem 2 ahead, certifies that there exists choice of step size lower and upper bounds \(d_{min}\) and \(d_{max}\) such that R-linear convergence of the method (5) holds. The result holds for any choice of m. Clearly, the specific values of \(d_{min}\) and \(d_{max}\) in general depend on m. Intuitively, we can expect that for larger m, the maximal admissible step-size is lower; also, for fixed step-size choices \(d_{min}\) and \(d_{max}\) that lead to R-linear convergence, larger m leads to slower R-linear convergence, i.e., it leads to a worse R-linear convergence factor.
We consider the following class of methods. Assume that at each iteration node i holds two vectors \(x_i^k\) and \(u_i^k\) in \(R^d\) and that the global vectors \(x^k,\ u^k\in {\mathbb {R}}^{nd}\), defined as in (3), are updated according to the following rules:
where \({\mathcal {W}}^k:=(W^k\otimes I)\in {\mathbb {R}}^{nd\times nd}\), \(D^k = diag(d^k_1I,\dots ,d^k_nI)\) with \(d^k_i\) being the step-size for node i at iteration k and \(B^k\) is a symmetric \(n\times n\) matrix that respects the sparsity structure of the communication network \(G^k\) and such that for every \(y\in {\mathbb {R}}^d\) we have \(B^k(1\otimes y) = c(1\otimes y)\) for some \(c\in {\mathbb {R}}\). Moreover, we assume that \(x^0\in {\mathbb {R}}^{nd}\) is an arbitrary vector and \(u^0 = 0\in {\mathbb {R}}^{nd}\).
For \(B^k = 0\) and appropriate choice of the step-sizes \(d^k_i\) we get the method introduced in [14]. For \(D^k = \alpha I\), if \(B^k = bI\) or \(B^k = b{\mathcal {W}}\) we retrieve the class of methods analyzed in [13] while if \(B_k = 0\) we retrieve the DIGing method proposed in [18, 21]. For \(D^k = \alpha I\) and \(B^k = b{\mathcal {W}}\) with \(b = \frac{1}{\alpha }\) we have the EXTRA method [26], but while this method can be described with this choice of the parameters in Eq. (5), it is not included in the class of methods we consider. Namely, the theoretical analysis that we carry out in Sect. 3 requires the parameter b to be independent on the step-sizes, thus ruling out the choice \(b=\frac{1}{\alpha }\) that yields EXTRA method. This is in line with [28] that shows that EXTRA may not converge in general for time-varying networks.
In our analysis, we consider the case \(B^k = bI\) and \(B^k = b{\mathcal {W}}^k\) with b non-negative constant and \(d_{min}\le d^k_j\le d_{max}\) for every k and every \(j=1,\dots ,n\) for appropriately chosen safeguards \(0<d_{min}<d_{max}.\)
A possible choice for uncoordinated and time-varying step-sizes was proposed in [14] where we have \(d^k_i = (\sigma _i^k)^{-1}\) with \(\sigma _i^k\) given by
where \(s^{k-1}_i = x^{k}_i-x^{k-1}_i\) and \(y^{k-1}_i = \nabla f_i(x^k_i) - \nabla f_i(x^{k-1}_i)\). Here, \({{\mathcal {P}}}_U\) denotes the projection onto the closed set U, \(\sigma _{min}=1/d_{max}\), and \(\sigma _{max} =1/d_{min}\). We refer to [14] for details on the derivation and intuition behind this step-size choice. For static networks, this step size choice incurs no communication overhead per iteration; see [14] However, for time-varying networks, the communication and storage protocol to implement this step size needs to be adapted. One way to ensure at node i and iteration k the availability of \(s_j^{k-1}\) for \((i,j)\in E^k\), is that node i receives \(s_j^{k-1}\) for all j such that \((i,j) \in E^k\). That is, each node j per iteration additionally broadcasts one d-dimensional vector \(s_j^k\) to all its current neighbors. Therefore the method described by Eq. (5) combines [13, 14] into a more general method.
3 Convergence analysis
We now study the convergence of the method described in (5). Specifically, denoting with \(y^*\) the solution of (1) and defining
we prove that, if Assumptions A1 and A2 hold, there exist \(0<d_{min}<d_{max}\) such that the sequence \(\{x^k\}\) generated by (5) converges to \(x^*\).
Given a vector \(v\in {\mathbb {R}}^{nd},\) denote with \(\bar{v}\) the average \(\bar{v}=\frac{1}{n}\sum _{j=1}^{n}v_j\in {\mathbb {R}}^{d}\) and with J the \(n\times n\) matrix \((I-\frac{1}{n}ee^T)\), where \(e^T =(1,\dots ,1)\in {\mathbb {R}}^{n}\). Recalling the definition of \(x^k\) and \(u^k\) given in (5), we define the following quantities, which will be used further on:
To simplify the notation, in the rest of the section we assume that the dimension d of the problem is given by \(d=1\), but the same results can be proved analogously in the general case.
A few results listed below will be needed for the convergence result presented in this paper. Since \(W^k\) is doubly stochastic, we have that \(\frac{1}{n}ee^t(W^k-I) = 0\). Using this equality and the definition of \(u^{k+1}\) we get
and by the initialization \(u^0=0\), we have that
Directly by the definition of \(\tilde{u}^k\) and (9) we get
From Assumption A1, for every k there exists a matrix \(H_k\preceq LI\) such that
Lemma 1
[18] If the matrix sequence \(\{W^k\}_k\) satisfies Assumption A2, then for every \(k\ge m\) we have
Lemma 2
[2] If the function f satisfies Assumption A1and \(0<\alpha <\frac{1}{L}\), then
where \(\tau = \max \{|1-\alpha \mu |,|1-\alpha L|\}\)
Following the idea presented in [18], our convergence result relies on the Small Gain Theorem [7], which we now briefly recall. Denote by \(\textbf{a}:=\{a^k\}\) an infinite sequence of vectors, \(a^k \in {\mathbb {R}}^d\) for \(k=0,1,\ldots\). For a fixed \(\lambda \in (0,1)\) we define
Theorem 1
[7] Let \(\textbf{a} = \{a^k\}\) and \(\textbf{b} =\{b^k\}\) be two vector sequences, with \(a^k, b^k \in {\mathbb {R}}^d.\) If there exists \(\lambda \in (0,1)\) such that for all \(K = 0,1,\ldots ,\) the following inequalities hold
with \(\gamma _1\cdot \gamma _2 \in [0,1)\), then
and
We will use the following technical Lemma to show that the sequences \(\Vert \bar{q}^k\Vert\) and \(\Vert \tilde{x}^k\Vert\) satisfy the hypotheses of Theorem 1.
Lemma 3
Given \(b, \mu , L \ge 0\), \(\nu \in (0,1)\) and \(n,m\in {\mathbb {N}}\), where we denote with \({\mathbb {N}}\) the set of positive integers, there exists \(\lambda \in (0,1)\) and \(0\le d_{min}<d_{max}\) such that the following conditions hold:
-
1.
\(\nu <\lambda ^m;\)
-
2.
\(\frac{d_{min}}{n}<\frac{2}{L};\)
-
3.
\(1-\mu d_{min}+\Delta L<\lambda ;\)
-
4.
\(\gamma \beta _2<1;\)
-
5.
\(\beta _3<1;\)
-
6.
\(\frac{\beta _5\gamma }{1-\beta _3}<1;\)
-
7.
\(\frac{\beta _1+\gamma \beta _2}{1-\gamma \beta _2} \cdot \frac{\beta _4+\gamma \beta _5}{1-\beta _3-\gamma \beta _5}<1,\)
where
Proof
Take \(\lambda ^m>\nu\) and \(d_{min}<\frac{2n}{L}\) so that 1 and 2 hold. For \(d_{max}>d_{min}\) and close enough to \(d_{min}\) one can ensure that
holds. By the previous inequality, we have \(1-\mu d_{min}+\Delta L<1\) and therefore, for fixed \(d_{max}\) and \(d_{min}\) we can always take \(\lambda \in (0,1)\) such that 3 is satisfied and 1 still holds. Moreover, we can take \(d_{min}\) arbitrarily small and \(d_{max}\) arbitrarily close to \(d_{min}\) without violating conditions 1–3 Notice that \(C = \frac{\lambda (1-\lambda ^m)}{1-\lambda }\) is an increasing function of \(\lambda\).
Let us now consider condition 4 given by
The left hand side expression is an increasing function of \(\Delta\) and it is equal to 0 for \(\Delta =0.\) Therefore, taking \(d_{max}\) close enough to \(d_{min}\), condition 4 holds.
Condition 5 holds for \(d_{max}<\frac{\lambda ^m-\nu }{\lambda ^mLC}\).
Consider now condition 6,
The left hand side expression is an increasing function of \(d_{max}\) and taking \(d_{max}\) small enough we conclude that the previous inequality holds. Since we need \(d_{max}>d_{min}\), in order to be able to take \(d_{max}\) small, we need to take \(d_{min}\) small enough, but this can be done without violating the previous conditions.
By definition, \(\frac{\beta _2+\gamma \beta _3}{1-\gamma \beta _3}\) and \(\frac{\beta _5+\gamma \beta _6}{1-\beta _4-\gamma \beta _6}\) are also increasing functions of \(d_{max}\) and \(\Delta\). Thus, we can apply the same reasoning that we applied to 4 and 6 to get \(\gamma _2<1\) and \(\gamma _3<1\). In particular, we can take \(d_{min}\) and \(d_{max}\) such that condition 7 holds. \(\square\)
Theorem 2
Let \(B^k\) be defined as \(B^k = b W^k\) or \(B^k = bI\) for a positive constant b, or \(B^k = 0.\) If Assumptions A1 and A2 hold then there exists \(d_{min}<d_{max}\) such that the sequence \(\{x^k\}\) generated by (5) converges R-linearly to \(x^*.\)
Proof
Define \(\nu = \displaystyle \sup _{k=0:m-1}\nu _k<1\) where \(\nu _k,\ m\) are given in Assumption A2, and take \(\lambda \in (0,1), 0\le d_{min}<d_{max}\) given by Lemma 3. We prove that \(n^{1/2}\bar{q}^k\) and \(\tilde{x}^k\) satisfy inequalities (12), thus ensuring R-linear convergence by Theorem 1.
We have \(B^k=bI\) or \(B^k = bW^k\), in both cases, \(B^kx^* = bx^*\), therefore \((W^k-I)B^kx^* = 0\) and thus
For \(k\ge m-1\), using (5), the previous equality and (11), we get
and by (11), the definition of \(B^k\) and the fact that \(W^k\) is doubly stochastic, we get
Taking the norm in (14) and using the two previous inequalities, we have that for \(k\ge m-1\)
Notice that the above inequality also holds for the third case considered, i.e. for \(B^k = 0,\) taking \(b=0\). Multiplying by \(\frac{1}{\lambda ^{k+1}}\), taking the maximum for \(k=-1:\bar{k}-1\), and defining
we get
Since by condition 1 in Lemma 3 we have \(\nu <\lambda ^m\), reordering the terms in the previous inequality and using the fact that \(q^k = \tilde{x}^k+e\bar{q}^k\), we get
with
Let us now consider \(\bar{q}^k\).
Taking the norm, by Lipschitz continuity of the gradient and denoting with \(\Delta = d_{max}-d_{min}\), we have
Now \(\frac{d_{min}}{n}<\frac{2}{L}\), thus Lemma 2 gives a bound for the first term in the right hand side of the last inequality, and we get
Multiplying by \(\frac{1}{\lambda ^{k+1}}\) and taking the maximum for \(k=-1:\bar{k}-1\) we get
By Lemma 3 we have \(\tau = 1-\mu d_{min}\) and \(\tau +\Delta L<\lambda\), thus reordering and using (15), we get
where \(\beta _1\) and \(\beta _2\) are defined in Lemma 3. Take
From 4 in Lemma 3 we get
Finally, let us consider \(\tilde{x}^k\). For \(k\ge m-1\), by definition of \(x^k, \tilde{u}^k\) and \(q^k,\) and Eq. (11) we have
Taking the norm, applying Lemma 1 and (11), we get
Multiplying by \(\frac{1}{\lambda ^{k+1}}\) and taking the maximum for \(k=-1:\bar{k}-1\) we get
where \(\tilde{\omega }_3 = \max _{k=-1:m-1} \left\{ \frac{1}{\lambda ^{k+1}}\Vert \tilde{x}^{k+1}\Vert \right\}\) and \(\beta _3, \beta _4, \beta _5\) are defined in Lemma 3. In particular, we have \(\beta _3<1\), and can rearrange the terms of the previous inequality to get
Now, applying (15) and 6 from Lemma 3, we obtain
with
We thus proved
with \(\gamma _2\gamma _3<1\) by condition 7 in Lemma 3. By the Small Gain Theorem, we have that \(\Vert \bar{q}^k\Vert\) and \(\Vert \tilde{x}^k\Vert\) converge to 0, and thus \(\Vert q^k\Vert\) converges to zeros, which gives the thesis. \(\square\)
The above theorem states the R-linear convergence of the method and hence one might naturally ask what is the convergence factor and how it compares with similar methods, in particular with DIGing. Given that the setting here is rather general – allowing for node-specific and iteration-specific step sizes on time varying networks, the generality of the encompassed methods and settings makes analytical close-form expression of the convergence factor unfeasible. One comparison between the convergence factors of DIGing and a method of the class considered here under restrictive assumptions of a static network is given in [13], Remark 5. Therein, it is shown that the class of methods considered here can have a better convergence factor than DIGing. Specifically, the favorable convergence factor is achieved in [13] for a method instance within the class when parameter b is set differently than the choice that recovers DIGing. Although such comparison is derived for a narrow class of problems, contrary to the more general setting of DIGIng and the setting considered here, with fixed stepsizes and static networks, it serves as an indication for the comparison between DIGing and the method considered here. Furthermore, numerical experiments presented in Sect. 4 contain the comparison between the method considered here and DIGing and show faster convergence and an increased degree of robustness of the method considered here.
4 Analytical and numerical studies of robustness of the methods
Theorem 2 and Lemma 3 ensure convergence of the considered class of methods. Namely, they establish existence of bounds \(d_{min}<d_{max}\) such that the methods converge R-linearly under the given assumptions. However they do not provide any information about the difference \(\Delta = d_{max}-d_{min}\) and thus about how much the steps employed by different nodes and at different iterations can differ. In this section we try to address this issue by investigating in practice the length of the interval of admissible step-sizes. Firt we show a particular example where the method converges without any upper bound \(d_{max},\) then we present a set of numerical results that show how the step bounds influence the convergence and the performance of the methods.
We consider the same framework considered in [14] (Sect. 4.2) and we prove that even if we allow the consensus matrix to change from iteration to iteration, the method converges. Consider the following objective function
and assume that at iteration k the consensus matrix is given by
Lemma 4
Assume that \(\theta _k\in (\frac{1}{3}, \frac{3}{4})\) for every k, and that \(\{x^k\}\) is the sequence generated by (5) with \(b=0\), \(e^Tx_0 = e^Ta\) and \(e^T(u_0+\nabla F(x_0)) = 0\).
If \(d^k_i = \alpha\) for every \(i=1,\dots ,n\) and for every k, then the method converges R-linearly to the solution of (17) if \(\alpha _{min}\le \alpha \le \frac{2}{3}\) and \(\alpha _{min}>0\) small enough. On the other hand, for any \(\alpha >2\), there exists a sequence \(\{\theta _k\},\ k=0,1,2,\dots\) that satisfies the assumptions of the Lemma such that the method diverges, i.e., \(\Vert x^k\Vert \rightarrow \infty\).
If \(d_i^k = (\sigma ^k_i)^{-1}\) with
with \(s^k_i = x^{k+1}_i-x^k_i\), \(\sigma _{min} = 0\), \(\sigma _{max} = 3/2\) and \(\sigma ^0_i = \sigma \in (\sigma _{min}, \sigma _{max})\) for every \(1,\dots ,n\), then \(\{x^k\}\) converges R-linearly to the solution of (17).
Proof
In the case we are considering, (5) is equivalent to
where \(D^k = diag(d^k_1I,\dots ,d^k_nI)\).
Let us consider the case with fixed step-size \(d^k_i = \alpha\) and let us denote with \(\xi ^k\) the vector \((q^k, z^k)\in {\mathbb {R}}^{2n}\). We can see that for every k we have \(\xi ^{k+1} = A_k\xi ^k\) where the matrix \(A_k\) is given by
In order to prove the first part of the Lemma, it is enough to show that there exists \(\mu <1\) such that \(\Vert A_k\Vert ^2_2<\mu\) for every iteration index k. That is, we have to prove that there exists \(\mu <1\) such that the spectral radius of \(A_k^TA_k\) is smaller than \(\mu\) for every k. Denoting with \({1, \lambda ^k_2,\dots , \lambda ^k_n}\) the eigenvalues of \(W^k\), it can be proved that the eigenvalues of \(A_k^TA_k\) are given by the eigenvalues of the \(2\times 2\) matrices \(M^k_i\) defined as
for \(i=2,\dots ,n\). By direct computation we can see that the eigenvalues of \(M^k_1\) are given by 0 and \(2\alpha ^2-2\alpha +1<1-\frac{2}{3}\alpha _{min}\) and therefore it is enough to take \(\mu >1-\frac{2}{3}\alpha _{min}.\) Denoting with \(p^k_i(t)\) the characteristic polynomial of \(D^k_i\) we can see that with the values of \(\theta _{min}, \theta _{max}\) and \(\alpha _{max}\) given by the assumptions, we can always find \(1-\frac{2}{3}\alpha _{min}<\mu <1\) such that \(p^k_i(\mu )>0\) and \(p^k_i(-\mu )>0\) and thus such that the eigenvalues of \(M^k_i\) belong to \((-\mu\) and \(\mu )\) for every k and for every \(i=1,\dots ,n.\) To prove that if \(\alpha >2\) the method is in general not convergent it is enough to consider the case when \(\theta _k = \theta _0\) for every iteration index k. In this case we have that \(A_k = A_0\) for every k and thus \(\xi ^{k}= A_0^k\xi ^0\). In this case we can see [14] that \(1-\alpha\) is an eigenvalue of \(A_0\) an therefore if \(\alpha >2\) we have that \(\rho (A_0)>1\) and thus the sequence \(\{\xi ^k\}\) does not converge. This concludes the first part of the proof.
Assume now that the step-sizes are computed as in (18). Proceeding as in the proof of Proposition 4.3 in [14] we can prove that \(\sigma ^{k+1}_i = \sigma ^{k+1}\) for every i with \(\sigma _{k+1}\) given by
where \(\hat{\sigma }^{k+1} =1+\theta _k+\theta _k\theta {k-1}+\dots +\prod _{j=1}^k\theta ^j+\sigma ^0\prod _{j=0}^k\theta ^j\). By using the fact that \(\theta _k > 1/3\) and \(\sigma _{max} = 3/2\) we can prove that there exists \(\bar{k}\) such that \(\sigma ^k = \sigma _{max}\) for every \(k>\bar{k}.\) Therefore, for \(k>\bar{k}\) the step-size becomes the same for all nodes and equal to \(d^k_i = \sigma _{max}^{-1} =2/3\) and thus the method converges by the first part of the Lemma. \(\square\)
The above Lemma certifies convergence of the spectral-like method [14] for time-varying networks and a very specific problem structure with all-to-all communication network and consensus quadratic costs. It is worth noting that, for generic quadratic cost functions and sparse time-varying networks, an upper bound on the step-size is necessary (see Figs. 1 and 2 below). We now make an analogy on the achieved results for the distributed spectral-like method [14] and the spectral (Barzilei-Borwein) gradient method from the centralized optimization. Namely, in centralized settings, the spectral gradient method’s convergence without step size safeguarding has been proved only for a strictly convex quadratic cost function. In the case of generic functions beyond strictly convex quadratic, some safeguards \(\Delta _{min}\) and \(\Delta _{max}\) are necessary, even in the centralized case. Though, in the centralized case, these safeguards can be arbitrarily small (\(\Delta _{min}\)) and arbitrarily large (\(\Delta _{max}\)). Therefore, the need for safeguards is to be expected in the distributed optimization scenario as well. This matches with the results that we present here. It turns our that the price to be payed in the distributed time-varying networks scenario is two-fold: (1) the no-safeguards case happens in a more restricted cost functions setting, namely the consensus quadratic costs (see Lemma 4); and (2) the safeguard step size bounds in the general case are no longer arbitrary and take a network-dependent form.
We also have the following Lemma where we continue to assume the consensus problem but relax the requirement that the network is fully connected at all times. When the network is not fully connected, in general we need safeguarding for global convergence. However, as explained below, the following Lemma sheds some light on the behavior of the spectral-like distributed method. While it is not to be considered as a global convergence result, it highlights that the next step size has a controlled length provided that the current solution estimate is close to consensus.
Lemma 5
Let us assume that the objective function is given by (17), and that \(x^0,z^0\) are such that \(e^Tx^0 =e^Ta\), \(z^0_i = \nabla f(x^0_i) = x^0_i\). Moreover, for every \(i=1,\dots ,n\) let the local stepsize \(d^k_i\) be defined as \(d^0_i =d^0>0\) and, for every \(k\ge 1\), \(d^k_i = 1/\sigma ^k_i\), with
where \(s^k_j = x^{k+1}_j-x^k_j\). Moreover, let us assume that at each iteration Assumption A2 holds with \(m=1\).
Given any \(\hat{d}>1\), if \(\Vert x^0-e\bar{x}^0\Vert \le \hat{\varepsilon }\) with \(\hat{\varepsilon }\) satisfying
then \(d^1_i\le \hat{d}\) for every \(i=1,\dots ,n\).
Proof
Let us denote with \(J\in {\mathbb {R}}^{n\times n}\) the matrix \(\frac{1}{n}ee^t\) and with \(v^k = s^k-e\bar{s}^k\). From the assumptions and the double stochasticity of the matrix \(W^k\), we have
Where we defined \(\varepsilon = (\nu _0+d^0)\hat{\varepsilon }\). Moreover,
These imply that, for every \(j=1,\dots ,n\)
Replacing these bounds in (20), and defining \(\sigma ^0 =1/d^0\), we get
It’s easy to see that the first inequality, together with the assumption over \(\hat{\varepsilon }\), imply \(\sigma ^1_i\ge 1/\hat{d}\), which in turn implies the thesis. \(\square\)
Intuitively, the Lemma above says that, for the considered problem, if algorithm (5) with stepsize (20) starts from a point close to consensus (i.e., a point where solution estimates across different nodes are mutually close), then the next step size at each node will not be too large. More precisely, the size of the next step size is controlled by the consensus neighborhood \(\hat{\epsilon }\) that we start from. In other words, if the next step size is to be upper bounded by an arbitrary constant \(\hat{d}>1\), we can find a problem-dependent constant \(\hat{\epsilon }\) such that, starting at most \(\hat{\epsilon }\) away from consensus, the next step size at each node is at most \(\hat{d}\). To further explain this, suppose that all the quantities \(s_j^k/s_i^k\)’s are \(\epsilon ^\prime\)-close to one, \(|s_j^k/s_i^k| \in (1-\epsilon ^\prime , 1+\epsilon ^\prime )\), for all nodes i, j. Then, in view of (21), quantity \(\sigma _i^{k+1}\), for all nodes i, is approximated as:
In other words, for the special case of the consensus problem, provided that all the quantities \(s_j^k/s_i^k\)’s are \(\epsilon ^\prime\)-close to one, the next step-size \(1/\sigma _i^{k+1}\) is in a neighborhood of one, and is hence bounded.
We now present some numerical results. We consider the problem of minimizing a logistic loss function with \(l_2\) regularization, that is, we assume the local objective function \(f_{i}\) at node i is given by
where \(a_i\in {\mathbb {R}}^d,\ b_i\in \{-1,1\}\) and \(R>0.\) We compare 3 different choices of the matrix B in (5) and three different definitions of the step-sizes \(d^k_i\), resulting in nine methods. For increasing values of \(d_{max}\) we run each method on the given problem and we plot in Fig. 1 the number of iterations necessary to arrive at convergence.
The problem is generated as follows. The convergence analysis we carried out in Sect. 3 does not rely on any particular definition of the step-sizes \(d^k_i,\) therefore we need to specify how each node chooses the step-size at each iteration. We consider here two cases. The first one, referred to as spectral in Fig. 1, is the case where \(d^k_i = (\sigma ^k_i)^{-1}\) with \(\sigma ^k_i\) as in (18). The second case we consider is the one where each node performs local line search by employing a backtracking strategy starting at \(d_{max}\) to satisfy classical Armijo condition on the local objective function. That is, to satisfy
with \(c = 10^{-3}\) and \(z^k_i = u^k_i+\nabla f_i(x^k_i).\) We refer to this method as line search. It is worth noting that there are no convergence guarantees for the line search method. The rationale for including a comparison with it is to show that the method [14] exhibits a significantly higher degree of robustness with respect to a meaningful, time-varying and node-varying, local step size strategy that can be employed. For comparison, we also consider the method with fixed step-size \(d^k_i = d_{max}\) for every k and every \(i=1,\dots ,n.\) The choices of the matrix \(B^k\) are given by \(B^k=0\) (plot (a) in Fig. 1), \(B^k=d_{max}^{-1}I\) (plot (b)) and \(B^k = d_{max}^{-1}{\mathcal {W}}^k\) (plot(c)), where for the case \(B\ne 0\) the choice is made following [13]. Notice that the case \(d_i^k=d_{max}\) and \(B^k=0\) corresponds to [18, 21] with constant, coordinated step-sizes. We consider increasing values of \(d_{max}\) in \([\frac{1}{50L}, \frac{10}{L}]\), while we fix \(d_{min}=10^{-8}\) as, in the considered framework, we saw that its choice does not influence the performance of the methods significantly.
In Fig. 1 we plot the results in the case where the underlying network is symmetric and timevarying, defined as follows: we consider a network G with \(n=25\) nodes undirected and connected, generated as a random geometric graph with communication radius \(\sqrt{n^{-1}\ln (n)}\), and we define the sequence of networks \(\{G^k\}\) by deleting each edge with probability \(\frac{1}{4}.\) We carried out analogous tests in the cases where G is symmetric and constant and in the case where it is given by a directed ring. The obtained results were comparable to the ones that we present. We also observed in practice that double stochasticity of the underlying network appears to be essential for the convergence of the considered methods.
We set the dimension d as equal to 10 and we generate the quantities involved in the definition of the local objective functions (22) as follows. For \(i = 1,\dots ,n\) we define \(a_i = (a_{i1},\dots ,a_{i,d-1},1)^T\) where the components \(a_{ij}\) are independent and come from the standard normal distribution, and \(b_i = sign(a_i^Ty^*+\epsilon _i)\) where \(y^*\in {\mathbb {R}}^d\) with independent components drawn from the standard normal distribution, and \(\epsilon _i\) are generated according to the normal distribution with mean 0 and standard deviation 0.4. Finally, we take the regularization parameter \(R = 0.25.\) The initial vectors \(x^0_i\) are generated independently, with components drawn from the uniform distribution on [0, 1], and at each iteration we define the consensus matrix \(W_k\) as the Metropolis matrix [31].
We are interested in the number of iterations required by each method to reach a prescribed accuracy. More precisely, we evaluate the iteration number \(\bar{k}\) at which \(\max _{i=1,...,n}\Vert x_i^{\bar{k}}-y^*\Vert < \varepsilon\), where \(\varepsilon =10^{-5}\). In Fig. 1, on the x-axis we show the upper bound \(d_{max}\) while on the y-axis we show \(\bar{k}\) for each method. To facilitate the comparison among the methods, in Fig. 2 we plot the same results, with y-axis cut at 2000.
We can see from Fig. 1 that for all considered choices of the matrix B the spectral method allows for maximum step-size that is at least 10 times larger than the method with fixed step-size, while line search allows for maximum step-size equal to 2 and 3 times the maximum step-size allowed by the method with fixed steplength, for B equal to 0 and bI or \(b{{\mathcal {W}}}\) respectively. Moreover, we can see that choosing \(B = bI\) seems to increase the maximum value of \(d_{max}\) that yields convergence for all the considered methods. Finally, in Fig. 2, we can notice that for most of the tested values of \(d_{max}\) the spectral methods requires a smaller number of iterations than the method with fixed step-size. That is, in the considered framework, using uncoordinated time-varying step-sizes given by [14] helps to significantly improve the robustness of the method and also the performance. Notice also that the spectral step-size strategy exhibits a “stable", practically unchanged, performance for a wide range of \(d_{max}\); hence, it is not sensitive to tuning of \(d_{max}\). This is in contrast with the constant step-size strategy that is very sensitive to the step-size choice \(d_{max}\). It is also worth noting that Theorem 2 requires a conservative upper bound on the step-size \(d_{max}\) and a conservative upper bound on step-size differences \(\Delta\) and that both depend on multiple global system parameters (Lemma 3). However, simulations presented here and other extensive numerical studies suggest that an a priori upper bound on \(\Delta\) is not required for convergence. In addition, \(d_{min}\) can be set to a small value independent of system parameters, e.g., \(d_{min}=10^{-8}\), and setting \(d_{max}\) requires only a coarse upper bound on quantity 1/L.
5 Conclusions
We proved that a class of distributed first-order methods, including those proposed in [13, 14], is robust to time-varying and uncoordinated step-sizes and time-varying weight-balanced digraphs, wherein connectedness of the network at each iteration, unlike, e.g., the recent work [28], is not required. The achieved results provide a solid improvement in understanding of the robustness of exact distributed first-order methods to time-varying networks and uncoordinated time-varying step-sizes. Most notably, we show that the unification strategy in [13] and the spectral-like step-size selection strategy in [14], as well as combination of those, exhibits a high degree of robustness. This paper considers weight-balanced directed networks. Extensions to weight-imbalanced networks requires redefining the algorithmic class and the respective analysis, and represents an interesting future research direction.
Notes
That is, we assume that it is possible to define a doubly stochastic consensus matrix associated with each of the networks.
To see this, first note that, clearly, matrix \(W_k^m\) is doubly stochastic. Furthermore, it is easy to show (e.g., by induction) that, for any \((i,j) \in E_k^m\), and for any \(i=1,...,n\), we have that \([W_k^m]_{ij} \ge \underline{w}^m\). This means that \(W_k^m\) is a doubly stochastic matrix with positive diagonal entries, and, moreover, all its off-diagonal entries at the positions that correspond to links of \(G_k^m\) are strictly positive. Using standard arguments on doubly stochastic matrices, this implies that \(\nu _{\textrm{max}}\left( W_k^m-\frac{1}{n}ee^T\right)\).
References
Alghunaim, S.A., Ryu, E.K., Yuan, K., Sayed, A.H.: Decentralized proximal gradient algorithms with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021)
Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1997)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Dai, Y.H., Huang, Y., Liu, X.W.: A family of spectral gradient methods for optimization. Comput. Optim. Appl. 74(1), 43–65 (2019)
Dai, Y.H., Liao, L.Z.: R-linear convergence of the Barzilai and Borwein gradient method. IMA J. Numer. Anal. 22(1), 1–10 (2002)
Desoer, C., Vidyasagar, M.: Feedback Systems: Input–Output Properties. SIAM, Philadelphia (2009)
Di Lorenzo, P., Scutari, G.: NEXT: in-network nonconvex optimization. IEEE Trans. Signal Inf. Process. Netw. 2(2), 120–136 (2016)
Di Serafino, D., Ruggiero, V., Toraldo, G., Zanni, L.: On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput. 318, 176–195 (2018)
Jakovetić, D., Xavier, J., Moura, J.M.F.: Fast distributed gradient methods. IEEE Trans. Autom. Control 59(5), 1131–1146 (2014)
Jakovetić, D., Moura, J.M.F., Xavier, J.: Distributed Nesterov-like gradient algorithms. In: CDC’12, 51st IEEE conference on decision and control, pp 5459–5464 (2012)
Jakovetić, D., Bajović, D., Krejić, N., Krklec, Jerinkić N.: Newton-like method with diagonal correction for distributed optimization. SIAM J. Optim. 27(2), 1171–1203 (2017)
Jakovetić, D.: A unification and generalization of exact distributed first-order methods. IEEE Trans. Signal Inf. Process. Netw. 5(1), 31–46 (2019)
Jakovetić, D., Krejić, N., Krklec, Jerinkić N.: Exact spectral-like gradient method for distributed optimization. Comput. Optim. Appl. 74, 703–728 (2019)
Kar, S., Moura, J.M.F., Ramanan, K.: Distributed parameter estimation in sensor networks: nonlinear observation models and imperfect communication. IEEE Trans. Inf. Theory 58(6), 3575–3605 (2012)
Li, J., Li, G., Wu, Z., Wu, C.: Stochastic mirror descent method for distributed multi-agent optimization. Optim. Lett. 12(6), 1179–1197 (2018)
Mota, J., Xavier, J., Aguiar, P., Püschel, M.: Distributed optimization with local domains: applications in MPC and network flows. IEEE Trans. Autom. Control 60(7), 2004–2009 (2015)
Nedic, A., Olshevsky, A., Shi, W.: Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 27(4), 2597–2633 (2017)
Nedic, A., Olshevsky, A., Shi, W., Uribe, C.A.: Geometrically convergent distributed optimization with uncoordinated step sizes. In: American Control Conference, pp. 3950–3955 (2017)
Patrascu, A., Necoara, I., Tran-Dinh, Q.: Adaptive inexact fast augmented Lagrangian methods for constrained convex optimization. Optim. Lett. 11, 609–626 (2017)
Qu, G., Li, N.: Harnessing smoothness to accelerate distributed optimization. In: IEEE 55th Conference on Decision and Control, pp. 159–166 (2016)
Raydan, M.: On the Barzilai and Borwein choice of steplength for the gradient method. IMA J. Numer. Anal. 13(3), 321–326 (1993)
Raydan, M.: The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J. Optim. 7(1), 26–33 (1997)
Saadatniaki, F., Xin, R., Khan, U.A.: Decentralized optimization over time-varying directed graphs with row and column-stochastic matrices. IEEE Trans. Autom. Control 65, 4769–4780 (2018)
Scutari, G., Sun, Y.: Distributed nonconvex constrained optimization over time-varying digraphs. Math. Program. 176(1–2), 497–544 (2019)
Shi, W., Ling, Q., Wu, G., Yin, W.: EXTRA: an exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)
Sun, Y., Daneshmand, A., Scutari, G.: Convergence rate of distributed optimization algorithms based on gradient tracking, arxiv preprint, arXiv:1905.02637 (2019)
Sundararajan, A., Van Scoy, B., Lessard, L.: Analysis and design of first-order distributed optimization algorithms over time-varying graphs, arxiv preprint, arXiv:1907.05448 (2019)
Tian, Y., Sun, Y., Scutari, G.: Achieving linear convergence in distributed asynchronous multi-agent optimization. IEEE Trans. Autom. Control (2020)
Tian, Y., Sun, Y., Scutari, G.: Asynchronous decentralized successive convex approximation, arxiv preprint, arXiv:1909.10144 (2020)
Xiao, L., Boyd, S., Lall, S.: Distributed average consensus with time-varying metropolis weights. Automatica (2006)
Xin, R., Khan, U.A.: Distributed heavy-ball: a generalization and acceleration of first-order methods with gradient tracking. IEEE Trans. Autom. Control. 26, 2627–2633 (2018)
Xin, R., Xi, C., Khan, U.A.: FROST—fast row-stochastic optimization with uncoordinated step-sizes. EURASIP J. Adv. Signal Process. Special issue on optimization, learning, and adaptation over networks, 1 (2019)
Xu, J., Tian, Y., Sun, Y., Scutari, G.: Distributed algorithms for composite optimization: unified framework and convergence analysis, arxiv preprint, arXiv:2002.11534 (2020)
Xu, J., Zhu, S., Soh, Y.C., Xie, L.: Augmented distributed gradient methods for multi-agent optimization under uncoordinated constant step sizes. In: IEEE Conference on Decision and Control, pp. 2055–2060 (2015)
Yuan, K., Ying, B., Zhao, X., Sayed, A.H.: Exact diffusion for distributed optimization and learning—Part I: algorithm development. IEEE Trans. Sig. Proc. 67(3), 708–723 (2019)
Acknowledgements
This work is supported by the European Union’s Horizon 2020 programme under the Marie Skłodowska-Curie Grant Agreement No. 812912. The work of Jakovetić and Krejić is partially supported by Serbian Ministry of Education, Science and Technological Development grant.
Funding
Open access funding provided by Università degli Studi di Firenze within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Malaspina, G., Jakovetić, D. & Krejić, N. Linear convergence rate analysis of a class of exact first-order distributed methods for weight-balanced time-varying networks and uncoordinated step sizes. Optim Lett 18, 825–846 (2024). https://doi.org/10.1007/s11590-023-02011-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-023-02011-x