Abstract
A key question in many low-rank problems throughout optimization, machine learning, and statistics is to characterize the convex hulls of simple low-rank sets and judiciously apply these convex hulls to obtain strong yet computationally tractable relaxations. We invoke the matrix perspective function—the matrix analog of the perspective function—to characterize explicitly the convex hull of epigraphs of simple matrix convex functions under low-rank constraints. Further, we combine the matrix perspective function with orthogonal projection matrices—the matrix analog of binary variables which capture the row-space of a matrix—to develop a matrix perspective reformulation technique that reliably obtains strong relaxations for a variety of low-rank problems, including reduced rank regression, non-negative matrix factorization, and factor analysis. Moreover, we establish that these relaxations can be modeled via semidefinite constraints and thus optimized over tractably. The proposed approach parallels and generalizes the perspective reformulation technique in mixed-integer optimization and leads to new relaxations for a broad class of problems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Over the past decade, a considerable amount of attention has been devoted to finding high-quality solutions to low-rank optimization problems, resulting in theoretically and practically efficient algorithms for problems as disparate as matrix completion, reduced rank regression, or computer vision. In spite of this progress, almost no equivalent progress has been made on developing strong lower bounds for low-rank problems. Accordingly, this paper proposes a procedure for obtaining novel and strong lower bounds.
We consider the following low-rank optimization problem:
where \(\varvec{C}, \varvec{A}_{1}, \ldots \varvec{A}_m \in {\mathcal {S}}^n\) are \(n \times n\) symmetric matrices, \(b_1, \ldots b_m \in \mathbb {R}\) are scalars, [n] denotes the set of running indices \(\{1, ..., n\}\), \({\mathcal {S}}^n_+\) denotes the \(n \times n\) positive semidefinite cone, and \(\mu \in \mathbb {R}_+, k \in \mathbb {N}\) are parameters which controls the complexity of \(\varvec{X}\) by respectively penalizing and constraining its rank. The set \({\mathcal {K}}\) is a proper—i.e., closed, convex, solid and pointed cone (c.f. [15], Section 2.4.1), and \(\Omega (\varvec{X})=\textrm{tr}(f(\varvec{X}))\) for some matrix convex function f; see formal definitions and assumptions in Sect. 3.
For optimization problems with logical constraints, strong relaxations can be obtained by formulating them as mixed-integer optimization (MIO) problems and applying the so-called perspective reformulation technique (see [37, 42]). In this paper, we develop a matrix analog of the perspective reformulation technique to obtain strong yet computationally tractable relaxations of low-rank optimization problems of the form (1).
1.1 Motivating example
In this section, we illustrate our results on a statistical learning example. To emphasize the analogy with the perspective reformulation technique in MIO, we first consider the best subset selection problem and review its perspective relaxations. We then consider a reduced-rank regression problem—the rank-analog of best subset selection—and provide new relaxations that naturally arise from our Matrix Perspective Reformulation Technique (MPRT).
Best Subset Selection: Given a data matrix \(\varvec{X} \in \mathbb {R}^{n \times p}\) and a response vector \(\varvec{y} \in \mathbb {R}^{n}\), the \(\ell _0-\ell _2\) regularized best subset selection problem is to solve (c.f. [3, 6, 7, 9, 65, 78]):
where \(\mu , \gamma >0\) are parameters which control \(\varvec{w}\)’s sparsity and sensitivity to noise respectively.
Early attempts at solving Problem (2) exactly relied upon weak implicit or big-M formulations of logical constraints which supply low-quality relaxations and therefore do not scale well (see [14, 44], for discussions). However, very similar algorithms now solve these problems to certifiable optimality with millions of features. Perhaps the key ingredient in modernizing these (previously inefficient) algorithms was invoking the perspective reformulation technique—a technique for obtaining high-quality convex relaxations of non-convex sets—first stated in Stubbs’ PhD thesis [73] (see also [21, 74]) and popularized by [1, 37, 42] among others.
Relaxation via the Perspective Reformulation Technique: By applying the perspective reformulation technique [1, 37, 42] to the term \(\mu \Vert \varvec{w}\Vert _0+\frac{1}{2\gamma } \Vert \varvec{w}\Vert _2^2\), we obtain the following reformulation:
where \(\varvec{e}\) denotes a vector of all ones of appropriate dimension.
Interestingly, this formulation can be represented using second-order cones [42, 65] and optimized over efficiently using projected subgradient descent [9]. Moreover, it reliably supplies near-exact relaxations for most practically relevant cases of best subset selection [6, 65]. In instances where it is not already tight, one can apply a refinement of the perspective reformulation technique to the term \(\Vert \varvec{y}-\varvec{X}\varvec{w}\Vert _2^2\) and thereby obtain the following (tighter yet more expensive) relaxation [26]:
Recently, a class of even tighter relaxations were developed by Atamtürk and Gómez [3, 39, 43]. As they were developed by considering multiple binary variables simultaneously and therefore do not, to our knowledge, generalize readily to the low-rank case (where we often have one low-rank matrix), we do not discuss (nor generalize) them here.
Reduced Rank Regression: Given m observations of a response vector \(\varvec{Y}_j \in \mathbb {R}^n\) and a predictor \(\varvec{X}_j \in \mathbb {R}^p\), an important problem in high-dimensional statistics is to recover a low-complexity model which relates \(\varvec{X}\ \hbox {and}\ \varvec{Y}\). A popular choice for doing so is to assume that \(\varvec{X}, \varvec{Y}\) are related via \(\varvec{Y}=\varvec{X}\varvec{\beta }+\varvec{E}\), where \(\varvec{\beta } \in \mathbb {R}^{p \times n}\) is a coefficient matrix which we assume to be low-rank, \(\varvec{E}\) is a matrix of noise, and we require that the rank of \(\varvec{\beta }\) is small in order that the linear model is parsimonious [57]. Introducing Frobenius regularization gives rise to the problem:
where \(\gamma , \mu >0\) control the robustness to noise and the complexity of the estimator respectively, and we normalize the ordinary least squares loss by dividing by m, the number of observations.
Existing attempts at solving this problem generally involve replacing the low-rank term with a nuclear norm term [57], which succeeds under some strong assumptions on the problem data but not in general. Recently, we proposed a new framework to model rank constraints, using orthogonal projection matrices which satisfy \(\varvec{Y}^2=\varvec{Y}\) instead of binary variables which satisfy \(z^2=z\) [11]. By building on this work, in this paper we propose a generalization of the perspective function to matrix-valued functions with positive semidefinite arguments and develop a matrix analog of the perspective reformulation technique from MIO which uses projection matrices instead of binary variables.
Relaxations via the Matrix Perspective Reformulation Technique: By applying the matrix perspective reformulation technique (Theorem 1) to the term \({\frac{1}{2\gamma }}\Vert \varvec{\beta }\Vert _F^2+\mu \cdot \textrm{Rank}(\varvec{\beta })\), we will prove that the following problem is a valid—and numerically high-quality—relaxation of (5):
The analogy between problems (2)–(5) and their relaxations (3)–(6) is striking. The goal of the present paper is to develop the corresponding theory to support and derive the relaxation (6). Interestingly, the main argument that led [26] to the improved relaxation (4) for (2) can be extended to reduced-rank regression. Combined with our MPRT, it leads to the relaxation:
It is not too hard to see that this is a valid semidefinite relaxation: if \(\varvec{W}\) is a rank-k projection matrix then, by the Schur complement lemma [16, Equation 2.41], \(\varvec{\beta }=\varvec{\beta }\varvec{W}\), and thus the rank of \(\varvec{\beta }\) is at most k. Moreover, if we let \(\varvec{B}=\varvec{\beta }\varvec{\beta }^\top \) in a solution, we recover a low-rank solution to the original problemFootnote 1. Actually, as we show in Sect. 3.3, a similar technique can be applied to any instance of Problem (1), for which the applications beyond matrix regression are legion.
1.2 Literature review
Three classes of approaches have been proposed for solving Problem (1): (a) heuristics, which prioritize computational efficiency and obtain typically high-quality solutions to low-rank problems efficiently but without optimality guarantees (see [59], for a review); (b) relax-and-round approaches, which balance computational efficiency and accuracy concerns by relaxing the rank constraint and rounding a solution to the relaxation to obtain a provably near-optimal low-rank matrix [11, Section 1.2.1]; and (c) exact approaches, which prioritize accuracy over computational efficiency and solve Problem (1) exactly in exponential time [11], Section 1.2.1].
Of the three classes of approaches, heuristics currently dominate the literature, because their superior runtime and memory usage allows them to address larger-scale problems. However, recent advances in algorithmic theory and computational power have drastically improved the scalability of exact and approximate methods, to the point where they can now solve moderately sized problems which are relevant in practice [11]. Moreover, relaxations of strong exact formulations often give rise to very efficient heuristics (via tight relaxations of the exact formulation) which outperform existing heuristics. This suggests that heuristic approaches may not maintain their dominance going forward, and motivates the exploration of tight yet affordable relaxations of low-rank problems.
1.3 Contributions and structure
The main contributions of this paper are twofold. First, we propose a general reformulation technique for obtaining high-quality relaxations of low-rank optimization problems: introducing an orthogonal projection matrix to model a low-rank constraint, and strengthening the formulation by taking the matrix perspective of an appropriate substructure of the problem. This technique can be viewed as a generalization of the perspective reformulation technique for obtaining strong relaxations of sparse or logically constrained problems [10, 37, 42, 43]. Second, by applying this technique, we obtain explicit characterizations of convex hulls of low-rank sets which frequently arise in low-rank problems. As the interplay between convex hulls of indicator sets and perspective functions has engineered algorithms which outperform state-of-the-art heuristics in sparse linear regression [6, 44] and sparse portfolio selection [10, 80], we hope that this work will empower similar developments for low-rank problems.
The rest of the paper is structured as follows: In Sect. 2 we supply some background on perspective functions and review their role in developing tight formulations of mixed-integer problems. In Sect. 3, we introduce the matrix perspective function and its properties, extend the function’s definition to allow semidefinite in addition to positive definite arguments, and propose a matrix perspective reformulation technique (MPRT) which successfully obtains high-quality relaxations for low-rank problems which commonly arise in the literature. We also connect the matrix perspective function to the convex hulls of epigraphs of simple matrix convex functions under rank constraints. In Sect. 4, we illustrate the utility of this connection by deriving tighter relaxations of several low-rank problems than are currently available in the literature. Finally, in Sect. 5, we numerically verify the utility of our approach on reduced rank regression, D-optimal design and non-negative matrix factorization problems.
Notation: We let nonbold face characters such as b denote scalars, lowercase bold faced characters such as \(\varvec{x}\) denote vectors, uppercase bold faced characters such as \(\varvec{X}\) denote matrices, and calligraphic uppercase characters such as \({\mathcal {Z}}\) denote sets. We let \(\mathbb {N}\) denote the set of positive integers. We let \(\textbf{e}\) denote a vector of all 1’s, \(\varvec{0}\) denote a vector of all 0’s, and \(\mathbb {I}\) denote the identity matrix. We let \({\mathcal {S}}^n_{+} \cap \mathbb {R}^{n \times n}_{+}\) denote the cone of \(n \times n\) doubly non-negative matrices, and \({\mathcal {C}}^n_+:=\{\varvec{U}\varvec{U}^\top : \varvec{U} \in \mathbb {R}^{n \times n}_{+}\}\) denote the cone of \(n \times n\) completely positive matrices. Finally, we let \(\varvec{X}^\dag \) denote the Moore-Penrose pseudoinverse of a matrix \(\varvec{X}\); see [13, 47] for general theories of matrix operators. Less common matrix operators will be defined as they are needed.
2 Background on Perspective Functions
In this section, we review perspective functions and their interplay with tight formulations of logically constrained problems. This prepares the ground for and motivates our study of matrix perspective functions and their interplay with tight formulations of low-rank problems. Many of our subsequent results can be viewed as (nontrivial) generalizations of the results in this section, since a rank constraint is a cardinality constraint on the singular values.
2.1 Preliminaries
Consider a proper closed convex function \(f : {\mathcal {X}} \rightarrow \mathbb {R}\), where \({\mathcal {X}}\) is a convex subset of \(\mathbb {R}^n\). The perspective function of f is commonly defined for any \(\varvec{x} \in \mathbb {R}^n\) and any \(t > 0\) as \((\varvec{x}, t) \mapsto t f(\varvec{x}/t)\). Its closure is defined by continuity for \(t=0\) and is equal to (c.f. [46], Proposition IV.2.2.2):
where \(f_\infty \) is the recession function of f, as originally stated in [68], p. 67], which is given by
for any \(\varvec{x}_0\) in the domain of f. That is, \(f_\infty (\varvec{x})\) is the asymptotic slope of f in the direction of \(\varvec{x}\).
The perspective function was first investigated by [68], who made the important observation that f is convex in \(\varvec{x}\) if and only if \(g_f\) is convex in \((\varvec{x},t)\). Among other properties, we have that, for any \(t>0\), \((\varvec{x},t,s) \in \textrm{epi}(g_f)\) if and only if \((\varvec{x}/t, s/t) \in \textrm{epi}(f)\) [46], Proposition IV.2.2.1]. We refer to the review by [23] for further properties of perspective functions.
Throughout this work, we refer to \(g_f\) as the perspective function of f –although it technically is the closure of the perspective. We also consider a family of convex functions f which satisfy:
Assumption 1
The function \(f : {\mathcal {X}} \rightarrow \mathbb {R}\) is proper, closed, and convex, \(\varvec{0} \in {\mathcal {X}}\), and for any \(\varvec{x} \ne \varvec{0}\), \(f_\infty (\varvec{x}) = + \infty \).
The condition \(f_\infty (\varvec{x}) = +\infty , \forall \varvec{x} \ne \varvec{0}\) is equivalent to \(\lim _{\varvec{x} \rightarrow \infty } {f(\varvec{x})}/{\Vert \varvec{x}\Vert } = +\infty ,\) and means that, asymptotically, f increases to infinity faster than any affine function. In particular, it is satisfied if the domain of f is bounded or if f is strictly convex. Under Assumption 1, the definition of the perspective function of f simplifies to
2.2 The perspective reformulation technique
A number of authors have observed that optimization problems over binary and continuous variables admit tight reformulations involving perspective functions of appropriate substructures of the problem, since [21], building upon the work of [68], Theorem 9.8], derived the convex hull of a disjunction of convex constraints. To motivate our study of the matrix perspective function in the sequel, we now demonstrate that a class of logically-constrained problems admit reformulations in terms of perspective functions. We remark that this development bears resemblance to other works on perspective reformulations including [10, 39, 43].
Consider a logically-constrained problem of the form
where \({\mathcal {Z}} \subseteq \{0,1\}^n\), \(\varvec{c} \in \mathbb {R}^n\) is a cost vector, \(f(\cdot )\) is a generic convex function which possibly models convex constraints \(\varvec{x} \in {\mathcal {X}}\) for a convex set \({\mathcal {X}} \subseteq \mathbb {R}^n\) implicitly–by requiring that \(g(\varvec{x})=+\infty \) if \(\varvec{x} \notin {\mathcal {X}}\), and \(\Omega (\cdot )\) is a regularization function which satisfies the following assumption:
Assumption 2
(Separability) \(\Omega (\varvec{x})=\sum _{i \in [n]} \Omega _i(x_i)\), where each \(\Omega _i\) satisfies Assumption 1.
Since \(z_i\) is binary, imposing the logical constraint “\(x_i=0\) if \(z_i=0\)” plus the term \(\Omega _i(x_i)\) in the objective is equivalent to \(g_{\Omega }(x_i, z_i)+(1-z_i)\Omega _i(0)\) in the objective, where \(g_{\Omega _i}\) is the perspective function of \(\Omega _i\), and thus Problem (9) is equivalent to:
Notably, while Problems (9)–(10) have the same feasible regions, (10) often has substantially stronger relaxations, as frequently noted in the perspective reformulation literature [10, 36, 37, 42].
For completeness, we provide a formal proof of equivalence between (9) and (10); note that a related (although dual, and weaker as it requires \(\Omega (\varvec{0})=0\)) result can be found in [10, Theorem 2.5]:
Lemma 1
Suppose (9) attains a finite optimal value. Then, (10) attains the same value.
Proof
It suffices to establish that the following equality holds:
Indeed, this equality shows that any feasible solution to one problem is a feasible solution to the other with equal cost. We prove this by considering the cases where \(z_i=0\), \(z_i=1\) separately.
-
Suppose \(z_i=1\). Then, \(g_{\Omega _i}(x_i, z_i)=z_i \Omega _i(x_i/z_i) =\Omega _i(x_i)\) and \(x_i=z_i\cdot x_i\), so the result holds.
-
Suppose \(z_i=0\). If \(x_i=0\) we have \(g_{\Omega _i}(0,0)+\Omega _i(0)=\Omega _i(0)\), and moreover the right-hand-side of the equality is certainly \(\Omega _i(0)\). Alternatively, if \(x_i \ne 0\) then both sides equal \(+\infty \). \(\square \)
In Table 1, we present examples of penalties \(\Omega \) for which Assumption 1 holds and the perspective reformulation technique is applicable. We remind the reader that the exponential cone is (c.f. [22]):
while the power cone is defined for any \(\alpha \in (0, 1)\) as (c.f. [22]):
2.3 Perspective cuts
Another computationally useful application of the perspective reformulation technique has been to derive a class of cutting-planes for MIOs with logical constraints [37]. To motivate our generalization of these cuts to low-rank problems, we now briefly summarize their main result.
Consider the following problem:
where \(\{{x_i: \ }\varvec{A}^i x_i \le 0\}=\{0\}\), which implies the set of feasible \(\varvec{x}\) is bounded, \(\Omega _i(x_i)\) is a closed convex function, we take \(\Omega _i(0)=0\) as in [37] for simplicity, and \(f(\varvec{x})\) is a convex function. Then, letting \(\rho _i\) model the epigraph of \(\Omega _i(x_i)+c_i z_i\) and \(s_i\) be a subgradient of \(\Omega _i\) at \(\bar{x}_i\), i.e., \(s_i \in \partial \Omega _i(\bar{x}_i)\), we have the following result [37, 42]:
Proposition 1
The following cut
is valid for the equivalent MINLO:
Remark 1
In the special case where \(\Omega _i(x_i)=x_i^2\), the cut reduces to:
The class of cutting planes defined in Proposition 1 are commonly referred to as perspective cuts, because they define a linear lower approximation of the perspective function of \(\Omega _i(x_i)\), \(g_{\Omega _i}(x_i, z_i)\). Consequently, Proposition 1 implies that a perspective reformulation of (11) is equivalent to adding all (infinitely many) perspective cuts (12). This may be helpful where the original problem is nonlinear, as a sequence of linear MIOs can be easier to solve than one nonlinear MIO (see [38], for a comparison).
3 The Matrix Perspective Function and its Applications
In this section, we generalize the perspective function from vectors to matrices, and invoke the matrix perspective function to propose a new technique for generating strong yet efficient relaxations of a diverse family of low-rank problems, which we call the Matrix Perspective Reformulation Technique (MPRT). Selected background on matrix analysis (see [13], for a general theory) and semidefinite optimization (see [77], for a general theory) which we use throughout this section can be found in Appendix A.
3.1 A matrix perspective function
To generalize the ideas from the previous section to low-rank constraints, we require a more expressive transform than the perspective transform, which introduces a single (scalar) additional degree of freedom and cannot control the eigenvalues of a matrix. Therefore, we invoke a generalization from quantum mechanics: the matrix perspective function defined in [27, 28], building upon the work of [29]; see also [24, 54,55,56] for a related generalization of perspective functions to perspective functionals.
Definition 1
For a matrix-valued function \(f: {\mathcal {X}} \rightarrow {\mathcal {S}}^n_+\) where \({\mathcal {X}} \subseteq {\mathcal {S}}^n\) is a convex set, the matrix perspective function of f, \(g_f\), is defined as
Remark 2
If \(\varvec{X}\) and \(\varvec{Y}\) commute and f is analytic, then Definition 1 simplifies into \(\varvec{Y} f\left( \varvec{Y}^{-1}\varvec{X}\right) \), which is the analog of the usual definition of the perspective function originally stated in [29]. Definition 1, however, generalizes this definition to the case where \(\varvec{X}\) and \(\varvec{Y}\) do not commute by ensuring that \(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\) is nonetheless symmetric, in a manner reminiscent of the development of interior point methods (see, e.g., [2]). In particular, if \(\varvec{Y}\) is a projection matrix such that \(\varvec{X}=\varvec{Y}\varvec{X}\)—as occurs for the exact formulations of the low-rank problems we consider in this paper—then it is safe to assume that \(\varvec{X}, \varvec{Y}\) commute. However, when \(\varvec{Y}\) is not a projection matrix, this cannot be assumed in general.
The matrix perspective function generalizes the definition of the perspective transformation to matrix-valued functions and satisfies analogous properties:
Proposition 2
Let f be a matrix-valued function and \(g_f\) its matrix perspective function. Then:
-
(a)
f is matrix convex, i.e.,
$$\begin{aligned} t f(\varvec{X})+(1-t) f(\varvec{W}) \succeq f(t \varvec{X}+(1-t)\varvec{W}) \quad \forall \varvec{X}, \varvec{W} \in {\mathcal {S}}^n, \ t \in [0,1], \end{aligned}$$(14)if and only if \(g_f\) is matrix convex in \((\varvec{X}, \varvec{Y})\).
-
(b)
\(g_f\) is a positive homogeneous function, i.e., for any \(\mu {>} 0\) we have
$$\begin{aligned} g_f(\mu \varvec{X}, \mu \varvec{Y})=\mu g_f(\varvec{X}, \varvec{Y}). \end{aligned}$$(15) -
(c)
Let \(\varvec{Y} \succ \varvec{0}\) be a positive definite matrix. Then, letting the epigraph of f be denoted by
$$\begin{aligned} \textrm{epi}(f):=\{(\varvec{X}, \varvec{\theta }): \varvec{X} \in \textrm{dom}(f), f(\varvec{X}) \preceq \varvec{\theta }\},\end{aligned}$$(16)we have \((\varvec{X}, \varvec{Y}, \varvec{\theta }) \in \textrm{epi}(g_f)\) if and only if \((\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}, \varvec{Y}^{-\frac{1}{2}}\varvec{\theta }\varvec{Y}^{-\frac{1}{2}}) \in \textrm{epi}(f)\).
Proof
We prove the claims successively:
-
(a)
This is precisely the main result of [27], Theorem 2.2].
-
(b)
For \(\mu > 0\), \(g_f(\mu \varvec{X}, \mu \varvec{Y})=\mu \varvec{Y}^{\frac{1}{2}} f\left( (\mu \varvec{Y})^{-\frac{1}{2}}\mu \varvec{X}(\mu \varvec{Y})^{-\frac{1}{2}}\right) \varvec{Y}^{\frac{1}{2}}=\mu g_f(\varvec{X}, \varvec{Y})\).
-
(c)
By generalizing the main result in [15, Chapter 3.2.6], for any \(\varvec{Y} \succ \varvec{0}\) we have that
$$\begin{aligned} (\varvec{X}, \varvec{Y}, \varvec{\theta }) \in \textrm{epi}(g_f)&\iff \varvec{Y}^\frac{1}{2}f(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}})\varvec{Y}^\frac{1}{2}\preceq \varvec{\theta },\\&\iff f(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}) \preceq \varvec{Y}^{-\frac{1}{2}}\varvec{\theta }\varvec{Y}^{-\frac{1}{2}},\\&\iff (\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}, \varvec{Y}^{-\frac{1}{2}}\varvec{\theta }\varvec{Y}^{-\frac{1}{2}}) \in \textrm{epi}(f). \end{aligned}$$\(\square \)
We now specialize our attention to matrix-valued functions defined by a scalar convex function, as suggested in the introduction.
3.2 Matrix perspectives of operator functions
From any function \(\omega : \mathbb {R} \rightarrow \mathbb {R}\), we can define its extension to the set of symmetric matrices, \(f_\omega : {\mathcal {S}}^n \rightarrow {\mathcal {S}}^n\) as
where \(\varvec{X} = \varvec{U} {\text {Diag}}(\lambda _1^x,\dots , \lambda _n^x) \varvec{U}^\top \) is an eigendecomposition of \(\varvec{X}\). Functions of this form are called operator functions (see [13], for a general theory). In particular, one can show that \(f_\omega (\varvec{X})\) is well-defined (does not depend explicitly on the eigenbasis of \(\varvec{X}\), \(\varvec{U}\)). Among other examples, taking \(\omega (x) = \exp (x)\) (resp. \(\log (x)\)) provides a matrix generalization of the exponential (resp. logarithm) function; see Appendix A.1.
Central to our analysis is that we can explicitly characterize the closure of the matrix perspective of \(f_\omega \) under some assumptions on \(\omega \), i.e., define by continuity \(g_{f_\omega }(\varvec{X},\varvec{Y})\) for rank-deficient matrices \(\varvec{Y}\):
Proposition 3
Consider a function \(\omega : \mathbb {R} \rightarrow \mathbb {R}\) satisfying Assumption 1. Then, the closure of the matrix perspective of \(f_\omega \) is, for any \(\varvec{X} \in {\mathcal {S}}^n\), \(\varvec{Y} \in {\mathcal {S}}_+^n\),
where \(\varvec{Y}^{-\frac{1}{2}}\) denotes the pseudo-inverse of the square root of \(\varvec{Y}\).
Remark 3
Note that in the expression of \(g_{f_\omega }\) above, the matrix \(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\) is unambiguously defined if and only if \({\text {Span}}(\varvec{X}) \subseteq {\text {Span}}(\varvec{Y})\) (otherwise, its value depends on how we define the pseudo-inverse of \(\varvec{Y}^{\frac{1}{2}}\) outside of its range). Accordingly, in the remainder of the paper, we omit the condition \({\text {Span}}(\varvec{X}) \subseteq {\text {Span}}(\varvec{Y})\) whenever the analytic expression for \(g_{f_\omega }\) explicitly involves \(\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\).
The proof of Proposition 3 is deferred to Appendix B.1. In the appendix, we also present an immediate extension where additional constraints, \(\varvec{X} \in {\mathcal {X}}\), are imposed on the argument of \(f_\omega \). As in our prior work [11], we will reformulate the rank constraints in (1) by introducing a projection matrix \(\varvec{Y}\) to encode for the span of \(\varvec{X}\). Naturally, \(\varvec{Y}\) should be rank-deficient. Hence, Proposition 3 ensures that having \({\text {tr}}(g_{f_\omega }(\varvec{X},\varvec{Y})) < \infty \) is a sufficient condition for \(\varvec{Y}\) to indeed control \({\text {Span}}(\varvec{X})\).
To gain intuition on how the matrix perspective function transforms \(\varvec{X}\) and \(\varvec{Y}\), we now provide an interesting connection between the matrix perspective of \(f_\omega \) and the perspective of \(\omega \) in the case where \(\varvec{X}\) and \(\varvec{Y}\) commute.
Proposition 4
Consider two matrices \(\varvec{X} \in {\mathcal {S}}^n, \varvec{Y} \in {\mathcal {S}}^n_+\) that commute and such that \(\textrm{Span}(\varvec{X}) \subseteq \textrm{Span}(\varvec{Y})\). Hence, there exists an orthogonal matrix \(\varvec{U}\) which jointly diagonalizes \(\varvec{X}\) and \(\varvec{Y}\). Let \(\lambda _1^x, \dots , \lambda _n^x\) and \(\lambda _1^y, \dots , \lambda _n^y\) denote the eigenvalues of \(\varvec{X}\) and \(\varvec{Y}\) respectively, ordered according to this basis \(\varvec{U}\). Consider an operator function \(f_\omega \) with \(\omega \) satisfying Assumption 1. Then, we have that:
Proof
By simultaneously diagonalizing \(\varvec{X}\) and \(\varvec{Y}\), we get
\(\square \)
Note that if \(\varvec{Y}\) is a projection matrix such that \(\textrm{Span}(\varvec{X}) \subseteq \textrm{Span}(\varvec{Y})\) then we necessarily have that \(\varvec{X}=\varvec{Y}\varvec{X}=\varvec{X}\varvec{Y}\) and the assumptions of Proposition 4 hold.
In the general case where \(\varvec{X}\) and \(\varvec{Y}\) do not commute, we cannot simultaneously diagonalize them and connect \(g_{f_\omega }\) with \(g_\omega \). However, we can still project \(\varvec{Y}\) onto the space of matrices that commute with \(\varvec{X}\) and obtain the following result when \(g_{f_\omega }\) is matrix convex (proof deferred to Appendix B.2):
Lemma 2
Let \(\varvec{X} \in {\mathcal {S}}^n\) and \(\varvec{Y} \in {\mathcal {S}}_+^n\) be matrices, and denote the commutant of \(\varvec{X}\), \({ {\mathcal {C}}_{\varvec{X}} :=} \{ \varvec{M} \ : \ \varvec{MX} = \varvec{XM} \}\), i.e., the set of matrices which commute with \(\varvec{X}\). For any matrix \(\varvec{M}\), denote \(\varvec{M}_{|{ \varvec{X}}}\) the orthogonal projection of \(\varvec{M}\) onto \({\mathcal {C}}_{\varvec{X}}\). Then, we have that
Moreover, if \(\varvec{Y} \mapsto g_{f_\omega }(\varvec{X},\varvec{Y})\) is matrix convex, then we have
3.3 The matrix perspective reformulation technique
Definition 1 and Proposition 3 supply the necessary language to lay out our Matrix Perspective Reformulation Technique (MPRT). Therefore, we now state the technique; details regarding its implementation will become clearer throughout the paper.
Let us revisit Problem (1), and assume that the term \(\Omega (\varvec{X})\) satisfies the following properties:
Assumption 3
\(\Omega (\varvec{X})= {\text {tr}}\left( f_\omega (\varvec{X}) \right) \), where \(\omega \) is a function satisfying Assumption 1 and whose associated operator function, \(f_\omega \), is matrix convex.
Assumption 3 implies that the regularizer can be rewritten as operating on the eigenvalues of \(\varvec{X}\), \(\lambda _i(\varvec{X})\), directly: \(\Omega (\varvec{X})= \sum _{i \in [n]} \omega (\lambda _i(\varvec{X}))\). As we discuss in the next section, a broad class of functions satisfy this property. For ease of notation, we refer to \(f_\omega \) as f in the remainder of the paper (and accordingly denote by \(g_f\) its matrix perspective function).
After letting an orthogonal projection matrix \(\varvec{Y}\) model the rank of \(\varvec{X}\)—as per [11]—Problem (1) admits the equivalent mixed-projection reformulation:
where \(\varvec{Y} \in {\mathcal {Y}}^k_n\) is the set of \(n \times n\) orthogonal projection matrices with trace at most k:
Note that for \(k \in \mathbb {N}\), the convex hull of \({\mathcal {Y}}^k_n\) is given by \( \textrm{Conv}({\mathcal {Y}}^k_n)=\{\varvec{Y} \in {\mathcal {S}}^n_{+}: \varvec{Y} \preceq \mathbb {I}, \textrm{tr}(\varvec{Y}) \le k\}\), which is a well-studied object in its own right [52, 60,61,62].
Since \(\varvec{Y}\) is an orthogonal projection matrix, imposing the nonlinear constraint \(\varvec{X}=\varvec{Y}\varvec{X}\) and introducing the term \(\Omega (\varvec{X})=\textrm{tr}(f(\varvec{X}))\) in the objective is equivalent to introducing the following term in the objective:
where \(g_f\) is the matrix perspective of f, and thus Problem (18) is equivalent to:
Let us formally state and verify the equivalence between Problems (18)–(19) via:
Theorem 1
Problems (18)–(19) attain the same optimal objective value.
Proof
It suffices to show that for any feasible solution to (18) we can construct a feasible solution to (19) with an equal or lower cost, and vice versa:
-
Let \((\varvec{X}, \varvec{Y})\) be a feasible solution to (18). Since \(\varvec{X} = \varvec{Y} \varvec{X} \in {\mathcal {S}}^n\), \(\varvec{X}\) and \(\varvec{Y}\) commute. Hence, by Proposition 4, we have (using the same notation as in Proposition 4):
$$\begin{aligned} {\text {tr}}\left( g_f(\varvec{X}, \varvec{Y}) \right) = \sum _{i \in [n]} g_\omega \left( \lambda ^x_i,\lambda ^y_i \right) = \sum _{i \in [n]} 1\left\{ \lambda _i^y>0\right\} \omega (\lambda ^x_i), \end{aligned}$$where \(1\left\{ \lambda _i^y>0\right\} \) is an indicator function which denotes whether the ith eigenvalue of \(\varvec{Y}\) (which is either 0 or 1) is strictly positive. Moreover, since \(\varvec{X} = \varvec{Y}\varvec{X}\), \(\lambda _i^y = 0 \implies \lambda _i^x = 0\) and
$$\begin{aligned} {\text {tr}}\left( f(\varvec{X})\right)&= \sum _{i\in [n]} \omega (\lambda _i^x) = {\text {tr}}\left( g_f(\varvec{X}, \varvec{Y}) \right) + \sum _{i \in [n]} 1\left\{ \lambda _i^y=0\right\} \omega (0) \nonumber \\&= {\text {tr}}\left( g_f(\varvec{X}, \varvec{Y}) \right) + (n-\textrm{tr}(\varvec{Y}))\omega (0). \end{aligned}$$(20)This establishes that \((\varvec{X}, \varvec{Y})\) is feasible in (19) with the same cost.
-
Let \((\varvec{X}, \varvec{Y})\) be a feasible solution to (19). Then, it follows that \(\varvec{X} \in \textrm{Span}(\varvec{Y})\), which implies that \(\varvec{X}=\varvec{Y}\varvec{X}\) since \(\varvec{Y}\) is a projection matrix. Therefore, (20) holds, which establishes that \((\varvec{X}, \varvec{Y})\) is feasible in (18) with the same cost.
\(\square \)
Eventually, relaxing \(\varvec{Y} \in {\mathcal {Y}}^k_n\) in Problem (19) supplies as strong—and sometimes significantly stronger—relaxations than by any other technique we are aware of, as we explore in Sect. 4.
Remark 4
Note that, based on the proof of Theorem 1, we could replace \(g_f(\varvec{X}, \varvec{Y})\) in (19) by any function \(\tilde{g}(\varvec{X}, \varvec{Y})\) such that \(g_f(\varvec{X}, \varvec{Y}) = \tilde{g}(\varvec{X}, \varvec{Y})\) for \(\varvec{X}, \varvec{Y}\) that commute, with no impact on the objective value. However, it might impact tractability if \(\tilde{g}(\varvec{X}, \varvec{Y})\) is not convex in \((\varvec{X}, \varvec{Y})\).
Remark 5
Under Assumption 3, the regularization term \(\Omega (\varvec{X})\) penalizes all eigenvalues of \(f_{\omega }(\varvec{X})\) equally. The MPRT can be extended to a wider class of regularization functions that penalize the largest eigenvalues more heavily, at the price of (a significant amount of) additional notation. For brevity, we lay out this extension in Appendix C.
Theorem 1 only uses the fact that f is an operator function with \(\omega \) satisfying Assumption 1, not the fact that f is matrix convex. In other words, (19) is always an equivalent reformulation of (18). An interesting question is to identify the set of necessary conditions for the objective of (19) to be convex in \((\varvec{X},\varvec{Y})\) – f being matrix convex is clearly sufficient. The objective in (19) is convex only as long as \({\text {tr}}\left( g_f \right) \) is. Interestingly, this is not equivalent to the convexity of \(\textrm{tr}(f)\). See Appendix B.3 for a counter-example. It is, however, an open question whether a weaker notion than matrix convexity could ensure the joint convexity of \(\textrm{tr}(g_f)\). It would also be interesting to investigate the benefits and the tractability of non-convex penalties (either by having f not matrix convex or \(\omega \) non-convex), given the successes of non-convex penalty functions in sparse regression problems [30, 79].
3.4 Convex hulls of low-rank sets and the MPRT
We now show that, for a general class of low-rank sets, applying the MPRT is equivalent to taking the convex hull of the set. This is significant, because we are not aware of any general-purpose techniques for taking convex hulls of low-rank sets. Formally, we have the following result:
Theorem 2
Consider an operator function \(f=f_\omega \) satisfying Assumption 3. Let
be a set where \(t \in \mathbb {R},k \in \mathbb {N}\) are fixed. Then, an extended formulation of the convex hull of \({\mathcal {T}}\) is given by:
where \(\textrm{Conv}({\mathcal {Y}}^k_n)=\{\varvec{Y} \in {\mathcal {S}}^n_+: \varvec{Y} \preceq \mathbb {I}, \textrm{tr}(\varvec{Y}) \le k \}\) is the convex hull of trace-k projection matrices, and \(g_f\) is the matrix perspective function of f.
Proof
We prove the two directions sequentially:
-
\(\textrm{Conv}\left( {\mathcal {T}}\right) \subseteq {\mathcal {T}}^c\): let \(\varvec{X} \in {\mathcal {T}}\). Then, since the rank of \(\varvec{X}\) is at most k, there exists some \(\varvec{Y} \in {\mathcal {Y}}^k_n\) such that \(\varvec{X}=\varvec{Y}\varvec{X}\) and \(\textrm{tr}(\varvec{Y})=\textrm{Rank}(\varvec{X})\). Moreover, by the same argument as in the proof of Theorem 1, it follows that (20) holds and \(\textrm{tr}(g_f(\varvec{X}, \varvec{Y}))+\mu \cdot \textrm{tr}(\varvec{Y})+(n-\textrm{tr}(\varvec{Y}))\omega (0) \le t\), which confirms that \((\varvec{X}, \varvec{Y}) \in {\mathcal {T}}^c\). Since \({\mathcal {T}}^c\) is a convex set, we therefore have \(\textrm{Conv}\left( {\mathcal {T}}\right) \subseteq {\mathcal {T}}^c\).
-
\({\mathcal {T}}^c \subseteq \textrm{Conv}\left( {\mathcal {T}}\right) \): let \((\varvec{X}, \varvec{Y}) \in {\mathcal {T}}^c\). Denote \(\varvec{Y}_{|\varvec{X}}\) the projection of \(\varvec{Y}\) onto the set of matrices that commute with \(\varvec{X}\): \(\{ \varvec{M} \ : \ \textit{\textbf{X M}} = \varvec{M X} \}\). By Lemma 2, we have that \(\varvec{Y}_{|{ \varvec{X}}} \in {\text {Conv}}({\mathcal {Y}}_n^k)\), and \({\text {tr}}\left( g_{f}(\varvec{X}, \varvec{Y}_{|{ \varvec{X}}}) \right) \le {\text {tr}}\left( g_{f}(\varvec{X}, \varvec{Y}) \right) < \infty \) so \((\varvec{X}, \varvec{Y}_{|{ \varvec{X}}}) \in {\mathcal {T}}^c\) as well. Hence, without loss of generality, by renaming \(\varvec{Y} \leftarrow \varvec{Y}_{|{ \varvec{X}}}\), we can assume that \(\varvec{X}\) and \(\varvec{Y}\) commute. Then, it follows from Proposition 4 that the vectors of eigenvalues of \(\varvec{X}\) and \(\varvec{Y}\) (ordered according to a shared eigenbasis \(\varvec{U}\)), \(\left( \varvec{\lambda }(\varvec{X}), \varvec{\lambda }(\varvec{Y})\right) \) belong to the set
$$\begin{aligned}&\left\{ (\varvec{x}, \varvec{y}) \in \mathbb {R}^n \times [0,1]^n: \sum _i y_i \le k, \sum _{i=1}^n y_i \omega \left( \tfrac{x_i}{y_i}\right) +\mu \sum _i y_i +\left( n-\sum _i y_i\right) \omega (0) \le t \right\} , \end{aligned}$$which is the closure of the convex hull of
$$\begin{aligned} {\mathcal {U}}^c:=&\left\{ (\varvec{x}, \varvec{y}) \in \mathbb {R}^n \times \{0,1\}^n: \sum _i y_i \le k,\sum _{i=1}^n \omega ({x_i})+\mu \sum _i y_i \le t, x_i = 0 \ \text {if} \ y_i = 0 \ \forall i \in [n]\right\} , \end{aligned}$$as proved by [76, Theorem 3] for the case \(\omega (0) = 0\). We provide a self-contained proof of the generalization of this statement to \(\omega (0) \ne 0\) in Appendix B.5 (Proposition 8). Let us decompose \((\varvec{\lambda }(\varvec{X}), \varvec{\lambda }(\varvec{Y}))\) into \(\varvec{\lambda }(\varvec{X}) = \sum _{k} \alpha _k \varvec{x}^{(k)}\), \(\lambda (\varvec{Y}) = \sum _{k} \alpha _k \varvec{y}^{(k)}\), with \(\alpha _k \ge 0\), \(\sum _k \alpha _k = 1\), and \((\varvec{x}^{(k)}, \varvec{y}^{(k)}) \in {\mathcal {U}}^c\). By definition,
$$\begin{aligned} \varvec{T}^{(k)} := \varvec{U} \text {Diag}(\varvec{x}^{(k)}) \varvec{U}^\top \in {\mathcal {T}}, \end{aligned}$$and \(\varvec{X} = \sum _k \alpha _k \varvec{T}^{(k)}\). Therefore, we have that \(\varvec{X} \in \textrm{Conv}({\mathcal {T}})\), as required. \(\square \)
Remark 6
Since linear optimization problems over convex sets admit extremal optima, Theorem 2 demonstrates that unconstrained low-rank problems with spectral objectives can be recast as linear semidefinite problems, where the rank constraint is dropped without loss of optimality. This suggests that work on hidden convexity in low-rank optimization, i.e., deriving conditions under which low-rank linear optimization problems admit exact relaxations where the rank constraint is omitted (see, e.g., [12, 62, 75]), could be extended to incorporate spectral functions.
3.5 Examples of the matrix perspective function
Theorem 2 demonstrates that, for spectral functions under low-rank constraints, taking the matrix perspective is equivalent to taking the convex hull. To highlight the utility of Theorems 1–2, we therefore supply the perspective functions of some spectral regularization functions which frequently arise in the low-rank matrix literature, and summarize them in Table 2. We also discuss how these functions and their perspectives can be efficiently optimized over. Note that all functions introduced in this section are either matrix convex or the trace of a matrix convex function, and thus supply valid convex relaxations when used as regularizers for the MPRT.
Spectral constraint: Let \(\omega (x) = 0\) if \(|x| \le M\), \(+\infty \) otherwise. Then,
for \(\varvec{X} \in {\mathcal {S}}^n\), where \(\Vert \cdot \Vert _\sigma \) denotes the spectral norm, i.e., the largest eigenvalue in absolute magnitude of \(\varvec{X}\). Observe that the condition \(\Vert \varvec{X}\Vert _\sigma \le M\) can be expressed via semidefinite constraints\( -M\mathbb {I} \preceq \varvec{X}\preceq M\mathbb {I}\). The perspective function \(g_f\) can then be expressed as
If \(\varvec{X}\) and \(\varvec{Y}\) commute, \(g_f(\varvec{X},\varvec{Y})\) requires that \(\vert \lambda _j(\varvec{X}) \vert \le M \lambda _j(\varvec{Y}) \ \forall j \in [n]\)–the spectral analog of a big-M constraint. This constraint can be modeled using two semidefinite cones, and thus handled by semidefinite solvers.
Convex quadratic: For \(\omega (x) = x^2\), \(f(\varvec{X})=\varvec{X}^\top \varvec{X}\). Then, the perspective function \(g_f\) is
Observe that this function’s epigraph is semidefinite-representable. Indeed, by the Schur complement lemma [16], Equation 2.41], minimizing the trace of \(g_f(\varvec{X}, \varvec{Y})\) is equivalent to solving
Interestingly, this perspective function allows us to rewrite the rank-k SVD problem
as a linear optimization problem over the set of orthogonal projection matrices, which implies that the orthogonal projection constraint can be relaxed to its convex hull without loss of optimality (since some extremal solution will be optimal for the relaxation). This is significant, because while rank-k SVD is commonly thought of as a non-convex problem which “surprisingly” admits a closed-form solution, the MPRT shows that it actually admits an exact convex reformulation:
Note that, in the above formulation, we extended our results for symmetric matrices to rectangular matrices \(\varvec{X} \in \mathbb {R}^{n \times m}\) without justification. We rigorously derive this extension for \(f(\varvec{X}) = \varvec{X}^\top \varvec{X}\) in Appendix D and defer the study of the general case to future research.
Spectral plus convex quadratic: Let
for \(\varvec{X} \in {\mathcal {S}}^n\). Then, the perspective function \(g_f\) is
This can be interpreted as the spectral analog of combining a big-M and a ridge penalty.
Convex quadratic over completely positive cone: Consider the following optimization problem
where \({\mathcal {C}}^n_+=\{\varvec{X}: \varvec{X}=\varvec{U}\varvec{U}^\top , \varvec{U} \in \mathbb {R}^{n \times n}_+\} \subseteq {\mathcal {S}}^n_+\) denotes the completely positive cone. Then, by denoting \(f(\varvec{X}) = \varvec{X}^\top \varvec{X}\) and \(g_f\) its perspective function, we obtain a valid relaxation by minimizing \(\textrm{tr}(g_f)\), which, by the Schur complement lemma (see [16, Equation 2.41]), can be reformulated as
Unfortunately, this formulation cannot be tractably optimized over, since separating over the completely positive cone is NP-hard. However, by relaxing the completely positive cone to the doubly non-negative cone—\({\mathcal {S}}^n_+ \cap \mathbb {R}^{n \times n}_{+}\)—we obtain a tractable and near-exact relaxation. Indeed, as we shall see in our numerical experiments, combining this relaxation with a state-of-the-art heuristic supplies certifiably near-optimal solutions in both theory and practice.
Note that we could have obtained an alternative relaxation by instead considering the perspective of
Remark 7
One can obtain a nearly identical formulation over the copositive cone (c.f. [17]).
Power: LetFootnote 2\(f(\varvec{X})=\varvec{X}^\alpha \) for \(\alpha \in [0, 1]\) and \(\varvec{X} \in {\mathcal {S}}^n_+\). The matrix perspective function isFootnote 3
The expression above can be simplified into \(\varvec{X}^\alpha \varvec{Y}^{1-\alpha }\) when \(\varvec{X}\) and \(\varvec{Y}\) commute and, per Remark 4, the former expression can be used equivalently for optimization purposes, even when \(\varvec{X}\) and \(\varvec{Y}\) do not commute.
Remark 8
(Matrix Power Cone) This function’s hypograph, the matrix power cone, i.e.,
is a closed convex cone which is semidefinite representable for any rational \(\alpha \in [0, 1]\) [32, Equation 3]. Consequently, it is a tractable object which successfully models the matrix power function (and its perspective) and we shall make repeated use of it when we apply the MPRT to several important low-rank problems in Sect. 3.5.
Logarithm: Let \(f(\varvec{X})=-\log (\varvec{X})\) be the matrix logarithm function. We have that
Observe that when \(\varvec{X}\) and \(\varvec{Y}\) commute, \(g_f(\varvec{X},\varvec{Y})\) can be rewritten as \(\varvec{Y}(\log (\varvec{Y}) - \log (\varvec{X}))\), whose trace is the (Umegaki) quantum relative entropy function (see [33], for a general theory). We remark that the domain of \(\log (\varvec{X})\) requires that \(\varvec{X}\) is full-rank, which at a first glance makes the use of this function problematic for low-rank optimization. Accordingly, we consider the \(\epsilon \)-logarithm function, i.e., \(\log _{\epsilon }(\varvec{X})=\log (\varvec{X} + \epsilon \mathbb {I})\) for \(\epsilon >0\), as advocated by [35] in a different context. Background on matrix exponential and logarithm functions can be found in Appendix A.1.
Observe that \(\textrm{tr}(\log (\varvec{X}))=\log \det (\varvec{X})\) while \(\textrm{tr}(g_f)=\textrm{tr}(\varvec{X}(\log (\varvec{X})-\log (\varvec{Y}))\). Thus, the matrix logarithm and its trace verify the concavity of the log determinant function—which has numerous applications in low-rank problems [35] and interior point methods [67] among others—while the perspective of the matrix logarithm provides an elementary proof of the convexity of the quantum relative entropy: a task for which perspective-free proofs are technically demanding [29].
Von Neumann entropy: Let \(f(\varvec{X})=\varvec{X}\log (\varvec{X})\) denote the von Neumann quantum entropy of a density matrix \(\varvec{X}\). Then, its perspective function is \(g_f(\varvec{X}, \varvec{Y})= \varvec{X}\varvec{Y}^{-\frac{1}{2}}\log (\varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}})\varvec{Y}^\frac{1}{2}\). When \(\varvec{X}\) and \(\varvec{Y}\) commute, this perspective can be equivalently written as
Note that various generalizations of the relative entropy for matrices have been proposed in the quantum physics literature [45]. However, these different definitions agree on the set of commuting matrices, hence can be used interchangeably for optimization purposes (see Remark 4).
Remark 9
(Quantum relative entropy cone) Note the epigraph of \(g_f\), namely,
is a convex cone which can be approximated using semidefinite cones and optimized over using either the |Matlab| package |CVXQuad| (see [33]), or optimized over directly using an interior point method for asymmetric cones [49]Footnote 4. Consequently, this is a tractable object which models the matrix logarithm and Von Neumann entropy (and their perspectives).
Finally, Table 2 relates the matrix perspectives discussed above with their scalar analogs.
3.6 Matrix perspective cuts
We now generalize the perspective cuts of [37, 42] from vectors to matrices and cardinality to rank constraints. Let us reconsider the previously defined mixed-projection optimization problem:
where similarly to [37] we assume that \(f(\varvec{0})=\varvec{0}\) to simplify the cut derivation procedure. Letting \(\varvec{\theta }\) model the epigraph of f via \(\varvec{\theta } \succeq f(\varvec{X})\) and \(\varvec{S}\) be a subgradient of f at \(\bar{\varvec{X}}\), we have:
which if \(f(\varvec{X})=\varvec{X}^2\) —as discussed previously—reduces to
which is precisely the analog of perspective cuts in the vector case. Note however that these cuts require semidefinite constraints to impose, which suggests they may not be as practically useful. For instance, our prior work [11]’s outer-approximation scheme for low-rank problems has a non-convex QCQOP master problem, which can only be currently solved using |Gurobi|, while |Gurobi| currently does not support semidefinite constraints.
We remark however that the inner product of Equation (23) with an arbitrary PSD matrix supplies a valid linear inequality. Two interesting cases of this observation arise when we take the inner product of the cut with either a rank-one matrix or the identity matrix.
Taking an inner product with the identity matrix supplies the inequality:
Moreover, by analogy to [10, Section 3.4], if we “project out” the \(\varvec{X}\) variables by decomposing the problem into a master problem in \(\varvec{Y}\) and subproblems in \(\varvec{X}\) then this cut becomes the Generalized Benders Decomposition cuts derived in our prior work [11, Equation 17].
Alternatively, taking the inner product of the cut with a rank-one matrix \(\varvec{b}\varvec{b}^\top \) gives:
A further improvement is actually possible: rather than requiring that the semidefinite inequality is non-negative with respect to one rank-one matrix, we can require that it is simultaneously non-negative in the directions \(\varvec{v}^1\) and \(\varvec{v}^2\). This supplies the second-order cone [64], Equation 8] cut:
The analysis in this section suggests that applying a perspective cut decomposition scheme out-of-the-box may be impractical, but leaves the door open to adaptations of the scheme which account for the projection matrix structure.
4 Examples and Perspective Relaxations
In this section, we apply the MRPT to several important low-rank problems, in addition to the previously discussed reduced-rank regression problem (Sect. 1.1). We also recall Theorem 2 to demonstrate that applying the MPRT to spectral functions which feature in these problems actually gives the convex hull of relevant substructures.
4.1 Matrix completion
Given a sample \((A_{i,j} : (i, j) \in {\mathcal {I}} \subseteq [n] \times [n])\) of a matrix \(\varvec{A} \in {\mathcal {S}}^n_+\), the matrix completion problem is to reconstruct the entire matrix, by assuming \(\varvec{A}\) is approximately low-rank [19]. Letting \(\mu , \gamma >0\) be penalty multipliers, this problem admits the formulation:
Applying the MPRT to the \(\Vert \varvec{X}\Vert _F^2=\textrm{tr}(\varvec{X}^\top \varvec{X})\) term demonstrates that this problem is equivalent to the mixed-projection problem:
and relaxing \(\varvec{Y} \in {\mathcal {Y}}^n_n\) to \(\varvec{Y} \in \textrm{Conv}({\mathcal {Y}}^n_n) =\{\varvec{Y} \in {\mathcal {S}}^n: \varvec{0} \preceq \varvec{Y} \preceq \mathbb {I}\}\) supplies a valid relaxation. We now argue that this relaxation is often high-quality, by demonstrating that the MPRT supplies the convex envelope of \(t\ge \frac{1}{2\gamma }\Vert \varvec{X}\Vert _F^2+\mu \cdot \textrm{Rank}(\varvec{X})\), via the following corollary to Theorem 2:
Corollary 1
Let
be a set where \(\ell , u \in \mathbb {R}_+\). Then, this set’s convex hull is given by:
4.2 Tensor completion
A central problem in machine learning is to reconstruct a d-tensor \(\varvec{\mathscr {X}}\) given a subsample of its entries \((A_{i_1, \ldots i_d}: (i_1, \ldots i_d) \in {\mathcal {I}} \subseteq [n_1] \times [n_2] \times \ldots \times [n_d])\), by assuming that the tensor is low-rank. Since even evaluating the rank of a tensor is NP-hard [50], a popular approach for solving this problem is to minimize the reconstruction error while constraining the ranks of different unfoldings of the tensor (see, e.g., [40]). After imposing Frobenius norm regularization and letting \(\Vert \cdot \Vert _{HS}=\sqrt{\sum _{i_{1}=1}^{n_{1}} \ldots \sum _{{i_d}=1}^{n_d} X_{i_1, \ldots , i_d}^2}\) denote the (second-order cone representable) Hilbert-Schmidt norm of a tensor, this leads to optimization problems of the form:
Similarly to low-rank matrix completion, it is tempting to apply the MRPT to model the \(\varvec{X}_{(i)}^\top \varvec{X}_{(i)}\) term for each mode-n unfolding. We now demonstrate this supplies a tight approximation of the convex hull of the sum of the regularizers, via the following lemma (proof omitted, follows in the spirit of [42, Lemma 4]):
Lemma 3
be a set where \(l_i, u_i, q_i \in \mathbb {R}^n_+ \ \forall i \in [m]\), and \({\mathcal {S}}_i\) is a set of the same form as \({\mathcal {S}}\), but l, u are replaced by \(l_i, u_i\). Then, an extended formulation of this set’s convex hull is given by:
Lemma 3 suggests that the MPRT may improve algorithms which aim to recover tensors of low slice rank. For instance, in low-rank tensor problems where (26) admits multiple local solutions, solving the convex relaxation coming from \({\mathcal {Q}}^c\) and greedily rounding may give a high-quality initial point for an alternating minimization method such as the method of [31], and indeed allow such a strategy to return better solutions than if it were initialized at a random point.
Note however that Lemma 3 does not necessarily give the convex hull of the sum of the regularizers, since the regularization terms involve different slices of the same tensor and thus interact; see also [69] for a related proof that the tensor trace norm does not give the convex envelope of the sum of ranks of slices.
4.3 Low-rank factor analysis
statistics, psychometrics and economics is to decompose a covariance matrix \(\varvec{\Sigma } \in {\mathcal {S}}^n_+\) into a low-rank matrix \(\varvec{X} \in {\mathcal {S}}^n_+\) plus a diagonal matrix \(\varvec{\Phi } \in {\mathcal {S}}^n_+\), as explored by [8] and references therein. This corresponds to solving:
where \(q\ge 1\), \(\Vert \varvec{X} \Vert _q= \left( \sum _{i=1}^n \lambda _i(\varvec{X})^q\right) ^\frac{1}{q}\) denotes the matrix q-norm, and we constrain the spectral norm of \(\varvec{X}\) via a big-M constraint for the sake of tractability.
This problem’s objective involves minimizing \(\textrm{tr}\left( \varvec{\Sigma }-\varvec{\Phi }-\varvec{X}\right) ^q\), and it is not immediately obvious how to either apply the technique in the presence of the \(\varvec{\Phi }\) variables or alternatively seperate out the \(\varvec{\Phi }\) term and apply the MPRT to an appropriate (\(\varvec{\Phi }\)-free) substructure. To proceed, let us therefore first consider its scalar analog, obtaining the convex closure of the following set:
where \(d \in \mathbb {R}\) and \(q \ge 1\) are fixed constants, and we require that \(\vert x\vert \le M\) for the sake of tractability. We obtain the convex closure via the following proposition (proof deferred to Appendix B.4):
Proposition 5
The convex closure of the set \({\mathcal {T}}\), \({\mathcal {T}}^c\), is given by:
Remark 10
To check that this set is indeed a valid convex relaxation, observe that if \(z=0\) then \(x=0\) and \(x=-\beta \implies \beta =0\) and \(t \ge \vert y-d\vert ^q\), while if \(z=1\) then \(y=\beta \) and \(t \ge \vert x+y-d\vert ^q\).
Observe that \({\mathcal {T}}^c\) can be modeled using two power cones and one inequality constraint.
Proposition 5 suggests that we can obtain high-quality convex relaxations for low-rank factor analysis problems via a judicious use of the matrix power cone. Namely, introduce an epigraph matrix \(\varvec{\theta }\) to model the eigenvalues of \((\varvec{\Sigma }-\varvec{\Phi }-\varvec{X})^q\) and an orthogonal projection matrix \(\varvec{Y}_2\) to model the span of \(\varvec{X}\). This then leads to the following matrix power cone representable relaxation:
4.4 Optimal experimental design
Letting \(\varvec{A} \in \mathbb {R}^{n \times m}\) where \(m \ge n\) be a matrix of linear measurements of the form \(y_i=\varvec{a}_i^\top \varvec{\beta }+\epsilon _i\) from an experimental setting, the D-optimal experimental design problem (a.k.a. the sensor selection problem) is to pick \(k \le m\) of these experiments in order to make the most accurate estimate of \(\varvec{\beta }\) possible, by solving (see [48, 71], for a modern approach):
where we define \(\log \det _{\epsilon }(\varvec{X})= \log \det (\varvec{X}+\epsilon \mathbb {I})\) for \(\epsilon >0\) to be the pseudo log-determinant of a rank-deficient PSD matrix, which can be thought of as imposing an uninformative prior of importance \(\epsilon \) on the experimental design process. Since \(\log \det (\varvec{X})=\textrm{tr}(\log (\varvec{X}))\), a valid convex relaxation is given by:
which can be modeled using the quantum relative entropy cone, via \((-\varvec{\theta }, \mathbb {I}, \varvec{A}\textrm{Diag}(\varvec{z})\varvec{A}^\top +\epsilon \mathbb {I}) \in {\mathcal {K}}^{\text {rel, op}}_{\text {mat}}\). This is equivalent to perhaps the most common relaxation of D-optimal design, as proposed by [15, Equation 7.2.6]. By formulating in terms of the quantum relative entropy cone, the identity term suggests this relaxation leaves something “on the table”.
In this direction, let us apply the MPRT. Observe that \(\varvec{X}:=\sum _{i \in [n]} z_i \varvec{a}_i \varvec{a}_i^\top \) is a rank-k matrix and thus at an optimal solution to the original problem there is some orthogonal projection matrix \(\varvec{Y}\) such that \(\varvec{X}=\varvec{Y}\varvec{X}\). Therefore, we can take the perspective function of \(f(\varvec{X})=\log (\varvec{X}{+\epsilon \mathbb {I}})\), and thereby obtain the following valid—and potentially much tighter when \(k<n\)—convex relaxation:
which can be modeled via the quantum relative entropy cone: \((-\varvec{\theta }, \varvec{Y}, \varvec{A}\textrm{Diag}(\varvec{z})\varvec{A}^\top +\epsilon \varvec{Y}) \in {\mathcal {K}}^{\text {rel, op}}_{\text {mat}}\). We now argue that this relaxation is high-quality, by demonstrating that the MPRT supplies the convex envelope of \(t \ge -\log \det _\epsilon (\varvec{X})\) under a low-rank constraint, via the following corollary to Theorem 2:
Corollary 2
be a set where \(\epsilon , k,t\) are fixed. Then, this set’s convex hull is:
Remark 11
Observe that (29)’s relaxation is not useful in the over-determined regime where \(k \ge n\), since setting \(\varvec{Y}=\mathbb {I}\) recovers (28)’s Boolean relaxation, which is considerably cheaper to optimize over. Accordingly, we only consider the under-determined regime in our experiments.
4.5 Non-negative matrix optimization
Many important problems in combinatorial optimization, statistics and computer vision (see, e.g., [17]) reduce to optimizing over the space of low-rank matrices with non-negative factors. An important special case is when we would like to find the low-rank completely positive matrix \(\varvec{X}\) which best approximates (in a least-squares sense) a given matrix \(\varvec{A} \in {\mathcal {S}}^n_+\), i.e., perform non-negative principal component analysis. Formally, we have the problem:
where \({\mathcal {C}}^n_+:=\{\varvec{U}\varvec{U}^\top : \varvec{U} \in \mathbb {R}^{n \times n}_{+}\}\) denotes the cone of \(n \times n\) completely positive matrices.
Applying the MPRT to the strongly convex \(\frac{1}{2}\Vert \varvec{X}\Vert _F^2\) term in the objective therefore yields the following completely positive program:
Interestingly, since (31)’s reformulation has a linear objective, some extreme point in its relaxation is optimal, which means we can relax the requirement that \(\varvec{Y}\) is a projection matrix without loss of optimality and the computational complexity of the problem is entirely concentrated in the completely positive cone. Unfortunately however, completely positive optimization itself is intractable. Nonetheless, it can be approximated by replacing the completely positive cone with the doubly non-negative cone, \({\mathcal {S}}^n_+ \cap \mathbb {R}^{n \times n}_{+}\). Namely, we instead solve
Unfortunately, rounding a solution to (32) to obtain a completely positive \(\varvec{X}\) is non-trivial. Indeed, according to [41], there is currently no effective mechanism for rounding doubly non-negative programs. Nonetheless, as we shall see in our numerical results, there are already highly effective heuristic methods for completely positive matrix factorization, and combining our relaxation with such a procedure offers certificates of near optimality in a tractable fashion.
Remark 12
If \(\varvec{X}=\varvec{D}\varvec{\Pi }\) is a monomial matrix, i.e., decomposable as the product of a diagonal matrix \(\varvec{D}\) and a permutation matrix \(\varvec{\Pi }\), as occurs in binary optimization problems such as k-means clustering problems among others (c.f. [63]), then it follows that \((\varvec{X}^\top \varvec{X})^\dag \ge \varvec{0}\) (see [66]) and thus \(\varvec{Y}:=\varvec{X}(\varvec{X}^\top \varvec{X})^\dag \varvec{X}^\top \) is elementwise non-negative. In this case, the doubly non-negative relaxation (32) should be strengthened by requiring that \(\varvec{Y} \ge \varvec{0}\).
5 Numerical Results
In this section, we evaluate the algorithmic strategies derived in the previous section, implemented in |Julia| 1.5 using |JuMP.jl| 0.21.6 and |Mosek| 9.1 to solve the conic problems considered here. Except where indicated otherwise, all experiments were performed on a Intel Xeon E5–2690 v4 2.6GHz CPU core using 32 GB RAM. To bridge the gap between theory and practice, we have made our code freely available on |Github| at github.com/ryancorywright/MatrixPerspectiveSoftware.
5.1 Reduced rank regression
In this section, we compare our convex relaxations for reduced rank regression developed in the introduction and laid out in (6)–(7)—which we refer to as “Persp” and “DCL” respectively—against the nuclear norm estimator proposed by [57] (“NN”), who solve
Similarly to [57], we attempt to recover rank-\(k_{true}\) estimators \(\varvec{\beta }_{\text {true}}=\varvec{U}\varvec{V}^\top \), where each entry of \(\varvec{U} \in \mathbb {R}^{p \times k_{true}}, \varvec{V} \in \mathbb {R}^{n \times k_{true}}\) is i.i.d. standard Gaussian \({\mathcal {N}}(0,1)\), the matrix \(\varvec{X} \in \mathbb {R}^{m \times p}\) contains i.i.d. standard Gaussian \({\mathcal {N}}(0,1)\) entries, \(\varvec{Y}=\varvec{X}\varvec{\beta }+\varvec{E}\), and \(E_{i,j} \sim {\mathcal {N}}(0, \sigma )\) injects a small amount of i.i.d. noise. We set \(n=p=50, k=10\), \(\gamma =10^6\), \(\sigma =0.05\) and vary m. To ensure a fair comparison, we cross-validate \(\mu \) for both of our relaxations and [57]’s approach so as to minimize the MSE on a validation set. For each m, we evaluate 20 different values of \(\mu \) which are distributed uniformly in logspace between \(10^{-4}\) and \(10^{4}\) across 50 random instances for our convex relaxations and report on 100 different random instances with the “best” \(\mu \) for each method and each p.
Rank recovery and statistical accuracy: Figures 1a–c report the relative accuracy (\(\Vert \varvec{\beta }_{\text {est}}-\varvec{\beta }_{\text {true}}\Vert _F/\Vert \varvec{\beta }_{\text {true}}\Vert _F\)), the rank (i.e., number of singular values of \(\varvec{\beta }_{\text {est}}\) which exceed \(10^{-4}\)), and the out-of-sample MSEFootnote 5\(\Vert \varvec{X}_{\text {new}}\varvec{\beta }_{\text {est}}-\varvec{Y}_{\text {new}}\Vert _F^2\) (normalized by the out-of-sample MSE of the ground truth \(\Vert \varvec{X}_{\text {new}}\varvec{\beta }_{\text {true}}-\varvec{Y}_{\text {new}}\Vert _F^2\)). Results are averaged over 100 random instances per value of m. We observe that—even though we did not supply the true rank of the optimal solution in our formulation—Problem (7)’s relaxation returns solutions of the correct rank (\(k_{true}=10\)) and better MSE/accuracy, while our more “naive” perspective relaxation (6) and the nuclear norm approach (33) return solutions of a higher rank and lower accuracy. This suggests that (7)’s formulation should be considered as a more accurate estimator for reduced rank problems, and empirically confirms that the MPRT can lead to significant improvements in statistical accuracy.
Scalability w.r.t. m: Figure 1d reports the average time for |Mosek| to convergeFootnote 6 to an optimal solution (over 100 random instances per m). Surprisingly, although (7) is a stronger relaxation than (6), it is one to two orders of magnitude faster than (6) and (33)’s formulations. The relative scalability of (7)’s formulation as m—the number of observations—increases can be explained by the fact that (7) considers a linear inner product of the Gram matrix \(\varvec{X}^\top \varvec{X}\) with a semidefinite matrix \(\varvec{B}\) (the size of which does not vary with m) while Problems (6) and (33) have a quadratic inner product \(\langle \varvec{\beta }\varvec{\beta }^\top , \varvec{X}^\top \varvec{X}\rangle \) which must be modeled using a rotated second-order cone constraint (the size of which depends on m), since modern conic solvers such as |Mosek| do not allow quadratic objective terms and semidefinite constraints to be simultaneously present (if they did, we believe all three formulations would scale similarly).
Scalability w.r.t p: Next, we evaluate the scalability of all three approaches in terms of their solve times and peak memory usage (measured using the |slurm| command |MaxRSS|), as \(n=p\) increases. Figure 2 depicts the average time to converge to an optimal solution (a) and peak memory consumption (b) by each method as we vary \(n=p\) with \(m=n\), \(k=10\), \(\gamma =10^6\), each \(\mu \) fixed to the average cross-validated value found in the previous experiment, a peak memory budget of 120GB, a runtime budget of 12 hours, and otherwise the same experimental setup as previously (averaged over 20 random instances per n). We observe (7)’s relaxation is dramatically more scalable than the other two approaches considered, and can solve problems of nearly twice the size (4 times as many variables), and solves problems of a similar size in substantially less time and with substantially less peak memory consumption (40s vs. 1000s when \(n=100\)). All in all, the proposed relaxation (7) seems to be the best method of the three considered.
5.2 Non-negative matrix factorization
In this section, we benchmark the quality of our dual bound for non-negative matrix factorization laid out in Sect. 4.5 by using the non-linear reformulation strategy proposed by [18] (alternating least squares or ALS) to obtain upper bounds. Namely, we obtain upper bounds by solving for local minima of the problem
In our implementation of ALS, we obtain a local minimum by introducing a dummy variable \(\varvec{V}\) which equals \(\varvec{U}\) at optimality and alternating between solving the following two problems
where we set \(\rho _t=\min (10^{-4}\times 2^{t-1}, 10^5)\) at the tth iteration in order that the final matrix is positive semidefinite, as advocated in [5, Section 5.2.3] (we cap \(\rho _t\) to avoid numerical instability). We iterate over solving these two problems from a random initialization point \(\varvec{V}_{0}\)—where each \(V_{0,i,j}\) is i.i.d. standard uniform—until either the objective value between iterations does not change by \(10^{-4}\) or we exceed the maximum number of allowable iterations, which we set to 100.
To generate problem instances, we let \(\varvec{A}=\varvec{U}\varvec{U}^\top +\varvec{E}\) where \(\varvec{U} \in \mathbb {R}^{n \times k_{true}}\), each \(U_{i,j}\) is uniform on [0, 1], \(E_{i,j} \sim {\mathcal {N}}(0, 0.0125 k_{true})\), and set \(A_{i,j}=0\) if \(A_{i,j}<0\). We set \(n=50, k_{true}=10\). We use the ALS heuristic to compute a feasible solution \(\varvec{X}\) and an upper-bound on the problem’s objective value. By comparing it with the lower bound derived from our MPRT, we can assess the sub-optimality of the heuristic solution, which previously lacked optimality guarantees.
Figure 3 depicts the average relative in-sample MSE of the heuristic (\(\Vert \varvec{X}-\varvec{A}\Vert _F/\Vert \varvec{A}\Vert _F\)) and the relative bound gap—(UB-LB)/UB—as we vary the target rank, averaged over 100 random synthetic instances. We observe that the method is most accurate and has the lowest MSE when k is set to \(k_{true}=10\), which confirms that the method can recover solutions of the correct rank. In addition, by combining the solution from OLS with our lower-bound, we can compute a duality gap and assert that the heuristic solution is \(0\%{-}3\%\)-optimal, with the gap peaking at \(k = k_{true}\) and stabilizing as \(k \rightarrow n\). This echoes similar findings in k-means clustering and alternating current optimal power flow problems, where the SDO relaxation need not be near-tight in theory but nonetheless is nearly exact in practice [51, 63]. Further, this suggests our convex relaxation may be a powerful weapon for providing gaps for heuristics for non-negative matrix factorization, and particularly detecting when they are performing well or can be further improved.
Figure 4 reports the time needed to compute both the upper bound and a lower bound solution as we vary the target rank.
5.3 Optimal experimental design
In this section, we benchmark our dual bound for D-optimal experimental design (29) against the convex relaxation (28) and a greedy submodular maximization approach, in terms of both bound quality and the ability of all three approaches to generate high-quality feasible solutions. We round both relaxations to generate feasible solutions greedily, by setting the k largest \(z_i\)’s in a continuous relaxation to 1, while for the submodular maximization approach we iteratively set the jth index of \(\varvec{z}\) to 1, where \({\mathcal {S}}\) is initially an empty set and we iteratively take
Interestingly, the greedy rounding approach enjoys rigorous approximation guarantees (see [48, 71]), while the submodular maximization approach also enjoys strong guarantees (see [58]).
We benchmark all methods in terms of their performance on synthetic D-optimal experimental design problems, where we let \(\varvec{A} \in \mathbb {R}^{n \times m}\) be a matrix with i.i.d. \({\mathcal {N}}(0, \frac{1}{\sqrt{n}})\) entries. We set \(n=20, m=10, \epsilon =10^{-6}\) and vary \(k <m\) over 20 random instances. Table 3 depicts the average relative bound gap, objective values, and runtimes for all 3 methods (we use the lower bound from (28)’s relaxation to compute the submodular bound gap). Note that all results for this experiment were generated on a standard Macbook pro laptop with a 2.9GHZ 6-core Intel i9 CPU using 16GB DDR4 RAM, CVX version 1.22, Matlab R2021a, and |Mosek| 9.1. Moreover, we optimize over (29)’s relaxation using the |CVXQuad| package developed by [33].
Relaxation quality: We observe that (29)’s relaxation is dramatically stronger than (28), offering bound gaps on the order of \(0\%{-}3\%\) when \(k \le 7\), rather than gaps of \(90\%\) or more. This confirms the efficacy of the MPRT, and demonstrates the value of taking low-rank constraints into account when designing convex relaxations, even when not obviously present.
Scalability: We observe that (29)’s relaxation is around two orders of magnitude slower than the other proposed approaches, largely because semidefinite approximations of quantum relative entropy are expensive, but is still tractable for moderate sizes. We believe, however, that the relaxation would scale significantly better if it were optimized over using an interior point method for non-symmetric cones (see, e.g., [49, 72]), or an alternating minimization approach (see [34]). As such, (29)’s relaxation is potentially useful at moderate problem sizes with off-the-shelf software, or at larger problem sizes with problem-specific techniques such as alternating minimization.
6 Conclusion
In this paper, we introduced the Matrix Perspective Reformulation Technique (MPRT), a new technique for deriving tractable and often high-quality relaxations of a wide variety of low-rank problems. We also invoked the technique to derive the convex hulls of some frequently-studied low-rank sets, and provided examples where the technique proves useful in practice. This is significant and potentially useful to the community, because substantial progress on producing tractable upper bounds for low-rank problems has been made over the past decade, but until now almost no progress on tractable lower bounds has followed.
Future work could take three directions: (1) automatically detecting structures where the MPRT could be applied, as is already done for perspective reformulations in the MIO case by |CPLEX| and |Gurobi|, (2) developing scalable semidefinite-free techniques for solving the semidefinite relaxations proposed in this paper, and (3) combining the ideas in this paper and in our prior work [11] with custom branching strategies to solve low-rank problems to optimality at scale.
Change history
23 March 2023
A Correction to this paper has been published: https://doi.org/10.1007/s10107-023-01947-3
Notes
Observe that the constraints in Problem (4) are equivalent to the block matrix constraint \(\begin{pmatrix} \textrm{Diag}(\varvec{z}) &{} \textrm{Diag}(\varvec{w}) \\ \textrm{Diag}(\varvec{w}) &{} \varvec{W}\end{pmatrix} \succeq \varvec{0}.\) This verifies that the reduced rank regression formulation is indeed a generalization of [26]’s formulation for sparse regression.
Note that \(f(\varvec{X})\) and its perspective are concave functions; hence we model their hypographs, not epigraphs.
We only consider the PSD case for notational convenience. However, the symmetric case follows in much the same manner, after splitting \(\varvec{X}=\varvec{X}_{+}-\varvec{X}_{-}: \varvec{X}_{+}, \varvec{X}_{-} \succeq \varvec{0}, \langle \varvec{X}_{+}, \varvec{X}_{-}\rangle =0\) and replacing \(\varvec{X}\) with \(\varvec{X}_{+}+\varvec{X}_{-}\).
Specifically, if we are interested in quantum relative entropy problems where we minimize the trace of \(\varvec{X}_1\), as occurs in the context of the MPRT, we may achieve this using the domain-driven solver developed by [49]. However, we are not aware of any IPMs which can currently optimize over the full quantum relative entropy cone.
Evaluated on \(m=1000\) new observations of \(\varvec{X}_j, \varvec{Y}_k\) generated from the same distribution.
We model the convex quadratic \(\Vert \varvec{X}\varvec{\beta }-\varvec{Y}\Vert _F^2\) using a rotated second order cone for formulations (6) and (33) (the quadratic term doesn’t appear directly in (7)), model the nuclear norm term in (33) by introducing matrices \(\varvec{U}, \varvec{V}\) such that \(\begin{pmatrix} \varvec{U} &{} \varvec{\beta }\\ \varvec{\beta }^\top &{} \varvec{V}\end{pmatrix} \succeq \varvec{0}\) and minimizing \(\textrm{tr}(\varvec{U})+\textrm{tr}(\varvec{V})\), and use default Mosek parameters for all approaches
References
Aktürk, M.S., Atamtürk, A., Gürel, S.: A strong conic quadratic reformulation for machine-job assignment with controllable processing times. Oper. Res. Lett. 37(3), 187–191 (2009)
Alizadeh, F.: Interior point methods in semidefinite programming with applications to combinatorial optimization. SIAM J. Optim. 5(1), 13–51 (1995)
Atamtürk, A., Gómez, A.: Rank-one convexification for sparse regression, (2019). arXiv:1901.10334
Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization: Analysis, algorithms, and engineering applications, vol. 2. SIAM Philadelphia, PA (2001)
Bertsekas, D. P.: Nonlinear programming. Athena Scientific Belmont MA, 3rd edition, (2016)
Bertsimas, D., Van Parys, B.: Sparse high-dimensional regression: exact scalable algorithms and phase transitions. Ann. Stat. 48(1), 300–323 (2020)
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. The Annals of Statistics, 813–852 (2016)
Bertsimas, D., Copenhaver, M.S., Mazumder, R.: Certifiably optimal low rank factor analysis. J. Mach. Learn. Res. 18(1), 907–959 (2017)
Bertsimas, D., Pauphilet, J., Van Parys, B.: Sparse regression: scalable algorithms and empirical performance. Stat. Sci. 35(4), 555–578 (2020)
Bertsimas, D., Cory-Wright, R., Pauphilet, J.: A unified approach to mixed-integer optimization problems with logical constraints. SIAM J. Optim. 31(3), 2340–2367 (2021)
Bertsimas, D., Cory-Wright, R., Pauphilet, J.: Mixed-projection conic optimization: a new paradigm for modeling rank constraints. Oper. Res. (2021b)
Bertsimas, D., Cory-Wright, R., Pauphilet, J.: Solving large-scale sparse PCA to certifiable (near) optimality. J. Mach. Learn. Res. 23(13), 1–35 (2022)
Bhatia, R.: Matrix analysis, volume 169. Springer Science & Business Media New York, (2013)
Bienstock, D.: Eigenvalue techniques for convex objective, nonconvex optimization problems. In: International Conference on Integer Programming and Combinatorial Optimization, pages 29–42. Springer, (2010)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge, UK (2004)
Boyd, S., El Ghaoui, L., Feron, E., Balakrishnan, V.: Linear Matrix Inequalities in System and Control Theory, volume 15. Studies in Applied Mathematics, Society for Industrial and Applied Mathematics, Philadelphia, PA, (1994)
Burer, S.: On the copositive representation of binary and continuous nonconvex quadratic programs. Math. Program. 120(2), 479–495 (2009)
Burer, S., Monteiro, R.D.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)
Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717 (2009)
Carlen, E.: Trace inequalities and quantum entropy: an introductory course. Entropy Quantum 529, 73–140 (2010)
Ceria, S., Soares, J.: Convex programming for disjunctive convex optimization. Math. Program. 86(3), 595–614 (1999)
Chares, R.: Cones and interior-point algorithms for structured convex optimization involving powers and exponentials. PhD thesis, UCL-Université Catholique de Louvain, (2009)
Combettes, P.L.: Perspective functions: Properties, constructions, and examples. Set-Valued Var. Anal. 26(2), 247–264 (2018)
Dacorogna, B., Maréchal, P.: The role of perspective functions in convexity, polyconvexity, rank-one convexity and separate convexity. J. Convex Anal. 15(2), 271–284 (2008)
Davis, C.: Various averaging operations onto subalgebras Ill. J. Math. 3(4), 538–553 (1959)
Dong, H., Chen, K., Linderoth, J.: Regularization vs. relaxation: a conic optimization perspective of statistical variable selection (2015). arXiv:1510.06083
Ebadian, A., Nikoufar, I., Gordji, M.E.: Perspectives of matrix convex functions. Proc. Natl. Acad. Sci. 108(18), 7313–7314 (2011)
Effros, E., Hansen, F.: Non-commutative perspectives. Ann. Funct. Anal. 5(2), 74–79 (2014)
Effros, E.G.: A matrix convexity approach to some celebrated quantum inequalities. Proc. Natl. Acad. Sci. 106(4), 1006–1008 (2009)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Farias, V.F., Li, A.A.: Learning preferences with side information. Manage. Sci. 65(7), 3131–3149 (2019)
Fawzi, H., Saunderson, J.: Lieb’s concavity theorem, matrix geometric means, and semidefinite optimization. Linear Algebra Appl. 513, 240–263 (2017)
Fawzi, H., Saunderson, J., Parrilo, P.A.: Semidefinite approximations of the matrix logarithm. Found. Comput. Math. 19(2), 259–296 (2019)
Faybusovich, L., Zhou, C.: Self-concordance and matrix monotonicity with applications to quantum entanglement problems. Appl. Math. Comput. 375, 125071 (2020)
Fazel, M., Hindi, H., Boyd, S.P.: Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices. In: Proceedings of the 2003 American Control Conference, volume 3, pages 2156–2162. IEEE, (2003)
Fischetti, M., Ljubić, I., Sinnl, M.: Redesigning Benders decomposition for large-scale facility location. Manage. Sci. 63(7), 2146–2162 (2016)
Frangioni, A., Gentile, C.: Perspective cuts for a class of convex 0–1 mixed integer programs. Math. Program. 106(2), 225–236 (2006)
Frangioni, A., Gentile, C.: A computational comparison of reformulations of the perspective relaxation: SOCP vs. cutting planes. Oper. Res. Lett. 37(3), 206–210 (2009)
Frangioni, A., Gentile, C., Hungerford, J.: Decompositions of semidefinite matrices and the perspective reformulation of nonseparable quadratic programs. Math. Oper. Res. 45(1), 15–33 (2020)
Gandy, S., Recht, B., Yamada, I.: Tensor completion and low-n-rank tensor recovery via convex optimization. Inverse Prob. 27(2), 025010 (2011)
Ge, D., Ye, Y.: On doubly positive semidefinite programming relaxations. Optimization Online, (2010)
Günlük, O., Linderoth, J.: Perspective reformulations of mixed integer nonlinear programs with indicator variables. Math. Program. 124(1–2), 183–205 (2010)
Han, S., Gómez, A., Atamtürk, A.: 2x2 convexifications for convex quadratic optimization with indicator variables (2020). arXiv:2004.07448
Hazimeh, H., Mazumder, R., Saab, A.: Sparse regression at scale: branch-and-bound rooted in first-order optimization. Math. Program. articles in advance, 1–42 (2021)
Hiai, F., Petz, D.: The proper formula for relative entropy and its asymptotics in quantum probability. Commun. Math. Phys. 143(1), 99–114 (1991)
Hiriart-Urruty, J.-B., Lemaréchal, C.: Convex analysis and minimization algorithms I: Fundamentals, volume 305. Springer Science & Business Media Berlin, (2013)
Horn, R.A., Johnson, C.R.: Matrix analysis. Cambridge University Press, New York (1985)
Joshi, S., Boyd, S.: Sensor selection via convex optimization. IEEE Trans. Signal Process. 57(2), 451–462 (2008)
Karimi, M., Tunçel, L.: Domain-driven solver (DDS): a MATLAB-based software package for convex optimization problems in domain-driven form (2019). arXiv preprint arXiv:1908.03075
Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009)
Lavaei, J., Low, S.H.: Zero duality gap in optimal power flow problem. IEEE Trans. Power Syst. 27(1), 92–107 (2011)
Lewis, A.S.: Convex analysis on the Hermitian matrices. SIAM J. Optim. 6(1), 164–177 (1996)
Lieb, E.H., Ruskai, M.B.: Proof of the strong subadditivity of quantum-mechanical entropy with an appendix by B. Simon. J. Math. Phys. 14, 1938–1941 (1973)
Maréchal, P.: On the convexity of the multiplicative potential and penalty functions and related topics. Math. Program. 89(3), 505–516 (2001)
Maréchal, P.: On a functional operation generating convex functions, part 1: duality. J. Optim. Theory Appl. 126(1), 175–189 (2005)
Maréchal, P.: On a functional operation generating convex functions, part 2: algebraic properties. J. Optim. Theory Appl. 126(2), 357–366 (2005)
Negahban, S., Wainwright, M.J.: Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann Stat, pp 1069–1097, (2011)
Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions-i. Math. Program. 14(1), 265–294 (1978)
Nguyen, L.T., Kim, J., Shim, B.: Low-rank matrix completion: a contemporary survey. IEEE Access 7, 94215–94237 (2019)
Overton, M.L., Womersley, R.S.: On the sum of the largest eigenvalues of a symmetric matrix. SIAM J. Matrix Anal. Appl. 13(1), 41–45 (1992)
Overton, M.L., Womersley, R.S.: Optimality conditions and duality theory for minimizing sums of the largest eigenvalues of symmetric matrices. Math. Program. 62(1–3), 321–357 (1993)
Pataki, G.: On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Math. Oper. Res. 23(2), 339–358 (1998)
Peng, J., Wei, Y.: Approximating K-means-type clustering via semidefinite programming. SIAM J. Optim. 18(1), 186–205 (2007)
Permenter, F., Parrilo, P.: Partial facial reduction: simplified, equivalent SDPs via approximations of the PSD cone. Math. Program. 171(1–2), 1–54 (2018)
Pilanci, M., Wainwright, M.J., El Ghaoui, L.: Sparse learning via Boolean relaxations. Math. Program. 151(1), 63–87 (2015)
Plemmons, R., Cline, R.: The generalized inverse of a nonnegative matrix. In: Proceedings of the American Mathematical Society, pages 46–50, (1972)
Renegar, J.: A mathematical view of interior-point methods in convex optimization, volume 3. Society for Industrial and Applied Mathematics, (2001)
Rockafellar, R.T.: Convex Analysis. Number 28. Princeton university press, (1970)
Romera-Paredes, B., Pontil, M.: A new convex relaxation for tensor completion (2013). arXiv preprint arXiv:1307.4653
Sagnol, G.: On the semidefinite representation of real functions applied to symmetric matrices. Linear Algebra Appl. 439(10), 2829–2843 (2013)
Singh, M., Xie, W.: Approximation algorithms for D-optimal design. Math. Oper. Res. 45, 1193–1620 (2020)
Skajaa, A., Ye, Y.: A homogeneous interior-point algorithm for nonsymmetric convex conic optimization. Math. Program. 150(2), 391–422 (2015)
Stubbs, R.A.: Branch-and-Cut Methods for Mixed 0-1 Convex Programming. PhD thesis, Northwestern University, (1996)
Stubbs, R.A., Mehrotra, S.: A branch-and-cut method for 0–1 mixed convex programming. Math. Program. 86(3), 515–532 (1999)
Wang, A.L., Kılınç-Karzan, F.: On the tightness of SDP relaxations of QCQPs. Math. Program. 1–41 (2021)
Wei, L., Gómez, A., Küçükyavuz, S.: Ideal formulations for constrained convex optimization problems with indicator variables. Math. Program. 192(1), 57–88 (2022)
Wolkowicz, H., Saigal, R., Vandenberghe, L.: Handbook of Semidefinite Programming: Theory, Algorithms, and Applications, volume 27. Springer Science & Business Media, (2012)
Xie, W., Deng, X.: Scalable algorithms for the sparse ridge regression. SIAM J. Optim. 30(4), 3359–3386 (2020)
Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zheng, X., Sun, X., Li, D.: Improving the performance of MIQP solvers for quadratic programs with cardinality and minimum threshold constraints: a semidefinite program approach. INFORMS J. Comput. 26(4), 690–703 (2014)
Acknowledgements
We are very grateful to two anonymous referees for useful and constructive comments. In particular, we would like to thank reviewer \(\#1\) for a very helpful refinement of our definition of the matrix perspective function, and reviewer \(\#2\) for suggesting the name matrix perspective function and supplying some new references on perspective operator functions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: The production team forgot to send the final proof to the authors for approval. As a result, the authors reported copy-editing errors in the article after publication. Now, the original article has been corrected, and these errors have been fixed.
Appendices
A Background on Operator Functions
In this work, we make repeated use of operator functions, i.e., functions defined from the spectral decomposition of a matrix. Namely, for any function \(\omega : \mathbb {R} \rightarrow \mathbb {R}\), its corresponding operator function \(f_\omega : {\mathcal {S}}^n \rightarrow {\mathcal {S}}^n\) is defined as
where \(\varvec{X} = \varvec{U} {\text {Diag}}(\lambda _1^x,\dots , \lambda _n^x) \varvec{U}^\top \) is an eigendecomposition of \(\varvec{X}\). In this appendix, we present some common examples and useful properties of operator functions.
1.1 A.1 Examples: matrix exponential and logarithm
For self-consistency of the paper, we now define the matrix exponential and logarithm functions and summarize their properties. These results are well known and can be found in modern matrix analysis textbooks (see, e.g., [13])
Definition 2
(Matrix exponential) Let \(\varvec{X} \in {\mathcal {S}}^n\) be a symmetric matrix with eigendecomposition \(\varvec{X}=\varvec{U}\varvec{\Lambda }\varvec{U}^\top \). Letting \(\exp (\varvec{\Lambda })=\textrm{diag}(e^{\lambda _1}, e^{\lambda _2}, \ldots , e^{\lambda _n})\), we define \(\exp (\varvec{X}):=\varvec{U}\exp (\varvec{\Lambda })\varvec{U}^\top \).
Proposition 6
The matrix exponential, \(\exp : {\mathcal {S}}^n \rightarrow {\mathcal {S}}^n_+\), satisfies the following properties:
-
Power series expansion: \(\exp (\varvec{X})=\mathbb {I}+\sum _{i=1}^\infty \frac{1}{i!}\varvec{X}^i\).
-
Trace monotonicity: \(\varvec{X} \preceq \varvec{Y} \implies \textrm{tr}(\exp (\varvec{X}))\le \textrm{tr}(\exp (\varvec{Y}))\).
-
Golden-Thompson-inequality: \(\textrm{tr}(\exp (\varvec{X}+\varvec{Y})) \le \textrm{tr}(\exp (\varvec{X}))+\textrm{tr}(\exp (\varvec{Y}))\).
Remark 13
The matrix exponential is not monotone: [13, Chapter V].
Definition 3
(Matrix logarithm) Let \(\varvec{X} \in {\mathcal {S}}^n\) be a symmetric matrix with eigendecomposition \(\varvec{X}=\varvec{U}\varvec{\Lambda }\varvec{U}^\top \). Letting \(\log (\varvec{\Lambda })=\textrm{diag}(\log (\lambda _1), \log (\lambda _2), \ldots , \log (\lambda _n))\), we have \(\log (\varvec{X}):=\varvec{U}\log (\varvec{\Lambda })\varvec{U}^\top \).
Proposition 7
The matrix logarithm, \(\log (\varvec{X}): {\mathcal {S}}^n_{++} \rightarrow {\mathcal {S}}^n\), satisfies the following properties:
-
Operator monotonicity: \(\varvec{X} \preceq \varvec{Y} \implies \log (\varvec{X}) \preceq \log (\varvec{Y})\).
-
Functional inversion: \(\log (\exp (\varvec{X}))=\varvec{X} \quad \forall \varvec{X} \in {\mathcal {S}}^n\).
-
Jacobi formula I: \(\textrm{tr}(\log (\varvec{X}))=\log \det (\varvec{X})\).
-
Jacobi formula II: \(\exp \left( \frac{1}{n}\textrm{tr}\log (\varvec{X})\right) =\det (\varvec{X})^\frac{1}{n}\).
1.2 A.2 Properties of operator functions
Among other properties, one can show that the trace of operator functions is invariant under an orthogonal rotation, i.e., \(\textrm{tr}(f_\omega (\varvec{X}))=\textrm{tr}(f_\omega (\varvec{U}^\top \varvec{X}\varvec{U}))\) for any orthogonal rotation \(\varvec{U}\). Also, if \(\omega \) is analytical, then \(f_\omega \) is also analytical with the same Taylor expansion.
In our analysis (in particular the proof of Proposition 3), we will use this simple bound on \(\varvec{v}^\top f_\omega (\varvec{A})\varvec{v}\) in the case where \(\omega \) is convex:
Lemma 4
Consider a convex function \(\omega : \mathbb {R} \rightarrow \mathbb {R}\) and a symmetric matrix \(\varvec{A} \in {\mathcal {S}}^n\). Consider a unit vector \(\varvec{v}\). Then,
Proof
Consider a spectral decomposition of \(\varvec{A}\), \(\varvec{A} = \sum _{i=1}^n \lambda _i \varvec{u}_i \varvec{u}_i^\top \). Then, \(f_\omega (\varvec{A}) = \sum _{i=1}^n \omega (\lambda _i) \varvec{u}_i\varvec{u}_i^\top \) and
where the inequality comes from the convexity of \(\omega \) since \(\varvec{v}^\top \varvec{u}_i \varvec{u}_i^\top \varvec{v} = (\varvec{u}_i^\top \varvec{v})^2 \ge 0\) and \(\sum _{i =1}^n \varvec{v}^\top \varvec{u}_i \varvec{u}_i^\top \varvec{v} = \varvec{v}^\top \left( \sum _{i =1}^n \varvec{u}_i \varvec{u}_i^\top \right) \varvec{v} = \Vert \varvec{v} \Vert ^2 = 1\). \(\square \)
B Omitted Proofs
In this section, we supply all omitted proofs, in the order the results were stated.
1.1 B.1 Proof of Proposition 3
Proof
Fix \(\varvec{X} \in {\mathcal {S}}^n\). For \(\varvec{Y} \succ \varvec{0}\), the perspective of \(f_\omega \) is well-defined according to Definition 1. Now, consider an arbitrary \(\varvec{Y} \succeq \varvec{0}\) and define \(\varvec{P}\) as the orthogonal projection onto the kernel of \(\varvec{Y}\), which is orthogonal to \({\text {Span}}(\varvec{Y})\). Then, \(\varvec{Y}_\varepsilon := \varvec{Y} + \varepsilon \varvec{P}\) for \(\varepsilon > 0\) is invertible. The closure of the matrix perspective of \(f_\omega \) is defined by continuity as the limit of \(\varvec{M}_\varepsilon := \varvec{Y}_\varepsilon ^{\frac{1}{2}} f_\omega \left( \varvec{Y}_\varepsilon ^{-\frac{1}{2}}\varvec{X}\varvec{Y}_\varepsilon ^{-\frac{1}{2}} \right) \varvec{Y}_\varepsilon ^{\frac{1}{2}}\) for \(\varepsilon \rightarrow 0\).
Since the ranges of \(\varvec{Y}\) and \(\varvec{P}\) are orthogonal (\(\varvec{Y} \varvec{P} = \varvec{P} \varvec{Y} = \varvec{0}\)), we have \(\varvec{Y}_\varepsilon ^{-\frac{1}{2}} = \varvec{Y}^{-\frac{1}{2}} + {\varepsilon }^{-\frac{1}{2}} \varvec{P}\), and
Note that \(\lim _{\varepsilon \rightarrow 0} \varvec{Y}_\varepsilon ^{\frac{1}{2}} = \varvec{Y}^{\frac{1}{2}}\) but \( \lim _{\varepsilon \rightarrow 0} \varvec{Y}_\varepsilon ^{-\frac{1}{2}} \ne \varvec{Y}^{-\frac{1}{2}}\). We now distinguish two cases.
Case 1: If \({\text {span}}(\varvec{X}) \subseteq {\text {span}}(\varvec{Y})\), \(\varvec{X} \varvec{P} = \varvec{P} \varvec{X} = \varvec{0}\) so
Case 2: If \({\text {span}}(\varvec{X}) \not \subseteq {\text {span}}(\varvec{Y})\), consider an orthonormal basis of \(\mathbb {R}^n\) such that \(\varvec{u}_1, \dots , \varvec{u}_k\) is an eigenbasis of \({\text {Span}}(\varvec{Y})\) (with respective eigenvalues \(\lambda ^y_{1}, \dots \lambda ^y_{k}\)) and \(\varvec{u}_{k+1}, \dots , \varvec{u}_n\) is a basis of \({\text {Span}}(\varvec{Y})^\perp = {\text {Ker}}(\varvec{Y})\). By assumption, \(k < n\) and there exists \(j > k\) such that \(\varvec{u}_j^\top \varvec{X} \varvec{u}_j \ne 0\). Without loss of generality, we shall assume \(\varvec{u}_n^\top \varvec{X} \varvec{u}_n \ne 0\). We show that the matrix \(\varvec{M}_\varepsilon \) goes to infinity as \(\varepsilon \rightarrow 0\) by showing that \(\varvec{u}_n^\top \varvec{M}_\varepsilon \varvec{u}_n\) diverges.
Since \(\varvec{Y}_\varepsilon ^{\pm \frac{1}{2}} \varvec{u}_n = \varepsilon ^{\pm \frac{1}{2}} \varvec{u}_n\), we have
where the inequality follows from the convexity of \(\omega \) and Lemma 4. By Assumption 1,
because \(\varvec{u}_n^\top \varvec{X}\varvec{u}_n \ne 0\) and \(\omega \) is coercive. \(\square \)
We now provide a simple extension of Proposition 3 that will prove useful later in our exposition.
Corollary 3
Consider a function \(\omega : \mathbb {R} \rightarrow \mathbb {R}\) satisfying Assumption 1 and denote its associated operator function \(f_\omega \). Consider a closed set \({\mathcal {X}} \subseteq {\mathcal {S}}^n\) and define
Then, the closure of the matrix perspective of f is, for any \(\varvec{X} \in {\mathcal {S}}^n\), \(\varvec{Y} \in {\mathcal {S}}_+^n\),
where \(\varvec{Y}^{-\frac{1}{2}}\) denotes the pseudo-inverse of the square root of \(\varvec{Y}\).
Proof
Fix \(\varvec{X} \in {\mathcal {S}}^n\) and \(\varvec{Y} \in {\mathcal {S}}^n_+\). From Proposition 3, we know that \(g_f(\varvec{X},\varvec{Y}) = + \infty \) if \({\text {Span}}(\varvec{X}) \not \subseteq {\text {Span}}(\varvec{Y})\). Let us assume that \({\text {Span}}(\varvec{X}) \subseteq {\text {Span}}(\varvec{Y})\). Following the same construction as in the proof of Proposition 3, we obtain a sequence \(\varvec{Y}_\varepsilon \) that converges to \(\varvec{Y}\) as \(\varepsilon \rightarrow 0\) and such that \(\varvec{Y}_\varepsilon ^{-\frac{1}{2}}\varvec{X}\varvec{Y}_\varepsilon ^{-\frac{1}{2}} = \varvec{Y}^{-\frac{1}{2}}\varvec{X}\varvec{Y}^{-\frac{1}{2}}\), which concludes the proof.
\(\square \)
1.2 B.2 Proof of Lemma 2
Proof
First, let us observe that \({\mathcal {C}}_{\varvec{X}} = \{ \varvec{M} \ : \ \varvec{MX} = \varvec{XM} \}\) is a closed subset of \({\mathcal {S}}^n\), contains the identity, and is closed under multiplication and transposition, also know as a Von Neumann subalgebra (see [20, Section 4] for a detailed treatment of projections onto subalgebras). The orthogonal projection of a semidefinite matrix onto \( {\mathcal {C}}_{\varvec{X}}\) is also semidefinite and has the same trace [20, Theorem. 4.13], so
Furthermore, since \(\varvec{Y} \mapsto g_{f_\omega }(\varvec{X},\varvec{Y})\) is matrix convex, [20, Theorem 4.16] yields
Taking the trace on both sides and using that \({\text {tr}}\left( g_{f_\omega }(\varvec{X}, \varvec{Y})_{|{ \varvec{X}}} \right) = {\text {tr}}\left( g_{f_\omega }(\varvec{X}, \varvec{Y}) \right) \) concludes the proof. \(\square \)
Remark 14
If \(\varvec{X} = \sum _{i \in [n]} \lambda _{i} \varvec{u}_i \varvec{u}_i^\top \) is a spectral decomposition of \(\varvec{X}\), then the projection of any matrix \(\varvec{Y}\) onto the commutant of \(\varvec{X}\) can be computed as \(Y_{|\varvec{X}} = \sum _{i \in [n]} \lambda _{i} \varvec{u}_i \varvec{u}_i^\top \varvec{Y} \varvec{u}_i \varvec{u}_i^\top \). This operation is known in the literature as pinching [20, 25].
In other words, taking the projection of \(\varvec{Y}\) onto the commutant of \(\varvec{X}\) is a trace preserving operation that can only reduce the value of \({\text {tr}}\left( g_{f_\omega }(\varvec{X}, \cdot ) \right) \). In this paper, we invoke the projection onto the commutant of \(\varvec{X}\) for theoretical purposes, not computational ones. So we are not interested in how to compute \(\varvec{Y}_{|{ \varvec{X}}}\) in practice. Note that, according to Proposition 2(a), Lemma 2 holds if \(f_\omega \) is matrix convex.
1.3 B.3 Counterexample to joint convexity of trace of matrix perspective of cube
In this section, we demonstrate by counterexample that if \(\omega \) is a convex and continuous function then, even though the trace of its matrix extension, \(\textrm{tr}(f_\omega )\), is convex (c.f. [20], Theorem 2.10), the trace of its matrix perspective need not be convex.
Specifically, let us consider \(\omega (x)=x^3\). In this case, \(\omega \) is convex on \(\mathbb {R}_+\), \(f_\omega \) is not matrix convex, but \(\textrm{tr}(f_\omega )\) is matrix convex. We have that
for \(\varvec{X} \in \textrm{Span}(\varvec{Y}), \varvec{X}, \varvec{Y} \in {\mathcal {S}}^n_+\). Let us now consider
Then, some elementary algebra reveals that
while
which verifies that \(\textrm{tr}(g_{f_\omega }(\varvec{X}, \varvec{Y}))\) is not midpoint convex in \((\varvec{X}, \varvec{Y})\), despite \(\textrm{tr}(f_\omega )\) being convex.
1.4 B.4 Proof of Proposition 5
Proof
We use the proof technique laid out in [43, Section 3.1], namely writing \({\mathcal {T}}\) as the disjunction of two convex sets driven by whether z is active and applying Fourier-Motzkin elimination. That is, we have \({\mathcal {T}}={\mathcal {T}}^1 \cup {\mathcal {T}}^2\) where:
Moreover, a point (x, y, z, t) is in the convex hull \({\mathcal {T}}^c\) if and only if it can be written as a convex combination of points in \({\mathcal {T}}^1, {\mathcal {T}}^2\). Letting \(\lambda _1, \lambda _2\) denote the weight of points in this system, we then have that \((x,y,z,t) \in {\mathcal {T}}^c\) if and only if the following system admits a solution:
For ease of computation, we now eliminate variables. First, one can substitute \(t_1, t_2\) for their lower bounds in the definition of t and replace \(\lambda _2\) with z to obtain
Next, we substitute x/z for \(x_2\) and \((y-z y_2)/\lambda _1\) for \(y_1\) to obtain
Finally, we let \(z y_2\) be the free variable \(\beta \) and set \(\lambda _1=1-z\) to obtain the required convex set. \(\square \)
1.5 B.5 Convex closure of epigraph of separable function under cardinality constraint
In this section, to keep this paper self-contained, we verify that the convex closure of the set
is given by the following set:
By fixing t, we then obtain precisely the result claimed in the proof of Theorem 2.
We remark that this result is essentially due to [76, Theorem 3]; the result we prove here is a minor extension which allows for \(\omega (0)\ne 0\). Formally, we have the following result:
Proposition 8
Suppose that \(\omega \) satisfies Assumption 1. Then, \({\mathcal {H}}^c\) is the closure of the convex hull of \({\mathcal {H}}\).
Proof
Let \((\varvec{a}, \varvec{b}) \in \mathbb {R}^{2n}\) and consider the following two optimization problems
Then, since \({\mathcal {H}}^c\) is a convex outer approximation of \({\mathcal {H}}\), it suffices to show that the above two optimization problems are equivalent, i.e., there exists an optimal solution to (43) which is optimal for (42) with the same objective value (as argued in [76, Theorem 1] we can associate a constant of 1 with t without loss of generality). Therefore, let us denote by \(\omega ^\star \) the convex conjugate of \(\omega \), i.e.,
and let \(\Gamma :=\{\gamma \in \mathbb {R}: \omega ^\star (\gamma )< \infty \}\). Then, by Fenchel’s inequality on the perspective function, for any \(\gamma \in \Gamma \) and any \(\varvec{y} \ge \varvec{0}, \varvec{x} \in \mathbb {R}^n\) we have
Moreover, we must have that \(b_i \in \Gamma \) for each \(i \in [n]\); otherwise, both (42) and (43) are unbounded. Therefore, set \(\gamma =-b_i\) in the above expression for each i and replace t with the left-hand-side of the inequality in (43), which provides the following relaxation to (43):
This problem is a linear one in \(\varvec{y}\) with binary extreme points, and thus it admits an optimal solution which is binary. Moreover, if we set \(x_i^\star =\sup _{w \in \mathbb {R}} -b_i w -\omega (w)\) if \(z_i=1\) and \(w_i=0\) otherwise, then we obtain a feasible solution in (42) with the same objective. \(\square \)
C Generalizing the Matrix Perspective Reformulation Technique
We now demonstrate the MPRT can be extended to incorporate a different separability of eigenvalues assumption, at the price of (a possibly significant amount of) additional notations. For any symmetric matrix \(\varvec{X}\), let us denote \(\lambda _i^{\downarrow }(\varvec{X})\) the ith largest eigenvalue of \(\varvec{X}\). Before proceeding any further, we recall the following result, due to [4, Example 18.c], which provides a semidefinite representation of the sum of the k largest eigenvalues:
Lemma 5
(Representability of sums of largest eigenvalues) Let \(S_k(\varvec{X}):=\sum _{i=1}^k \lambda _i^{\downarrow }(\varvec{X})\) denote the sum of the k largest eigenvalues of a symmetric matrix \(\varvec{X} \in {\mathcal {S}}^n\). Then, the epigraph of \(S_k\), \(S_k(\varvec{X}) \le t_k\), admits the following semidefinite representation:
Based on this result, we can relax the assumption that the penalty term \(\Omega (\varvec{X})\) corresponds to the trace of an operator function. Instead, we can assume:
Assumption 4
\(\Omega (\varvec{X})=\sum _{i \in [n]} p_i \lambda _i^{\downarrow }(f_\omega (\varvec{X}))\), where \(p_1 \ge \ldots \ge p_n \ge 0\) and where \(\omega \) is a function satisfying Assumption 1 and whose associated operator function, \(f_\omega \), is matrix convex.
This assumption is particularly suitable for Markov Chain problems (see, e.g., [15, Chapter 4.6]), where we are interested in controlling the behaviour of the largest eigenvalue (which always equals 1) plus the second largest eigenvalue of a matrix. However, it might appear to be challenging to model, since, e.g., \(\lambda _2^{\downarrow }(\varvec{X})\) is a non-convex function. By applying a telescoping sum argument reminiscent of the one in [4], Proposition 4.2.1], namely
with the convention \(p_{n+1}=0\), Lemma 5 allows us to rewrite low-rank problems where \(\Omega (\varvec{X})\) satisfies Assumption 4 in the form:
where \(t_i\) models the sum of the i largest eigenvalues of \(f(\varvec{X})\). Applying the MPRT then yields the following extension to Theorem 1:
Proposition 9
Suppose Problem (44) attains a finite optimal value. Then, the following problem attains the same value:
The proof of this reformulation is almost identical to the proof of Theorem 1, after observing that (20) holds not only for the traces but for the matrices directly, i.e., if \(\varvec{X}\) and \(\varvec{Y} \in {\mathcal {Y}}^k_n\) commute, we have
Problem (45) involves n times as many variables as Problem (18) and therefore supplies substantially less tractable relaxations. Nonetheless, it could be useful in specific instances. In the aforementioned Markov Chain mixing problem, \(p_i-p_{i+1}=0 \ \forall i \ge k\) with \(k=2\), so we can omit the variables which model the eigenvalues larger than 2 .
Appendix D Extension to the Rectangular Case
In this section, we extend the MPRT to the case where \(\varvec{X}\) is a generic \(n \times m\) matrix and \(f(\varvec{X})\) is the convex quadratic penalty \(f(\varvec{X})=\varvec{X}^\top \varvec{X}\). In this case, \({\text {tr}}(f(\varvec{X}))=\Vert \varvec{X}\Vert _F^2\) is the squared Frobenius norm of \(\varvec{X}\).
First, observe that \(f : \mathbb {R}^{n \times m} \rightarrow {\mathcal {S}}^m_+\). Alternatively, one could have considered \(g(\varvec{X}) = \varvec{X}\varvec{X}^\top \in {\mathcal {S}}_+^n\) and obtain the same penalty, i.e., \({\text {tr}}(f(\varvec{X})) = {\text {tr}}(g(\varvec{X}))\). In other words, one can arbitrarily choose whether f preserves the row or the column space of \(\varvec{X}\). By the Schur complement lemma, the epigraph is semidefinite representable via
so f is matrix convex.
In the symmetric case, we considered the matrix perspective of f at \((\varvec{X}, \varvec{Y})\), where \(\varvec{Y} \succeq \varvec{0}\) is a matrix controlling the range of \(\varvec{X}\). When \(\varvec{X}\) is no longer symmetric, it is natural to consider a matrix perspective function which involves two projection matrices, one of which models the row space and one which models the column space, as proposed in our prior work [11]. More precisely, for \(\varvec{Y}, \varvec{Z} \succ \varvec{0}\) we define a perspective of f as
For \(f(\varvec{X}) = \varvec{X}^\top \varvec{X}\), this function actually does not depend on \(\varvec{Z}\). Hence, we consider
Extending this function to positive semidefinite \(\varvec{Y}\) using the same proof technique as in Proposition 3, we then obtain
Proof
Fix \(\varvec{X} \in {\mathcal {S}}^n\) and \(\varvec{Y} \succeq \varvec{0}\). As in the proof of Proposition 3 denote \(\varvec{P}\) the orthogonal projection onto the kernel of \(\varvec{Y}\), and define \(\varvec{Y}_\varepsilon := \varvec{Y} + \varepsilon \varvec{P}\) for \(\varepsilon > 0\). Hence,
The right-hand side admits a finite limit if and only if
\(\square \)
Furthermore, using the Schur complement lemma as in [11], one can show that \(\tilde{g}_f\) is SDP-representable:
and hence matrix convex.
Finally, we can easily check that Theorem 1 still holds in the symmetric case because (20)—which simplifies to \({\text {tr}}(f(\varvec{X})) = {\text {tr}}(\tilde{g}_f(\varvec{X}))\) in this case—holds for any \(\varvec{Y} \in {\mathcal {Y}}^k_n\) such that \(\varvec{X} = \varvec{Y} \varvec{X}\).
Remark 15
We believe the approach outlined above could be generalized to a broader class of function that generalizes operator functions to the non-symmetric case. Namely, we could consider functions of the form
where \(\varvec{X}=\varvec{U}\textrm{Diag}\left( \sigma _1^x, \dots , \sigma _m^x\right) \varvec{V}^\top \) is a singular value decomposition of \(\varvec{X}\) and \(\omega \) is a convex function satisfying Assumption 1. Again, \(f_\omega \) could arbitrarily be defined as preserving \(\varvec{U}\) or \(\varvec{V}\). For these functions, the perspective \(g_{f_\omega }(\varvec{X},\varvec{Y},\varvec{X})\) is well defined for \(\varvec{Y},\varvec{Z} \succ \varvec{0}\). Unlike in the quadratic case, however, its value will depend on both \(\varvec{Y}\) and \(\varvec{Z}\). Developing the theoretical tools necessary to extend the MPRT to rectangular matrices, is therefore a question for future research.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Bertsimas, D., Cory-Wright, R. & Pauphilet, J. A new perspective on low-rank optimization. Math. Program. 202, 47–92 (2023). https://doi.org/10.1007/s10107-023-01933-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-023-01933-9
Keywords
- Low-rank matrix
- Semidefinite optimization
- Matrix perspective function
- Perspective reformulation technique