Certifiably optimal sparse inverse covariance estimation

1328 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

We consider the maximum likelihood estimation of sparse inverse covariance matrices. We demonstrate that current heuristic approaches primarily encourage robustness, instead of the desired sparsity. We give a novel approach that solves the cardinality constrained likelihood problem to certifiable optimality. The approach uses techniques from mixed-integer optimization and convex optimization, and provides a high-quality solution with a guarantee on its suboptimality, even if the algorithm is terminated early. Using a variety of synthetic and real datasets, we demonstrate that our approach can solve problems where the dimension of the inverse covariance matrix is up to 1000 s. We also demonstrate that our approach produces significantly sparser solutions than Glasso and other popular learning procedures, makes less false discoveries, while still maintaining state-of-the-art accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

NOVELIST estimator of large correlation and covariance matrices and their inverses

Article Open access 11 July 2018

Estimation of matrices with row sparsity

Article 01 October 2015

Certifiably optimal sparse principal component analysis

Article 01 January 2019

Notes

Note that $\overline{\varvec{\Sigma }}$ is not the only estimate of the covariance matrix. In particular, $\tfrac{n}{n-1} \overline{\varvec{\Sigma }}$ is a widely-used unbiased estimator of the covariance matrix. In this paper, we will only consider $\overline{\varvec{\Sigma }}$, which we might refer to as the empirical or sample covariance matrix.
Available at https://cran.r-project.org/web/packages/glasso/.
Convexity of ${h}({\mathbf {Z}})$ can also be proved from the primal formulation (6) directly. Take two matrices ${\mathbf {Z}}_1$ and $ {\mathbf {Z}}_2$, $\lambda \in (0,1)$, ${\mathbf {Z}} := \lambda {\mathbf {Z}}_1 + (1-\lambda ) {\mathbf {Z}}_2$, then it follows from the definition (6) that $ h({\mathbf {Z}}) \leqslant \lambda h({\mathbf {Z}}_1) + (1-\lambda ) h({\mathbf {Z}}_2)$.
For each instance, we generate a sparse precision matrix $\varvec{\Theta }_0$ as in Sect. 5.1 and n samples from the corresponding multivariate normal distribution.

References

Atamtürk, A., Narayanan, V.: Conic mixed-integer rounding cuts. Math. Program. 122(1), 1–20 (2010)
MathSciNet MATH Google Scholar
Atchadé, Y.F., Mazumder, R., Chen, J.: Scalable computation of regularized precision matrices via stochastic optimization (2015). arXiv:1509.00426
Banerjee, O., El Ghaoui, L., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9(Mar), 485–516 (2008)
MathSciNet MATH Google Scholar
Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton (2009)
MATH Google Scholar
Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. 270, 93142 (2018)
MathSciNet MATH Google Scholar
Bertsimas, D., Mazumder, R.: Least quantile regression via modern optimization. Ann. Stat. 42, 2494–2525 (2014)
MathSciNet MATH Google Scholar
Bertsimas, D., Van Parys, B.: Sparse high-dimensional regression: Exact scalable algorithms and phase transitions (2017). arXiv:1709.10029
Bertsimas, D., Brown, D.B., Caramanis, C.: Theory and applications of robust optimization. SIAM Rev. 53(3), 464–501 (2011)
MathSciNet MATH Google Scholar
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
MathSciNet MATH Google Scholar
Bickel, P.J., Levina, E., et al.: Covariance regularization by thresholding. Ann. Stat. 36(6), 2577–2604 (2008)
MathSciNet MATH Google Scholar
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
MATH Google Scholar
Cai, T., Liu, W., Luo, X.: A constrained $\ell _1$ minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106(494), 594–607 (2011)
MATH Google Scholar
Çezik, M.T., Iyengar, G.: Cuts for mixed 0–1 conic programming. Math. Program. 104(1), 179–202 (2005)
MathSciNet MATH Google Scholar
Chickering, D.M.: Learning Bayesian Networks is NP-Complete. In: Fisher, D., Lenz, HJ. (eds.) Learning from Data. Lecture Notes in Statistics, vol 112. Springer, New York, NY (1996)
Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)
MathSciNet MATH Google Scholar
Dahl, J., Vandenberghe, L., Roychowdhury, V.: Covariance selection for nonchordal graphs via chordal embedding. Optim. Methods Softw. 23(4), 501–520 (2008)
MathSciNet MATH Google Scholar
Dempster, A.P.: Covariance selection. Biometrics 28, 157–175 (1972)
MathSciNet Google Scholar
Drton, M., Maathuis, M.H.: Structure learning in graphical modeling. Ann. Rev. Stat. Appl. 4, 365–393 (2017)
Google Scholar
Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Math. Programm. 36(3), 307–339 (1986)
MathSciNet MATH Google Scholar
El Karoui, N.: High-dimensionality effects in the markowitz problem and other quadratic programs with linear constraints: risk underestimation. Ann. Stat. 38(6), 3487–3566 (2010)
MathSciNet MATH Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
MathSciNet MATH Google Scholar
Fan, J., Fan, Y., Lv, J.: High dimensional covariance matrix estimation using a factor model. J. Econom. 147(1), 186–197 (2008)
MathSciNet MATH Google Scholar
Fan, J., Feng, Y., Yichao, W.: Network exploration via the adaptive lasso and scad penalties. Ann. Appl. Stat. 3(2), 521 (2009)
MathSciNet MATH Google Scholar
Fan, J., Zhang, J., Ke, Y.: Vast portfolio selection with gross-exposure constraints. J. Am. Stat. Assoc. 107(498), 592–606 (2012)
MathSciNet MATH Google Scholar
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)
Google Scholar
Fan, J., Liao, Y., Liu, H.: An overview of the estimation of large covariance and precision matrices. Econom. J. 19(1), C1–C32 (2016)
MathSciNet Google Scholar
Fattahi, S., Sojoudi, S.: Graphical lasso and thresholding: equivalence and closed-form solutions (2017). arXiv:1708.09479
Foygel, R., Drton, M.: Extended Bayesian information criteria for Gaussian graphical models. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 604–612. Curran Associates, Inc. (2010)
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008)
MATH Google Scholar
Gally, T., Pfetsch, M.E., Ulbrich, S.: A framework for solving mixed-integer semidefinite programs. Optim. Methods Softw. 33(3), 594–632 (2018)
MathSciNet MATH Google Scholar
Geoffrion, A.M.: Generalized benders decomposition. J. Optim. Theory Appl. 10(4), 237–260 (1972)
MathSciNet MATH Google Scholar
Gurobi Optimization Inc. Gurobi Optimizer Reference Manual (2015). http://www.gurobi.com
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)
MATH Google Scholar
Helmberg, C., Rendl, F.: Solving quadratic (0, 1)-problems by semidefinite programs and cutting planes. Math. Program. 82(3), 291–315 (1998)
MathSciNet MATH Google Scholar
Hess, K.R., Keith Anderson, W., Symmans, F., Valero, V., Ibrahim, N., Mejia, J.A., Booser, D., Theriault, R.L., Buzdar, A.U., Dempsey, P.J., et al.: Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J. Clin. Oncol. 24(26), 4236–4244 (2006)
Google Scholar
Hsieh, C.-J., Dhillon, I.S., Ravikumar, P.K., Sustik, M.A.: Sparse inverse covariance matrix estimation using quadratic approximation. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 2330–2338. Curran Associates, Inc. (2011)
Hsieh, C.-J., Sustik, M.A., Dhillon, I.S., Ravikumar, P.K., Poldrack, R.: Big & quic: sparse inverse covariance estimation for a million variables. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3165–3173. Curran Associates, Inc. (2013)
IBM ILOG. Cplex optimizer (2012). http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer
Krishnamurthy, V., Ahipasaoglu, S.D., d’Aspremont, A.: A pathwise algorithm for covariance selection. Optim. Mach. Learn., p. 479 (2011). arXiv:0908.0143v2
Lam, C., Fan, J.: Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Stat. 37(6B), 4254 (2009)
MathSciNet MATH Google Scholar
Lauritzen, S.L.: Graphical Models, vol. 17. Clarendon Press, Oxford (1996)
MATH Google Scholar
Liu, Z., Lin, S., Deng, N., McGovern, D.P.B., Piantadosi, S.: Sparse inverse covariance estimation with $\ell _0$ penalty for network construction with omics data. J. Comput. Biol. 23(3), 192–202 (2016)
MathSciNet Google Scholar
Lubin, M., Dunning, I.: Computing in operations research using julia. INFORMS J. Comput. 27(2), 238–248 (2015)
MathSciNet MATH Google Scholar
Lubin, M., Yamangil, E., Bent, R., Vielma, J.P.: Polyhedral approximation in mixed-integer convex optimization. Math. Program. 172(1–2), 139–168 (2018)
MathSciNet MATH Google Scholar
Ma, J., Zhao, F., Xu, J.: Structure learning constrained by node-specific degree distribution. In: Meila, M., Heskes, T. (eds.) Uncertainty in Artificial Intelligence 33, pp. 533–541. AUAI Press Corvalis, Oregon (2015)
Google Scholar
Marjanovic, G., Hero, A.O.: $\ell _0$ sparse inverse covariance estimation. IEEE Trans. Signal Process. 63(12), 3218–3231 (2015)
MathSciNet MATH Google Scholar
Kipp Martin, R.: Using separation algorithms to generate mixed integer model reformulations. Oper. Res. Lett. 10(3), 119–128 (1991)
MathSciNet MATH Google Scholar
Mazumder, R., Hastie, T.: Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res. 13(Mar), 781–794 (2012a)
MathSciNet MATH Google Scholar
Mazumder, R., Hastie, T.: The graphical lasso: new insights and alternatives. Electron. J. Stat. 6, 2125 (2012b)
MathSciNet MATH Google Scholar
Meinshausen, N., Bühlmann, P., et al.: High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34(3), 1436–1462 (2006)
MathSciNet MATH Google Scholar
Meyer, C.D.: Matrix analysis and applied linear algebra, vol. 71. Siam (2000)
Oztoprak, F., Nocedal, J., Rennie, S., Olsen, P.A.: Newton-like methods for sparse inverse covariance estimation. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 755–763. Curran Associates, Inc. (2012)
Rendl, F., Rinaldi, G., Wiegele, A.: Solving max-cut to optimality by intersecting semidefinite and polyhedral relaxations. Math. Program. 121(2), 307 (2010)
MathSciNet MATH Google Scholar
Rigollet, P., Tsybakov, A.: Estimation of covariance matrices under sparsity constraints (2012). arXiv:1205.1210
Rothman, A.J., Bickel, P.J., Levina, E., Zhu, J., et al.: Sparse permutation invariant covariance estimation. Electron. J. Stat. 2, 494–515 (2008)
MathSciNet MATH Google Scholar
Santhanam, N.P., Wainwright, M.J.: Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Trans. Inf. Theory 58(7), 4117–4134 (2012)
MathSciNet MATH Google Scholar
Scheinberg, K., Rish, I.: SINCO-a greedy coordinate ascent method for sparse inverse covariance selection problem. IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598, July (2009)
Scheinberg, K., Ma, S., Goldfarb, D.: Sparse inverse covariance selection via alternating linearization methods. In: Advances in Neural Information Processing Systems, pp. 2101–2109 (2010)
Shao, J.: Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)
MathSciNet MATH Google Scholar
Sotirov, R.: SDP relaxations for some combinatorial optimization problems. In: Handbook on Semidefinite, Conic and Polynomial Optimization, pp. 795–819. Springer (2012)
Tan, K.M., London, P., Mohan, K., Lee, S.-I., Fazel, M., Witten, D.: Learning graphical models with hubs. J. Mach. Learn. Res. 15(1), 3297–3331 (2014)
MathSciNet MATH Google Scholar
Tokuda, T., Goodrich, B., Van Mechelen, I., Gelman, A., Tuerlinckx, F.: Visualizing distributions of covariance matrices. Columbia University, New York, USA, Technical Report, p. 18 (2011)
Vandenberghe, L., Andersen, M.S., et al.: Chordal graphs and semidefinite optimization. Found. Trends Optim. 1(4), 241–433 (2015)
Google Scholar
Xu, H., Caramanis, C., Mannor, S.: Robust regression and lasso. In: Advances in Neural Information Processing Systems, pp. 1801–1808 (2009)
Yonekura, K., Kanno, Y.: Global optimization of robust truss topology via mixed integer semidefinite programming. Optim. Eng. 11(3), 355–379 (2010)
MathSciNet MATH Google Scholar
Yuan, M.: High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11(Aug), 2261–2286 (2010)
MathSciNet MATH Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94(1), 19–35 (2007)
MathSciNet MATH Google Scholar
Zhang, C.-H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
MathSciNet MATH Google Scholar
Zhang, F.: The Schur Complement and Its Applications, vol. 4. Springer, Berlin (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Massachusetts Institute of Technology, Operations Research Center, Cambridge, MA, USA
Dimitris Bertsimas, Jourdain Lamperski & Jean Pauphilet

Authors

Dimitris Bertsimas
View author publications
You can also search for this author in PubMed Google Scholar
Jourdain Lamperski
View author publications
You can also search for this author in PubMed Google Scholar
Jean Pauphilet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dimitris Bertsimas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proofs of Theorem 2 and corollaries

In this section, we detail the proof of Theorem 2. We first specify the assumptions required on the regularizer $\varOmega $, prove Theorem 2 and finally investigate some special cases of interest.

1.1 Assumptions

We first assume that the function $\varOmega $ is decomposable, i.e., there exist scalar functions $\varOmega _{ij}$ such that

$$\begin{aligned} \forall \, \varvec{\Phi }, \quad \varOmega (\varvec{\Phi }) = \sum _{i,j} \varOmega _{ij}(\Phi _{ij}). \end{aligned}$$

(A1)

In addition, we assume that for all (i, j), $\varOmega _{ij}$ is convex and tends to regularize towards zero. Formally,

$$\begin{aligned} \forall \, (i,j), \quad \min _x \, \varOmega _{ij}(x) = \varOmega _{ij}(0). \end{aligned}$$

(A2)

Those first two assumptions are not highly restrictive and are satisfied by $\ell _\infty $-norm constraint (big-M), $\ell _1$-norm regularization (LASSO) or $\Vert \cdot \Vert _2^2$-regularization, among others.

For any function f, we denote with a superscript $\star $ its Fenchel conjugate (see [11], chap. 3.3) defined as

$$\begin{aligned} f^\star (y) := \sup _x \langle x, y \rangle - f(x). \end{aligned}$$

In particular, the Fenchel conjugate of any function f is convex. Given Assumption (A1),

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}})&= \sup _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {R}} \rangle - \varOmega (\varvec{\Phi }), \\&= \sum _{i,j} \sup _{\Phi _{ij}} \Phi _{ij} R_{ij} - \varOmega _{ij}(\Phi _{ij}), \\&= \sum _{i,j} \varOmega _{ij}^\star (R_{ij}). \end{aligned}$$

As a result, it is easy to see that if $\varOmega $ satisfies (A1) and (A2), so does its Fenchel conjugate.

Let us denote ${\mathbf {A}} \circ {\mathbf {B}}$ the Hadamard or component-wise product between matrices ${\mathbf {A}} $ and ${\mathbf {B}}$. Consider a matrix ${\mathbf {R}}$ and a support matrix ${\mathbf {Z}} \in \{0,1\}^{p \times p}$. The function ${\mathbf {Z}} \mapsto \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}} )$ is convex in ${\mathbf {Z}}$, by convexity of $\varOmega ^\star $. We now assume that it is linear in ${\mathbf {Z}}$, that is, there exists a function $\varvec{\Omega }^\star : \, {\mathbb {R}}^{p \times p} \rightarrow {\mathbb {R}}^{p \times p}$ satisfying:

$$\begin{aligned} \forall \, {\mathbf {Z}} \in \{0,1\}^{p \times p}, \forall \, {\mathbf {R}} \in {\mathbb {R}}^{p \times p}, \, \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}} ) = \langle {\mathbf {Z}}, \varvec{\Omega }^\star ({\mathbf {R}}) \rangle . \end{aligned}$$

(A3)

1.2 Proof of Theorem 2

Given ${\mathbf {Z}} \in \{0,1\}^{p \times p}$ such that ${Z}_{ii} = 1$ for all $i = 1,\dots ,p$, we first prove that under assumptions (A1) and (A2):

$$\begin{aligned} {\tilde{h}}({\mathbf {Z}})&:= \quad \min _{\varvec{\Theta } \succ {\mathbf {0}}} \quad \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Theta }) \quad \hbox { s.t. } {\varTheta }_{ij} = 0 \hbox { if } {Z}_{ij} = 0 \, \forall (i,j), \\&= \quad \max _{{\mathbf {R}}:\overline{\varvec{\Sigma }} + {\mathbf {R}} \succ {\mathbf {0}}} \, p +\log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}}) - \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}}). \end{aligned}$$

Then, Assumption (A3) will conclude the proof.

Proof

We decompose the minimization problem à la Fenchel.

$$\begin{aligned} {\tilde{h}}({\mathbf {Z}})&= \min _{\varvec{\Theta } \succ {\mathbf {0}}} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Theta }) \, \hbox { s.t. } {\varTheta }_{ij} = 0 \hbox { if } {Z}_{ij} = 0, \\&= \min _{\varvec{\Theta }\succ {\mathbf {0}}, \varvec{\Phi }} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega ({\mathbf {Z}} \circ \varvec{\Phi }) \, \hbox { s.t. } {\varTheta }_{ij} = {Z}_{ij} {\Phi }_{ij}, \\&= \min _{\varvec{\Theta } \succeq {\mathbf {0}}, \varvec{\Phi }} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega ({\mathbf {Z}} \circ \varvec{\Phi }) \, \hbox { s.t. } \varvec{\Theta } = {\mathbf {Z}} \circ \varvec{\Phi }. \end{aligned}$$

In the last equality, we omitted the constraint $\varvec{\Theta } \succ {\mathbf {0}}$, which is implied by the domain of $\log \det $. Assuming (A1) and (A2) hold, the regularization term $\varOmega ({\mathbf {Z}} \circ \varvec{\Phi })$ can be replaced by $\varOmega (\varvec{\Phi })$ and

The above objective function is convex in $(\varvec{\Theta }, \varvec{\Phi })$, the feasible set is a non-empty - $\varvec{\Theta } = \varvec{\Phi } = {\mathbf {I}}_p$ is feasible - convex set, and Slater’s conditions are satisfied. Hence, strong duality holds.

$$\begin{aligned} {\tilde{h}}({\mathbf {Z}})&= \min _{\varvec{\Theta } \succeq {\mathbf {0}}, \varvec{\Phi }} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Phi }) \, \hbox { s.t. } \varvec{\Theta } = {\mathbf {Z}} \circ \varvec{\Phi }, \\&= \min _{\varvec{\Theta } \succeq {\mathbf {0}}, \varvec{\Phi }} \, \max _{{\mathbf {R}}} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Phi }) \, + \langle \varvec{\Theta } - {\mathbf {Z}} \circ \varvec{\Phi }, {\mathbf {R}} \rangle , \\&= \max _{{\mathbf {R}}} \, \min _{\varvec{\Theta } \succeq {\mathbf {0}}} \, \left[ \langle \overline{\varvec{\Sigma }} + {\mathbf {R}}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } \right] + \min _{\varvec{\Phi }} \left[ \varOmega ( \varvec{\Phi }) - \langle {\mathbf {Z}} \circ \varvec{\Phi }, {\mathbf {R}} \rangle \right] . \end{aligned}$$

For the first inner-minimization problem, first-order conditions $\overline{\varvec{\Sigma }} + {\mathbf {R}} - \varvec{\Theta }^{-1} = {\mathbf {0}}$ lead to the constraint $\overline{\varvec{\Sigma }}+ {\mathbf {R}} \succ 0$ and the objective value is $p + \log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}})$. The second inner-minimization problem is almost the definition of the Fenchel conjugate:

$$\begin{aligned} \min _{\varvec{\Phi }} \varOmega (\varvec{\Phi }) - \langle {\mathbf {Z}} \circ \varvec{\Phi }, {\mathbf {R}} \rangle&= - \max _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {Z}} \circ {\mathbf {R}} \rangle - \varOmega ( \varvec{\Phi }), \\&= - \varOmega ^\star ({\mathbf {Z}} \circ {\mathbf {R}} ) \end{aligned}$$

Hence,

$$\begin{aligned} h({\mathbf {Z}})&= \max _{{\mathbf {R}}: \overline{\varvec{\Sigma }}+ {\mathbf {R}} \succ {\mathbf {0}}} \, p + \log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}}) - \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}} ). \end{aligned}$$

$\square $

Remark

Notice that we proved that ${\tilde{h}}({\mathbf {Z}})$ could be written as point-wise maximum of concave functions of ${\mathbf {Z}}$. Assumption (A3) is needed to ensure that the function in the maximization is convex in ${\mathbf {Z}}$ at the same time.

1.3 Special cases and corollaries

1.3.1 No regularization

We first consider the unregularized case of (6) where $\forall \, \varvec{\Phi }, \, \varOmega (\varvec{\Phi }) = 0$. Assumptions (A1) and (A2) are obviously satisfied. Moreover, for any ${\mathbf {R}}$,

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}}) = \sup _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {R}} \rangle = {\left\{ \begin{array}{ll} 0 &{} \hbox {if } {\mathbf {R}} = {\mathbf {0}}, \\ +\,\infty &{} \hbox {otherwise.} \end{array}\right. } \end{aligned}$$

With the convention that $0 \times \infty = 0$, Assumption (A3) is satisfied and Theorem 2 holds:

$$\begin{aligned} {h}({\mathbf {Z}})&= \quad \max _{{\mathbf {R}}: \overline{\varvec{\Sigma }}+ {\mathbf {R}} \succ {\mathbf {0}}} \, p +\log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}}) - \langle {\mathbf {Z}} , \varvec{\Omega }^\star ( {\mathbf {R}}) \rangle , \\&= \max _{{\mathbf {R}}: \overline{\varvec{\Sigma }} + {\mathbf {R}} \succ {\mathbf {0}}} \, p +\log \det (\overline{\varvec{\Sigma }}+ {\mathbf {R}}) \quad \hbox { s.t. } {Z}_{ij} {R}_{ij} = 0, \, \forall (i,j). \end{aligned}$$

In particular, this reformulation proves that ${h}({\mathbf {Z}})$ is convex,^{Footnote 3} but that the coordinates of its sub-gradient $-\varvec{\Omega }^\star ({\mathbf {R}}^\star ({\mathbf {Z}}))$ are either 0 or $-\infty $, hence uninformative. Note that the same conclusion is true for $\ell _1$-regularization.

From the proof of Theorem 2, one can derive a lower bound on $ \Vert \varvec{\Theta }^\star \Vert _\infty $ which will be useful for big-M regularization.

Theorem 3

The solution of (8) satisfies $\Vert \varvec{\Theta }^\star \Vert _\infty \geqslant \frac{p}{\Vert \overline{\varvec{\Sigma }} \Vert _1}$

Proof

For a feasible support ${\mathbf {Z}}$, denote the optimal primal and dual variables $\varvec{\Theta }^\star ({\mathbf {Z}})$ and ${\mathbf {R}}^\star ({\mathbf {Z}})$ respectively. There is no duality gap and KKT condition $\varvec{\Theta }^\star ({\mathbf {Z}})^{-1}= \overline{\varvec{\Sigma }} + {\mathbf {R}}^\star ({\mathbf {Z}})$ holds, so that $ \langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle = p$. From Hölder’s inequality, we obtain the desired lower bound. $\square $

1.3.2 Big-M regularization

For the big-M regularization,

$$\begin{aligned} \varOmega (\varvec{\Theta }) = {\left\{ \begin{array}{ll} 0 &{} \hbox { if } |\varTheta _{ij}| \leqslant M_{ij}, \\ +\,\infty &{} \hbox { otherwise} \end{array}\right. }, \end{aligned}$$

is decomposable with $\varOmega _{i,j}(\varTheta _{ij}) = 0$ if $| \varTheta _{ij} | \leqslant M_{ij}$, $+\,\infty $ otherwise. Assumptions (A1) and (A2) are satisfied. Moreover, for any ${\mathbf {R}}$,

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}})&= \sup _{\varvec{\Phi \, : \, \Vert \varvec{\Phi }\Vert _\infty \leqslant {\mathbf {M}}}} \langle \varvec{\Phi }, {\mathbf {R}} \rangle = \Vert {\mathbf {M}} \circ {\mathbf {R}} \Vert _1. \end{aligned}$$

In particular, for any binary matrix ${\mathbf {Z}}$,

$$\begin{aligned} \varOmega ^\star ({\mathbf {Z}} \circ {\mathbf {R}}) = \sum _{i,j} | M_{ij} Z_{ij} R_{ij} | = \sum _{i,j} M_{ij} Z_{ij} | R_{ij} |, \end{aligned}$$

so that Assumption (A3) is satisfied with $\varvec{\Omega }^\star ({\mathbf {R}}) = \left( M_{ij} |R_{ij}| \right) _{ij}$.

1.3.3 Ridge regularization

For the $\ell _2^2$-regularization,

$$\begin{aligned} \varOmega (\varvec{\Theta }) = \dfrac{1}{2\gamma } \Vert \varvec{\Theta } \Vert _2^2, \end{aligned}$$

is decomposable with $\varOmega _{i,j}(\varTheta _{ij}) = \tfrac{1}{2\gamma } \varTheta _{ij}^2$. Assumptions (A1) and (A2) are satisfied. Moreover, for any ${\mathbf {R}}$,

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}})&= \sup _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {R}} \rangle - \dfrac{1}{2\gamma } \Vert \varvec{\Phi } \Vert _2^2 = \dfrac{\gamma }{2} \Vert {\mathbf {R}} \Vert _2^2 \end{aligned}$$

In particular, for any binary matrix ${\mathbf {Z}}$,

$$\begin{aligned} \varOmega ^\star ({\mathbf {Z}} \circ {\mathbf {R}}) = \dfrac{\gamma }{2} \sum _{i,j} ( Z_{ij} R_{ij} )^2 = \dfrac{\gamma }{2}\sum _{i,j} Z_{ij} R_{ij}^2, \end{aligned}$$

since $Z_{ij}^2 = Z_{ij}$, so that Assumption (A3) is satisfied with $\varvec{\Omega }^\star ({\mathbf {R}}) = \left( \tfrac{\gamma }{2} R_{ij}^2 \right) _{ij}.$

Moreover, from the proof of Theorem 2, one can connect the norm of $\varvec{\Theta }^\star ({\mathbf {Z}})$ and $\gamma $.

Theorem 4

For any support ${\mathbf {Z}}$, the norm of the optimal precision matrix $\varvec{\Theta }^\star ({\mathbf {Z}})$ is bounded by

$$\begin{aligned} \dfrac{\gamma }{2} \Vert \overline{\varvec{\Sigma }} \Vert _2 \left( \sqrt{1 + \dfrac{4 p}{\gamma \Vert \overline{\varvec{\Sigma }}\Vert _2^2}} -1\right) \leqslant \Vert \varvec{\Theta }^\star ({\mathbf {Z}}) \Vert _2 \leqslant \sqrt{p \gamma }. \end{aligned}$$

Proof

There is no duality gap:

$$\begin{aligned} \langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle&- \log \det \varvec{\Theta }^\star ({\mathbf {Z}}) + \dfrac{1}{2\gamma } \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2^2\\&= p + \log \det ( \overline{\varvec{\Sigma }} + {\mathbf {R}}^\star ({\mathbf {Z}}) ) +\dfrac{\gamma }{2} \Vert {\mathbf {Z}} \circ {\mathbf {R}}^\star ({\mathbf {Z}}) \Vert _2^2. \end{aligned}$$

In addition, the following KKT conditions hold

$$\begin{aligned} \varvec{\Theta }^\star ( {\mathbf {Z}})^{-1}&= \overline{\varvec{\Sigma }} + {\mathbf {R}}^\star ({\mathbf {Z}}), \\ \varvec{\Phi }^\star ( {\mathbf {Z}})&= \gamma {\mathbf {Z}} \circ {\mathbf {R}}^\star ( {\mathbf {Z}}), \end{aligned}$$

where the second condition follows from the inner minimization problem defining $\varOmega ^\star $. All in all, we have

$$\begin{aligned} \langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle + \dfrac{1}{\gamma } \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2^2 = p. \end{aligned}$$

Since $\varvec{\Sigma }$ and $\varvec{\Theta }^\star ({\mathbf {Z}})$ are semi-definite positive matrices, $\langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle \geqslant 0$. Hence,

$$\begin{aligned} \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2 \leqslant \sqrt{p \gamma }. \end{aligned}$$

To obtain the lower bound, we apply Cauchy-Schwartz inequality $\langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle \leqslant \Vert \overline{\varvec{\Sigma }} \Vert _2 \Vert \varvec{\Theta }^\star ({\mathbf {Z}}) \Vert _2$ and solve the quadratic equation

$$\begin{aligned} \dfrac{1}{\gamma } \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2^2 + \Vert \overline{\varvec{\Sigma }} \Vert _2 \Vert \varvec{\Theta }^\star ({\mathbf {Z}}) \Vert _2 - p \geqslant 0. \end{aligned}$$

$\square $

In particular, the lower bound in Theorem 4 is controlled by the factor $\tfrac{4 p}{\gamma \Vert \overline{\varvec{\Sigma }} \Vert _2^2}$, suggesting an appropriate scaling of $\gamma $ to start a grid search with.

An optimization approach for finding big-M values

In this section, we present a method for obtaining suitable constants ${\mathbf {M}}$. The approach involves solving two optimization problems for each off-diagonal entry of the matrix being estimated. The problems provide lower and upper bounds for each entry of the optimal solution. First we present the problems, then we discuss how they are solved.

1.1 Bound optimization problems

Let $\hat{\varvec{\Theta }}$ be a feasible solution for (3) and define,

$$\begin{aligned} u := \langle \hat{\varvec{\Theta }}, \overline{\varvec{\Sigma }} \rangle - \log \det \hat{\varvec{\Theta }}. \end{aligned}$$

A simple way to obtain lower bounds for the ijth entry of the optimal solution is to solve

$$\begin{aligned} \begin{aligned}&\min _{\varvec{\Theta } \succ {\mathbf {0}}} \quad&\varTheta _{ij} \\&\text { s.t. }&\langle \overline{\varvec{\Sigma }}, \mathbf { \varTheta } \rangle - \log \det \varvec{\Theta } \leqslant u. \end{aligned} \end{aligned}$$

(11)

Likewise, to obtain upper bounds we solve

$$\begin{aligned} \begin{aligned}&\max _{\varvec{\Theta } \succ {\mathbf {0}}} \quad&\varTheta _{ij} \\&\text { s.t. }&\langle \overline{\varvec{\Sigma }}, \mathbf { \varTheta } \rangle - \log \det \varvec{\Theta } \leqslant u. \end{aligned} \end{aligned}$$

(12)

Note that it is sufficient to find a feasible solution $\hat{\varvec{\Theta }}$ to formulate (11) and (12), and a feasible solution with a smaller value leads to better bounds.

1.2 Solution approach

We describe the approach for the lower bound Problem (11) only, the upper bound Problem (12) being similar.

First, we make the additional assumption that $\overline{\varvec{\Sigma }}$ is invertible. We know this assumption cannot hold in the high dimensional setting where $p > n$. Numerically, one can always argue that the lowest eigenvalues of $\overline{\varvec{\Sigma }}$ are never exactly equal to zero but should be strictly positive. In this case however, these eigenvalues should be small and close to machine precision, making matrix inversion very unstable. Note that this extra assumption is required for problems (11) and (12) to be bounded.

Problem (11) is a semidefinite optimization problem and there are $\frac{p(p+1)}{2}$ entries to bound so it is necessary to efficiently solve (11) and avoid solving so many SDPs. Instead, one can solve the dual of (11) very efficiently. Note an advantage for considering the dual is we do not need to solve the problem to optimality to obtain a valid bound. Using basic arguments from convex duality theory similar to the ones invoked in Sect. A.2, the dual problem for (11) writes

$$\begin{aligned} \max _{\lambda > 0} \left\{ \lambda \left( p - u + \log \det \left( \frac{1}{2 \lambda } (e_i e_j^T + e_j e_i^T) + \overline{\varvec{\Sigma }} \right) \right) \right\} \end{aligned}$$

(13)

Computationally, problem (13) is easier to solve because it is a convex optimization problem with a scalar decision variable $\lambda $.

Denote $g(\lambda )$ the objective function in the dual Problem (13). Algebraic manipulations yield

$$\begin{aligned} g(\lambda )&:= \lambda \left[ p - u + \log \det \left( \frac{1}{2 \lambda } (e_i e_j^T + e_j e_i^T) + \overline{\varvec{\Sigma }} \right) \right] , \\&= \lambda \left[ p - u + \log \det (\overline{\varvec{\Sigma }}) + \log \left( 1 + \dfrac{\varTheta _{ij} }{\lambda } + \dfrac{\varTheta _{ij}^2 - \varTheta _{ii} \varTheta _{jj} }{4 \lambda ^2} \right) \right] , \end{aligned}$$

where $\varvec{\Theta } = \overline{\varvec{\Sigma }}^{-1}$. We can then easily derive the first and second derivatives of g and apply Newton’s method to solve Problem (13).

Additional material on computational performance of the cutting-plane algorithm

In this section, we consider the runtime of the cutting-plane algorithm on synthetic problems as in Sect. 5.1. In Sect. 5.1.1, we illustrated how the regularization parameter M or $\gamma $ can impact the convergence of the cutting-plane algorithm, so we focus in this section on the impact of the problem sizes n, p and k.

In particular, we study the time needed by the algorithm to find the optimal solution (opt-time) and to verify the solution’s optimality (ver-time), as well as the number of cuts required (laz-cons). We carry out all experiments by generating 10 instances of synthetic data^{Footnote 4} for $(p,k_{true}) \in \{30,50,80,120,200\} \times \{5,10\}$ and different values of n. We solve each instance of (8) with big-M regularization for $k = k_{true}$, $M=0.5$ and report average performance in Table 5. These computations are performed on 4 Intel E5-2690 v4 2.6 GHz CPUs (14 cores per CPU, no hyper threading) with 16GB of RAM in total. We chose to fix the value of $M=0.5$ in order to isolate the impact of p, k and n on computational time, the specific value 0.5 being informed by the knowledge of the ground truth.

In general the algorithm provides an optimal solution in a matter of seconds, and a certificate of optimality in seconds or minutes even for p in the 100 s. Optimal verification occurs significantly quicker when the sample size n is larger because the sparsity pattern of the underlying matrix is easier to recover. However, we note that finding the optimal solution is not as affected by the sample size n. As p or k increase, optimal detection also does not significantly change, but optimal verification generally becomes significantly harder. Similar observations have been made for mixed-integer formulations of the best subset selection problem in linear and logistic regression [9]. We also observe that changes in k have a more substantial impact on the runtime than changes in n or p, especially when p is large. Finally, Meinshausen and Bühlmann’s approximation is used as a warm-start and we observe that is often optimal, especially when n / p is large.

Thus, the cutting-plane algorithm in general provides an optimal or near-optimal solution fast, but optimal verification strongly depends on p, k, and n. Nonetheless, we observe that optimality of solutions can be verified for p in the 100 s and k in the 10 s in a matter of minutes.

Table 5 Average performance on instances of synthetic data with $k = k_{true}$

Full size table

Additional comparisons on statistical performance

We report here additional results from the experiments conducted in Sect. 5.1.

1.1 Comparisons for varying sample sizes n / p

See Figs. 6 and 7.

1.2 Comparisons for varying sparsity levels t

See Figs. 8 and 9.

1.3 Comparisons for varying dimensions p

See Figs. 10, 11, 12 and 13.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bertsimas, D., Lamperski, J. & Pauphilet, J. Certifiably optimal sparse inverse covariance estimation. Math. Program. 184, 491–530 (2020). https://doi.org/10.1007/s10107-019-01419-7

Download citation

Received: 06 March 2016
Accepted: 01 August 2019
Published: 17 August 2019
Issue Date: November 2020
DOI: https://doi.org/10.1007/s10107-019-01419-7

Certifiably optimal sparse inverse covariance estimation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

NOVELIST estimator of large correlation and covariance matrices and their inverses

Estimation of matrices with row sparsity

Certifiably optimal sparse principal component analysis

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proofs of Theorem 2 and corollaries

1.1 Assumptions

1.2 Proof of Theorem 2

Proof

Remark

1.3 Special cases and corollaries

1.3.1 No regularization

Theorem 3

Proof

1.3.2 Big-M regularization

1.3.3 Ridge regularization

Theorem 4

Proof

An optimization approach for finding big-M values

1.1 Bound optimization problems

1.2 Solution approach

Additional material on computational performance of the cutting-plane algorithm

Additional comparisons on statistical performance

1.1 Comparisons for varying sample sizes n / p

1.2 Comparisons for varying sparsity levels t

1.3 Comparisons for varying dimensions p

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation