[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Certifiably optimal sparse inverse covariance estimation

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We consider the maximum likelihood estimation of sparse inverse covariance matrices. We demonstrate that current heuristic approaches primarily encourage robustness, instead of the desired sparsity. We give a novel approach that solves the cardinality constrained likelihood problem to certifiable optimality. The approach uses techniques from mixed-integer optimization and convex optimization, and provides a high-quality solution with a guarantee on its suboptimality, even if the algorithm is terminated early. Using a variety of synthetic and real datasets, we demonstrate that our approach can solve problems where the dimension of the inverse covariance matrix is up to 1000 s. We also demonstrate that our approach produces significantly sparser solutions than Glasso and other popular learning procedures, makes less false discoveries, while still maintaining state-of-the-art accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Note that \(\overline{\varvec{\Sigma }}\) is not the only estimate of the covariance matrix. In particular, \(\tfrac{n}{n-1} \overline{\varvec{\Sigma }}\) is a widely-used unbiased estimator of the covariance matrix. In this paper, we will only consider \(\overline{\varvec{\Sigma }}\), which we might refer to as the empirical or sample covariance matrix.

  2. Available at https://cran.r-project.org/web/packages/glasso/.

  3. Convexity of \({h}({\mathbf {Z}})\) can also be proved from the primal formulation (6) directly. Take two matrices \({\mathbf {Z}}_1\) and \( {\mathbf {Z}}_2\), \(\lambda \in (0,1)\), \({\mathbf {Z}} := \lambda {\mathbf {Z}}_1 + (1-\lambda ) {\mathbf {Z}}_2\), then it follows from the definition (6) that \( h({\mathbf {Z}}) \leqslant \lambda h({\mathbf {Z}}_1) + (1-\lambda ) h({\mathbf {Z}}_2)\).

  4. For each instance, we generate a sparse precision matrix \(\varvec{\Theta }_0\) as in Sect. 5.1 and n samples from the corresponding multivariate normal distribution.

References

  1. Atamtürk, A., Narayanan, V.: Conic mixed-integer rounding cuts. Math. Program. 122(1), 1–20 (2010)

    MathSciNet  MATH  Google Scholar 

  2. Atchadé, Y.F., Mazumder, R., Chen, J.: Scalable computation of regularized precision matrices via stochastic optimization (2015). arXiv:1509.00426

  3. Banerjee, O., El Ghaoui, L., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9(Mar), 485–516 (2008)

    MathSciNet  MATH  Google Scholar 

  4. Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton (2009)

    MATH  Google Scholar 

  5. Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. 270, 93142 (2018)

    MathSciNet  MATH  Google Scholar 

  6. Bertsimas, D., Mazumder, R.: Least quantile regression via modern optimization. Ann. Stat. 42, 2494–2525 (2014)

    MathSciNet  MATH  Google Scholar 

  7. Bertsimas, D., Van Parys, B.: Sparse high-dimensional regression: Exact scalable algorithms and phase transitions (2017). arXiv:1709.10029

  8. Bertsimas, D., Brown, D.B., Caramanis, C.: Theory and applications of robust optimization. SIAM Rev. 53(3), 464–501 (2011)

    MathSciNet  MATH  Google Scholar 

  9. Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)

    MathSciNet  MATH  Google Scholar 

  10. Bickel, P.J., Levina, E., et al.: Covariance regularization by thresholding. Ann. Stat. 36(6), 2577–2604 (2008)

    MathSciNet  MATH  Google Scholar 

  11. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    MATH  Google Scholar 

  12. Cai, T., Liu, W., Luo, X.: A constrained \(\ell _1\) minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106(494), 594–607 (2011)

    MATH  Google Scholar 

  13. Çezik, M.T., Iyengar, G.: Cuts for mixed 0–1 conic programming. Math. Program. 104(1), 179–202 (2005)

    MathSciNet  MATH  Google Scholar 

  14. Chickering, D.M.: Learning Bayesian Networks is NP-Complete. In: Fisher, D., Lenz, HJ. (eds.) Learning from Data. Lecture Notes in Statistics, vol 112. Springer, New York, NY (1996)

  15. Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)

    MathSciNet  MATH  Google Scholar 

  16. Dahl, J., Vandenberghe, L., Roychowdhury, V.: Covariance selection for nonchordal graphs via chordal embedding. Optim. Methods Softw. 23(4), 501–520 (2008)

    MathSciNet  MATH  Google Scholar 

  17. Dempster, A.P.: Covariance selection. Biometrics 28, 157–175 (1972)

    MathSciNet  Google Scholar 

  18. Drton, M., Maathuis, M.H.: Structure learning in graphical modeling. Ann. Rev. Stat. Appl. 4, 365–393 (2017)

    Google Scholar 

  19. Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Math. Programm. 36(3), 307–339 (1986)

    MathSciNet  MATH  Google Scholar 

  20. El Karoui, N.: High-dimensionality effects in the markowitz problem and other quadratic programs with linear constraints: risk underestimation. Ann. Stat. 38(6), 3487–3566 (2010)

    MathSciNet  MATH  Google Scholar 

  21. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    MathSciNet  MATH  Google Scholar 

  22. Fan, J., Fan, Y., Lv, J.: High dimensional covariance matrix estimation using a factor model. J. Econom. 147(1), 186–197 (2008)

    MathSciNet  MATH  Google Scholar 

  23. Fan, J., Feng, Y., Yichao, W.: Network exploration via the adaptive lasso and scad penalties. Ann. Appl. Stat. 3(2), 521 (2009)

    MathSciNet  MATH  Google Scholar 

  24. Fan, J., Zhang, J., Ke, Y.: Vast portfolio selection with gross-exposure constraints. J. Am. Stat. Assoc. 107(498), 592–606 (2012)

    MathSciNet  MATH  Google Scholar 

  25. Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)

    Google Scholar 

  26. Fan, J., Liao, Y., Liu, H.: An overview of the estimation of large covariance and precision matrices. Econom. J. 19(1), C1–C32 (2016)

    MathSciNet  Google Scholar 

  27. Fattahi, S., Sojoudi, S.: Graphical lasso and thresholding: equivalence and closed-form solutions (2017). arXiv:1708.09479

  28. Foygel, R., Drton, M.: Extended Bayesian information criteria for Gaussian graphical models. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 604–612. Curran Associates, Inc. (2010)

  29. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008)

    MATH  Google Scholar 

  30. Gally, T., Pfetsch, M.E., Ulbrich, S.: A framework for solving mixed-integer semidefinite programs. Optim. Methods Softw. 33(3), 594–632 (2018)

    MathSciNet  MATH  Google Scholar 

  31. Geoffrion, A.M.: Generalized benders decomposition. J. Optim. Theory Appl. 10(4), 237–260 (1972)

    MathSciNet  MATH  Google Scholar 

  32. Gurobi Optimization Inc. Gurobi Optimizer Reference Manual (2015). http://www.gurobi.com

  33. Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

    MATH  Google Scholar 

  34. Helmberg, C., Rendl, F.: Solving quadratic (0, 1)-problems by semidefinite programs and cutting planes. Math. Program. 82(3), 291–315 (1998)

    MathSciNet  MATH  Google Scholar 

  35. Hess, K.R., Keith Anderson, W., Symmans, F., Valero, V., Ibrahim, N., Mejia, J.A., Booser, D., Theriault, R.L., Buzdar, A.U., Dempsey, P.J., et al.: Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J. Clin. Oncol. 24(26), 4236–4244 (2006)

    Google Scholar 

  36. Hsieh, C.-J., Dhillon, I.S., Ravikumar, P.K., Sustik, M.A.: Sparse inverse covariance matrix estimation using quadratic approximation. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 2330–2338. Curran Associates, Inc. (2011)

  37. Hsieh, C.-J., Sustik, M.A., Dhillon, I.S., Ravikumar, P.K., Poldrack, R.: Big & quic: sparse inverse covariance estimation for a million variables. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3165–3173. Curran Associates, Inc. (2013)

  38. IBM ILOG. Cplex optimizer (2012). http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer

  39. Krishnamurthy, V., Ahipasaoglu, S.D., d’Aspremont, A.: A pathwise algorithm for covariance selection. Optim. Mach. Learn., p. 479 (2011). arXiv:0908.0143v2

  40. Lam, C., Fan, J.: Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Stat. 37(6B), 4254 (2009)

    MathSciNet  MATH  Google Scholar 

  41. Lauritzen, S.L.: Graphical Models, vol. 17. Clarendon Press, Oxford (1996)

    MATH  Google Scholar 

  42. Liu, Z., Lin, S., Deng, N., McGovern, D.P.B., Piantadosi, S.: Sparse inverse covariance estimation with \(\ell _0\) penalty for network construction with omics data. J. Comput. Biol. 23(3), 192–202 (2016)

    MathSciNet  Google Scholar 

  43. Lubin, M., Dunning, I.: Computing in operations research using julia. INFORMS J. Comput. 27(2), 238–248 (2015)

    MathSciNet  MATH  Google Scholar 

  44. Lubin, M., Yamangil, E., Bent, R., Vielma, J.P.: Polyhedral approximation in mixed-integer convex optimization. Math. Program. 172(1–2), 139–168 (2018)

    MathSciNet  MATH  Google Scholar 

  45. Ma, J., Zhao, F., Xu, J.: Structure learning constrained by node-specific degree distribution. In: Meila, M., Heskes, T. (eds.) Uncertainty in Artificial Intelligence 33, pp. 533–541. AUAI Press Corvalis, Oregon (2015)

    Google Scholar 

  46. Marjanovic, G., Hero, A.O.: \(\ell _0\) sparse inverse covariance estimation. IEEE Trans. Signal Process. 63(12), 3218–3231 (2015)

    MathSciNet  MATH  Google Scholar 

  47. Kipp Martin, R.: Using separation algorithms to generate mixed integer model reformulations. Oper. Res. Lett. 10(3), 119–128 (1991)

    MathSciNet  MATH  Google Scholar 

  48. Mazumder, R., Hastie, T.: Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res. 13(Mar), 781–794 (2012a)

    MathSciNet  MATH  Google Scholar 

  49. Mazumder, R., Hastie, T.: The graphical lasso: new insights and alternatives. Electron. J. Stat. 6, 2125 (2012b)

    MathSciNet  MATH  Google Scholar 

  50. Meinshausen, N., Bühlmann, P., et al.: High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34(3), 1436–1462 (2006)

    MathSciNet  MATH  Google Scholar 

  51. Meyer, C.D.: Matrix analysis and applied linear algebra, vol. 71. Siam (2000)

  52. Oztoprak, F., Nocedal, J., Rennie, S., Olsen, P.A.: Newton-like methods for sparse inverse covariance estimation. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 755–763. Curran Associates, Inc. (2012)

  53. Rendl, F., Rinaldi, G., Wiegele, A.: Solving max-cut to optimality by intersecting semidefinite and polyhedral relaxations. Math. Program. 121(2), 307 (2010)

    MathSciNet  MATH  Google Scholar 

  54. Rigollet, P., Tsybakov, A.: Estimation of covariance matrices under sparsity constraints (2012). arXiv:1205.1210

  55. Rothman, A.J., Bickel, P.J., Levina, E., Zhu, J., et al.: Sparse permutation invariant covariance estimation. Electron. J. Stat. 2, 494–515 (2008)

    MathSciNet  MATH  Google Scholar 

  56. Santhanam, N.P., Wainwright, M.J.: Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Trans. Inf. Theory 58(7), 4117–4134 (2012)

    MathSciNet  MATH  Google Scholar 

  57. Scheinberg, K., Rish, I.: SINCO-a greedy coordinate ascent method for sparse inverse covariance selection problem. IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598, July (2009)

  58. Scheinberg, K., Ma, S., Goldfarb, D.: Sparse inverse covariance selection via alternating linearization methods. In: Advances in Neural Information Processing Systems, pp. 2101–2109 (2010)

  59. Shao, J.: Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)

    MathSciNet  MATH  Google Scholar 

  60. Sotirov, R.: SDP relaxations for some combinatorial optimization problems. In: Handbook on Semidefinite, Conic and Polynomial Optimization, pp. 795–819. Springer (2012)

  61. Tan, K.M., London, P., Mohan, K., Lee, S.-I., Fazel, M., Witten, D.: Learning graphical models with hubs. J. Mach. Learn. Res. 15(1), 3297–3331 (2014)

    MathSciNet  MATH  Google Scholar 

  62. Tokuda, T., Goodrich, B., Van Mechelen, I., Gelman, A., Tuerlinckx, F.: Visualizing distributions of covariance matrices. Columbia University, New York, USA, Technical Report, p. 18 (2011)

  63. Vandenberghe, L., Andersen, M.S., et al.: Chordal graphs and semidefinite optimization. Found. Trends Optim. 1(4), 241–433 (2015)

    Google Scholar 

  64. Xu, H., Caramanis, C., Mannor, S.: Robust regression and lasso. In: Advances in Neural Information Processing Systems, pp. 1801–1808 (2009)

  65. Yonekura, K., Kanno, Y.: Global optimization of robust truss topology via mixed integer semidefinite programming. Optim. Eng. 11(3), 355–379 (2010)

    MathSciNet  MATH  Google Scholar 

  66. Yuan, M.: High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11(Aug), 2261–2286 (2010)

    MathSciNet  MATH  Google Scholar 

  67. Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94(1), 19–35 (2007)

    MathSciNet  MATH  Google Scholar 

  68. Zhang, C.-H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    MathSciNet  MATH  Google Scholar 

  69. Zhang, F.: The Schur Complement and Its Applications, vol. 4. Springer, Berlin (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dimitris Bertsimas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proofs of Theorem 2 and corollaries

In this section, we detail the proof of Theorem 2. We first specify the assumptions required on the regularizer \(\varOmega \), prove Theorem 2 and finally investigate some special cases of interest.

1.1 Assumptions

We first assume that the function \(\varOmega \) is decomposable, i.e., there exist scalar functions \(\varOmega _{ij}\) such that

$$\begin{aligned} \forall \, \varvec{\Phi }, \quad \varOmega (\varvec{\Phi }) = \sum _{i,j} \varOmega _{ij}(\Phi _{ij}). \end{aligned}$$
(A1)

In addition, we assume that for all (ij), \(\varOmega _{ij}\) is convex and tends to regularize towards zero. Formally,

$$\begin{aligned} \forall \, (i,j), \quad \min _x \, \varOmega _{ij}(x) = \varOmega _{ij}(0). \end{aligned}$$
(A2)

Those first two assumptions are not highly restrictive and are satisfied by \(\ell _\infty \)-norm constraint (big-M), \(\ell _1\)-norm regularization (LASSO) or \(\Vert \cdot \Vert _2^2\)-regularization, among others.

For any function f, we denote with a superscript \(\star \) its Fenchel conjugate (see [11], chap.  3.3) defined as

$$\begin{aligned} f^\star (y) := \sup _x \langle x, y \rangle - f(x). \end{aligned}$$

In particular, the Fenchel conjugate of any function f is convex. Given Assumption (A1),

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}})&= \sup _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {R}} \rangle - \varOmega (\varvec{\Phi }), \\&= \sum _{i,j} \sup _{\Phi _{ij}} \Phi _{ij} R_{ij} - \varOmega _{ij}(\Phi _{ij}), \\&= \sum _{i,j} \varOmega _{ij}^\star (R_{ij}). \end{aligned}$$

As a result, it is easy to see that if \(\varOmega \) satisfies (A1) and (A2), so does its Fenchel conjugate.

Let us denote \({\mathbf {A}} \circ {\mathbf {B}}\) the Hadamard or component-wise product between matrices \({\mathbf {A}} \) and \({\mathbf {B}}\). Consider a matrix \({\mathbf {R}}\) and a support matrix \({\mathbf {Z}} \in \{0,1\}^{p \times p}\). The function \({\mathbf {Z}} \mapsto \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}} )\) is convex in \({\mathbf {Z}}\), by convexity of \(\varOmega ^\star \). We now assume that it is linear in \({\mathbf {Z}}\), that is, there exists a function \(\varvec{\Omega }^\star : \, {\mathbb {R}}^{p \times p} \rightarrow {\mathbb {R}}^{p \times p}\) satisfying:

$$\begin{aligned} \forall \, {\mathbf {Z}} \in \{0,1\}^{p \times p}, \forall \, {\mathbf {R}} \in {\mathbb {R}}^{p \times p}, \, \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}} ) = \langle {\mathbf {Z}}, \varvec{\Omega }^\star ({\mathbf {R}}) \rangle . \end{aligned}$$
(A3)

1.2 Proof of Theorem 2

Given \({\mathbf {Z}} \in \{0,1\}^{p \times p}\) such that \({Z}_{ii} = 1\) for all \(i = 1,\dots ,p\), we first prove that under assumptions (A1) and (A2):

$$\begin{aligned} {\tilde{h}}({\mathbf {Z}})&:= \quad \min _{\varvec{\Theta } \succ {\mathbf {0}}} \quad \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Theta }) \quad \hbox { s.t. } {\varTheta }_{ij} = 0 \hbox { if } {Z}_{ij} = 0 \, \forall (i,j), \\&= \quad \max _{{\mathbf {R}}:\overline{\varvec{\Sigma }} + {\mathbf {R}} \succ {\mathbf {0}}} \, p +\log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}}) - \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}}). \end{aligned}$$

Then, Assumption (A3) will conclude the proof.

Proof

We decompose the minimization problem à la Fenchel.

$$\begin{aligned} {\tilde{h}}({\mathbf {Z}})&= \min _{\varvec{\Theta } \succ {\mathbf {0}}} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Theta }) \, \hbox { s.t. } {\varTheta }_{ij} = 0 \hbox { if } {Z}_{ij} = 0, \\&= \min _{\varvec{\Theta }\succ {\mathbf {0}}, \varvec{\Phi }} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega ({\mathbf {Z}} \circ \varvec{\Phi }) \, \hbox { s.t. } {\varTheta }_{ij} = {Z}_{ij} {\Phi }_{ij}, \\&= \min _{\varvec{\Theta } \succeq {\mathbf {0}}, \varvec{\Phi }} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega ({\mathbf {Z}} \circ \varvec{\Phi }) \, \hbox { s.t. } \varvec{\Theta } = {\mathbf {Z}} \circ \varvec{\Phi }. \end{aligned}$$

In the last equality, we omitted the constraint \(\varvec{\Theta } \succ {\mathbf {0}}\), which is implied by the domain of \(\log \det \). Assuming (A1) and (A2) hold, the regularization term \(\varOmega ({\mathbf {Z}} \circ \varvec{\Phi })\) can be replaced by \(\varOmega (\varvec{\Phi })\) and

$$\begin{aligned} {\tilde{h}}({\mathbf {Z}})&= \min _{\varvec{\Theta } \succeq {\mathbf {0}}, \varvec{\Phi }} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Phi }) \, \hbox { s.t. } \varvec{\Theta } = {\mathbf {Z}} \circ \varvec{\Phi }. \end{aligned}$$

The above objective function is convex in \((\varvec{\Theta }, \varvec{\Phi })\), the feasible set is a non-empty - \(\varvec{\Theta } = \varvec{\Phi } = {\mathbf {I}}_p\) is feasible - convex set, and Slater’s conditions are satisfied. Hence, strong duality holds.

$$\begin{aligned} {\tilde{h}}({\mathbf {Z}})&= \min _{\varvec{\Theta } \succeq {\mathbf {0}}, \varvec{\Phi }} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Phi }) \, \hbox { s.t. } \varvec{\Theta } = {\mathbf {Z}} \circ \varvec{\Phi }, \\&= \min _{\varvec{\Theta } \succeq {\mathbf {0}}, \varvec{\Phi }} \, \max _{{\mathbf {R}}} \, \langle \overline{\varvec{\Sigma }}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } + \varOmega (\varvec{\Phi }) \, + \langle \varvec{\Theta } - {\mathbf {Z}} \circ \varvec{\Phi }, {\mathbf {R}} \rangle , \\&= \max _{{\mathbf {R}}} \, \min _{\varvec{\Theta } \succeq {\mathbf {0}}} \, \left[ \langle \overline{\varvec{\Sigma }} + {\mathbf {R}}, \varvec{\Theta } \rangle - \log \det \varvec{\Theta } \right] + \min _{\varvec{\Phi }} \left[ \varOmega ( \varvec{\Phi }) - \langle {\mathbf {Z}} \circ \varvec{\Phi }, {\mathbf {R}} \rangle \right] . \end{aligned}$$

For the first inner-minimization problem, first-order conditions \(\overline{\varvec{\Sigma }} + {\mathbf {R}} - \varvec{\Theta }^{-1} = {\mathbf {0}}\) lead to the constraint \(\overline{\varvec{\Sigma }}+ {\mathbf {R}} \succ 0\) and the objective value is \(p + \log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}})\). The second inner-minimization problem is almost the definition of the Fenchel conjugate:

$$\begin{aligned} \min _{\varvec{\Phi }} \varOmega (\varvec{\Phi }) - \langle {\mathbf {Z}} \circ \varvec{\Phi }, {\mathbf {R}} \rangle&= - \max _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {Z}} \circ {\mathbf {R}} \rangle - \varOmega ( \varvec{\Phi }), \\&= - \varOmega ^\star ({\mathbf {Z}} \circ {\mathbf {R}} ) \end{aligned}$$

Hence,

$$\begin{aligned} h({\mathbf {Z}})&= \max _{{\mathbf {R}}: \overline{\varvec{\Sigma }}+ {\mathbf {R}} \succ {\mathbf {0}}} \, p + \log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}}) - \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}} ). \end{aligned}$$

\(\square \)

Remark

Notice that we proved that \({\tilde{h}}({\mathbf {Z}})\) could be written as point-wise maximum of concave functions of \({\mathbf {Z}}\). Assumption (A3) is needed to ensure that the function in the maximization is convex in \({\mathbf {Z}}\) at the same time.

1.3 Special cases and corollaries

1.3.1 No regularization

We first consider the unregularized case of (6) where \(\forall \, \varvec{\Phi }, \, \varOmega (\varvec{\Phi }) = 0\). Assumptions (A1) and (A2) are obviously satisfied. Moreover, for any \({\mathbf {R}}\),

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}}) = \sup _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {R}} \rangle = {\left\{ \begin{array}{ll} 0 &{} \hbox {if } {\mathbf {R}} = {\mathbf {0}}, \\ +\,\infty &{} \hbox {otherwise.} \end{array}\right. } \end{aligned}$$

With the convention that \(0 \times \infty = 0\), Assumption (A3) is satisfied and Theorem 2 holds:

$$\begin{aligned} {h}({\mathbf {Z}})&= \quad \max _{{\mathbf {R}}: \overline{\varvec{\Sigma }}+ {\mathbf {R}} \succ {\mathbf {0}}} \, p +\log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}}) - \langle {\mathbf {Z}} , \varvec{\Omega }^\star ( {\mathbf {R}}) \rangle , \\&= \max _{{\mathbf {R}}: \overline{\varvec{\Sigma }} + {\mathbf {R}} \succ {\mathbf {0}}} \, p +\log \det (\overline{\varvec{\Sigma }}+ {\mathbf {R}}) \quad \hbox { s.t. } {Z}_{ij} {R}_{ij} = 0, \, \forall (i,j). \end{aligned}$$

In particular, this reformulation proves that \({h}({\mathbf {Z}})\) is convex,Footnote 3 but that the coordinates of its sub-gradient \(-\varvec{\Omega }^\star ({\mathbf {R}}^\star ({\mathbf {Z}}))\) are either 0 or \(-\infty \), hence uninformative. Note that the same conclusion is true for \(\ell _1\)-regularization.

From the proof of Theorem 2, one can derive a lower bound on \( \Vert \varvec{\Theta }^\star \Vert _\infty \) which will be useful for big-M regularization.

Theorem 3

The solution of (8) satisfies \(\Vert \varvec{\Theta }^\star \Vert _\infty \geqslant \frac{p}{\Vert \overline{\varvec{\Sigma }} \Vert _1}\)

Proof

For a feasible support \({\mathbf {Z}}\), denote the optimal primal and dual variables \(\varvec{\Theta }^\star ({\mathbf {Z}})\) and \({\mathbf {R}}^\star ({\mathbf {Z}})\) respectively. There is no duality gap and KKT condition \(\varvec{\Theta }^\star ({\mathbf {Z}})^{-1}= \overline{\varvec{\Sigma }} + {\mathbf {R}}^\star ({\mathbf {Z}})\) holds, so that \( \langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle = p\). From Hölder’s inequality, we obtain the desired lower bound. \(\square \)

1.3.2 Big-M regularization

For the big-M regularization,

$$\begin{aligned} \varOmega (\varvec{\Theta }) = {\left\{ \begin{array}{ll} 0 &{} \hbox { if } |\varTheta _{ij}| \leqslant M_{ij}, \\ +\,\infty &{} \hbox { otherwise} \end{array}\right. }, \end{aligned}$$

is decomposable with \(\varOmega _{i,j}(\varTheta _{ij}) = 0\) if \(| \varTheta _{ij} | \leqslant M_{ij}\), \(+\,\infty \) otherwise. Assumptions (A1) and (A2) are satisfied. Moreover, for any \({\mathbf {R}}\),

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}})&= \sup _{\varvec{\Phi \, : \, \Vert \varvec{\Phi }\Vert _\infty \leqslant {\mathbf {M}}}} \langle \varvec{\Phi }, {\mathbf {R}} \rangle = \Vert {\mathbf {M}} \circ {\mathbf {R}} \Vert _1. \end{aligned}$$

In particular, for any binary matrix \({\mathbf {Z}}\),

$$\begin{aligned} \varOmega ^\star ({\mathbf {Z}} \circ {\mathbf {R}}) = \sum _{i,j} | M_{ij} Z_{ij} R_{ij} | = \sum _{i,j} M_{ij} Z_{ij} | R_{ij} |, \end{aligned}$$

so that Assumption (A3) is satisfied with \(\varvec{\Omega }^\star ({\mathbf {R}}) = \left( M_{ij} |R_{ij}| \right) _{ij}\).

1.3.3 Ridge regularization

For the \(\ell _2^2\)-regularization,

$$\begin{aligned} \varOmega (\varvec{\Theta }) = \dfrac{1}{2\gamma } \Vert \varvec{\Theta } \Vert _2^2, \end{aligned}$$

is decomposable with \(\varOmega _{i,j}(\varTheta _{ij}) = \tfrac{1}{2\gamma } \varTheta _{ij}^2\). Assumptions (A1) and (A2) are satisfied. Moreover, for any \({\mathbf {R}}\),

$$\begin{aligned} \varOmega ^\star ({\mathbf {R}})&= \sup _{\varvec{\Phi }} \langle \varvec{\Phi }, {\mathbf {R}} \rangle - \dfrac{1}{2\gamma } \Vert \varvec{\Phi } \Vert _2^2 = \dfrac{\gamma }{2} \Vert {\mathbf {R}} \Vert _2^2 \end{aligned}$$

In particular, for any binary matrix \({\mathbf {Z}}\),

$$\begin{aligned} \varOmega ^\star ({\mathbf {Z}} \circ {\mathbf {R}}) = \dfrac{\gamma }{2} \sum _{i,j} ( Z_{ij} R_{ij} )^2 = \dfrac{\gamma }{2}\sum _{i,j} Z_{ij} R_{ij}^2, \end{aligned}$$

since \(Z_{ij}^2 = Z_{ij}\), so that Assumption (A3) is satisfied with \(\varvec{\Omega }^\star ({\mathbf {R}}) = \left( \tfrac{\gamma }{2} R_{ij}^2 \right) _{ij}.\)

Moreover, from the proof of Theorem 2, one can connect the norm of \(\varvec{\Theta }^\star ({\mathbf {Z}})\) and \(\gamma \).

Theorem 4

For any support \({\mathbf {Z}}\), the norm of the optimal precision matrix \(\varvec{\Theta }^\star ({\mathbf {Z}})\) is bounded by

$$\begin{aligned} \dfrac{\gamma }{2} \Vert \overline{\varvec{\Sigma }} \Vert _2 \left( \sqrt{1 + \dfrac{4 p}{\gamma \Vert \overline{\varvec{\Sigma }}\Vert _2^2}} -1\right) \leqslant \Vert \varvec{\Theta }^\star ({\mathbf {Z}}) \Vert _2 \leqslant \sqrt{p \gamma }. \end{aligned}$$

Proof

There is no duality gap:

$$\begin{aligned} \langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle&- \log \det \varvec{\Theta }^\star ({\mathbf {Z}}) + \dfrac{1}{2\gamma } \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2^2\\&= p + \log \det ( \overline{\varvec{\Sigma }} + {\mathbf {R}}^\star ({\mathbf {Z}}) ) +\dfrac{\gamma }{2} \Vert {\mathbf {Z}} \circ {\mathbf {R}}^\star ({\mathbf {Z}}) \Vert _2^2. \end{aligned}$$

In addition, the following KKT conditions hold

$$\begin{aligned} \varvec{\Theta }^\star ( {\mathbf {Z}})^{-1}&= \overline{\varvec{\Sigma }} + {\mathbf {R}}^\star ({\mathbf {Z}}), \\ \varvec{\Phi }^\star ( {\mathbf {Z}})&= \gamma {\mathbf {Z}} \circ {\mathbf {R}}^\star ( {\mathbf {Z}}), \end{aligned}$$

where the second condition follows from the inner minimization problem defining \(\varOmega ^\star \). All in all, we have

$$\begin{aligned} \langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle + \dfrac{1}{\gamma } \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2^2 = p. \end{aligned}$$

Since \(\varvec{\Sigma }\) and \(\varvec{\Theta }^\star ({\mathbf {Z}})\) are semi-definite positive matrices, \(\langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle \geqslant 0\). Hence,

$$\begin{aligned} \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2 \leqslant \sqrt{p \gamma }. \end{aligned}$$

To obtain the lower bound, we apply Cauchy-Schwartz inequality \(\langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle \leqslant \Vert \overline{\varvec{\Sigma }} \Vert _2 \Vert \varvec{\Theta }^\star ({\mathbf {Z}}) \Vert _2\) and solve the quadratic equation

$$\begin{aligned} \dfrac{1}{\gamma } \Vert \varvec{\Phi }^\star ({\mathbf {Z}}) \Vert _2^2 + \Vert \overline{\varvec{\Sigma }} \Vert _2 \Vert \varvec{\Theta }^\star ({\mathbf {Z}}) \Vert _2 - p \geqslant 0. \end{aligned}$$

\(\square \)

In particular, the lower bound in Theorem 4 is controlled by the factor \(\tfrac{4 p}{\gamma \Vert \overline{\varvec{\Sigma }} \Vert _2^2}\), suggesting an appropriate scaling of \(\gamma \) to start a grid search with.

An optimization approach for finding big-M values

In this section, we present a method for obtaining suitable constants \({\mathbf {M}}\). The approach involves solving two optimization problems for each off-diagonal entry of the matrix being estimated. The problems provide lower and upper bounds for each entry of the optimal solution. First we present the problems, then we discuss how they are solved.

1.1 Bound optimization problems

Let \(\hat{\varvec{\Theta }}\) be a feasible solution for (3) and define,

$$\begin{aligned} u := \langle \hat{\varvec{\Theta }}, \overline{\varvec{\Sigma }} \rangle - \log \det \hat{\varvec{\Theta }}. \end{aligned}$$

A simple way to obtain lower bounds for the ijth entry of the optimal solution is to solve

$$\begin{aligned} \begin{aligned}&\min _{\varvec{\Theta } \succ {\mathbf {0}}} \quad&\varTheta _{ij} \\&\text { s.t. }&\langle \overline{\varvec{\Sigma }}, \mathbf { \varTheta } \rangle - \log \det \varvec{\Theta } \leqslant u. \end{aligned} \end{aligned}$$
(11)

Likewise, to obtain upper bounds we solve

$$\begin{aligned} \begin{aligned}&\max _{\varvec{\Theta } \succ {\mathbf {0}}} \quad&\varTheta _{ij} \\&\text { s.t. }&\langle \overline{\varvec{\Sigma }}, \mathbf { \varTheta } \rangle - \log \det \varvec{\Theta } \leqslant u. \end{aligned} \end{aligned}$$
(12)

Note that it is sufficient to find a feasible solution \(\hat{\varvec{\Theta }}\) to formulate (11) and (12), and a feasible solution with a smaller value leads to better bounds.

1.2 Solution approach

We describe the approach for the lower bound Problem (11) only, the upper bound Problem (12) being similar.

First, we make the additional assumption that \(\overline{\varvec{\Sigma }}\) is invertible. We know this assumption cannot hold in the high dimensional setting where \(p > n\). Numerically, one can always argue that the lowest eigenvalues of \(\overline{\varvec{\Sigma }}\) are never exactly equal to zero but should be strictly positive. In this case however, these eigenvalues should be small and close to machine precision, making matrix inversion very unstable. Note that this extra assumption is required for problems (11) and (12) to be bounded.

Problem (11) is a semidefinite optimization problem and there are \(\frac{p(p+1)}{2}\) entries to bound so it is necessary to efficiently solve (11) and avoid solving so many SDPs. Instead, one can solve the dual of (11) very efficiently. Note an advantage for considering the dual is we do not need to solve the problem to optimality to obtain a valid bound. Using basic arguments from convex duality theory similar to the ones invoked in Sect. A.2, the dual problem for (11) writes

$$\begin{aligned} \max _{\lambda > 0} \left\{ \lambda \left( p - u + \log \det \left( \frac{1}{2 \lambda } (e_i e_j^T + e_j e_i^T) + \overline{\varvec{\Sigma }} \right) \right) \right\} \end{aligned}$$
(13)

Computationally, problem (13) is easier to solve because it is a convex optimization problem with a scalar decision variable \(\lambda \).

Denote \(g(\lambda )\) the objective function in the dual Problem (13). Algebraic manipulations yield

$$\begin{aligned} g(\lambda )&:= \lambda \left[ p - u + \log \det \left( \frac{1}{2 \lambda } (e_i e_j^T + e_j e_i^T) + \overline{\varvec{\Sigma }} \right) \right] , \\&= \lambda \left[ p - u + \log \det (\overline{\varvec{\Sigma }}) + \log \left( 1 + \dfrac{\varTheta _{ij} }{\lambda } + \dfrac{\varTheta _{ij}^2 - \varTheta _{ii} \varTheta _{jj} }{4 \lambda ^2} \right) \right] , \end{aligned}$$

where \(\varvec{\Theta } = \overline{\varvec{\Sigma }}^{-1}\). We can then easily derive the first and second derivatives of g and apply Newton’s method to solve Problem (13).

Additional material on computational performance of the cutting-plane algorithm

In this section, we consider the runtime of the cutting-plane algorithm on synthetic problems as in Sect. 5.1. In Sect. 5.1.1, we illustrated how the regularization parameter M or \(\gamma \) can impact the convergence of the cutting-plane algorithm, so we focus in this section on the impact of the problem sizes n, p and k.

In particular, we study the time needed by the algorithm to find the optimal solution (opt-time) and to verify the solution’s optimality (ver-time), as well as the number of cuts required (laz-cons). We carry out all experiments by generating 10 instances of synthetic dataFootnote 4 for \((p,k_{true}) \in \{30,50,80,120,200\} \times \{5,10\}\) and different values of n. We solve each instance of (8) with big-M regularization for \(k = k_{true}\), \(M=0.5\) and report average performance in Table 5. These computations are performed on 4 Intel E5-2690 v4 2.6 GHz CPUs (14 cores per CPU, no hyper threading) with 16GB of RAM in total. We chose to fix the value of \(M=0.5\) in order to isolate the impact of p, k and n on computational time, the specific value 0.5 being informed by the knowledge of the ground truth.

In general the algorithm provides an optimal solution in a matter of seconds, and a certificate of optimality in seconds or minutes even for p in the 100 s. Optimal verification occurs significantly quicker when the sample size n is larger because the sparsity pattern of the underlying matrix is easier to recover. However, we note that finding the optimal solution is not as affected by the sample size n. As p or k increase, optimal detection also does not significantly change, but optimal verification generally becomes significantly harder. Similar observations have been made for mixed-integer formulations of the best subset selection problem in linear and logistic regression [9]. We also observe that changes in k have a more substantial impact on the runtime than changes in n or p, especially when p is large. Finally, Meinshausen and Bühlmann’s approximation is used as a warm-start and we observe that is often optimal, especially when n / p is large.

Thus, the cutting-plane algorithm in general provides an optimal or near-optimal solution fast, but optimal verification strongly depends on p, k, and n. Nonetheless, we observe that optimality of solutions can be verified for p in the 100 s and k in the 10 s in a matter of minutes.

Table 5 Average performance on instances of synthetic data with \(k = k_{true}\)

Additional comparisons on statistical performance

We report here additional results from the experiments conducted in Sect. 5.1.

1.1 Comparisons for varying sample sizes n / p

See Figs. 6 and 7.

Fig. 6
figure 6

Impact of the number of samples n / p on out-of-sample negative log-likelihood. Results are averaged over 10 instances with \(p=200\), \(t=1\%\)

Fig. 7
figure 7

Impact of the number of samples n / p on support recovery. Results are averaged over 10 instances with \(p=200\), \(t=1\%\). Hyper-parameters are tuned using the \(BIC_{1/2}\) criterion

1.2 Comparisons for varying sparsity levels t

See Figs. 8 and 9.

Fig. 8
figure 8

Impact of the sparsity level t on out-of-sample negative log-likelihood. Results are averaged over 10 instances with \(p=200\), \(n=p\)

Fig. 9
figure 9

Impact of the sparsity level t on support recovery. Results are averaged over 10 instances with \(p=200\), \(n=p\). Hyper-parameters are tuned using the \(BIC_{1/2}\) criterion

1.3 Comparisons for varying dimensions p

See Figs. 10, 11, 12 and 13.

Fig. 10
figure 10

Impact of the dimension p on support recovery. Results are averaged over 10 instances with \(n=p\), \(t=1\%\). Hyper-parameters are tuned using \(-LL\)

Fig. 11
figure 11

Impact of the dimension p on support recovery. Results are averaged over 10 instances with \(n=p\), \(t=1\%\). Hyper-parameters are tuned using the \(BIC_{1/2}\) criterion

Fig. 12
figure 12

Impact of the dimension p on out-of-sample negative log-likelihood. Results are averaged over 10 instances with \(n=p\), \(t=1\%\)

Fig. 13
figure 13

Impact of the dimension p on computational time. Results are averaged over 10 instances with \(n=p\), \(t=1\%\). Recall that discrete formulations big-M and ridge are stopped after 5 min

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bertsimas, D., Lamperski, J. & Pauphilet, J. Certifiably optimal sparse inverse covariance estimation. Math. Program. 184, 491–530 (2020). https://doi.org/10.1007/s10107-019-01419-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01419-7

Keywords

Mathematics Subject Classification