Abstract
We consider the maximum likelihood estimation of sparse inverse covariance matrices. We demonstrate that current heuristic approaches primarily encourage robustness, instead of the desired sparsity. We give a novel approach that solves the cardinality constrained likelihood problem to certifiable optimality. The approach uses techniques from mixed-integer optimization and convex optimization, and provides a high-quality solution with a guarantee on its suboptimality, even if the algorithm is terminated early. Using a variety of synthetic and real datasets, we demonstrate that our approach can solve problems where the dimension of the inverse covariance matrix is up to 1000 s. We also demonstrate that our approach produces significantly sparser solutions than Glasso and other popular learning procedures, makes less false discoveries, while still maintaining state-of-the-art accuracy.
Similar content being viewed by others
Notes
Note that \(\overline{\varvec{\Sigma }}\) is not the only estimate of the covariance matrix. In particular, \(\tfrac{n}{n-1} \overline{\varvec{\Sigma }}\) is a widely-used unbiased estimator of the covariance matrix. In this paper, we will only consider \(\overline{\varvec{\Sigma }}\), which we might refer to as the empirical or sample covariance matrix.
Available at https://cran.r-project.org/web/packages/glasso/.
Convexity of \({h}({\mathbf {Z}})\) can also be proved from the primal formulation (6) directly. Take two matrices \({\mathbf {Z}}_1\) and \( {\mathbf {Z}}_2\), \(\lambda \in (0,1)\), \({\mathbf {Z}} := \lambda {\mathbf {Z}}_1 + (1-\lambda ) {\mathbf {Z}}_2\), then it follows from the definition (6) that \( h({\mathbf {Z}}) \leqslant \lambda h({\mathbf {Z}}_1) + (1-\lambda ) h({\mathbf {Z}}_2)\).
For each instance, we generate a sparse precision matrix \(\varvec{\Theta }_0\) as in Sect. 5.1 and n samples from the corresponding multivariate normal distribution.
References
Atamtürk, A., Narayanan, V.: Conic mixed-integer rounding cuts. Math. Program. 122(1), 1–20 (2010)
Atchadé, Y.F., Mazumder, R., Chen, J.: Scalable computation of regularized precision matrices via stochastic optimization (2015). arXiv:1509.00426
Banerjee, O., El Ghaoui, L., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9(Mar), 485–516 (2008)
Ben-Tal, A., El Ghaoui, L., Nemirovski, A.: Robust Optimization. Princeton University Press, Princeton (2009)
Bertsimas, D., Copenhaver, M.S.: Characterization of the equivalence of robustification and regularization in linear and matrix regression. Eur. J. Oper. Res. 270, 93142 (2018)
Bertsimas, D., Mazumder, R.: Least quantile regression via modern optimization. Ann. Stat. 42, 2494–2525 (2014)
Bertsimas, D., Van Parys, B.: Sparse high-dimensional regression: Exact scalable algorithms and phase transitions (2017). arXiv:1709.10029
Bertsimas, D., Brown, D.B., Caramanis, C.: Theory and applications of robust optimization. SIAM Rev. 53(3), 464–501 (2011)
Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens. Ann. Stat. 44(2), 813–852 (2016)
Bickel, P.J., Levina, E., et al.: Covariance regularization by thresholding. Ann. Stat. 36(6), 2577–2604 (2008)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Cai, T., Liu, W., Luo, X.: A constrained \(\ell _1\) minimization approach to sparse precision matrix estimation. J. Am. Stat. Assoc. 106(494), 594–607 (2011)
Çezik, M.T., Iyengar, G.: Cuts for mixed 0–1 conic programming. Math. Program. 104(1), 179–202 (2005)
Chickering, D.M.: Learning Bayesian Networks is NP-Complete. In: Fisher, D., Lenz, HJ. (eds.) Learning from Data. Lecture Notes in Statistics, vol 112. Springer, New York, NY (1996)
Chow, C., Liu, C.: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 14(3), 462–467 (1968)
Dahl, J., Vandenberghe, L., Roychowdhury, V.: Covariance selection for nonchordal graphs via chordal embedding. Optim. Methods Softw. 23(4), 501–520 (2008)
Dempster, A.P.: Covariance selection. Biometrics 28, 157–175 (1972)
Drton, M., Maathuis, M.H.: Structure learning in graphical modeling. Ann. Rev. Stat. Appl. 4, 365–393 (2017)
Duran, M.A., Grossmann, I.E.: An outer-approximation algorithm for a class of mixed-integer nonlinear programs. Math. Programm. 36(3), 307–339 (1986)
El Karoui, N.: High-dimensionality effects in the markowitz problem and other quadratic programs with linear constraints: risk underestimation. Ann. Stat. 38(6), 3487–3566 (2010)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Fan, J., Fan, Y., Lv, J.: High dimensional covariance matrix estimation using a factor model. J. Econom. 147(1), 186–197 (2008)
Fan, J., Feng, Y., Yichao, W.: Network exploration via the adaptive lasso and scad penalties. Ann. Appl. Stat. 3(2), 521 (2009)
Fan, J., Zhang, J., Ke, Y.: Vast portfolio selection with gross-exposure constraints. J. Am. Stat. Assoc. 107(498), 592–606 (2012)
Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1(2), 293–314 (2014)
Fan, J., Liao, Y., Liu, H.: An overview of the estimation of large covariance and precision matrices. Econom. J. 19(1), C1–C32 (2016)
Fattahi, S., Sojoudi, S.: Graphical lasso and thresholding: equivalence and closed-form solutions (2017). arXiv:1708.09479
Foygel, R., Drton, M.: Extended Bayesian information criteria for Gaussian graphical models. In: Lafferty, J.D., Williams, C.K.I., Shawe-Taylor, J., Zemel, R.S., Culotta, A. (eds.) Advances in Neural Information Processing Systems 23, pp. 604–612. Curran Associates, Inc. (2010)
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9(3), 432–441 (2008)
Gally, T., Pfetsch, M.E., Ulbrich, S.: A framework for solving mixed-integer semidefinite programs. Optim. Methods Softw. 33(3), 594–632 (2018)
Geoffrion, A.M.: Generalized benders decomposition. J. Optim. Theory Appl. 10(4), 237–260 (1972)
Gurobi Optimization Inc. Gurobi Optimizer Reference Manual (2015). http://www.gurobi.com
Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)
Helmberg, C., Rendl, F.: Solving quadratic (0, 1)-problems by semidefinite programs and cutting planes. Math. Program. 82(3), 291–315 (1998)
Hess, K.R., Keith Anderson, W., Symmans, F., Valero, V., Ibrahim, N., Mejia, J.A., Booser, D., Theriault, R.L., Buzdar, A.U., Dempsey, P.J., et al.: Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J. Clin. Oncol. 24(26), 4236–4244 (2006)
Hsieh, C.-J., Dhillon, I.S., Ravikumar, P.K., Sustik, M.A.: Sparse inverse covariance matrix estimation using quadratic approximation. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 2330–2338. Curran Associates, Inc. (2011)
Hsieh, C.-J., Sustik, M.A., Dhillon, I.S., Ravikumar, P.K., Poldrack, R.: Big & quic: sparse inverse covariance estimation for a million variables. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26, pp. 3165–3173. Curran Associates, Inc. (2013)
IBM ILOG. Cplex optimizer (2012). http://www-01.ibm.com/software/commerce/optimization/cplex-optimizer
Krishnamurthy, V., Ahipasaoglu, S.D., d’Aspremont, A.: A pathwise algorithm for covariance selection. Optim. Mach. Learn., p. 479 (2011). arXiv:0908.0143v2
Lam, C., Fan, J.: Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Stat. 37(6B), 4254 (2009)
Lauritzen, S.L.: Graphical Models, vol. 17. Clarendon Press, Oxford (1996)
Liu, Z., Lin, S., Deng, N., McGovern, D.P.B., Piantadosi, S.: Sparse inverse covariance estimation with \(\ell _0\) penalty for network construction with omics data. J. Comput. Biol. 23(3), 192–202 (2016)
Lubin, M., Dunning, I.: Computing in operations research using julia. INFORMS J. Comput. 27(2), 238–248 (2015)
Lubin, M., Yamangil, E., Bent, R., Vielma, J.P.: Polyhedral approximation in mixed-integer convex optimization. Math. Program. 172(1–2), 139–168 (2018)
Ma, J., Zhao, F., Xu, J.: Structure learning constrained by node-specific degree distribution. In: Meila, M., Heskes, T. (eds.) Uncertainty in Artificial Intelligence 33, pp. 533–541. AUAI Press Corvalis, Oregon (2015)
Marjanovic, G., Hero, A.O.: \(\ell _0\) sparse inverse covariance estimation. IEEE Trans. Signal Process. 63(12), 3218–3231 (2015)
Kipp Martin, R.: Using separation algorithms to generate mixed integer model reformulations. Oper. Res. Lett. 10(3), 119–128 (1991)
Mazumder, R., Hastie, T.: Exact covariance thresholding into connected components for large-scale graphical lasso. J. Mach. Learn. Res. 13(Mar), 781–794 (2012a)
Mazumder, R., Hastie, T.: The graphical lasso: new insights and alternatives. Electron. J. Stat. 6, 2125 (2012b)
Meinshausen, N., Bühlmann, P., et al.: High-dimensional graphs and variable selection with the lasso. Ann. Stat. 34(3), 1436–1462 (2006)
Meyer, C.D.: Matrix analysis and applied linear algebra, vol. 71. Siam (2000)
Oztoprak, F., Nocedal, J., Rennie, S., Olsen, P.A.: Newton-like methods for sparse inverse covariance estimation. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 755–763. Curran Associates, Inc. (2012)
Rendl, F., Rinaldi, G., Wiegele, A.: Solving max-cut to optimality by intersecting semidefinite and polyhedral relaxations. Math. Program. 121(2), 307 (2010)
Rigollet, P., Tsybakov, A.: Estimation of covariance matrices under sparsity constraints (2012). arXiv:1205.1210
Rothman, A.J., Bickel, P.J., Levina, E., Zhu, J., et al.: Sparse permutation invariant covariance estimation. Electron. J. Stat. 2, 494–515 (2008)
Santhanam, N.P., Wainwright, M.J.: Information-theoretic limits of selecting binary graphical models in high dimensions. IEEE Trans. Inf. Theory 58(7), 4117–4134 (2012)
Scheinberg, K., Rish, I.: SINCO-a greedy coordinate ascent method for sparse inverse covariance selection problem. IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598, July (2009)
Scheinberg, K., Ma, S., Goldfarb, D.: Sparse inverse covariance selection via alternating linearization methods. In: Advances in Neural Information Processing Systems, pp. 2101–2109 (2010)
Shao, J.: Linear model selection by cross-validation. J. Am. Stat. Assoc. 88(422), 486–494 (1993)
Sotirov, R.: SDP relaxations for some combinatorial optimization problems. In: Handbook on Semidefinite, Conic and Polynomial Optimization, pp. 795–819. Springer (2012)
Tan, K.M., London, P., Mohan, K., Lee, S.-I., Fazel, M., Witten, D.: Learning graphical models with hubs. J. Mach. Learn. Res. 15(1), 3297–3331 (2014)
Tokuda, T., Goodrich, B., Van Mechelen, I., Gelman, A., Tuerlinckx, F.: Visualizing distributions of covariance matrices. Columbia University, New York, USA, Technical Report, p. 18 (2011)
Vandenberghe, L., Andersen, M.S., et al.: Chordal graphs and semidefinite optimization. Found. Trends Optim. 1(4), 241–433 (2015)
Xu, H., Caramanis, C., Mannor, S.: Robust regression and lasso. In: Advances in Neural Information Processing Systems, pp. 1801–1808 (2009)
Yonekura, K., Kanno, Y.: Global optimization of robust truss topology via mixed integer semidefinite programming. Optim. Eng. 11(3), 355–379 (2010)
Yuan, M.: High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11(Aug), 2261–2286 (2010)
Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika 94(1), 19–35 (2007)
Zhang, C.-H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zhang, F.: The Schur Complement and Its Applications, vol. 4. Springer, Berlin (2006)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proofs of Theorem 2 and corollaries
In this section, we detail the proof of Theorem 2. We first specify the assumptions required on the regularizer \(\varOmega \), prove Theorem 2 and finally investigate some special cases of interest.
1.1 Assumptions
We first assume that the function \(\varOmega \) is decomposable, i.e., there exist scalar functions \(\varOmega _{ij}\) such that
In addition, we assume that for all (i, j), \(\varOmega _{ij}\) is convex and tends to regularize towards zero. Formally,
Those first two assumptions are not highly restrictive and are satisfied by \(\ell _\infty \)-norm constraint (big-M), \(\ell _1\)-norm regularization (LASSO) or \(\Vert \cdot \Vert _2^2\)-regularization, among others.
For any function f, we denote with a superscript \(\star \) its Fenchel conjugate (see [11], chap. 3.3) defined as
In particular, the Fenchel conjugate of any function f is convex. Given Assumption (A1),
As a result, it is easy to see that if \(\varOmega \) satisfies (A1) and (A2), so does its Fenchel conjugate.
Let us denote \({\mathbf {A}} \circ {\mathbf {B}}\) the Hadamard or component-wise product between matrices \({\mathbf {A}} \) and \({\mathbf {B}}\). Consider a matrix \({\mathbf {R}}\) and a support matrix \({\mathbf {Z}} \in \{0,1\}^{p \times p}\). The function \({\mathbf {Z}} \mapsto \varOmega ^\star ( {\mathbf {Z}} \circ {\mathbf {R}} )\) is convex in \({\mathbf {Z}}\), by convexity of \(\varOmega ^\star \). We now assume that it is linear in \({\mathbf {Z}}\), that is, there exists a function \(\varvec{\Omega }^\star : \, {\mathbb {R}}^{p \times p} \rightarrow {\mathbb {R}}^{p \times p}\) satisfying:
1.2 Proof of Theorem 2
Given \({\mathbf {Z}} \in \{0,1\}^{p \times p}\) such that \({Z}_{ii} = 1\) for all \(i = 1,\dots ,p\), we first prove that under assumptions (A1) and (A2):
Then, Assumption (A3) will conclude the proof.
Proof
We decompose the minimization problem à la Fenchel.
In the last equality, we omitted the constraint \(\varvec{\Theta } \succ {\mathbf {0}}\), which is implied by the domain of \(\log \det \). Assuming (A1) and (A2) hold, the regularization term \(\varOmega ({\mathbf {Z}} \circ \varvec{\Phi })\) can be replaced by \(\varOmega (\varvec{\Phi })\) and
The above objective function is convex in \((\varvec{\Theta }, \varvec{\Phi })\), the feasible set is a non-empty - \(\varvec{\Theta } = \varvec{\Phi } = {\mathbf {I}}_p\) is feasible - convex set, and Slater’s conditions are satisfied. Hence, strong duality holds.
For the first inner-minimization problem, first-order conditions \(\overline{\varvec{\Sigma }} + {\mathbf {R}} - \varvec{\Theta }^{-1} = {\mathbf {0}}\) lead to the constraint \(\overline{\varvec{\Sigma }}+ {\mathbf {R}} \succ 0\) and the objective value is \(p + \log \det (\overline{\varvec{\Sigma }} + {\mathbf {R}})\). The second inner-minimization problem is almost the definition of the Fenchel conjugate:
Hence,
\(\square \)
Remark
Notice that we proved that \({\tilde{h}}({\mathbf {Z}})\) could be written as point-wise maximum of concave functions of \({\mathbf {Z}}\). Assumption (A3) is needed to ensure that the function in the maximization is convex in \({\mathbf {Z}}\) at the same time.
1.3 Special cases and corollaries
1.3.1 No regularization
We first consider the unregularized case of (6) where \(\forall \, \varvec{\Phi }, \, \varOmega (\varvec{\Phi }) = 0\). Assumptions (A1) and (A2) are obviously satisfied. Moreover, for any \({\mathbf {R}}\),
With the convention that \(0 \times \infty = 0\), Assumption (A3) is satisfied and Theorem 2 holds:
In particular, this reformulation proves that \({h}({\mathbf {Z}})\) is convex,Footnote 3 but that the coordinates of its sub-gradient \(-\varvec{\Omega }^\star ({\mathbf {R}}^\star ({\mathbf {Z}}))\) are either 0 or \(-\infty \), hence uninformative. Note that the same conclusion is true for \(\ell _1\)-regularization.
From the proof of Theorem 2, one can derive a lower bound on \( \Vert \varvec{\Theta }^\star \Vert _\infty \) which will be useful for big-M regularization.
Theorem 3
The solution of (8) satisfies \(\Vert \varvec{\Theta }^\star \Vert _\infty \geqslant \frac{p}{\Vert \overline{\varvec{\Sigma }} \Vert _1}\)
Proof
For a feasible support \({\mathbf {Z}}\), denote the optimal primal and dual variables \(\varvec{\Theta }^\star ({\mathbf {Z}})\) and \({\mathbf {R}}^\star ({\mathbf {Z}})\) respectively. There is no duality gap and KKT condition \(\varvec{\Theta }^\star ({\mathbf {Z}})^{-1}= \overline{\varvec{\Sigma }} + {\mathbf {R}}^\star ({\mathbf {Z}})\) holds, so that \( \langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle = p\). From Hölder’s inequality, we obtain the desired lower bound. \(\square \)
1.3.2 Big-M regularization
For the big-M regularization,
is decomposable with \(\varOmega _{i,j}(\varTheta _{ij}) = 0\) if \(| \varTheta _{ij} | \leqslant M_{ij}\), \(+\,\infty \) otherwise. Assumptions (A1) and (A2) are satisfied. Moreover, for any \({\mathbf {R}}\),
In particular, for any binary matrix \({\mathbf {Z}}\),
so that Assumption (A3) is satisfied with \(\varvec{\Omega }^\star ({\mathbf {R}}) = \left( M_{ij} |R_{ij}| \right) _{ij}\).
1.3.3 Ridge regularization
For the \(\ell _2^2\)-regularization,
is decomposable with \(\varOmega _{i,j}(\varTheta _{ij}) = \tfrac{1}{2\gamma } \varTheta _{ij}^2\). Assumptions (A1) and (A2) are satisfied. Moreover, for any \({\mathbf {R}}\),
In particular, for any binary matrix \({\mathbf {Z}}\),
since \(Z_{ij}^2 = Z_{ij}\), so that Assumption (A3) is satisfied with \(\varvec{\Omega }^\star ({\mathbf {R}}) = \left( \tfrac{\gamma }{2} R_{ij}^2 \right) _{ij}.\)
Moreover, from the proof of Theorem 2, one can connect the norm of \(\varvec{\Theta }^\star ({\mathbf {Z}})\) and \(\gamma \).
Theorem 4
For any support \({\mathbf {Z}}\), the norm of the optimal precision matrix \(\varvec{\Theta }^\star ({\mathbf {Z}})\) is bounded by
Proof
There is no duality gap:
In addition, the following KKT conditions hold
where the second condition follows from the inner minimization problem defining \(\varOmega ^\star \). All in all, we have
Since \(\varvec{\Sigma }\) and \(\varvec{\Theta }^\star ({\mathbf {Z}})\) are semi-definite positive matrices, \(\langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle \geqslant 0\). Hence,
To obtain the lower bound, we apply Cauchy-Schwartz inequality \(\langle \overline{\varvec{\Sigma }}, \varvec{\Theta }^\star ({\mathbf {Z}}) \rangle \leqslant \Vert \overline{\varvec{\Sigma }} \Vert _2 \Vert \varvec{\Theta }^\star ({\mathbf {Z}}) \Vert _2\) and solve the quadratic equation
\(\square \)
In particular, the lower bound in Theorem 4 is controlled by the factor \(\tfrac{4 p}{\gamma \Vert \overline{\varvec{\Sigma }} \Vert _2^2}\), suggesting an appropriate scaling of \(\gamma \) to start a grid search with.
An optimization approach for finding big-M values
In this section, we present a method for obtaining suitable constants \({\mathbf {M}}\). The approach involves solving two optimization problems for each off-diagonal entry of the matrix being estimated. The problems provide lower and upper bounds for each entry of the optimal solution. First we present the problems, then we discuss how they are solved.
1.1 Bound optimization problems
Let \(\hat{\varvec{\Theta }}\) be a feasible solution for (3) and define,
A simple way to obtain lower bounds for the ijth entry of the optimal solution is to solve
Likewise, to obtain upper bounds we solve
Note that it is sufficient to find a feasible solution \(\hat{\varvec{\Theta }}\) to formulate (11) and (12), and a feasible solution with a smaller value leads to better bounds.
1.2 Solution approach
We describe the approach for the lower bound Problem (11) only, the upper bound Problem (12) being similar.
First, we make the additional assumption that \(\overline{\varvec{\Sigma }}\) is invertible. We know this assumption cannot hold in the high dimensional setting where \(p > n\). Numerically, one can always argue that the lowest eigenvalues of \(\overline{\varvec{\Sigma }}\) are never exactly equal to zero but should be strictly positive. In this case however, these eigenvalues should be small and close to machine precision, making matrix inversion very unstable. Note that this extra assumption is required for problems (11) and (12) to be bounded.
Problem (11) is a semidefinite optimization problem and there are \(\frac{p(p+1)}{2}\) entries to bound so it is necessary to efficiently solve (11) and avoid solving so many SDPs. Instead, one can solve the dual of (11) very efficiently. Note an advantage for considering the dual is we do not need to solve the problem to optimality to obtain a valid bound. Using basic arguments from convex duality theory similar to the ones invoked in Sect. A.2, the dual problem for (11) writes
Computationally, problem (13) is easier to solve because it is a convex optimization problem with a scalar decision variable \(\lambda \).
Denote \(g(\lambda )\) the objective function in the dual Problem (13). Algebraic manipulations yield
where \(\varvec{\Theta } = \overline{\varvec{\Sigma }}^{-1}\). We can then easily derive the first and second derivatives of g and apply Newton’s method to solve Problem (13).
Additional material on computational performance of the cutting-plane algorithm
In this section, we consider the runtime of the cutting-plane algorithm on synthetic problems as in Sect. 5.1. In Sect. 5.1.1, we illustrated how the regularization parameter M or \(\gamma \) can impact the convergence of the cutting-plane algorithm, so we focus in this section on the impact of the problem sizes n, p and k.
In particular, we study the time needed by the algorithm to find the optimal solution (opt-time) and to verify the solution’s optimality (ver-time), as well as the number of cuts required (laz-cons). We carry out all experiments by generating 10 instances of synthetic dataFootnote 4 for \((p,k_{true}) \in \{30,50,80,120,200\} \times \{5,10\}\) and different values of n. We solve each instance of (8) with big-M regularization for \(k = k_{true}\), \(M=0.5\) and report average performance in Table 5. These computations are performed on 4 Intel E5-2690 v4 2.6 GHz CPUs (14 cores per CPU, no hyper threading) with 16GB of RAM in total. We chose to fix the value of \(M=0.5\) in order to isolate the impact of p, k and n on computational time, the specific value 0.5 being informed by the knowledge of the ground truth.
In general the algorithm provides an optimal solution in a matter of seconds, and a certificate of optimality in seconds or minutes even for p in the 100 s. Optimal verification occurs significantly quicker when the sample size n is larger because the sparsity pattern of the underlying matrix is easier to recover. However, we note that finding the optimal solution is not as affected by the sample size n. As p or k increase, optimal detection also does not significantly change, but optimal verification generally becomes significantly harder. Similar observations have been made for mixed-integer formulations of the best subset selection problem in linear and logistic regression [9]. We also observe that changes in k have a more substantial impact on the runtime than changes in n or p, especially when p is large. Finally, Meinshausen and Bühlmann’s approximation is used as a warm-start and we observe that is often optimal, especially when n / p is large.
Thus, the cutting-plane algorithm in general provides an optimal or near-optimal solution fast, but optimal verification strongly depends on p, k, and n. Nonetheless, we observe that optimality of solutions can be verified for p in the 100 s and k in the 10 s in a matter of minutes.
Additional comparisons on statistical performance
We report here additional results from the experiments conducted in Sect. 5.1.
1.1 Comparisons for varying sample sizes n / p
1.2 Comparisons for varying sparsity levels t
1.3 Comparisons for varying dimensions p
Rights and permissions
About this article
Cite this article
Bertsimas, D., Lamperski, J. & Pauphilet, J. Certifiably optimal sparse inverse covariance estimation. Math. Program. 184, 491–530 (2020). https://doi.org/10.1007/s10107-019-01419-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01419-7
Keywords
- Sparse inverse covariance estimation
- Semidefinite optimization
- Mixed-integer optimization
- Glasso
- Maximum likelihood estimation