Multistart algorithm for identifying all optima of nonconvex stochastic functions

196 Accesses
Explore all metrics

A Correction to this article was published on 02 July 2024

This article has been updated

Abstract

We propose a multistart algorithm to identify all local minima of a constrained, nonconvex stochastic optimization problem. The algorithm uniformly samples points in the domain and then starts a local stochastic optimization run from any point that is the “probabilistically best” point in its neighborhood. Under certain conditions, our algorithm is shown to asymptotically identify all local optima with high probability; this holds even though our algorithm is shown to almost surely start only finitely many local stochastic optimization runs. We demonstrate the performance of an implementation of our algorithm on nonconvex stochastic optimization problems, including identifying optimal variational parameters for the quantum approximate optimization algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Run-and-Inspect Method for nonconvex optimization and global optimality bounds for R-local minimizers

Article 27 April 2019

Adaptive Sampling line search for local stochastic optimization with integer variables

Article 09 July 2021

On the Convergence of Adaptive Stochastic Search Methods for Constrained and Multi-objective Black-Box Optimization

Article 18 July 2016

Data availibility

We do not use any specific dataset in this study. We use simulated data to present our analyses and results. As no specific datasets were employed, no data will be made available.

Change history

02 July 2024
A Correction to this paper has been published: https://doi.org/10.1007/s11590-024-02135-8

References

Gheribi, A.E., Robelin, C., Digabel, S.L., Audet, C., Pelton, A.D.: Calculating all local minima on liquidus surfaces using the FactSage software and databases and the mesh adaptive direct search algorithm. J. Chem. Thermodyn. 43(9), 1323–1330 (2011). https://doi.org/10.1016/j.jct.2011.03.021
Article Google Scholar
Adamcik, J., Mezzenga, R.: Amyloid polymorphism in the protein folding and aggregation energy landscape. Angewandte Chémie Int. Edit. 57(28), 8370–8382 (2018). https://doi.org/10.1002/anie.201713416
Article Google Scholar
Floudas, C., Klepeis, J., Pardalos, P.: Global optimization approaches in protein folding and peptide docking. In: DIMACS Series in Discrete Mathematics and Theoretical Computer Science vol. 47, pp. 141–171. Am. Math. Soc. (1999). https://doi.org/10.1090/dimacs/047/07
Li, Z., Scheraga, H.A.: Monte Carlo-minimization approach to the multiple-minima problem in protein folding. Proc. Natl. Acad. Sci. U.S.A. 84(19), 6611–6615 (1987). https://doi.org/10.1073/pnas.84.19.6611
Article MathSciNet Google Scholar
di Serafino, D., Gomez, S., Milano, L., Riccio, F., Toraldo, G.: A genetic algorithm for a global optimization problem arising in the detection of gravitational waves. J. Global Optim. 48(1), 41–55 (2010). https://doi.org/10.1007/s10898-010-9525-9
Article MathSciNet Google Scholar
Krishnamoorthy, M., Schulz, H., Ju, X., Wang, W., Leyffer, S., Marshall, Z., Mrenna, S., Müller, J., Kowalkowski, J.B.: Apprentice for event generator tuning. EPJ Web of Conf. 251, 03060 (2021). https://doi.org/10.1051/epjconf/202125103060
Article Google Scholar
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18(1), 6765–6816 (2017)
MathSciNet Google Scholar
Maclaurin, D., Duvenaud, D., Adams, R.: Gradient-based hyperparameter optimization through reversible learning. In: International Conference on Machine Learning, pp. 2113–2122 (2015)
Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm. arXiv:1411.4028 (2014). https://doi.org/10.48550/arXiv.1411.4028
Farhi, E., Goldstone, J., Gutmann, S.: A quantum approximate optimization algorithm applied to a bounded occurrence constraint problem. arXiv:1412.6062 (2014). https://doi.org/10.48550/arXiv.1412.6062
Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods part I Clustering methods. Math. Program. 39(1), 27–56 (1987). https://doi.org/10.1007/bf02592070
Article Google Scholar
Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods part II: Multi level methods. Math. Program. 39(1), 57–78 (1987). https://doi.org/10.1007/bf02592071
Article Google Scholar
Shashaani, S., Hashemi, F.S., Pasupathy, R.: ASTRO-DF: A class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization. SIAM J. Optim. 28(4), 3145–3176 (2018). https://doi.org/10.1137/15m1042425
Article MathSciNet Google Scholar
Larson, J., Wild, S.M.: Asynchronously parallel optimization solver for finding multiple minima. Math. Program. Comput. 10(3), 303–332 (2018). https://doi.org/10.1007/s12532-017-0131-4
Article MathSciNet Google Scholar
Kushner, H.J.: A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J. Basic Eng. 86(1), 97 (1964). https://doi.org/10.1115/1.3653121
Article Google Scholar
Frazier, P.I.: Bayesian optimization. In: Recent Advances in Optimization and Modeling of Contemporary Problems, pp. 255–278. INFORMS TutORials in Operations Research, (2018). https://doi.org/10.1287/educ.2018.0188
Mathesen, L., Pedrielli, G., Ng, S.H., Zabinsky, Z.B.: Stochastic optimization with adaptive restart: A framework for integrated local and global learning. J. Global Optim. 79(1), 87–110 (2020). https://doi.org/10.1007/s10898-020-00937-5
Article MathSciNet Google Scholar
Locatelli, M.: Relaxing the assumptions of the multilevel single linkage algorithm. J. Global Optim. 13(1), 25–42 (1998). https://doi.org/10.1023/a:1008246031222
Article MathSciNet Google Scholar
Nguyen, V., Rana, S., Gupta, S., Li, C., Venkatesh, S.: Budgeted batch Bayesian optimization with unknown batch sizes. arXiv:1703.04842 (2017). https://doi.org/10.48550/arXiv.1703.04842
Wessing, S., Preuss, M.: The true destination of EGO is multi-local optimization. In: IEEE Latin American Conference on Computational Intelligence (2017). https://doi.org/10.1109/la-cci.2017.8285677
Krityakierne, T., Shoemaker, C.A.: SOMS: SurrOgate MultiStart algorithm for use with nonlinear programming for global optimization. Int. Trans. Oper. Res. 24(5), 1139–1172 (2015). https://doi.org/10.1111/itor.12190
Article MathSciNet Google Scholar
Peri, D., Tinti, F.: A multistart gradient-based algorithm with surrogate model for global optimization. Commun. Appl. Ind. Math. 3(1) (2012). https://doi.org/10.1685/journal.caim.393
Regis, R.G., Shoemaker, C.A.: A quasi-multistart framework for global optimization of expensive functions using response surface models. J. Global Optim. 56(4), 1719–1753 (2012). https://doi.org/10.1007/s10898-012-9940-1
Article MathSciNet Google Scholar
Žilinskas, A., Gillard, J., Scammell, M., Zhigljavsky, A.: Multistart with early termination of descents. J. Global Optim. 79(2), 447–462 (2019). https://doi.org/10.1007/s10898-019-00814-w
Article MathSciNet Google Scholar
Zheng, R., Li, M.: Multistart global optimization with tunnelling and an evolutionary strategy supervised by a martingale. Eng. Optim., 1–19 (2021). https://doi.org/10.1080/0305215x.2021.1940989
Jin, C., Liu, L.T., Ge, R., Jordan, M.I.: On the local minima of the empirical risk. In: Advances in Neural Information Processing Systems, pp. 4896–4905 (2018)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2014). https://doi.org/10.1007/s10107-014-0846-1
Article MathSciNet Google Scholar
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2015). https://doi.org/10.1007/s10107-015-0871-8
Article MathSciNet Google Scholar
Hu, W., Li, C.J., Li, L., Liu, J.-G.: On the diffusion approximation of nonconvex stochastic gradient descent. Ann. Math. Sci. Appl. 4(1), 3–32 (2019). https://doi.org/10.4310/amsa.2019.v4.n1.a1
Article MathSciNet Google Scholar
Durrett, R.: Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, (2010). https://doi.org/10.1017/CBO9780511779398
Brooks, S.H.: A discussion of random methods for seeking maxima. Oper. Res. 6(2), 244–251 (1958). https://doi.org/10.1287/opre.6.2.244
Article Google Scholar
Moré, J.J., Wild, S.M.: Benchmarking derivative-free optimization algorithms. SIAM J. Optim. 20(1), 172–191 (2009). https://doi.org/10.1137/080724083
Article MathSciNet Google Scholar
Forrester, A., Sobester, A., Keane, A.: Engineering Design via Surrogate Modelling, pp. 195–203. John Wiley & Sons, Ltd, (2008). https://doi.org/10.1002/9780470770801.app1
Zhou, L., Wang, S.-T., Choi, S., Pichler, H., Lukin, M.D.: Quantum approximate optimization algorithm: Performance, mechanism, and implementation on near-term devices. Phys. Rev. X 10(2) (2020). https://doi.org/10.1103/physrevx.10.021067
Shaydulin, R., Safro, I., Larson, J.: Multistart methods for quantum approximate optimization. In: Proceedings of the IEEE High Performance Extreme Computing Conference (2019). https://doi.org/10.1109/hpec.2019.8916288
Cartis, C., Fiala, J., Marteau, B., Roberts, L.: Improving the flexibility and robustness of model-based derivative-free optimization solvers. ACM Trans. Math. Softw. 45(3), 1–41 (2019). https://doi.org/10.1145/3338517
Article MathSciNet Google Scholar
Huyer, W., Neumaier, A.: SNOBFIT - stable noisy optimization by branch and fit. ACM Trans. Math. Softw. 35(2), 1–25 (2008). https://doi.org/10.1145/1377612.1377613
Article MathSciNet Google Scholar
Nogueira, F.: Bayesian Optimization: Open source constrained global optimization tool for Python (2014–). https://github.com/fmfn/BayesianOptimization

Download references

Acknowledgements

This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Accelerated Research for Quantum Computing program, under contract number DE-AC02-06CH11357.

Author information

Authors and Affiliations

Department of Statistics, Texas A&M University, College Station, TX, 77843, USA
Prateek Jaiswal
Mathematics & Computer Science Division, Argonne National Laboratory, Lemont, IL, 60439, USA
Jeffrey Larson

Authors

Prateek Jaiswal
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Larson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prateek Jaiswal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised due to change in the copyright line of the article.

Appendices

Proofs

Below are the proofs of Lemma 1 and Lemma 2.

Proof of Lemma 1

First, consider the sets

$$\begin{aligned} \mathcal {G}(a;r;n)&:=\left\{ x \in \mathcal {D}: \Vert x- a\Vert \le r \text { and } [f(x)-f(a)] < {\epsilon }_n(x;a) \right\} \text { and } \end{aligned}$$

(21)

$$\begin{aligned} \mathcal {C}(a;r)&:= \left\{ x\in \mathcal {D}: \Vert x- a\Vert \le r \text { and } [f(x)- f(a)] < 0 \right\} . \end{aligned}$$

(22)

Now, observe that using Chebyschev’s inequality and Assumption 2, for all $x \in \mathcal {G}(a;r;n)$,

$$\begin{aligned} P_{\xi }&\left( {\hat{f}}_n(x) - {\hat{f}}_n(a)> {\epsilon }_n(x;a) \right) \nonumber \\&\le P_{\xi }\left( \left| [{\hat{f}}_n(x) -f(x)] - [{\hat{f}}_n(a)-f(a)] \right| > {\epsilon }_n(x;a) - [f(x)-f(a)] \right) \nonumber \\&\le \frac{1}{({\epsilon }_n(x;a)-[f(x)-f(a)] )^2}{\mathbb {E}}_{\xi }\left[ \left( [{\hat{f}}_n(x) -f(x)] - [{\hat{f}}_n(a)-f(a)]\right) ^2\right] \nonumber \\&= \frac{1}{({\epsilon }_n(x;a)-[f(x)-f(a)] )^2}\Bigg ({\text {Var}}_{\xi }\left[ {\hat{f}}_n(x)\right] + {\text {Var}}_{\xi }\left[ {\hat{f}}_n(a)\right] - 2{\text {Cov}}_{\xi }\left[ {\hat{f}}_n(x),{\hat{f}}_n(a)\right] \Bigg ) \nonumber \\&= \frac{{\text {Var}}_{\xi }\left[ {\hat{f}}_n(x)-{\hat{f}}_n(a)\right] }{({\epsilon }_n(x;a)-[f(x)-f(a)] )^2}. \end{aligned}$$

(23)

Since $x \in \mathcal {G}(a;r;n)$ implies ${\epsilon }_n(x;a)> [f(x)- f(a)]$, therefore

$$\begin{aligned}\left\{ x \in \mathcal {D}: \frac{{\text {Var}}_{\xi }\left[ {\hat{f}}_n(x)-{\hat{f}}_n(a) \right] }{({\epsilon }_n(x;a)-[f(x)-f(a)] )^2} \beta \right\} \nonumber \\= \left\{ x \in \mathcal {D}: f(x)- f(a) {\epsilon }_n(x;a) - \sqrt{\beta ^{-1}{{\text {Var}}_{\xi }\left[ {\hat{f}}_n(x)-{\hat{f}}_n(a) \right] }} \right\} . \end{aligned}$$

(24)

Next, recall the definition of $\mathcal {A}\left( a;r;n;\beta \right)$, and observe that Eqs. (23) and (24) together imply that

$$\begin{aligned} \bigg \{x\in \mathcal {D}: \Vert x- a\Vert \le r \text { and } f(x)- f(a) {\epsilon }_n(x;a) \nonumber \\ - \sqrt{\beta ^{-1}{{\text {Var}}_{\xi }\left[ {\hat{f}}_n(x)-{\hat{f}}_n(a) \right] }} \bigg \}\subseteq \mathcal {A}\left( a;r;n;\beta \right) , \end{aligned}$$

(25)

for all $x \in \mathcal {G}(a;r;n)$. Since ${\epsilon }_n(x;a) = \sqrt{\frac{{\text {Var}}_{\xi }\left[ {\hat{f}}_n(x)-{\hat{f}}_n(a) \right] }{\beta }}$, Eq. (25) implies that

$$\begin{aligned} \mathcal {C}(a;r) \cap \mathcal {G}(a;r;n) \subseteq \mathcal {A}\left( a;r;n;\beta \right) \cap \mathcal {G}(a;r;n). \end{aligned}$$

(26)

Observe that $\mathcal {C}(a;r)\subseteq \mathcal {G}(a;r;n)$ and $\mathcal {A}\left( a;r;n;\beta \right) \cap \mathcal {G}(a;r;n) \subseteq \mathcal {A}\left( a;r;n;\beta \right)$; Therefore it follows from (26) that

$$\begin{aligned} \mathcal {C}(a;r) \subseteq \mathcal {A}\left( a;r;n;\beta \right) . \end{aligned}$$

(27)

Recall that $\mathcal {T}_{\omega }$ is the union of $\omega$-radius balls centered at stationary points of f in $\mathcal {D}$. Now consider $a \in \mathcal {D}{\setminus } (\mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega }) \}$. Define the set

$$\begin{aligned} \mathcal {E}(a;r;\rho ):= \left\{ x\in \mathcal {D}: \Vert x- a\Vert \le r \text { and } \nabla f(a)^T(x-a) + \frac{1}{2} \rho r^2 \le 0 \right\} , \end{aligned}$$

where $\rho$ is the largest eigenvalue of $\nabla ^2 f(x)$ for $x \in \mathcal {D}$. Using the Taylor expansion of f around a, we know that for all $x\in \mathcal {D}$, with $\Vert x-a\Vert \le r$, there exists a $\theta \in [0,1]$ such that

$$\begin{aligned} f(x) - f(a)&= \nabla f(a)^T(x-a) + \frac{1}{2}(x-a)^T \nabla ^2 f(a + \theta (x-a) ) (x-a) . \end{aligned}$$

(28)

For ease of reference, let $H = \nabla ^2 f(a + \theta (x-a) ) = H$ and $v = x-a$.

Since f is twice continuously differentiable by Assumption 1.1, its Hessian is always real and symmetric, satisfying $v^T H v \le \rho v^Tv$ for all $v \in \mathbb {R}^d$. It follows from (28) that

$$\begin{aligned} f(x) - f(a)&\le \nabla f(a)^T(x-a) + \frac{1}{2} \rho (x-a)^T (x-a) \le \nabla f(a)^T(x-a) + \frac{1}{2} \rho r^2. \end{aligned}$$

(29)

Equation (29) implies that $\mathcal {E}(a;r;\rho ) \subseteq \mathcal {C}(a;r) \subseteq \mathcal {A}\left( a;r;n;\beta \right)$. For a given $a\in \mathcal {D}$, $m( \mathcal {B}\left( a;r \right) ) \le r^d \frac{\pi ^{d/2} }{ \mathrm {\Gamma }(1+ d/2)}$, where $\mathrm {\Gamma }(\cdot )$ is the gamma function. Now using the lower bound on $m(\mathcal {E}(a;r;\rho ))$ derived in Lemma 7 of [11], we obtain

$$\begin{aligned} \lim _{r \rightarrow 0} \frac{m(\mathcal {A}\left( a;r;n;\beta \right) )}{m ( \mathcal {B}\left( a;r \right) )} \ge \lim _{r \rightarrow 0} \frac{m(\mathcal {E}(a;r;\rho ))}{m ( \mathcal {B}\left( a;r \right) )} \ge \lim _{r \rightarrow 0} \frac{1}{2} - \frac{\pi ^{-d/2}}{2p\mathrm {\Gamma }(1 + \frac{d-1}{2})} \frac{\rho r}{2} = \frac{1}{2}, \end{aligned}$$

(30)

where $p = \min \{ \Vert \nabla f(a)\Vert , a \in \mathcal {D}{\setminus } (\mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega }) \}$ for any $k \in \mathbb {N}$. (Note that $p > 0$ by construction.) Therefore (30) implies that for all $a \in \mathcal {D}{\setminus } (\mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega })$

$$\begin{aligned} \lim _{r \rightarrow 0} \frac{m(\mathcal {A}\left( a;r;n;\beta \right) )}{m ( \mathcal {B}\left( a;r \right) )} \ge \frac{1}{2}. \end{aligned}$$

$\square$

Proof of Lemma 2

For any $a\in$ $S_k\setminus ( \mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega })$ and using the fact that $S_k$ is sampled uniformly, for a vanishing sequence $\{r_k\}$ the Lemma 1 implies that there exists a $k_0\ge 1$ such that for all $k\ge k_0$,

$$\begin{aligned} P[\{ t_k' > 0\}]&\le \bigcup _{a \in S_k\setminus ( \mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega })} \left( 1-\frac{m(\mathcal {A}\left( a;r_k;n;\beta \right) )}{m(\mathcal {D})} \right) ^{|S_k|-1} \nonumber \\&\le |S_k\setminus ( \mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega })| \left( 1-\frac{m(\mathcal {A}\left( a;r_k;n;\beta \right) )}{m(\mathcal {D})} \right) ^{|S_k|-1} \nonumber \\&\le |S_k| \left( 1-\frac{1}{2} \frac{m(\mathcal {B}\left( a;r_k \right) )}{m(\mathcal {D})} \right) ^{|S_k|-1} , \end{aligned}$$

(31)

where the first inequality bounds the probability that for any sampled point in $S_k\setminus ( \mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega })$ none of the remaining sampled points are in $\mathcal {A}\left( a;r_k;n;\beta \right)$ (see (S1) of Table 1). The second inequality follows from Boole’s inequality. The last inequality in (31) is due to Lemma 1 and the fact that $|S_k\setminus ( \mathcal {Q}_{\tau }\cup \mathcal {T}_{\omega })|< |S_k|$.

Recall that for any $a\in \mathcal {D}$ and $r_k$ sufficiently small such that $\mathcal {B}\left( a;r_k \right) \subseteq \mathcal {D}$, $m( \mathcal {B}\left( a;r_k \right) ) = r_k^d \frac{\pi ^{d/2} }{ \mathrm {\Gamma }(1+ d/2)}$. Combined with this fact, for any $\sigma >0$ and choosing a sequence $r_k = \frac{1}{\sqrt{\pi }} \root d \of {\mathrm {\Gamma }(1+{d/2}) m(\mathcal {D}) \sigma \frac{\log |S_k|}{|S_k|}}$, we have

$$\begin{aligned} P[\{ t_k' > 0\}] \le |S_k| \left( 1- \frac{1}{2} \frac{m(\mathcal {B}\left( a;r_k \right) )}{m(\mathcal {D})} \right) ^{|S_k|-1} = |S_k|\left( 1-\frac{\sigma }{2} \frac{\log |S_k|}{|S_k|} \right) ^{|S_k|-1}. \end{aligned}$$

Since $|S_k|\rightarrow \infty$ as $k\rightarrow \infty$, the fact that $e^{-\frac{\sigma }{2} \frac{\log |S_k|}{|S_k|}}\ge 1-\frac{\sigma }{2} \frac{\log |S_k|}{|S_k|}$ implies

$$\begin{aligned} P[\{ t_k' > 0\}] \le |S_k|\left( e^{-\frac{\sigma }{2} \frac{{(|S_k|-1)} \log |S_k|}{|S_k|}} \right) = |S_k|^{1-\frac{\sigma }{2}\frac{{(|S_k|-1)} }{|S_k|} }= O(|S_k|^{1- \frac{\sigma }{2} }). \end{aligned}$$

(32)

$\square$

Shekel with $d=6$ and $d=8$

Figures 5 and 6.

MANSO with heterogeneous variance

Figures 7, 8, 9, 10 and 11.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jaiswal, P., Larson, J. Multistart algorithm for identifying all optima of nonconvex stochastic functions. Optim Lett 18, 1335–1360 (2024). https://doi.org/10.1007/s11590-024-02114-z

Download citation

Received: 07 March 2023
Accepted: 03 April 2024
Published: 07 May 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11590-024-02114-z