Abstract
Multi-level methods are widely used for the solution of large-scale problems, because of their computational advantages and exploitation of the complementarity between the involved sub-problems. After a re-interpretation of multi-level methods from a block-coordinate point of view, we propose a multi-level algorithm for the solution of nonlinear optimization problems and analyze its evaluation complexity. We apply it to the solution of partial differential equations using physics-informed neural networks (PINNs) and consider two different types of neural architectures, a generic feedforward network and a frequency-aware network. We show that our approach is particularly effective if coupled with these specialized architectures and that this coupling results in better solutions and significant computational savings.
Similar content being viewed by others
Data availibility
The tests presented in this manuscript do not rely on any particular dataset and they can be reproduced by the information provided in the numerical results sections.
Notes
This assumption is standard in complexity analysis and prevents arbitrary change in the gradient over short steps. In particular, it prevents the occurence of infinite gradients at iterates, which would cause the algorithm to fail. It is more general than assuming bounded gradients, as can be seen by considering convex quadratics. See also the discussion at the end of the present section.
Technically, the objective function must be bounded below, have a bounded directional subgradient map and a finite nonconvexity modulus.
References
Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, 2nd edn. SIAM, Philadelphia, USA (2000)
Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Elsevier, Amsterdam, The Netherlands (2001)
Nash, S.G.: A multigrid approach to discretized optimization problems. Optim. Methods Softw. 14, 99–116 (2000)
Gratton, S., Sartenaer, A. and Toint, Ph. L., On recursive multiscale trust-region algorithms for unconstrained minimization. In F. Jarre, C. Lemaréchal, and J. Zowe, editors, Oberwolfach Reports: Optimization and Applications, (2005)
Gratton, S., Sartenaer, A., Toint, Ph.L.: Second-order convergence properties of trust-region methods using incomplete curvature information, with an application to multigrid optimization. J. Comput. Appl. Math. 24(6), 676–692 (2006)
Gratton, S., Sartenaer, A., Toint, Ph.L.: Recursive trust-region methods for multiscale nonlinear optimization. SIAM J. Optim. 19(1), 414–444 (2008)
Calandra, H., Gratton, S., Riccietti, E., Vasseur, X.: On high-order multilevel optimization strategies. SIAM J. Optim. 31(1), 307–330 (2021)
Kopaničáková, A., Krause, R.: Globally convergent multilevel training of deep residual networks. SIAM J. Sci. Comput. 45, S254–S280 (2022)
Haber, E., Ruthotto, L., Holtham, E., and Jun, S.-H. : Learning across scales—multiscale methods for convolution neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, pages 3142–3148, (2018)
Wu, C.-Y. , Girshick, R., He, K., Feichtenhofer, Ch., and Krahenbuhl, Ph.: A multigrid method for efficiently training video models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 153–162, (2020)
Von Planta, C., Kopaničáková, A., and Krause, R.: Training of deep residual networks with stochastic MG/OPT. arXiv:2108.04052v1
Haber, E., Ruthotto, L.: Stable architectures for deep neural networks. Inverse Prob. 34(1), 014004 (2017)
Tsung-Wei, K., Maire, M., and Yu, S. X.: Multigrid neural architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6665–6673, (2017)
Chung, J., Ahn, S., Bengio, Y.: Hierarchical multiscale recurrent neural networks. arXiv:1609.01704, (2016)
Perdikaris, P., Raissi, M., Karniadakis, G.E.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019)
Mishra, S., Molinaro, R.: Estimates on the generalization error of physics-informed neural networks for approximating a class of inverse problems for PDEs. IMA J. Numer. Anal. 42, 981–1022 (2022)
Shin, Y.: On the convergence of physics informed neural networks for linear second-order elliptic and parabolic type PDEs. Commun. Comput. Phys. 28(5), 2042–2074 (2020)
Cai, S., Mao, Z., Wang, Z., Yin, M., Karniadakis, G.E.: Physics-informed neural networks (PINNs) for fluid mechanics: a review. Acta. Mech. Sin. 37(12), 1727–1738 (2021)
Wang, S., Wang, H., Perdikaris, P.: On the eigenvector bias of Fourier feature networks: from regression to solving multi-scale PDEs with physics-informed neural networks. Comput. Methods Appl. Mech. Eng. 384, 113938 (2021)
Xu, Z.-Q.J.: Frequency principle: Fourier analysis sheds light on deep neural networks. Commun. Comput. Phys. 28(5), 1746–1767 (2020)
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., Courville. A.: (2019) On the spectral bias of neural networks. In International Conference On Machine Learning, pages 5301–5310,
Ronen, B., Jacobs, D., Kasten, Y., and Kritchman, S.: The convergence rate of neural networks for learned functions of different frequencies. Proceedings of the 33rd international conference on neural information processing systems, page 4761-4771, (2019)
Wang, Sifan, Xinling, Yu., Perdikaris, Paris: When and why pinns fail to train: a neural tangent kernel perspective. J. Comput. Phys. 449, 110768 (2022)
Farhani, G., Kazachek, A., Wang, B.: Momentum diminishes the effect of spectral bias in physics-informed neural networks. (2022) arXiv:2206.14862,
Liu, Z., Cai, W., Xu, Z.-Q.J.: Multi-scale deep neural network (MscaleDNN) for solving Poisson-Boltzmann equation in complex domains. Commun. Comput. Phys. 28(5), 1970–2001 (2020)
Li, X.-A.: A multi-scale DNN algorithm for nonlinear elliptic equations with multiple scales. Commun. Comput. Phys. 28(5), 1886–1906 (2020)
Hackbusch, W.: Multi-grid Methods and Applications. Number 4 in Springer Series in Computational Mathematics. Springer Verlag, Heidelberg, Berlin, New York, (1995)
Borzi, A., Kunisch, K.: A globalisation strategy for the multigrid solution of elliptic optimal control problems. Optim. Methods Softw. 21(3), 445–459 (2006)
Gross, Ch., Krause, R.: On the globalization of ASPIN employing trust-region control strategies - convergence analysis and numerical examples. Technical Report 2011-03, Universita della Svizzera italiana, Lugano, CH, (2011)
Richtárik, Peter, Takáč, Martin: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)
Nesterov, Y.: Efficiency of coordinate descent methods for huge-scale optimization problems. SIAM J. Optim. 22, 341–362 (2012)
Powell, M.J.D.: On search directions for minimization algorithms. Math. Program. 4, 193–201 (1973)
Wright, S.J.: Coordinate descent algorithms. Math. Program., Series A 151(1), 3–34 (2015)
Amaral, V.S., Andreani, R., Birgin, E.G., Marcondes, D.S., Martínez, J.M.: On complexity and convergence of high-order coordinate descent algorithms for smooth nonconvex box-constrained minimization. J. Global Optim. 84(3), 527–561 (2022)
Cartis, C., Gould, N.I.M., Toint, Ph. L.: Evaluation complexity of algorithms for nonconvex optimization. Number 30 in MOS-SIAM Series on Optimization. SIAM, Philadelphia, USA, (2022)
Nesterov, Y.: Introductory Lectures on Convex Optimization. Applied Optimization. Kluwer Academic Publishers, Dordrecht, The Netherlands (2004)
Bottou, Léon., Curtis, Frank E., Nocedal, Jorge: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Stefania, B., Della Santa, F., and Alessandra P.: Alternate training of shared and task-specific parameters for multi-task neural networks. arXiv preprint arXiv:2312.16340, (2023)
Jordan, M. I. , Lin, T., and Zampetakis, M.: On the complexity of deterministic nonsmooth and nonconvex optimization. arXiv:2209.12403, (2022)
Tian, L., and Man-Cho So, A.: Computing Goldstein \((\epsilon ,\delta )\)-stationary points of Lipschitz functions in \(\cal{O}(\epsilon ^{-3}\delta ^{-1})\) iterations via random conic perturbation. arXiv:2112.09002, (2021)
Zhang, J., Lin, H., Jegelka, S., Sra, S., and Jadbabaie, A.: Complexity of finding stationary points of nonconvex nonsmooth functions. In Proceedings of the 37th international conference on machine learning, PMLR, volume 119, pages 11173–11182, (2020)
Kong, S., A. Lewis, S.: The cost of nonconvexity in nonconvex nonsmooth optimization. (2022) arXiv:2210.00652,
Goldstein, A.A.: Optimization of Lipschitz continuous functions. Math. Program., Series A 13(1), 14–22 (1977)
Rapaka, N.R., Samtaney, R.: An efficient Poisson solver for complex embedded boundary domains using the multi-grid and fast multipole methods. J. Comput. Phys. 410, 109387 (2020)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In Proceedings in the International Conference on Learning Representations (ICLR), (2015)
Tancik, Matthew, Srinivasan, Pratul, Mildenhall, Ben, Fridovich-Keil, Sara, Raghavan, Nithin, Singhal, Utkarsh, Ramamoorthi, Ravi, Barron, Jonathan, Ng, Ren: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural. Inf. Process. Syst. 33, 7537–7547 (2020)
Li, X.-A., Xu, Z.-Q. J. , and Zhang, L.: Subspace decomposition based DNN algorithm for elliptic type multi-scale PDEs. (2021) arXiv:2112.06660,
Yu, B., et al.: The deep Ritz method: a deep learning-based numerical algorithm for solving variational problems. Commun. Math. Stat. 6, 1–12 (2018)
Funding
Work partially supported by 3IA Artificial and Natural Intelligence Toulouse Institute (ANITI), French “Investing for the Future - PIA3” program under the Grant agreement ANR-19-PI3A-0004 and by the GDR ISIS project MOMIGS.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
Proofs of convergence theorems
1.1 Proof of Theorem 3.1
Consider iteration k of ML-BCD and suppose that this iteration occurs in the minimization of subproblem \(\mathcal{A}_i\) on the variables given by \(x_i\). From the Lipschitz continuity of the gradient and (15), we obtain that
Since the minimization of subproblem \(\mathcal{A}_i\) has not yet terminated, we must have that \(\Vert \nabla ^1_{x_i}f(x_k)\Vert \ge \frac{\epsilon }{\sqrt{2}}\), and thus
Summing this inequality on all iterations, we deduce that, for all k before termination of the ML-BCD algorithm,
This implies that the algorithm cannot take more than
iterations before it terminates, proving the theorem with
\(\square \)
1.2 Proof of Theorem 3.2
Consider iteration k of ML-BCD and suppose that this iteration occurs in the minimization of subproblem \(\mathcal{A}_i\) on the variables given by \(x_i\). From the Lipschitz continuity, using the SG update of the selected block (16) we obtain:
Using assumption (17) and the bound on \(\alpha \) we deduce that
and, if k is a major iteration, we have that
Taking the total expectation gives that
Consider now a major iteration k, denoting by \(x_{k+\ell }\) (for \(\ell \in \{1,\dots ,\ell _k\}\) and \(\ell _k\le {\bar{\ell }}\)) the iterations performed on the selected subproblem. We deduce from (28) that
If the minimization of the subproblem is not stopped, this means that \(\Vert g_i(x_{k+\ell },\xi _{k+\ell })\Vert \ge c_\epsilon \), so that \(\mathbb {E}_{\xi _{k+\ell }}(\Vert g_{i}(x_{k+\ell },\xi _{k+\ell })\Vert ^2)\ge c_\epsilon ^2\). From (17), we have that
and, since k is a major iteration, from the condition for the choice of the blocks we have \( \Vert \nabla ^1_{x_i}f(x_k)\Vert >\tau \Vert \nabla ^1_x f(x_k)\Vert . \) Thus
Taking into account the fact that that f is bounded from below, summing up K major iterations and using the assumption \(c_\epsilon >2\sigma _1^2\) we obtain:
Thus
where the last inequality follows from the fact that \({\bar{\ell }} \sigma _1^2 \left( \frac{\alpha L}{\mu }- \frac{1}{\sigma _2^2}\right) <0\) by the assumption on \(\alpha \). Taking \({\bar{\ell }}\) large we can thus improve the bound, as long as \(\epsilon \) remains larger than the variance. \(\square \)
A Mscale FAML variant and its performance
1.1 Another frequency-aware architecture
This appendix reports on tests conducted using a frequency-aware network architecture based on MscaleDNN defined in [25], instead of the WWP networks used in Sect. 5. A large part of the success of Mscale networks is due to their wavelet-inspired activation functions. These compactly-supported have good scale separation properties and are constructed so that the bandwidth of their Fourier transform increases with their input scaling. Several activation functions have been proposed in [25], but the most efficient one from a practical point of view has been introduced in [26] and is given by
where z is the input scaling. This is a continuous function which decays faster and has better localization property in the frequency domain than the previously proposed sRelu. The amplitude peaks of the differently scaled s2ReLU in the frequency domain are well separated in the Fourier domain. They are indicated by black stars in Fig. 11, and we observe their expected monotonic growth with scaling.
As in Sect. 5, we use these functions to create a neural network composed of a sum of subnetworks each targeting a different frequency range (analogously to Fig. 4). In this modified architecture, the lower frequency networks still use \(\tanh \) activation functions while the higher frequency networks now use (31). For each subnetwork, the first activation function is a SFM function associated with a fixed input scaling, as in standard Mscales networks. The details of this architecture are given in Table 4. Each of the subnetworks is composed of 3 hidden layers of 100 neurons each.
Associated with the ML-BCD algorithm (exactly as in Sect. 5), this modified architecture defines an Mscale variant of the FAML approach.
1.2 Numerical results
We tested this approach using the same experimental setup and methodology as that of Sect. 5 and again compared its performance to that the standard single level training applied on the complete network. The results are reported in Table 5.
As was the case when using P-WWP networks, the Mscale FAML variant produced lower losses than the standard training in each case. The differences are particularly significant for a small computational budget where the improvement is of several orders of magnitude. This is also almost always the case for the associated MSE. These results thus remain excellent, despite the fact that our complexity theory does not formally cover the Mscale FAML variant because of the non-smoothness of (31).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gratton, S., Mercier, V., Riccietti, E. et al. A block-coordinate approach of multi-level optimization with an application to physics-informed neural networks. Comput Optim Appl 89, 385–417 (2024). https://doi.org/10.1007/s10589-024-00597-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-024-00597-1