[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

MAGMA: : Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures

Published: 16 October 2024 Publication History

Abstract

MAGMA (Matrix Algebra for GPU and Multicore Architectures) is a pivotal open-source library in the landscape of GPU-enabled dense and sparse linear algebra computations. With a repertoire of approximately 750 numerical routines across four precisions, MAGMA is deeply ingrained in the DOE software stack, playing a crucial role in high-performance computing. Notable projects such as ExaConstit, HiOP, MARBL, and STRUMPACK, among others, directly harness the capabilities of MAGMA. In addition, the MAGMA development team has been acknowledged multiple times for contributing to the vendors’ numerical software stacks. Looking back over the time of the Exascale Computing Project (ECP), we highlight how MAGMA has adapted to recent changes in modern HPC systems, especially the growing gap between CPU and GPU compute capabilities, as well as the introduction of low precision arithmetic in modern GPUs. We also describe MAGMA’s direct impact on several ECP projects. Maintaining portable performance across NVIDIA and AMD GPUs, and with current efforts toward supporting Intel GPUs, MAGMA ensures its adaptability and relevance in the ever-evolving landscape of GPU architectures.

References

[1]
Abdelfattah A, Haidar A, and Tomov S, et al. (2016a) On the development of variable size batched computation for heterogeneous parallel architectures. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016, 1249–1258.
[2]
Abdelfattah A, Haidar A, and Tomov S, et al. (2016b) Performance, design, and autotuning of batched GEMM for gpus. In: ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, 21–38.
[3]
Abdelfattah A, Haidar A, and Tomov S, et al. (2016c) Performance tuning and optimization techniques of fixed and variable size batched Cholesky factorization on GPUs. In: International Conference on Computational Science 2016 ICCS 2016, San Diego, California, USA, 6-8 June 2016, 119–130.
[4]
Abdelfattah A, Haidar A, and Tomov S, et al. (2017a) Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. In: International Conference on Computational Science, ICCS 2017, Zurich, Switzerland, 12-14 June 2017, 606–615.
[5]
Abdelfattah A, Haidar A, and Tomov S, et al. (2017b) Novel HPC techniques to batch execution of many variable size BLAS computations on gpus. In: Gropp WD, Beckman P, and Li Z, et al. (eds) In: Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14-16, 2017. ACM, 5:1–5:10.
[6]
Abdelfattah A, Haidar A, and Tomov S, et al. (2018a) Batched one-sided factorizations of tiny matrices using GPUs: challenges and countermeasures. Journal of Computational Science 26: 226–236. https://www.sciencedirect.com/science/article/pii/S1877750317311456
[7]
Abdelfattah A, Haidar A, and Tomov S, et al. (2018b) Analysis and design techniques towards high-performance and energy-efficient dense linear solvers on GPUs. IEEE Transactions on Parallel and Distributed Systems 29(12): 2700–2712.
[8]
Abdelfattah A, Tomov S, and Dongarra JJ (2019a) Fast batch matrix multiplication for small sizes using half precision arithmetic on GPUs. In: 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20-24, 2019. pp. 111–122.
[9]
Abdelfattah A, Tomov S, and Dongarra JJ (2019b) Progressive optimization of batched LU factorization on GPUs. In: 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019, Waltham, MA, USA, September 24-26, 2019. Piscataway: IEEE, 1–6.
[10]
Abdelfattah A, Ghysels P, and Boukaram W, et al. (2022a) Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers. In: Wolf F, Shende S, and Culhane C, et al. (eds) SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022, 26:1–26:14.
[11]
Abdelfattah A, Tomov S, and Dongarra JJ (2022b) Batch QR factorization on GPUs: design, optimization, and Tuning. In: Groen D, de Mulatier C, and Paszynski M, et al. (eds) Computational Science - ICCS 2022 - 22nd International Conference, London, UK, June 21-23, 2022, 60–74.
[12]
Alexander F, Almgren A, and Bell J, et al. (2020) Exascale applications: skin in the game. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences 378(2166): 20190056. https://www.osti.gov/biblio/1594850
[13]
Amestoy PR and Duff LS (1989) Vectorization of a multiprocessor multifrontal code. The International Journal of Supercomputing Applications 3(3): 41–59.
[14]
Anderson MJ, Sheffield D, and Keutzer K (2012) A predictive model for solving small linear algebra problems in GPU registers. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2–13.
[15]
Anderson R, Andrej J, and Barker A, et al. (2021) MFEM: a modular finite element methods library. Computers & Mathematics with Applications 81: 42–74. https://www.sciencedirect.com/science/article/pii/S0898122120302583
[16]
Angerson E, Bai Z, and Dongarra J, et al. (1990) Lapack: a portable linear algebra library for high-performance computers. In: Supercomputing ’90:Proceedings of the 1990 ACM/IEEE Conference on Supercomputing. Piscataway: IEEE, 2–11.
[17]
Anzt H, Tomov S, and Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c/sell-c-σ formats on nvidia gpus. Knoxville, USA: University of Tennessee. https://www.icl.utk.edu/sites/icl/files/publications/2014/icl-utk-772-2014.pdf
[18]
Anzt H, Tomov S, and Luszczek P, et al. (2015) Acceleration of GPU-based Krylov solvers via data transfer reduction. The International Journal of High Performance Computing Applications 29(3): 366–383.
[19]
Anzt H, Baboulin M, and Dongarra JJ, et al. (2016a) Accelerating the conjugate gradient algorithm with GPUs in CFD simulations. In: Dutra I, Camacho R, and Barbosa JG, et al. (eds) High Performance Computing for Computational Science - VECPAR 2016 - 12th International Conference, Porto, Portugal, June 28-30, 2016, 35–43.
[20]
Anzt H, Chow E, and Huckle T, et al. (2016b) Batched generation of incomplete sparse approximate inverses on GPUs. In: 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA@SC 2016, Salt Lake, UT, USA, November 14, 2016. Piscataway: IEEE Computer Society, 49–56.
[21]
Anzt H, Dongarra JJ, and Flegar G, et al. (2017) Batched gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs. In: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM@PPoPP 2017, Austin, TX, USA, February February 2017. pp. 1–10.
[22]
Auer AA, Baumgartner G, and Bernholdt DE, et al. (2006) Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics 104(2): 211–228.
[23]
Baboulin M, Dongarra JJ, and Tomov S (2008) Some issues in dense linear algebra for multicore and special purpose architectures. https://www.netlib.org/lapack/lawnspdf/lawn200.pdf
[24]
Baboulin M, Buttari A, and Dongarra J, et al. (2009a) Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications 180(12): 2526–2533.
[25]
Baboulin M, Buttari A, and Dongarra J, et al. (2009b) Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications 180(12): 2526–2533. https://www.sciencedirect.com/science/article/pii/S0010465508003846
[26]
Beams N, Abdelfattah A, and Tomov S, et al. (2020) High-order finite element method using standard and device-level batch gemm on gpus. In: 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). Piscataway: IEEE, 53–60.
[27]
Beckingsale DA, Burmark J, and Hornung R, et al. (2019a) Raja: portable performance for large-scale scientific applications. In: 2019 Ieee/acm International Workshop on Performance, Portability and Productivity in Hpc (P3hpc). Piscataway: IEEE, 71–81.
[28]
Beckingsale DA, McFadden MJ, and Dahm JP, et al. (2020) Umpire: application-focused management and coordination of complex hierarchical memory. IBM Journal of Research and Development 64(3/4): 1.
[29]
BLAS (1980) BLAS (basic linear algebra subprograms). https://www.netlib.org/blas
[30]
Boukaram WH, Turkiyyah G, and Ltaief H, et al. (2018) Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Computing 74: 19–33.
[31]
Brown C, Abdelfattah A, and Tomov S, et al. (2020) Design, optimization, and benchmarking of dense linear algebra algorithms on AMD gpus. In: 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020, Waltham, MA, USA, September 22-24, 2020.
[32]
Brown J, Abdelfattah A, and Barra V, et al. (2021) libceed: fast algebra for high-order element-based discretizations. Journal of Open Source Software 6(63): 2945.
[33]
Buck I, Foley T, and Horn D, et al. (2004) Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics 23(3): 777–786.
[34]
Buttari A, Dongarra J, and Langou J, et al. (2007) Mixed precision iterative refinement techniques for the solution of dense linear systems. The International Journal of High Performance Computing Applications 21(4): 457–466.
[35]
Carson E and Higham NJ (2017) A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems. SIAM Journal on Scientific Computing 39(6): A2834–A2856.
[36]
Carson E and Higham NJ (2018) Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM Journal on Scientific Computing 40(2): A817–A847.
[37]
Carson RA, Wopschall SR, and Bramwell JA (2019) ExaConstit. https://github.com/LLNL/ExaConstit
[38]
Carson R, Rolchigo M, and Coleman J, et al. (2023) Uncertainty quantification of metal additive manufacturing processing conditions through the use of exascale computing. In: Proceedings of the SC ’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023. New York: ACM.
[40]
Chen Y, Davis TA, and Hager WW, et al. (2008) Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software 35(3): 1–14.
[41]
Chow E, Anzt H, and Dongarra JJ (2015) Asynchronous iterative algorithm for computing incomplete factorizations on GPUs. In: Kunkel JM and Ludwig T (eds) High Performance Computing - 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, 1–16.
[42]
cuBLAS (2024) NVIDIA CUDA Basic Linear Algebra Subprograms. Available at: https://developer.nvidia.com/cublas
[43]
CUDA (2007) Compute Unified Device Architecture Programming Guide, Version 1.0. Santa Clara, CA, USA: NVIDIA Corporation.
[44]
Davis TA (2004) Algorithm 832: UMFPACK V4.3—an unsymmetric-pattern multifrontal method. ACM Transactions on Mathematical Software 30(2): 196–199.
[45]
Demmel JW (1997) Applied Numerical Linear Algebra. Philadelphia: Society for Industrial and Applied Mathematics. https://epubs.siam.org/doi/abs/10.1137/1.9781611971446
[46]
Duff IS and Koster J (1999) The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM Journal on Matrix Analysis and Applications 20(4): 889–901.
[47]
Fischer P, Min M, and Rathnayake T, et al. (2020) Scalability of high-performance pde solvers. The International Journal of High Performance Computing Applications 34(5): 562–586.
[48]
Fortenberry A and Tomov S (2022) Extending magma portability with oneapi. In: SC 2022 Workshop on Accelerator Programming Using Directives (WACCPD), Dallas United States, Nov. 13 - 18, 2022, 22–31.
[49]
Ghysels P and Synk R (2022) High performance sparse multifrontal solvers on modern gpus. Parallel Computing 110: 102897. https://www.sciencedirect.com/science/article/pii/S0167819122000059.
[50]
Ghysels P, Li XS, and Rouet FH, et al. (2016) An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM Journal on Scientific Computing 38(5): S358–S384.
[51]
Golub GH and Ye Q (1999) Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM Journal on Scientific Computing 21(4): 1305–1320.
[52]
Gupta A (2000) WSMP: Watson Sparse Matrix Package Part II – Direct Solution of General Systems. Yorktown Heights: Watson Research Center. https://s3.us.cloud-object-storage.appdomain.cloud/res-files/1331-wsmp2.pdf.
[53]
Haidar A, Dong T, and Luszczek P, et al. (2015) Batched matrix computations on hardware accelerators based on GPUs. The International Journal of High Performance Computing Applications 29: 193–208.
[54]
Haidar A, Abdelfatah A, and Tomov S, et al. (2017) High-performance Cholesky factorization for GPU-only execution. In: Proceedings of the General Purpose GPUs, GPGPU-10. New York, NY, USA: ACM, 42–52.
[55]
Haidar A, Tomov S, and Dongarra J, et al. (2018) Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. Piscataway, NJ, USA: IEEE Press, 47:1.
[56]
Hénon P, Ramet P, and Roman J (2002) PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems. Parallel Computing. 28(2): 301–321. https://www.sciencedirect.com/science/article/pii/S0167819101001417.
[58]
Intel’s DPC++ Compatibility Tool (2024) Intel® DPC++ compatibility Tool: migrate your CUDA* code to portable C++ with SYCL* multiarchitecture code. https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html.
[59]
Karypis G and Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20(1): 359–392.
[60]
Kolev T, Fischer P, and Min M, et al. (2021) Efficient exascale discretizations: high-order finite element methods. The International Journal of High Performance Computing Applications 35(6): 527–552.
[61]
Kolev T, Fischer P, and Abdelfattah A, et al. (2023) CEED ECP Milestone Report: support ECP applications in their exascale challenge problem runs.
[62]
Kurzak J, Tomov S, and Dongarra JJ (2012) Autotuning GEMM kernels for the fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23(11): 2045–2057.
[63]
Langou J, Langou J, and Luszczek P, et al. (2006) Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06. New York, NY, USA: Association for Computing Machinery, 113.
[64]
Li XS and Demmel JW (2003) SuperLU_DIST: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software 29(2): 110–140.
[65]
Masliah I, Abdelfattah A, and Haidar A, et al. (2016) High-performance matrix-matrix multiplications of very small matrices. In: Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016. pp. 659–671.
[66]
Messer OEB, Harris JA, and Parete-Koon S, et al. (2013) Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Manninen P and Öster P (eds) Applied Parallel and Scientific Computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 92–106.
[67]
MKL (2024) Intel oneAPI math kernel library. Available at: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html.
[68]
Moler CB (1967) Iterative refinement in floating point. Journal of the ACM 14(2): 316–321.
[69]
Nath R, Tomov S, and Dongarra J (2010) An improved MAGMA GEMM for fermi graphics processing units. The International Journal of High Performance Computing Applications 24(4): 511–515, URL.
[70]
Nath R, Tomov S, and Dong T, et al. (2011) Optimizing symmetric dense matrix-vector multiplication on GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011.
[71]
OpenBLAS (2023) OpenBLAS: an optimized BLAS library. https://www.openblas.net/.
[72]
Pazner W, Kolev T, and Vassilevski P (2023) Matrix-free Gpu-Accelerated Saddle-point Solvers for High-Order Problems in H(div). Ithaca: Cornell University.
[73]
Reinders J, Ashbaugh B, and Brodman J, et al. (2021) Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL. Berlin: Springer Nature.
[74]
rocBLAS (2024) rocBLAS, next generation BLAS implementation for ROCm platform. https://github.com/ROCmSoftwarePlatform/rocBLAS.
[76]
Saad Y (1993) A flexible inner-outer preconditioned gmres algorithm. SIAM Journal on Scientific Computing 14(2): 461–469.
[77]
Saad Y and Schultz MH (1986) GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 7(3): 856–869.
[78]
Simoncini V and Szyld DB (2002) Flexible inner-outer krylov subspace methods. SIAM Journal on Numerical Analysis 40(6): 2219–2239.
[79]
Stitt T, Belcher K, and Campos A, et al. (2024) Performance portable gpu acceleration of a high-order finite element multiphysics application. Journal of Fluids Engineering 146: 1–69.
[80]
Strohmaier E, Dongarra J, and Simon H, et al. (2023) Top 500: novermber 2023. https://www.top500.org/lists/top500/2023/11/.
[82]
Tomov S, Dongarra J, and Volkov V, et al. (2009) Magma Library. Knoxville, TN, and Berkeley, CA: Univ. of Tennessee and Univ. of California.
[83]
Tomov S, Nath R, and Ltaief H, et al. (2010) Dense linear algebra solvers for multicore with GPU accelerators. In: Proc. Of the IEEE IPDPS’10. Atlanta, GA: IEEE Computer Society, 1–8.
[84]
Tomov S, Dongarra JJ, and Baboulin M (2010a) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing 36(5-6): 232–240.
[85]
Tomov S, Nath R, and Dongarra J (2010b) Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36(12): 645–654.
[86]
Turner JA, Belak J, and Barton N, et al. (2022) ExaAM: metal additive manufacturing simulation at the fidelity of the microstructure. The International Journal of High Performance Computing Applications 36(1): 13–39.
[87]
Vargas A, Stitt TM, and Weiss K, et al. (2022) Matrix-free approaches for gpu acceleration of a high-order finite element hydrodynamics application using mfem, umpire, and raja. The International Journal of High Performance Computing Applications 36(4): 492–509.
[88]
Wilkinson JH (2023) Rounding errors in algebraic processes. Philadelphia, PA: Society for Industrial and Applied Mathematics. https://epubs.siam.org/doi/abs/10.1137/1.9781611977523.
[89]
Yeralan SN, Davis TA, and Sid-Lakhdar WM, et al. (2017) Algorithm 980: sparse QR factorization on the GPU. ACM Transactions on Mathematical Software 44(2): 1–29.

Index Terms

  1. MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image International Journal of High Performance Computing Applications
    International Journal of High Performance Computing Applications  Volume 38, Issue 5
    Sep 2024
    165 pages

    Publisher

    Sage Publications, Inc.

    United States

    Publication History

    Published: 16 October 2024

    Author Tags

    1. The MAGMA library
    2. numerical linear algebra
    3. GPU computing
    4. performance portability

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media