More Web Proxy on the site http://driver.im/

research-article

MAGMA: : Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures

Authors:

Ahmad Abdelfattah,

Robert Carson, Pieter Ghysels,

Tzanio Kolev, Thomas Stitt,

Arturo Vargas, Stanimire Tomov, Jack DongarraAuthors Info & Claims

The International Journal of High Performance Computing Applications, Volume 38, Issue 5

Pages 468 - 490

https://doi.org/10.1177/10943420241261960

Published: 16 October 2024 Publication History

Abstract

MAGMA (Matrix Algebra for GPU and Multicore Architectures) is a pivotal open-source library in the landscape of GPU-enabled dense and sparse linear algebra computations. With a repertoire of approximately 750 numerical routines across four precisions, MAGMA is deeply ingrained in the DOE software stack, playing a crucial role in high-performance computing. Notable projects such as ExaConstit, HiOP, MARBL, and STRUMPACK, among others, directly harness the capabilities of MAGMA. In addition, the MAGMA development team has been acknowledged multiple times for contributing to the vendors’ numerical software stacks. Looking back over the time of the Exascale Computing Project (ECP), we highlight how MAGMA has adapted to recent changes in modern HPC systems, especially the growing gap between CPU and GPU compute capabilities, as well as the introduction of low precision arithmetic in modern GPUs. We also describe MAGMA’s direct impact on several ECP projects. Maintaining portable performance across NVIDIA and AMD GPUs, and with current efforts toward supporting Intel GPUs, MAGMA ensures its adaptability and relevance in the ever-evolving landscape of GPU architectures.

References

[1]

Abdelfattah A, Haidar A, and Tomov S, et al. (2016a) On the development of variable size batched computation for heterogeneous parallel architectures. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, May 23-27, 2016, 1249–1258.

[2]

Abdelfattah A, Haidar A, and Tomov S, et al. (2016b) Performance, design, and autotuning of batched GEMM for gpus. In: ISC High Performance 2016, Frankfurt, Germany, June 19-23, 2016, 21–38.

[3]

Abdelfattah A, Haidar A, and Tomov S, et al. (2016c) Performance tuning and optimization techniques of fixed and variable size batched Cholesky factorization on GPUs. In: International Conference on Computational Science 2016 ICCS 2016, San Diego, California, USA, 6-8 June 2016, 119–130.

Digital Library

[4]

Abdelfattah A, Haidar A, and Tomov S, et al. (2017a) Factorization and inversion of a million matrices using GPUs: challenges and countermeasures. In: International Conference on Computational Science, ICCS 2017, Zurich, Switzerland, 12-14 June 2017, 606–615.

[5]

Abdelfattah A, Haidar A, and Tomov S, et al. (2017b) Novel HPC techniques to batch execution of many variable size BLAS computations on gpus. In: Gropp WD, Beckman P, and Li Z, et al. (eds) In: Proceedings of the International Conference on Supercomputing, ICS 2017, Chicago, IL, USA, June 14-16, 2017. ACM, 5:1–5:10.

Digital Library

[6]

Abdelfattah A, Haidar A, and Tomov S, et al. (2018a) Batched one-sided factorizations of tiny matrices using GPUs: challenges and countermeasures. Journal of Computational Science 26: 226–236. https://www.sciencedirect.com/science/article/pii/S1877750317311456

[7]

Abdelfattah A, Haidar A, and Tomov S, et al. (2018b) Analysis and design techniques towards high-performance and energy-efficient dense linear solvers on GPUs. IEEE Transactions on Parallel and Distributed Systems 29(12): 2700–2712.

[8]

Abdelfattah A, Tomov S, and Dongarra JJ (2019a) Fast batch matrix multiplication for small sizes using half precision arithmetic on GPUs. In: 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20-24, 2019. pp. 111–122.

[9]

Abdelfattah A, Tomov S, and Dongarra JJ (2019b) Progressive optimization of batched LU factorization on GPUs. In: 2019 IEEE High Performance Extreme Computing Conference, HPEC 2019, Waltham, MA, USA, September 24-26, 2019. Piscataway: IEEE, 1–6.

[10]

Abdelfattah A, Ghysels P, and Boukaram W, et al. (2022a) Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers. In: Wolf F, Shende S, and Culhane C, et al. (eds) SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022, 26:1–26:14.

[11]

Abdelfattah A, Tomov S, and Dongarra JJ (2022b) Batch QR factorization on GPUs: design, optimization, and Tuning. In: Groen D, de Mulatier C, and Paszynski M, et al. (eds) Computational Science - ICCS 2022 - 22nd International Conference, London, UK, June 21-23, 2022, 60–74.

[12]

Alexander F, Almgren A, and Bell J, et al. (2020) Exascale applications: skin in the game. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences 378(2166): 20190056. https://www.osti.gov/biblio/1594850

[13]

Amestoy PR and Duff LS (1989) Vectorization of a multiprocessor multifrontal code. The International Journal of Supercomputing Applications 3(3): 41–59.

Digital Library

[14]

Anderson MJ, Sheffield D, and Keutzer K (2012) A predictive model for solving small linear algebra problems in GPU registers. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium. Piscataway: IEEE, 2–13.

Digital Library

[15]

Anderson R, Andrej J, and Barker A, et al. (2021) MFEM: a modular finite element methods library. Computers & Mathematics with Applications 81: 42–74. https://www.sciencedirect.com/science/article/pii/S0898122120302583

[16]

Angerson E, Bai Z, and Dongarra J, et al. (1990) Lapack: a portable linear algebra library for high-performance computers. In: Supercomputing ’90:Proceedings of the 1990 ACM/IEEE Conference on Supercomputing. Piscataway: IEEE, 2–11.

[17]

Anzt H, Tomov S, and Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c/sell-c-σ formats on nvidia gpus. Knoxville, USA: University of Tennessee. https://www.icl.utk.edu/sites/icl/files/publications/2014/icl-utk-772-2014.pdf

[18]

Anzt H, Tomov S, and Luszczek P, et al. (2015) Acceleration of GPU-based Krylov solvers via data transfer reduction. The International Journal of High Performance Computing Applications 29(3): 366–383.

Digital Library

[19]

Anzt H, Baboulin M, and Dongarra JJ, et al. (2016a) Accelerating the conjugate gradient algorithm with GPUs in CFD simulations. In: Dutra I, Camacho R, and Barbosa JG, et al. (eds) High Performance Computing for Computational Science - VECPAR 2016 - 12th International Conference, Porto, Portugal, June 28-30, 2016, 35–43.

[20]

Anzt H, Chow E, and Huckle T, et al. (2016b) Batched generation of incomplete sparse approximate inverses on GPUs. In: 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA@SC 2016, Salt Lake, UT, USA, November 14, 2016. Piscataway: IEEE Computer Society, 49–56.

[21]

Anzt H, Dongarra JJ, and Flegar G, et al. (2017) Batched gauss-Jordan elimination for block-Jacobi preconditioner generation on GPUs. In: Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM@PPoPP 2017, Austin, TX, USA, February February 2017. pp. 1–10.

Digital Library

[22]

Auer AA, Baumgartner G, and Bernholdt DE, et al. (2006) Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics 104(2): 211–228.

[23]

Baboulin M, Dongarra JJ, and Tomov S (2008) Some issues in dense linear algebra for multicore and special purpose architectures. https://www.netlib.org/lapack/lawnspdf/lawn200.pdf

[24]

Baboulin M, Buttari A, and Dongarra J, et al. (2009a) Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications 180(12): 2526–2533.

[25]

Baboulin M, Buttari A, and Dongarra J, et al. (2009b) Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications 180(12): 2526–2533. https://www.sciencedirect.com/science/article/pii/S0010465508003846

[26]

Beams N, Abdelfattah A, and Tomov S, et al. (2020) High-order finite element method using standard and device-level batch gemm on gpus. In: 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA). Piscataway: IEEE, 53–60.

[27]

Beckingsale DA, Burmark J, and Hornung R, et al. (2019a) Raja: portable performance for large-scale scientific applications. In: 2019 Ieee/acm International Workshop on Performance, Portability and Productivity in Hpc (P3hpc). Piscataway: IEEE, 71–81.

[28]

Beckingsale DA, McFadden MJ, and Dahm JP, et al. (2020) Umpire: application-focused management and coordination of complex hierarchical memory. IBM Journal of Research and Development 64(3/4): 1.

[29]

BLAS (1980) BLAS (basic linear algebra subprograms). https://www.netlib.org/blas

[30]

Boukaram WH, Turkiyyah G, and Ltaief H, et al. (2018) Batched QR and SVD algorithms on GPUs with applications in hierarchical matrix compression. Parallel Computing 74: 19–33.

Digital Library

[31]

Brown C, Abdelfattah A, and Tomov S, et al. (2020) Design, optimization, and benchmarking of dense linear algebra algorithms on AMD gpus. In: 2020 IEEE High Performance Extreme Computing Conference, HPEC 2020, Waltham, MA, USA, September 22-24, 2020.

[32]

Brown J, Abdelfattah A, and Barra V, et al. (2021) libceed: fast algebra for high-order element-based discretizations. Journal of Open Source Software 6(63): 2945.

[33]

Buck I, Foley T, and Horn D, et al. (2004) Brook for GPUs: stream computing on graphics hardware. ACM Transactions on Graphics 23(3): 777–786.

Digital Library

[34]

Buttari A, Dongarra J, and Langou J, et al. (2007) Mixed precision iterative refinement techniques for the solution of dense linear systems. The International Journal of High Performance Computing Applications 21(4): 457–466.

Digital Library

[35]

Carson E and Higham NJ (2017) A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems. SIAM Journal on Scientific Computing 39(6): A2834–A2856.

Digital Library

[36]

Carson E and Higham NJ (2018) Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM Journal on Scientific Computing 40(2): A817–A847.

Digital Library

[37]

Carson RA, Wopschall SR, and Bramwell JA (2019) ExaConstit. https://github.com/LLNL/ExaConstit

[38]

Carson R, Rolchigo M, and Coleman J, et al. (2023) Uncertainty quantification of metal additive manufacturing processing conditions through the use of exascale computing. In: Proceedings of the SC ’23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis, SC-W 2023. New York: ACM.

Digital Library

[39]

CEED (2020) Ceed. URL https://ceed.exascaleproject.org/.

[40]

Chen Y, Davis TA, and Hager WW, et al. (2008) Algorithm 887: CHOLMOD, supernodal sparse Cholesky factorization and update/downdate. ACM Transactions on Mathematical Software 35(3): 1–14.

Digital Library

[41]

Chow E, Anzt H, and Dongarra JJ (2015) Asynchronous iterative algorithm for computing incomplete factorizations on GPUs. In: Kunkel JM and Ludwig T (eds) High Performance Computing - 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, 1–16.

[42]

cuBLAS (2024) NVIDIA CUDA Basic Linear Algebra Subprograms. Available at: https://developer.nvidia.com/cublas

[43]

CUDA (2007) Compute Unified Device Architecture Programming Guide, Version 1.0. Santa Clara, CA, USA: NVIDIA Corporation.

[44]

Davis TA (2004) Algorithm 832: UMFPACK V4.3—an unsymmetric-pattern multifrontal method. ACM Transactions on Mathematical Software 30(2): 196–199.

Digital Library

[45]

Demmel JW (1997) Applied Numerical Linear Algebra. Philadelphia: Society for Industrial and Applied Mathematics. https://epubs.siam.org/doi/abs/10.1137/1.9781611971446

[46]

Duff IS and Koster J (1999) The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM Journal on Matrix Analysis and Applications 20(4): 889–901.

Digital Library

[47]

Fischer P, Min M, and Rathnayake T, et al. (2020) Scalability of high-performance pde solvers. The International Journal of High Performance Computing Applications 34(5): 562–586.

Digital Library

[48]

Fortenberry A and Tomov S (2022) Extending magma portability with oneapi. In: SC 2022 Workshop on Accelerator Programming Using Directives (WACCPD), Dallas United States, Nov. 13 - 18, 2022, 22–31.

[49]

Ghysels P and Synk R (2022) High performance sparse multifrontal solvers on modern gpus. Parallel Computing 110: 102897. https://www.sciencedirect.com/science/article/pii/S0167819122000059.

Digital Library

[50]

Ghysels P, Li XS, and Rouet FH, et al. (2016) An efficient multicore implementation of a novel HSS-structured multifrontal solver using randomized sampling. SIAM Journal on Scientific Computing 38(5): S358–S384.

Digital Library

[51]

Golub GH and Ye Q (1999) Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM Journal on Scientific Computing 21(4): 1305–1320.

Digital Library

[52]

Gupta A (2000) WSMP: Watson Sparse Matrix Package Part II – Direct Solution of General Systems. Yorktown Heights: Watson Research Center. https://s3.us.cloud-object-storage.appdomain.cloud/res-files/1331-wsmp2.pdf.

[53]

Haidar A, Dong T, and Luszczek P, et al. (2015) Batched matrix computations on hardware accelerators based on GPUs. The International Journal of High Performance Computing Applications 29: 193–208.

Digital Library

[54]

Haidar A, Abdelfatah A, and Tomov S, et al. (2017) High-performance Cholesky factorization for GPU-only execution. In: Proceedings of the General Purpose GPUs, GPGPU-10. New York, NY, USA: ACM, 42–52.

Digital Library

[55]

Haidar A, Tomov S, and Dongarra J, et al. (2018) Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’18. Piscataway, NJ, USA: IEEE Press, 47:1.

Digital Library

[56]

Hénon P, Ramet P, and Roman J (2002) PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems. Parallel Computing. 28(2): 301–321. https://www.sciencedirect.com/science/article/pii/S0167819101001417.

Digital Library

[57]

HIP (2024) AMD HIP framework. https://rocm.docs.amd.com/projects/HIP/en/latest/.

[58]

Intel’s DPC++ Compatibility Tool (2024) Intel® DPC++ compatibility Tool: migrate your CUDA* code to portable C++ with SYCL* multiarchitecture code. https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compatibility-tool.html.

[59]

Karypis G and Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20(1): 359–392.

Digital Library

[60]

Kolev T, Fischer P, and Min M, et al. (2021) Efficient exascale discretizations: high-order finite element methods. The International Journal of High Performance Computing Applications 35(6): 527–552.

Digital Library

[61]

Kolev T, Fischer P, and Abdelfattah A, et al. (2023) CEED ECP Milestone Report: support ECP applications in their exascale challenge problem runs.

[62]

Kurzak J, Tomov S, and Dongarra JJ (2012) Autotuning GEMM kernels for the fermi GPU. IEEE Transactions on Parallel and Distributed Systems 23(11): 2045–2057.

Digital Library

[63]

Langou J, Langou J, and Luszczek P, et al. (2006) Exploiting the performance of 32 bit floating point arithmetic in obtaining 64 bit accuracy (revisiting iterative refinement for linear systems). In: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06. New York, NY, USA: Association for Computing Machinery, 113.

Digital Library

[64]

Li XS and Demmel JW (2003) SuperLU_DIST: a scalable distributed-memory sparse direct solver for unsymmetric linear systems. ACM Transactions on Mathematical Software 29(2): 110–140.

Digital Library

[65]

Masliah I, Abdelfattah A, and Haidar A, et al. (2016) High-performance matrix-matrix multiplications of very small matrices. In: Euro-Par 2016: Parallel Processing - 22nd International Conference on Parallel and Distributed Computing, Grenoble, France, August 24-26, 2016. pp. 659–671.

[66]

Messer OEB, Harris JA, and Parete-Koon S, et al. (2013) Multicore and accelerator development for a leadership-class stellar astrophysics code. In: Manninen P and Öster P (eds) Applied Parallel and Scientific Computing. Berlin, Heidelberg: Springer Berlin Heidelberg, 92–106.

[67]

MKL (2024) Intel oneAPI math kernel library. Available at: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html.

[68]

Moler CB (1967) Iterative refinement in floating point. Journal of the ACM 14(2): 316–321.

Digital Library

[69]

Nath R, Tomov S, and Dongarra J (2010) An improved MAGMA GEMM for fermi graphics processing units. The International Journal of High Performance Computing Applications 24(4): 511–515, URL.

Digital Library

[70]

Nath R, Tomov S, and Dong T, et al. (2011) Optimizing symmetric dense matrix-vector multiplication on GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011.

[71]

OpenBLAS (2023) OpenBLAS: an optimized BLAS library. https://www.openblas.net/.

[72]

Pazner W, Kolev T, and Vassilevski P (2023) Matrix-free Gpu-Accelerated Saddle-point Solvers for High-Order Problems in H(div). Ithaca: Cornell University.

[73]

Reinders J, Ashbaugh B, and Brodman J, et al. (2021) Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems Using C++ and SYCL. Berlin: Springer Nature.

[74]

rocBLAS (2024) rocBLAS, next generation BLAS implementation for ROCm platform. https://github.com/ROCmSoftwarePlatform/rocBLAS.

[75]

ROCm (2024) AMD ROCm^TM software. https://www.amd.com/en/products/software/rocm.html.

[76]

Saad Y (1993) A flexible inner-outer preconditioned gmres algorithm. SIAM Journal on Scientific Computing 14(2): 461–469.

Digital Library

[77]

Saad Y and Schultz MH (1986) GMRES: a generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 7(3): 856–869.

Digital Library

[78]

Simoncini V and Szyld DB (2002) Flexible inner-outer krylov subspace methods. SIAM Journal on Numerical Analysis 40(6): 2219–2239.

Digital Library

[79]

Stitt T, Belcher K, and Campos A, et al. (2024) Performance portable gpu acceleration of a high-order finite element multiphysics application. Journal of Fluids Engineering 146: 1–69.

[80]

Strohmaier E, Dongarra J, and Simon H, et al. (2023) Top 500: novermber 2023. https://www.top500.org/lists/top500/2023/11/.

[81]

SYCL (2024) SYCL. https://www.khronos.org/sycl.

[82]

Tomov S, Dongarra J, and Volkov V, et al. (2009) Magma Library. Knoxville, TN, and Berkeley, CA: Univ. of Tennessee and Univ. of California.

[83]

Tomov S, Nath R, and Ltaief H, et al. (2010) Dense linear algebra solvers for multicore with GPU accelerators. In: Proc. Of the IEEE IPDPS’10. Atlanta, GA: IEEE Computer Society, 1–8.

[84]

Tomov S, Dongarra JJ, and Baboulin M (2010a) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Computing 36(5-6): 232–240.

Digital Library

[85]

Tomov S, Nath R, and Dongarra J (2010b) Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Computing 36(12): 645–654.

Digital Library

[86]

Turner JA, Belak J, and Barton N, et al. (2022) ExaAM: metal additive manufacturing simulation at the fidelity of the microstructure. The International Journal of High Performance Computing Applications 36(1): 13–39.

Digital Library

[87]

Vargas A, Stitt TM, and Weiss K, et al. (2022) Matrix-free approaches for gpu acceleration of a high-order finite element hydrodynamics application using mfem, umpire, and raja. The International Journal of High Performance Computing Applications 36(4): 492–509.

Digital Library

[88]

Wilkinson JH (2023) Rounding errors in algebraic processes. Philadelphia, PA: Society for Industrial and Applied Mathematics. https://epubs.siam.org/doi/abs/10.1137/1.9781611977523.

[89]

Yeralan SN, Davis TA, and Sid-Lakhdar WM, et al. (2017) Algorithm 980: sparse QR factorization on the GPU. ACM Transactions on Mathematical Software 44(2): 1–29.

Digital Library

Index Terms

MAGMA: Enabling exascale performance with accelerated BLAS and LAPACK for diverse GPU architectures
1. Mathematics of computing

Index terms have been assigned to the content through auto-classification.

Recommendations

Kokkos

The manycore revolution can be characterized by increasing thread counts, decreasing memory per thread, and diversity of continually evolving manycore architectures. High performance computing (HPC) applications and libraries must exploit increasingly ...
Evaluation of a performance portable lattice Boltzmann code using OpenCL
IWOCL '14: Proceedings of the International Workshop on OpenCL 2013 & 2014

With the advent of many-core computer architectures such as GPGPUs from NVIDIA and AMD, and more recently Intel's Xeon Phi, ensuring performance portability of HPC codes is potentially becoming more complex. In this work we have focused on one important ...
High-performance Cholesky factorization for GPU-only execution
GPGPU-10: Proceedings of the General Purpose GPUs

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 38, Issue 5

Sep 2024

165 pages

Issue’s Table of Contents

© The Author(s) 2024.

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 16 October 2024

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents