Dense Linear Algebra on Accelerated Multicore Hardware

Jack Dongarra⁸,
Jakub Kurzak⁸,
Piotr Luszczek⁸ &
…
Stanimire Tomov⁸

1935 Accesses
2 Citations

Abstract

Design of systems exceeding 1 Pflop/s and inching towards 1 Eflop/s forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs) which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Hardcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Exploiting Data Sparsity for Large-Scale Matrix Computations

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Notes

1.
In terms of numerical accuracy, the incremental pivoting used in PLASMA’s LU implementation has a higher upper bound on the backward error than the LU with partial pivoting featured in LAPACK and MKL.

References

Banerjee, U.: Dependence Analysis (Loop Transformation for Restructuring Compilers). Springer, Berlin (1996)
Google Scholar
Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: International Conference on Supercomputing, pp. 340–347 (1997). citeseer.ist.psu.edu/article/bilmes97optimizing.html
Google Scholar
Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)
Book MATH Google Scholar
Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., Mornet, G., Namyst, R., Primet, P., Quetier, B., Richard, O., Talbi, E.G., Touche, I.: Grid’5000: A large scale and highly reconfigurable experimental grid testbed. Int. J. High Perform. Comput. Appl. 20(4), 481–494 (2006)
Article Google Scholar
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In: IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, pp. 1432–1444 (2011)
Google Scholar
Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, H., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Distributed-memory task execution and dependence tracking within DAGuE and the DPLASMA project. Tech. Rep. 232, LAPACK Working Note (2010). http://www.netlib.org/lapack/lawnspdf/lawn232.pdf
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for high performance computing. Tech. Rep. 231, LAPACK Working Note (2010). http://www.netlib.org/lapack/lawnspdf/lawn231.pdf
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for high performance computing. In: Proceedings of the 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’11), Anchorage, AL, USA (2011)
Google Scholar
Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for high performance computing. In: 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS-11), Anchorage, AK (2011)
Google Scholar
CUDA CUBLAS Library. http://developer.download.nvidia.com
Dennard, R.H., Gaensslen, F.H., Rideout, V.L., Bassous, E., LeBlanc, A.R.: Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid-State Circuits 9(5), 256–268 (1974). doi:10.1109/JSSC.1974.1050511
Article Google Scholar
Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present and future. Concurr. Comput. 15(9), 803–820 (2003). doi:10.1002/cpe.728
Article Google Scholar
Gustavson, F.G., Karlsson, L., Kågström, B.: Distributed SBP Cholesky factorization algorithms with near-optimal scheduling. ACM Trans. Math. Softw. 36(2), 1–25 (2009). http://doi.acm.org/10.1145/1499096.1499100
Article Google Scholar
Kogge, P., Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: ExaScale computing study: Technology challenges in achieving exascale systems. Tech. Rep. TR-2008-13, Department of Computer Science and Engineering, University of Notre Dame (2008)
Google Scholar
Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: ICCS ’09: Proceedings of the 9th International Conference on Computational Science, pp. 884–892. Springer, Berlin (2009). doi:10.1007/978-3-642-01970-8_89
Google Scholar
Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8) (1965)
Google Scholar
Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU kernels for dense linear algebra. In: Proc. of High Performance Computing for Computational Science (VECPAR’10), June 22–25, 2010
Google Scholar
National Research Council Committee on the Potential Impact of High-End Computing on Illustrative Fields of Science and Engineering: The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Academies Press, Washington (2008)
Google Scholar
NVIDIA: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi (2009). http://www.nvidia.com/object/fermi_architecture.html
NVIDIA: NVIDIA CUDA™ Best Practices Guide Version 3.0. NVIDIA Corporation (2010)
Google Scholar
NVIDIA: NVIDIA CUDA™ Programming Guide Version 3.0. NVIDIA Corporation (2010)
Google Scholar
Sutter, H.: The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s Journal 30(3) (2005). http://www.ddj.com/184405990
Tomov, S., Nath, R., Du, P., Dongarra, J.: MAGMA version 0.2 User Guide (11/2009). http://icl.cs.utk.edu/magma
University of Tennessee: PLASMA Users’ Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.2 (2009)
Google Scholar
Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE Press, Piscataway (2008). http://doi.acm.org/10.1145/1413370.1413402
Google Scholar
Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, USA
Jack Dongarra, Jakub Kurzak, Piotr Luszczek & Stanimire Tomov

Authors

Jack Dongarra
View author publications
You can also search for this author in PubMed Google Scholar
Jakub Kurzak
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Luszczek
View author publications
You can also search for this author in PubMed Google Scholar
Stanimire Tomov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jack Dongarra .

Editor information

Editors and Affiliations

Dept. Electrical Eng. & Computer Science, University of Tennessee, Volunteer Boulevard 1122, Knoxville, 37996-3450, Tennessee, USA
Michael W. Berry
Department of Mathematics, Florida State University, Academic Way 1017, Tallahassee, 32306, Florida, USA
Kyle A. Gallivan
Dept. Computer Engineering & Informatics, University of Patras, Patras, 26500, Greece
Efstratios Gallopoulos
Department of Computer Science, Purdue University, 305 North University Street, West Lafayette, 47907-2107, Indiana, USA
Ananth Grama
IRISA, INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, Rennes, 35042, France
Bernard Philippe
Dept. of Computer Science & Engineering, University of Minnesota, Union Street SE 200, Minneapolis, 55455, Minnesota, USA
Yousef Saad
Department of Computer Science, Purdue University, N. University Street 305, West Lafayette, 47907-2107, Indiana, USA
Faisal Saied

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S. (2012). Dense Linear Algebra on Accelerated Multicore Hardware. In: Berry, M., et al. High-Performance Scientific Computing. Springer, London. https://doi.org/10.1007/978-1-4471-2437-5_5

Download citation

DOI: https://doi.org/10.1007/978-1-4471-2437-5_5
Publisher Name: Springer, London
Print ISBN: 978-1-4471-2436-8
Online ISBN: 978-1-4471-2437-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dense Linear Algebra on Accelerated Multicore Hardware

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Exploiting Data Sparsity for Large-Scale Matrix Computations

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Dense Linear Algebra on Accelerated Multicore Hardware

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Exploiting Data Sparsity for Large-Scale Matrix Computations

Heterogenous Acceleration for Linear Algebra in Multi-coprocessor Environments

Accelerating Numerical Dense Linear Algebra Calculations with GPUs

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation