[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Dense Linear Algebra on Accelerated Multicore Hardware

  • Chapter
High-Performance Scientific Computing

Abstract

Design of systems exceeding 1 Pflop/s and inching towards 1 Eflop/s forced a dramatic shift in hardware design. Various physical and engineering constraints resulted in introduction of massive parallelism and functional hybridization with the use of accelerator units. This paradigm change brings about a serious challenge for application developers as the management of multicore proliferation and heterogeneity rests on software. And it is reasonable to expect that this situation will not change in the foreseeable future. This chapter presents a methodology of dealing with this issue in three common scenarios. In the context of shared-memory multicore installations, we show how high performance and scalability go hand in hand when the well-known linear algebra algorithms are recast in terms of Direct Acyclic Graphs (DAGs) which are then transparently scheduled at runtime inside the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) project. Similarly, Matrix Algebra on GPU and Multicore Architectures (MAGMA) schedules DAG-driven computations on multicore processors and accelerators. Finally, Distributed PLASMA (DPLASMA), takes the approach to distributed-memory machines with the use of automatic dependence analysis and the Direct Acyclic Graph Engine (DAGuE) to deliver high performance at the scale of many thousands of cores.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In terms of numerical accuracy, the incremental pivoting used in PLASMA’s LU implementation has a higher upper bound on the backward error than the LU with partial pivoting featured in LAPACK and MKL.

References

  1. Banerjee, U.: Dependence Analysis (Loop Transformation for Restructuring Compilers). Springer, Berlin (1996)

    Google Scholar 

  2. Bilmes, J., Asanovic, K., Chin, C.W., Demmel, J.: Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In: International Conference on Supercomputing, pp. 340–347 (1997). citeseer.ist.psu.edu/article/bilmes97optimizing.html

    Google Scholar 

  3. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)

    Book  MATH  Google Scholar 

  4. Bolze, R., Cappello, F., Caron, E., Daydé, M., Desprez, F., Jeannot, E., Jégou, Y., Lanteri, S., Leduc, J., Melab, N., Mornet, G., Namyst, R., Primet, P., Quetier, B., Richard, O., Talbi, E.G., Touche, I.: Grid’5000: A large scale and highly reconfigurable experimental grid testbed. Int. J. High Perform. Comput. Appl. 20(4), 481–494 (2006)

    Article  Google Scholar 

  5. Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, A., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In: IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, pp. 1432–1444 (2011)

    Google Scholar 

  6. Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Haidar, H., Herault, T., Kurzak, J., Langou, J., Lemarinier, P., Ltaief, H., Luszczek, P., YarKhan, A., Dongarra, J.: Distributed-memory task execution and dependence tracking within DAGuE and the DPLASMA project. Tech. Rep. 232, LAPACK Working Note (2010). http://www.netlib.org/lapack/lawnspdf/lawn232.pdf

  7. Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for high performance computing. Tech. Rep. 231, LAPACK Working Note (2010). http://www.netlib.org/lapack/lawnspdf/lawn231.pdf

  8. Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for high performance computing. In: Proceedings of the 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’11), Anchorage, AL, USA (2011)

    Google Scholar 

  9. Bosilca, G., Bouteiller, A., Danalis, A., Herault, T., Lemarinier, P., Dongarra, J.: DAGuE: A generic distributed DAG engine for high performance computing. In: 16th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS-11), Anchorage, AK (2011)

    Google Scholar 

  10. CUDA CUBLAS Library. http://developer.download.nvidia.com

  11. Dennard, R.H., Gaensslen, F.H., Rideout, V.L., Bassous, E., LeBlanc, A.R.: Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J. Solid-State Circuits 9(5), 256–268 (1974). doi:10.1109/JSSC.1974.1050511

    Article  Google Scholar 

  12. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present and future. Concurr. Comput. 15(9), 803–820 (2003). doi:10.1002/cpe.728

    Article  Google Scholar 

  13. Gustavson, F.G., Karlsson, L., Kågström, B.: Distributed SBP Cholesky factorization algorithms with near-optimal scheduling. ACM Trans. Math. Softw. 36(2), 1–25 (2009). http://doi.acm.org/10.1145/1499096.1499100

    Article  Google Scholar 

  14. Kogge, P., Bergman, K., Borkar, S., Campbell, D., Carlson, W., Dally, W., Denneau, M., Franzon, P., Harrod, W., Hill, K., Hiller, J., Karp, S., Keckler, S., Klein, D., Lucas, R., Richards, M., Scarpelli, A., Scott, S., Snavely, A., Sterling, T., Williams, R.S., Yelick, K.: ExaScale computing study: Technology challenges in achieving exascale systems. Tech. Rep. TR-2008-13, Department of Computer Science and Engineering, University of Notre Dame (2008)

    Google Scholar 

  15. Li, Y., Dongarra, J., Tomov, S.: A note on auto-tuning GEMM for GPUs. In: ICCS ’09: Proceedings of the 9th International Conference on Computational Science, pp. 884–892. Springer, Berlin (2009). doi:10.1007/978-3-642-01970-8_89

    Google Scholar 

  16. Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8) (1965)

    Google Scholar 

  17. Nath, R., Tomov, S., Dongarra, J.: Accelerating GPU kernels for dense linear algebra. In: Proc. of High Performance Computing for Computational Science (VECPAR’10), June 22–25, 2010

    Google Scholar 

  18. National Research Council Committee on the Potential Impact of High-End Computing on Illustrative Fields of Science and Engineering: The Potential Impact of High-End Capability Computing on Four Illustrative Fields of Science and Engineering. Academies Press, Washington (2008)

    Google Scholar 

  19. NVIDIA: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi (2009). http://www.nvidia.com/object/fermi_architecture.html

  20. NVIDIA: NVIDIA CUDA™ Best Practices Guide Version 3.0. NVIDIA Corporation (2010)

    Google Scholar 

  21. NVIDIA: NVIDIA CUDA™ Programming Guide Version 3.0. NVIDIA Corporation (2010)

    Google Scholar 

  22. Sutter, H.: The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s Journal 30(3) (2005). http://www.ddj.com/184405990

  23. Tomov, S., Nath, R., Du, P., Dongarra, J.: MAGMA version 0.2 User Guide (11/2009). http://icl.cs.utk.edu/magma

  24. University of Tennessee: PLASMA Users’ Guide, Parallel Linear Algebra Software for Multicore Architectures, Version 2.2 (2009)

    Google Scholar 

  25. Volkov, V., Demmel, J.: Benchmarking GPUs to tune dense linear algebra. In: SC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–11. IEEE Press, Piscataway (2008). http://doi.acm.org/10.1145/1413370.1413402

    Google Scholar 

  26. Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001)

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jack Dongarra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag London Limited

About this chapter

Cite this chapter

Dongarra, J., Kurzak, J., Luszczek, P., Tomov, S. (2012). Dense Linear Algebra on Accelerated Multicore Hardware. In: Berry, M., et al. High-Performance Scientific Computing. Springer, London. https://doi.org/10.1007/978-1-4471-2437-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-2437-5_5

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-2436-8

  • Online ISBN: 978-1-4471-2437-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics