[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

Published: 01 March 2015 Publication History

Abstract

We present a review of the current best practices in parallel programming models for dense linear algebra DLA on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries { in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA for multicore CPUs and MAGMA for heterogeneous architectures, as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models { especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need { in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.

References

[1]
1. A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov. High-Performance Tensor Contractions for GPUs. Technical Report UT-EECS-16-738, 01-2016 2016.
[2]
2. E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser., 1801, 2009.
[3]
3. ACML - AMD Core Math Library, 2014. Available at http://developer.amd.com/ tools-and-sdks/cpu-development/amd-core-math-library-acml.
[4]
4. E. Anderson, Z. Bai, C. Bischof, S. L. Blackford, J. W. Demmel, J. J. Dongarra, J. D. Croz, A. Greenbaum, S. J. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, Third edition, 1999.
[5]
5. M. Anderson, D. Sheffield, and K. Keutzer. A predictive model for solving small linear algebra problems in gpu registers. In IEEE 26th International Parallel Distributed Processing Symposium IPDPS, 2012.
[6]
6. A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan, C.-C. Lam, Q. Luc, M. Nooijene, R. Pitzerf, J. Ramanujamg, P. Sadayappanc, and A. Sibiryakovc. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 1042:211-228, 2006.
[7]
7. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 232:187-198, 2011.
[8]
8. M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, C. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov. Towards a High-Performance Tensor Algebra Package for Accelerators. http://icl.cs.utk.edu/projectsfiles/magma/pubs/43-smc15_tensor_ contractions.pdf, September 2 2015. Smoky Mountains Computational Sciences and Engineering Conference SMC'15, Poster, Gatlinburg, TN.
[9]
9. R. M. Badia, J. R. Herrero, J. Labarta, J. M. Perez, E. S. Quintana-Orti, and G. Quintana-Orti. Parallelizing dense and banded linear algebra libraries using smpss. Concurrency and Computation: Practice and Experience, 2118:2438-2456, 2009.
[10]
10. A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 351:38-53, 2009.
[11]
11. J. Choi, J. Demmel, I. S. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers Design Issues and Performance. Computer Physics Communications, 97, aug 1996.
[12]
12. J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. W. Walker, and R. C. Whaley. A proposal for a set of parallel basic linear algebra subprograms. In Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science, Second International Workshop, PARA '95, Lyngby, Denmark, August 21-24, 1995, Proc., pages 107-114, 1995.
[13]
13. M. Corporation. C++ AMP : Language and programming model, 2012. Version 1.0, August.
[14]
14. S. Donfack, S. Tomov, and J. Dongarra. Dynamically balanced synchronization-avoiding lu factorization with multicore and gpus. In Fourth International Workshop on Accelerators and Hybrid Exascale Systems AsHES, IPDPS 2014, 05-2014 2014.
[15]
15. T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra. A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU. In IEEE 28th International Parallel Distributed Processing Symposium IPDPS, 2014.
[16]
16. J. J. Dongara, C. B. Moler, J. R. Bunch, and G. W. Stewart. LINPACK User's Guide. SIAM, Philadelphia, PA, 1979.
[17]
17. J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki. Accelerating numerical dense linear algebra calculations with gpus. Numerical Computations with GPUs, pages 1-26, 2014.
[18]
18. J. Dongarra, J. Kurzak, P. Luszczek, T. Moore, and S. Tomov. Numerical algorithms and libraries at exascale. http://www.hpcwire.com/2015/10/19/ numerical-algorithms-and-libraries-at-exascale/, October 19 2015. HPCwire.
[19]
19. J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson. An extended set of fortran basic linear algebra subprograms. ACM Trans. Math. Softw., 141:1-17, 1988.
[20]
20. J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 161:1-17, Mar. 1990.
[21]
21. P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Comput., 388:391-407, Aug. 2012.
[22]
22. A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OMPSS: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 2102:173-193, 2011.
[23]
23. K. Gregory and A. Miller. C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++. Microsoft Press, 1st edition, 2012. ISBN-13: 978-0735664739 ISBN-10: 0735664730.
[24]
24. F. Gustavson, L. Karlsson, and B. Kagstrom. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software TOMS, 383:17, 2012.
[25]
25. A. Haidar, C. Cao, I. Yamazaki, J. Dongarra, M. Gates, P. Luszczek, and S. Tomov. Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors. In 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems ScalA 14, New Orleans, LA, 11-2014 2014. IEEE.
[26]
26. A. Haidar, C. Cao, A. Yarkhan, P. Luszczek, S. Tomov, K. Kabir, and J. Dongarra. Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 491-500, Washington, DC, USA, 2014. IEEE Computer Society.
[27]
27. A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. Dongarra. Batched matrix computations on hardware accelerators based on gpus. International Journal of High Performance Computing Applications, 2015.
[28]
28. A. Haidar, T. Dong, S. Tomov, P. Luszczek, and J. Dongarra. Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations. In ISC High Performance, Frankfurt, Germany, 07-2015 2015. Springer, Springer.
[29]
29. A. Haidar, J. Dongarra, K. Kabir, M. Gates, P. Luszczek, S. Tomov, and Y. Jia. Hpc programming on intel many-integrated-core hardware with magma port to xeon phi. Scientific Programming, 23, 01-2015 2015.
[30]
30. A. Haidar, A. YarKhan, C. Cao, P. Luszczek, S. Tomov, and J. Dongarra. Flexible linear algebra development and scheduling with cholesky factorization. In 17th IEEE International Conference on High Performance Computing and Communications, Newark, NJ, 08-2015 2015.
[31]
31. M. A. Heroux. Exascale programming: Adapting what we have can and must work. http://www.hpcwire.com/2016/01/14/24151/, January 14 2016. HPCwire.
[32]
32. M. Horton, S. Tomov, and J. Dongarra. A class of hybrid LAPACK algorithms for multicore and GPU architectures. In Proceedings of Symposium for Application Accelerators in High Performance Computing SAAHPC, 2011.
[33]
33. E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 181:135-158, Feb. 2004.
[34]
34. Intel Math Kernel Library, 2014. Available at http://software.intel.com/intel-mkl/.
[35]
35. Y. Jia, P. Luszczek, and J. Dongarra. Multi-GPU implementation of LU factorization. In proceedings of the international conference on computational science ICCS, 2012.
[36]
36. K. Kabir, A. Haidar, S. Tomov, and J. Dongarra. On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors. In ISC High Performance 2015, Frankfurt, Germany, 07-2015 2015.
[37]
37. J. L. Khodayari A., A.R. Zomorrodi and C. Maranas. A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metabolic engineering, 25C:50-62, 2014.
[38]
38. Khronos OpenCL Working Group. The opencl specification, version: 1.0 document revision: 48, 2009.
[39]
39. J. Kurzak, H. Ltaief, J. Dongarra, and R. M. Badia. Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience, 221:15-44, 2010.
[40]
40. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 53:308-323, Sept. 1979.
[41]
41. O. Messer, J. Harris, S. Parete-Koon, and M. Chertkow. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of "PARA 2012: State-of-the-Art in Scientific and Parallel Computing.", 2012.
[42]
42. S. Mittal and J. S. Vetter. A survey of cpu-gpu heterogeneous computing techniques. ACM Comput. Surv., 474:69:1-69:35, July 2015.
[43]
43. J. Molero, E. Garzon, I. Garcia, E. Quintana-Orti, and A. Plaza. Poster: A batched Cholesky solver for local RX anomaly detection on GPUs, 2013. PUMPS.
[44]
44. R. Nath, S. Tomov, and J. Dongarra. An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl., 244:511-515, Nov. 2010.
[45]
45. C. J. Newburn, G. Bansal, M. Wood, L. Crivelli, J. Planas, A. Duran, P. Souza, L. Borges, P. Luszczek, S. Tomov, J. Dongarra, H. Anzt, M. Gates, A. Haidar, Y. Jia, K. Kabir, I. Yamazaki, and J. Labarta. Heterogeneous streaming. In IPDPSW, AsHES 2016 accepted, Chicago, IL, USA, May 23 2016.
[46]
46. OpenACC Non-Profit Corporation. The OpenACC application programming interface version 2.0. http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf, June 2013.
[47]
47. OpenMP Architecture Review Board. OpenMP application program interface version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.
[48]
48. OpenMP Architecture Review Board. OpenMP application program interface version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 2013.
[49]
49. OpenMP Architecture Review Board. OpenMP application program interface version 4.5, Nov 2015.
[50]
50. J. M. Perez, P. Bellens, R. M. Badia, and J. Labarta. Cellss: Making it easier to program the cell broadband engine processor. IBM Journal of Research and Development, 515:593-604, 2007.
[51]
51. A. Pop and A. Cohen. Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Trans. Archit. Code Optim., 94, 2013.
[52]
52. F. Song, S. Tomov, and J. Dongarra. Enabling and scaling matrix computations on heterogeneous multi-core and multi-gpu systems. In ICS, pages 365-376, 2012.
[53]
53. M. Tillenius. Superglue: A shared memory framework using data versioning for dependencyaware task-based parallelization. SIAM Journal on Scientific Computing, 376:C617-C642, 2015.
[54]
54. S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra for hybrid gpu accelerated manycore systems. Parellel Comput. Syst. Appl., 365-6:232-240, 2010.
[55]
55. S. Tomov, R. Nath, H. Ltaief, and J. Dongarra. Dense linear algebra solvers for multicore with GPU accelerators. In Proc. of the IEEE IPDPS'10, pages 1-8, Atlanta, GA, April 19-23 2010. IEEE Computer Society.
[56]
56. I. Yamazaki, S. Tomov, and J. Dongarra. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. In proceedings of the international conference on computational science ICCS, 2012.
[57]
57. I. Yamazaki, S. Tomov, and J. Dongarra. Non-GPU-resident symmetric indefinite factorization. Submitted to ACM Transactions on Mathematical Software TOMS, 2016.
[58]
58. Y. Yan, B. M. Chapman, and M. Wong. A comparison of heterogeneous and manycore programming models. http://www.hpcwire.com/2015/03/02/ a-comparison-of-heterogeneous-and-manycore-programming-models, March 2 2015. HPCwire.
[59]
59. A. YarKhan. Dynamic Task Execution on Shared and Distributed Memory Architectures. PhD thesis, University of Tennessee, 2012.
[60]
60. S. N. Yeralan, T. A. Davis, and S. Ranka. Sparse mulitfrontal QR on the GPU. Technical report, University of Florida Technical Report, 2013.

Cited By

View all
  • (2022)A Task-Parallel Runtime for Heterogeneous Multi-node Vector SystemsParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_26(331-343)Online publication date: 7-Dec-2022
  • (2019)Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsACM Transactions on Mathematical Software10.1145/326710145:2(1-28)Online publication date: 3-May-2019
  • (2019)DuctTeipParallel Computing10.1016/j.parco.2019.10258290:COnline publication date: 1-Dec-2019

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Supercomputing Frontiers and Innovations: an International Journal
Supercomputing Frontiers and Innovations: an International Journal  Volume 2, Issue 4
March 2015
83 pages
ISSN:2409-6008
EISSN:2313-8734
Issue’s Table of Contents

Publisher

South Ural State University

Chelyabinsk, Russian Federation

Publication History

Published: 01 March 2015

Author Tags

  1. GPU
  2. HPC
  3. Programming models
  4. dense linear algebra
  5. multicore
  6. runtime

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)A Task-Parallel Runtime for Heterogeneous Multi-node Vector SystemsParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_26(331-343)Online publication date: 7-Dec-2022
  • (2019)Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsACM Transactions on Mathematical Software10.1145/326710145:2(1-28)Online publication date: 3-May-2019
  • (2019)DuctTeipParallel Computing10.1016/j.parco.2019.10258290:COnline publication date: 1-Dec-2019

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media