More Web Proxy on the site http://driver.im/

article

Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems

Authors:

A. Abdelfattah,

A. YarKhanAuthors Info & Claims

Supercomputing Frontiers and Innovations: an International Journal, Volume 2, Issue 4

Pages 67 - 86

https://doi.org/10.14529/jsfi150405

Published: 01 March 2015 Publication History

Abstract

We present a review of the current best practices in parallel programming models for dense linear algebra DLA on heterogeneous architectures. We consider multicore CPUs, stand alone manycore coprocessors, GPUs, and combinations of these. Of interest is the evolution of the programming models for DLA libraries { in particular, the evolution from the popular LAPACK and ScaLAPACK libraries to their modernized counterparts PLASMA for multicore CPUs and MAGMA for heterogeneous architectures, as well as other programming models and libraries. Besides providing insights into the programming techniques of the libraries considered, we outline our view of the current strengths and weaknesses of their programming models { especially in regards to hardware trends and ease of programming high-performance numerical software that current applications need { in order to motivate work and future directions for the next generation of parallel programming models for high-performance linear algebra libraries on heterogeneous systems.

References

[1]

1. A. Abdelfattah, M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, J. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov. High-Performance Tensor Contractions for GPUs. Technical Report UT-EECS-16-738, 01-2016 2016.

Digital Library

[2]

2. E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov. Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects. J. Phys.: Conf. Ser., 1801, 2009.

[3]

3. ACML - AMD Core Math Library, 2014. Available at http://developer.amd.com/ tools-and-sdks/cpu-development/amd-core-math-library-acml.

[4]

4. E. Anderson, Z. Bai, C. Bischof, S. L. Blackford, J. W. Demmel, J. J. Dongarra, J. D. Croz, A. Greenbaum, S. J. Hammarling, A. McKenney, and D. C. Sorensen. LAPACK User's Guide. Society for Industrial and Applied Mathematics, Philadelphia, Third edition, 1999.

[5]

5. M. Anderson, D. Sheffield, and K. Keutzer. A predictive model for solving small linear algebra problems in gpu registers. In IEEE 26th International Parallel Distributed Processing Symposium IPDPS, 2012.

Digital Library

[6]

6. A. A. Auer, G. Baumgartner, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Krishnamoorthy, S. Krishnan, C.-C. Lam, Q. Luc, M. Nooijene, R. Pitzerf, J. Ramanujamg, P. Sadayappanc, and A. Sibiryakovc. Automatic code generation for many-body electronic structure methods: the tensor contraction engine. Molecular Physics, 1042:211-228, 2006.

[7]

7. C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 232:187-198, 2011.

Digital Library

[8]

8. M. Baboulin, V. Dobrev, J. Dongarra, C. Earl, C. Falcou, A. Haidar, I. Karlin, T. Kolev, I. Masliah, and S. Tomov. Towards a High-Performance Tensor Algebra Package for Accelerators. http://icl.cs.utk.edu/projectsfiles/magma/pubs/43-smc15_tensor_ contractions.pdf, September 2 2015. Smoky Mountains Computational Sciences and Engineering Conference SMC'15, Poster, Gatlinburg, TN.

[9]

9. R. M. Badia, J. R. Herrero, J. Labarta, J. M. Perez, E. S. Quintana-Orti, and G. Quintana-Orti. Parallelizing dense and banded linear algebra libraries using smpss. Concurrency and Computation: Practice and Experience, 2118:2438-2456, 2009.

[10]

10. A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Computing, 351:38-53, 2009.

Digital Library

[11]

11. J. Choi, J. Demmel, I. S. Dhillon, J. Dongarra, S. Ostrouchov, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK: A Portable Linear Algebra Library for Distributed Memory Computers Design Issues and Performance. Computer Physics Communications, 97, aug 1996.

[12]

12. J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. W. Walker, and R. C. Whaley. A proposal for a set of parallel basic linear algebra subprograms. In Applied Parallel Computing, Computations in Physics, Chemistry and Engineering Science, Second International Workshop, PARA '95, Lyngby, Denmark, August 21-24, 1995, Proc., pages 107-114, 1995.

[13]

13. M. Corporation. C++ AMP : Language and programming model, 2012. Version 1.0, August.

[14]

14. S. Donfack, S. Tomov, and J. Dongarra. Dynamically balanced synchronization-avoiding lu factorization with multicore and gpus. In Fourth International Workshop on Accelerators and Hybrid Exascale Systems AsHES, IPDPS 2014, 05-2014 2014.

Digital Library

[15]

15. T. Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, and J. Dongarra. A step towards energy efficient computing: Redesigning a hydrodynamic application on CPU-GPU. In IEEE 28th International Parallel Distributed Processing Symposium IPDPS, 2014.

Digital Library

[16]

16. J. J. Dongara, C. B. Moler, J. R. Bunch, and G. W. Stewart. LINPACK User's Guide. SIAM, Philadelphia, PA, 1979.

[17]

17. J. Dongarra, M. Gates, A. Haidar, J. Kurzak, P. Luszczek, S. Tomov, and I. Yamazaki. Accelerating numerical dense linear algebra calculations with gpus. Numerical Computations with GPUs, pages 1-26, 2014.

[18]

18. J. Dongarra, J. Kurzak, P. Luszczek, T. Moore, and S. Tomov. Numerical algorithms and libraries at exascale. http://www.hpcwire.com/2015/10/19/ numerical-algorithms-and-libraries-at-exascale/, October 19 2015. HPCwire.

[19]

19. J. J. Dongarra, J. D. Croz, S. Hammarling, and R. J. Hanson. An extended set of fortran basic linear algebra subprograms. ACM Trans. Math. Softw., 141:1-17, 1988.

Digital Library

[20]

20. J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 161:1-17, Mar. 1990.

Digital Library

[21]

21. P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra. From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming. Parallel Comput., 388:391-407, Aug. 2012.

Digital Library

[22]

22. A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OMPSS: A proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 2102:173-193, 2011.

[23]

23. K. Gregory and A. Miller. C++ AMP: Accelerated Massive Parallelism with Microsoft Visual C++. Microsoft Press, 1st edition, 2012. ISBN-13: 978-0735664739 ISBN-10: 0735664730.

[24]

24. F. Gustavson, L. Karlsson, and B. Kagstrom. Parallel and cache-efficient in-place matrix storage format conversion. ACM Transactions on Mathematical Software TOMS, 383:17, 2012.

[25]

25. A. Haidar, C. Cao, I. Yamazaki, J. Dongarra, M. Gates, P. Luszczek, and S. Tomov. Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads Across Accelerators, Coprocessors, and Multicore Processors. In 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems ScalA 14, New Orleans, LA, 11-2014 2014. IEEE.

[26]

26. A. Haidar, C. Cao, A. Yarkhan, P. Luszczek, S. Tomov, K. Kabir, and J. Dongarra. Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment. In Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS '14, pages 491-500, Washington, DC, USA, 2014. IEEE Computer Society.

Digital Library

[27]

27. A. Haidar, T. Dong, P. Luszczek, S. Tomov, and J. Dongarra. Batched matrix computations on hardware accelerators based on gpus. International Journal of High Performance Computing Applications, 2015.

[28]

28. A. Haidar, T. Dong, S. Tomov, P. Luszczek, and J. Dongarra. Framework for Batched and GPU-resident Factorization Algorithms to Block Householder Transformations. In ISC High Performance, Frankfurt, Germany, 07-2015 2015. Springer, Springer.

[29]

29. A. Haidar, J. Dongarra, K. Kabir, M. Gates, P. Luszczek, S. Tomov, and Y. Jia. Hpc programming on intel many-integrated-core hardware with magma port to xeon phi. Scientific Programming, 23, 01-2015 2015.

[30]

30. A. Haidar, A. YarKhan, C. Cao, P. Luszczek, S. Tomov, and J. Dongarra. Flexible linear algebra development and scheduling with cholesky factorization. In 17th IEEE International Conference on High Performance Computing and Communications, Newark, NJ, 08-2015 2015.

Digital Library

[31]

31. M. A. Heroux. Exascale programming: Adapting what we have can and must work. http://www.hpcwire.com/2016/01/14/24151/, January 14 2016. HPCwire.

[32]

32. M. Horton, S. Tomov, and J. Dongarra. A class of hybrid LAPACK algorithms for multicore and GPU architectures. In Proceedings of Symposium for Application Accelerators in High Performance Computing SAAHPC, 2011.

Digital Library

[33]

33. E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 181:135-158, Feb. 2004.

Digital Library

[34]

34. Intel Math Kernel Library, 2014. Available at http://software.intel.com/intel-mkl/.

[35]

35. Y. Jia, P. Luszczek, and J. Dongarra. Multi-GPU implementation of LU factorization. In proceedings of the international conference on computational science ICCS, 2012.

[36]

36. K. Kabir, A. Haidar, S. Tomov, and J. Dongarra. On the Design, Development, and Analysis of Optimized Matrix-Vector Multiplication Routines for Coprocessors. In ISC High Performance 2015, Frankfurt, Germany, 07-2015 2015.

[37]

37. J. L. Khodayari A., A.R. Zomorrodi and C. Maranas. A kinetic model of escherichia coli core metabolism satisfying multiple sets of mutant flux data. Metabolic engineering, 25C:50-62, 2014.

[38]

38. Khronos OpenCL Working Group. The opencl specification, version: 1.0 document revision: 48, 2009.

[39]

39. J. Kurzak, H. Ltaief, J. Dongarra, and R. M. Badia. Scheduling dense linear algebra operations on multicore processors. Concurrency and Computation: Practice and Experience, 221:15-44, 2010.

[40]

40. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 53:308-323, Sept. 1979.

Digital Library

[41]

41. O. Messer, J. Harris, S. Parete-Koon, and M. Chertkow. Multicore and accelerator development for a leadership-class stellar astrophysics code. In Proceedings of "PARA 2012: State-of-the-Art in Scientific and Parallel Computing.", 2012.

[42]

42. S. Mittal and J. S. Vetter. A survey of cpu-gpu heterogeneous computing techniques. ACM Comput. Surv., 474:69:1-69:35, July 2015.

[43]

43. J. Molero, E. Garzon, I. Garcia, E. Quintana-Orti, and A. Plaza. Poster: A batched Cholesky solver for local RX anomaly detection on GPUs, 2013. PUMPS.

[44]

44. R. Nath, S. Tomov, and J. Dongarra. An improved magma gemm for fermi graphics processing units. Int. J. High Perform. Comput. Appl., 244:511-515, Nov. 2010.

Digital Library

[45]

45. C. J. Newburn, G. Bansal, M. Wood, L. Crivelli, J. Planas, A. Duran, P. Souza, L. Borges, P. Luszczek, S. Tomov, J. Dongarra, H. Anzt, M. Gates, A. Haidar, Y. Jia, K. Kabir, I. Yamazaki, and J. Labarta. Heterogeneous streaming. In IPDPSW, AsHES 2016 accepted, Chicago, IL, USA, May 23 2016.

[46]

46. OpenACC Non-Profit Corporation. The OpenACC application programming interface version 2.0. http://www.openacc.org/sites/default/files/OpenACC.2.0a_1.pdf, June 2013.

[47]

47. OpenMP Architecture Review Board. OpenMP application program interface version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.

[48]

48. OpenMP Architecture Review Board. OpenMP application program interface version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 2013.

[49]

49. OpenMP Architecture Review Board. OpenMP application program interface version 4.5, Nov 2015.

[50]

50. J. M. Perez, P. Bellens, R. M. Badia, and J. Labarta. Cellss: Making it easier to program the cell broadband engine processor. IBM Journal of Research and Development, 515:593-604, 2007.

Digital Library

[51]

51. A. Pop and A. Cohen. Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Trans. Archit. Code Optim., 94, 2013.

[52]

52. F. Song, S. Tomov, and J. Dongarra. Enabling and scaling matrix computations on heterogeneous multi-core and multi-gpu systems. In ICS, pages 365-376, 2012.

Digital Library

[53]

53. M. Tillenius. Superglue: A shared memory framework using data versioning for dependencyaware task-based parallelization. SIAM Journal on Scientific Computing, 376:C617-C642, 2015.

[54]

54. S. Tomov, J. Dongarra, and M. Baboulin. Towards dense linear algebra for hybrid gpu accelerated manycore systems. Parellel Comput. Syst. Appl., 365-6:232-240, 2010.

Digital Library

[55]

55. S. Tomov, R. Nath, H. Ltaief, and J. Dongarra. Dense linear algebra solvers for multicore with GPU accelerators. In Proc. of the IEEE IPDPS'10, pages 1-8, Atlanta, GA, April 19-23 2010. IEEE Computer Society.

[56]

56. I. Yamazaki, S. Tomov, and J. Dongarra. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators. In proceedings of the international conference on computational science ICCS, 2012.

[57]

57. I. Yamazaki, S. Tomov, and J. Dongarra. Non-GPU-resident symmetric indefinite factorization. Submitted to ACM Transactions on Mathematical Software TOMS, 2016.

[58]

58. Y. Yan, B. M. Chapman, and M. Wong. A comparison of heterogeneous and manycore programming models. http://www.hpcwire.com/2015/03/02/ a-comparison-of-heterogeneous-and-manycore-programming-models, March 2 2015. HPCwire.

[59]

59. A. YarKhan. Dynamic Task Execution on Shared and Distributed Memory Architectures. PhD thesis, University of Tennessee, 2012.

[60]

60. S. N. Yeralan, T. A. Davis, and S. Ranka. Sparse mulitfrontal QR on the GPU. Technical report, University of Florida Technical Report, 2013.

Cited By

Ide KTakahashi KShimomura YTakizawa H(2022)A Task-Parallel Runtime for Heterogeneous Multi-node Vector SystemsParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_26(331-343)Online publication date: 7-Dec-2022
https://dl.acm.org/doi/10.1007/978-3-031-29927-8_26
Charara AKeyes DLtaief H(2019)Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsACM Transactions on Mathematical Software10.1145/326710145:2(1-28)Online publication date: 3-May-2019
https://dl.acm.org/doi/10.1145/3267101
Zafari ALarsson ETillenius M(2019)DuctTeipParallel Computing10.1016/j.parco.2019.10258290:COnline publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.102582

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
Performance and Portability of a Linear Solver Across Emerging Architectures
Accelerator Programming Using Directives
Abstract
A linear solver algorithm used by a large-scale unstructured-grid computational fluid dynamics application is examined for a broad range of familiar and emerging architectures. Efficient implementation of a linear solver is challenging on recent ...
Optimizing symmetric dense matrix-vector multiplication on GPUs
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Supercomputing Frontiers and Innovations: an International Journal

Supercomputing Frontiers and Innovations: an International Journal Volume 2, Issue 4

March 2015

83 pages

ISSN:2409-6008

EISSN:2313-8734

Issue’s Table of Contents

Publisher

South Ural State University

Chelyabinsk, Russian Federation

Publication History

Published: 01 March 2015

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Ide KTakahashi KShimomura YTakizawa H(2022)A Task-Parallel Runtime for Heterogeneous Multi-node Vector SystemsParallel and Distributed Computing, Applications and Technologies10.1007/978-3-031-29927-8_26(331-343)Online publication date: 7-Dec-2022
https://dl.acm.org/doi/10.1007/978-3-031-29927-8_26
Charara AKeyes DLtaief H(2019)Batched Triangular Dense Linear Algebra Kernels for Very Small Matrix Sizes on GPUsACM Transactions on Mathematical Software10.1145/326710145:2(1-28)Online publication date: 3-May-2019
https://dl.acm.org/doi/10.1145/3267101
Zafari ALarsson ETillenius M(2019)DuctTeipParallel Computing10.1016/j.parco.2019.10258290:COnline publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1016/j.parco.2019.102582

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents