More Web Proxy on the site http://driver.im/

article

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Authors:

Stanimire Tomov,

Jack DongarraAuthors Info & Claims

Parallel Computing, Volume 36, Issue 12

Pages 645 - 654

https://doi.org/10.1016/j.parco.2010.06.001

Published: 01 December 2010 Publication History

Abstract

We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25x the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the system's hybrid components. The results described in this paper are significant because the HR has not been properly accelerated before on homogeneous multicore architectures, and it plays a significant role in solving non-symmetric eigenvalue problems. Moreover, the ideas from the hybrid HR are used to develop a hybrid tridiagonal reduction algorithm (for symmetric eigenvalue problems) and a bidiagonal reduction algorithm (for singular value decomposition problems). Our approach demonstrates a methodology that streamlines the development of a large and important class of algorithms on modern computer architectures of multicore and GPUs. The new algorithms can be directly used in the software stack that relies on LAPACK.

References

[1]

}}Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A. and Sorensen, D., LAPACK User's Guide. 1999. third ed. SIAM.

[2]

}}M. Baboulin, J. Dongarra, S. Tomov, Some issues in dense linear algebra for multicore and special purpose architectures, in: Proceedings of the International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), Trondheim, Norway, 2008.

[3]

}}G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Minimizing Communication in Linear Algebra, Tech. Report, LAPACK Working Note 218, May 2009.

[4]

}}S. Barrachina, M. Castillo, F.D. Igual, R. Mayo, E.S. Quintana-Ortí, Solving Dense Linear Systems on Graphics Processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February, 2008.

Digital Library

[5]

}}C. Bischof C. Van Loan, The WY representation for products of householder matrices, SIAM J. Sci. Stat. Comput. 8(1) (1987) S2-S13, Parallel processing for scientific computing (Norfolk, VA, 1985). MR 88f:65070.

Digital Library

[6]

}}A. Buttari, J. Dongarra, J. Kurzak, J. Langou, S. Tomov, The impact of multicore on math software, in: PARA 2006, Umea, Sweden, 2006.

[7]

}}A. Buttari, J. Langou, J. Kurzak, J. Dongarra, A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures, Technical Report UT-CS-07-600, LAPACK Working Note 191, University of Tennessee, 2007.

[8]

}}J. Dongarra, S. Moore, G. Peterson, S. Tomov, J. Allred, V. Natoli, D. Richie, Exploring new architectures in accelerating CFD for air force applications, in: Proceedings of the HPCMP UGC08, July 14-17, 2008.

[9]

}}Fatahalian, K., Sugerman, J. and Hanrahan, P., Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (New York, NY, USA), ACM. pp. 133-137.

[10]

}}Fatica, M., Accelerating linpack with CUDA on heterogenous clusters. In: GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (New York, NY, USA), ACM. pp. 46-51.

[11]

}}Galoppo, N., Govindaraju, N., Henson, M. and Manocha, D., LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In: SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing (Washington, DC, USA), IEEE Computer Society. pp. 3

[12]

}}Golub, G.H. and Van Loan, C.F., Matrix Computations. 1989. second ed. Baltimore, MD, USA.

[13]

}}W. Gruener, Larrabee, CUDA and the quest for the free lunch, 08/2008, TGDaily, <http://www.tgdaily.com/content/view/38750/113/>.

[14]

}}Hammarling, S., Sorensen, D. and Dongarra, J., Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. v27. 215-227.

[15]

}}Li, Y., Dongarra, J. and Tomov, S., A note on auto-tuning GEMM for GPUs. In: Proceedings of the 9th ICCS '09 (Baton Rouge, LA), vol. 5544. Springer, Verlag. pp. 884-892.

[16]

}}Van Loan, C.F., Using the Hessenberg Decomposition in Control Theory. 1982. North-Holland, Amsterdam.

[17]

}}NVIDIA, Nvidia Tesla doubles the performance for CUDA developers, Computer Graphics World, 06/30/2008.

[18]

}}NVIDIA, NVIDIA CUDA Programming Guide, Version 2.0, 06/07/2008.

[19]

}}Owens, J., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A. and Purcell, T., A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum. v26 i1. 80-113.

[20]

}}H. Ltaief, S. Tomov, R. Nath, P. Du, J. Dongarra, A Scalable High Performance Cholesky Factorization for Multicore with GPU Accelerators, Tech. Report, LAPACK Working Note 223, November 2009.

[21]

}}E. Ayguadé, R. Badia, F. Igual, J. Labarta, R. Mayo, E. Quintana-Ortí, An extension of the StarSs programming model for platforms with multiple GPUs, in: Proceedings of the Euro-Par '09, Delft, The Netherlands, 2009, pp. 851-862.

[22]

}}Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R., Van Zee, F.G. and Chan, E., Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. v36 i3. 1-26.

[23]

}}Schreiber, R. and Van Loan, C., A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. v10 i1. 53-57.

[24]

}}S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, in press.

[25]

}}V. Volkov, J. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Proceedings of the SC '08, November 15-21, 2008, Austin, Texas.

[26]

}}V. Volkov, J.W. Demmel, Using GPUs to accelerate linear algebra routines, Poster at PAR lab winter retreat, January 9, 2008, <http://www.eecs.berkeley.edu/~volkov/volkov08-parlab.pdf>.

[27]

}}General-purpose computation using graphics hardware, <http://www.gpgpu.org>.

[28]

}}NVIDIA CUDA ZONE, <http://www.nvidia.com/object/cuda_home.html>.

[29]

}}P. Bientinesi, F. Igual, D. Kressner, E. Quintana-Orti, Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures, Aachen Institute for Computational Engineering Science, RWTH Aachen, AICES-2009-11, March 2009.

[30]

}}James W. Demmel, Laura Grigori, Mark Frederick Hoemmen, Julien Langou, Communication-Optimal Parallel and Sequential QR and LU Factorizations, Tech. Report, LAPACK Working Note 204, August 2008.

[31]

}}H. Ltaief, J. Kurzak, J. Dongarra, Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited, Tech. Report, LAPACK Working Note 208, August 2009.

[32]

}}H. Ltaief, J. Kurzak, J. Dongarra, Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures, Tech. Report, LAPACK Working Note 209, October 2009.

[33]

}}G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Communication-Optimal Parallel and Sequential Cholesky Decomposition, Tech. Report, LAPACK Working Note 215, February 2009.

[34]

}}S. Tomov, R. Nath, H. Ltaief, J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, in: Proceedings of the IPDPS 2010, Atlanta, GA, April 2010.

[35]

}}S. Tomov, R. Nath, P. Du, J. Dongarra, MAGMA version 0.2 Users' Guide, November 2009, <http://icl.cs.utk.edu/magma>.

Cited By

Heroux MAbdelfattah ABeams NCarson RGhysels PKolev TStitt TVargas ATomov SDongarra J(2024)MAGMAInternational Journal of High Performance Computing Applications10.1177/1094342024126196038:5(468-490)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1177/10943420241261960
Li ZXiong GZhang XShen ZLuo CShang XDong XBian GWang XWang F(2019)A GPU Based Parallel Genetic Algorithm for the Orientation Optimization Problem in 3D Printing*2019 International Conference on Robotics and Automation (ICRA)10.1109/ICRA.2019.8793989(2786-2792)Online publication date: 20-May-2019
https://dl.acm.org/doi/10.1109/ICRA.2019.8793989
Abdelfattah AKeyes DLtaief H(2016)KBLASACM Transactions on Mathematical Software10.1145/281831142:3(1-31)Online publication date: 10-May-2016
https://dl.acm.org/doi/10.1145/2818311
Show More Cited By

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
1. General and reference
  1. Cross-computing tools and techniques
2. Networks

Recommendations

An Improved Magma Gemm For Fermi Graphics Processing Units

We present an improved matrixâ matrix multiplication routine (General Matrix Multiply [GEMM]) in the MAGMA BLAS library that targets the NVIDIA Fermi graphics processing units (GPUs) using Compute Unified Data Architecture (CUDA). We show how to modify ...
Orthogonal Hessenberg Reduction and Orthogonal Krylov Subspace Bases

We study necessary and sufficient conditions that a nonsingular matrix A can be B-orthogonally reduced to upper Hessenberg form with small bandwidth. By this we mean the existence of a decomposition AV=VH, where H is upper Hessenberg with few nonzero ...
Condensed forms for the symmetric eigenvalue problem on multi-threaded architectures

We investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Parallel Computing

Parallel Computing Volume 36, Issue 12

December, 2010

69 pages

ISSN:0167-8191

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2010.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 December 2010

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Heroux MAbdelfattah ABeams NCarson RGhysels PKolev TStitt TVargas ATomov SDongarra J(2024)MAGMAInternational Journal of High Performance Computing Applications10.1177/1094342024126196038:5(468-490)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1177/10943420241261960
Li ZXiong GZhang XShen ZLuo CShang XDong XBian GWang XWang F(2019)A GPU Based Parallel Genetic Algorithm for the Orientation Optimization Problem in 3D Printing*2019 International Conference on Robotics and Automation (ICRA)10.1109/ICRA.2019.8793989(2786-2792)Online publication date: 20-May-2019
https://dl.acm.org/doi/10.1109/ICRA.2019.8793989
Abdelfattah AKeyes DLtaief H(2016)KBLASACM Transactions on Mathematical Software10.1145/281831142:3(1-31)Online publication date: 10-May-2016
https://dl.acm.org/doi/10.1145/2818311
Kabir KHaidar ATomov SDongarra JWatson LWeinbub JSosonkina MThacker W(2015)Performance analysis and design of a hessenberg reduction using stabilized blocked elementary transformations for new architecturesProceedings of the Symposium on High Performance Computing10.5555/2872599.2872616(135-142)Online publication date: 12-Apr-2015
https://dl.acm.org/doi/10.5555/2872599.2872616
Mittal SVetter J(2015)A Survey of CPU-GPU Heterogeneous Computing TechniquesACM Computing Surveys10.1145/278839647:4(1-35)Online publication date: 21-Jul-2015
https://dl.acm.org/doi/10.1145/2788396
Granat RKågström BKressner DShao M(2015)Algorithm 953ACM Transactions on Mathematical Software10.1145/269947141:4(1-23)Online publication date: 12-Oct-2015
https://dl.acm.org/doi/10.1145/2699471
Kabir KHaidar ATomov SDongarra J(2015)Performance Analysis and Optimisation of Two-sided Factorization Algorithms for Heterogeneous PlatformProcedia Computer Science10.1016/j.procs.2015.05.22251:C(180-190)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1016/j.procs.2015.05.222
Mahfoudhi RMahjoub ZNasri W(2015)Parallel Communication-Avoiding Algorithm for Triangular Matrix Inversion on Homogeneous and Heterogeneous PlatformsInternational Journal of Parallel Programming10.1007/s10766-014-0310-043:4(631-655)Online publication date: 1-Aug-2015
https://dl.acm.org/doi/10.1007/s10766-014-0310-0
Haidar AGates MTomov SDongarra JMalony ANemirovsky MMidkiff S(2013)Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communicationProceedings of the 27th international ACM conference on International conference on supercomputing10.1145/2464996.2465438(223-232)Online publication date: 10-Jun-2013
https://dl.acm.org/doi/10.1145/2464996.2465438
Bosner NBujanović ZDrmač Z(2013)Efficient generalized Hessenberg form and applicationsACM Transactions on Mathematical Software10.1145/2450153.245015739:3(1-19)Online publication date: 3-May-2013
https://dl.acm.org/doi/10.1145/2450153.2450157
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents