[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Published: 01 December 2010 Publication History

Abstract

We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25x the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the system's hybrid components. The results described in this paper are significant because the HR has not been properly accelerated before on homogeneous multicore architectures, and it plays a significant role in solving non-symmetric eigenvalue problems. Moreover, the ideas from the hybrid HR are used to develop a hybrid tridiagonal reduction algorithm (for symmetric eigenvalue problems) and a bidiagonal reduction algorithm (for singular value decomposition problems). Our approach demonstrates a methodology that streamlines the development of a large and important class of algorithms on modern computer architectures of multicore and GPUs. The new algorithms can be directly used in the software stack that relies on LAPACK.

References

[1]
}}Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A. and Sorensen, D., LAPACK User's Guide. 1999. third ed. SIAM.
[2]
}}M. Baboulin, J. Dongarra, S. Tomov, Some issues in dense linear algebra for multicore and special purpose architectures, in: Proceedings of the International Workshop on State-of-the-Art in Scientific and Parallel Computing (PARA), Trondheim, Norway, 2008.
[3]
}}G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Minimizing Communication in Linear Algebra, Tech. Report, LAPACK Working Note 218, May 2009.
[4]
}}S. Barrachina, M. Castillo, F.D. Igual, R. Mayo, E.S. Quintana-Ortí, Solving Dense Linear Systems on Graphics Processors, Technical Report ICC 02-02-2008, Universidad Jaime I, February, 2008.
[5]
}}C. Bischof C. Van Loan, The WY representation for products of householder matrices, SIAM J. Sci. Stat. Comput. 8(1) (1987) S2-S13, Parallel processing for scientific computing (Norfolk, VA, 1985). MR 88f:65070.
[6]
}}A. Buttari, J. Dongarra, J. Kurzak, J. Langou, S. Tomov, The impact of multicore on math software, in: PARA 2006, Umea, Sweden, 2006.
[7]
}}A. Buttari, J. Langou, J. Kurzak, J. Dongarra, A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures, Technical Report UT-CS-07-600, LAPACK Working Note 191, University of Tennessee, 2007.
[8]
}}J. Dongarra, S. Moore, G. Peterson, S. Tomov, J. Allred, V. Natoli, D. Richie, Exploring new architectures in accelerating CFD for air force applications, in: Proceedings of the HPCMP UGC08, July 14-17, 2008.
[9]
}}Fatahalian, K., Sugerman, J. and Hanrahan, P., Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: HWWS '04: Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware (New York, NY, USA), ACM. pp. 133-137.
[10]
}}Fatica, M., Accelerating linpack with CUDA on heterogenous clusters. In: GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units (New York, NY, USA), ACM. pp. 46-51.
[11]
}}Galoppo, N., Govindaraju, N., Henson, M. and Manocha, D., LU-GPU: Efficient algorithms for solving dense linear systems on graphics hardware. In: SC '05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing (Washington, DC, USA), IEEE Computer Society. pp. 3
[12]
}}Golub, G.H. and Van Loan, C.F., Matrix Computations. 1989. second ed. Baltimore, MD, USA.
[13]
}}W. Gruener, Larrabee, CUDA and the quest for the free lunch, 08/2008, TGDaily, <http://www.tgdaily.com/content/view/38750/113/>.
[14]
}}Hammarling, S., Sorensen, D. and Dongarra, J., Block reduction of matrices to condensed forms for eigenvalue computations. J. Comput. Appl. Math. v27. 215-227.
[15]
}}Li, Y., Dongarra, J. and Tomov, S., A note on auto-tuning GEMM for GPUs. In: Proceedings of the 9th ICCS '09 (Baton Rouge, LA), vol. 5544. Springer, Verlag. pp. 884-892.
[16]
}}Van Loan, C.F., Using the Hessenberg Decomposition in Control Theory. 1982. North-Holland, Amsterdam.
[17]
}}NVIDIA, Nvidia Tesla doubles the performance for CUDA developers, Computer Graphics World, 06/30/2008.
[18]
}}NVIDIA, NVIDIA CUDA Programming Guide, Version 2.0, 06/07/2008.
[19]
}}Owens, J., Luebke, D., Govindaraju, N., Harris, M., Krger, J., Lefohn, A. and Purcell, T., A survey of general-purpose computation on graphics hardware. Comput. Graph. Forum. v26 i1. 80-113.
[20]
}}H. Ltaief, S. Tomov, R. Nath, P. Du, J. Dongarra, A Scalable High Performance Cholesky Factorization for Multicore with GPU Accelerators, Tech. Report, LAPACK Working Note 223, November 2009.
[21]
}}E. Ayguadé, R. Badia, F. Igual, J. Labarta, R. Mayo, E. Quintana-Ortí, An extension of the StarSs programming model for platforms with multiple GPUs, in: Proceedings of the Euro-Par '09, Delft, The Netherlands, 2009, pp. 851-862.
[22]
}}Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R., Van Zee, F.G. and Chan, E., Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. v36 i3. 1-26.
[23]
}}Schreiber, R. and Van Loan, C., A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. v10 i1. 53-57.
[24]
}}S. Tomov, J. Dongarra, M. Baboulin, Towards dense linear algebra for hybrid GPU accelerated manycore systems, Parallel Computing, in press.
[25]
}}V. Volkov, J. Demmel, Benchmarking GPUs to tune dense linear algebra, in: Proceedings of the SC '08, November 15-21, 2008, Austin, Texas.
[26]
}}V. Volkov, J.W. Demmel, Using GPUs to accelerate linear algebra routines, Poster at PAR lab winter retreat, January 9, 2008, <http://www.eecs.berkeley.edu/~volkov/volkov08-parlab.pdf>.
[27]
}}General-purpose computation using graphics hardware, <http://www.gpgpu.org>.
[28]
}}NVIDIA CUDA ZONE, <http://www.nvidia.com/object/cuda_home.html>.
[29]
}}P. Bientinesi, F. Igual, D. Kressner, E. Quintana-Orti, Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures, Aachen Institute for Computational Engineering Science, RWTH Aachen, AICES-2009-11, March 2009.
[30]
}}James W. Demmel, Laura Grigori, Mark Frederick Hoemmen, Julien Langou, Communication-Optimal Parallel and Sequential QR and LU Factorizations, Tech. Report, LAPACK Working Note 204, August 2008.
[31]
}}H. Ltaief, J. Kurzak, J. Dongarra, Parallel Block Hessenberg Reduction using Algorithms-By-Tiles for Multicore Architectures Revisited, Tech. Report, LAPACK Working Note 208, August 2009.
[32]
}}H. Ltaief, J. Kurzak, J. Dongarra, Parallel Band Two-Sided Matrix Bidiagonalization for Multicore Architectures, Tech. Report, LAPACK Working Note 209, October 2009.
[33]
}}G. Ballard, J. Demmel, O. Holtz, O. Schwartz, Communication-Optimal Parallel and Sequential Cholesky Decomposition, Tech. Report, LAPACK Working Note 215, February 2009.
[34]
}}S. Tomov, R. Nath, H. Ltaief, J. Dongarra, Dense linear algebra solvers for multicore with GPU accelerators, in: Proceedings of the IPDPS 2010, Atlanta, GA, April 2010.
[35]
}}S. Tomov, R. Nath, P. Du, J. Dongarra, MAGMA version 0.2 Users' Guide, November 2009, <http://icl.cs.utk.edu/magma>.

Cited By

View all
  • (2024)MAGMAInternational Journal of High Performance Computing Applications10.1177/1094342024126196038:5(468-490)Online publication date: 1-Sep-2024
  • (2019)A GPU Based Parallel Genetic Algorithm for the Orientation Optimization Problem in 3D Printing*2019 International Conference on Robotics and Automation (ICRA)10.1109/ICRA.2019.8793989(2786-2792)Online publication date: 20-May-2019
  • (2016)KBLASACM Transactions on Mathematical Software10.1145/281831142:3(1-31)Online publication date: 10-May-2016
  • Show More Cited By
  1. Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Parallel Computing
      Parallel Computing  Volume 36, Issue 12
      December, 2010
      69 pages

      Publisher

      Elsevier Science Publishers B. V.

      Netherlands

      Publication History

      Published: 01 December 2010

      Author Tags

      1. Bidiagonalization
      2. Dense linear algebra
      3. GPUs
      4. Hessenberg reduction
      5. Hybrid computing
      6. Tridiagonalization
      7. Two-sided factorizations

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 10 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)MAGMAInternational Journal of High Performance Computing Applications10.1177/1094342024126196038:5(468-490)Online publication date: 1-Sep-2024
      • (2019)A GPU Based Parallel Genetic Algorithm for the Orientation Optimization Problem in 3D Printing*2019 International Conference on Robotics and Automation (ICRA)10.1109/ICRA.2019.8793989(2786-2792)Online publication date: 20-May-2019
      • (2016)KBLASACM Transactions on Mathematical Software10.1145/281831142:3(1-31)Online publication date: 10-May-2016
      • (2015)Performance analysis and design of a hessenberg reduction using stabilized blocked elementary transformations for new architecturesProceedings of the Symposium on High Performance Computing10.5555/2872599.2872616(135-142)Online publication date: 12-Apr-2015
      • (2015)A Survey of CPU-GPU Heterogeneous Computing TechniquesACM Computing Surveys10.1145/278839647:4(1-35)Online publication date: 21-Jul-2015
      • (2015)Algorithm 953ACM Transactions on Mathematical Software10.1145/269947141:4(1-23)Online publication date: 12-Oct-2015
      • (2015)Performance Analysis and Optimisation of Two-sided Factorization Algorithms for Heterogeneous PlatformProcedia Computer Science10.1016/j.procs.2015.05.22251:C(180-190)Online publication date: 1-Sep-2015
      • (2015)Parallel Communication-Avoiding Algorithm for Triangular Matrix Inversion on Homogeneous and Heterogeneous PlatformsInternational Journal of Parallel Programming10.1007/s10766-014-0310-043:4(631-655)Online publication date: 1-Aug-2015
      • (2013)Toward a scalable multi-GPU eigensolver via compute-intensive kernels and efficient communicationProceedings of the 27th international ACM conference on International conference on supercomputing10.1145/2464996.2465438(223-232)Online publication date: 10-Jun-2013
      • (2013)Efficient generalized Hessenberg form and applicationsACM Transactions on Mathematical Software10.1145/2450153.245015739:3(1-19)Online publication date: 3-May-2013
      • Show More Cited By

      View Options

      View options

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media