More Web Proxy on the site http://driver.im/

Article

An operation stacking framework for large ensemble computations

Authors:

Calvin J. Ribbens,

Godmar BackAuthors Info & Claims

ICS '07: Proceedings of the 21st annual international conference on Supercomputing

Pages 83 - 92

https://doi.org/10.1145/1274971.1274986

Published: 17 June 2007 Publication History

Abstract

Iterative solutions of sparse problems often achieve only a small fraction of the peak theoretical performance on modern architectures. This problem is highly challenging because sparse matrix storage schemes require data to be accessed irregularly, which leads to massive cache misses. Furthermore, the inner loop of typical sparse matrix operations accesses only a small and variable amount of data, which not only leads to low utilization of floating point registers, but also prevents optimization techniques that improve instruction level parallelism (ILP), such as unroll and jam. Although a general solution to this problem has not been found, significant performance improvements can be made for at least one important special case, namely large ensemble computations, which run the same application repeatedly on different data sets. In this paper, we present the Operation Stacking Framework (OSF), which runs multiple sparse problems simultaneously, stacking their data and solving them as one, thus improving both cache and ILP utilization. Programmers can use stacked solvers transparently in their applications. Moreover, OSF provides an API that makes it simple to convert existing solvers such as the conjugate gradient (CG) and generalized minimal residual (GMRES) methods into a stacked form. Our experimental results show that stacking can reduce the number of L2 misses by 25% to 44%, resulting in performance improvements of up to 1.95x with an average of 1.60x for stacked CG and GMRES algorithms on a single CPU.

References

[1]

I. Al-Furaih and S. Ranka. Memory hierarchy management for iterative graph structures. IPPS, 00:0298, 1998.

[2]

F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In R. Rustin, editor, Design and Optimization of Compilers, pages 1--30. Prentice-Hall, 1971.

[3]

F. Alvarado. The sparse matrix manipulation system: User and reference manual. Technical report, The University of Wisconsin, May 1993.

[4]

American National Standards Institute. IEEE standard for information technology: Portable Operating Sytem Interface (POSIX). IEEE Computer Society Press, 1994.

[5]

E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In IEEE, editor, Proceedings, Supercomputing '90: November 12-16, 1990, New York Hilton at Rockefeller Center, New York, New York, pages 2--11. IEEE Computer Society Press, 1990.

Digital Library

[6]

W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Achieving high sustained performance in an unstructured mesh CFD application. Proceedings of SC99, 1999.

Digital Library

[7]

D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Computing Surveys, 26(4):345--420, 1994.

Digital Library

[8]

S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2001. http://www.mcs.anl.gov/petsc.

[9]

R. Barrett, M. Berry, T. Chan, J. W. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst. Templates for the solution of linear systems. SIAM, 1994.

[10]

M. Belgin and C. J. Ribbens. Improving the performance of HPC applications by using operation stacking. In H. Dag and Y. Deng, editors, Proceedings of ICCSE2005, The International Conference on Computational Science and Engineering, pages 183--190. Istanbul Technical University, 2005.

[11]

S. Browne, J. Dongarra, G. Ho, N. Garner, and P. Mucci. A portable programming interface for performance evaluation on modern processors. International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000.

Digital Library

[12]

S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, November 1994.

Digital Library

[13]

R. Das, M. Uysal, J. H. Saltz, and Y.-S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Parallel Distrib. Comput, 22(3):462--478, 1994.

Digital Library

[14]

J. Demmel, J. Dongarra, J. Ding, V. Eijkhout, D. Keyes, S. Li, B. Smith, and R. Vuduc. Performance optimization of sparse kernels in TOPS. URL, 2006. Available: "http://www-unix.mcs.anl.gov/scidactops/highlights/TOPS_performance.html".

[15]

J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16(1):1--17, 1990.

Digital Library

[16]

J. Dongarra, I. Duff, D. Sorensen, and H. van der Vorst. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, 1998.

Digital Library

[17]

J. Dongarra, A. Lumsdaine, R. Pozo, and K. Remington. A sparse matrix library in C++ for high performance architectures. In Proceedings of the Second Object Oriented Numerics Conference, pages 214--218, 1994.

[18]

T. Gneiting and A. E. Raftery. Weather forecasting with ensemble methods. Science (Washington, D.C.), 310(5746):248, 2005.

[19]

G. H. Golub and C. F. Van Loan. Matrix Computations, 2nd. Edition. The Johns Hopkins University Press, Baltimore, MD, 1989.

Digital Library

[20]

K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, The University of Texas at Austin, Department of Computer Sciences, Nov. 1 2002.

[21]

W. W. H. Merlitz, T. Herges. Fluctuation analysis and accuracy of a large-scale in silico screen. Journal of computational chemistry, 25(13):1568, October 2004.

[22]

H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7):606--618, 2006.

Digital Library

[23]

E.-J. Im and K. A. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01: Proceedings of the International Conference on Computational Sciences-Part I, pages 127--136, London, UK, 2001. Springer-Verlag.

Digital Library

[24]

E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vectormultiply. International Journal of High Performance Computing Applications, 18(1):135--158, February 2004.

Digital Library

[25]

M. V. Joshi, G. Karypis, V. Kumar, A. Gupta, and F. G. Gustavson. PSPASES: An efficient and scalable parallel sparse direct solver. In PPSC, 1999.

[26]

S. Manegold. Understanding, Modeling, and Improving Main-Memory Database Performance. Ph.d. thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, December 2002.

[27]

J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix-vector product computations using unroll and jam. Int. J. High Perform. Comput. Appl., 18(2):225--236, 2004.

Digital Library

[28]

J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming, 29(3):217--247, 2001.

Digital Library

[29]

A. Pinar and M. T. Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of Supercomputing'99 (CD-ROM), Portland, OR, Nov. 1999. ACM SIGARCH and IEEE.

Digital Library

[30]

K. A. Remington and R. Pozo. NIST sparse BLAS user's guide. Technical report, National Institute of Standards and Technology, July 19 1996.

[31]

Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical report, Computer Science Department, University of Minnesota, Minneapolis, MN 55455, June 1994. Version 2.

[32]

Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, Boston, 1996.

Digital Library

[33]

S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--725, 1997.

Digital Library

[34]

R. S. Tuminaro, M. Heroux, S. A. Hutchinson, and J. N. Shadid. Official Aztec user's guide version 2.1. Technical Report SAND99-8801J, Sandia National Laboratories, 1999.

[35]

R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).

[36]

R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In L. T. Yang, O. F. Rana, B. D. Martino, and J. Dongarra, editors, HPCC, volume 3726 of Lecture Notes in Computer Science, pages 807--816. Springer, 2005.

Digital Library

[37]

R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.

Digital Library

[38]

J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In ICS '06: Proceedings of the 20th annual international conference on Supercomputing, pages 307--316, New York, NY, USA, 2006. ACM Press.

Digital Library

Cited By

Gao JJi WLiu JWang YShi F(2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104799
Belgin MBack GRibbens C(2009)Operation Stacking for Ensemble Computations With Variable ConvergenceThe International Journal of High Performance Computing Applications10.1177/109434200934789224:2(194-212)Online publication date: 3-Nov-2009
https://doi.org/10.1177/1094342009347892

Index Terms

An operation stacking framework for large ensemble computations

Recommendations

Operation Stacking for Ensemble Computations With Variable Convergence

Sparse matrix operations achieve only small fractions of peak CPU speeds because of the use of specialized, index-based matrix representations, which degrade cache utilization by imposing irregular memory accesses and increasing the number of overall ...
Inexact Krylov Subspace Methods for Linear Systems

There is a class of linear problems for which the computation of the matrix-vector product is very expensive since a time consuming method is necessary to approximate it with some prescribed relative precision. In this paper we investigate the impact of ...
A framework for generalized conjugate gradient methods-with special emphasis on contributions by Rüdiger Weiss
Developments and trends in iterative methods for large systems of equations—in memoriam Rüdiger Weiss

We discuss a general framework for generalized conjugate gradient methods, that is, iterative methods (for solving linear systems) that are based on generating residuals that are formally orthogonal to each other with respect to some true or formal ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '07: Proceedings of the 21st annual international conference on Supercomputing

June 2007

315 pages

ISBN:9781595937681

DOI:10.1145/1274971

General Chair:
Burton Smith
Microsoft

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS07

Sponsor:

SIGARCH

ICS07: International Conference on Supercomputing

June 17 - 21, 2007

Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
213
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)4

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gao JJi WLiu JWang YShi F(2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.jpdc.2023.104799
Belgin MBack GRibbens C(2009)Operation Stacking for Ensemble Computations With Variable ConvergenceThe International Journal of High Performance Computing Applications10.1177/109434200934789224:2(194-212)Online publication date: 3-Nov-2009
https://doi.org/10.1177/1094342009347892

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents