[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1274971.1274986acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

An operation stacking framework for large ensemble computations

Published: 17 June 2007 Publication History

Abstract

Iterative solutions of sparse problems often achieve only a small fraction of the peak theoretical performance on modern architectures. This problem is highly challenging because sparse matrix storage schemes require data to be accessed irregularly, which leads to massive cache misses. Furthermore, the inner loop of typical sparse matrix operations accesses only a small and variable amount of data, which not only leads to low utilization of floating point registers, but also prevents optimization techniques that improve instruction level parallelism (ILP), such as unroll and jam. Although a general solution to this problem has not been found, significant performance improvements can be made for at least one important special case, namely large ensemble computations, which run the same application repeatedly on different data sets. In this paper, we present the Operation Stacking Framework (OSF), which runs multiple sparse problems simultaneously, stacking their data and solving them as one, thus improving both cache and ILP utilization. Programmers can use stacked solvers transparently in their applications. Moreover, OSF provides an API that makes it simple to convert existing solvers such as the conjugate gradient (CG) and generalized minimal residual (GMRES) methods into a stacked form. Our experimental results show that stacking can reduce the number of L2 misses by 25% to 44%, resulting in performance improvements of up to 1.95x with an average of 1.60x for stacked CG and GMRES algorithms on a single CPU.

References

[1]
I. Al-Furaih and S. Ranka. Memory hierarchy management for iterative graph structures. IPPS, 00:0298, 1998.
[2]
F. E. Allen and J. Cocke. A catalogue of optimizing transformations. In R. Rustin, editor, Design and Optimization of Compilers, pages 1--30. Prentice-Hall, 1971.
[3]
F. Alvarado. The sparse matrix manipulation system: User and reference manual. Technical report, The University of Wisconsin, May 1993.
[4]
American National Standards Institute. IEEE standard for information technology: Portable Operating Sytem Interface (POSIX). IEEE Computer Society Press, 1994.
[5]
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. DuCroz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK: A portable linear algebra library for high-performance computers. In IEEE, editor, Proceedings, Supercomputing '90: November 12-16, 1990, New York Hilton at Rockefeller Center, New York, New York, pages 2--11. IEEE Computer Society Press, 1990.
[6]
W. K. Anderson, W. D. Gropp, D. K. Kaushik, D. E. Keyes, and B. F. Smith. Achieving high sustained performance in an unstructured mesh CFD application. Proceedings of SC99, 1999.
[7]
D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. ACM Computing Surveys, 26(4):345--420, 1994.
[8]
S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang. PETSc Web page, 2001. http://www.mcs.anl.gov/petsc.
[9]
R. Barrett, M. Berry, T. Chan, J. W. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. van der Vorst. Templates for the solution of linear systems. SIAM, 1994.
[10]
M. Belgin and C. J. Ribbens. Improving the performance of HPC applications by using operation stacking. In H. Dag and Y. Deng, editors, Proceedings of ICCSE2005, The International Conference on Computational Science and Engineering, pages 183--190. Istanbul Technical University, 2005.
[11]
S. Browne, J. Dongarra, G. Ho, N. Garner, and P. Mucci. A portable programming interface for performance evaluation on modern processors. International Journal of High Performance Computing Applications, 14(3):189--204, Fall 2000.
[12]
S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 16(6):1768--1810, November 1994.
[13]
R. Das, M. Uysal, J. H. Saltz, and Y.-S. Hwang. Communication optimizations for irregular scientific computations on distributed memory architectures. J. Parallel Distrib. Comput, 22(3):462--478, 1994.
[14]
J. Demmel, J. Dongarra, J. Ding, V. Eijkhout, D. Keyes, S. Li, B. Smith, and R. Vuduc. Performance optimization of sparse kernels in TOPS. URL, 2006. Available: "http://www-unix.mcs.anl.gov/scidactops/highlights/TOPS_performance.html".
[15]
J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16(1):1--17, 1990.
[16]
J. Dongarra, I. Duff, D. Sorensen, and H. van der Vorst. Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia, 1998.
[17]
J. Dongarra, A. Lumsdaine, R. Pozo, and K. Remington. A sparse matrix library in C++ for high performance architectures. In Proceedings of the Second Object Oriented Numerics Conference, pages 214--218, 1994.
[18]
T. Gneiting and A. E. Raftery. Weather forecasting with ensemble methods. Science (Washington, D.C.), 310(5746):248, 2005.
[19]
G. H. Golub and C. F. Van Loan. Matrix Computations, 2nd. Edition. The Johns Hopkins University Press, Baltimore, MD, 1989.
[20]
K. Goto and R. van de Geijn. On reducing TLB misses in matrix multiplication. Technical Report CS-TR-02-55, The University of Texas at Austin, Department of Computer Sciences, Nov. 1 2002.
[21]
W. W. H. Merlitz, T. Herges. Fluctuation analysis and accuracy of a large-scale in silico screen. Journal of computational chemistry, 25(13):1568, October 2004.
[22]
H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7):606--618, 2006.
[23]
E.-J. Im and K. A. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01: Proceedings of the International Conference on Computational Sciences-Part I, pages 127--136, London, UK, 2001. Springer-Verlag.
[24]
E.-J. Im, K. A. Yelick, and R. Vuduc. SPARSITY: Framework for optimizing sparse matrix-vectormultiply. International Journal of High Performance Computing Applications, 18(1):135--158, February 2004.
[25]
M. V. Joshi, G. Karypis, V. Kumar, A. Gupta, and F. G. Gustavson. PSPASES: An efficient and scalable parallel sparse direct solver. In PPSC, 1999.
[26]
S. Manegold. Understanding, Modeling, and Improving Main-Memory Database Performance. Ph.d. thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, December 2002.
[27]
J. Mellor-Crummey and J. Garvin. Optimizing sparse matrix-vector product computations using unroll and jam. Int. J. High Perform. Comput. Appl., 18(2):225--236, 2004.
[28]
J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. International Journal of Parallel Programming, 29(3):217--247, 2001.
[29]
A. Pinar and M. T. Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of Supercomputing'99 (CD-ROM), Portland, OR, Nov. 1999. ACM SIGARCH and IEEE.
[30]
K. A. Remington and R. Pozo. NIST sparse BLAS user's guide. Technical report, National Institute of Standards and Technology, July 19 1996.
[31]
Y. Saad. SPARSKIT: A basic tool kit for sparse matrix computations. Technical report, Computer Science Department, University of Minnesota, Minneapolis, MN 55455, June 1994. Version 2.
[32]
Y. Saad. Iterative Methods for Sparse Linear Systems. PWS, Boston, 1996.
[33]
S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM Journal of Research and Development, 41(6):711--725, 1997.
[34]
R. S. Tuminaro, M. Heroux, S. A. Hutchinson, and J. N. Shadid. Official Aztec user's guide version 2.1. Technical Report SAND99-8801J, Sandia National Laboratories, 1999.
[35]
R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear).
[36]
R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In L. T. Yang, O. F. Rana, B. D. Martino, and J. Dongarra, editors, HPCC, volume 3726 of Lecture Notes in Computer Science, pages 807--816. Springer, 2005.
[37]
R. C. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimizations of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.
[38]
J. Willcock and A. Lumsdaine. Accelerating sparse matrix computations via data compression. In ICS '06: Proceedings of the 20th annual international conference on Supercomputing, pages 307--316, New York, NY, USA, 2006. ACM Press.

Cited By

View all
  • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
  • (2009)Operation Stacking for Ensemble Computations With Variable ConvergenceThe International Journal of High Performance Computing Applications10.1177/109434200934789224:2(194-212)Online publication date: 3-Nov-2009

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '07: Proceedings of the 21st annual international conference on Supercomputing
June 2007
315 pages
ISBN:9781595937681
DOI:10.1145/1274971
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CG
  2. GMRES
  3. irregular applications
  4. loop unroll and jam
  5. operation stacking
  6. sparse computation

Qualifiers

  • Article

Conference

ICS07
Sponsor:
ICS07: International Conference on Supercomputing
June 17 - 21, 2007
Washington, Seattle

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)4
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Revisiting thread configuration of SpMV kernels on GPUJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104799185:COnline publication date: 4-Mar-2024
  • (2009)Operation Stacking for Ensemble Computations With Variable ConvergenceThe International Journal of High Performance Computing Applications10.1177/109434200934789224:2(194-212)Online publication date: 3-Nov-2009

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media