More Web Proxy on the site http://driver.im/

Article

An experimental comparison of cache-oblivious and cache-conscious programs

Authors:

Keshav Pingali,

Fred GustavsonAuthors Info & Claims

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

Pages 93 - 104

https://doi.org/10.1145/1248377.1248394

Published: 09 June 2007 Publication History

Abstract

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy.

An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question.

This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.

References

[1]

Basic Linear Algebra Routines (BLAS). http://www.netlib.org/blas.

[2]

R. Allan and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002.

Digital Library

[3]

E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, editors. LAPACK Users' Guide. Second Edition. SIAM, Philadelphia, 1995.

Digital Library

[4]

L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966.

Digital Library

[5]

David A. Berson, Rajiv Gupta, and Mary Lou Soffa. Integrated instruction scheduling and register allocation techniques. In LCPC '98, pages 247--262, London, UK, 1999. Springer-Verlag.

Digital Library

[6]

Gianfranco Bilardi, 2005. Personal communication.

[7]

Gianfranco Bilardi, Paolo D'Alberto, and Alex Nicolau. Fractal matrix multiplication: A case study on portability of cache performance. In Algorithm Engineering: 5th International Workshop, WAE, 2001.

Digital Library

[8]

David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, 1990.

Digital Library

[9]

Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Supermatrix out-of-order scheduling of matrix operations for smp and multi-core architectures. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007.

Digital Library

[10]

Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, and Mithuna Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, 1999.

Digital Library

[11]

Rezaul Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007.

Digital Library

[12]

S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI, 1995.

Digital Library

[13]

Keith D. Cooper, Devika Subramanian, and Linda Torczon. Adaptive optimizing compilers for the 21st century. J. Supercomput., 23(1):7--22, 2002.

Digital Library

[14]

J. J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.

Digital Library

[15]

Matteo Frigo, 2005. Personal communication.

[16]

Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285. IEEE Computer Society, 1999.

Digital Library

[17]

J. R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic blocks. In ICS '88, pages 442--452, New York, NY, USA, 1988. ACM Press.

Digital Library

[18]

Jia Guo, María Jesús Garzarán, and David Padua. The power of Belady's algorithm in register allocation for long basic blocks. In Proc. 16th International Workshop in Languages and Parallel Computing, pages 374--390, 2003.

[19]

Fred Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737--755, 1997.

Digital Library

[20]

Jia-Wei Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proc. of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, 1981.

Digital Library

[21]

Piyush Kumar. Cache-oblivious algorithms. In Lecture Notes in Computer Science 2625. Springer-Verlag, 1998.

[22]

W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.

Digital Library

[23]

Cindy Norris and Lori L. Pollock. An experimental study of several cooperative register allocation and instruction scheduling strategies. In MICRO 28, pages 169--179, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press.

Digital Library

[24]

Robert Schreiber and Jack Dongarra. Automatic blocking of nested loops. Technical Report CS-90-108, Knoxville, TN 37996, USA, 1990.

Digital Library

[25]

Sivan Toledo. A survey of out-of-core algorithms in numerical linear algebra. In External memory algorithms. American Mathematical Society, Boston, MA, 1999.

Digital Library

[26]

Clint Whaley. personal communication, 2005.

[27]

R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.

Digital Library

[28]

M. Wolfe. Iteration space tiling for memory hierarchies. In Third SIAM Conference on Parallel Processing for Scientific Computing, December 1987.

Digital Library

[29]

Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2), 2005.

Cited By

Butcher NKogge P(2022)Exploring Strategies to Improve Locality Across Many-Core AffinitiesEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_3(29-40)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_3
Gareev RAkimova E(2021)Analytical modeling of matrix–vector multiplication on multicore processorsMathematical Methods in the Applied Sciences10.1002/mma.704545:15(8769-8799)Online publication date: 14-Jan-2021
https://doi.org/10.1002/mma.7045
Pedro-Zapater ARodríguez CSegarra JGran Tejero RViñals-Yúfera V(2020)Ideal and Predictable Hit Ratio for Matrix Transposition in Data CachesMathematics10.3390/math80201848:2(184)Online publication date: 3-Feb-2020
https://doi.org/10.3390/math8020184
Show More Cited By

Index Terms

An experimental comparison of cache-oblivious and cache-conscious programs
1. Mathematics of computing
  1. Mathematical software
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

Increasing hardware data prefetching performance using the second-level cache

Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Application-adaptive intelligent cache memory system

This article presents the design of a simple hardware-controlled, high performance cache system. The design supports fast access time, optimal utilization of temporal and spatial localities adaptive to given applications, and a simple dynamic fetching ...
Cache management for discrete processor architectures
ISPA'05: Proceedings of the Third international conference on Parallel and Distributed Processing and Applications

Many schemes had been used to reduce the performance (or speed) gap between processors and main memories; such as the cache memory is one of the most methods. In this paper, we issue the structure of shared cache, which is based on the multiprocessor ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures

June 2007

376 pages

ISBN:9781595936677

DOI:10.1145/1248377

General Chair:
Phillip B. Gibbons
Intel Research, USA
,
Program Chair:
Christian Scheideler
Technische Universität München, Germany

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SPAA07

Sponsor:

SPAA07: 19th ACM Symposium on Parallelism in Algorithms and Architectures

June 9 - 11, 2007

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

66
Total Citations
View Citations
1,079
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Butcher NKogge P(2022)Exploring Strategies to Improve Locality Across Many-Core AffinitiesEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_3(29-40)Online publication date: 9-Jun-2022
https://doi.org/10.1007/978-3-031-06156-1_3
Gareev RAkimova E(2021)Analytical modeling of matrix–vector multiplication on multicore processorsMathematical Methods in the Applied Sciences10.1002/mma.704545:15(8769-8799)Online publication date: 14-Jan-2021
https://doi.org/10.1002/mma.7045
Pedro-Zapater ARodríguez CSegarra JGran Tejero RViñals-Yúfera V(2020)Ideal and Predictable Hit Ratio for Matrix Transposition in Data CachesMathematics10.3390/math80201848:2(184)Online publication date: 3-Feb-2020
https://doi.org/10.3390/math8020184
Javanmard MAhmad ZKong MPouchet LChowdhury RHarrison RMars JTang LXue JWu P(2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377916
Bender MChowdhury RDas RJohnson RKuszmaul WLincoln ALiu QLynch JXu HScheideler CSpear M(2020)Closing the Gap Between Cache-oblivious and Cache-adaptive AnalysisProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400274(63-73)Online publication date: 6-Jul-2020
https://dl.acm.org/doi/10.1145/3350755.3400274
Bakshi SJohnsson L(2020)A Highly Efficient SGEMM Implementation using DMA on the Intel/Movidius Myriad-22020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD49847.2020.00051(321-328)Online publication date: Sep-2020
https://doi.org/10.1109/SBAC-PAD49847.2020.00051
Butcher NOlivier SKogge P(2020)Cache Oblivious Strategies to Exploit Multi-Level Memory on Manycore Systems2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)10.1109/MCHPC51950.2020.00011(42-51)Online publication date: Nov-2020
https://doi.org/10.1109/MCHPC51950.2020.00011
Jafari OOssorgin JNagarkar PEl Saddik ADel Bimbo AZhang ZHauptmann ACandan KBertini MXie LWei X(2019)qwLSHProceedings of the 2019 on International Conference on Multimedia Retrieval10.1145/3323873.3325048(329-333)Online publication date: 5-Jun-2019
https://dl.acm.org/doi/10.1145/3323873.3325048
Hundhausen COlivares DCarter A(2017)IDE-Based Learning Analytics for Computing EducationACM Transactions on Computing Education10.1145/310575917:3(1-26)Online publication date: 29-Aug-2017
https://dl.acm.org/doi/10.1145/3105759
Sundararajah KSakka LKulkarni M(2017)Locality Transformations for Nested Recursive Iteration SpacesACM SIGARCH Computer Architecture News10.1145/3093337.303772045:1(281-295)Online publication date: 4-Apr-2017
https://dl.acm.org/doi/10.1145/3093337.3037720
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents