[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1248377.1248394acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
Article

An experimental comparison of cache-oblivious and cache-conscious programs

Published: 09 June 2007 Publication History

Abstract

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy.
An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question.
This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.

References

[1]
Basic Linear Algebra Routines (BLAS). http://www.netlib.org/blas.
[2]
R. Allan and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002.
[3]
E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, editors. LAPACK Users' Guide. Second Edition. SIAM, Philadelphia, 1995.
[4]
L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966.
[5]
David A. Berson, Rajiv Gupta, and Mary Lou Soffa. Integrated instruction scheduling and register allocation techniques. In LCPC '98, pages 247--262, London, UK, 1999. Springer-Verlag.
[6]
Gianfranco Bilardi, 2005. Personal communication.
[7]
Gianfranco Bilardi, Paolo D'Alberto, and Alex Nicolau. Fractal matrix multiplication: A case study on portability of cache performance. In Algorithm Engineering: 5th International Workshop, WAE, 2001.
[8]
David Callahan, Steve Carr, and Ken Kennedy. Improving register allocation for subscripted variables. In PLDI, pages 53--65, 1990.
[9]
Ernie Chan, Enrique S. Quintana-Orti, Gregorio Quintana-Orti, and Robert van de Geijn. Supermatrix out-of-order scheduling of matrix operations for smp and multi-core architectures. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007.
[10]
Siddhartha Chatterjee, Alvin R. Lebeck, Praveen K. Patnala, and Mithuna Thottethodi. Recursive array layouts and fast parallel matrix multiplication. In ACM Symposium on Parallel Algorithms and Architectures, pages 222--231, 1999.
[11]
Rezaul Chowdhury and Vijaya Ramachandran. The cache-oblivious gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. In Symposium on Parallelism in Algorithms and Architectures (SPAA), June 2007.
[12]
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI, 1995.
[13]
Keith D. Cooper, Devika Subramanian, and Linda Torczon. Adaptive optimizing compilers for the 21st century. J. Supercomput., 23(1):7--22, 2002.
[14]
J. J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.
[15]
Matteo Frigo, 2005. Personal communication.
[16]
Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. Cache-oblivious algorithms. In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, page 285. IEEE Computer Society, 1999.
[17]
J. R. Goodman and W.-C. Hsu. Code scheduling and register allocation in large basic blocks. In ICS '88, pages 442--452, New York, NY, USA, 1988. ACM Press.
[18]
Jia Guo, María Jesús Garzarán, and David Padua. The power of Belady's algorithm in register allocation for long basic blocks. In Proc. 16th International Workshop in Languages and Parallel Computing, pages 374--390, 2003.
[19]
Fred Gustavson. Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development, 41(6):737--755, 1997.
[20]
Jia-Wei Hong and H. T. Kung. I/O complexity: The red-blue pebble game. In Proc. of the thirteenth annual ACM symposium on Theory of computing, pages 326--333, 1981.
[21]
Piyush Kumar. Cache-oblivious algorithms. In Lecture Notes in Computer Science 2625. Springer-Verlag, 1998.
[22]
W. Li and K. Pingali. Access Normalization: Loop restructuring for NUMA compilers. ACM Transactions on Computer Systems, 1993.
[23]
Cindy Norris and Lori L. Pollock. An experimental study of several cooperative register allocation and instruction scheduling strategies. In MICRO 28, pages 169--179, Los Alamitos, CA, USA, 1995. IEEE Computer Society Press.
[24]
Robert Schreiber and Jack Dongarra. Automatic blocking of nested loops. Technical Report CS-90-108, Knoxville, TN 37996, USA, 1990.
[25]
Sivan Toledo. A survey of out-of-core algorithms in numerical linear algebra. In External memory algorithms. American Mathematical Society, Boston, MA, 1999.
[26]
Clint Whaley. personal communication, 2005.
[27]
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.
[28]
M. Wolfe. Iteration space tiling for memory hierarchies. In Third SIAM Conference on Parallel Processing for Scientific Computing, December 1987.
[29]
Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2), 2005.

Cited By

View all
  • (2022)Exploring Strategies to Improve Locality Across Many-Core AffinitiesEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_3(29-40)Online publication date: 9-Jun-2022
  • (2021)Analytical modeling of matrix–vector multiplication on multicore processorsMathematical Methods in the Applied Sciences10.1002/mma.704545:15(8769-8799)Online publication date: 14-Jan-2021
  • (2020)Ideal and Predictable Hit Ratio for Matrix Transposition in Data CachesMathematics10.3390/math80201848:2(184)Online publication date: 3-Feb-2020
  • Show More Cited By

Index Terms

  1. An experimental comparison of cache-oblivious and cache-conscious programs

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures
      June 2007
      376 pages
      ISBN:9781595936677
      DOI:10.1145/1248377
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 June 2007

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cache-conscious algorithms
      2. cache-oblivious algorithms
      3. memory bandwidth
      4. memory hierarchy
      5. memory latency
      6. numerical software

      Qualifiers

      • Article

      Conference

      SPAA07

      Acceptance Rates

      Overall Acceptance Rate 447 of 1,461 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)17
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 14 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2022)Exploring Strategies to Improve Locality Across Many-Core AffinitiesEuro-Par 2021: Parallel Processing Workshops10.1007/978-3-031-06156-1_3(29-40)Online publication date: 9-Jun-2022
      • (2021)Analytical modeling of matrix–vector multiplication on multicore processorsMathematical Methods in the Applied Sciences10.1002/mma.704545:15(8769-8799)Online publication date: 14-Jan-2021
      • (2020)Ideal and Predictable Hit Ratio for Matrix Transposition in Data CachesMathematics10.3390/math80201848:2(184)Online publication date: 3-Feb-2020
      • (2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
      • (2020)Closing the Gap Between Cache-oblivious and Cache-adaptive AnalysisProceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3350755.3400274(63-73)Online publication date: 6-Jul-2020
      • (2020)A Highly Efficient SGEMM Implementation using DMA on the Intel/Movidius Myriad-22020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD49847.2020.00051(321-328)Online publication date: Sep-2020
      • (2020)Cache Oblivious Strategies to Exploit Multi-Level Memory on Manycore Systems2020 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC)10.1109/MCHPC51950.2020.00011(42-51)Online publication date: Nov-2020
      • (2019)qwLSHProceedings of the 2019 on International Conference on Multimedia Retrieval10.1145/3323873.3325048(329-333)Online publication date: 5-Jun-2019
      • (2017)IDE-Based Learning Analytics for Computing EducationACM Transactions on Computing Education10.1145/310575917:3(1-26)Online publication date: 29-Aug-2017
      • (2017)Locality Transformations for Nested Recursive Iteration SpacesACM SIGARCH Computer Architecture News10.1145/3093337.303772045:1(281-295)Online publication date: 4-Apr-2017
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media