[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/645989.674318acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
Article

Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Published: 22 September 2002 Publication History

Abstract

In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only temporal but also spatial reuse. As compared to optimizations to exploit reuse in cache, the compiler must also manage replacement, and thus, explicitly name registers in the generated code. We describe an implementation of our approach integrated with a compiler that exploits superword-level parallelism (SLP). We present a set of results derived automatically on 4 multimedia kernels and 2 scientific benchmarks. Our results show speedups ranging from 1.3 to 2.8X on the 6 programs as compared to using SLP alone, and we eliminate the majority of memory accesses.

References

[1]
K. Asanovic and J. Beck. T0 engineering data. UC Berkeley CS technical report UCB/CSD-97-930.
[2]
N. Bowman, N. Cardwell, C. Kozyrakis, C. Romer, and H. Wang. Evaluation of existing architectures in IRAM systems. In First Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997.
[3]
J. Brockman. P. Kogge, V. Freeh, S. Kuntz, and T. Sterling. Microservers: A new memory semantics for massively parallel computing. In ACM International Conference on Supercomputing (ICS'99), June 1999.
[4]
S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 15(3):400-462, July 1994.
[5]
S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Software--Practice and Experience, 24(1):51- 77, 1994.
[6]
S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, California, October 1994.
[7]
J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In International Conference on Supercomputing, pages 492-499, 1999.
[8]
G. Cheong and M. S. Lam. An optimizer for multimedia instruction sets. In The Second SUIF Compiler Workshop, Stanford University, USA, August 1997.
[9]
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In The SIGPIAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.
[10]
D. J. DeVries. A vectorizing suif compiler: Implementation and performance. Master's thesis, University of Toronto, 1997.
[11]
D. Elliott, M. Snelgrove, and M. Stumm. Computational RAM: a memory-SIMD hybrid and its application to DSP. In IEEE 1992 Custom Integrated Circuit Conference, pages 30.6.1 - 30.6.4, 1992.
[12]
K. Esseghir. Improving data locality for caches. Master's thesis, Dept. of Computer Science, Rice University, September 1993.
[13]
J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pages 328-343, Santa Clara, California, August 1991.
[14]
C. Fricker, O. Temam, and W. Jalby. Influence of cross-interferences on blocked loops: A case study with matrix-vector multiply. TOPLAS, 17(4):561-575, July 1995.
[15]
S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, Vienna, Austria, July 1997.
[16]
S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228-239, San Jose, California, October 1998.
[17]
M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki, A. Srivastava, W. Athas, J. Brockman, V. Freeh, J. Park, and J. Shin. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In ACM International Conference on Supercomputing, November 1999.
[18]
M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy, S. Liao, E. Bugnion, and M.S. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, 29(12):84-89, December 1996.
[19]
M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimization of blocked algorithms. ACM SIGPLAN Notices, 26(4):63-74, 1991.
[20]
S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Conference on Programming Language Design and Implementation, pages 145-156, Vancouver, BC Canada, June 2000.
[21]
S. Larsen, E. Witchel, and S. Amarasinghe. Increasing and detecting memory address congruence. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, Charlottesville, Virginia, September 2002.
[22]
R. Lee. Subword parallelism with max2. IEEE Micro, 16(4):51-59, August 1996.
[23]
A. Fernandez M. Jimenez, J.M. Llaberia and E. Morancho. Index set splitting to exploit data locality at the register level. Technical Report UPC-DAC-1996-49, Universitat politecnica de Catalunya, 1996.
[24]
Metrowerks. Code Warrior version 7.0 data sheet, 2001. http://www.metrowerks.com/pdf/mac7.pdf.
[25]
P. Ranganathan, S. Adve, and N. Jouppi. Performance of image and video processing with general-purpose processors and media ISA extensions. In International Symposium on Computer Architecture, May 1999.
[26]
G. Rivera and C. Tseng. A comparison of compiler tiling algorithms. In the 8th International Conference on Compiler Construction (CC'99), Amsterdam, The Netherlands, March 1999.
[27]
N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming, 2000.
[28]
O. Temam, E. Granston, and W. Jalby. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In ACM International Conference on Supercomputing, Portland, OR, November 1993.
[29]
Veridian. VAST/AltiVec Features, June 2001. http://www.psrv.com/altivec_feat.html.
[30]
M. E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Dept. of Computer Science, Stanford University, 1992.
[31]
M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, June 1991.
[32]
Michael J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing '89, pages 655-664, Reno, Nevada, November 1989.

Cited By

View all
  • (2019)Compiler auto-vectorization with imitation learningProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455597(14625-14635)Online publication date: 8-Dec-2019
  • (2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
  • (2018)SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector ProcessorsProceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing10.1145/3178433.3178436(1-8)Online publication date: 24-Feb-2018
  • Show More Cited By
  1. Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques
      September 2002
      168 pages
      ISBN:0769516203

      Sponsors

      Publisher

      IEEE Computer Society

      United States

      Publication History

      Published: 22 September 2002

      Check for updates

      Qualifiers

      • Article

      Conference

      PACT02
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 121 of 471 submissions, 26%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 08 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2019)Compiler auto-vectorization with imitation learningProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455597(14625-14635)Online publication date: 8-Dec-2019
      • (2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
      • (2018)SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector ProcessorsProceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing10.1145/3178433.3178436(1-8)Online publication date: 24-Feb-2018
      • (2016)Exploiting mixed SIMD parallelism by reducing data reorganization overheadProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854054(59-69)Online publication date: 29-Feb-2016
      • (2015)Automatic Vectorization of Interleaved Data RevisitedACM Transactions on Architecture and Code Optimization10.1145/283873512:4(1-25)Online publication date: 8-Dec-2015
      • (2015)Optimizing Overlapped Memory Accesses in User-directed VectorizationProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751224(393-404)Online publication date: 8-Jun-2015
      • (2013)Polyhedral parallel code generation for CUDAACM Transactions on Architecture and Code Optimization10.1145/2400682.24007139:4(1-23)Online publication date: 20-Jan-2013
      • (2013)Parallel execution of Java loops on Graphics Processing UnitsScience of Computer Programming10.1016/j.scico.2011.06.00478:5(458-480)Online publication date: 1-May-2013
      • (2012)Extending a C-like language for portable SIMD programmingACM SIGPLAN Notices10.1145/2370036.214582547:8(65-74)Online publication date: 25-Feb-2012
      • (2012)A compiler framework for extracting superword level parallelismACM SIGPLAN Notices10.1145/2345156.225410647:6(347-358)Online publication date: 11-Jun-2012
      • Show More Cited By

      View Options

      View options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media