More Web Proxy on the site http://driver.im/

Article

Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

Authors:

Jacqueline Chame,

Mary W. HallAuthors Info & Claims

PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques

Pages 45 - 55

Published: 22 September 2002 Publication History

Abstract

In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only temporal but also spatial reuse. As compared to optimizations to exploit reuse in cache, the compiler must also manage replacement, and thus, explicitly name registers in the generated code. We describe an implementation of our approach integrated with a compiler that exploits superword-level parallelism (SLP). We present a set of results derived automatically on 4 multimedia kernels and 2 scientific benchmarks. Our results show speedups ranging from 1.3 to 2.8X on the 6 programs as compared to using SLP alone, and we eliminate the majority of memory accesses.

References

[1]

K. Asanovic and J. Beck. T0 engineering data. UC Berkeley CS technical report UCB/CSD-97-930.

Digital Library

[2]

N. Bowman, N. Cardwell, C. Kozyrakis, C. Romer, and H. Wang. Evaluation of existing architectures in IRAM systems. In First Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997.

[3]

J. Brockman. P. Kogge, V. Freeh, S. Kuntz, and T. Sterling. Microservers: A new memory semantics for massively parallel computing. In ACM International Conference on Supercomputing (ICS'99), June 1999.

Digital Library

[4]

S. Carr and K. Kennedy. Improving the ratio of memory operations to floating-point operations in loops. ACM Transactions on Programming Languages and Systems, 15(3):400-462, July 1994.

[5]

S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Software--Practice and Experience, 24(1):51- 77, 1994.

Digital Library

[6]

S. Carr, K. S. McKinley, and C.-W. Tseng. Compiler optimizations for improving data locality. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 252-262, San Jose, California, October 1994.

Digital Library

[7]

J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In International Conference on Supercomputing, pages 492-499, 1999.

Digital Library

[8]

G. Cheong and M. S. Lam. An optimizer for multimedia instruction sets. In The Second SUIF Compiler Workshop, Stanford University, USA, August 1997.

[9]

S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In The SIGPIAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995.

Digital Library

[10]

D. J. DeVries. A vectorizing suif compiler: Implementation and performance. Master's thesis, University of Toronto, 1997.

[11]

D. Elliott, M. Snelgrove, and M. Stumm. Computational RAM: a memory-SIMD hybrid and its application to DSP. In IEEE 1992 Custom Integrated Circuit Conference, pages 30.6.1 - 30.6.4, 1992.

[12]

K. Esseghir. Improving data locality for caches. Master's thesis, Dept. of Computer Science, Rice University, September 1993.

[13]

J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, pages 328-343, Santa Clara, California, August 1991.

Digital Library

[14]

C. Fricker, O. Temam, and W. Jalby. Influence of cross-interferences on blocked loops: A case study with matrix-vector multiply. TOPLAS, 17(4):561-575, July 1995.

Digital Library

[15]

S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, Vienna, Austria, July 1997.

Digital Library

[16]

S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228-239, San Jose, California, October 1998.

Digital Library

[17]

M. Hall, P. Kogge, J. Koller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki, A. Srivastava, W. Athas, J. Brockman, V. Freeh, J. Park, and J. Shin. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In ACM International Conference on Supercomputing, November 1999.

Digital Library

[18]

M.W. Hall, J.M. Anderson, S.P. Amarasinghe, B.R. Murphy, S. Liao, E. Bugnion, and M.S. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, 29(12):84-89, December 1996.

Digital Library

[19]

M. S. Lam, E. E. Rothberg, and M. E. Wolf. The cache performance and optimization of blocked algorithms. ACM SIGPLAN Notices, 26(4):63-74, 1991.

Digital Library

[20]

S. Larsen and S. Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. In Conference on Programming Language Design and Implementation, pages 145-156, Vancouver, BC Canada, June 2000.

Digital Library

[21]

S. Larsen, E. Witchel, and S. Amarasinghe. Increasing and detecting memory address congruence. In Proceedings of the International Conference on Parallel Architecture and Compilation Techniques, Charlottesville, Virginia, September 2002.

Digital Library

[22]

R. Lee. Subword parallelism with max2. IEEE Micro, 16(4):51-59, August 1996.

Digital Library

[23]

A. Fernandez M. Jimenez, J.M. Llaberia and E. Morancho. Index set splitting to exploit data locality at the register level. Technical Report UPC-DAC-1996-49, Universitat politecnica de Catalunya, 1996.

[24]

Metrowerks. Code Warrior version 7.0 data sheet, 2001. http://www.metrowerks.com/pdf/mac7.pdf.

[25]

P. Ranganathan, S. Adve, and N. Jouppi. Performance of image and video processing with general-purpose processors and media ISA extensions. In International Symposium on Computer Architecture, May 1999.

Digital Library

[26]

G. Rivera and C. Tseng. A comparison of compiler tiling algorithms. In the 8th International Conference on Compiler Construction (CC'99), Amsterdam, The Netherlands, March 1999.

Digital Library

[27]

N. Sreraman and R. Govindarajan. A vectorizing compiler for multimedia extensions. International Journal of Parallel Programming, 2000.

[28]

O. Temam, E. Granston, and W. Jalby. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts. In ACM International Conference on Supercomputing, Portland, OR, November 1993.

Digital Library

[29]

Veridian. VAST/AltiVec Features, June 2001. http://www.psrv.com/altivec_feat.html.

[30]

M. E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Dept. of Computer Science, Stanford University, 1992.

Digital Library

[31]

M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, pages 30-44, Toronto, June 1991.

Digital Library

[32]

Michael J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing '89, pages 655-664, Reno, Nevada, November 1989.

Digital Library

Cited By

Mendis CYang CPu YAmarasinghe SCarbin MWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Compiler auto-vectorization with imitation learningProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455597(14625-14635)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455597
Mendis CAmarasinghe S(2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
https://dl.acm.org/doi/10.1145/3276480
Rodrigues CPhaosawasdi AWu P(2018)SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector ProcessorsProceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing10.1145/3178433.3178436(1-8)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178433.3178436
Show More Cited By

Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory
2. Software and its engineering
  1. Software notations and tools

Recommendations

Energy-efficient register caching with compiler assistance

The register file is a critical component in a modern superscalar processor. It must be large enough to accommodate the results of all in-flight instructions. It must also have enough ports to allow simultaneous issue and writeback of many values each ...
Enabling compiler flow for embedded VLIW DSP processors with distributed register files
LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

High-performance and low-power VLIW DSP processors are increasingly deployed on embedded devices to process video and multimedia applications. For reducing power and cost in designs of VLIW DSP processors, distributed register files and multi-bank ...
Enabling compiler flow for embedded VLIW DSP processors with distributed register files
Proceedings of the 2007 LCTES conference

High-performance and low-power VLIW DSP processors are increasingly deployed on embedded devices to process video and multimedia applications. For reducing power and cost in designs of VLIW DSP processors, distributed register files and multi-bank ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '02: Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques

September 2002

168 pages

ISBN:0769516203

Copyright © Copyright (c) 2002 Institute of Electrical and Electronics Engineers, Inc. All rights reserved.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Computer Society

United States

Publication History

Published: 22 September 2002

Check for updates

Qualifiers

Article

Conference

PACT02

Sponsor:

SIGARCH

PACT02: Parallel Architectures and Compilation Techniques

September 22 - 25, 2002

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Mendis CYang CPu YAmarasinghe SCarbin MWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)Compiler auto-vectorization with imitation learningProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455597(14625-14635)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455597
Mendis CAmarasinghe S(2018)goSLP: globally optimized superword level parallelism frameworkProceedings of the ACM on Programming Languages10.1145/32764802:OOPSLA(1-28)Online publication date: 24-Oct-2018
https://dl.acm.org/doi/10.1145/3276480
Rodrigues CPhaosawasdi AWu P(2018)SIMDization of Small Tensor Multiplication Kernels for Wide SIMD Vector ProcessorsProceedings of the 2018 4th Workshop on Programming Models for SIMD/Vector Processing10.1145/3178433.3178436(1-8)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178433.3178436
Zhou HXue JFranke BWu YRastello F(2016)Exploiting mixed SIMD parallelism by reducing data reorganization overheadProceedings of the 2016 International Symposium on Code Generation and Optimization10.1145/2854038.2854054(59-69)Online publication date: 29-Feb-2016
https://dl.acm.org/doi/10.1145/2854038.2854054
Anderson AMalik AGregg D(2015)Automatic Vectorization of Interleaved Data RevisitedACM Transactions on Architecture and Code Optimization10.1145/283873512:4(1-25)Online publication date: 8-Dec-2015
https://dl.acm.org/doi/10.1145/2838735
Caballero DRoyuela SFerrer RDuran AMartorell XBhuyan LChong FSarkar V(2015)Optimizing Overlapped Memory Accesses in User-directed VectorizationProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751224(393-404)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751224
Verdoolaege SCarlos Juega JCohen AIgnacio Gómez JTenllado CCatthoor F(2013)Polyhedral parallel code generation for CUDAACM Transactions on Architecture and Code Optimization10.1145/2400682.24007139:4(1-23)Online publication date: 20-Jan-2013
https://dl.acm.org/doi/10.1145/2400682.2400713
Leung ALhoták OLashari G(2013)Parallel execution of Java loops on Graphics Processing UnitsScience of Computer Programming10.1016/j.scico.2011.06.00478:5(458-480)Online publication date: 1-May-2013
https://dl.acm.org/doi/10.1016/j.scico.2011.06.004
Leißa RHack SWald I(2012)Extending a C-like language for portable SIMD programmingACM SIGPLAN Notices10.1145/2370036.214582547:8(65-74)Online publication date: 25-Feb-2012
https://dl.acm.org/doi/10.1145/2370036.2145825
Liu JZhang YJang ODing WKandemir M(2012)A compiler framework for extracting superword level parallelismACM SIGPLAN Notices10.1145/2345156.225410647:6(347-358)Online publication date: 11-Jun-2012
https://dl.acm.org/doi/10.1145/2345156.2254106
Show More Cited By

View Options

View options

Figures

Tables

Media

View Table of Conten