[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2561828.2561929acmconferencesArticle/Chapter ViewAbstractPublication PagesiccadConference Proceedingsconference-collections
research-article

An efficient compiler framework for cache bypassing on GPUs

Published: 18 November 2013 Publication History

Abstract

Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

References

[1]
GE Intelligent Platforms. http://defense.ge-ip.com/products/hpec/c560.
[2]
Mosek. http://www.mosek.com/.
[3]
NVIDIA. Fermi GPUs www.nvidia.com/object/fermi-architecture.html.
[4]
NVIDIA. Kepler GPUs www.nvidia.com/object/nvidia-kepler.html.
[5]
NVIDIA. PTX Code http://docs.nvidia.com/cuda/pdf/ptx_isa_3.1.pdf.
[6]
NVIDIA. CUDA Programming Guide, Version 3.2.
[7]
NVIDIA. Profiler http://docs.nvidia.com/cuda/profiler-users-guide/index.html.
[8]
NVIDIA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.
[9]
NVIDIA Tegra. http://www.nvidia.com/object/tegra.html.
[10]
Qualcomm Inc. http://www.qualcomm.com/snapdragon.
[11]
SamSung Inc. www.samsung.com/exynos.
[12]
T. Cormen, C. Stein, R. Rivest, and C. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.
[13]
Z. Cui, Y. Liang, K. Rupnow, and D. Chen. An accurate GPU performance model for effective control flow divergence optimization. In IPDPS, 2012.
[14]
C. J. Wu et al. SHiP: signature-based hit predictor for high performance caching. In Micro, 2011.
[15]
J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. In IMPACT Technical Report, 2012.
[16]
J. D. Owens et al. GPU computing. Proceedings of the IEEE, 2008.
[17]
M. M. Baskaran et al. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS, 2008.
[18]
S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.
[19]
S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008.
[20]
Y. Wu et al. Compiler managed micro-cache bypassing for high performance EPIC processors. In Micro, 2002.
[21]
P. Francesco et al. An integrated hardware/software approach for run-time scratchpad management. In DAC, 2004.
[22]
S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009.
[23]
W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in GPUs. In ICS, 2012.
[24]
M. Kharbutli and Y. Solihin. Counter-based cache replacement and bypassing algorithms. In IEEE Transactions on Computers, 2008.
[25]
Y. Kim and A. Shrivastava. CuMAPz: A tool to analyze memory access patterns in CUDA. In DAC, 2011.
[26]
H. Kuo, T. Yen, B. C. Lai, and J. Jou. Cache capacity aware thread scheduling for irregular memory access on many-core GPGPUs. In ASPDAC, 2013.
[27]
Y. Liang, Z. Cui, K. Rupnow, and D. Chen. Register and thread structure optimization for GPUs. In ASPDAC, 2013.
[28]
Y. Liang et al. Real-time implementation and performance optimization of 3D sound localization on GPUs. In DATE, 2012.
[29]
Y. Liang and T. Mitra. Static analysis for fast and accurate design space exploration of caches. In CODES+ISSS, 2008.
[30]
H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Micro, 2008.
[31]
IJ. Sung, J. A. Stratton, and W. W. Hwu. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In PACT, 2010.
[32]
S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst., 5(2):472--511, May 2006.
[33]
Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In PACT, 2002.
[34]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, 2010.
[35]
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS, 2011.

Cited By

View all
  • (2024)Cross-core Data Sharing for Energy-efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/365301921:3(1-32)Online publication date: 18-Mar-2024
  • (2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
  • (2019)CuLDA_CGSProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3301496(435-436)Online publication date: 16-Feb-2019
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICCAD '13: Proceedings of the International Conference on Computer-Aided Design
November 2013
871 pages
ISBN:9781479910694
  • General Chair:
  • Jörg Henkel

Sponsors

Publisher

IEEE Press

Publication History

Published: 18 November 2013

Check for updates

Author Tags

  1. GPU
  2. cache bypassing
  3. compiler optimization

Qualifiers

  • Research-article

Conference

ICCAD'13
Sponsor:
ICCAD'13: The International Conference on Computer-Aided Design
November 18 - 21, 2013
California, San Jose

Acceptance Rates

ICCAD '13 Paper Acceptance Rate 92 of 354 submissions, 26%;
Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Cross-core Data Sharing for Energy-efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/365301921:3(1-32)Online publication date: 18-Mar-2024
  • (2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
  • (2019)CuLDA_CGSProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3301496(435-436)Online publication date: 16-Feb-2019
  • (2019)A coordinated tiling and batching framework for efficient GEMM on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295734(229-241)Online publication date: 16-Feb-2019
  • (2018)LTRFACM SIGPLAN Notices10.1145/3296957.317321153:2(489-502)Online publication date: 19-Mar-2018
  • (2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
  • (2018)Quantifying Data Locality in Dynamic Parallelism in GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873182:3(1-24)Online publication date: 21-Dec-2018
  • (2018)Harvesting Row-Buffer Hits via Orchestrated Last-Level Cache and DRAM Scheduling for Heterogeneous Multicore SystemsACM Transactions on Design Automation of Electronic Systems10.1145/326998224:1(1-27)Online publication date: 21-Dec-2018
  • (2018)cuMBIRProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205309(184-194)Online publication date: 12-Jun-2018
  • (2018)LTRFProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173211(489-502)Online publication date: 19-Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media