More Web Proxy on the site http://driver.im/

research-article

An efficient compiler framework for cache bypassing on GPUs

Authors:

Deming ChenAuthors Info & Claims

ICCAD '13: Proceedings of the International Conference on Computer-Aided Design

Pages 516 - 523

Published: 18 November 2013 Publication History

Abstract

Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

References

[1]

GE Intelligent Platforms. http://defense.ge-ip.com/products/hpec/c560.

[2]

Mosek. http://www.mosek.com/.

[3]

NVIDIA. Fermi GPUs www.nvidia.com/object/fermi-architecture.html.

[4]

NVIDIA. Kepler GPUs www.nvidia.com/object/nvidia-kepler.html.

[5]

NVIDIA. PTX Code http://docs.nvidia.com/cuda/pdf/ptx_isa_3.1.pdf.

[6]

NVIDIA. CUDA Programming Guide, Version 3.2.

[7]

NVIDIA. Profiler http://docs.nvidia.com/cuda/profiler-users-guide/index.html.

[8]

NVIDIA GPU Computing SDK. http://developer.nvidia.com/gpu-computing-sdk.

[9]

NVIDIA Tegra. http://www.nvidia.com/object/tegra.html.

[10]

Qualcomm Inc. http://www.qualcomm.com/snapdragon.

[11]

SamSung Inc. www.samsung.com/exynos.

[12]

T. Cormen, C. Stein, R. Rivest, and C. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001.

Digital Library

[13]

Z. Cui, Y. Liang, K. Rupnow, and D. Chen. An accurate GPU performance model for effective control flow divergence optimization. In IPDPS, 2012.

Digital Library

[14]

C. J. Wu et al. SHiP: signature-based hit predictor for high performance caching. In Micro, 2011.

Digital Library

[15]

J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. In IMPACT Technical Report, 2012.

[16]

J. D. Owens et al. GPU computing. Proceedings of the IEEE, 2008.

[17]

M. M. Baskaran et al. A compiler framework for optimization of affine loop nests for GPGPUs. In ICS, 2008.

Digital Library

[18]

S. Che et al. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009.

Digital Library

[19]

S. Ryoo et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP, 2008.

Digital Library

[20]

Y. Wu et al. Compiler managed micro-cache bypassing for high performance EPIC processors. In Micro, 2002.

Digital Library

[21]

P. Francesco et al. An integrated hardware/software approach for run-time scratchpad management. In DAC, 2004.

Digital Library

[22]

S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In ISCA, 2009.

Digital Library

[23]

W. Jia, K. A. Shaw, and M. Martonosi. Characterizing and improving the use of demand-fetched caches in GPUs. In ICS, 2012.

Digital Library

[24]

M. Kharbutli and Y. Solihin. Counter-based cache replacement and bypassing algorithms. In IEEE Transactions on Computers, 2008.

Digital Library

[25]

Y. Kim and A. Shrivastava. CuMAPz: A tool to analyze memory access patterns in CUDA. In DAC, 2011.

Digital Library

[26]

H. Kuo, T. Yen, B. C. Lai, and J. Jou. Cache capacity aware thread scheduling for irregular memory access on many-core GPGPUs. In ASPDAC, 2013.

[27]

Y. Liang, Z. Cui, K. Rupnow, and D. Chen. Register and thread structure optimization for GPUs. In ASPDAC, 2013.

[28]

Y. Liang et al. Real-time implementation and performance optimization of 3D sound localization on GPUs. In DATE, 2012.

Digital Library

[29]

Y. Liang and T. Mitra. Static analysis for fast and accurate design space exploration of caches. In CODES+ISSS, 2008.

Digital Library

[30]

H. Liu, M. Ferdman, J. Huh, and D. Burger. Cache Bursts: A new approach for eliminating dead blocks and increasing cache efficiency. In Micro, 2008.

Digital Library

[31]

IJ. Sung, J. A. Stratton, and W. W. Hwu. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In PACT, 2010.

Digital Library

[32]

S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst., 5(2):472--511, May 2006.

Digital Library

[33]

Z. Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems. Using the compiler to improve cache replacement decisions. In PACT, 2002.

Digital Library

[34]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A GPGPU compiler for memory optimization and parallelism management. In PLDI, 2010.

Digital Library

[35]

E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen. On-the-fly elimination of dynamic irregularities for GPU computing. In ASPLOS, 2011.

Digital Library

Cited By

Falahati HSadrosadati MXu QGómez-Luna JSaber Latibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-core Data Sharing for Energy-efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/365301921:3(1-32)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3653019
Kim HHong SLee HSeo EHan H(2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337886
Xie XLiang YLi XTan WHollingsworth JKeidar I(2019)CuLDA_CGSProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3301496(435-436)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3301496
Show More Cited By

Index Terms

An efficient compiler framework for cache bypassing on GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Adaptive and transparent cache bypassing for GPUs
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of ...
Exploring cache bypassing and partitioning for multi-tasking on GPUs
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided Design

Graphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to ...
An Efficient Compiler Framework for Cache Bypassing on GPUs
Graphics processing units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICCAD '13: Proceedings of the International Conference on Computer-Aided Design

November 2013

871 pages

ISBN:9781479910694

General Chair:
Jörg Henkel
Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Sponsors

SIGDA: ACM Special Interest Group on Design Automation
IEEE CEDA

Publisher

IEEE Press

Publication History

Published: 18 November 2013

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICCAD'13

Sponsor:

SIGDA

ICCAD'13: The International Conference on Computer-Aided Design

November 18 - 21, 2013

California, San Jose

Acceptance Rates

ICCAD '13 Paper Acceptance Rate 92 of 354 submissions, 26%;

Overall Acceptance Rate 457 of 1,762 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
247
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Falahati HSadrosadati MXu QGómez-Luna JSaber Latibari BJeon HHesaabi SSarbazi-Azad HMutlu OAnnavaram MPedram M(2024)Cross-core Data Sharing for Energy-efficient GPUsACM Transactions on Architecture and Code Optimization10.1145/365301921:3(1-32)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3653019
Kim HHong SLee HSeo EHan H(2019)Compiler-Assisted GPU Thread Throttling for Reduced Cache ContentionProceedings of the 48th International Conference on Parallel Processing10.1145/3337821.3337886(1-10)Online publication date: 5-Aug-2019
https://dl.acm.org/doi/10.1145/3337821.3337886
Xie XLiang YLi XTan WHollingsworth JKeidar I(2019)CuLDA_CGSProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3301496(435-436)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3301496
Li XLiang YYan SJia LLi YHollingsworth JKeidar I(2019)A coordinated tiling and batching framework for efficient GEMM on GPUsProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295734(229-241)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295734
Sadrosadati MMirhosseini AEhsani SSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu O(2018)LTRFACM SIGPLAN Notices10.1145/3296957.317321153:2(489-502)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173211
Ausavarungnirun RMiller VLandgraf JGhose SGandhi JJog ARossbach CMutlu O(2018)MASKACM SIGPLAN Notices10.1145/3296957.317316953:2(503-518)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173169
Tang XPattnaik AKayiran OJog AKandemir MDas C(2018)Quantifying Data Locality in Dynamic Parallelism in GPUsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/32873182:3(1-24)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3287318
Song YAlavoine OLin B(2018)Harvesting Row-Buffer Hits via Orchestrated Last-Level Cache and DRAM Scheduling for Heterogeneous Multicore SystemsACM Transactions on Design Automation of Electronic Systems10.1145/326998224:1(1-27)Online publication date: 21-Dec-2018
https://dl.acm.org/doi/10.1145/3269982
Li XLiang YZhang WLiu TLi HLuo GJiang M(2018)cuMBIRProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205309(184-194)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205309
Sadrosadati MMirhosseini AEhsani SSarbazi-Azad HDrumond MFalsafi BAusavarungnirun RMutlu OShen XTuck JBianchini RSarkar V(2018)LTRFProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173211(489-502)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173211
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents