[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

Published: 17 June 2019 Publication History

Abstract

GPUs provide high-bandwidth/low-latency on-chip shared memory and L1 cache to efficiently service a large number of concurrent memory requests. Specifically, concurrent memory requests accessing contiguous memory space are coalesced into warp-wide accesses. To support such large accesses to L1 cache with low latency, the size of L1 cache line is no smaller than that of warp-wide accesses. However, such L1 cache architecture cannot always be efficiently utilized when applications generate many memory requests with irregular access patterns especially due to branch and memory divergences that make requests uncoalesced and small. Furthermore, unlike L1 cache, the shared memory of GPUs is not often used in many applications, which essentially depends on programmers. In this article, we propose Elastic-Cache, which can efficiently support both fine- and coarse-grained L1 cache line management for applications with both regular and irregular memory access patterns to improve the L1 cache efficiency. Specifically, it can store 32- or 64-byte words in non-contiguous memory space to a single 128-byte cache line. Furthermore, it neither requires an extra memory structure nor reduces the capacity of L1 cache for tag storage, since it stores auxiliary tags for fine-grained L1 cache line managements in the shared memory space that is not fully used in many applications. To improve the bandwidth utilization of L1 cache with Elastic-Cache for fine-grained accesses, we further propose Elastic-Plus to issue 32-byte memory requests in parallel, which can reduce the processing latency of memory instructions and improve the throughput of GPUs. Our experiment result shows that Elastic-Cache improves the geometric-mean performance of applications with irregular memory access patterns by 104% without degrading the performance of applications with regular memory access patterns. Elastic-Plus outperforms Elastic-Cache and improves the performance of applications with irregular memory access patterns by 131%.

References

[1]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’09). 163--174.
[2]
J. Chang, S. Chen, W. Chen, S. Chiu, R. Faber, R. Ganesan, M. Grgek, V. Lukka, W. W. Mar, J. Vash, S. Rusu, and K. Zhang. 2009. A 45nm 24MB on-die L3 cache for the 8-core multi-threaded Xeon processor. In Proceedings of the Symposium on VLSI Circuits. 152--153.
[3]
N. Chatterjee, M. O’Connor, G. H. Loh, N. Jayasena, and R. Balasubramonia. 2014. Managing DRAM latency divergence in irregular GPGPU applications. In Proceedings of the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). 128--139.
[4]
Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’13). 185--195.
[5]
Xuhao Chen, Li-Wen Chang, C. I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive cache management for energy-efficient GPU computing. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’14). 343--355.
[6]
Kayvon Fatahalian and Mike Houston. 2008. A closer look at GPUs. Commun. ACM 51, 10 (Oct. 2008), 50--57.
[7]
Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’12). 96--106.
[8]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the Conference on Innovative Parallel Computing (InPar’12). 1--10.
[9]
Mark Harris. 2013. Using shared memory in CUDA C/C++. NVIDIA Developer Blog. Retrieved from https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/.
[10]
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce framework on graphics processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’08). 260--269.
[11]
Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. 2015. GPU register file virtualization. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’15). 420--432.
[12]
Wenhao Jia, Kelly A. Shaw, and Margaret Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’14). 272--283.
[13]
N. Jing, J. Wang, F. Fan, W. Yu, L. Jiang, C. Li, and X. Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1--12.
[14]
M. Khairy, J. Akshay, T. Aamodt, and T. G. Rogers. 2018. Exploring modern GPU memory system design challenges through accurate modeling. Computing Research Repository (CoRR), vol. abs/1810.07269.
[15]
F. Khorasani, R. Gupta, and L. N. Bhuyan. 2015. Efficient warp execution in presence of divergence with collaborative context collection. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’15). 204--215.
[16]
Snehasish Kumar, Hongzhou Zhao, Arrvindh Shriraman, Eric Matthews, Sandhya Dwarkadas, and Lesley Shannon. 2012. Amoeba-cache: Adaptive blocks for eliminating waste in the memory hierarchy. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’12). 376--388.
[17]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA’13). 487--498.
[18]
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In Proceedings of the International Conference on Supercomputing. 67--77.
[19]
E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. 2008. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (Mar. 2008), 39--55.
[20]
J. S. Liptay. 1968. Structural aspects of the system/360 model 85, Part II: The cache. IBM Syst. J. 7 (Mar. 1968), 15--21.
[21]
Keshav Pingali, Martin Burtscher, Rupesh Nasre. 2012. A quantitative study of irregular programs on GPUs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’12). 141--151.
[22]
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA’10). 235--246.
[23]
Rajeev Balasubramonian, Naveen Muralimanohar, and Norman P. Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. Technical Report. HP Laboratories. https://www.hpl.hp.com/techreports/2009/HPL-2009-85.pdf.
[24]
NVIDIA Corporation. 2014. NVIDIA GeForce GTX 980. https://www.techpowerup.com/gpu-specs/docs/nvidia-gtx-980.pdf.
[25]
NVIDIA Corporation. 2015. NVIDIA CUDA C Programming Guide. https://docs.nvidia.com/cuda/archive/8.0/pdf/CUDA_C_Programming_Guide.pdf.
[26]
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A locality-aware memory hierarchy for energy-efficient GPU architectures. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 86--98.
[27]
T. G. Rogers, M. O’Connor, and T. M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’12). 72--83.
[28]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware warp scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO’13). 99--110.
[29]
A. Seznec. 1994. Decoupled sectored caches: Conciliating low tag implementation cost. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA’94). 384--393.
[30]
I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’13). 578--590.
[31]
B. Sinharoy, J. A. Van Norstrand, R. J. Eickemeyer, H. Q. Le, J. Leenstra, D. Q. Nguyen, B. Konigsburg, K. Ward, M. D. Brown, J. E. Moreira, D. Levitan, S. Tung, D. Hrusecky, J. W. Bishop, M. Gschwind, M. Boersma, M. Kroener, M. Kaltenbach, T. Karkhanis, and K. M. Fernsler. 2015. IBM POWER8 processor core microarchitecture. IBM J. Res. Dev. 59, 1 (Jan. 2015), 2:1--2:21.
[32]
Manuel Ujaldon. 2015. Inside Kepler. GPGPU workshop of Department of Computer Science, University of Cape Town. Retrieved from http://gpu.cs.uct.ac.za/Slides/Kepler.pdf.
[33]
Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru Nicolau, and Xiaomei Ji. 1999. Adapting cache line size to application behavior. In Proceedings of the IEEE/ACM International Conference on Supercomputing (ICS’99). 145--154.
[34]
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’15). 76--88.
[35]
Qiumin Xu, Hyeran Jeon, and M. Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks? In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’14). 140--149.
[36]
Yi Yang, Ping Xiang, Mike Mantor, Norm Rubin, and Huiyang Zhou. 2012. Shared memory multiplexing: A novel way to improve GPGPU throughput. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’12). 283--292.
[37]
M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram. 2016. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA’16). 609--621.
[38]
Z. Zheng, Z. Wang, and M. Lipasti. 2014. Adaptive cache and concurrency allocation on GPGPUs. IEEE Comput. Architect. Lett. PP, 99 (Sep. 2014), 1--1.

Cited By

View all
  • (2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
  • (2024)e-CLAS: Effective GPUDirect I/O Classification SchemeComputational Science and Its Applications – ICCSA 202410.1007/978-3-031-64608-9_1(3-16)Online publication date: 2-Jul-2024
  • (2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
  • Show More Cited By

Index Terms

  1. An Efficient GPU Cache Architecture for Applications with Irregular Memory Access Patterns

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 3
    September 2019
    347 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3341169
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 June 2019
    Accepted: 01 March 2019
    Revised: 01 February 2019
    Received: 01 November 2018
    Published in TACO Volume 16, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. cache
    3. shared memory
    4. thread

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Fundamental Research Funds for the Central Universities of Civil Aviation University of China
    • Scientific Research Foundation of Civil Aviation University of China
    • National Natural Science Foundation of China
    • U.S. National Science Foundation
    • Natural Science Foundation of Tianjin

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,103
    • Downloads (Last 6 weeks)116
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)GPU Performance Optimization via Intergroup Cache CooperationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344370743:11(4142-4153)Online publication date: Nov-2024
    • (2024)e-CLAS: Effective GPUDirect I/O Classification SchemeComputational Science and Its Applications – ICCSA 202410.1007/978-3-031-64608-9_1(3-16)Online publication date: 2-Jul-2024
    • (2023)Snake: A Variable-length Chain-based Prefetching for GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3623782(728-741)Online publication date: 28-Oct-2023
    • (2023)ATA-Cache: Contention Mitigation for GPU Shared L1 Cache With Aggregated Tag ArrayIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.333719243:5(1429-1441)Online publication date: 29-Nov-2023
    • (2022)Locality-Aware Optimizations for Improving Remote Memory Latency in Multi-GPU SystemsProceedings of the International Conference on Parallel Architectures and Compilation Techniques10.1145/3559009.3569649(304-316)Online publication date: 8-Oct-2022
    • (2022)REMOCProceedings of the 19th ACM International Conference on Computing Frontiers10.1145/3528416.3530229(1-11)Online publication date: 17-May-2022
    • (2022)OSM: Off-Chip Shared Memory for GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.315431533:12(3415-3429)Online publication date: 1-Dec-2022
    • (2022)Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO56248.2022.00029(228-244)Online publication date: Oct-2022
    • (2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
    • (2021)Virtual-Cache: A cache-line borrowing technique for efficient GPU cache architecturesMicroprocessors and Microsystems10.1016/j.micpro.2021.104301(104301)Online publication date: Jun-2021

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media