[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP

Published: 22 March 2018 Publication History

Abstract

Graphics Processing Units (GPUs) leverage massive thread-level parallelism (TLP) to achieve high computation throughput and hide long memory latency. However, recent studies have shown that the GPU performance does not scale with the GPU occupancy or the degrees of TLP that a GPU supports, especially for memory-intensive workloads. The current understanding points to L1 D-cache contention or off-chip memory bandwidth. In this article, we perform a novel scalability analysis from the perspective of throughput utilization of various GPU components, including off-chip DRAM, multiple levels of caches, and the interconnect between L1 D-caches and L2 partitions. We show that the interconnect bandwidth is a critical bound for GPU performance scalability.
For the applications that do not have saturated throughput utilization on a particular resource, their performance scales well with increased TLP. To improve TLP for such applications efficiently, we propose a fast context switching approach. When a warp/thread block (TB) is stalled by a long latency operation, the context of the warp/TB is spilled to spare on-chip resource so that a new warp/TB can be launched. The switched-out warp/TB is switched back when another warp/TB is completed or switched out. With this fine-grain fast context switching, higher TLP can be supported without increasing the sizes of critical resources like the register file. Our experiment shows that the performance can be improved by up to 47% and a geometric mean of 22% for a set of applications with unsaturated throughput utilization. Compared to the state-of-the-art TLP improvement scheme, our proposed scheme achieves 12% higher performance on average and 16% for unsaturated benchmarks.

References

[1]
2010. OpenCL—The open standard for parallel programming. KHRONOS Group (2010).
[2]
2011. CCUDA C programming guide. NVIDIA Corporation (2011).
[3]
2011. CUDA C/C++ SDK CODE samples. NVIDIA Corporation (2011).
[4]
2012. AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE White Paper. AMD Corporation (2012).
[5]
2016. NVIDIA tesla P100 whitepaper. NVIDIA Corporation (2016).
[6]
J. T. Adriaens, K. Compton, Nam Sung Kim, and M. J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA’12). 1--12.
[7]
Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09).
[8]
S. Dublish, V. Nagarajan, and N. Topham. 2016. Characterizing memory bottlenecks in GPGPU workloads. In Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC’16). 1--2.
[9]
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Architect. Code Optim. 6, 2, Article 7 (July 2009), 37 pages.
[10]
Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 96--106.
[11]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the Conference on Innovative Parallel Computing (InPar’12). 1--10.
[12]
Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Fine-grained resource sharing for concurrent GPGPU kernels. In Presented as Part of the 4th USENIX Workshop on Hot Topics in Parallelism. USENIX, Berkeley, CA. Retrieved from https://www.usenix.org/conference/hotpar12/fine-grained-resource-sharing-concurrent-gpgpu-kernels.
[13]
M. Karol, M. Hluchyj, and S. Morgan. 1987. Input versus output queueing on a space-division packet switch. IEEE Trans. Commun. 35, 12 (Dec 1987), 1347--1356.
[14]
O. KayÄs̀ran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 157--166.
[15]
Minseok Lee, Seokwoo Song, Joosik Moon, J. Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 260--271.
[16]
Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu. 2015. CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 515--527.
[17]
Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 175--186.
[18]
D. Li, M. Rhu, D. R. Johnson, M. O’Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 89--100.
[19]
Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling efficient preemption for SIMT architectures with lightweight context switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16).
[20]
Jiwei Liu, Jun Yang, and Rami Melhem. 2015. SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, New York, NY, 383--394.
[21]
Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. 2016. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, New York, NY, Article 42, 12 pages.
[22]
A. Majumdar, G. Wu, K. Dev, J. L. Greathouse, I. Paul, W. Huang, A. K. Venugopal, L. Piga, C. Freitag, and S. Puthoor. 2015. A taxonomy of GPGPU performance scaling. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization. 118--119.
[23]
X. Mei and X. Chu. 2016. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Trans. Parallel Distrib. Syst. PP, 99 (2016), 1--1.
[24]
Online:. 2016. Diving Deeper: The Maxwell 2 Memory Crossbar 8 ROP Partitions. Retrieved from http://www.anandtech.com/show/8935/geforce-gtx-970-correcting-the-specs-exploring-memory-allocation/2.
[25]
Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 407--418.
[26]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative preemption for multitasking on a shared GPU. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 593--606.
[27]
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 72--83.
[28]
A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 174--185.
[29]
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, vLi Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report (2012).
[30]
Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira KhanâĎţ, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, and Onur Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.
[31]
Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 358--369.
[32]
H. Wong, M. M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). 235--246.
[33]
P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 284--295.
[34]
Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In Proceedings of the ACM International Symposium on Computer Architecture (ISCA’16).
[35]
Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram. 2016. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). IEEE Press, Piscataway, NJ, 609--621.

Cited By

View all
  • (2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
  • (2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
  • (2021)PSSMProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460374(139-151)Online publication date: 3-Jun-2021
  • Show More Cited By

Index Terms

  1. GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 15, Issue 1
    March 2018
    401 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3199680
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 March 2018
    Accepted: 01 December 2017
    Revised: 01 December 2017
    Received: 01 July 2017
    Published in TACO Volume 15, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. TLP
    3. context switching
    4. latency hiding

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)164
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
    • (2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
    • (2021)PSSMProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460374(139-151)Online publication date: 3-Jun-2021
    • (2021)A Survey of GPGPU Parallel Processing Architecture Performance Optimization2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)10.1109/ICISFall51598.2021.9627400(75-82)Online publication date: 13-Oct-2021
    • (2020)A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUsACM Transactions on Architecture and Code Optimization10.1145/337713817:1(1-26)Online publication date: 4-Mar-2020
    • (2019)Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel ExecutionACM Transactions on Architecture and Code Optimization10.1145/332612416:3(1-27)Online publication date: 17-Jun-2019
    • (2018)Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad MemoryACM Transactions on Architecture and Code Optimization10.1145/328084915:4(1-24)Online publication date: 16-Nov-2018

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media