More Web Proxy on the site http://driver.im/

research-article

Open access

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP

Authors:

Michael Mantor,

Huiyang ZhouAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 1

Article No.: 15, Pages 1 - 21

https://doi.org/10.1145/3177964

Published: 22 March 2018 Publication History

Abstract

Graphics Processing Units (GPUs) leverage massive thread-level parallelism (TLP) to achieve high computation throughput and hide long memory latency. However, recent studies have shown that the GPU performance does not scale with the GPU occupancy or the degrees of TLP that a GPU supports, especially for memory-intensive workloads. The current understanding points to L1 D-cache contention or off-chip memory bandwidth. In this article, we perform a novel scalability analysis from the perspective of throughput utilization of various GPU components, including off-chip DRAM, multiple levels of caches, and the interconnect between L1 D-caches and L2 partitions. We show that the interconnect bandwidth is a critical bound for GPU performance scalability.

For the applications that do not have saturated throughput utilization on a particular resource, their performance scales well with increased TLP. To improve TLP for such applications efficiently, we propose a fast context switching approach. When a warp/thread block (TB) is stalled by a long latency operation, the context of the warp/TB is spilled to spare on-chip resource so that a new warp/TB can be launched. The switched-out warp/TB is switched back when another warp/TB is completed or switched out. With this fine-grain fast context switching, higher TLP can be supported without increasing the sizes of critical resources like the register file. Our experiment shows that the performance can be improved by up to 47% and a geometric mean of 22% for a set of applications with unsaturated throughput utilization. Compared to the state-of-the-art TLP improvement scheme, our proposed scheme achieves 12% higher performance on average and 16% for unsaturated benchmarks.

References

[1]

2010. OpenCL—The open standard for parallel programming. KHRONOS Group (2010).

[2]

2011. CCUDA C programming guide. NVIDIA Corporation (2011).

[3]

2011. CUDA C/C++ SDK CODE samples. NVIDIA Corporation (2011).

[4]

2012. AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE White Paper. AMD Corporation (2012).

[5]

2016. NVIDIA tesla P100 whitepaper. NVIDIA Corporation (2016).

[6]

J. T. Adriaens, K. Compton, Nam Sung Kim, and M. J. Schulte. 2012. The case for GPGPU spatial multitasking. In Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA’12). 1--12.

Digital Library

[7]

Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J. W. Sheaffer, Sang-Ha Lee, and K. Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09).

Digital Library

[8]

S. Dublish, V. Nagarajan, and N. Topham. 2016. Characterizing memory bottlenecks in GPGPU workloads. In Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC’16). 1--2.

[9]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2009. Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. ACM Trans. Architect. Code Optim. 6, 2, Article 7 (July 2009), 37 pages.

Digital Library

[10]

Mark Gebhart, Stephen W. Keckler, Brucek Khailany, Ronny Krashinsky, and William J. Dally. 2012. Unifying primary cache, scratch, and register file memories in a throughput processor. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 96--106.

Digital Library

[11]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the Conference on Innovative Parallel Computing (InPar’12). 1--10.

[12]

Chris Gregg, Jonathan Dorn, Kim Hazelwood, and Kevin Skadron. 2012. Fine-grained resource sharing for concurrent GPGPU kernels. In Presented as Part of the 4th USENIX Workshop on Hot Topics in Parallelism. USENIX, Berkeley, CA. Retrieved from https://www.usenix.org/conference/hotpar12/fine-grained-resource-sharing-concurrent-gpgpu-kernels.

Digital Library

[13]

M. Karol, M. Hluchyj, and S. Morgan. 1987. Input versus output queueing on a space-division packet switch. IEEE Trans. Commun. 35, 12 (Dec 1987), 1347--1356.

[14]

O. KayÄs̀ran, A. Jog, M. T. Kandemir, and C. R. Das. 2013. Neither more nor less: Optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques. 157--166.

Digital Library

[15]

Minseok Lee, Seokwoo Song, Joosik Moon, J. Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 260--271.

[16]

Shin-Ying Lee, Akhil Arunkumar, and Carole-Jean Wu. 2015. CAWA: Coordinated warp scheduling and cache prioritization for critical warp acceleration of GPGPU workloads. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 515--527.

Digital Library

[17]

Shin-Ying Lee and Carole-Jean Wu. 2014. CAWS: Criticality-aware warp scheduling for GPGPU workloads. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). ACM, New York, NY, 175--186.

Digital Library

[18]

D. Li, M. Rhu, D. R. Johnson, M. O’Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Redder. 2015. Priority-based cache allocation in throughput processors. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 89--100.

[19]

Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling efficient preemption for SIMT architectures with lightweight context switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’16).

Digital Library

[20]

Jiwei Liu, Jun Yang, and Rami Melhem. 2015. SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, New York, NY, 383--394.

Digital Library

[21]

Yuxi Liu, Zhibin Yu, Lieven Eeckhout, Vijay Janapa Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, and Chengzhong Xu. 2016. Barrier-aware warp scheduling for throughput processors. In Proceedings of the 2016 International Conference on Supercomputing (ICS’16). ACM, New York, NY, Article 42, 12 pages.

Digital Library

[22]

A. Majumdar, G. Wu, K. Dev, J. L. Greathouse, I. Paul, W. Huang, A. K. Venugopal, L. Piga, C. Freitag, and S. Puthoor. 2015. A taxonomy of GPGPU performance scaling. In Proceedings of the 2015 IEEE International Symposium on Workload Characterization. 118--119.

Digital Library

[23]

X. Mei and X. Chu. 2016. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Trans. Parallel Distrib. Syst. PP, 99 (2016), 1--1.

Digital Library

[24]

Online:. 2016. Diving Deeper: The Maxwell 2 Memory Crossbar 8 ROP Partitions. Retrieved from http://www.anandtech.com/show/8935/geforce-gtx-970-correcting-the-specs-exploring-memory-allocation/2.

[25]

Sreepathi Pai, Matthew J. Thazhuthaveetil, and R. Govindarajan. 2013. Improving GPGPU concurrency with elastic kernels. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 407--418.

Digital Library

[26]

Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative preemption for multitasking on a shared GPU. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 593--606.

Digital Library

[27]

Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’12). IEEE Computer Society, Washington, DC, 72--83.

Digital Library

[28]

A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA’15). 174--185.

[29]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, vLi Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. IMPACT Technical Report (2012).

[30]

Nandita Vijaykumar, Kevin Hsieh, Gennady Pekhimenko, Samira KhanâĎţ, Ashish Shrestha, Saugata Ghose, Adwait Jog, Phillip B. Gibbons, and Onur Mutlu. 2016. Zorua: A holistic approach to resource virtualization in GPUs. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture.

Digital Library

[31]

Z. Wang, J. Yang, R. Melhem, B. Childers, Y. Zhang, and M. Guo. 2016. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA’16). 358--369.

[32]

H. Wong, M. M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS’10). 235--246.

[33]

P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 284--295.

[34]

Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, and Murali Annavaram. 2016. Efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming. In Proceedings of the ACM International Symposium on Computer Architecture (ISCA’16).

Digital Library

[35]

Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, and Murali Annavaram. 2016. Virtual thread: Maximizing thread-level parallelism beyond GPU scheduling limit. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA’16). IEEE Press, Piscataway, NJ, 609--621.

Digital Library

Cited By

Riahi ASavadi ANaghibzadeh M(2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00607-023-01255-w
Wei XHu QLi L(2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
https://doi.org/10.1109/ICPICS55264.2022.9873685
Yuan SSolihin YZhou HZhou HMoreira JMueller FEtsion Y(2021)PSSMProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460374(139-151)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460374
Show More Cited By

Index Terms

GPU Performance vs. Thread-Level Parallelism: Scalability Analysis and a Novel Way to Improve TLP
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

On-GPU Thread-Data Remapping for Branch Divergence Reduction

General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...
A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-...
Improving GPU performance via large warps and two-level warp scheduling
MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture

Due to their massive computational power, graphics processing units (GPUs) have become a popular platform for executing general purpose parallel applications. GPU programming models allow the programmer to create thousands of threads, each executing the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 1

March 2018

401 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3199680

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 March 2018

Accepted: 01 December 2017

Revised: 01 December 2017

Received: 01 July 2017

Published in TACO Volume 15, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
799
Total Downloads

Downloads (Last 12 months)164
Downloads (Last 6 weeks)25

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Riahi ASavadi ANaghibzadeh M(2024)Many-BSP: an analytical performance model for CUDA kernelsComputing10.1007/s00607-023-01255-w106:5(1519-1555)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s00607-023-01255-w
Wei XHu QLi L(2022)POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS)10.1109/ICPICS55264.2022.9873685(490-495)Online publication date: 29-Jul-2022
https://doi.org/10.1109/ICPICS55264.2022.9873685
Yuan SSolihin YZhou HZhou HMoreira JMueller FEtsion Y(2021)PSSMProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460374(139-151)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460374
Jia STian ZMa YSun CZhang YZhang Y(2021)A Survey of GPGPU Parallel Processing Architecture Performance Optimization2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)10.1109/ICISFall51598.2021.9627400(75-82)Online publication date: 13-Oct-2021
https://doi.org/10.1109/ICISFall51598.2021.9627400
Wu HLiu WLin HWang C(2020)A Model-Based Software Solution for Simultaneous Multiple Kernels on GPUsACM Transactions on Architecture and Code Optimization10.1145/337713817:1(1-26)Online publication date: 4-Mar-2020
https://dl.acm.org/doi/10.1145/3377138
Lin ZDai HMantor MZhou H(2019)Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel ExecutionACM Transactions on Architecture and Code Optimization10.1145/332612416:3(1-27)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3326124
Yu CBai YSun QYang H(2018)Improving Thread-level Parallelism in GPUs Through Expanding Register File to Scratchpad MemoryACM Transactions on Architecture and Code Optimization10.1145/328084915:4(1-24)Online publication date: 16-Nov-2018
https://dl.acm.org/doi/10.1145/3280849

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents