More Web Proxy on the site http://driver.im/

research-article

A Lightweight Method for Handling Control Divergence in GPGPUs

Authors:

Li ShenAuthors Info & Claims

HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Pages 120 - 127

https://doi.org/10.1145/3293320.3293331

Published: 14 January 2019 Publication History

Abstract

At present, graphics processing units (GPUs) has been widely used for scientific and high performance acceleration in the general purpose computing area, which is inseparable from the SIMT (Single-Instruction, Multiple-Thread) execution model. With SIMT, GPUs can fully utilize the advantages of SIMD parallel computing. However, when threads in a warp do not follow the same execution path, control divergence generates and affects the hardware utilization. In response to this problem, warp regrouping method has been proposed to combine threads executing the same branch path, which can significantly improve thread-level parallelism. But it is found that not all warps can be regrouped effectively because that may introduce a lot of unnecessary overheads, limiting further performance improvement. In this paper, we analyze the source of overheads and propose a lightweight warp regrouping method --- Partial Warp Regrouping (PWR) that controls the scope of reorganization and avoids most of the unnecessary warp regrouping by setting thresholds. In this method, it also can reduce the complexity of hardware design. Our experimental results show that this mechanism can improve the performance by 12% on average and up to 27% compared with immediate post-dominator.

References

[1]

NVIDIA Corporation. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

[2]

Yashuai Lv, Libo Huang, Li Shen, and Zhiying Wang. Unleashing the power of gpu for physically-based rendering via dynamic ray shuffling. In The Ieee/acm International Symposium, pages 560--573, 2017.

Digital Library

[3]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pages 407--420.

Digital Library

[4]

Minsoo Rhu and Mattan Erez. Maximizing simd resource utilization in gpgpus with simd lane permutation. Acm Sigarch Computer Architecture News, 41(3):356--367, 2013.

Digital Library

[5]

Minsoo Rhu and Mattan Erez. Capri: Prediction of compaction-adequacy for handling control-divergence in gpgpu architectures. Acm Sigarch Computer Architecture News, 40(3):61--71, 2012.

Digital Library

[6]

Aniruddha S. Vaidya, Anahita Shayesteh, Hyuk Woo Dong, Roy Saharoy, and Mani Azimi. SIMD divergence optimization through intra-warp compaction. ACM, 2013.

Digital Library

[7]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 163--174.

[8]

Gpgpu-sim. http://www.gpgpu-sim.org.

[9]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, pages 44--54, 2009.

Digital Library

[10]

W. W. L. Fung and T. M. Aamodt. Thread block compaction for efficient simt control flow. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pages 25--36.

Digital Library

[11]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 308--317, New York, NY, USA, 2011. ACM.

Digital Library

[12]

Roman Malits, Evgeny Bolotin, Avinoam Kolodny, and Avi Mendelson. Exploring the limits of gpgpu scheduling in control flow bound applications. ACM Transactions on Architecture and Code Optimization, 8(4):1--22, 2012.

Digital Library

[13]

Xingxing Jin, Brian Daku, and Seok-Bum Ko. Improved gpu simd control flow efficiency via hybrid warp size mechanism. Microprocessors and Microsystems, 38(7):717--729, 2014.

[14]

Jiayuan Meng, David Tarjan, and Kevin Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Computer Architecture News, 38(3):235--246, 2010.

Digital Library

[15]

M. Rhu and M. Erez. The dual-path execution model for efficient gpu control flow. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pages 591--602.

Digital Library

[16]

A. ElTantawy, J. W. Ma, M. O' Connor, and T. M. Aamodt. A scalable multipath microarchitecture for efficient gpu control flow. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 248--259.

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Wang YChen XHu X(2020)Associative Thread Compaction for Efficient Control Flow Handling in GPGPUs2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00049(228-233)Online publication date: Jul-2020
https://doi.org/10.1109/ISVLSI49217.2020.00049

Index Terms

A Lightweight Method for Handling Control Divergence in GPGPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data

Recommendations

Efficient warp execution in presence of divergence with collaborative context collection
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the ...
Control Divergence Optimization through Partial Warp Regrouping in GPGPUs
CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence

Recent GPUs has been widely used for high performance acceleration in the general purpose computing area, which is mainly because of the SIMT (Single-Instruction, Multiple-Thread) execution model. However, when threads in a warp do not follow the same ...
Taming warp divergence
CGO '17: Proceedings of the 2017 International Symposium on Code Generation and Optimization

Graphics Processing Units (GPU) are designed to run a large number of threads in parallel. These threads run on Streaming

Multiprocessors (SM) which consist of a few tens of SIMD cores. A kernel is launched on the GPU with an execution

...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

January 2019

143 pages

ISBN:9781450366328

DOI:10.1145/3293320

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Sun Yat-Sen University
CCF: China Computer Federation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 January 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

HPC Asia 2019

HPC Asia 2019: International Conference on High Performance Computing in Asia-Pacific Region

January 14 - 16, 2019

Guangzhou, China

Acceptance Rates

HPCAsia '19 Paper Acceptance Rate 15 of 32 submissions, 47%;

Overall Acceptance Rate 69 of 143 submissions, 48%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
73
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Wang YChen XHu X(2020)Associative Thread Compaction for Efficient Control Flow Handling in GPGPUs2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00049(228-233)Online publication date: Jul-2020
https://doi.org/10.1109/ISVLSI49217.2020.00049

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents