[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3293320.3293331acmotherconferencesArticle/Chapter ViewAbstractPublication PageshpcasiaConference Proceedingsconference-collections
research-article

A Lightweight Method for Handling Control Divergence in GPGPUs

Published: 14 January 2019 Publication History

Abstract

At present, graphics processing units (GPUs) has been widely used for scientific and high performance acceleration in the general purpose computing area, which is inseparable from the SIMT (Single-Instruction, Multiple-Thread) execution model. With SIMT, GPUs can fully utilize the advantages of SIMD parallel computing. However, when threads in a warp do not follow the same execution path, control divergence generates and affects the hardware utilization. In response to this problem, warp regrouping method has been proposed to combine threads executing the same branch path, which can significantly improve thread-level parallelism. But it is found that not all warps can be regrouped effectively because that may introduce a lot of unnecessary overheads, limiting further performance improvement. In this paper, we analyze the source of overheads and propose a lightweight warp regrouping method --- Partial Warp Regrouping (PWR) that controls the scope of reorganization and avoids most of the unnecessary warp regrouping by setting thresholds. In this method, it also can reduce the complexity of hardware design. Our experimental results show that this mechanism can improve the performance by 12% on average and up to 27% compared with immediate post-dominator.

References

[1]
NVIDIA Corporation. https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.
[2]
Yashuai Lv, Libo Huang, Li Shen, and Zhiying Wang. Unleashing the power of gpu for physically-based rendering via dynamic ray shuffling. In The Ieee/acm International Symposium, pages 560--573, 2017.
[3]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient gpu control flow. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), pages 407--420.
[4]
Minsoo Rhu and Mattan Erez. Maximizing simd resource utilization in gpgpus with simd lane permutation. Acm Sigarch Computer Architecture News, 41(3):356--367, 2013.
[5]
Minsoo Rhu and Mattan Erez. Capri: Prediction of compaction-adequacy for handling control-divergence in gpgpu architectures. Acm Sigarch Computer Architecture News, 40(3):61--71, 2012.
[6]
Aniruddha S. Vaidya, Anahita Shayesteh, Hyuk Woo Dong, Roy Saharoy, and Mani Azimi. SIMD divergence optimization through intra-warp compaction. ACM, 2013.
[7]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software, pages 163--174.
[8]
Gpgpu-sim. http://www.gpgpu-sim.org.
[9]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang Ha Lee, and Kevin Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization, pages 44--54, 2009.
[10]
W. W. L. Fung and T. M. Aamodt. Thread block compaction for efficient simt control flow. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture, pages 25--36.
[11]
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pages 308--317, New York, NY, USA, 2011. ACM.
[12]
Roman Malits, Evgeny Bolotin, Avinoam Kolodny, and Avi Mendelson. Exploring the limits of gpgpu scheduling in control flow bound applications. ACM Transactions on Architecture and Code Optimization, 8(4):1--22, 2012.
[13]
Xingxing Jin, Brian Daku, and Seok-Bum Ko. Improved gpu simd control flow efficiency via hybrid warp size mechanism. Microprocessors and Microsystems, 38(7):717--729, 2014.
[14]
Jiayuan Meng, David Tarjan, and Kevin Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. ACM SIGARCH Computer Architecture News, 38(3):235--246, 2010.
[15]
M. Rhu and M. Erez. The dual-path execution model for efficient gpu control flow. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pages 591--602.
[16]
A. ElTantawy, J. W. Ma, M. O' Connor, and T. M. Aamodt. A scalable multipath microarchitecture for efficient gpu control flow. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 248--259.

Cited By

View all
  • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
  • (2020)Associative Thread Compaction for Efficient Control Flow Handling in GPGPUs2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00049(228-233)Online publication date: Jul-2020

Index Terms

  1. A Lightweight Method for Handling Control Divergence in GPGPUs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region
    January 2019
    143 pages
    ISBN:9781450366328
    DOI:10.1145/3293320
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    • Sun Yat-Sen University
    • CCF: China Computer Federation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 January 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Control Divergence
    2. GPU
    3. SIMD
    4. Threshold
    5. Warp Regrouping

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    HPC Asia 2019

    Acceptance Rates

    HPCAsia '19 Paper Acceptance Rate 15 of 32 submissions, 47%;
    Overall Acceptance Rate 69 of 143 submissions, 48%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/362695734:1(1-25)Online publication date: 19-Oct-2023
    • (2020)Associative Thread Compaction for Efficient Control Flow Handling in GPGPUs2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00049(228-233)Online publication date: Jul-2020

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media