[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Published: 09 June 2012 Publication History

Abstract

Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked execution. The masked execution, however, often degrades performance because the issue slots of masked lanes are wasted. This degradation can be mitigated by dynamically compacting multiple unmasked threads into a single SIMD unit. This paper proposes a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous techniques and that only stalls threads that are likely to benefit from compaction. Our technique is based on the compaction-adequacy predictor (CAPRI). CAPRI dynamically identifies the compaction-effectiveness of a branch and only stalls threads that are predicted to benefit from compaction. We utilize a simple single-level branch-predictor inspired structure and show that this simple configuration attains a prediction accuracy of 99.8% and 86.6% for non-divergent and divergent workloads, respectively. Our performance evaluation demonstrates that CAPRI consistently outperforms both the baseline design that never attempts compaction and prior work that stalls upon all divergent branches.

References

[1]
J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, POPL '83, pages 177--189, New York, NY, USA, 1983. ACM.
[2]
AMD Corporation. AMD Radeon HD 6900M Series Specifications, 2010.
[3]
AMD Corporation. ATI Stream Computing OpenCL Programming Guide, August 2010.
[4]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009), April 2009.
[5]
W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick. The Illiac IV System. In Proceedings of the IEEE, volume 60, pages 369--388, April 1972.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization (IISWC-2009), October 2009.
[7]
G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili. SIMD Re-Convergence At Thread Frontiers. In 44th International Symposium on Microarchitecture (MICRO-44), December 2011.
[8]
W. W. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In 17th International Symposium on High Performance Computer Architecture (HPCA-17), February 2011.
[9]
W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In 40th International Symposium on Microarchitecture (MICRO-40), December 2007.
[10]
A. Gharaibeh and M. Ripeanu. Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance. In 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC-2010), November 2010.
[11]
M. Giles. Jacobi iteration for a Laplace discretisation on a 3D structured grid, 2008.
[12]
M. Giles and S. Xiaoke. Notes on using the NVIDIA 8800 GTX graphics card. http://people.maths.ox.ac.uk/gilesm/hpc/, 2008.
[13]
P. Harish and P. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In High Performance Computing HiPC 2007, volume 4873, pages 197--208. 2007.
[14]
IMPACT Research Group. The Parboil Benchmark Suite, 2007.
[15]
Intel Corporation. Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency, May 2009.
[16]
Intel Corporation. Intel HD Graphics OpenSource Programmer Reference Manual, June 2011.
[17]
U. Kapasi, W. Dally, S. Rixner, P. Mattson, J. Owens, and B. Khailany. Efficient conditional operations for data-parallel architectures. In 33th International Symposium on Microarchitecture (MICRO-33), December 2000.
[18]
J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In 37th International Symposium on Computer Architecture (ISCA-37), 2010.
[19]
V. Narasiman, C. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. Patt. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In 44th International Symposium on Microarchitecture (MICRO-44), December 2011.
[20]
NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.
[21]
NVIDIA Corporation. NVIDIA CUDA Programming Guide, 2011.
[22]
NVIDIA Corporation. PTX: Parallel Thread Execution ISA Version 2.3, 2011.
[23]
NVIDIA Corporation. CUDA C/C++ SDK CODE Samples, 2011.
[24]
R. M. Russell. The CRAY-1 computer system. Commun. ACM, 21:63--72, January 1978.
[25]
M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474, 2007.
[26]
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-core x86 Architecture for Visual Computing. ACM Trans. Graph., 27:18:1--18:15, August 2008.
[27]
J. E. Smith, G. Faanes, and R. Sugumar. Vector instruction set support for conditional operations. In 27th International Symposium on Computer Architecture (ISCA-27), 2000.
[28]
Steven Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.
[29]
D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for SIMD cores. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC-09), 2009.
[30]
S. Woop, J. Schmittler, and P. Slusallek. RPU: a programmable ray processing unit for realtime ray tracing. ACM Trans. Graph., 24:434--444, July 2005.
[31]
H. Wu, G. Diamos, S. Li, and S. Yalamanchili. Characterization and Transformation of Unstructured Control Flow in GPU Applications. In 1st International Workshop on Characterizing Applications for Heterogeneous Exascale Systems, June 2011.

Cited By

View all
  • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/3626957Online publication date: 19-Oct-2023
  • (2023)Reducing branch divergence to speed up parallel execution of unit testing on GPUsThe Journal of Supercomputing10.1007/s11227-023-05375-079:16(18340-18374)Online publication date: 13-May-2023
  • (2021)Characterizing Massively Parallel Polymorphism2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00037(205-216)Online publication date: Mar-2021
  • Show More Cited By
  1. CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 40, Issue 3
    ISCA '12
    June 2012
    559 pages
    ISSN:0163-5964
    DOI:10.1145/2366231
    Issue’s Table of Contents
    • cover image ACM Conferences
      ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
      June 2012
      584 pages
      ISBN:9781450316422
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 June 2012
    Published in SIGARCH Volume 40, Issue 3

    Check for updates

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)18
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/3626957Online publication date: 19-Oct-2023
    • (2023)Reducing branch divergence to speed up parallel execution of unit testing on GPUsThe Journal of Supercomputing10.1007/s11227-023-05375-079:16(18340-18374)Online publication date: 13-May-2023
    • (2021)Characterizing Massively Parallel Polymorphism2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00037(205-216)Online publication date: Mar-2021
    • (2021)A Survey of GPGPU Parallel Processing Architecture Performance Optimization2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)10.1109/ICISFall51598.2021.9627400(75-82)Online publication date: 13-Oct-2021
    • (2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
    • (2018)On-GPU Thread-Data Remapping for Branch Divergence ReductionACM Transactions on Architecture and Code Optimization10.1145/324208915:3(1-24)Online publication date: 1-Oct-2018
    • (2017)Unleashing the power of GPU for physically-based rendering via dynamic ray shufflingProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3124532(560-573)Online publication date: 14-Oct-2017
    • (2016)MIMD synchronization on SIMT architecturesThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195652(1-14)Online publication date: 15-Oct-2016
    • (2016)Dynamic Per-Warp Reconvergence Stack for Efficient Control Flow Handling in GPUs2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2016.35(176-181)Online publication date: Jul-2016
    • (2016)Thread Similarity Matrix: Visualizing Branch Divergence in GPGPU Programs2016 45th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2016.27(179-184)Online publication date: Aug-2016
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media