More Web Proxy on the site http://driver.im/

research-article

CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Authors:

Mattan ErezAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 40, Issue 3

Pages 61 - 71

https://doi.org/10.1145/2366231.2337167

Published: 09 June 2012 Publication History

Abstract

Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked execution. The masked execution, however, often degrades performance because the issue slots of masked lanes are wasted. This degradation can be mitigated by dynamically compacting multiple unmasked threads into a single SIMD unit. This paper proposes a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous techniques and that only stalls threads that are likely to benefit from compaction. Our technique is based on the compaction-adequacy predictor (CAPRI). CAPRI dynamically identifies the compaction-effectiveness of a branch and only stalls threads that are predicted to benefit from compaction. We utilize a simple single-level branch-predictor inspired structure and show that this simple configuration attains a prediction accuracy of 99.8% and 86.6% for non-divergent and divergent workloads, respectively. Our performance evaluation demonstrates that CAPRI consistently outperforms both the baseline design that never attempts compaction and prior work that stalls upon all divergent branches.

References

[1]

J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, POPL '83, pages 177--189, New York, NY, USA, 1983. ACM.

Digital Library

[2]

AMD Corporation. AMD Radeon HD 6900M Series Specifications, 2010.

[3]

AMD Corporation. ATI Stream Computing OpenCL Programming Guide, August 2010.

[4]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009), April 2009.

[5]

W. Bouknight, S. Denenberg, D. McIntyre, J. Randall, A. Sameh, and D. Slotnick. The Illiac IV System. In Proceedings of the IEEE, volume 60, pages 369--388, April 1972.

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IEEE International Symposium on Workload Characterization (IISWC-2009), October 2009.

Digital Library

[7]

G. Diamos, B. Ashbaugh, S. Maiyuran, A. Kerr, H. Wu, and S. Yalamanchili. SIMD Re-Convergence At Thread Frontiers. In 44th International Symposium on Microarchitecture (MICRO-44), December 2011.

Digital Library

[8]

W. W. Fung and T. M. Aamodt. Thread Block Compaction for Efficient SIMT Control Flow. In 17th International Symposium on High Performance Computer Architecture (HPCA-17), February 2011.

Digital Library

[9]

W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In 40th International Symposium on Microarchitecture (MICRO-40), December 2007.

Digital Library

[10]

A. Gharaibeh and M. Ripeanu. Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance. In 2010 International Conference for High Performance Computing, Networking, Storage and Analysis (SC-2010), November 2010.

Digital Library

[11]

M. Giles. Jacobi iteration for a Laplace discretisation on a 3D structured grid, 2008.

[12]

M. Giles and S. Xiaoke. Notes on using the NVIDIA 8800 GTX graphics card. http://people.maths.ox.ac.uk/gilesm/hpc/, 2008.

[13]

P. Harish and P. Narayanan. Accelerating Large Graph Algorithms on the GPU Using CUDA. In High Performance Computing HiPC 2007, volume 4873, pages 197--208. 2007.

Digital Library

[14]

IMPACT Research Group. The Parboil Benchmark Suite, 2007.

[15]

Intel Corporation. Intel AVX: New Frontiers in Performance Improvements and Energy Efficiency, May 2009.

[16]

Intel Corporation. Intel HD Graphics OpenSource Programmer Reference Manual, June 2011.

[17]

U. Kapasi, W. Dally, S. Rixner, P. Mattson, J. Owens, and B. Khailany. Efficient conditional operations for data-parallel architectures. In 33th International Symposium on Microarchitecture (MICRO-33), December 2000.

Digital Library

[18]

J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In 37th International Symposium on Computer Architecture (ISCA-37), 2010.

Digital Library

[19]

V. Narasiman, C. Lee, M. Shebanow, R. Miftakhutdinov, O. Mutlu, and Y. Patt. Improving GPU Performance via Large Warps and Two-Level Warp Scheduling. In 44th International Symposium on Microarchitecture (MICRO-44), December 2011.

Digital Library

[20]

NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Fermi, 2009.

[21]

NVIDIA Corporation. NVIDIA CUDA Programming Guide, 2011.

[22]

NVIDIA Corporation. PTX: Parallel Thread Execution ISA Version 2.3, 2011.

[23]

NVIDIA Corporation. CUDA C/C++ SDK CODE Samples, 2011.

[24]

R. M. Russell. The CRAY-1 computer system. Commun. ACM, 21:63--72, January 1978.

Digital Library

[25]

M. Schatz, C. Trapnell, A. Delcher, and A. Varshney. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics, 8(1):474, 2007.

[26]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: A Many-core x86 Architecture for Visual Computing. ACM Trans. Graph., 27:18:1--18:15, August 2008.

Digital Library

[27]

J. E. Smith, G. Faanes, and R. Sugumar. Vector instruction set support for conditional operations. In 27th International Symposium on Computer Architecture (ISCA-27), 2000.

Digital Library

[28]

Steven Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997.

Digital Library

[29]

D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for SIMD cores. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC-09), 2009.

Digital Library

[30]

S. Woop, J. Schmittler, and P. Slusallek. RPU: a programmable ray processing unit for realtime ray tracing. ACM Trans. Graph., 24:434--444, July 2005.

Digital Library

[31]

H. Wu, G. Diamos, S. Li, and S. Yalamanchili. Characterization and Transformation of Unstructured Control Flow in GPU Applications. In 1st International Workshop on Characterizing Applications for Heterogeneous Exascale Systems, June 2011.

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/3626957Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Bagies TLe WSheaffer J Jannesari A(2023)Reducing branch divergence to speed up parallel execution of unit testing on GPUsThe Journal of Supercomputing10.1007/s11227-023-05375-079:16(18340-18374)Online publication date: 13-May-2023
https://dl.acm.org/doi/10.1007/s11227-023-05375-0
Zhang MAlawneh ARogers T(2021)Characterizing Massively Parallel Polymorphism2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00037(205-216)Online publication date: Mar-2021
https://doi.org/10.1109/ISPASS51385.2021.00037
Show More Cited By

CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

CAPRI: prediction of compaction-adequacy for handling control-divergence in GPGPU architectures
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 40, Issue 3

ISCA '12

June 2012

559 pages

ISSN:0163-5964

DOI:10.1145/2366231

Issue’s Table of Contents

ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
June 2012
584 pages
ISBN:9781450316422
General Chair:
Shih-Lien Lu
Intel
,
Program Chair:
Josep Torrellas
University of Illinois

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2012

Published in SIGARCH Volume 40, Issue 3

Check for updates

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

47
Total Citations
View Citations
591
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/3626957Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Bagies TLe WSheaffer J Jannesari A(2023)Reducing branch divergence to speed up parallel execution of unit testing on GPUsThe Journal of Supercomputing10.1007/s11227-023-05375-079:16(18340-18374)Online publication date: 13-May-2023
https://dl.acm.org/doi/10.1007/s11227-023-05375-0
Zhang MAlawneh ARogers T(2021)Characterizing Massively Parallel Polymorphism2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS51385.2021.00037(205-216)Online publication date: Mar-2021
https://doi.org/10.1109/ISPASS51385.2021.00037
Jia STian ZMa YSun CZhang YZhang Y(2021)A Survey of GPGPU Parallel Processing Architecture Performance Optimization2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall)10.1109/ICISFall51598.2021.9627400(75-82)Online publication date: 13-Oct-2021
https://doi.org/10.1109/ICISFall51598.2021.9627400
Yang YZhang SShen L(2019)A Lightweight Method for Handling Control Divergence in GPGPUsProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3293320.3293331(120-127)Online publication date: 14-Jan-2019
https://dl.acm.org/doi/10.1145/3293320.3293331
Lin HWang CLiu H(2018)On-GPU Thread-Data Remapping for Branch Divergence ReductionACM Transactions on Architecture and Code Optimization10.1145/324208915:3(1-24)Online publication date: 1-Oct-2018
https://dl.acm.org/doi/10.1145/3242089
Lü YHuang LShen LWang ZHunter HMoreno JEmer JSanchez D(2017)Unleashing the power of GPU for physically-based rendering via dynamic ray shufflingProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3124532(560-573)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3124532
ElTantawy AAamodt THsu WYang CLipasti MLee H(2016)MIMD synchronization on SIMT architecturesThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195652(1-14)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195652
Wang YChen XWang DLiu S(2016)Dynamic Per-Warp Reconvergence Stack for Efficient Control Flow Handling in GPUs2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI.2016.35(176-181)Online publication date: Jul-2016
https://doi.org/10.1109/ISVLSI.2016.35
Yu ZEeckhout LXu C(2016)Thread Similarity Matrix: Visualizing Branch Divergence in GPGPU Programs2016 45th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2016.27(179-184)Online publication date: Aug-2016
https://doi.org/10.1109/ICPP.2016.27
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents