Taming Control Divergence in GPUs through Control Flow Linearization

Jayvant Anantpur¹⁷ &
Govindarajan R.¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8409))

Included in the following conference series:

International Conference on Compiler Construction

1967 Accesses
12 Citations
3 Altmetric

Abstract

Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack.

In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads.

We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.

Download to read the full chapter text

Chapter PDF

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Article 21 January 2015

Checking Data-Race Freedom of GPU Kernels, Compositionally

Block-Size Independence for GPU Programs

Keywords

References

Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers Principles, Techniques and Tools, 2nd edn. Pearson
Google Scholar
Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: ISPASS (2009)
Google Scholar
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A Benchmark Suite for Heterogeneous Computing. In: IISWC (2009)
Google Scholar
Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A Characterization of the Rodinia Benchmark Suite With Comparison to Contemporary CMP Workloads. In: IISWC (2010)
Google Scholar
Carter, L., Ferrante, J., Thomborson, C.: Folklore Confirmed: Reducible Flow Graphs are Exponentially Larger. In: POPL (2003)
Google Scholar
Coutinho, B., Sampaio, D., Pereira, F.M.Q., Meira Jr., W.: Divergence Analysis and Optimizations. In: PACT (2011)
Google Scholar
Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, J., Yalamanchili, S.: SIMD Re-Convergence At Thread Frontiers. In: MICRO (2011)
Google Scholar
Diamos, G., Kerr, A., Yalamanchili, S., Clark, N.: Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In: PACT (2010)
Google Scholar
Fang, Q., Boss, D.A.: Monte Carlo Simulation of Photon Migration in 3D Turbid Media Accelerated by Graphics Processing Units. Optics Express 17(22), 20178–20190 (2009)
Article Google Scholar
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: MICRO (2007)
Google Scholar
Fung, W.W.L., Aamodt, T.M.: Thread block compaction for efficient simt control flow. In: HPCA (2011)
Google Scholar
Han, T.D., Abdelrahman, T.S.: Reducing Divergence in GPGPU Programs with Loop Merging. In: GPGPU (2013)
Google Scholar
OpenCL, http://www.khronos.org/opencl
László, T., Kiss, Á.: Obfuscating C++ programs via control flow flattening. Annales Universitatis Scientarum Budapestinensis de Rolando Ëotv̈os Nominatae, Sectio Computatorica 30, 3–19 (2009)
MATH Google Scholar
Meng, J., Tarjan, D., Skadron, K.: Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In: ISCA (2010)
Google Scholar
Rhu, M., Erez, M.: The Dual-Path Execution Model for Efficient GPU Control Flow. In: HPCA (2013)
Google Scholar
Nvidia. CUDA C Programming Guide (October 2010)
Google Scholar
Nvidia, https://developer.nvidia.com/cuda-toolkit-42-archive
Nvidia, http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Stratton, J.A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., Hwu, W.W.: Efficient Compilation of Fine-Grained SPMD-threaded Programs for Multicore CPUs. In: CGO (2010)
Google Scholar
Wang, S., Hung, M., Hwang, Y., Ju, R.D., Lee, J.: Pointer Based Divergence Analysis in the SSA Form. In: CPC (2013)
Google Scholar
Wu, H., Diamos, G., Li, S., Yalamanchili, S.: Characterization and Transformation of Unstructured Control Flow in GPU Applications. In: The First International Workshop on Characterizing Applications for Heterogeneous Exascale Systems, CACHES (June 2011)
Google Scholar
Zhang, F., D’Hollander, E.H.: Using hammock graphs to structure programs. IEEE Trans. Softw. Eng., 231–245 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Supercomputer Education and Research Centre, Indian Institute of Science, India
Jayvant Anantpur & Govindarajan R.

Authors

Jayvant Anantpur
View author publications
You can also search for this author in PubMed Google Scholar
Govindarajan R.
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Département d’Informatique, INRIA and École Normale Supérieure, 45 rue d’Ulm, 75005, Paris, France
Albert Cohen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Anantpur, J., R., G. (2014). Taming Control Divergence in GPUs through Control Flow Linearization. In: Cohen, A. (eds) Compiler Construction. CC 2014. Lecture Notes in Computer Science, vol 8409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54807-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-54807-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54806-2
Online ISBN: 978-3-642-54807-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Taming Control Divergence in GPUs through Control Flow Linearization

Abstract

Chapter PDF

Similar content being viewed by others

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Checking Data-Race Freedom of GPU Kernels, Compositionally

Block-Size Independence for GPU Programs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Taming Control Divergence in GPUs through Control Flow Linearization

Abstract

Chapter PDF

Similar content being viewed by others

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Checking Data-Race Freedom of GPU Kernels, Compositionally

Block-Size Independence for GPU Programs

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation