Abstract
Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack.
In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads.
We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.
Chapter PDF
Similar content being viewed by others
References
Aho, A.V., Lam, M.S., Sethi, R., Ullman, J.D.: Compilers Principles, Techniques and Tools, 2nd edn. Pearson
Bakhoda, A., Yuan, G., Fung, W., Wong, H., Aamodt, T.: Analyzing CUDA workloads using a detailed GPU simulator. In: ISPASS (2009)
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A Benchmark Suite for Heterogeneous Computing. In: IISWC (2009)
Che, S., Sheaffer, J.W., Boyer, M., Szafaryn, L.G., Wang, L., Skadron, K.: A Characterization of the Rodinia Benchmark Suite With Comparison to Contemporary CMP Workloads. In: IISWC (2010)
Carter, L., Ferrante, J., Thomborson, C.: Folklore Confirmed: Reducible Flow Graphs are Exponentially Larger. In: POPL (2003)
Coutinho, B., Sampaio, D., Pereira, F.M.Q., Meira Jr., W.: Divergence Analysis and Optimizations. In: PACT (2011)
Diamos, G., Ashbaugh, B., Maiyuran, S., Kerr, A., Wu, J., Yalamanchili, S.: SIMD Re-Convergence At Thread Frontiers. In: MICRO (2011)
Diamos, G., Kerr, A., Yalamanchili, S., Clark, N.: Ocelot: A dynamic compiler for bulk-synchronous applications in heterogeneous systems. In: PACT (2010)
Fang, Q., Boss, D.A.: Monte Carlo Simulation of Photon Migration in 3D Turbid Media Accelerated by Graphics Processing Units. Optics Express 17(22), 20178–20190 (2009)
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.M.: Dynamic warp formation and scheduling for efficient gpu control flow. In: MICRO (2007)
Fung, W.W.L., Aamodt, T.M.: Thread block compaction for efficient simt control flow. In: HPCA (2011)
Han, T.D., Abdelrahman, T.S.: Reducing Divergence in GPGPU Programs with Loop Merging. In: GPGPU (2013)
OpenCL, http://www.khronos.org/opencl
László, T., Kiss, Á.: Obfuscating C++ programs via control flow flattening. Annales Universitatis Scientarum Budapestinensis de Rolando Ëotv̈os Nominatae, Sectio Computatorica 30, 3–19 (2009)
Meng, J., Tarjan, D., Skadron, K.: Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In: ISCA (2010)
Rhu, M., Erez, M.: The Dual-Path Execution Model for Efficient GPU Control Flow. In: HPCA (2013)
Nvidia. CUDA C Programming Guide (October 2010)
Nvidia, https://developer.nvidia.com/cuda-toolkit-42-archive
Stratton, J.A., Grover, V., Marathe, J., Aarts, B., Murphy, M., Hu, Z., Hwu, W.W.: Efficient Compilation of Fine-Grained SPMD-threaded Programs for Multicore CPUs. In: CGO (2010)
Wang, S., Hung, M., Hwang, Y., Ju, R.D., Lee, J.: Pointer Based Divergence Analysis in the SSA Form. In: CPC (2013)
Wu, H., Diamos, G., Li, S., Yalamanchili, S.: Characterization and Transformation of Unstructured Control Flow in GPU Applications. In: The First International Workshop on Characterizing Applications for Heterogeneous Exascale Systems, CACHES (June 2011)
Zhang, F., D’Hollander, E.H.: Using hammock graphs to structure programs. IEEE Trans. Softw. Eng., 231–245 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Anantpur, J., R., G. (2014). Taming Control Divergence in GPUs through Control Flow Linearization. In: Cohen, A. (eds) Compiler Construction. CC 2014. Lecture Notes in Computer Science, vol 8409. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54807-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-54807-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54806-2
Online ISBN: 978-3-642-54807-9
eBook Packages: Computer ScienceComputer Science (R0)