Abstract
Reduction operations aggregate a finite set of numeric elements into a single value. They are extensively employed in many computational tasks and can be performed in parallel when multiple processing units are available. This work presents a GPU-based approach for parallel reduction, which employs techniques like loop unrolling, persistent threads and algebraic expressions. It avoids thread divergence and it is able to surpass the methods currently in use. Experiments conducted to evaluate the approach show that the strategy performs efficiently on both AMD and NVidia’s hardware platforms, as well as using OpenCL and CUDA, making it portable.
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – “Finance Code 001”. We also thank the research supporting agency FAPEG – Fundação de Amparo à Pesquisa do Estado de Goiás for scholarships provided to researchers involved with the project.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Although, mathematically, the associativity is true for numbers in any set, in computational terms this is more complicated. For instance, that property holds for the set of integers, but the same does not happen for the floating point numbers due to the inherent imprecision that arises when combining (adding, multiplying, etc.) numbers with different exponents, which leads to the absorption of the lower bits during the combining operation. As an example, mathematically the result of \((1.5 + 4^{50} - 4^{50})\) is always the same, no matter the order the terms are added, whereas the floating point computation can result in 0 or 1.5, depending on the sequence in which the operations are performed [7, 12, 17].
- 2.
A pragma is a language directive that provides additional information to the compiler, specifying how to process its input. This additional information usually is beyond what is conveyed in the language itself.
- 3.
Reduction operations on vectors with millions of values are common when computing cost function in some real-life optimization problems. For instance, traffic assignment computation for large urban networks involves such vectors.
- 4.
The ideal unrolling factor strongly depends on the GPU model. In preliminary tests, performed in older hardware not listed here, the highest gains were obtained when F was defined as 6.
References
Billeter, M., Olsson, O., Assarsson, U.: Efficient stream compaction on wide SIMD many-core architectures. In: Proceedings of the Conference on High Performance Graphics 2009, HPG 2009, pp. 159–166. ACM, New York (2009). https://doi.org/10.1145/1572769.1572795
Catanzaro, B.: OpenCL optimization case study: simple reductions, August 2014. http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/. Published by Advanced Micro Devices. Accessed 05 Jan 2014
Chakroun, I., Mezmaz, M., Melab, N., Bendjoudi, A.: Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm. Concurr. Comput.: Pract. Exp. 25(8), 1121–1136 (2013)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)
Fog, A.: Optimizing subroutines in assembly language: an optimization guide for x86 platforms. Technical University of Denmark (2013)
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.: Dynamic warp formation and scheduling for efficient GPU control flow. In: 2007 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2007, pp. 407–420, December 2007. https://doi.org/10.1109/MICRO.2007.30
Goldberg, D.: What every computer scientist should know about floating-pointarithmetic. ACM Comput. Surv. 23(1), 5–48 (1991). https://doi.org/10.1145/103162.103163
Khronos OpenCL Working Group, et al.: The OpenCL specification. Version 1(29), 8 (2008)
Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative Parallel Computing (InPar), pp. 1–14. IEEE (2012)
Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1–3:8. ACM, New York (2011). https://doi.org/10.1145/1964179.1964184
Harris, M.: Optimizing Parallel Reduction in CUDA (2007). https://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction. Published by NVidia Corporation. Accessed 10 Sept 2018
Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2002)
Huang, J.C., Leng, T.: Generalized loop-unrolling: a method for program speed-up. In: Proceedings of the IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, pp. 244–248 (1997)
Kiefer, J.C.: Sequential minimax search for a maximum. Proc. Am. Math. Soc. 4, 502–506 (1953)
Luitjens, J.: Faster parallel reductions on Kepler. White Paper, February 2014. http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/. Published by NVidia Inc., Accessed 25 July 2014
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38(3), 235–246 (2010). https://doi.org/10.1145/1816038.1815992
Muller, J., et al.: Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2009)
Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 308–317. ACM, New York (2011). https://doi.org/10.1145/2155620.2155656
Nasre, R., Burtscher, M., Pingali, K.: Data-driven versus topology-driven irregular computations on GPUs. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pp. 463–474 (2013). https://doi.org/10.1109/IPDPS.2013.28
Parhami, B.: Introduction to Parallel Processing Algorithms and Architectures. Plenum Series in Computer Science. Plenum Press, London (1999). http://books.google.com/books?id=ekBsZkIYfUgC
Sarkar, V.: Optimized unrolling of nested loops. Int. J. Parallel Program. 29(5), 545–581 (2001). https://doi.org/10.1023/A:1012246031671
Steinberger, M., Kainz, B., Kerbl, B., Hauswiesner, S., Kenzel, M., Schmalstieg, D.: Softshell: dynamic scheduling on GPUs. ACM Trans. Graph. 31(6), 161:1–161:11 (2012). https://doi.org/10.1145/2366145.2366180
Wilt, N.: The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education, White Plains (2013)
Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS 2010, pp. 115–126. ACM, New York (2010). https://doi.org/10.1145/1810085.1810104
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Jradi, W.A.R., do Nascimento, H.A.D., Martins, W.S. (2020). A GPU-Based Parallel Reduction Implementation. In: Bianchini, C., Osthoff, C., Souza, P., Ferreira, R. (eds) High Performance Computing Systems. WSCAD 2018. Communications in Computer and Information Science, vol 1171. Springer, Cham. https://doi.org/10.1007/978-3-030-41050-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-41050-6_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41049-0
Online ISBN: 978-3-030-41050-6
eBook Packages: Computer ScienceComputer Science (R0)