A GPU-Based Parallel Reduction Implementation

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1171))

Included in the following conference series:

Symposium on High Performance Computing Systems

384 Accesses
1 Citations

Abstract

Reduction operations aggregate a finite set of numeric elements into a single value. They are extensively employed in many computational tasks and can be performed in parallel when multiple processing units are available. This work presents a GPU-based approach for parallel reduction, which employs techniques like loop unrolling, persistent threads and algebraic expressions. It avoids thread divergence and it is able to surpass the methods currently in use. Experiments conducted to evaluate the approach show that the strategy performs efficiently on both AMD and NVidia’s hardware platforms, as well as using OpenCL and CUDA, making it portable.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – “Finance Code 001”. We also thank the research supporting agency FAPEG – Fundação de Amparo à Pesquisa do Estado de Goiás for scholarships provided to researchers involved with the project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Although, mathematically, the associativity is true for numbers in any set, in computational terms this is more complicated. For instance, that property holds for the set of integers, but the same does not happen for the floating point numbers due to the inherent imprecision that arises when combining (adding, multiplying, etc.) numbers with different exponents, which leads to the absorption of the lower bits during the combining operation. As an example, mathematically the result of \((1.5 + 4^{50} - 4^{50})\) is always the same, no matter the order the terms are added, whereas the floating point computation can result in 0 or 1.5, depending on the sequence in which the operations are performed [7, 12, 17].
2.
A pragma is a language directive that provides additional information to the compiler, specifying how to process its input. This additional information usually is beyond what is conveyed in the language itself.
3.
Reduction operations on vectors with millions of values are common when computing cost function in some real-life optimization problems. For instance, traffic assignment computation for large urban networks involves such vectors.
4.
The ideal unrolling factor strongly depends on the GPU model. In preliminary tests, performed in older hardware not listed here, the highest gains were obtained when F was defined as 6.

References

Billeter, M., Olsson, O., Assarsson, U.: Efficient stream compaction on wide SIMD many-core architectures. In: Proceedings of the Conference on High Performance Graphics 2009, HPG 2009, pp. 159–166. ACM, New York (2009). https://doi.org/10.1145/1572769.1572795
Catanzaro, B.: OpenCL optimization case study: simple reductions, August 2014. http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/. Published by Advanced Micro Devices. Accessed 05 Jan 2014
Chakroun, I., Mezmaz, M., Melab, N., Bendjoudi, A.: Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm. Concurr. Comput.: Pract. Exp. 25(8), 1121–1136 (2013)
Article Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)
MATH Google Scholar
Fog, A.: Optimizing subroutines in assembly language: an optimization guide for x86 platforms. Technical University of Denmark (2013)
Google Scholar
Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.: Dynamic warp formation and scheduling for efficient GPU control flow. In: 2007 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2007, pp. 407–420, December 2007. https://doi.org/10.1109/MICRO.2007.30
Goldberg, D.: What every computer scientist should know about floating-pointarithmetic. ACM Comput. Surv. 23(1), 5–48 (1991). https://doi.org/10.1145/103162.103163
Article MathSciNet Google Scholar
Khronos OpenCL Working Group, et al.: The OpenCL specification. Version 1(29), 8 (2008)
Google Scholar
Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative Parallel Computing (InPar), pp. 1–14. IEEE (2012)
Google Scholar
Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1–3:8. ACM, New York (2011). https://doi.org/10.1145/1964179.1964184
Harris, M.: Optimizing Parallel Reduction in CUDA (2007). https://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction. Published by NVidia Corporation. Accessed 10 Sept 2018
Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2002)
Book Google Scholar
Huang, J.C., Leng, T.: Generalized loop-unrolling: a method for program speed-up. In: Proceedings of the IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, pp. 244–248 (1997)
Google Scholar
Kiefer, J.C.: Sequential minimax search for a maximum. Proc. Am. Math. Soc. 4, 502–506 (1953)
Article MathSciNet Google Scholar
Luitjens, J.: Faster parallel reductions on Kepler. White Paper, February 2014. http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/. Published by NVidia Inc., Accessed 25 July 2014
Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38(3), 235–246 (2010). https://doi.org/10.1145/1816038.1815992
Article Google Scholar
Muller, J., et al.: Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2009)
Google Scholar
Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 308–317. ACM, New York (2011). https://doi.org/10.1145/2155620.2155656
Nasre, R., Burtscher, M., Pingali, K.: Data-driven versus topology-driven irregular computations on GPUs. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pp. 463–474 (2013). https://doi.org/10.1109/IPDPS.2013.28
Parhami, B.: Introduction to Parallel Processing Algorithms and Architectures. Plenum Series in Computer Science. Plenum Press, London (1999). http://books.google.com/books?id=ekBsZkIYfUgC
Google Scholar
Sarkar, V.: Optimized unrolling of nested loops. Int. J. Parallel Program. 29(5), 545–581 (2001). https://doi.org/10.1023/A:1012246031671
Article MATH Google Scholar
Steinberger, M., Kainz, B., Kerbl, B., Hauswiesner, S., Kenzel, M., Schmalstieg, D.: Softshell: dynamic scheduling on GPUs. ACM Trans. Graph. 31(6), 161:1–161:11 (2012). https://doi.org/10.1145/2366145.2366180
Article Google Scholar
Wilt, N.: The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education, White Plains (2013)
Google Scholar
Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS 2010, pp. 115–126. ACM, New York (2010). https://doi.org/10.1145/1810085.1810104

Download references

Author information

Authors and Affiliations

Instituto de Informática, Universidade Federal de Goiás, Goiânia, Goiás, Brazil
Walid Abdala Rfaei Jradi, Hugo Alexandre Dantas do Nascimento & Wellington Santos Martins

Authors

Walid Abdala Rfaei Jradi
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Alexandre Dantas do Nascimento
View author publications
You can also search for this author in PubMed Google Scholar
Wellington Santos Martins
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Walid Abdala Rfaei Jradi .

Editor information

Editors and Affiliations

Mackenzie Presbyterian University, São Paulo, Brazil
Calebe Bianchini
National Laboratory for Scientific Computing, Rio de Janeiro, Brazil
Carla Osthoff
University of São Paulo, São Paulo, Brazil
Paulo Souza
Federal University of Minas Gerais, Belo Horizonte, Brazil
Renato Ferreira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jradi, W.A.R., do Nascimento, H.A.D., Martins, W.S. (2020). A GPU-Based Parallel Reduction Implementation. In: Bianchini, C., Osthoff, C., Souza, P., Ferreira, R. (eds) High Performance Computing Systems. WSCAD 2018. Communications in Computer and Information Science, vol 1171. Springer, Cham. https://doi.org/10.1007/978-3-030-41050-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-41050-6_11
Published: 14 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-41049-0
Online ISBN: 978-3-030-41050-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics