[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

A GPU-Based Parallel Reduction Implementation

  • Conference paper
  • First Online:
High Performance Computing Systems (WSCAD 2018)

Abstract

Reduction operations aggregate a finite set of numeric elements into a single value. They are extensively employed in many computational tasks and can be performed in parallel when multiple processing units are available. This work presents a GPU-based approach for parallel reduction, which employs techniques like loop unrolling, persistent threads and algebraic expressions. It avoids thread divergence and it is able to surpass the methods currently in use. Experiments conducted to evaluate the approach show that the strategy performs efficiently on both AMD and NVidia’s hardware platforms, as well as using OpenCL and CUDA, making it portable.

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – “Finance Code 001”. We also thank the research supporting agency FAPEG – Fundação de Amparo à Pesquisa do Estado de Goiás for scholarships provided to researchers involved with the project.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 35.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 44.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Although, mathematically, the associativity is true for numbers in any set, in computational terms this is more complicated. For instance, that property holds for the set of integers, but the same does not happen for the floating point numbers due to the inherent imprecision that arises when combining (adding, multiplying, etc.) numbers with different exponents, which leads to the absorption of the lower bits during the combining operation. As an example, mathematically the result of \((1.5 + 4^{50} - 4^{50})\) is always the same, no matter the order the terms are added, whereas the floating point computation can result in 0 or 1.5, depending on the sequence in which the operations are performed [7, 12, 17].

  2. 2.

    A pragma is a language directive that provides additional information to the compiler, specifying how to process its input. This additional information usually is beyond what is conveyed in the language itself.

  3. 3.

    Reduction operations on vectors with millions of values are common when computing cost function in some real-life optimization problems. For instance, traffic assignment computation for large urban networks involves such vectors.

  4. 4.

    The ideal unrolling factor strongly depends on the GPU model. In preliminary tests, performed in older hardware not listed here, the highest gains were obtained when F was defined as 6.

References

  1. Billeter, M., Olsson, O., Assarsson, U.: Efficient stream compaction on wide SIMD many-core architectures. In: Proceedings of the Conference on High Performance Graphics 2009, HPG 2009, pp. 159–166. ACM, New York (2009). https://doi.org/10.1145/1572769.1572795

  2. Catanzaro, B.: OpenCL optimization case study: simple reductions, August 2014. http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/. Published by Advanced Micro Devices. Accessed 05 Jan 2014

  3. Chakroun, I., Mezmaz, M., Melab, N., Bendjoudi, A.: Reducing thread divergence in a GPU-accelerated branch-and-bound algorithm. Concurr. Comput.: Pract. Exp. 25(8), 1121–1136 (2013)

    Article  Google Scholar 

  4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  5. Fog, A.: Optimizing subroutines in assembly language: an optimization guide for x86 platforms. Technical University of Denmark (2013)

    Google Scholar 

  6. Fung, W.W.L., Sham, I., Yuan, G., Aamodt, T.: Dynamic warp formation and scheduling for efficient GPU control flow. In: 2007 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2007, pp. 407–420, December 2007. https://doi.org/10.1109/MICRO.2007.30

  7. Goldberg, D.: What every computer scientist should know about floating-pointarithmetic. ACM Comput. Surv. 23(1), 5–48 (1991). https://doi.org/10.1145/103162.103163

    Article  MathSciNet  Google Scholar 

  8. Khronos OpenCL Working Group, et al.: The OpenCL specification. Version 1(29), 8 (2008)

    Google Scholar 

  9. Gupta, K., Stuart, J.A., Owens, J.D.: A study of persistent threads style GPU programming for GPGPU workloads. In: Innovative Parallel Computing (InPar), pp. 1–14. IEEE (2012)

    Google Scholar 

  10. Han, T.D., Abdelrahman, T.S.: Reducing branch divergence in GPU programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1–3:8. ACM, New York (2011). https://doi.org/10.1145/1964179.1964184

  11. Harris, M.: Optimizing Parallel Reduction in CUDA (2007). https://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction. Published by NVidia Corporation. Accessed 10 Sept 2018

  12. Higham, N.: Accuracy and Stability of Numerical Algorithms, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia (2002)

    Book  Google Scholar 

  13. Huang, J.C., Leng, T.: Generalized loop-unrolling: a method for program speed-up. In: Proceedings of the IEEE Symposium on Application-Specific Systems and Software Engineering and Technology, pp. 244–248 (1997)

    Google Scholar 

  14. Kiefer, J.C.: Sequential minimax search for a maximum. Proc. Am. Math. Soc. 4, 502–506 (1953)

    Article  MathSciNet  Google Scholar 

  15. Luitjens, J.: Faster parallel reductions on Kepler. White Paper, February 2014. http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/. Published by NVidia Inc., Accessed 25 July 2014

  16. Meng, J., Tarjan, D., Skadron, K.: Dynamic warp subdivision for integrated branch and memory divergence tolerance. SIGARCH Comput. Archit. News 38(3), 235–246 (2010). https://doi.org/10.1145/1816038.1815992

    Article  Google Scholar 

  17. Muller, J., et al.: Handbook of Floating-Point Arithmetic. Birkhäuser, Boston (2009)

    Google Scholar 

  18. Narasiman, V., Shebanow, M., Lee, C.J., Miftakhutdinov, R., Mutlu, O., Patt, Y.N.: Improving GPU performance via large warps and two-level warp scheduling. In: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, pp. 308–317. ACM, New York (2011). https://doi.org/10.1145/2155620.2155656

  19. Nasre, R., Burtscher, M., Pingali, K.: Data-driven versus topology-driven irregular computations on GPUs. In: 2013 IEEE 27th International Symposium on Parallel Distributed Processing (IPDPS), pp. 463–474 (2013). https://doi.org/10.1109/IPDPS.2013.28

  20. Parhami, B.: Introduction to Parallel Processing Algorithms and Architectures. Plenum Series in Computer Science. Plenum Press, London (1999). http://books.google.com/books?id=ekBsZkIYfUgC

    Google Scholar 

  21. Sarkar, V.: Optimized unrolling of nested loops. Int. J. Parallel Program. 29(5), 545–581 (2001). https://doi.org/10.1023/A:1012246031671

    Article  MATH  Google Scholar 

  22. Steinberger, M., Kainz, B., Kerbl, B., Hauswiesner, S., Kenzel, M., Schmalstieg, D.: Softshell: dynamic scheduling on GPUs. ACM Trans. Graph. 31(6), 161:1–161:11 (2012). https://doi.org/10.1145/2366145.2366180

    Article  Google Scholar 

  23. Wilt, N.: The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education, White Plains (2013)

    Google Scholar 

  24. Zhang, E.Z., Jiang, Y., Guo, Z., Shen, X.: Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS 2010, pp. 115–126. ACM, New York (2010). https://doi.org/10.1145/1810085.1810104

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Walid Abdala Rfaei Jradi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jradi, W.A.R., do Nascimento, H.A.D., Martins, W.S. (2020). A GPU-Based Parallel Reduction Implementation. In: Bianchini, C., Osthoff, C., Souza, P., Ferreira, R. (eds) High Performance Computing Systems. WSCAD 2018. Communications in Computer and Information Science, vol 1171. Springer, Cham. https://doi.org/10.1007/978-3-030-41050-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-41050-6_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-41049-0

  • Online ISBN: 978-3-030-41050-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics