[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2908080.2908089acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Higher-order and tuple-based massively-parallel prefix sums

Published: 02 June 2016 Publication History

Abstract

Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.

References

[1]
G.E. Blelloch. “Scans as Primitive Parallel Operations.” IEEE Transactions on Computers, C-38(ll):1526-1538, 1989.
[2]
G.E. Blelloch. “Prefix Sums and Their Applications.” In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990.
[3]
S. Chatterjee, G.E. Blelloch, and M. Zagha. “Scan primitives for vector computers.” Proceedings of the 1990 Conference on Supercomputing, pp. 666–675, 1990.
[4]
G. Chaurasia, J.R. Kelley, S. Paris, G. Drettakis, and F. Durand. “Compiling High Performance Recursive Filters.” Proceedings of the 7th Conference on High-Performance Graphics, pp 85–94, 2015.
[5]
CUB: https://github.com/NVlabs/cub
[6]
CUDPP: https://github.com/cudpp
[7]
Y. Dotsenko, N.K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. “Fast scan algorithms on graphics processors.” Proceedings of the 22nd Annual Int. Conference on Supercomputing, pp. 205–213, 2008.
[8]
G. Gautam and S. Rajopadhye. “Simplifying Reductions.” Proceedings of the 33rd ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, pp. 30–41, 2006.
[9]
A. Greß, M. Guthe, and R. Klein. “GPU-based Collision Detection for Deformable Parameterized Surfaces.” Computer Graphics Forum 25, 2006.
[10]
A. Greß and G. Zachmann. “GPUABiSort: Optimal Parallel Sorting on Stream Architectures.” Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, 2006.
[11]
K. Gupta, J.A. Stuart, and J.D. Owens. “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads.” Proceedings of Innovative Parallel Computing, 2012.
[12]
M. Harris, S. Sengupta, and J.D. Owens, “Parallel prefix sum (scan) with CUDA.” GPU Gems 3, 2007.
[13]
J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra. “Fast summed-area table generation and its applications.” Computer Graphics Forum, 24(3):547– 555, 2005.
[14]
W.D. Hillis and G.L. Steele Jr. “Data Parallel Algorithms.” Communications of the ACM: 29(12), pp. 1170–1183. 1986.
[15]
D. Horn. “Stream reduction operations for GPGPU applications.” In M. Pharr (Ed.), GPU Gems 2, chapter 36, pp. 573–589. Addison Wesley, 2005.
[16]
K.E. Iverson. “A Programming Language.” Wiley, 1962.
[17]
R.E. Ladner and M.J. Fischer. “Parallel prefix computation.” Journal of the ACM, 27(4):831–838, 1980.
[18]
D. Merrill and M. Garland. “Single-pass Parallel Prefix Scan with Decoupled Look-back.” NVIDIA Technical Report NVR-2016-002, NVIDIA Corporation. 2016.
[19]
B. Merry. “A performance comparison of sort and scan libraries for GPUs.” World Scientific Publishing Company, 2014.
[20]
MGPU: http://nvlabs.github.io/moderngpu/
[21]
D. Nehab, A. Maximo, R. Lima, and H. Hoppe. “GPUefficient Recursive Filtering and Summed-area Tables.” ACM Transactions on Graphics (SIGGRAPH Asia), 30:6, 2011.
[22]
S. Sengupta, M. Harris, and M. Garland. “Efficient parallel scan algorithms for GPUs.” In NVIDIA, Santa Clara, CA, 2008 - gpucomputing.net.
[23]
S. Sengupta, M. Harris, M. Garland, and J.D. Owens. “Efficient Parallel Scan Algorithms for many-core GPUs”. In J. Kurzak, D.A. Bader, and J. Dongarra (Eds.), Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pp. 413–442, 2011.
[24]
S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens. “Scan primitives for GPU computing.” Graphics Hardware 2007, pp. 97–106, 2007.
[25]
S. Sengupta, A.E. Lefohn, and J.D. Owens. “A Work-Efficient Step-Efficient Prefix Sum Algorithm.” Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, pp. D-26–27, 2006.
[26]
Thrust: https://developer.nvidia.com/thrust
[27]
S. Yan, G. Long, and Y. Zhang. “StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization.” Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 229–238, 2013.

Cited By

View all
  • (2022)Fiblets for Real‐Time Rendering of Massive Brain TractogramsComputer Graphics Forum10.1111/cgf.1448641:2(447-460)Online publication date: 24-May-2022
  • (2022)Work-Stealing Prefix Scan: Addressing Load Imbalance in Large-Scale Image RegistrationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309523033:3(523-535)Online publication date: 1-Mar-2022
  • (2022)An Efficient Framework for Load Balancing using MapReduce Algorithm for Bigdata2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC53929.2022.9792840(791-794)Online publication date: 9-May-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2016
726 pages
ISBN:9781450342612
DOI:10.1145/2908080
  • General Chair:
  • Chandra Krintz,
  • Program Chair:
  • Emery Berger
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 51, Issue 6
    PLDI '16
    June 2016
    726 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2980983
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU programming
  2. Higher-order prefix sums
  3. carry propagation
  4. tuple-based prefix sums

Qualifiers

  • Research-article

Conference

PLDI '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)8
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Fiblets for Real‐Time Rendering of Massive Brain TractogramsComputer Graphics Forum10.1111/cgf.1448641:2(447-460)Online publication date: 24-May-2022
  • (2022)Work-Stealing Prefix Scan: Addressing Load Imbalance in Large-Scale Image RegistrationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309523033:3(523-535)Online publication date: 1-Mar-2022
  • (2022)An Efficient Framework for Load Balancing using MapReduce Algorithm for Bigdata2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC53929.2022.9792840(791-794)Online publication date: 9-May-2022
  • (2021)Specifying and testing GPU workgroup progress modelsProceedings of the ACM on Programming Languages10.1145/34855085:OOPSLA(1-30)Online publication date: 15-Oct-2021
  • (2021)GPU efficient 1D and 3D recursive filteringDigital Signal Processing10.1016/j.dsp.2021.103076114(103076)Online publication date: Jul-2021
  • (2020)Scaling out speculative execution of finite-state machines with parallel mergeProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374524(160-172)Online publication date: 19-Feb-2020
  • (2019)Using Butterfly-patterned Partial Sums to Draw from Discrete DistributionsACM Transactions on Parallel Computing10.1145/33656626:4(1-30)Online publication date: 19-Nov-2019
  • (2019)Accelerating reduction and scan using tensor core unitsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331057(46-57)Online publication date: 26-Jun-2019
  • (2018)Automatic Hierarchical Parallelization of Linear RecurrencesACM SIGPLAN Notices10.1145/3296957.317316853:2(128-138)Online publication date: 19-Mar-2018
  • (2018)Automatic Hierarchical Parallelization of Linear RecurrencesProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173168(128-138)Online publication date: 19-Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media