More Web Proxy on the site http://driver.im/

research-article

Higher-order and tuple-based massively-parallel prefix sums

Authors:

Sepideh Maleki,

Martin BurtscherAuthors Info & Claims

PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

Pages 539 - 552

https://doi.org/10.1145/2908080.2908089

Published: 02 June 2016 Publication History

Abstract

Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a GPU-friendly algorithm for computing prefix sums and other scans that directly supports higher orders and tuple values. Its templated CUDA implementation unifies all of these computations in a single 100-statement kernel. SAM is communication-efficient in the sense that it minimizes main-memory accesses. When computing prefix sums of a million or more values, it outperforms Thrust and CUDPP on both a Titan X and a K40 GPU. On the Titan X, SAM reaches memory-copy speeds for large input sizes, which cannot be surpassed. SAM outperforms CUB, the currently fastest conventional prefix sum implementation, by up to a factor of 2.9 on eighth-order prefix sums and by up to a factor of 2.6 on eight-tuple prefix sums.

References

[1]

G.E. Blelloch. “Scans as Primitive Parallel Operations.” IEEE Transactions on Computers, C-38(ll):1526-1538, 1989.

Digital Library

[2]

G.E. Blelloch. “Prefix Sums and Their Applications.” In John H. Reif (Ed.), Synthesis of Parallel Algorithms, Morgan Kaufmann, 1990.

[3]

S. Chatterjee, G.E. Blelloch, and M. Zagha. “Scan primitives for vector computers.” Proceedings of the 1990 Conference on Supercomputing, pp. 666–675, 1990.

Digital Library

[4]

G. Chaurasia, J.R. Kelley, S. Paris, G. Drettakis, and F. Durand. “Compiling High Performance Recursive Filters.” Proceedings of the 7th Conference on High-Performance Graphics, pp 85–94, 2015.

Digital Library

[5]

CUB: https://github.com/NVlabs/cub

[6]

CUDPP: https://github.com/cudpp

[7]

Y. Dotsenko, N.K. Govindaraju, P.-P. Sloan, C. Boyd, and J. Manferdelli. “Fast scan algorithms on graphics processors.” Proceedings of the 22nd Annual Int. Conference on Supercomputing, pp. 205–213, 2008.

Digital Library

[8]

G. Gautam and S. Rajopadhye. “Simplifying Reductions.” Proceedings of the 33rd ACM SIGPLANSIGACT Symposium on Principles of Programming Languages, pp. 30–41, 2006.

Digital Library

[9]

A. Greß, M. Guthe, and R. Klein. “GPU-based Collision Detection for Deformable Parameterized Surfaces.” Computer Graphics Forum 25, 2006.

[10]

A. Greß and G. Zachmann. “GPUABiSort: Optimal Parallel Sorting on Stream Architectures.” Proceedings of the 20th IEEE International Parallel and Distributed Processing Symposium, 2006.

Digital Library

[11]

K. Gupta, J.A. Stuart, and J.D. Owens. “A Study of Persistent Threads Style GPU Programming for GPGPU Workloads.” Proceedings of Innovative Parallel Computing, 2012.

[12]

M. Harris, S. Sengupta, and J.D. Owens, “Parallel prefix sum (scan) with CUDA.” GPU Gems 3, 2007.

[13]

J. Hensley, T. Scheuermann, G. Coombe, M. Singh, and A. Lastra. “Fast summed-area table generation and its applications.” Computer Graphics Forum, 24(3):547– 555, 2005.

[14]

W.D. Hillis and G.L. Steele Jr. “Data Parallel Algorithms.” Communications of the ACM: 29(12), pp. 1170–1183. 1986.

Digital Library

[15]

D. Horn. “Stream reduction operations for GPGPU applications.” In M. Pharr (Ed.), GPU Gems 2, chapter 36, pp. 573–589. Addison Wesley, 2005.

[16]

K.E. Iverson. “A Programming Language.” Wiley, 1962.

Digital Library

[17]

R.E. Ladner and M.J. Fischer. “Parallel prefix computation.” Journal of the ACM, 27(4):831–838, 1980.

Digital Library

[18]

D. Merrill and M. Garland. “Single-pass Parallel Prefix Scan with Decoupled Look-back.” NVIDIA Technical Report NVR-2016-002, NVIDIA Corporation. 2016.

[19]

B. Merry. “A performance comparison of sort and scan libraries for GPUs.” World Scientific Publishing Company, 2014.

[20]

MGPU: http://nvlabs.github.io/moderngpu/

[21]

D. Nehab, A. Maximo, R. Lima, and H. Hoppe. “GPUefficient Recursive Filtering and Summed-area Tables.” ACM Transactions on Graphics (SIGGRAPH Asia), 30:6, 2011.

Digital Library

[22]

S. Sengupta, M. Harris, and M. Garland. “Efficient parallel scan algorithms for GPUs.” In NVIDIA, Santa Clara, CA, 2008 - gpucomputing.net.

[23]

S. Sengupta, M. Harris, M. Garland, and J.D. Owens. “Efficient Parallel Scan Algorithms for many-core GPUs”. In J. Kurzak, D.A. Bader, and J. Dongarra (Eds.), Scientific Computing with Multicore and Accelerators, Chapman & Hall/CRC Computational Science, chapter 19, pp. 413–442, 2011.

[24]

S. Sengupta, M. Harris, Y. Zhang, and J.D. Owens. “Scan primitives for GPU computing.” Graphics Hardware 2007, pp. 97–106, 2007.

Digital Library

[25]

S. Sengupta, A.E. Lefohn, and J.D. Owens. “A Work-Efficient Step-Efficient Prefix Sum Algorithm.” Proceedings of the Workshop on Edge Computing Using New Commodity Architectures, pp. D-26–27, 2006.

[26]

Thrust: https://developer.nvidia.com/thrust

[27]

S. Yan, G. Long, and Y. Zhang. “StreamScan: Fast Scan Algorithms for GPUs without Global Barrier Synchronization.” Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 229–238, 2013.

Digital Library

Cited By

Schertzer JMercier CRousseau SBoubekeur T(2022)Fiblets for Real‐Time Rendering of Massive Brain TractogramsComputer Graphics Forum10.1111/cgf.1448641:2(447-460)Online publication date: 24-May-2022
https://doi.org/10.1111/cgf.14486
Copik MGrosser THoefler TBientinesi PBerkels B(2022)Work-Stealing Prefix Scan: Addressing Load Imbalance in Large-Scale Image RegistrationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309523033:3(523-535)Online publication date: 1-Mar-2022
https://doi.org/10.1109/TPDS.2021.3095230
Chandra VSai PMandapati S(2022)An Efficient Framework for Load Balancing using MapReduce Algorithm for Bigdata2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC53929.2022.9792840(791-794)Online publication date: 9-May-2022
https://doi.org/10.1109/ICAAIC53929.2022.9792840
Show More Cited By

Index Terms

Higher-order and tuple-based massively-parallel prefix sums

Recommendations

Higher-order and tuple-based massively-parallel prefix sums
PLDI '16

Prefix sums are an important parallel primitive, especially in massively-parallel programs. This paper discusses two orthogonal generalizations thereof, which we call higher-order and tuple-based prefix sums. Moreover, it describes and evaluates SAM, a ...
Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...
Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

June 2016

726 pages

ISBN:9781450342612

DOI:10.1145/2908080

General Chair:
Chandra Krintz
University of California at Santa Barbara, USA
,
Program Chair:
Emery Berger
University of Massachusetts at Amherst, USA

ACM SIGPLAN Notices Volume 51, Issue 6
PLDI '16
June 2016
726 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2980983
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '16

Sponsor:

SIGPLAN

PLDI '16: ACM SIGPLAN Conference on Programming Language Design and Implementation

June 13 - 17, 2016

CA, Santa Barbara, USA

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

17
Total Citations
View Citations
492
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)8

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Schertzer JMercier CRousseau SBoubekeur T(2022)Fiblets for Real‐Time Rendering of Massive Brain TractogramsComputer Graphics Forum10.1111/cgf.1448641:2(447-460)Online publication date: 24-May-2022
https://doi.org/10.1111/cgf.14486
Copik MGrosser THoefler TBientinesi PBerkels B(2022)Work-Stealing Prefix Scan: Addressing Load Imbalance in Large-Scale Image RegistrationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309523033:3(523-535)Online publication date: 1-Mar-2022
https://doi.org/10.1109/TPDS.2021.3095230
Chandra VSai PMandapati S(2022)An Efficient Framework for Load Balancing using MapReduce Algorithm for Bigdata2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC)10.1109/ICAAIC53929.2022.9792840(791-794)Online publication date: 9-May-2022
https://doi.org/10.1109/ICAAIC53929.2022.9792840
Sorensen TSalvador LRaval HEvrard HWickerson JMartonosi MDonaldson A(2021)Specifying and testing GPU workgroup progress modelsProceedings of the ACM on Programming Languages10.1145/34855085:OOPSLA(1-30)Online publication date: 15-Oct-2021
https://dl.acm.org/doi/10.1145/3485508
Maximo A(2021)GPU efficient 1D and 3D recursive filteringDigital Signal Processing10.1016/j.dsp.2021.103076114(103076)Online publication date: Jul-2021
https://doi.org/10.1016/j.dsp.2021.103076
Xia YJiang PAgrawal GGupta RShen X(2020)Scaling out speculative execution of finite-state machines with parallel mergeProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374524(160-172)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374524
Jr. GTristan J(2019)Using Butterfly-patterned Partial Sums to Draw from Discrete DistributionsACM Transactions on Parallel Computing10.1145/33656626:4(1-30)Online publication date: 19-Nov-2019
https://dl.acm.org/doi/10.1145/3365662
Dakkak ALi CXiong JGelado IHwu WEigenmann RDing CMcKee S(2019)Accelerating reduction and scan using tensor core unitsProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3331057(46-57)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3331057
Maleki SBurtscher M(2018)Automatic Hierarchical Parallelization of Linear RecurrencesACM SIGPLAN Notices10.1145/3296957.317316853:2(128-138)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3296957.3173168
Maleki SBurtscher MShen XTuck JBianchini RSarkar V(2018)Automatic Hierarchical Parallelization of Linear RecurrencesProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173168(128-138)Online publication date: 19-Mar-2018
https://dl.acm.org/doi/10.1145/3173162.3173168
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents