More Web Proxy on the site http://driver.im/

Article

Software Pipelined Execution of Stream Programs on GPUs

Authors:

Abhishek Udupa,

R. Govindarajan,

Matthew J. ThazhuthaveetilAuthors Info & Claims

CGO '09: Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization

Pages 200 - 209

https://doi.org/10.1109/CGO.2009.20

Published: 22 March 2009 Publication History

Abstract

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem --- both scheduling and assignment of filters to processors --- as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.

References

[1]

NVIDIA CUDA Programming Guide. {Online}. Available: http: //www.nvidia.com/cuda

[2]

S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-M. W. Hwu, "Program Optimization Space Pruning for a Multithreaded GPU," in CGO '08: Proc. of the sixth annual IEEE/ACM Intl. Symp. on Code Generation and Optimization, 2008, pp. 195-204.

Digital Library

[3]

ATI CTM Guide. {Online}. Available: http://ati.amd.com/companyinfo/ researcher/documents/ATI_CTM_Guide.pdf

[4]

NVIDIA CUDA. {Online}. Available: http://www.nvidia.com/cuda

[5]

M. I. Gordon, W. Thies, and S. Amarasinghe, "Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs," in ASPLOS-XII: Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 151-162.

Digital Library

[6]

W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A Language for Streaming Applications," in CC '02: Proc. of the 11th Intl. Conf. on Compiler Construction, 2002, pp. 179-196.

Digital Library

[7]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan, "Brook for GPUs: Stream Computing on Graphics Hardware," ACM Trans. on Graphics, vol. 23, no. 3, pp. 777-786, 2004.

Digital Library

[8]

D. Tarditi, S. Puri, and J. Oglesby, "Accelerator: Using Data Parallelism to Program GPUs for General-Purpose Uses," in ASPLOS-XII: Proc. of the 12th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2006, pp. 325-335.

Digital Library

[9]

M. Kudlur and S. Mahlke, "Orchestrating the Execution of Stream Programs on Multicore Platforms," in PLDI '08: Proc. of the 2008 ACM SIGPLAN Conf. on Programming Language Design and Implementation, 2008, pp. 114-124.

Digital Library

[10]

S. Agrawal, W. Thies, and S. Amarasinghe, "Optimizing Stream Programs using Linear State Space Analysis," in CASES '05: Proc. of the 2005 Intl. Conf. on Compilers, Architectures and Synthesis for Embedded Systems, 2005, pp. 126-136.

Digital Library

[11]

E. A. Lee and D. G. Messerschmitt, "Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing," IEEE Trans. on Computers, vol. 36, no. 1, pp. 24-35, 1987.

Digital Library

[12]

S. S. Bhattacharyya and E. A. Lee, "Looped Schedules for Dataflow Descriptions of Multirate Signal Processing Algorithms," Formal Methods in System Design, vol. 5, no. 3, pp. 183-205, 1994.

Digital Library

[13]

M. Karczmarek, W. Thies, and S. Amarasinghe, "Phased Scheduling of Stream Programs," in LCTES '03: Proc. of the 2003 ACM SIGPLAN Conf. on Language, Compiler, and Tool Support for Embedded Systems, 2003, pp. 103-112.

Digital Library

[14]

R. Govindarajan, E. R. Altman, and G. R. Gao, "Minimizing Register Requirements Under Resource-constrained Rate-optimal Software Pipelining," in MICRO 27: Proc. of the 27th annual Intl. Symp. on Microarchitecture, 1994, pp. 85-94.

Digital Library

[15]

R. Govindarajan and G. Gao, "A Novel Framework for Multi-rate Scheduling in DSP Applications," in ASAP '93: Proc. of the 1993 Intl. Conf. on Application-Specific Array Processors, Oct 1993, pp. 77-88.

[16]

B. R. Rau, M. S. Schlansker, and P. P. Tirumalai, "Code Generation Schema for Modulo Scheduled Loops," in MICRO 25: Proc. of the 25th annual Intl. Symp. on Microarchitecture, 1992, pp. 158-169.

Digital Library

[17]

StreamIt Home Page. {Online}. Available: http://www.cag.lcs.mit.edu/ streamit/

[18]

P. K. Murthy and S. S. Bhattacharyya, "Buffer Merging--A Powerful Technique for Reducing Memory Requirements of Synchronous Dataflow Specifications," ACM Trans. on Design and Automation of Electronic Systems, vol. 9, no. 2, pp. 212-237, 2004.

Digital Library

[19]

G. Gao, R. Govindarajan, and P. Panangaden, "Well-Behaved Dataflow Programs for DSP Computation," ICASSP-92: IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1992., vol. 5, pp. 561-564 vol. 5, Mar 1992.

[20]

R. Govindarajan, G. Gao, and P. Desai, "Minimizing Memory Requirements in Rate-optimal Schedules," in ASAP '94: Proc. of the 1994 Intl. Conf. on Application Specific Array Processors, Aug 1994, pp. 75-86.

[21]

M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe, "A Stream Compiler for Communication-Exposed Architectures," in ASPLOS-X: Proc. of the 10th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, 2002, pp. 291-303.

Digital Library

[22]

D. Zhang, Q. J. Li, R. Rabbah, and S. Amarasinghe, "A Lightweight Streaming Layer for Multicore Execution," SIGARCH Computer Architecture News, vol. 36, no. 2, pp. 18-27, 2008.

Digital Library

[23]

S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W. mei W. Hwu, "Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA," in PPoPP '08: Proc. of the 13th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, 2008, pp. 73-82.

Digital Library

Cited By

Gurdeep Singh RScholliers C(2023)GaiwanScience of Computer Programming10.1016/j.scico.2023.102989230:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.scico.2023.102989
Rockenbach DLöff JAraujo GGriebler DFernandes L(2022)High-Level Stream and Data Parallelism in C++ for GPUsProceedings of the XXVI Brazilian Symposium on Programming Languages10.1145/3561320.3561327(41-49)Online publication date: 6-Oct-2022
https://dl.acm.org/doi/10.1145/3561320.3561327
Cong JLau JLiu GNeuendorffer SPan PVissers KZhang Z(2022)FPGA HLS Today: Successes, Challenges, and OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/353077515:4(1-42)Online publication date: 8-Aug-2022
https://dl.acm.org/doi/10.1145/3530775
Show More Cited By

Index Terms

Software Pipelined Execution of Stream Programs on GPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...
Synergistic execution of stream programs on multicores with accelerators
LCTES '09

The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...
Efficient Compilation of Stream Programs for Heterogeneous Architectures: A Model-Checking based approach
SCOPES '15: Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems

Stream programming based on the synchronous data flow (SDF) model naturally exposes data, task and pipeline parallelism. Statically scheduling stream programs for homogeneous architectures has been an area of extensive research. With graphic processing ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO '09: Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization

March 2009

299 pages

ISBN:9780769535760

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 22 March 2009

Check for updates

Author Tags

Qualifiers

Article

Conference

CGO '09

Sponsor:

CGO '09: 7th Annual IEEE/ ACM International Symposium on Code Generation and Optimization

March 22 - 25, 2009

Acceptance Rates

CGO '09 Paper Acceptance Rate 26 of 70 submissions, 37%;

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

52
Total Citations
View Citations
626
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Gurdeep Singh RScholliers C(2023)GaiwanScience of Computer Programming10.1016/j.scico.2023.102989230:COnline publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1016/j.scico.2023.102989
Rockenbach DLöff JAraujo GGriebler DFernandes L(2022)High-Level Stream and Data Parallelism in C++ for GPUsProceedings of the XXVI Brazilian Symposium on Programming Languages10.1145/3561320.3561327(41-49)Online publication date: 6-Oct-2022
https://dl.acm.org/doi/10.1145/3561320.3561327
Cong JLau JLiu GNeuendorffer SPan PVissers KZhang Z(2022)FPGA HLS Today: Successes, Challenges, and OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/353077515:4(1-42)Online publication date: 8-Aug-2022
https://dl.acm.org/doi/10.1145/3530775
Papadimitriou MMarkou EFumero JStratikopoulos ABlanaru FKotselidis CTitzer BXu HZhang I(2021)Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimesProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454019(125-138)Online publication date: 7-Apr-2021
https://dl.acm.org/doi/10.1145/3453933.3454019
Zheng ZOh CZhai JShen XYi YChen WBahar IHerlihy MWitchel ELebeck A(2019)HiWayLibProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304032(153-166)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304032
Hagedorn BStoltzfus LSteuwer MGorlatch SDubach CKnoop JSchordan MJohnson TO'Boyle M(2018)High performance stencil code generation with LiftProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168824(100-112)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168824
Lin SWu JBhattacharyya S(2018)Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU PlatformsACM Transactions on Embedded Computing Systems10.1145/315766917:2(1-25)Online publication date: 30-Jan-2018
https://dl.acm.org/doi/10.1145/3157669
Steuwer MRemmelg TDubach CReddi VSmith ATang L(2017)Lift: a functional data-parallel IR for high-performance GPU code generationProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049841(74-85)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049841
Frust TJuckeland GBieberle A(2016)Scalable and modular online data processing for ultrafast computed tomography using CUDA pipelinesProceedings of the 2nd Workshop on In Situ Infrastructures for Enabling Extreme-scale Analysis and Visualization10.5555/3018859.3018861(7-11)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3018859.3018861
Liu BQiu WJiang LGong Z(2016)Software pipelining for graphic processing unit accelerationInternational Journal of High Performance Computing Applications10.1177/109434201558584530:2(169-185)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1177/1094342015585845
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents