[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

On-the-Fly Pipeline Parallelism

Published: 08 September 2015 Publication History

Abstract

Pipeline parallelism organizes a parallel program as a linear sequence of stages. Each stage processes elements of a data stream, passing each processed data element to the next stage, and then taking on a new element before the subsequent stages have necessarily completed their processing. Pipeline parallelism is used especially in streaming applications that perform video, audio, and digital signal processing. Three out of 13 benchmarks in PARSEC, a popular software benchmark suite designed for shared-memory multiprocessors, can be expressed as pipeline parallelism.
Whereas most concurrency platforms that support pipeline parallelism use a “construct-and-run” approach, this article investigates “on-the-fly” pipeline parallelism, where the structure of the pipeline emerges as the program executes rather than being specified a priori. On-the-fly pipeline parallelism allows the number of stages to vary from iteration to iteration and dependencies to be data dependent. We propose simple linguistics for specifying on-the-fly pipeline parallelism and describe a provably efficient scheduling algorithm, the Piper algorithm, which integrates pipeline parallelism into a work-stealing scheduler, allowing pipeline and fork-join parallelism to be arbitrarily nested. The Piper algorithm automatically throttles the parallelism, precluding “runaway” pipelines. Given a pipeline computation with T1 work and T span (critical-path length), Piper executes the computation on P processors in TPT1/P+O(T∞+lg P) expected time. Piper also limits stack space, ensuring that it does not grow unboundedly with running time.
We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system. Our prototype Cilk-P implementation exploits optimizations such as “lazy enabling” and “dependency folding.” We have ported the three PARSEC benchmarks that exhibit pipeline parallelism to run on Cilk-P. One of these, x264, cannot readily be executed by systems that support only construct-and-run pipeline parallelism. Benchmark results indicate that Cilk-P has low serial overhead and good scalability. On x264, for example, Cilk-P exhibits a speedup of 13.87 over its respective serial counterpart when running on 16 processors.

References

[1]
Kunal Agrawal, Charles E. Leiserson, and Jim Sukha. 2010. Executing task graphs using work-stealing. In IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1--12.
[2]
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. 2001. Thread scheduling for multiprogrammed multiprocessors. Theory of Computing Systems (2001), 115--144.
[3]
Henry C. Baker, Jr. and Carl Hewitt. 1977. The incremental garbage collection of processes. SIGPLAN Notices 12, 8 (1977), 55--59.
[4]
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In The PARSEC Benchmark Suite: Characterization and Architectural Implications (PACT’08). ACM, 72--81.
[5]
Christian Bienia and Kai Li. 2010. Characteristics of workloads using the pipeline programming model. In Proceedings of the 2010 International Conference on Computer Architecture (ISCA’10). Springer-Verlag, 161--171.
[6]
Guy E. Blelloch and Margaret Reid-Miller. 1997. Pipelining with futures. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA’97). ACM, 249--259.
[7]
Robert D. Blumofe and Charles E. Leiserson. 1998. Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 1 (Feb. 1998), 202--229.
[8]
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. JACM 46, 5 (1999), 720--748.
[9]
Robert L. Bocchino, Jr., Vikram S. Adve, Sarita V. Adve, and Marc Snir. 2009. Parallel programming must be deterministic by default. In First USENIX Conference on Hot Topics in Parallelism.
[10]
F. Warren Burton and M. Ronan Sleep. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (FPCA’81). ACM, 187--194.
[11]
Charles Consel, Hedi Hamdi, Laurent Réveillère, Lenin Singaravelu, Haiyan Yu, and Calton Pu. 2003. Spidle: A DSL approach to specifying streaming applications. In Generative Programming and Component Engineering, Lecture Notes in Computer Science, Volume 2830. Springer-Verlag, 1--17.
[12]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2009. Introduction to Algorithms (3rd. ed.). The MIT Press.
[13]
Raphael Finkel and Udi Manber. 1987. DIB—A distributed implementation of backtracking. ACM TOPLAS 9, 2 (1987), 235--256.
[14]
Daniel P. Friedman and David S. Wise. 1978. Aspects of applicative programming for parallel processing. IEEE Trans. Comput. C-27, 4 (1978), 289--296.
[15]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation (PLDI’98). ACM, 212--223.
[16]
John Giacomoni, Tipp Moseley, and Manish Vachharajani. 2008. FastForward for efficient pipeline parallelism: A cache-optimized concurrent lock-free queue. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’08). ACM, 43--52.
[17]
Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 151--162.
[18]
Erwin A. Hauck and Benjamin A. Dent. 1968. Burroughs’ B6500/B7500 stack mechanism. Proceedings of the AFIPS Spring Joint Computer Conference. 245--251.
[19]
Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview scalability analyzer. In Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA’10). 145--156.
[20]
Intel Corporation 2013. Intel® Cilk Plus Language Extension Specification, Version 1.1. Intel Corporation. Document 324396-002US. Available from http://cilkplus.org/sites/default/files/open_specifications/Intel_Cilk_ plus_lang_spec_2.htm.
[21]
David A. Kranz, Robert H. Halstead, Jr., and Eric Mohr. 1989. Mul-T: A high-performance parallel Lisp. In Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation (PLDI’89). ACM, 81--90.
[22]
Monica Lam. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN 1988 Conference on Programming Language Design and Implementation (PLDI’88). ACM, 318--328.
[23]
I-Ting Angelina Lee, Silas Boyd-Wickizer, Zhiyi Huang, and Charles E. Leiserson. 2010. Using memory mapping to support cactus stacks in work-stealing runtime systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, 411--420.
[24]
Charles E. Leiserson. 2010. The Cilk++ concurrency platform. J. Supercomputing 51, 3 (2010), 244--257.
[25]
Steve MacDonald, Duane Szafron, and Jonathan Schaeffer. 2004. Rethinking the pipeline as object-oriented states with transformations. In Proceedings of the 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS’04). IEEE, 12--21.
[26]
William R. Mark, R. Steven Glanville, Kurt Akeley, and Mark J. Kilgard. 2003. Cg: A system for programming graphics hardware in a C-like language. In Proceedings of ACM SIGGRAPH 2003. ACM, 896--907.
[27]
Michael McCool, Arch D. Robison, and James Reinders. 2012. Structured Parallel Programming: Patterns for Efficient Computation. Elsevier.
[28]
Larry McVoy and Carl Staelin. 1996. lmbench: Portable tools for performance analysis. In Proceedings of the USENIX 1996 Annual Technical Conference. USENIX Association, San Diego, CA, 279--294.
[29]
Angeles Navarro, Rafael Asenjo, Siham Tabik, and Cǎlin Caşcaval. 2009. Analytical modeling of pipeline parallelism. In 18th International Conference on Parallel Architectures and Compilation Techniques (PACT’09). IEEE, 281--290.
[30]
OpenMP 3.0 2008. OpenMP Application Program Interface, Version 3.0. Available from http://www.openmp.org/mp-documents/spec30.pdf.
[31]
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture (MICRO 38). IEEE, 105--118.
[32]
Antoniu Pop and Albert Cohen. 2011. A stream-computing extension to OpenMP. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers (HiPEAC’11). ACM, 5--14.
[33]
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). ACM, 177--188.
[34]
Eric C. Reed, Nicholas Chen, and Ralph E. Johnson. 2011. Expressing pipeline parallelism using TBB constructs: A case study on what works and what doesn’t. In Proceedings of the SPLASH Workshops 2011. ACM, 133--138.
[35]
Raúl Rojas. 1997. Konrad Zuse’s legacy: The architecture of the Z1 and Z3. IEEE Ann. Hist. Comput. 19, 2 (April 1997), 5--16.
[36]
Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic fine-grain scheduling of pipeline parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT’11). IEEE, 22--32.
[37]
Bjarne Stroustrup. 2013. The C++ Programming Language (4th. ed.). Addison-Wesley.
[38]
M. Aater Suleman, Moinuddin K. Qureshi, Khubaib, and Yale N. Patt. 2010. Feedback-directed pipeline parallelism. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). ACM, 147--156.
[39]
William Thies, Vikram Chandrasekhar, and Saman Amarasinghe. 2007. A practical approach to exploiting coarse-grained pipeline parallelism in C programs. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). IEEE, 356--369.
[40]
Thomas Wiegand, Gary J. Sullivan, Gisle Bj&phis;ntegaard, and Ajay Luthra. 2003. Overview of the H.264/AVC video coding standard. IEEE Trans. Circ. Syst. Video Technol. 13, 7 (2003), 560--576.

Cited By

View all
  • (2024)An Efficient Task-Parallel Pipeline Programming FrameworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635037(95-106)Online publication date: 19-Jan-2024
  • (2024)A computational framework based on the dynamic pipeline approachJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2024.100966139(100966)Online publication date: Jun-2024
  • (2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 2, Issue 3
Special Issue for SPAA 2013
October 2015
196 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/2821462
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 September 2015
Accepted: 01 June 2015
Revised: 01 May 2015
Received: 01 October 2013
Published in TOPC Volume 2, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cilk
  2. multicore
  3. multithreading
  4. on-the-fly pipelining
  5. parallel programming
  6. pipeline parallelism
  7. scheduling
  8. work stealing

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)51
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)An Efficient Task-Parallel Pipeline Programming FrameworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635037(95-106)Online publication date: 19-Jan-2024
  • (2024)A computational framework based on the dynamic pipeline approachJournal of Logical and Algebraic Methods in Programming10.1016/j.jlamp.2024.100966139(100966)Online publication date: Jun-2024
  • (2023)Studying the expressiveness and performance of parallelization abstractions for linear pipelinesProceedings of the 14th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3582514.3582522(29-38)Online publication date: 25-Feb-2023
  • (2023)Implementation of deep learning based intelligent image analysis on an edge AI platform using heterogeneous AI accelerators2023 14th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC58733.2023.10393630(1347-1349)Online publication date: 11-Oct-2023
  • (2023)ODIN: Overcoming Dynamic Interference in iNference PipelinesEuro-Par 2023: Parallel Processing10.1007/978-3-031-39698-4_12(169-183)Online publication date: 28-Aug-2023
  • (2022)A Pipeline Pattern Detection Technique in PollyWorkshop Proceedings of the 51st International Conference on Parallel Processing10.1145/3547276.3548445(1-10)Online publication date: 29-Aug-2022
  • (2022)Accelerating Face De-identification System for Real-time Video Surveillance Services2022 13th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC55196.2022.9952885(1477-1479)Online publication date: 19-Oct-2022
  • (2022)OpenMP as runtime for providing high-level stream parallelism on multi-coresThe Journal of Supercomputing10.1007/s11227-021-04182-978:6(7655-7676)Online publication date: 1-Apr-2022
  • (2021)JumpgateProceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463770(1-12)Online publication date: 14-Jun-2021
  • (2021)Provably space-efficient parallel functional programmingProceedings of the ACM on Programming Languages10.1145/34342995:POPL(1-33)Online publication date: 4-Jan-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media