More Web Proxy on the site http://driver.im/

research-article

Open access

Provably and practically efficient granularity control

Authors:

Vitaly Aksenov,

Arthur Charguéraud,

Mike RaineyAuthors Info & Claims

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Pages 214 - 228

https://doi.org/10.1145/3293883.3295725

Published: 16 February 2019 Publication History

Abstract

Over the past decade, many programming languages and systems for parallel-computing have been developed, e.g., Fork/Join and Habanero Java, Parallel Haskell, Parallel ML, and X10. Although these systems raise the level of abstraction for writing parallel codes, performance continues to require labor-intensive optimizations for coarsening the granularity of parallel executions. In this paper, we present provably and practically efficient techniques for controlling granularity within the run-time system of the language. Our starting point is "oracle-guided scheduling", a result from the functional-programming community that shows that granularity can be controlled by an "oracle" that can predict the execution time of parallel codes. We give an algorithm for implementing such an oracle and prove that it has the desired theoretical properties under the nested-parallel programming model. We implement the oracle in C++ by extending Cilk and evaluate its practical performance. The results show that our techniques can essentially eliminate hand tuning while closely matching the performance of hand tuned codes.

References

[1]

Umut A. Acar, Vitaly Aksenov, Arthur Charguéraud, and Mike Rainey. 2019. Provably and Practically Efficient Granularity Control. (2019). https://hal.inria.fr/hal-01973285v1 Long version.

Digital Library

[2]

Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2002. The Data Locality of Work Stealing. Theory of Computing Systems 35, 3 (2002), 321--347.

[3]

Umut A. Acar, Arthur Charguéraud, Adrien Guatto, Mike Rainey, and Filip Sieczkowski. 2018. Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 769--782.

Digital Library

[4]

Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13).

Digital Library

[5]

Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2015. A work-efficient algorithm for parallel unordered depth-first search. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015. 67:1--67:12.

Digital Library

[6]

Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2016. Oracle-guided scheduling for controlling granularity in implicitly parallel languages. Journal of Functional Programming (JFP) 26 (2016), e23.

[7]

Arvind and K. P. Gostelow. 1978. The Id Report: An Asychronous Language and Computing Machine. Technical Report TR-114. Department of Information and Computer Science, University of California, Irvine.

[8]

Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2012. Lazy Tree Splitting. J. Funct. Program. 22, 4--5 (Aug. 2012), 382--438.

Digital Library

[9]

Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. In PPoPP '12. 181--192.

Digital Library

[10]

Guy E. Blelloch and John Greiner. 1996. A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM SIGPLAN International Conference on Functional Programming. ACM, 213--225.

Digital Library

[11]

Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a Portable Nested Data-Parallel Language. J. Parallel Distrib. Comput. 21, 1 (1994), 4--14.

Digital Library

[12]

Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46 (Sept. 1999), 720--748. Issue 5.

Digital Library

[13]

Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon L. Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data parallel Haskell: a status report. In Proceedings of the POPL 2007 Workshop on Declarative Aspects of Multicore Programming, DAMP 2007, Nice, France, January 16, 2007. 10--18.

Digital Library

[14]

Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA '05). ACM, 519--538.

Digital Library

[15]

A. Duran, J. Corbalan, and E. Ayguade. 2008. An adaptive cut-off for task parallelism. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11.

Digital Library

[16]

Marc Feeley. 1992. AMessage Passing Implementation of Lazy Task Creation. In Parallel Symbolic Computing. 94--107.

Digital Library

[17]

Marc Feeley. 1993a. An efficient and general implementation of futures on large scale shared-memory multiprocessors. Ph.D. Dissertation. Brandeis University, Waltham, MA, USA. UMI Order No. GAX93-22348.

Digital Library

[18]

Marc Feeley. 1993b. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture (FPCA '93). 179--187.

Digital Library

[19]

Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming 20, 5--6 (2011), 1--40.

Digital Library

[20]

Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212--223.

Digital Library

[21]

Adrien Guatto, Sam Westrick, Ram Raghunathan, and Umut A. Acarand Matthew Fluet. 2018. Hierarchical Memory Management for Mutable State. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP). ACM Press.

Digital Library

[22]

Robert H. Halstead. 1985. MULTILISP: a language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems 7 (1985), 501--538.

Digital Library

[23]

Robert H. Halstead, Jr. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP '84). ACM, 9--17.

Digital Library

[24]

Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-based load balancing. In PPoPP '09. ACM, 55--64.

Digital Library

[25]

Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. 1994. Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM conference on LISP and functional programming (LFP '94). 79--90.

Digital Library

[26]

Shams Mahmood Imam and Vivek Sarkar. 2014. Habanero-Java library: a Java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ '14. 75--86.

[27]

Intel. 2011. Intel Threading Building Blocks. (2011). https://www.threadingbuildingblocks.org/.

[28]

Shintaro Iwasaki and Kenjiro Taura. 2016. A static cut-off for task parallel programs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 139--150.

Digital Library

[29]

Suresh Jagannathan, Armand Navabi, KC Sivaramakrishnan, and Lukasz Ziarek. 2010. The Design Rationale for Multi-MLton. In ML '10: Proceedings of the ACM SIGPLAN Workshop on ML. ACM.

[30]

Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the 15th ACM SIGPLAN international conference on Functional programming (ICFP '10). 261--272.

Digital Library

[31]

Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande (JAVA '00). 36--43.

Digital Library

[32]

Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. 2009. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '09). 227--242.

Digital Library

[33]

P. Lopez, M. Hermenegildo, and S. Debray. 1996. A methodology for granularity-based control of parallelism in logic programs. Journal of Symbolic Computation 21 (June 1996), 715--734. Issue 4--6.

Digital Library

[34]

E. Mohr, D. A. Kranz, and R. H. Halstead. 1991. Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 264--280.

Digital Library

[35]

Eric Mohr, David A. Kranz, and Robert H. Halstead Jr. 1990. Lazy task creation: a technique for increasing the granularity of parallel programs. In Conference record of the 1990 ACM Conference on Lisp and Functional Programming. ACM Press, New York, New York, USA, 185--197.

Digital Library

[36]

OpenMP Architecture Review Board. 2008. OpenMP Application Program Interface. (2008). https://www.openmp.org/.

[37]

Joseph Pehoushek and Joseph Weening. 1990. Low-cost process creation and dynamic partitioning in Qlisp. In Parallel Lisp: Languages and Systems, Takayasu Ito and Robert Halstead (Eds.). Lecture Notes in Computer Science, Vol. 441. Springer Berlin / Heidelberg, 182--199.

Digital Library

[38]

Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016). ACM, New York, NY, USA, 392--406.

Digital Library

[39]

Mike Rainey. 2010. Effective Scheduling Techniques for High-Level Parallel Programming Languages. Ph.D. Dissertation. University of Chicago.

Digital Library

[40]

Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. 2010a. Flexible architectural support for fine-grain scheduling. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS '10). ACM, New York, NY, USA, 311--322.

Digital Library

[41]

Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. 2010b. Flexible architectural support for fine-grain scheduling. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS '10). ACM, New York, NY, USA, 311--322.

Digital Library

[42]

Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. 2014. Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism. TOPLAS 36, 3, Article 10 (Sept. 2014), 51 pages.

Digital Library

[43]

Joseph S. Weening. 1989. Parallel Execution of Lisp Programs. Ph.D. Dissertation Stanford University. Computer Science Technical Report STAN-CS-89-1265.

Digital Library

[44]

Reza Zadeh. 2017. Overview, Models of Computation, BrentâĂ&Zacute;s Theorem. (2017). https://stanford.edu/~rezab/dao/notes/lecture01/cme323_lec1.pdf

Cited By

Westrick SFluet MRainey MAcar U(2024)Automatic Parallelism ManagementProceedings of the ACM on Programming Languages10.1145/36328808:POPL(1118-1149)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632880
Wang LDong XGu YSun Y(2023)Parallel Strong Connectivity Based on Faster ReachabilityProceedings of the ACM on Management of Data10.1145/35892591:2(1-29)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589259
Custódio RPaulino HRito GAgrawal KShun J(2023)Efficient Synchronization-Light Work StealingProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591099(39-49)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591099
Show More Cited By

Index Terms

Provably and practically efficient granularity control
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Heartbeat scheduling: provable efficiency for nested parallelism
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not ...
Task parallel assembly language for uncompromising parallelism
PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Achieving parallel performance and scalability involves making compromises between parallel and sequential computation. If not contained, the overheads of parallelism can easily outweigh its benefits, sometimes by orders of magnitude. Today, we expect ...
Heartbeat scheduling: provable efficiency for nested parallelism
PLDI '18

A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

February 2019

472 pages

ISBN:9781450362252

DOI:10.1145/3293883

General Chair:
Jeff Hollingsworth
University of Maryland
,
Program Chair:
Idit Keidar
Technion, Israel

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

European Research Council

Conference

PPoPP '19

Sponsor:

PPoPP '19: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 16 - 20, 2019

District of Columbia, Washington

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
723
Total Downloads

Downloads (Last 12 months)110
Downloads (Last 6 weeks)14

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Westrick SFluet MRainey MAcar U(2024)Automatic Parallelism ManagementProceedings of the ACM on Programming Languages10.1145/36328808:POPL(1118-1149)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632880
Wang LDong XGu YSun Y(2023)Parallel Strong Connectivity Based on Faster ReachabilityProceedings of the ACM on Management of Data10.1145/35892591:2(1-29)Online publication date: 20-Jun-2023
https://dl.acm.org/doi/10.1145/3589259
Custódio RPaulino HRito GAgrawal KShun J(2023)Efficient Synchronization-Light Work StealingProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591099(39-49)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3558481.3591099
Álvarez DBeltran V(2023)Optimizing Iterative Data-flow Scientific Applications using Directed Cyclic GraphsIEEE Access10.1109/ACCESS.2023.3269902(1-1)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3269902
Koparkar CRainey MVollmer MKulkarni MNewton R(2021)Efficient tree-traversals: reconciling parallelism and dense data representationsProceedings of the ACM on Programming Languages10.1145/34735965:ICFP(1-29)Online publication date: 19-Aug-2021
https://dl.acm.org/doi/10.1145/3473596
Rainey MNewton RHale KHardavellas NCampanoni SDinda PAcar UFreund SYahav E(2021)Task parallel assembly language for uncompromising parallelismProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3460969(1064-1079)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3460969
Munksgaard PBreddam SHenriksen TGieseke FOancea C(2021)Dataset Sensitive Autotuning of Multi-versioned Code Based on Monotonic PropertiesTrends in Functional Programming10.1007/978-3-030-83978-9_1(3-23)Online publication date: 23-Aug-2021
https://doi.org/10.1007/978-3-030-83978-9_1
Iwasaki SAmer ATaura KBalaji P(2020)Analyzing the Performance Trade-Off in Implementing User-Level ThreadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297605731:8(1859-1877)Online publication date: 1-Aug-2020
https://doi.org/10.1109/TPDS.2020.2976057
Westrick SYadav RFluet MAcar U(2019)Disentanglement in nested-parallel programsProceedings of the ACM on Programming Languages10.1145/33711154:POPL(1-32)Online publication date: 20-Dec-2019
https://dl.acm.org/doi/10.1145/3371115

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents