[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3293883.3295725acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

Provably and practically efficient granularity control

Published: 16 February 2019 Publication History

Abstract

Over the past decade, many programming languages and systems for parallel-computing have been developed, e.g., Fork/Join and Habanero Java, Parallel Haskell, Parallel ML, and X10. Although these systems raise the level of abstraction for writing parallel codes, performance continues to require labor-intensive optimizations for coarsening the granularity of parallel executions. In this paper, we present provably and practically efficient techniques for controlling granularity within the run-time system of the language. Our starting point is "oracle-guided scheduling", a result from the functional-programming community that shows that granularity can be controlled by an "oracle" that can predict the execution time of parallel codes. We give an algorithm for implementing such an oracle and prove that it has the desired theoretical properties under the nested-parallel programming model. We implement the oracle in C++ by extending Cilk and evaluate its practical performance. The results show that our techniques can essentially eliminate hand tuning while closely matching the performance of hand tuned codes.

References

[1]
Umut A. Acar, Vitaly Aksenov, Arthur Charguéraud, and Mike Rainey. 2019. Provably and Practically Efficient Granularity Control. (2019). https://hal.inria.fr/hal-01973285v1 Long version.
[2]
Umut A. Acar, Guy E. Blelloch, and Robert D. Blumofe. 2002. The Data Locality of Work Stealing. Theory of Computing Systems 35, 3 (2002), 321--347.
[3]
Umut A. Acar, Arthur Charguéraud, Adrien Guatto, Mike Rainey, and Filip Sieczkowski. 2018. Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 769--782.
[4]
Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13).
[5]
Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2015. A work-efficient algorithm for parallel unordered depth-first search. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015. 67:1--67:12.
[6]
Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2016. Oracle-guided scheduling for controlling granularity in implicitly parallel languages. Journal of Functional Programming (JFP) 26 (2016), e23.
[7]
Arvind and K. P. Gostelow. 1978. The Id Report: An Asychronous Language and Computing Machine. Technical Report TR-114. Department of Information and Computer Science, University of California, Irvine.
[8]
Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2012. Lazy Tree Splitting. J. Funct. Program. 22, 4--5 (Aug. 2012), 382--438.
[9]
Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. 2012. Internally deterministic parallel algorithms can be fast. In PPoPP '12. 181--192.
[10]
Guy E. Blelloch and John Greiner. 1996. A provable time and space efficient implementation of NESL. In Proceedings of the 1st ACM SIGPLAN International Conference on Functional Programming. ACM, 213--225.
[11]
Guy E. Blelloch, Jonathan C. Hardwick, Jay Sipelstein, Marco Zagha, and Siddhartha Chatterjee. 1994. Implementation of a Portable Nested Data-Parallel Language. J. Parallel Distrib. Comput. 21, 1 (1994), 4--14.
[12]
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling multithreaded computations by work stealing. J. ACM 46 (Sept. 1999), 720--748. Issue 5.
[13]
Manuel M. T. Chakravarty, Roman Leshchinskiy, Simon L. Peyton Jones, Gabriele Keller, and Simon Marlow. 2007. Data parallel Haskell: a status report. In Proceedings of the POPL 2007 Workshop on Declarative Aspects of Multicore Programming, DAMP 2007, Nice, France, January 16, 2007. 10--18.
[14]
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA '05). ACM, 519--538.
[15]
A. Duran, J. Corbalan, and E. Ayguade. 2008. An adaptive cut-off for task parallelism. In 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11.
[16]
Marc Feeley. 1992. AMessage Passing Implementation of Lazy Task Creation. In Parallel Symbolic Computing. 94--107.
[17]
Marc Feeley. 1993a. An efficient and general implementation of futures on large scale shared-memory multiprocessors. Ph.D. Dissertation. Brandeis University, Waltham, MA, USA. UMI Order No. GAX93-22348.
[18]
Marc Feeley. 1993b. Polling efficiently on stock hardware. In Proceedings of the conference on Functional programming languages and computer architecture (FPCA '93). 179--187.
[19]
Matthew Fluet, Mike Rainey, John Reppy, and Adam Shaw. 2011. Implicitly threaded parallelism in Manticore. Journal of Functional Programming 20, 5--6 (2011), 1--40.
[20]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212--223.
[21]
Adrien Guatto, Sam Westrick, Ram Raghunathan, and Umut A. Acarand Matthew Fluet. 2018. Hierarchical Memory Management for Mutable State. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP). ACM Press.
[22]
Robert H. Halstead. 1985. MULTILISP: a language for concurrent symbolic computation. ACM Transactions on Programming Languages and Systems 7 (1985), 501--538.
[23]
Robert H. Halstead, Jr. 1984. Implementation of Multilisp: Lisp on a Multiprocessor. In Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP '84). ACM, 9--17.
[24]
Tasuku Hiraishi, Masahiro Yasugi, Seiji Umatani, and Taiichi Yuasa. 2009. Backtracking-based load balancing. In PPoPP '09. ACM, 55--64.
[25]
Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. 1994. Using the run-time sizes of data structures to guide parallel-thread creation. In Proceedings of the 1994 ACM conference on LISP and functional programming (LFP '94). 79--90.
[26]
Shams Mahmood Imam and Vivek Sarkar. 2014. Habanero-Java library: a Java 8 framework for multicore programming. In 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ '14. 75--86.
[27]
Intel. 2011. Intel Threading Building Blocks. (2011). https://www.threadingbuildingblocks.org/.
[28]
Shintaro Iwasaki and Kenjiro Taura. 2016. A static cut-off for task parallel programs. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation. ACM, 139--150.
[29]
Suresh Jagannathan, Armand Navabi, KC Sivaramakrishnan, and Lukasz Ziarek. 2010. The Design Rationale for Multi-MLton. In ML '10: Proceedings of the ACM SIGPLAN Workshop on ML. ACM.
[30]
Gabriele Keller, Manuel M.T. Chakravarty, Roman Leshchinskiy, Simon Peyton Jones, and Ben Lippmeier. 2010. Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the 15th ACM SIGPLAN international conference on Functional programming (ICFP '10). 261--272.
[31]
Doug Lea. 2000. A Java fork/join framework. In Proceedings of the ACM 2000 conference on Java Grande (JAVA '00). 36--43.
[32]
Daan Leijen, Wolfram Schulte, and Sebastian Burckhardt. 2009. The design of a task parallel library. In Proceedings of the 24th ACM SIGPLAN conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '09). 227--242.
[33]
P. Lopez, M. Hermenegildo, and S. Debray. 1996. A methodology for granularity-based control of parallelism in logic programs. Journal of Symbolic Computation 21 (June 1996), 715--734. Issue 4--6.
[34]
E. Mohr, D. A. Kranz, and R. H. Halstead. 1991. Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Transactions on Parallel and Distributed Systems 2, 3 (1991), 264--280.
[35]
Eric Mohr, David A. Kranz, and Robert H. Halstead Jr. 1990. Lazy task creation: a technique for increasing the granularity of parallel programs. In Conference record of the 1990 ACM Conference on Lisp and Functional Programming. ACM Press, New York, New York, USA, 185--197.
[36]
OpenMP Architecture Review Board. 2008. OpenMP Application Program Interface. (2008). https://www.openmp.org/.
[37]
Joseph Pehoushek and Joseph Weening. 1990. Low-cost process creation and dynamic partitioning in Qlisp. In Parallel Lisp: Languages and Systems, Takayasu Ito and Robert Halstead (Eds.). Lecture Notes in Computer Science, Vol. 441. Springer Berlin / Heidelberg, 182--199.
[38]
Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. 2016. Hierarchical Memory Management for Parallel Programs. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming (ICFP 2016). ACM, New York, NY, USA, 392--406.
[39]
Mike Rainey. 2010. Effective Scheduling Techniques for High-Level Parallel Programming Languages. Ph.D. Dissertation. University of Chicago.
[40]
Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. 2010a. Flexible architectural support for fine-grain scheduling. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS '10). ACM, New York, NY, USA, 311--322.
[41]
Daniel Sanchez, Richard M. Yoo, and Christos Kozyrakis. 2010b. Flexible architectural support for fine-grain scheduling. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS '10). ACM, New York, NY, USA, 311--322.
[42]
Alexandros Tzannes, George C. Caragea, Uzi Vishkin, and Rajeev Barua. 2014. Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism. TOPLAS 36, 3, Article 10 (Sept. 2014), 51 pages.
[43]
Joseph S. Weening. 1989. Parallel Execution of Lisp Programs. Ph.D. Dissertation Stanford University. Computer Science Technical Report STAN-CS-89-1265.
[44]
Reza Zadeh. 2017. Overview, Models of Computation, BrentâĂŹs Theorem. (2017). https://stanford.edu/~rezab/dao/notes/lecture01/cme323_lec1.pdf

Cited By

View all
  • (2024)Automatic Parallelism ManagementProceedings of the ACM on Programming Languages10.1145/36328808:POPL(1118-1149)Online publication date: 5-Jan-2024
  • (2023)Parallel Strong Connectivity Based on Faster ReachabilityProceedings of the ACM on Management of Data10.1145/35892591:2(1-29)Online publication date: 20-Jun-2023
  • (2023)Efficient Synchronization-Light Work StealingProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591099(39-49)Online publication date: 17-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
February 2019
472 pages
ISBN:9781450362252
DOI:10.1145/3293883
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. granularity control
  2. parallel programming languages

Qualifiers

  • Research-article

Funding Sources

Conference

PPoPP '19

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)110
  • Downloads (Last 6 weeks)14
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Automatic Parallelism ManagementProceedings of the ACM on Programming Languages10.1145/36328808:POPL(1118-1149)Online publication date: 5-Jan-2024
  • (2023)Parallel Strong Connectivity Based on Faster ReachabilityProceedings of the ACM on Management of Data10.1145/35892591:2(1-29)Online publication date: 20-Jun-2023
  • (2023)Efficient Synchronization-Light Work StealingProceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3558481.3591099(39-49)Online publication date: 17-Jun-2023
  • (2023)Optimizing Iterative Data-flow Scientific Applications using Directed Cyclic GraphsIEEE Access10.1109/ACCESS.2023.3269902(1-1)Online publication date: 2023
  • (2021)Efficient tree-traversals: reconciling parallelism and dense data representationsProceedings of the ACM on Programming Languages10.1145/34735965:ICFP(1-29)Online publication date: 19-Aug-2021
  • (2021)Task parallel assembly language for uncompromising parallelismProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3460969(1064-1079)Online publication date: 19-Jun-2021
  • (2021)Dataset Sensitive Autotuning of Multi-versioned Code Based on Monotonic PropertiesTrends in Functional Programming10.1007/978-3-030-83978-9_1(3-23)Online publication date: 23-Aug-2021
  • (2020)Analyzing the Performance Trade-Off in Implementing User-Level ThreadsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297605731:8(1859-1877)Online publication date: 1-Aug-2020
  • (2019)Disentanglement in nested-parallel programsProceedings of the ACM on Programming Languages10.1145/33711154:POPL(1-32)Online publication date: 20-Dec-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media