[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3572848.3577509acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code

Published: 21 February 2023 Publication History

Abstract

This paper presents OpenCilk, an open-source software infrastructure for task-parallel programming that allows for substantial code reuse and easy exploration of design choices in language abstraction, compilation strategy, runtime mechanism, and productivity-tool development.
The OpenCilk infrastructure consists of three main components: a compiler designed to compile fork-join task-parallel code, an efficient work-stealing runtime scheduler, and a productivity-tool development framework based on compiler instrumentation designed for fork-join parallel computations. OpenCilk is modular --- modifying one component for the most part does not necessitate modifications to the other components --- and easy to extend --- its construction naturally encourages code reuse. Despite being modular and easy to extend, OpenCilk produces high-performing code.
We investigated OpenCilk's modularity, extensibility, and performance through several case studies, including a study to extend OpenCilk to support multiple parallel runtime systems, including Cilk Plus, OpenMP, and oneTBB. OpenCilk's design enables rapid prototyping of new compiler back ends to target different parallel-runtime ABIs. Each back end required fewer than 2000 new lines of code. We examined the OpenCilk runtime's performance empirically on 15 benchmark Cilk programs and found that it outperforms the other runtimes by a geometric mean of 4%--26% on 1 core and 10%--120% on 48 cores.

References

[1]
2022. Rust Documentation: Crate tokio. url-https://docs.rs/tokio/latest/tokio/. Accessed in August 2022.
[2]
2022. A Tour of Go: Goroutines. urlhttps://go.dev/tour/concurrency/1. Accessed in August 2022.
[3]
GCC 4.9. 2014. GCC 4.9 Release Series Changes, New Features, and Fixes. Available at https://gcc.gnu.org/gcc-4.9/changes.html.
[4]
Umut A. Acar, Vitaly Aksenov, Arthur Charguéraud, and Mike Rainey. 2019. Provably and Practically Efficient Granularity Control. In PPoPP. ACM, 214--228.
[5]
Umut A. Acar, Arthur Charguéraud, Adrien Guatto, Mike Rainey, and Filip Sieczkowski. 2018. Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. In PLDI. ACM, 769--782.
[6]
Umut A. Acar, Arthur Chargueraud, and Mike Rainey. 2013. Scheduling Parallel Programs by Work Stealing with Private Deques. In PPoPP. 219--228.
[7]
Rajkishore Barik, Zoran Budimlić, Vincent Cavè, Sanjay Chatterjee, Yi Guo, David Peixotto, Raghavan Raman, Jun Shirako, Sağnak Taşırlar, Yonghong Yan, Yisheng Zhao, and Vivek Sarkar. 2009. The Habanero Multicore Software Research Project. In OOPSLA. 735--736.
[8]
Rajkishore Barik and Vivek Sarkar. 2009. Interprocedural Load Elimination for Dynamic Optimization of Parallel Programs. In PACT. 41--52.
[9]
Rajkishore Barik, Jisheng Zhao, and Vivek Sarkar. 2013. Interproce-dural Strength Reduction of Critical Sections in Explicitly-Parallel Programs. In PACT. 29--40.
[10]
David A. Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J. Kunen, Olga Pearce, Peter Robinson, Brian S. Ryujin, and Thomas RW Scogland. 2019. RAJA: Portable Performance for Large-Scale Scientific Applications. In P3HPC. 71--81.
[11]
Guy E. Blelloch and Margaret Reid-Miller. 1997. Pipelining with futures. In SPAA. ACM, 249--259.
[12]
Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multithreaded Computations by Work Stealing. JACM 46, 5 (1999), 720--748.
[13]
Hans-J. Boehm. 2005. Threads Cannot Be Implemented as a Library. SIGPLAN Not. 40, 6 (jun 2005), 261--268.
[14]
Randal E. Bryant and David R. O'Hallaron. 2015. Computer Systems: A Programmer's Perspective (3rd ed.). Pearson, USA.
[15]
Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. 2011. Habanero-Java: the new adventures of old X10. In PPPJ. 51--61.
[16]
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An Object-Oriented Approach to Non-Uniform Cluster Computing. In OOPSLA. 519--538.
[17]
Intel Comunities. 2012. Problems with cilkprof. Available from https://community.intel.com/t5/Software-Archive/Problems-using-cilkprof/m-p/741804.
[18]
Google Corporation. 2022. TCMalloc. https://google.github.io/tcmalloc/. Accessed August 2022.
[19]
Intel Corporation. 2011. Intel Cilk Plus Software Development Kit. Available at http://software.intel.com/en-us/articles/intel-cilk-plus-software-development-kit/.
[20]
Microsoft Corporation. 2021. Parallel Patterns Library (PPL). Available from https://docs.microsoft.com/en-us/cpp/parallel/concrt/parallel-patterns-library-ppl.
[21]
NVIDIA Corporation. 2022. Libdevice User's Guide. https://docs.nvidia.com/cuda/libdevice-users-guide/index.html. Accessed August 2022.
[22]
Oracle Corporation. 2022. GraalVM. https://www.graalvm.org/. Accessed August 2022.
[23]
John S. Danaher, I-Ting Angelina Lee, and Charles E. Leiserson. 2008. Programming with exceptions in JCilk. Science of Computer Programming 63, 2 (Dec. 2008), 147--171.
[24]
Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2021. Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable. ACM TOPC 8, 1, Article 4 (apr 2021), 70 pages.
[25]
Johannes Doerfert and Hal Finkel. 2018. Compiler Optimizations for OpenMP. In Evolving OpenMP for Evolving Architectures. 113--127.
[26]
H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. JPDC 74, 12 (2014), 3202--3216.
[27]
Alexandre E. Eichenberger, John Mellor-Crummey, Martin Schulz, Michael Wong, Nawal Copty, Robert Dietrich, Xu Liu, Eugene Loh, and Daniel Lorenz. 2013. OMPT: An OpenMP Tools Application Programming Interface for Performance Analysis. In OpenMP in the Era of Low Power Devices and Accelerators. 171--185.
[28]
Mingdong Feng and Charles E. Leiserson. 1999. Efficient Detection of Determinacy Races in Cilk Programs. Theory of Computing Systems 32, 3 (1999), 301--326.
[29]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementation of the Cilk-5 Multithreaded Language. In PLDI. 212--223.
[30]
GCC Team. 2015. GOMP --- An OpenMP Implementation for GCC. Available at https://gcc.gnu.org/projects/gomp/.
[31]
Yi Guo, Rajkishore Barik, Raghavan Raman, and Vivek Sarkar. 2009. Work-first and Help-first Scheduling Policies for Async-finish Task Parallelism. In IPDPS. 1--12.
[32]
Pablo Halpern. 2013. Considering a Fork-Join Parallelism Library. Technical Report N3557. Intel Corporation. Available from https://isocpp.org/files/papers/n3557.pdf.
[33]
Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview Scalability Analyzer. In SPAA. 145--156.
[34]
Shams Imam and Vivek Sarkar. 2014. Cooperative Scheduling of Parallel Tasks with General Synchronization Patterns. In ECOOP. 618--643.
[35]
Shams M. Imam and Vivek Sarkar. 2012. Integrating Task Parallelism with Actors. In OOPSLA. 753--772.
[36]
Intel 2012. Intel® Cilk™ Plus Language Extension Specification. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1665.htm. Accessed in August 2022.
[37]
Intel Corporation. 2011. Intrinsics for Low Overhead Tool Annotations. Document Number 326357-001US. Available from http://web.archive.org/web/20150908043715/https://www.cilkplus.org/sites/default/files/open_specifications/LowOverheadAnnotations.pdf.
[38]
Intel Corporation. 2013. Cilk Plus/LLVM. Available from http://cilkplus.github.io/.
[39]
Intel Corporation 2023. Intel(R) oneAPI Threading Building Blocks (oneTBB) Documentation. Intel Corporation. Available from https://www.intel.com/content/www/us/en/develop/documentation/onetbb-documentation/top.html.
[40]
Teresa Johnson, Mehdi Amini, and Xinliang David Li. 2017. ThinLTO: Scalable and incremental LTO. In CGO. 111--121.
[41]
Tim Kaler, William Kuszmaul, Tao B. Schardl, and Daniele Vettorel. 2020. Cilkmem: Algorithms for Analyzing the Memory High-Water Mark of Fork-Join Parallel Programs. In APOCS. 162--176.
[42]
Tim Kaler, Tao B. Schardl, Brian Xie, Charles E. Leiserson, Jie Chen, Aldo Pareja, and Georgios Kollias. 2021. PARAD: A Work-Efficient Parallel Algorithm for Reverse-Mode Automatic Differentiation. In APOCS. 144--158.
[43]
Chris Lattner. 2002. LLVM: An Infrastructure for Multi-Stage Optimization. Master's thesis. Computer Science Dept., University of Illinois at Urbana-Champaign, Urbana, IL. See http://llvm.cs.uiuc.edu.
[44]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO. 75.
[45]
Doug Lea. 2000. A Java fork/join framework. In ACM 2000 Conference on Java Grande. 36--43.
[46]
I-Ting Angelina Lee, Silas Boyd-Wickizer, Zhiyi Huang, and Charles E. Leiserson. 2010. Using Memory Mapping to Support Cactus Stacks in Work-Stealing Runtime Systems. In PACT. ACM, 411--420.
[47]
Daan Leijen and Judd Hall. 2007. Optimize Managed Code For Multi-Core Machines. MSDN Magazine (2007). Available from http://msdn.microsoft.com/magazine/.
[48]
Charles E. Leiserson. 2010. The Cilk++ Concurrency Platform. J. Supercomputing 51, 3 (2010), 244--257.
[49]
Charles E. Leiserson, Tao B. Schardl, and Jim Sukha. 2012. Deterministic Parallel Random-Number Generation for Dynamic-Multithreading Platforms. In PPoPP.
[50]
LLVM Project. 2015. OpenMP: Support for the OpenMP Language. Available at http://openmp.llvm.org/.
[51]
Li Lu, Weixing Ji, and Michael L. Scott. 2014. Dynamic Enforcement of Determinism in a Parallel Scripting Language. In PLDI. 519--529.
[52]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. In PLDI. 190--200.
[53]
Steven Margerm, Amirali Sharifian, Apala Guha, Arrvindh Shriraman, and Gilles Pokam. 2018. TAPAS: Generating Parallel Accelerators from Parallel Programs. In MICRO. 245--257.
[54]
Stefan K. Muller and Umut A. Acar. 2016. Latency-Hiding Work Stealing: Scheduling Interacting Parallel Computations with Work Stealing. In SPAA. 71--82.
[55]
Stefan K. Muller, Umut A. Acar, and Robert Harper. 2018. Competitive Parallelism: Getting Your Priorities Right. In ICFP. 95:1--95:30.
[56]
OpenMP 4.5 2015. OpenMP Application Program Interface, Version 4.5.
[57]
OpenMP 5.0 2018. OpenMP API 5.0 Specification.
[58]
Mantevo Organization. 2022. miniFE Finite Element Mini-Application. Available from https://github.com/Mantevo/miniFE. Accessed August 2022.
[59]
LLVM Project. 2022. LLVM Link Time Optimization: Design and Implementation. https://llvm.org/docs/LinkTimeOptimization.html. Accessed August 2022.
[60]
LLVM Project. 2022. My First Language Frontend with LLVM Tutorial. https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html. Accessed August 2022.
[61]
LLVM Project. 2023. Clang 14.0.0 documentation, OpenMP Support. https://releases.llvm.org/14.0.0/tools/clang/docs/OpenMPSupport.html. Accessed January 2023.
[62]
Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran Yahav. 2012. Scalable and Precise Dynamic Datarace Detection for Structured Parallelism. In PLDI. 531--542.
[63]
James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O'Reilly Media, Inc.
[64]
Tao B. Schardl, Tyler Denniston, Damon Doucet, Bradley C. Kuszmaul, I-Ting Angelina Lee, and Charles E. Leiserson. 2017. The CSI Framework for Compiler-Inserted Program Instrumentation. ACM POMACS 1, 2, Article 43 (Dec. 2017), 25 pages.
[65]
Tao B. Schardl, Bradley C. Kuszmaul, I-Ting Angelina Lee, William M. Leiserson, and Charles E. Leiserson. 2015. The Cilkprof Scalability Profiler. In SPAA. 89--100.
[66]
Tao B. Schardl, William S. Moses, and Charles E. Leiserson. 2019. Tapir: Embedding Recursive Fork-Join Parallelism into LLVM's Intermediate Representation. ACM TOPC 6, 4, Article 19 (Dec. 2019), 33 pages.
[67]
Tao B. Schardl and Siddharth Samsi. 2019. TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir. In HPEC. 1--8.
[68]
Ariya Shajii, Ibrahim Numanagić, Riyadh Baghdadi, Bonnie Berger, and Saman Amarasinghe. 2019. Seq: A High-Performance Language for Bioinformatics. Proc. ACM Program. Lang. 3, OOPSLA, Article 125 (oct 2019), 29 pages.
[69]
Shumpei Shiina and Kenjiro Taura. 2019. Almost Deterministic Work Stealing. In SC.
[70]
Jun Shirako, David M. Peixotto, Vivek Sarkar, and William N. Scherer. 2008. Phasers: A Unified Deadlock-free Construct for Collective and Point-to-point Synchronization. In ICS. 277--288.
[71]
Jun Shirako, David M. Peixotto, Vivek Sarkar, and William N. Scherer. 2009. Phaser accumulators: A new reduction construct for dynamic parallelism. In IPDPS. 1--12.
[72]
Kyle Singer, Kunal Agrawal, and I-Ting Angelina Lee. 2020. Scheduling I/O Latency-Hiding Futures in Task-Parallel Platforms. In APoCS. 147--161.
[73]
Kyle Singer, Noah Goldstein, Stefan K. Muller, Kunal Agrawal, I-Ting Angelina Lee, and Umut A. Acar. 2020. Priority Scheduling for Interactive Applications. In SPAA. 465--477.
[74]
Kyle Singer, Yifan Xu, and I-Ting Angelina Lee. 2019. Proactive Work Stealing for Futures. In PPoPP. 257--271.
[75]
Daniel Spoonhower, Guy E. Blelloch, Phillip B. Gibbons, and Robert Harper. 2009. Beyond Nested Parallelism: Tight Bounds on Work-stealing Overheads for Parallel Futures. In SPAA. 91--100.
[76]
Amitabh Srivastava and David W. Wall. 1992. A Practical System for Intermodule Code Optimization at Link-Time. Technical Report 92/6. Digital Western Research Laboratory.
[77]
Guy L. Steele, Doug Lea, and Christine H. Flood. 2014. Fast Splittable Pseudorandom Number Generators. In OOPSLA, Vol. 49. 453--472.
[78]
Olivier Tardieu, Haichuan Wang, and Haibo Lin. 2012. A Work-stealing Scheduler for X10's Task Parallelism with Suspension. In PPoPP. 267--276.
[79]
Sağnak Taşırlar and Vivek Sarkar. 2011. Data-Driven Tasks and Their Implementation. In ICPP. 652--661.
[80]
Peter Thoman, Peter Zangerl, and Thomas Fahringer. 2017. Task-Parallel Runtime System Optimization Using Static Compiler Analysis. In CF. 201--210.
[81]
Robert Utterback, Kunal Agrawal, Jeremy Fineman, and I-Ting Angelina Lee. 2016. Provably Good and Practically Efficient Parallel Race Detection for Fork-Join Programs. In SPAA. 83--94.
[82]
Robert Utterback, Kunal Agrawal, I-Ting Angelina Lee, and Milind Kulkarni. 2017. Processor-Oblivious Record and Replay. In PPoPP. 145--161.
[83]
Yifan Xu, I-Ting Angelina Lee, and Kunal Agrawal. 2018. Efficient Parallel Determinacy Race Detection for Two-dimensional Dags. In PPoPP. 368--380.
[84]
Yifan Xu, Kyle Singer, and I-Ting Angelina Lee. 2020. Parallel Determinacy Race Detection for Futures. In PPoPP. 217--231.
[85]
Victor A. Ying, Mark C. Jeffrey, and Daniel Sanchez. 2020. T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware. In ISCA. 159--172.
[86]
Adarsh Yoga and Santosh Nagarakatte. 2016. Atomicity Violation Checker for Task Parallel Programs. In CGO. 239--249.
[87]
Adarsh Yoga and Santosh Nagarakatte. 2017. A Fast Causal Profiler for Task Parallel Programs. In ESEC/FSE. 15--26.
[88]
Adarsh Yoga and Santosh Nagarakatte. 2019. Parallelism-Centric What-If and Differential Analyses. In PLDI. 485--501.
[89]
Adarsh Yoga, Santosh Nagarakatte, and Aarti Gupta. 2016. Parallel Data Race Detection for Task Parallel Programs with Locks. In FSE. 833--845.

Cited By

View all
  • (2024)Optimizing a Super-Fast Eigensolver for Hierarchically Semiseparable MatricesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673119(32-41)Online publication date: 12-Aug-2024
  • (2024)An Efficient Task-Parallel Pipeline Programming FrameworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635037(95-106)Online publication date: 18-Jan-2024
  • (2024)When Is Parallelism Fearless and Zero-Cost with Rust?Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659966(27-40)Online publication date: 17-Jun-2024
  • Show More Cited By

Index Terms

  1. OpenCilk: A Modular and Extensible Software Infrastructure for Fast Task-Parallel Code
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
          February 2023
          480 pages
          ISBN:9798400700156
          DOI:10.1145/3572848
          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 21 February 2023

          Check for updates

          Badges

          Author Tags

          1. OpenCilk
          2. OpenMP
          3. bitcode
          4. cilk
          5. compiler-inserted instrumentation
          6. compiling
          7. fork-join parallelism
          8. oneTBB
          9. parallel computing
          10. parallel runtime system
          11. productivity tools
          12. tapir
          13. task parallelism

          Qualifiers

          • Research-article

          Funding Sources

          Conference

          PPoPP '23

          Acceptance Rates

          Overall Acceptance Rate 230 of 1,014 submissions, 23%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)515
          • Downloads (Last 6 weeks)70
          Reflects downloads up to 13 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Optimizing a Super-Fast Eigensolver for Hierarchically Semiseparable MatricesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673119(32-41)Online publication date: 12-Aug-2024
          • (2024)An Efficient Task-Parallel Pipeline Programming FrameworkProceedings of the International Conference on High Performance Computing in Asia-Pacific Region10.1145/3635035.3635037(95-106)Online publication date: 18-Jan-2024
          • (2024)When Is Parallelism Fearless and Zero-Cost with Rust?Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659966(27-40)Online publication date: 17-Jun-2024
          • (2024)Multi Bucket Queues: Efficient Concurrent Priority SchedulingProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3659962(113-124)Online publication date: 17-Jun-2024
          • (2024)Speedcode: Software Performance Engineering Education via the Coding of Didactic Exercises2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00087(391-394)Online publication date: 27-May-2024
          • (2024)Teaching Parallel Algorithms Using the Binary-Forking Model2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00080(346-351)Online publication date: 27-May-2024
          • (2024)Helping Faculty Teach Software Performance Engineering2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00078(338-341)Online publication date: 27-May-2024
          • (2024)HardCilk: Cilk-like Task Parallelism for FPGAs2024 IEEE 32nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)10.1109/FCCM60383.2024.00025(140-150)Online publication date: 5-May-2024
          • (2024)X-OpenMP — eXtreme fine-grained tasking using lock-less work stealingFuture Generation Computer Systems10.1016/j.future.2024.05.019159(444-458)Online publication date: Oct-2024
          • (2024)Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in ClustersSN Computer Science10.1007/s42979-024-02624-85:3Online publication date: 13-Mar-2024
          • Show More Cited By

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media