More Web Proxy on the site http://driver.im/

research-article

Public Access

Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs

Authors:

Sriram Krishnamoorthy,

Milind KulkarniAuthors Info & Claims

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 117 - 130

https://doi.org/10.1145/3018743.3018763

Published: 26 January 2017 Publication History

Abstract

Modern hardware contains parallel execution resources that are well-suited for data-parallelism-vector units-and task parallelism-multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task- blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel pro- grams into task block-based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14×-108× speedup over sequential baselines.

References

[1]

N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread Scheduling for Multiprogrammed Multiprocessors. In SPAA, pages 119--129, 1998.

Digital Library

[2]

J. Barnes and P. Hut. A hierarchical o(nlogn) forcecalculation algorithm. Nature, 324(4):446--449, December 1986.

[3]

R. D. Blumofe and C. E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. J. ACM, 46(5):720--748, 1999.

Digital Library

[4]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of Parallel and Distributed Computing, 37(1):55--69, August 25, 1996.

Digital Library

[5]

Cilk. Cilk 5.4.6. http://supertech.csail.mit.edu/cilk/.

[6]

I. Corp. Intel Cilk Plus Language Extension Specification, 2011.

[7]

Y. Jo and M. Kulkarni. Enhancing Locality for Recursive Traversals of Recursive Structures. In OOPSLA, pages 463--482, 2011.

Digital Library

[8]

Y. Jo, M. Goldfarb, and M. Kulkarni. Automatic Vectorization of Tree Traversals. In PACT, pages 363--374, 2013.

Digital Library

[9]

Y. A. Liu and S. D. Stoller. From Recursion to Iteration: What are the Optimizations? In PEPM, pages 73--82, 2000.

[10]

S. Maleki, Y. Gao, M. J. Garzaran, T. Wong, and D. A. Padua. An Evaluation of Vectorizing Compilers. In PACT, pages 372--382, 2011.

Digital Library

[11]

OpenMP Architecture Review Board. OpenMP Specification and Features. http://openmp.org/wp/, May 2008.

[12]

M. Pharr and W. R. Mark. ispc: A SPMD Compiler for High-performance CPU Programming. In Innovative Parallel Computing (InPar), 2012, pages 1--13. IEEE, 2012.

[13]

J. Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O'Reilly, 2007.

[14]

B. Ren, G. Agrawal, J. R. Larus, T. Mytkowicz, T. Poutanen, and W. Schulte. SIMD Parallelization of Applications that Traverse Irregular Data Structures. In 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pages 1--10. IEEE, 2013.

Digital Library

[15]

B. Ren, Y. Jo, S. Krishnamoorthy, K. Agrawal, and M. Kulkarni. Efficient Execution of Recursive Programs on Commodity Vector Hardware. In PLDI, pages 509--520, 2015.

Digital Library

[16]

M. Rinard and P. C. Diniz. Commutativity analysis: a new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942--991, 1997. ISSN 0164-0925.

Digital Library

[17]

R. Rugina and M. C. Rinard. Recursion Unrolling for Divide and Conquer Programs. In LCPC, pages 34--48, 2000.

[18]

X10. The X10 Programming Language. www.research.ibm.com/x10/, Mar. 2006.

Cited By

Zhang YLiu TSun ZChen ZLi XZuo ZJust RFraser G(2023)Catamaran: Low-Overhead Memory Safety Enforcement via Parallel AccelerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598098(816-828)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598098
Shehri WKhemakhem MBasuhail AEassa F(2019)A Proposed Architecture for Parallel HPC-based Resource Management System for Big Data ApplicationsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0401054:1Online publication date: 2019
https://doi.org/10.25046/aj040105
Moll SHack S(2018)Partial control-flow linearizationACM SIGPLAN Notices10.1145/3296979.319241353:4(543-556)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192413
Show More Cited By

Index Terms

Exploiting Vector and Multicore Parallelism for Recursive, Data- and Task-Parallel Programs
1. Computing methodologies
  1. Concurrent computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations ...
Efficient execution of recursive programs on commodity vector hardware
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel ...
Efficient execution of recursive programs on commodity vector hardware
PLDI '15

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '17: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

January 2017

476 pages

ISBN:9781450344937

DOI:10.1145/3018743

General Chair:
Vivek Sarkar
Rice University, USA
,
Program Chair:
Lawrence Rauchwerger
Texas A&M University, USA

ACM SIGPLAN Notices Volume 52, Issue 8
PPoPP '17
August 2017
442 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3155284
Editor:
Matthew Fluet
Issue’s Table of Contents

Copyright © 2017 ACM.

© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 January 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PPoPP '17

Sponsor:

SIGPLAN

PPoPP '17: 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 4 - 8, 2017

Texas, Austin, USA

Acceptance Rates

PPoPP '17 Paper Acceptance Rate 29 of 132 submissions, 22%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
1,011
Total Downloads

Downloads (Last 12 months)111
Downloads (Last 6 weeks)14

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YLiu TSun ZChen ZLi XZuo ZJust RFraser G(2023)Catamaran: Low-Overhead Memory Safety Enforcement via Parallel AccelerationProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598098(816-828)Online publication date: 12-Jul-2023
https://dl.acm.org/doi/10.1145/3597926.3598098
Shehri WKhemakhem MBasuhail AEassa F(2019)A Proposed Architecture for Parallel HPC-based Resource Management System for Big Data ApplicationsAdvances in Science, Technology and Engineering Systems Journal10.25046/aj0401054:1Online publication date: 2019
https://doi.org/10.25046/aj040105
Moll SHack S(2018)Partial control-flow linearizationACM SIGPLAN Notices10.1145/3296979.319241353:4(543-556)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192413
Shaker ABaker JSharma GYuan M(2018)Performance Comparison of NVIDIA accelerators with SIMD, Associative, and Multi-core Processors for Air Traffic ManagementWorkshop Proceedings of the 47th International Conference on Parallel Processing10.1145/3229710.3229757(1-10)Online publication date: 13-Aug-2018
https://dl.acm.org/doi/10.1145/3229710.3229757
Moll SHack SFoster JGrossman D(2018)Partial control-flow linearizationProceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3192366.3192413(543-556)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3192366.3192413
Al Farhan MKeyes D(2018)Optimizations of Unstructured Aerodynamics Computations for Many-core ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.282653329:10(2317-2332)Online publication date: 1-Oct-2018
https://doi.org/10.1109/TPDS.2018.2826533
Wu HRavi JBecchi M(2018)Compiling SIMT Programs on Multi- and Many-Core Processors with Wide Vector Units: A Case Study with CUDA2018 IEEE 25th International Conference on High Performance Computing (HiPC)10.1109/HiPC.2018.00022(123-132)Online publication date: Dec-2018
https://doi.org/10.1109/HiPC.2018.00022
Kim JJiang STorng CWang MSrinath SIlbeyi BAl-Hawaj KBatten CHunter HMoreno JEmer JSanchez D(2017)Using intra-core loop-task accelerators to improve the productivity and performance of task-based parallel programsProceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3123939.3136952(759-773)Online publication date: 14-Oct-2017
https://dl.acm.org/doi/10.1145/3123939.3136952
Hou KLiu WWang HFeng WGropp WBeckman PLi ZCazorla F(2017)Fast segmented sort on GPUsProceedings of the International Conference on Supercomputing10.1145/3079079.3079105(1-10)Online publication date: 14-Jun-2017
https://dl.acm.org/doi/10.1145/3079079.3079105
Yadav SGupta NReverdell AKaiser H(2021)Parallel SIMD - A Policy Based Solution for Free Speed-Up using C++ Data-Parallel Types2021 IEEE/ACM 6th International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM254806.2021.00008(20-29)Online publication date: Nov-2021
https://doi.org/10.1109/ESPM254806.2021.00008
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents