More Web Proxy on the site http://driver.im/

research-article

Open access

Pure C++ Approach to Optimized Parallel Traversal of Regular Data Structures

Authors:

Lukáš Rozsypal,

Martin KrulišAuthors Info & Claims

PMAM '24: Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores

Pages 42 - 51

https://doi.org/10.1145/3649169.3649247

Published: 06 March 2024 Publication History

Abstract

Many computational problems consider memory throughput a performance bottleneck. The problem becomes even more pronounced in the case of parallel platforms, where the ratio between computing elements and memory bandwidth shifts towards computing. Software needs to be attuned to hardware features like cache architectures or memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can be adjusted for particular memory layouts of the data structures, semi-automated parallelization, or autotuning without altering their internal code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C++ template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.

Supplementary Material

klepl (klepl.zip)

Supplemental movie, appendix, image and software files for, Pure C++ Approach to Optimized Parallel Traversal of Regular Data Structures

Download
3.68 MB

References

[1]

David A Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J Kunen, Olga Pearce, Peter Robinson, Brian S Ryujin, and Thomas RW Scogland. 2019. RAJA: Portable performance for large-scale scientific applications. In 2019 ieee/acm international workshop on performance, portability and productivity in hpc (p3hpc). IEEE, 71--81.

[2]

David Bednárek, Martin Kruliš, and Jakub Yaghob. 2021. Letting future programmers experience performance-related tasks. J. Parallel and Distrib. Comput. 155 (2021), 74--86.

Digital Library

[3]

Nathan Bell and Jared Hoberock. 2012. Thrust: A productivity-oriented library for CUDA. In GPU computing gems Jade edition. Elsevier, 359--371.

[4]

Lorenzo Chelini, Tobias Gysi, Tobias Grosser, Martin Kong, and Henk Corporaal. 2020. Automatic generation of multi-objective polyhedral compiler transformations. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 83--96.

Digital Library

[5]

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578--594.

[6]

Philippe Clauss and Benoît Meister. 2000. Automatic memory layout transformations to optimize spatial locality in parameterized loop nests. ACM SIGARCH computer architecture news 28, 1 (2000), 11--19.

[7]

Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.

[8]

Zhangxiaowen Gong, Zhi Chen, Justin Szaday, David Wong, Zehra Sura, Neftali Watkinson, Saeed Maleki, David Padua, Alexander Veidenbaum, et al. 2018. An empirical study of the effect of source-level loop transformations on compiler stability. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1--29.

Digital Library

[9]

Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 innovative parallel computing (InPar). Ieee, 1--10.

[10]

Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly: performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 04 (2012), 1250010.

[11]

Junjie Li, Sanjay Ranka, and Sartaj Sahni. 2013. GPU matrix multiplication. Multicore Computing: Algorithms, Architectures, and Applications 345 (2013).

[12]

Kedar S Namjoshi and Nimit Singhania. 2016. Loopy: Programmable and formally verified loop transformations. In International Static Analysis Symposium. Springer, 383--402.

[13]

Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges 23, 4 (2008), 298--298.

Digital Library

[14]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48, 6 (2013), 519--530.

Digital Library

[15]

Adam Šmelko, Martin Kruliš, Miroslav Kratochvíl, Jiří Klepl, Jiří Mayer, and Petr Šimůnek. 2023. Astute Approach to Handling Memory Layouts of Regular Data Structures. In Algorithms and Architectures for Parallel Processing: 22nd International Conference, ICA3PP 2022, Copenhagen, Denmark, October 10-12, 2022, Proceedings. Springer, 507--528.

Digital Library

[16]

Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna Upadrasta. 2010. Graphite two years after: First lessons learned from real-world polyhedral compilation. In GCC Research Opportunities Workshop (GROW'10).

[17]

Christian R Trott, Damien Lebrun-Grandie, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, Evan Harvey, Daisy S Hollman, Dan Ibanez, et al. 2021. Kokkos 3: Programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2021), 805--817.

[18]

Xingfu Wu, Michael Kruse, Prasanna Balaprakash, Hal Finkel, Paul Hovland, Valerie Taylor, and Mary Hall. 2022. Autotuning PolyBench Benchmarks with LLVM Clang/Polly loop optimization pragmas using Bayesian optimization. Concurrency and Computation: Practice and Experience 34, 20 (2022), e6683.

Cited By

Klepl JŠmelko ARozsypal LKruliš M(2024)Abstractions for C++ code optimizations in parallel high-performance applicationsParallel Computing10.1016/j.parco.2024.103096(103096)Online publication date: Aug-2024
https://doi.org/10.1016/j.parco.2024.103096

Index Terms

Pure C++ Approach to Optimized Parallel Traversal of Regular Data Structures
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
    2. Parallel programming languages
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques
      1. Object oriented development
  2. Software notations and tools
    1. Compilers

Recommendations

An Iteration Partition Approach for Cache or Local Memory Thrashing on Parallel Processing

Parallel processing systems with cache or local memory in the memory hierarchies are considered. These systems have a local cache memory in each processor and usually employ a write-invalidate protocol for the cache coherence. In such systems, a problem ...
Abstractions for C++ code optimizations in parallel high-performance applications
Abstract
Many computational problems consider memory throughput a performance bottleneck, especially in the domain of parallel computing. Software needs to be attuned to hardware features like cache architectures or concurrent memory banks to reach a ...
Highlights
- Proposing novel abstraction for flexible traversals of regular data structures.
- Designed for traversal-agnostic algorithms in HPC parallel computing.
- Reduces traversal code complexity, improving separation of concerns and ...
MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors
In this work, we develop MrPhi, an optimized MapReduce framework on a heterogeneous computing platform, particularly equipped with multiple Intel Xeon Phi coprocessors. To the best of our knowledge, this is the first work to optimize the MapReduce ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM '24: Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores

March 2024

65 pages

ISBN:9798400705991

DOI:10.1145/3649169

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

PPoPP '24

Sponsor:

PPoPP '24: The 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

March 3, 2024

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
356
Total Downloads

Downloads (Last 12 months)356
Downloads (Last 6 weeks)46

Reflects downloads up to 28 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Klepl JŠmelko ARozsypal LKruliš M(2024)Abstractions for C++ code optimizations in parallel high-performance applicationsParallel Computing10.1016/j.parco.2024.103096(103096)Online publication date: Aug-2024
https://doi.org/10.1016/j.parco.2024.103096

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten