[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3649169.3649247acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Open access

Pure C++ Approach to Optimized Parallel Traversal of Regular Data Structures

Published: 06 March 2024 Publication History

Abstract

Many computational problems consider memory throughput a performance bottleneck. The problem becomes even more pronounced in the case of parallel platforms, where the ratio between computing elements and memory bandwidth shifts towards computing. Software needs to be attuned to hardware features like cache architectures or memory banks to reach a decent level of performance efficiency. This can be achieved by selecting the right memory layouts for data structures or changing the order of data structure traversal. In this work, we present an abstraction for traversing a set of regular data structures (e.g., multidimensional arrays) that allows the design of traversal-agnostic algorithms. Such algorithms can be adjusted for particular memory layouts of the data structures, semi-automated parallelization, or autotuning without altering their internal code. The proposed solution was implemented as an extension of the Noarr library that simplifies a layout-agnostic design of regular data structures. It is implemented entirely using C++ template meta-programming without any nonstandard dependencies, so it is fully compatible with existing compilers, including CUDA NVCC. We evaluate the performance and expressiveness of our approach on the Polybench-C benchmarks.

Supplementary Material

klepl (klepl.zip)
Supplemental movie, appendix, image and software files for, Pure C++ Approach to Optimized Parallel Traversal of Regular Data Structures

References

[1]
David A Beckingsale, Jason Burmark, Rich Hornung, Holger Jones, William Killian, Adam J Kunen, Olga Pearce, Peter Robinson, Brian S Ryujin, and Thomas RW Scogland. 2019. RAJA: Portable performance for large-scale scientific applications. In 2019 ieee/acm international workshop on performance, portability and productivity in hpc (p3hpc). IEEE, 71--81.
[2]
David Bednárek, Martin Kruliš, and Jakub Yaghob. 2021. Letting future programmers experience performance-related tasks. J. Parallel and Distrib. Comput. 155 (2021), 74--86.
[3]
Nathan Bell and Jared Hoberock. 2012. Thrust: A productivity-oriented library for CUDA. In GPU computing gems Jade edition. Elsevier, 359--371.
[4]
Lorenzo Chelini, Tobias Gysi, Tobias Grosser, Martin Kong, and Henk Corporaal. 2020. Automatic generation of multi-objective polyhedral compiler transformations. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 83--96.
[5]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578--594.
[6]
Philippe Clauss and Benoît Meister. 2000. Automatic memory layout transformations to optimize spatial locality in parameterized loop nests. ACM SIGARCH computer architecture news 28, 1 (2000), 11--19.
[7]
Leonardo Dagum and Ramesh Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE computational science and engineering 5, 1 (1998), 46--55.
[8]
Zhangxiaowen Gong, Zhi Chen, Justin Szaday, David Wong, Zehra Sura, Neftali Watkinson, Saeed Maleki, David Padua, Alexander Veidenbaum, et al. 2018. An empirical study of the effect of source-level loop transformations on compiler stability. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1--29.
[9]
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 innovative parallel computing (InPar). Ieee, 1--10.
[10]
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly: performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 04 (2012), 1250010.
[11]
Junjie Li, Sanjay Ranka, and Sartaj Sahni. 2013. GPU matrix multiplication. Multicore Computing: Algorithms, Architectures, and Applications 345 (2013).
[12]
Kedar S Namjoshi and Nimit Singhania. 2016. Loopy: Programmable and formally verified loop transformations. In International Static Analysis Symposium. Springer, 383--402.
[13]
Chuck Pheatt. 2008. Intel® threading building blocks. Journal of Computing Sciences in Colleges 23, 4 (2008), 298--298.
[14]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices 48, 6 (2013), 519--530.
[15]
Adam Šmelko, Martin Kruliš, Miroslav Kratochvíl, Jiří Klepl, Jiří Mayer, and Petr Šimůnek. 2023. Astute Approach to Handling Memory Layouts of Regular Data Structures. In Algorithms and Architectures for Parallel Processing: 22nd International Conference, ICA3PP 2022, Copenhagen, Denmark, October 10-12, 2022, Proceedings. Springer, 507--528.
[16]
Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna Upadrasta. 2010. Graphite two years after: First lessons learned from real-world polyhedral compilation. In GCC Research Opportunities Workshop (GROW'10).
[17]
Christian R Trott, Damien Lebrun-Grandie, Daniel Arndt, Jan Ciesko, Vinh Dang, Nathan Ellingwood, Rahulkumar Gayatri, Evan Harvey, Daisy S Hollman, Dan Ibanez, et al. 2021. Kokkos 3: Programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems 33, 4 (2021), 805--817.
[18]
Xingfu Wu, Michael Kruse, Prasanna Balaprakash, Hal Finkel, Paul Hovland, Valerie Taylor, and Mary Hall. 2022. Autotuning PolyBench Benchmarks with LLVM Clang/Polly loop optimization pragmas using Bayesian optimization. Concurrency and Computation: Practice and Experience 34, 20 (2022), e6683.

Cited By

View all
  • (2024)Abstractions for C++ code optimizations in parallel high-performance applicationsParallel Computing10.1016/j.parco.2024.103096(103096)Online publication date: Aug-2024

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PMAM '24: Proceedings of the 15th International Workshop on Programming Models and Applications for Multicores and Manycores
March 2024
65 pages
ISBN:9798400705991
DOI:10.1145/3649169
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2024

Check for updates

Author Tags

  1. Iteration order
  2. Layout agnostic
  3. Memory optimizations
  4. Parallel programming
  5. Traverser

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

PPoPP '24

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)356
  • Downloads (Last 6 weeks)46
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Abstractions for C++ code optimizations in parallel high-performance applicationsParallel Computing10.1016/j.parco.2024.103096(103096)Online publication date: Aug-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media