More Web Proxy on the site http://driver.im/

research-article

Cache Cohort GPU Scheduling

Authors:

Vinay Ramakrishnaiah,

Bradford Beckmann,

Rene Van Oostrum,

Keith LoweryAuthors Info & Claims

GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPU

Pages 19 - 25

https://doi.org/10.1145/3649411.3649415

Published: 28 April 2024 Publication History

Abstract

With the ever-improving computation capability of GPUs, there is an increasing demand for higher memory bandwidth to supply the GPU cores with data. One way to improve effective memory bandwidth is with larger caches, and we have seen GPU vendors recently invest in very large last-level caches (LLCs). The key challenge is to maximize cache utilization and keep the active working set on-chip during the lifetime of a kernel. Software tiling optimizations can help, but programmers alone cannot anticipate dynamic cache capacity contention. To this end, we introduce the concept of cache cohorts: groups of kernels that share the same working set and thus can be efficiently scheduled together. Our approach uses the GPU command processor (CP) to track the occupancy of the LLC and throttles scheduling of the next cohort if there is no room. To evaluate the design space, we modify the existing GPU firmware on the AMD RX 7600 and RX 7900 XTX GPUs to prototype the scheduling scheme using real hardware. Our preliminary results show up to 27% reduction in execution time using the cache optimized scheduling technique on RX 7900 XTX.

References

[1]

“AMD RDNA2.” [Online]. Available: https://gpuopen.com/rdna2/

[2]

“AMD RadeonTM RX 7900 XTX.” Accessed: Nov. 16, 2023. [Online]. Available: https://www.amd.com/en/products/graphics/amd-radeon-rx-7900xtx

[3]

“AdaLovelace.” [Online]. Available: https://www.nvidia.com/en-us/geforce/ada-lovelace-architecture/

[4]

T. M. Ranadive and M. M. Baskaran, “Large-scale sparse tensor decomposition using a damped gauss-newton method,” in 2020 IEEE High Performance Extreme Computing Conference (HPEC), IEEE, 2020, pp. 1–8.

[5]

Y. Ma, J. Li, X. Wu, C. Yan, J. Sun, and R. Vuduc, “Optimizing sparse tensor times matrix on GPUs,” Journal of Parallel and Distributed Computing, vol. 129, pp. 99–109, 2019.

Digital Library

[6]

T. B. Rolinger, T. A. Simon, and C. D. Krieger, “Performance considerations for scalable parallel tensor decomposition,” Journal of Parallel and Distributed Computing, vol. 129, pp. 83–98, 2019.

Digital Library

[7]

T. B. Rolinger, T. A. Simon, and C. D. Krieger, “Performance challenges for heterogeneous distributed tensor decompositions,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC), IEEE, 2017, pp. 1–7.

[8]

T. Henretty, M. Baskaran, J. Ezick, D. Bruns-Smith, and T. A. Simon, “A quantitative and qualitative analysis of tensor decompositions on spatiotemporal data,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC), IEEE, 2017, pp. 1–7.

[9]

R. Minster, I. Viviano, X. Liu, and G. Ballard, “CP decomposition for tensors via alternating least squares with QR decomposition,” Numerical Linear Algebra App, vol. 30, no. 6, p. e2511, Dec. 2023.

[10]

C. Battaglino, G. Ballard, and T. G. Kolda, “A Practical Randomized CP Tensor Decomposition,” SIAM J. Matrix Anal. & Appl., vol. 39, no. 2, pp. 876–901, Jan. 2018.

Digital Library

[11]

K. Hayashi, G. Ballard, Y. Jiang, and M. J. Tobia, “Shared-memory parallelization of MTTKRP for dense tensors,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018, pp. 393–394.

Digital Library

[12]

K. Lee, H. Lin, and W. Feng, “Performance characterization of data-intensive kernels on AMD Fusion architectures,” Comput Sci Res Dev, vol. 28, no. 2–3, pp. 175–184, May 2013.

Digital Library

[13]

D. A. Rockenbach, “stream processing on multi-cores with GPUs: parallel programming models’ challenges,” in 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, 2019, pp. 834–841. Accessed: Nov. 16, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8778359/

[14]

P. Kristof, H. Yu, Z. Li, and X. Tian, “Performance study of SIMD programming models on intel multicore processors,” in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IEEE, 2012, pp. 2423–2432. Accessed: Nov. 16, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/6270614/

Digital Library

[15]

D. T. Popovici, A. Canning, Z. Zhao, L.-W. Wang, and J. Shalf, “A systematic approach to improving data locality across Fourier transforms and linear algebra operations,” in Proceedings of the ACM International Conference on Supercomputing, Virtual Event USA: ACM, Jun. 2021, pp. 329–341.

Digital Library

[16]

N. Otterness and J. H. Anderson, “Exploring AMD GPU Scheduling Details by Experimenting With ‘Worst Practices,’” in 29th International Conference on Real-Time Networks and Systems, NANTES France: ACM, Apr. 2021, pp. 24–34.

Digital Library

[17]

S. Kim, C. Jung, and Y. Kim, “Comparative Analysis of GPU Stream Processing between Persistent and Non-persistent Kernels,” in 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), IEEE, 2022, pp. 2330–2332. Accessed: Nov. 16, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9952789/

[18]

J. D. Carroll and J.-J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way generalization of ‘Eckart-Young’ decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, Sep. 1970.

[19]

R. A. Harshman, “Foundations of the PARAFAC procedure: Models and conditions for an" explanatory" multimodal factor analysis,” 1970, Accessed: Nov. 16, 2023. [Online]. Available: https://www.psychology.uwo.ca/faculty/harshman/wpppfac0.pdf

[20]

T. G. Kolda and B. W. Bader, “Tensor Decompositions and Applications,” SIAM Rev., vol. 51, no. 3, pp. 455–500, Aug. 2009.

Digital Library

[21]

J. Choi, X. Liu, S. Smith, and T. Simon, “Blocking Optimization Techniques for Sparse Tensor Computation,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC: IEEE, May 2018, pp. 568–577.

[22]

Y. Chen, G. Xiao, M. T. Özsu, Z. Tang, A. Y. Zomaya, and K. Li, “Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems,” in 2022 IEEE 38th International Conference on Data Engineering (ICDE), May 2022, pp. 2522–2535.

[23]

Y. Soh, “High Performance Streaming Tensor Decomposition,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, pp. 683–692.

[24]

S. Subramaniyan and X. Wang, “OptiCPD: Optimization For The Canonical Polyadic Decomposition Algorithm on GPUs,” in 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2023, pp. 403–412.

[25]

“Heterogeneous System Architecture Foundation.” Accessed: Nov. 15, 2023. [Online]. Available: https://hsafoundation.com/

[26]

S. Puthoor, X. Tang, J. Gross, and B. M. Beckmann, “Oversubscribed Command Queues in GPUs,” in Proceedings of the 11th Workshop on General Purpose GPUs, Vienna Austria: ACM, Feb. 2018, pp. 50–60.

Digital Library

[27]

S. Puthoor, “Implementing directed acyclic graphs with the heterogeneous system architecture,” in Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, Barcelona Spain: ACM, Mar. 2016, pp. 53–62.

Digital Library

[28]

J. R. Larus and M. Parkes, “Using Cohort Scheduling to Enhance Server Performance (Extended Abstract),” in Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systems, Snow Bird Utah USA: ACM, Aug. 2001, pp. 182–187.

Digital Library

[29]

M. Huzaifa, J. Alsop, A. Mahmoud, G. Salvador, M. D. Sinclair, and S. V. Adve, “Inter-kernel Reuse-aware Thread Block Scheduling,” ACM Trans. Archit. Code Optim., vol. 17, no. 3, pp. 1–27, Sep. 2020.

Digital Library

[30]

W. H. Wen-mei, Heterogeneous System Architecture: A new compute platform infrastructure. Morgan Kaufmann, 2015.

Digital Library

[31]

“HSA Platform System Architecture Specification.” HSA Foundation. [Online]. Available: http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf

[32]

“HSA Programmer's Reference Manual.” HSA Foundation. [Online]. Available: http://hsafoundation.com/wp-content/uploads/2021/02/HSA-PRM-1.2.pdf

[33]

“HSA Runtime Programmer's Reference Manual.” HSA Foundation. [Online]. Available: http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf

[34]

“HIP Documentation — HIP Documentation.” Accessed: Nov. 16, 2023. [Online]. Available: https://rocm.docs.amd.com/projects/HIP/en/latest/

[35]

R. van Oostrum, S. Moe, D. McDougal, and P. Bauman, “Amd tools overview.” 2019. Accessed: Nov. 02, 2023. [Online]. Available: https://olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_Tools_Overview.pdf

[36]

“Radeon Southern Islands Acceleration.” 2012. [Online]. Available: https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/programmer-references/si_programming_guide_v2.pdf

[37]

“clr/rocclr at develop · ROCm-Developer-Tools/clr.” Accessed: Nov. 15, 2023. [Online]. Available: https://github.com/ROCm-Developer-Tools/clr/tree/develop/rocclr

[38]

“RadeonOpenCompute/ROCR-Runtime.” ROCm Core Technology, Nov. 15, 2023. Accessed: Nov. 15, 2023. [Online]. Available: https://github.com/RadeonOpenCompute/ROCR-Runtime

[39]

“‘RDNA3’ Instruction Set Architecture: Reference Guide”.

Index Terms

Cache Cohort GPU Scheduling

Recommendations

Locality-Driven Dynamic GPU Cache Bypassing
ICS '15: Proceedings of the 29th ACM on International Conference on Supercomputing

This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of ...
Adaptive GPU cache bypassing
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Modern graphics processing units (GPUs) include hardware- controlled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are inefficient for general purpose GPU (GPGPU) comput- ing. GPGPU workloads tend ...
Cache-Conscious Wavefront Scheduling
MICRO-45: Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture

This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPU

March 2024

37 pages

ISBN:9798400718175

DOI:10.1145/3649411

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 April 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

GPGPU '24

GPGPU '24: 16th Workshop on General Purpose Processing Using GPU

March 2, 2024

Edinburgh, United Kingdom

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
107
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)17

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents