[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3649411.3649415acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Cache Cohort GPU Scheduling

Published: 28 April 2024 Publication History

Abstract

With the ever-improving computation capability of GPUs, there is an increasing demand for higher memory bandwidth to supply the GPU cores with data. One way to improve effective memory bandwidth is with larger caches, and we have seen GPU vendors recently invest in very large last-level caches (LLCs). The key challenge is to maximize cache utilization and keep the active working set on-chip during the lifetime of a kernel. Software tiling optimizations can help, but programmers alone cannot anticipate dynamic cache capacity contention. To this end, we introduce the concept of cache cohorts: groups of kernels that share the same working set and thus can be efficiently scheduled together. Our approach uses the GPU command processor (CP) to track the occupancy of the LLC and throttles scheduling of the next cohort if there is no room. To evaluate the design space, we modify the existing GPU firmware on the AMD RX 7600 and RX 7900 XTX GPUs to prototype the scheduling scheme using real hardware. Our preliminary results show up to 27% reduction in execution time using the cache optimized scheduling technique on RX 7900 XTX.

References

[1]
“AMD RDNA2.” [Online]. Available: https://gpuopen.com/rdna2/
[2]
“AMD RadeonTM RX 7900 XTX.” Accessed: Nov. 16, 2023. [Online]. Available: https://www.amd.com/en/products/graphics/amd-radeon-rx-7900xtx
[3]
“AdaLovelace.” [Online]. Available: https://www.nvidia.com/en-us/geforce/ada-lovelace-architecture/
[4]
T. M. Ranadive and M. M. Baskaran, “Large-scale sparse tensor decomposition using a damped gauss-newton method,” in 2020 IEEE High Performance Extreme Computing Conference (HPEC), IEEE, 2020, pp. 1–8.
[5]
Y. Ma, J. Li, X. Wu, C. Yan, J. Sun, and R. Vuduc, “Optimizing sparse tensor times matrix on GPUs,” Journal of Parallel and Distributed Computing, vol. 129, pp. 99–109, 2019.
[6]
T. B. Rolinger, T. A. Simon, and C. D. Krieger, “Performance considerations for scalable parallel tensor decomposition,” Journal of Parallel and Distributed Computing, vol. 129, pp. 83–98, 2019.
[7]
T. B. Rolinger, T. A. Simon, and C. D. Krieger, “Performance challenges for heterogeneous distributed tensor decompositions,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC), IEEE, 2017, pp. 1–7.
[8]
T. Henretty, M. Baskaran, J. Ezick, D. Bruns-Smith, and T. A. Simon, “A quantitative and qualitative analysis of tensor decompositions on spatiotemporal data,” in 2017 IEEE High Performance Extreme Computing Conference (HPEC), IEEE, 2017, pp. 1–7.
[9]
R. Minster, I. Viviano, X. Liu, and G. Ballard, “CP decomposition for tensors via alternating least squares with QR decomposition,” Numerical Linear Algebra App, vol. 30, no. 6, p. e2511, Dec. 2023.
[10]
C. Battaglino, G. Ballard, and T. G. Kolda, “A Practical Randomized CP Tensor Decomposition,” SIAM J. Matrix Anal. & Appl., vol. 39, no. 2, pp. 876–901, Jan. 2018.
[11]
K. Hayashi, G. Ballard, Y. Jiang, and M. J. Tobia, “Shared-memory parallelization of MTTKRP for dense tensors,” in Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018, pp. 393–394.
[12]
K. Lee, H. Lin, and W. Feng, “Performance characterization of data-intensive kernels on AMD Fusion architectures,” Comput Sci Res Dev, vol. 28, no. 2–3, pp. 175–184, May 2013.
[13]
D. A. Rockenbach, “stream processing on multi-cores with GPUs: parallel programming models’ challenges,” in 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), IEEE, 2019, pp. 834–841. Accessed: Nov. 16, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8778359/
[14]
P. Kristof, H. Yu, Z. Li, and X. Tian, “Performance study of SIMD programming models on intel multicore processors,” in 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IEEE, 2012, pp. 2423–2432. Accessed: Nov. 16, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/6270614/
[15]
D. T. Popovici, A. Canning, Z. Zhao, L.-W. Wang, and J. Shalf, “A systematic approach to improving data locality across Fourier transforms and linear algebra operations,” in Proceedings of the ACM International Conference on Supercomputing, Virtual Event USA: ACM, Jun. 2021, pp. 329–341.
[16]
N. Otterness and J. H. Anderson, “Exploring AMD GPU Scheduling Details by Experimenting With ‘Worst Practices,’” in 29th International Conference on Real-Time Networks and Systems, NANTES France: ACM, Apr. 2021, pp. 24–34.
[17]
S. Kim, C. Jung, and Y. Kim, “Comparative Analysis of GPU Stream Processing between Persistent and Non-persistent Kernels,” in 2022 13th International Conference on Information and Communication Technology Convergence (ICTC), IEEE, 2022, pp. 2330–2332. Accessed: Nov. 16, 2023. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9952789/
[18]
J. D. Carroll and J.-J. Chang, “Analysis of individual differences in multidimensional scaling via an n-way generalization of ‘Eckart-Young’ decomposition,” Psychometrika, vol. 35, no. 3, pp. 283–319, Sep. 1970.
[19]
R. A. Harshman, “Foundations of the PARAFAC procedure: Models and conditions for an" explanatory" multimodal factor analysis,” 1970, Accessed: Nov. 16, 2023. [Online]. Available: https://www.psychology.uwo.ca/faculty/harshman/wpppfac0.pdf
[20]
T. G. Kolda and B. W. Bader, “Tensor Decompositions and Applications,” SIAM Rev., vol. 51, no. 3, pp. 455–500, Aug. 2009.
[21]
J. Choi, X. Liu, S. Smith, and T. Simon, “Blocking Optimization Techniques for Sparse Tensor Computation,” in 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC: IEEE, May 2018, pp. 568–577.
[22]
Y. Chen, G. Xiao, M. T. Özsu, Z. Tang, A. Y. Zomaya, and K. Li, “Exploiting Hierarchical Parallelism and Reusability in Tensor Kernel Processing on Heterogeneous HPC Systems,” in 2022 IEEE 38th International Conference on Data Engineering (ICDE), May 2022, pp. 2522–2535.
[23]
Y. Soh, “High Performance Streaming Tensor Decomposition,” in 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, pp. 683–692.
[24]
S. Subramaniyan and X. Wang, “OptiCPD: Optimization For The Canonical Polyadic Decomposition Algorithm on GPUs,” in 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2023, pp. 403–412.
[25]
“Heterogeneous System Architecture Foundation.” Accessed: Nov. 15, 2023. [Online]. Available: https://hsafoundation.com/
[26]
S. Puthoor, X. Tang, J. Gross, and B. M. Beckmann, “Oversubscribed Command Queues in GPUs,” in Proceedings of the 11th Workshop on General Purpose GPUs, Vienna Austria: ACM, Feb. 2018, pp. 50–60.
[27]
S. Puthoor, “Implementing directed acyclic graphs with the heterogeneous system architecture,” in Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, Barcelona Spain: ACM, Mar. 2016, pp. 53–62.
[28]
J. R. Larus and M. Parkes, “Using Cohort Scheduling to Enhance Server Performance (Extended Abstract),” in Proceedings of the ACM SIGPLAN workshop on Languages, compilers and tools for embedded systems, Snow Bird Utah USA: ACM, Aug. 2001, pp. 182–187.
[29]
M. Huzaifa, J. Alsop, A. Mahmoud, G. Salvador, M. D. Sinclair, and S. V. Adve, “Inter-kernel Reuse-aware Thread Block Scheduling,” ACM Trans. Archit. Code Optim., vol. 17, no. 3, pp. 1–27, Sep. 2020.
[30]
W. H. Wen-mei, Heterogeneous System Architecture: A new compute platform infrastructure. Morgan Kaufmann, 2015.
[31]
“HSA Platform System Architecture Specification.” HSA Foundation. [Online]. Available: http://hsafoundation.com/wp-content/uploads/2021/02/HSA-SysArch-1.2.pdf
[32]
“HSA Programmer's Reference Manual.” HSA Foundation. [Online]. Available: http://hsafoundation.com/wp-content/uploads/2021/02/HSA-PRM-1.2.pdf
[33]
“HSA Runtime Programmer's Reference Manual.” HSA Foundation. [Online]. Available: http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf
[34]
“HIP Documentation — HIP Documentation.” Accessed: Nov. 16, 2023. [Online]. Available: https://rocm.docs.amd.com/projects/HIP/en/latest/
[35]
R. van Oostrum, S. Moe, D. McDougal, and P. Bauman, “Amd tools overview.” 2019. Accessed: Nov. 02, 2023. [Online]. Available: https://olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_Tools_Overview.pdf
[36]
“Radeon Southern Islands Acceleration.” 2012. [Online]. Available: https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/programmer-references/si_programming_guide_v2.pdf
[37]
“clr/rocclr at develop · ROCm-Developer-Tools/clr.” Accessed: Nov. 15, 2023. [Online]. Available: https://github.com/ROCm-Developer-Tools/clr/tree/develop/rocclr
[38]
“RadeonOpenCompute/ROCR-Runtime.” ROCm Core Technology, Nov. 15, 2023. Accessed: Nov. 15, 2023. [Online]. Available: https://github.com/RadeonOpenCompute/ROCR-Runtime
[39]
“‘RDNA3’ Instruction Set Architecture: Reference Guide”.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
GPGPU '24: Proceedings of the 16th Workshop on General Purpose Processing Using GPU
March 2024
37 pages
ISBN:9798400718175
DOI:10.1145/3649411
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 April 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cohort Scheduling
  2. Firmware
  3. GPU Command Processor
  4. Input Cache Locality

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

GPGPU '24

Acceptance Rates

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 107
    Total Downloads
  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)17
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media