More Web Proxy on the site http://driver.im/

research-article

Open access

AdCoalescer: An Adaptive Coalescer to Reduce the Inter-Module Traffic in MCM-GPUs

Authors:

Xia ZhaoAuthors Info & Claims

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

Pages 1001 - 1011

https://doi.org/10.1145/3673038.3673075

Published: 12 August 2024 Publication History

All formats PDF

Abstract

The demand for greater computing power has driven the development of Multi-chip-module GPUs (MCM-GPUs), which greatly improve parallel processing capabilities. Unfortunately, MCM-GPUs have encountered a notable challenge, the performance bottleneck caused by remote accesses through the inter-module network. In this work, we found significant data access redundancy among SMs within a GPU module which can be coalesced to reduce network pressure. However, how to design the coalescing scheme to identify memory addresses with high data locality is still not clear.

Previously proposed techniques, such as intra-cluster coalescing (ICC), which merges all memory requests, and the L1.5 cache, which uses cache replacement policy to identify locality, are ineffective in many-chip-module GPUs. This ineffectiveness is due to the high number of SMs within a GPU module and the uniform sharing behavior of memory addresses, as this paper identifies. We next propose a simple but effective framework, the Adaptive Coalescer (AdCoalescer), based on the key observation that the program counter (PC) values can be exploited to identify the memory requests with a high data locality. AdCoalescer consists of two key components: the PC filter and the merge table. AdCoalescer adaptively merges memory requests sent from different SMs to the same cache line, especially requests that may be concurrently issued by multiple SMs identified by the PC value. Compared to traditional designs, AdCoalescer achieves an average performance improvement of 22.5% (up to 71.9%) with minimal hardware cost. This substantially outperforms the ICC design, which shows a 7.7% improvement, and the L1.5 cache design, which achieves a 7.1% improvement, both of which cause higher hardware cost.

References

[1]

2022. NVIDIA NVLink High-Speed GPU Interconnect. https://www.nvidia.com/en-us/design-visualization/nvlink-bridges/. NVIDIA.

[2]

Tor M. Aamodt, Wilson Wai Lun Fung, and Timothy G. Rogers. 2018. General-Purpose Graphics Processor Architectures. Morgan & Claypool Publishers.

[3]

Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA). 320–332.

Digital Library

[4]

Akhil Arunkumar, Evgeny Bolotin, David Nellans, and Carole-Jean Wu. 2019. Understanding the Future of Energy Efficiency in Multi-Module GPUs. In Proceedings of International Symposium on High Performance Computer Architecture (HPCA). 519–532.

[5]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 163–174.

[6]

Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems. In Proceedings of High Performance Computer Architecture (HPCA). 596–609.

[7]

Niladrish Chatterjee, Mike O’Connor, Donghyuk Lee, Daniel R. Johnson, Stephen W. Keckler, Minsoo Rhu, and William J. Dally. 2017. Architecting an Energy-Efficient DRAM System for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 73–84.

[8]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Symposium on Workload Characterization (IISWC). 44–54.

Digital Library

[9]

Ian Cutress. 2021. Intel: Sapphire Rapids with 64GB of HBM2e, Ponte Vecchio with 408 MB L2 Cache. https://www.anandtech.com/show/17067/intel-sapphire-rapids-with-64-gb-of-hbm2e-ponte-vecchio-with-408-mb-l2-cache.

[10]

J. Dean, G. Corrado, R. Monga, and et al.2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, Vol. 25.

[11]

Yinxiao Feng, Dong Xiang, and Kaisheng Ma. 2023. A Scalable Methodology for Designing Efficient Interconnection Network of Chiplets. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 1059–1071.

[12]

I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.

[13]

Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li, and Gabriel H. Loh. 2014. NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?. In Proceedings of the International Symposium on Microarchitecture (MICRO). 458–470.

Digital Library

[14]

Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks: Second Edition (2nd ed.). Morgan & Claypool Publishers.

[15]

Mohammad Reza Jokar, Lunkai Zhang, John M. Dallesasse, Frederic T. Chong, and Yanjing Li. 2019. Direct-Modulated Optical Networks for Interposer Systems. In Proceedings of the International Symposium on Networks-on-Chip (NoC).

Digital Library

[16]

Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2016. Exploiting Interposer Technologies to Disintegrate and Reintegrate Multicore Processors. IEEE Micro 36, 3 (2016), 84–93.

[17]

Mahmoud Khairy, Vadim Nikiforov, David Nellans, and Timothy G. Rogers. 2020. Locality-Centric Data and Threadblock Management for Massive GPUs. In Proceedings of International Symposium on Microarchitecture (MICRO). 1022–1036.

[18]

Kyung Hoon Kim, Rahul Boyapati, Jiayi Huang, Yuho Jin, Ki Hwan Yum, and Eun Jung Kim. 2017. Packet Coalescing Exploiting Data Redundancy in GPGPU Architectures. In Proceedings of the International Conference on Supercomputing (ICS).

Digital Library

[19]

Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Computer Architecture Letters 15, 1 (2016), 45–49.

Digital Library

[20]

D. Li and T. M. Aamodt. 2016. Inter-Core Locality Aware Memory Scheduling. IEEE Computer Architecture Letters 15, 1 (January 2016), 25–28.

Digital Library

[21]

Y. Liu, X. Zhao, M. Jahre, Z. Wang, X. Wang, Y. Luo, and L. Eeckhout. 2018. Get Out of the Valley: Power-Efficient Address Mapping for GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 166–179.

[22]

Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA-Aware GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO).

Digital Library

[23]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. HP laboratories 27 (2009), 28.

[24]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA: Is CUDA the Parallel Programming Model That Application Developers Have Been Waiting For?Queue 6, 2 (mar 2008), 40–53.

Digital Library

[25]

Nvidia. 2024. NVIDIA CUDA SDK Code Samples. https://developer.nvidia.com/cuda-downloads.

[26]

Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, Subramanian S. Iyer, and Rakesh Kumar. 2019. Architecting Waferscale Processors - A GPU Case Study. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 250–263.

[27]

John W. Poulton, William J. Dally, Xi Chen, John G. Eyles, Thomas H. Greer, Stephen G. Tell, John M. Wilson, and C. Thomas Gray. 2013. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications. Solid-State Circuits, IEEE Journal of12 (2013), 3206–3218.

[28]

Bharadwaj Pratheek, Nikhil Jawalkar, and Arkaprava Basu. 2022. Designing Virtual Memory System of MCM GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE, 404–422.

Digital Library

[29]

Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cache-conscious Wavefront Scheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO). 72–83.

Digital Library

[30]

Ryan Smith. 2021. AMD Announces Instinct MI200 Accelerator Family.

[31]

Shuaiwen Song, Chunyi Su, Barry Rountree, and Kirk W. Cameron. 2013. A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). 673–686.

Digital Library

[32]

C. Sun, C. H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. S. Peh, and V. Stojanovic. 2012. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In Proceedings of the International Symposium on Networks-on-Chip (NOCS). 201–210.

[33]

David Tarjan and Kevin Skadron. 2010. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1–10.

Digital Library

[34]

Walker J. Turner, John W. Poulton, John M. Wilson, Xi Chen, Stephen G. Tell, Matthew Fojtik, Thomas H. Greer, Brian Zimmer, Sanquan Song, Nikola Nedovic, Sudhir S. Kudva, Sunil R. Sudhakaran, Rizwan Bashirullah, Wenxu Zhao, William J. Dally, and C. Thomas Gray. 2018. Ground-referenced Signaling for Intra-chip and Short-reach Chip-to-chip Interconnects. In Proceedings of Custom Integrated Circuits Conference (CICC). 1–8.

[35]

Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs. In Proceedings the International Symposium on Computer Architecture (ISCA). 829–842.

Digital Library

[36]

Lei Wang, Yuho Jin, Hyungjun Kim, and Eun Jung Kim. 2009. Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-on-Chip. In Proceedings of International Symposium on Networks-on-Chip (NoCs). 64–73.

[37]

L. Wang, X. Zhao, D. Kaeli, Z. Wang, and L. Eeckhout. 2018. Intra-Cluster Coalescing to Reduce GPU NoC Pressure. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS). 990–999.

[38]

J. Yin, Z. Lin, O. Kayiran, M. Poremba, M. Shoaib Bin Altaf, N. Enright Jerger, and G. H. Loh. 2018. Modular Routing Design for Chiplet-Based Systems. In Proceedings of the International Symposium on Computer Architecture (ISCA). 726–738.

[39]

Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems. In Proceedings of the International Symposium on Microarchitecture (MICRO). 339–351.

Digital Library

[40]

Shiqing Zhang, Ziyue Zhang, Mahmood Naderan-Tahan, Hossein SeyyedAghaei, Xin Wang, He Li, Senbiao Qin, Didier Colle, Guy Torfs, Mario Pickavet, Johan Bauwelinck, Günther Roelkens, and Lieven Eeckhout. 2023. Photonic Network-on-Wafer for Multichiplet GPUs. IEEE Micro 43, 2 (2023), 86–95.

Digital Library

[41]

Albert Y. Zomaya and Young Choon Lee. 2012. Energy-Efficient Distributed Computing Systems. Wiley-IEEE Computer Society Pr.

Index Terms

AdCoalescer: An Adaptive Coalescer to Reduce the Inter-Module Traffic in MCM-GPUs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Networks
  1. Network performance evaluation
    1. Network performance modeling

Recommendations

SAC: Sharing-Aware Caching in Multi-Chip GPUs
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

Bandwidth non-uniformity in multi-chip GPUs poses a major design challenge for its last-level cache (LLC) architecture. Whereas a memory-side LLC caches data from the local memory partition while being accessible by all chips, an SM-side LLC is ...
Cooperative Caching for GPUs

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad ...
Adaptive and transparent cache bypassing for GPUs
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

In the last decade, GPUs have emerged to be widely adopted for general-purpose applications. To capture on-chip locality for these applications, modern GPUs have integrated multilevel cache hierarchy, in an attempt to reduce the amount and latency of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing

August 2024

1279 pages

ISBN:9798400717932

DOI:10.1145/3673038

Copyright © 2024 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Natural Science Foundation of China Youth Science Foundation Project
Beijing Nova Program

Conference

ICPP '24

ICPP '24: the 53rd International Conference on Parallel Processing

August 12 - 15, 2024

Gotland, Sweden

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
232
Total Downloads

Downloads (Last 12 months)232
Downloads (Last 6 weeks)101

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents