[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3673038.3673075acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

AdCoalescer: An Adaptive Coalescer to Reduce the Inter-Module Traffic in MCM-GPUs

Published: 12 August 2024 Publication History

Abstract

The demand for greater computing power has driven the development of Multi-chip-module GPUs (MCM-GPUs), which greatly improve parallel processing capabilities. Unfortunately, MCM-GPUs have encountered a notable challenge, the performance bottleneck caused by remote accesses through the inter-module network. In this work, we found significant data access redundancy among SMs within a GPU module which can be coalesced to reduce network pressure. However, how to design the coalescing scheme to identify memory addresses with high data locality is still not clear.
Previously proposed techniques, such as intra-cluster coalescing (ICC), which merges all memory requests, and the L1.5 cache, which uses cache replacement policy to identify locality, are ineffective in many-chip-module GPUs. This ineffectiveness is due to the high number of SMs within a GPU module and the uniform sharing behavior of memory addresses, as this paper identifies. We next propose a simple but effective framework, the Adaptive Coalescer (AdCoalescer), based on the key observation that the program counter (PC) values can be exploited to identify the memory requests with a high data locality. AdCoalescer consists of two key components: the PC filter and the merge table. AdCoalescer adaptively merges memory requests sent from different SMs to the same cache line, especially requests that may be concurrently issued by multiple SMs identified by the PC value. Compared to traditional designs, AdCoalescer achieves an average performance improvement of 22.5% (up to 71.9%) with minimal hardware cost. This substantially outperforms the ICC design, which shows a 7.7% improvement, and the L1.5 cache design, which achieves a 7.1% improvement, both of which cause higher hardware cost.

References

[1]
2022. NVIDIA NVLink High-Speed GPU Interconnect. https://www.nvidia.com/en-us/design-visualization/nvlink-bridges/. NVIDIA.
[2]
Tor M. Aamodt, Wilson Wai Lun Fung, and Timothy G. Rogers. 2018. General-Purpose Graphics Processor Architectures. Morgan & Claypool Publishers.
[3]
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In Proceedings of the International Symposium on Computer Architecture (ISCA). 320–332.
[4]
Akhil Arunkumar, Evgeny Bolotin, David Nellans, and Carole-Jean Wu. 2019. Understanding the Future of Energy Efficiency in Multi-Module GPUs. In Proceedings of International Symposium on High Performance Computer Architecture (HPCA). 519–532.
[5]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Proceeding of the International Symposium on Performance Analysis of Systems and Software (ISPASS). 163–174.
[6]
Trinayan Baruah, Yifan Sun, Ali Tolga Dinçer, Saiful A. Mojumder, José L. Abellán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems. In Proceedings of High Performance Computer Architecture (HPCA). 596–609.
[7]
Niladrish Chatterjee, Mike O’Connor, Donghyuk Lee, Daniel R. Johnson, Stephen W. Keckler, Minsoo Rhu, and William J. Dally. 2017. Architecting an Energy-Efficient DRAM System for GPUs. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 73–84.
[8]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proceedings of the International Symposium on Workload Characterization (IISWC). 44–54.
[9]
Ian Cutress. 2021. Intel: Sapphire Rapids with 64GB of HBM2e, Ponte Vecchio with 408 MB L2 Cache. https://www.anandtech.com/show/17067/intel-sapphire-rapids-with-64-gb-of-hbm2e-ponte-vecchio-with-408-mb-l2-cache.
[10]
J. Dean, G. Corrado, R. Monga, and et al.2012. Large Scale Distributed Deep Networks. In Advances in Neural Information Processing Systems, Vol. 25.
[11]
Yinxiao Feng, Dong Xiang, and Kaisheng Ma. 2023. A Scalable Methodology for Designing Efficient Interconnection Network of Chiplets. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA). 1059–1071.
[12]
I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.
[13]
Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li, and Gabriel H. Loh. 2014. NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?. In Proceedings of the International Symposium on Microarchitecture (MICRO). 458–470.
[14]
Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-Chip Networks: Second Edition (2nd ed.). Morgan & Claypool Publishers.
[15]
Mohammad Reza Jokar, Lunkai Zhang, John M. Dallesasse, Frederic T. Chong, and Yanjing Li. 2019. Direct-Modulated Optical Networks for Interposer Systems. In Proceedings of the International Symposium on Networks-on-Chip (NoC).
[16]
Ajaykumar Kannan, Natalie Enright Jerger, and Gabriel H. Loh. 2016. Exploiting Interposer Technologies to Disintegrate and Reintegrate Multicore Processors. IEEE Micro 36, 3 (2016), 84–93.
[17]
Mahmoud Khairy, Vadim Nikiforov, David Nellans, and Timothy G. Rogers. 2020. Locality-Centric Data and Threadblock Management for Massive GPUs. In Proceedings of International Symposium on Microarchitecture (MICRO). 1022–1036.
[18]
Kyung Hoon Kim, Rahul Boyapati, Jiayi Huang, Yuho Jin, Ki Hwan Yum, and Eun Jung Kim. 2017. Packet Coalescing Exploiting Data Redundancy in GPGPU Architectures. In Proceedings of the International Conference on Supercomputing (ICS).
[19]
Yoongu Kim, Weikun Yang, and Onur Mutlu. 2016. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Computer Architecture Letters 15, 1 (2016), 45–49.
[20]
D. Li and T. M. Aamodt. 2016. Inter-Core Locality Aware Memory Scheduling. IEEE Computer Architecture Letters 15, 1 (January 2016), 25–28.
[21]
Y. Liu, X. Zhao, M. Jahre, Z. Wang, X. Wang, Y. Luo, and L. Eeckhout. 2018. Get Out of the Valley: Power-Efficient Address Mapping for GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA). 166–179.
[22]
Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA-Aware GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO).
[23]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. 2009. CACTI 6.0: A Tool to Model Large Caches. HP laboratories 27 (2009), 28.
[24]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA: Is CUDA the Parallel Programming Model That Application Developers Have Been Waiting For?Queue 6, 2 (mar 2008), 40–53.
[25]
Nvidia. 2024. NVIDIA CUDA SDK Code Samples. https://developer.nvidia.com/cuda-downloads.
[26]
Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, Subramanian S. Iyer, and Rakesh Kumar. 2019. Architecting Waferscale Processors - A GPU Case Study. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 250–263.
[27]
John W. Poulton, William J. Dally, Xi Chen, John G. Eyles, Thomas H. Greer, Stephen G. Tell, John M. Wilson, and C. Thomas Gray. 2013. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications. Solid-State Circuits, IEEE Journal of12 (2013), 3206–3218.
[28]
Bharadwaj Pratheek, Nikhil Jawalkar, and Arkaprava Basu. 2022. Designing Virtual Memory System of MCM GPUs. In Proceedings of the International Symposium on Microarchitecture (MICRO). IEEE, 404–422.
[29]
Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cache-conscious Wavefront Scheduling. In Proceedings of the International Symposium on Microarchitecture (MICRO). 72–83.
[30]
Ryan Smith. 2021. AMD Announces Instinct MI200 Accelerator Family.
[31]
Shuaiwen Song, Chunyi Su, Barry Rountree, and Kirk W. Cameron. 2013. A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS). 673–686.
[32]
C. Sun, C. H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. S. Peh, and V. Stojanovic. 2012. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling. In Proceedings of the International Symposium on Networks-on-Chip (NOCS). 201–210.
[33]
David Tarjan and Kevin Skadron. 2010. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 1–10.
[34]
Walker J. Turner, John W. Poulton, John M. Wilson, Xi Chen, Stephen G. Tell, Matthew Fojtik, Thomas H. Greer, Brian Zimmer, Sanquan Song, Nikola Nedovic, Sudhir S. Kudva, Sunil R. Sudhakaran, Rizwan Bashirullah, Wenxu Zhao, William J. Dally, and C. Thomas Gray. 2018. Ground-referenced Signaling for Intra-chip and Short-reach Chip-to-chip Interconnects. In Proceedings of Custom Integrated Circuits Conference (CICC). 1–8.
[35]
Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, and Onur Mutlu. 2018. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality in GPUs. In Proceedings the International Symposium on Computer Architecture (ISCA). 829–842.
[36]
Lei Wang, Yuho Jin, Hyungjun Kim, and Eun Jung Kim. 2009. Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-on-Chip. In Proceedings of International Symposium on Networks-on-Chip (NoCs). 64–73.
[37]
L. Wang, X. Zhao, D. Kaeli, Z. Wang, and L. Eeckhout. 2018. Intra-Cluster Coalescing to Reduce GPU NoC Pressure. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS). 990–999.
[38]
J. Yin, Z. Lin, O. Kayiran, M. Poremba, M. Shoaib Bin Altaf, N. Enright Jerger, and G. H. Loh. 2018. Modular Routing Design for Chiplet-Based Systems. In Proceedings of the International Symposium on Computer Architecture (ISCA). 726–738.
[39]
Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems. In Proceedings of the International Symposium on Microarchitecture (MICRO). 339–351.
[40]
Shiqing Zhang, Ziyue Zhang, Mahmood Naderan-Tahan, Hossein SeyyedAghaei, Xin Wang, He Li, Senbiao Qin, Didier Colle, Guy Torfs, Mario Pickavet, Johan Bauwelinck, Günther Roelkens, and Lieven Eeckhout. 2023. Photonic Network-on-Wafer for Multichiplet GPUs. IEEE Micro 43, 2 (2023), 86–95.
[41]
Albert Y. Zomaya and Young Choon Lee. 2012. Energy-Efficient Distributed Computing Systems. Wiley-IEEE Computer Society Pr.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Author Tags

  1. Multi-chip-module GPUs (MCM-GPUs)
  2. Network-on-Chip (NoC)
  3. adaptive coalescer.

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • National Natural Science Foundation of China Youth Science Foundation Project
  • Beijing Nova Program

Conference

ICPP '24

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 232
    Total Downloads
  • Downloads (Last 12 months)232
  • Downloads (Last 6 weeks)101
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media