More Web Proxy on the site http://driver.im/

research-article

Open access

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

Authors:

Nuwan Jayasena,

Gabriel LohAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 15, Issue 3

Article No.: 32, Pages 1 - 23

https://doi.org/10.1145/3232521

Published: 04 September 2018 Publication History

All formats PDF

Abstract

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach.

To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.

References

[1]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15).

Digital Library

[2]

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15).

Digital Library

[3]

Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data reorganization in memory using 3D-stacked DRAM. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15).

Digital Library

[4]

Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA’15).

Digital Library

[5]

JEDEC Solid State Technology Association. 2015. High Bandwidth Memory (HBM) DRAM. JESD235A (November 2015).

[6]

Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J. Rossbach, and Onur Mutlu. 2017. Mosaic: A GPU memory manager with application-transparent support for multiple page sizes. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17).

Digital Library

[7]

Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei Hwu. 2014. Automatic execution of single-GPU computations across multiple GPUs. In Proceedings of the 2014 23rd International Conference on Parallel Architecture and Compilation Techniques (PACT’14).

Digital Library

[8]

Rohit Chandra, Scott Devine, Ben Verghese, Anoop Gupta, and Mendel Rosenblum. 1994. Scheduling and page migration for multiprocessor compute servers. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’94).

Digital Library

[9]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09).

Digital Library

[10]

Michael Chu, Nuwan Jayasena, Dongping Zhang, and Mike Ignatowski. 2013. High-level programming model abstractions for processing in memory. In Proceedings of the 1st Workshop on Near-Data Processing (WoNDP’13).

[11]

Duncan Elliott, W. Martin Snelgrove, and Michael Stumm. 1992. Computational ram: A memory-simd hybrid and its application to Dsp. In Proceedings of the IEEE Custom Integrated Circuits Conference 1992.

[12]

Duncan Elliott, Michael Stumm, W. Martin Snelgrove, Christian Cojocaru, and Robert McKenzie. 1999. Computational RAM: Implementing processors in memory. IEEE Des. Test 16, 1 (Jan. 1999), 32--41.

Digital Library

[13]

Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’07).

Digital Library

[14]

Mohsen Ghasempour, Aamer Jaleel, Jim D. Garside, and Mikel Luján. 2016. DReAM: Dynamic re-arrangement of address mapping to improve the performance of DRAMs. In Proceedings of the International Symposium on Memory Systems (MEMSYS’16).

Digital Library

[15]

Maya Gokhale, Bill Holmes, and Ken Iobst. 1995. Processing in memory: The terasys massively parallel PIM array. Computer 28, 4 (April 1995), 23--31.

Digital Library

[16]

Dominik Grewe and Michael F. P. O’Boyle. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In Proceedings of the 20th International Conference on Compiler Construction: Part of the Joint European Conferences on Theory and Practice of Software (CC/ETAPS’11).

Digital Library

[17]

Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, and Hyesoon Kim. 2017. Demystifying the characteristics of 3D-stacked memories: A case study for hybrid memory cube. In 2017 IEEE International Symposium on Workload Characterization (IISWC'17).

[18]

Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, and Hyesoon Kim. 2018. Performance implications of NoCs on 3D-stacked memories: Insights from the hybrid memory cube. In Proceedings of the 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’18).

[19]

Ramyad Hadidi, Lifeng Nai, Hyojong Kim, and Hyesoon Kim. 2017. CAIRO: A compiler-assisted technique for enabling instruction-level offloading of processing-in-memory. ACM Trans. Arch. Code Optimiz. 14, 4, Article 48 (Dec. 2017), 25 pages.

Digital Library

[20]

Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Jay Brockman, Apoorv Srivastava, William Athas, Vincent Freeh, Jaewook Shin, and Joonseok Park. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (SC’99).

Digital Library

[21]

Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09).

Digital Library

[22]

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike OConnor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. 2016. Transparent offoading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In Proceedings of the 43rd Annual International Symposium on Computer Architecture (ISCA’16).

Digital Library

[23]

Intel Corporation. 2007. Intel®64 and IA-32 Architectures Software Developer's Manual.

[24]

Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 1999. FlexRAM: Toward an advanced intelligent memory system. In Proceedings of the 2012 IEEE 30th International Conference on Computer Design (ICCD’99).

Digital Library

[25]

Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13).

Digital Library

[26]

Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014. Multi-GPU system design with memory networks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’14).

Digital Library

[27]

Hyesoon Kim, Jaekyu Lee, Nagesh B. Lakshminarayana, Jaewoong Sim, Jieun Lim, Tri Pho, Hyojong Kim, and Ramyad Hadidi. 2012. MacSim: A CPU-GPU Heterogeneous Simulation Framework User Guide.

[28]

Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11).

Digital Library

[29]

Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO’04).

Digital Library

[30]

Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU collaboration for data-parallel kernels on heterogeneous systems. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13).

Digital Library

[31]

Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2015. SKMD: Single kernel on multiple devices for transparent CPU-GPU collaboration. ACM Trans. Comput. Syst. (2015).

Digital Library

[32]

Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (March 2008), 39--55.

Digital Library

[33]

Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09).

Digital Library

[34]

Richard C. Murphy, Peter M. Kogge, and Arun Rodrigues. 2000. The characterization of data intensive memory workloads on distributed PIM systems. In Revised Papers from the Second International Workshop on Intelligent Memory Systems (IMS’00).

Digital Library

[35]

Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling instruction-level PIM offloading in graph computing frameworks. In Proceedings of the 2017 IEEE 23rd International Symposium on High Performance Computer Architecture (HPCA’17).

[36]

Lifeng Nai, Ramyad Hadidi, He Xiao, Hyojong Kim, Jaewoong Sim, and Hyesoon Kim. 2018. CoolPIM: Thermal-aware source throttling for efficient PIM instruction offloading. In Proceedings of the 2018 IEEE 32nd International Symposium on Parallel and Distributed Processing (IPDPS’18).

[37]

Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Kim Hyesoon, and Ching-Yung Lin. 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’15).

Digital Library

[38]

Ravi Nair, Samuel Antao, Carlo Bertolli, Pradip Bose, Jose Brunheroto, Tong Chen, Chen-Yong Cher, Carlos Costa, Jun Doi, Constantinos Evangelinos, Bruce Fleischer, Thomas Fox, Diego Gallo, Leopold Grinberg, John Gunnels, Arpith Jacob, Philip Jacob, Hans Jacobson, Tejas Karkhanis, Changhoan Kim, Jaime Moreno, Kevin O’Brien, Martin Ohmacht, Yoonho Park, Daniel Prener, Bryan Rosenburg, Kyung Ryu, Olivier Sallenave, Mauricio Serrano, Patrick Siegl, Krishnan Sugavanam, and Zehra Sura. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM J. Res. Dev. 59, 2/3 (March 2015), 17:1--17:14.

Digital Library

[39]

Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’11).

Digital Library

[40]

NVIDIA Corp. 2009. NVIDIA’s Next Generation CUDA™ Compute Architecture: Fermi™. Retrieved from https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf.

[41]

NVIDIA Corp. 2016. NVIDIA Tesla P100. Retrieved from https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.

[42]

NVIDIA Corp. 2017. NVIDIA Tesla V100. Retrieved from http://www.nvidia.com/object/volta-architecture-whitepaper.html.

[43]

Mark Oskin, Frederic T. Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings of the 25th Annual International Symposium on Computer Architecture (ISCA’98).

Digital Library

[44]

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. Intelligent RAM (IRAM): Chips that remember and compute. In Proceedings of the IEEE International Solids-State Circuits Conference (ISSCC’97).

[45]

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A case for intelligent RAM. IEEE Micro (1997).

Digital Library

[46]

Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan Mangard. 2016. DRAMA: Exploiting DRAM addressing for cross-CPU attacks. In Proceedings of the 25th USENIX Security Symposium (USENIX Security’16).

Digital Library

[47]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14).

Digital Library

[48]

Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14).

[49]

Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’14).

[50]

Thejas Ramashekar and Uday Bondhugula. 2013. Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Arch. Code Optimiz. (2013).

Digital Library

[51]

Rice University, CORPORATE. 1993. High Performance Fortran Language Specification. SIGPLAN Fortran Forum 12, 4 (Dec. 1993), 1--86.

Digital Library

[52]

Arun F. Rodrigues, K. Scott Hemmert, Brian W. Barrett, Chad Kersey, Ron A. Oldfield, Marlo Weston, R. Risen, Jonathan Cook, Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review (2011).

Digital Library

[53]

Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob. 2011. DRAMSim2: A cycle accurate memory system simulator. IEEE Comput. Arch. Lett. (2011).

Digital Library

[54]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. IMPACT Technical Report (2012).

[55]

I-Jui Sung, John A. Stratton, and Wen-Mei W. Hwu. 2010. Data layout transformation exploiting memory-level parallelism in structured grid many-core applications. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10).

Digital Library

[56]

Tung Thanh-Hoang, Amirali Shambayati, and Andrew A. Chien. 2016. A data layout transformation (DLT) accelerator: Architectural support for data movement optimization in accelerated-centric heterogeneous systems. In Proceedings of the 2016 Design, Automation Test in Europe Conference Exhibition (DATE’16).

Digital Library

[57]

Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC’14).

Digital Library

[58]

Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph Greathouse, Mitesh Meswani, Mark Nutter, and Mike Ignatowski. 2013. A new perspective on processing-in-memory architecture design. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC’13).

Digital Library

[59]

Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2000. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO’00).

Digital Library

[60]

Tianhao Zheng, David W. Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards high performance paged memory for GPUs. In Proceedings of the 2016 IEEE 22nd International Symposium on High Performance Computer Architecture (HPCA’16).

[61]

Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L. Abellán, Rafael Ubal, John Kim, Ajay Joshi, and David Kaeli. 2016. UMH: A hardware-based unified memory hierarchy for systems with multiple discrete GPUs. ACM Trans. Arch. Code Optimiz. (2016).

Digital Library

[62]

John H. Zurawski, John E. Murray, and Paul J. Lemmon. 1995. The design and verification of the alphastation 600 5-series workstation. Dig. Techn. J. (1995).

Digital Library

Cited By

Augonnet CAlexandrescu ASidelnik AGarland M(2024)CUDASTF: Bridging the Gap Between CUDA and Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00049(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00049
Dalmia PKumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Zhu SShkirko ILevinson JWang ZNowatzki T(2024)SPGPU: Spatially Programmed GPUIEEE Computer Architecture Letters10.1109/LCA.2024.349933923:2(223-226)Online publication date: Jul-2024
https://doi.org/10.1109/LCA.2024.3499339
Show More Cited By

Index Terms

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems

Recommendations

Cross-Accelerator Performance Profiling
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the ...
CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory

Three-dimensional (3D)-stacking technology and the memory-wall problem have popularized processing-in-memory (PIM) concepts again, which offers the benefits of bandwidth and energy savings by offloading computations to functional units inside the ...
Reducing Communication Overhead in Multi-GPU Hybrid Solver for 2D Laplace's Equation

The possibility of porting algorithms to graphics processing units (GPUs) raises significant interest among researchers. The natural next step is to employ multiple GPUs, but communication overhead may limit further performance improvement. In this ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 15, Issue 3

September 2018

322 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3274266

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 September 2018

Accepted: 01 June 2018

Revised: 01 May 2018

Received: 01 February 2018

Published in TACO Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
1,670
Total Downloads

Downloads (Last 12 months)340
Downloads (Last 6 weeks)29

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Augonnet CAlexandrescu ASidelnik AGarland M(2024)CUDASTF: Bridging the Gap Between CUDA and Task ParallelismProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00049(1-17)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00049
Dalmia PKumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Zhu SShkirko ILevinson JWang ZNowatzki T(2024)SPGPU: Spatially Programmed GPUIEEE Computer Architecture Letters10.1109/LCA.2024.349933923:2(223-226)Online publication date: Jul-2024
https://doi.org/10.1109/LCA.2024.3499339
Feng YNa SKim HJeon H(2024)Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00065(834-847)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00065
Abdullah RLee HZhou HAwad A(2024)Salus: Efficient Security Support for CXL-Expanded GPU Memory2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00027(1-15)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00027
Zhang SNaderan-Tahan MJahre MEeckhout L(2023)Characterizing Multi-Chip GPU Data SharingACM Transactions on Architecture and Code Optimization10.1145/362952120:4(1-24)Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1145/3629521
Ramchandani DAsgari BKim H(2023)Spica: Exploring FPGA Optimizations to Enable an Efficient SpMV Implementation for Computations at Edge2023 IEEE International Conference on Edge Computing and Communications (EDGE)10.1109/EDGE60047.2023.00018(36-42)Online publication date: Jul-2023
https://doi.org/10.1109/EDGE60047.2023.00018
Zhou YRen ZShao EMa LHu QWang LTan G(2023)FILL: a heterogeneous resource scheduling system addressing the low throughput problem in GROMACSCCF Transactions on High Performance Computing10.1007/s42514-023-00169-56:1(17-31)Online publication date: 23-Sep-2023
https://doi.org/10.1007/s42514-023-00169-5
de Castro MSantamaria-Valenzuela ITorres YGonzalez-Escribano ALlanos D(2023)EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUsThe Journal of Supercomputing10.1007/s11227-022-05040-y79:9(9409-9442)Online publication date: 14-Jan-2023
https://dl.acm.org/doi/10.1007/s11227-022-05040-y
B PJawalkar NBasu AHardavellas NCampanoni SGrot BKarpuzcu U(2022)Designing Virtual Memory System of MCM GPUsProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00036(404-422)Online publication date: 1-Oct-2022
https://dl.acm.org/doi/10.1109/MICRO56248.2022.00036
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents