More Web Proxy on the site http://driver.im/

research-article

Public Access

Filtering Translation Bandwidth with Virtual Caching

Authors:

Jason Lowe-Power,

Gurindar S. SohiAuthors Info & Claims

ACM SIGPLAN Notices, Volume 53, Issue 2

Pages 113 - 127

https://doi.org/10.1145/3296957.3173195

Published: 19 March 2018 Publication History

Abstract

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).

References

[1]

{n. d.}. AMD and HSA. ({n. d.}). Retrieved Accessed: 2017-12-09 from http://www.amd.com/en-us/innovations/software-technologies/hsa

[2]

{n. d.}. The ARM CoreLink CCI-550 Cache Coherent Interconnect. ({n. d.}). Retrieved Accessed: 2017-12-09 from https://developer.arm.com/products/system-ip/corelink-interconnect/corelink-cache-coherent-interconnect-family/corelink-cci-550

[3]

Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth Address Translation for Multiple-issue Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA '96). ACM, New York, NY, USA, 158-167.

Digital Library

[4]

Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 237-248.

Digital Library

[5]

Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2012. Reducing Memory Reference Energy with Opportunistic Virtual Caching. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 297-308. http://dl.acm.org/citation.cfm?id=2337159.2337194

Digital Library

[6]

Benjie Batanes. 2016. PS4 Pro Specs: How Does It Fare Against Xbox Project Scorpio? Which One Is Better? (November 2016). Retrieved Accessed: 2017-12-09 from http://www.itechpost.com/articles/50922/20161107/ps4-pro-specs-fare-against-xbox-project-scorpio-one-better.htm

[7]

A. Bhattacharjee. 2017. Preserving Virtual Memory by Mitigating the Address Translation Wall. IEEE Micro 37, 5 (September 2017), 6-10.

Digital Library

[8]

Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA '11). IEEE Computer Society, Washington, DC, USA, 62-63. http://dl.acm.org/citation.cfm?id=2014698.2014896

Digital Library

[9]

Jeffrey S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D. Lazowska. 1994. Sharing and Protection in a Single-address-space Operating System. ACM Trans. Comput. Syst. 12, 4 (Nov. 1994), 271-307.

Digital Library

[10]

Jeffrey S. Chase, Henry M. Levy, Edward D. Lazowska, and Miche Baker-Harvey. 1992. Lightweight Shared Objects in a 64-bit Operating System. In Conference Proceedings on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '92). ACM, New York, NY, USA, 397-413.

Digital Library

[11]

Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2013 IEEE International Symposium on. IEEE, 185-195.

[12]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44-54.

Digital Library

[13]

Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343-355.

Digital Library

[14]

Ian Cutress. 2017. Hot Chips: Microsoft Xbox One X Scorpio Engine Live Blog. (August 2017). Retrieved Accessed: 2017-12-13 from https://www.anandtech.com/show/11740/hot-chips-microsoft-xbox-one-x-scorpio-engine-live-blog-930am-pt-430pm-utc

[15]

Koen De Bosschere, Albert Cohen, Jonas Maebe, and Harm Munk. 2015. HiPEAC Vision. (2015).

[16]

James R. Goodman. 1987. Coherency for Multiprocessor Virtual Address Caches. SIGPLAN Not. 22, 10 (Oct. 1987), 72-81.

Digital Library

[17]

Mark Hill, Susan Eggers, Jim Larus, George Taylor, Glenn Adams, B. K. Bose, Garth Gibson, Paul Hansen, Jon Keller, Shing Kong, Corinna Lee, Daebum Lee, Joan Pendleton, Scott Ritchie, David A. Wood, Ben Zorn, Paul Hilfinger, Dave Hodges, Randy Katz, John Ousterhout, and Dave Patterson. 1986. Design Decisions in SPUR. Computer 19, 11 (Nov. 1986), 8-22.

Digital Library

[18]

Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 427-440.

Digital Library

[19]

Bruce Jacob. 2009. The Memory System: You Can'T Avoid It, You Can'T Ignore It, You Can'T Fake It. Morgan and Claypool Publishers.

Digital Library

[20]

Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, and Wolfgang Lehner. 2017. Big Data Causing Big (TLB) Problems: Taming Random Memory Accesses on the GPU. In Proceedings of the 13th International Workshop on Data Management on New Hardware (DAMON '17). ACM, New York, NY, USA, Article 6, 10 pages.

Digital Library

[21]

Stefanos Kaxiras and Alberto Ros. 2013. A New Perspective for Efficient Virtual-cache Coherence. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 535-546.

Digital Library

[22]

Andy Kegel, Paul Blinzer, Arka Basu, and Maggie Chan. 2016. Virtualizing IO through IO Memory Management Unit. (2016). Retrieved Accessed: 2017-12-09 from http://pages.cs.wisc.edu/~basu/iscaiommututorial/IOMMUTUTORIALASPLOS2016.pdf

[23]

Hyesoon Kim. 2012. Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions. In Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC '12). ACM, New York, NY, USA, 70-71.

Digital Library

[24]

Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 201-216. http://dl.acm.org/citation.cfm?id=2685048.2685065

Digital Library

[25]

Eric J. Koldinger, Jeffrey S. Chase, and Susan J. Eggers. 1992. Architecture Support for Single Address Space Operating Systems. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V). ACM, New York, NY, USA, 175-186.

Digital Library

[26]

Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras. 2016. Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead. ACM Trans. Archit. Code Optim. 13, 1, Article 1 (March 2016), 22 pages.

Digital Library

[27]

Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula. 2015. Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 733-745.

Digital Library

[28]

George Kyriazis. 2012. Heterogeneous system architecture: A technical review. AMD Fusion Developer Summit (2012).

[29]

Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 72-83. http://dl.acm.org/citation.cfm?id=2337159.2337168

Digital Library

[30]

Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104.

Digital Library

[31]

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh. 2016. Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. In Proceedings of the 43th Annual International Symposium on Computer Architecture (ISCA '16). IEEE Computer Society, Washington, DC, USA.

Digital Library

[32]

Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269.

Digital Library

[33]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 743-758.

Digital Library

[34]

Jason Power. 2017. Inferring Kaveri's Shared Virtual Memory Implementation. (July 2017). Retrieved Accessed: 2017-12-09 from http://www.lowepower.com/jason/inferring-kaveris-shared-virtual-memory-implementation.html

[35]

Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, 457-467.

Digital Library

[36]

Jason Power, Joel Hestness, Marc Orr, Mark Hill, and David Wood. 2014. gem5-gpu: A Heterogeneous CPU-GPU Simulator. Computer Architecture Letters 13, 1 (Jan 2014).

[37]

Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '14). IEEE, 568-578.

[38]

Jason Power, Yinan Li, Mark D. Hill, Jignesh M. Patel, and David A. Wood. 2015. Toward GPUs Being Mainstream in Analytic Processing: An Initial Argument Using Simple Scan-aggregate Queries. In Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN'15). ACM, New York, NY, USA, Article 11, 8 pages.

Digital Library

[39]

Kiran Puttaswamy and Gabriel H. Loh. 2006. Thermal Analysis of a 3D Die-stacked High-performance Microprocessor. In Proceedings of the 16th ACM Great Lakes Symposium on VLSI (GLSVLSI '06). ACM, New York, NY, USA, 19-24.

Digital Library

[40]

Xiaogang Qiu and Michel Dubois. 2001. Towards virtually-addressed memory hierarchies. In Proceedings of the 2001 IEEE 7th International Symposium on High Performance Computer Architecture (HPCA '01). 51-62.

Digital Library

[41]

Xiaogang Qiu and Michel Dubois. 2008. The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches. IEEE Trans. Comput. 57, 12 (Dec. 2008), 1585-1599.

Digital Library

[42]

Jude A. Rivers, Gary S. Tyson, Edward S. Davidson, and Todd M. Austin. 1997. On High-bandwidth Data Cache Design for Multi-issue Processors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-30). IEEE Computer Society, Washington, DC, USA, 46-56. http://dl.acm.org/citation.cfm?id=266800.266805

Digital Library

[43]

Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a File System with GPUs. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 485-498.

Digital Library

[44]

Abhayendra Singh, Shaizeen Aga, and Satish Narayanasamy. 2015. Efficiently Enforcing Strong Memory Ordering in GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 699-712.

Digital Library

[45]

Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, and Tor M. Aamodt. 2013. Cache Coherence for GPU Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '13). IEEE Computer Society, Washington, DC, USA, 578-590.

Digital Library

[46]

Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges (MICRO 2011 Keynote talk).

[47]

J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161-171.

[48]

W. H. Wang, J.-L. Baer, and H. M. Levy. 1989. Organization and Performance of a Two-level Virtual-real Cache Hierarchy. In Proceedings of the 16th Annual International Symposium on Computer Architecture (ISCA '89). ACM, New York, NY, USA, 140-148.

Digital Library

[49]

Neil H. E. Weste and Kamran Eshraghian. 1985. Principles of CMOS VLSI Design: A Systems Perspective. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.

Digital Library

[50]

H. Wong, M. M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 235-246.

[51]

D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton. 1986. An In-cache Address Translation Mechanism. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA '86). IEEE Computer Society Press, Los Alamitos, CA, USA, 358-365. http://dl.acm.org/citation.cfm?id=17407.17398

Digital Library

[52]

H. Yoon and G. S. Sohi. 2016. Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. In Proceedings of the 2016 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA '16). 212-224.

[53]

Lixin Zhang, Evan Speight, Ram Rajamony, and Jiang Lin. 2010. Enigma: Architectural and Operating System Support for Reducing the Impact of Address Translation. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS '10). ACM, New York, NY, USA, 159-168.

Digital Library

Cited By

Di BHu DXie ZSun JChen HRen JLi D(2022)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 31-Mar-2022
https://doi.org/10.1145/3491218
Li BYin JZhang YTang X(2021)Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB DesignMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480083(1154-1168)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480083
Gupta SBhattacharyya AOh YBhattacharjee AFalsafi BPayer MMartínez JDuato JJohn L(2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00047
Show More Cited By

Index Terms

Filtering Translation Bandwidth with Virtual Caching
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Virtual memory

Recommendations

Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Efficient synonym filtering and scalable delayed translation for hybrid virtual caching
ISCA'16

Conventional translation look-aside buffers (TLBs) are required to complete address translation with short latencies, as the address translation is on the critical path of all memory accesses even for L1 cache hits. Such strict TLB latency restrictions ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 53, Issue 2

ASPLOS '18

February 2018

809 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3296957

Editor:
Matthew Fluet
Rodchester Institude of Technology

Issue’s Table of Contents

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018

Published in SIGPLAN Volume 53, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

University of Wisconsin Foundation (John P. Morgridge Professor)
National Science Foundation
William F. Vilas Trust Estate (Vilas Research Professor)
Wisconsin Alumni Research Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
1,016
Total Downloads

Downloads (Last 12 months)228
Downloads (Last 6 weeks)29

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Di BHu DXie ZSun JChen HRen JLi D(2022)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 31-Mar-2022
https://doi.org/10.1145/3491218
Li BYin JZhang YTang X(2021)Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB DesignMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480083(1154-1168)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480083
Gupta SBhattacharyya AOh YBhattacharjee AFalsafi BPayer MMartínez JDuato JJohn L(2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00047
Liu QHuang DCostero LZapater MAtienza D(2024)Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloadsACM Transactions on Architecture and Code Optimization10.1145/365920721:3(1-23)Online publication date: 20-Apr-2024
https://dl.acm.org/doi/10.1145/3659207
B PCox GVesely JBasu A(2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00030
Park JKwon OLee YKim SByeon GYoon JNair PHong S(2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00029
Li BGuo YWang YJaleel AYang JTang X(2023)IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE InvalidationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614269(1163-1177)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614269
Lee JLee JOh YSong WRo W(2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071063
Li BYin JHoley AZhang YYang JTang X(2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071054
Qiu ZLiu KChen Y(2022)BARM: A Batch-Aware Resource Manager for Boosting Multiple Neural Networks Inference on GPUs With Memory OversubscriptionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319980633:12(4612-4624)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3199806
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents