[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Public Access

Filtering Translation Bandwidth with Virtual Caching

Published: 19 March 2018 Publication History

Abstract

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).

References

[1]
{n. d.}. AMD and HSA. ({n. d.}). Retrieved Accessed: 2017-12-09 from http://www.amd.com/en-us/innovations/software-technologies/hsa
[2]
{n. d.}. The ARM CoreLink CCI-550 Cache Coherent Interconnect. ({n. d.}). Retrieved Accessed: 2017-12-09 from https://developer.arm.com/products/system-ip/corelink-interconnect/corelink-cache-coherent-interconnect-family/corelink-cci-550
[3]
Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth Address Translation for Multiple-issue Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA '96). ACM, New York, NY, USA, 158-167.
[4]
Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 237-248.
[5]
Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2012. Reducing Memory Reference Energy with Opportunistic Virtual Caching. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 297-308. http://dl.acm.org/citation.cfm?id=2337159.2337194
[6]
Benjie Batanes. 2016. PS4 Pro Specs: How Does It Fare Against Xbox Project Scorpio? Which One Is Better? (November 2016). Retrieved Accessed: 2017-12-09 from http://www.itechpost.com/articles/50922/20161107/ps4-pro-specs-fare-against-xbox-project-scorpio-one-better.htm
[7]
A. Bhattacharjee. 2017. Preserving Virtual Memory by Mitigating the Address Translation Wall. IEEE Micro 37, 5 (September 2017), 6-10.
[8]
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA '11). IEEE Computer Society, Washington, DC, USA, 62-63. http://dl.acm.org/citation.cfm?id=2014698.2014896
[9]
Jeffrey S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D. Lazowska. 1994. Sharing and Protection in a Single-address-space Operating System. ACM Trans. Comput. Syst. 12, 4 (Nov. 1994), 271-307.
[10]
Jeffrey S. Chase, Henry M. Levy, Edward D. Lazowska, and Miche Baker-Harvey. 1992. Lightweight Shared Objects in a 64-bit Operating System. In Conference Proceedings on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '92). ACM, New York, NY, USA, 397-413.
[11]
Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2013 IEEE International Symposium on. IEEE, 185-195.
[12]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44-54.
[13]
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343-355.
[14]
Ian Cutress. 2017. Hot Chips: Microsoft Xbox One X Scorpio Engine Live Blog. (August 2017). Retrieved Accessed: 2017-12-13 from https://www.anandtech.com/show/11740/hot-chips-microsoft-xbox-one-x-scorpio-engine-live-blog-930am-pt-430pm-utc
[15]
Koen De Bosschere, Albert Cohen, Jonas Maebe, and Harm Munk. 2015. HiPEAC Vision. (2015).
[16]
James R. Goodman. 1987. Coherency for Multiprocessor Virtual Address Caches. SIGPLAN Not. 22, 10 (Oct. 1987), 72-81.
[17]
Mark Hill, Susan Eggers, Jim Larus, George Taylor, Glenn Adams, B. K. Bose, Garth Gibson, Paul Hansen, Jon Keller, Shing Kong, Corinna Lee, Daebum Lee, Joan Pendleton, Scott Ritchie, David A. Wood, Ben Zorn, Paul Hilfinger, Dave Hodges, Randy Katz, John Ousterhout, and Dave Patterson. 1986. Design Decisions in SPUR. Computer 19, 11 (Nov. 1986), 8-22.
[18]
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 427-440.
[19]
Bruce Jacob. 2009. The Memory System: You Can'T Avoid It, You Can'T Ignore It, You Can'T Fake It. Morgan and Claypool Publishers.
[20]
Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, and Wolfgang Lehner. 2017. Big Data Causing Big (TLB) Problems: Taming Random Memory Accesses on the GPU. In Proceedings of the 13th International Workshop on Data Management on New Hardware (DAMON '17). ACM, New York, NY, USA, Article 6, 10 pages.
[21]
Stefanos Kaxiras and Alberto Ros. 2013. A New Perspective for Efficient Virtual-cache Coherence. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 535-546.
[22]
Andy Kegel, Paul Blinzer, Arka Basu, and Maggie Chan. 2016. Virtualizing IO through IO Memory Management Unit. (2016). Retrieved Accessed: 2017-12-09 from http://pages.cs.wisc.edu/~basu/iscaiommututorial/IOMMUTUTORIALASPLOS2016.pdf
[23]
Hyesoon Kim. 2012. Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions. In Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC '12). ACM, New York, NY, USA, 70-71.
[24]
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 201-216. http://dl.acm.org/citation.cfm?id=2685048.2685065
[25]
Eric J. Koldinger, Jeffrey S. Chase, and Susan J. Eggers. 1992. Architecture Support for Single Address Space Operating Systems. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V). ACM, New York, NY, USA, 175-186.
[26]
Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras. 2016. Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead. ACM Trans. Archit. Code Optim. 13, 1, Article 1 (March 2016), 22 pages.
[27]
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula. 2015. Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 733-745.
[28]
George Kyriazis. 2012. Heterogeneous system architecture: A technical review. AMD Fusion Developer Summit (2012).
[29]
Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 72-83. http://dl.acm.org/citation.cfm?id=2337159.2337168
[30]
Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104.
[31]
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh. 2016. Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. In Proceedings of the 43th Annual International Symposium on Computer Architecture (ISCA '16). IEEE Computer Society, Washington, DC, USA.
[32]
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269.
[33]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 743-758.
[34]
Jason Power. 2017. Inferring Kaveri's Shared Virtual Memory Implementation. (July 2017). Retrieved Accessed: 2017-12-09 from http://www.lowepower.com/jason/inferring-kaveris-shared-virtual-memory-implementation.html
[35]
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, 457-467.
[36]
Jason Power, Joel Hestness, Marc Orr, Mark Hill, and David Wood. 2014. gem5-gpu: A Heterogeneous CPU-GPU Simulator. Computer Architecture Letters 13, 1 (Jan 2014).
[37]
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '14). IEEE, 568-578.
[38]
Jason Power, Yinan Li, Mark D. Hill, Jignesh M. Patel, and David A. Wood. 2015. Toward GPUs Being Mainstream in Analytic Processing: An Initial Argument Using Simple Scan-aggregate Queries. In Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN'15). ACM, New York, NY, USA, Article 11, 8 pages.
[39]
Kiran Puttaswamy and Gabriel H. Loh. 2006. Thermal Analysis of a 3D Die-stacked High-performance Microprocessor. In Proceedings of the 16th ACM Great Lakes Symposium on VLSI (GLSVLSI '06). ACM, New York, NY, USA, 19-24.
[40]
Xiaogang Qiu and Michel Dubois. 2001. Towards virtually-addressed memory hierarchies. In Proceedings of the 2001 IEEE 7th International Symposium on High Performance Computer Architecture (HPCA '01). 51-62.
[41]
Xiaogang Qiu and Michel Dubois. 2008. The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches. IEEE Trans. Comput. 57, 12 (Dec. 2008), 1585-1599.
[42]
Jude A. Rivers, Gary S. Tyson, Edward S. Davidson, and Todd M. Austin. 1997. On High-bandwidth Data Cache Design for Multi-issue Processors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-30). IEEE Computer Society, Washington, DC, USA, 46-56. http://dl.acm.org/citation.cfm?id=266800.266805
[43]
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a File System with GPUs. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 485-498.
[44]
Abhayendra Singh, Shaizeen Aga, and Satish Narayanasamy. 2015. Efficiently Enforcing Strong Memory Ordering in GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 699-712.
[45]
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, and Tor M. Aamodt. 2013. Cache Coherence for GPU Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '13). IEEE Computer Society, Washington, DC, USA, 578-590.
[46]
Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges (MICRO 2011 Keynote talk).
[47]
J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161-171.
[48]
W. H. Wang, J.-L. Baer, and H. M. Levy. 1989. Organization and Performance of a Two-level Virtual-real Cache Hierarchy. In Proceedings of the 16th Annual International Symposium on Computer Architecture (ISCA '89). ACM, New York, NY, USA, 140-148.
[49]
Neil H. E. Weste and Kamran Eshraghian. 1985. Principles of CMOS VLSI Design: A Systems Perspective. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
[50]
H. Wong, M. M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 235-246.
[51]
D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton. 1986. An In-cache Address Translation Mechanism. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA '86). IEEE Computer Society Press, Los Alamitos, CA, USA, 358-365. http://dl.acm.org/citation.cfm?id=17407.17398
[52]
H. Yoon and G. S. Sohi. 2016. Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. In Proceedings of the 2016 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA '16). 212-224.
[53]
Lixin Zhang, Evan Speight, Ram Rajamony, and Jiang Lin. 2010. Enigma: Architectural and Operating System Support for Reducing the Impact of Address Translation. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS '10). ACM, New York, NY, USA, 159-168.

Cited By

View all
  • (2022)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 31-Mar-2022
  • (2021)Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB DesignMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480083(1154-1168)Online publication date: 18-Oct-2021
  • (2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 53, Issue 2
ASPLOS '18
February 2018
809 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296957
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2018
    827 pages
    ISBN:9781450349116
    DOI:10.1145/3173162
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018
Published in SIGPLAN Volume 53, Issue 2

Check for updates

Author Tags

  1. TLB
  2. address translation
  3. heterogeneous computing
  4. virtual caching
  5. virtual memory

Qualifiers

  • Research-article

Funding Sources

  • University of Wisconsin Foundation (John P. Morgridge Professor)
  • National Science Foundation
  • William F. Vilas Trust Estate (Vilas Research Professor)
  • Wisconsin Alumni Research Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)228
  • Downloads (Last 6 weeks)29
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware SchedulingACM Transactions on Architecture and Code Optimization10.1145/349121819:1(1-23)Online publication date: 31-Mar-2022
  • (2021)Improving Address Translation in Multi-GPUs via Sharing and Spilling aware TLB DesignMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480083(1154-1168)Online publication date: 18-Oct-2021
  • (2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
  • (2024)Intermediate Address Space: virtual memory optimization of heterogeneous architectures for cache-resident workloadsACM Transactions on Architecture and Code Optimization10.1145/365920721:3(1-23)Online publication date: 20-Apr-2024
  • (2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
  • (2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
  • (2023)IDYLL: Enhancing Page Translation in Multi-GPUs via Light Weight PTE InvalidationsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614269(1163-1177)Online publication date: 28-Oct-2023
  • (2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
  • (2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
  • (2022)BARM: A Batch-Aware Resource Manager for Boosting Multiple Neural Networks Inference on GPUs With Memory OversubscriptionIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.319980633:12(4612-4624)Online publication date: 1-Dec-2022
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media