[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead

Published: 28 March 2016 Publication History

Abstract

This work proposes a novel scheme to facilitate heterogeneous systems with unified virtual memory. Research proposals implement coherence protocols for sequential consistency (SC) between central processing unit (CPU) cores and between devices. Such mechanisms introduce severe bottlenecks in the system; therefore, we adopt the heterogeneous-race-free (HRF) memory model. The use of HRF simplifies the coherency protocol and the graphics processing unit (GPU) memory management unit (MMU). Our protocol optimizes CPU and GPU demands separately, with the GPU part being simpler while the CPU is more elaborate and latency aware. We achieve an average 45% speedup and 45% energy-delay product reduction (20% energy) over the corresponding SC implementation.

References

[1]
Sarita V. Adve and Mark D. Hill. 1990. Weak ordering -- a new definition. In Proceedings of the 17th ACM/IEEE International Symposium on Computer Architecture (ISCA). 2--14.
[2]
AMD. 2013. APU TM. Retrieved from http://www.amd.com/en-us/innovations/software-technologies/apu.
[3]
Cristiana Amza, Alan L. Cox, Sandhya Dwarkadas, Pete Keleher, Honghui Lu, Ramakrishnan Rajamony, Weimin Yu, and Willy Zwaenepoel. 1996. TreadMarks: Shared memory computing on networks of workstations. IEEE Comput. 29, 2 (Feb 1996), 18--28.
[4]
Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott B. Baden, and Dean M. Tullsen. 2012. Redefining the role of the CPU in the Era of CPU-GPU integration. IEEE Micro 32, 6 (Nov 2012), 4--16.
[5]
Ali Bakhoda, George L. Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 163--174.
[6]
Luiz André Barroso, Kourosh Gharachorloo, Robert McNamara, Andreas Nowatzyk, Shaz Qadeer, Barton Sano, Scott Smith, Robert Stets, and Ben Verghese. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th ACM/IEEE International Symposium on Computer Architecture (ISCA). 282--293.
[7]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The gem5 simulator. ACM SIGARCH Comput. Arch. News 39, 2 (Aug. 2011), 1--7.
[8]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 International Symposium on Workload Characterization (IISWC). 44--54.
[9]
Blas Cuesta, Alberto Ros, María E. Gómez, Antonio Robles, and José Duato. 2011. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In Proceedings of the 38th ACM/IEEE International Symposium on Computer Architecture (ISCA). 93--104.
[10]
Sandhya Dwarkadas, Nikolaos Hardavellas, Leonidas Kontothanassis, Rishiyur Nikhil, and Robert Stets. 1999. Cashmere-VLM: Remote memory paging for software distributed shared memory. In Proceedings of the 13th International Symposium on Parallel Processing (IPPS). 153--159.
[11]
Albert Esteve, Alberto Ros, Maria E. Gómez, Antonio Robles, and José Duato. 2015. Efficient tlb-based detection of private pages in chip multiprocessors. IEEE Transactions on Parallel and Distributed Systems (TPDS) (March 2015).
[12]
William Gropp, Ewing Lusk, and Anthony Skjellum. 1999. Using MPI: Portable Parallel Programming with the Message-Passing Interface. Vol. 1. MIT Press, Cambridge, MA.
[13]
Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th ACM/IEEE International Symposium on Computer Architecture (ISCA). 184--195.
[14]
Blake A. Hechtman, Shuai Che, Derek R. Hower, Yingying Tian, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. QuickRelease: A throughput-oriented approach to release consistency on GPUs. In Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 189--200.
[15]
Blake A. Hechtman and Daniel J. Sorin. 2013. Exploring memory consistency for massively-threaded throughput-oriented processors. In Proceedings of the 40th ACM/IEEE International Symposium on Computer Architecture (ISCA). 201--212.
[16]
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free memory models. In Proceedings of the 19th International Conference on Architectural Support for Programming Language and Operating Systems (ASPLOS). 427--440.
[17]
Hynix. 2013. Hynix H5GQ1H24AFR -- 1Gb (32Mx32) GDDR5 SGRAM. (2013). http://www.hynix.com/.
[18]
Ron Kalla, Balaram Sinharoy, William J. Starke, and Michael Floyd. 2010. Power7: IBM’s next-generation server processor. IEEE Micro 30, 2 (2010), 7--15.
[19]
Stefanos Kaxiras, David Klaftenegger, Magnus Norgren, Alberto Ros, and Konstantinos Sagonas. 2015. Turning centralized coherence and distributed critical-section execution on their head: A new approach for scalable distributed shared memory. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC). 3--14.
[20]
Stefanos Kaxiras and Alberto Ros. 2013. A new perspective for efficient virtual-cache coherence. In Proceedings of the 40th ACM/IEEE International Symposium on Computer Architecture (ISCA). 535--546.
[21]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. In Proceedings of the 40th ACM/IEEE International Symposium on Computer Architecture (ISCA). 487--498.
[22]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd IEEE/ACM International Symposium on Microarchitecture (MICRO). 469--480.
[23]
Aaftab Munshi, Benedict Gaster, Timothy G. Mattson, and Dan Ginsburg. 2011. OpenCL Programming Guide. Pearson Education.
[24]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. 2009. CACTI 6.0: A tool to model large caches. HP Laboratories (2009), 22--31.
[25]
Nvidia. 2015. CUDA C Programming Guide. Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/#memory-fence-functions.
[26]
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO). 457--467.
[27]
Jason Power, Joel Hestness, Marc S. Orr, Mark D. Hill, and David A. Wood. 2015. Gem5-gpu: A heterogeneous CPU-GPU simulator. Comput. Arch. Lett. 14, 1 (Jan 2015), 34--36.
[28]
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 20th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 568--578.
[29]
Alberto Ros, Mahdad Davari, and Stefanos Kaxiras. 2015. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In Proceedings of the 21st IEEE International Symposium on High-Performance Computer Architecture (HPCA). 186--197.
[30]
Alberto Ros and Stefanos Kaxiras. 2012. Complexity-effective multicore coherence. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 241--252.
[31]
Alberto Ros and Stefanos Kaxiras. 2015. Callback: Efficient synchronization without invalidation with a directory just for spin-waiting. In Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture (ISCA). 427--438.
[32]
Andreas Sembrant, Erik Hagersten, and David Black-Shaffer. 2013. TLC: A tag-less cache for reducing dynamic first level cache energy. In Proceedings of the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO). 49--61.
[33]
Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2014. The dirty-block index. In Proceedings of the 41st ACM/IEEE International Symposium on Computer Architecture (ISCA). 157--168.
[34]
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O’Connor, and Tor M. Aamodt. 2013. Cache coherence for GPU architectures. In Proceedings of the 19th IEEE International Symposium on High-Performance Computer Architecture (HPCA). 578--590.
[35]
John A. Stratton, Christopher Rodrigues, I.-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Center for Reliable and High-Performance Computing (2012).
[36]
Lukasz G. Szafaryn, Todd Gamblin, Bronis R. De Supinski, and Kevin Skadron. 2011. Experiences with achieving portability across heterogeneous architectures. Proceedings of WOLFHPC, in Conjunction with ICS, Tucson (2011).
[37]
Shyamkumar Thoziyoor, Naveen Muralimanohar, Jung Ho Ahn, and Norman P. Jouppi. 2008. CACTI 5.1. HP Laboratories 2 (Apr 2008).

Cited By

View all
  • (2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
  • (2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
  • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
  • Show More Cited By

Index Terms

  1. Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 1
    April 2016
    347 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2899032
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 March 2016
    Accepted: 01 November 2015
    Revised: 01 October 2015
    Received: 01 May 2015
    Published in TACO Volume 13, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU MMU design
    2. Multicore
    3. directory-less protocol
    4. heterogeneous coherence
    5. virtual coherence protocol

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Swedish Research Council UPMARC Linnaeus Centre
    • European Commission FEDER funds
    • “Fundación Seneca-Agencia de Ciencia y Tecnología de la Región de Murcia”
    • the Spanish MINECO
    • EU Project LPGPU
    • the project “Jóvenes Líderes en Investigación”

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)137
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
    • (2024)Atomic Cache: Enabling Efficient Fine-Grained Synchronization with Relaxed Memory Consistency on GPGPUs Through In-Cache Atomic Operations2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00056(671-685)Online publication date: 2-Nov-2024
    • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
    • (2022)Demystifying BERT: System Design Implications2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00033(296-309)Online publication date: Nov-2022
    • (2022)HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00061(756-771)Online publication date: Apr-2022
    • (2022)Only Buffer When You Need To: Reducing On-chip GPU Traffic with Reconfigurable Local Atomic Buffers2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00056(676-691)Online publication date: Apr-2022
    • (2020)Inter-kernel Reuse-aware Thread Block SchedulingACM Transactions on Architecture and Code Optimization10.1145/340653817:3(1-27)Online publication date: 17-Aug-2020
    • (2020)Deterministic Atomic Buffering2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO50266.2020.00083(981-995)Online publication date: Oct-2020
    • (2020)HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA47549.2020.00054(582-595)Online publication date: Feb-2020
    • (2019)Compiler assisted hybrid implicit and explicit GPU memory management under unified address spaceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356141(1-16)Online publication date: 17-Nov-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media