More Web Proxy on the site http://driver.im/

research-article

Public Access

Compiler assisted hybrid implicit and explicit GPU memory management under unified address space

Authors:

Barbara ChapmanAuthors Info & Claims

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 51, Pages 1 - 16

https://doi.org/10.1145/3295500.3356141

Published: 17 November 2019 Publication History

Abstract

To improve programmability and productivity, recent GPUs adopt a virtual memory address space shared with CPUs (e.g., NVIDIA's unified memory). Unified memory migrates the data management burden from programmers to system software and hardware, and enables GPUs to address datasets that exceed their memory capacity. Our experiments show that while the implicit data transfer of unified memory may bring better data movement efficiency, page fault overhead and data thrashing can erase its benefits. In this paper, we propose several user-transparent unified memory management schemes to 1) achieve adaptive implicit and explicit data transfer and 2) prevent data thrashing. Unlike previous approaches which mostly rely on the runtime and thus suffer from large overhead, we demonstrate the benefits of exploiting key information from compiler analyses, including data locality, access density, and target reuse distance, to accomplish our goal. We implement the proposed schemes to improve OpenMP GPU offloading performance. Our evaluation shows that our schemes improve the GPU performance and memory efficiency significantly.

References

[1]

2013. OpenMP 4.0 Specifications. (2013). http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf

[2]

2017. The OpenACC Application Programming Interface. (2017). https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final-changes.pdf

[3]

2018. OpenMP Application Programming Interface Version 5.0. (2018). https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf

[4]

2018. Summit. https://www.olcf.ornl.gov/summit

[5]

Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In ASPLOS '15. ACM, New York, NY, USA, 607--618.

Digital Library

[6]

Nabeel AlSaber and Milind Kulkarni. 2013. SemCache: Semantics-aware Caching for Efficient GPU Offloading. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 421--432.

Digital Library

[7]

Samuel F. Antao, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O'Brien. 2016. Offloading Support for OpenMPinClang and LLVM. In LLVM-HPC '16. IEEE Press, Piscataway, NJ, USA, 1--11.

[8]

L. A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal 5, 2 (1966), 78--101.

Digital Library

[9]

M. Brehob, S. Wagner, E. Torng, and R. Enbody. 2004. Optimal replacement is NP-hard for nonstandard caches. IEEE Trans. Comput. 53, 1 (Jan 2004), 73--76.

Digital Library

[10]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.

Digital Library

[11]

X. Cui, T. R. W. Scogland, B. R. d. Supinski, and W. c. Feng. 2017. Directive-Based Partitioning and Pipelining for Graphics Processing Units. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 575--584.

[12]

L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (Jan 1998), 46--55.

Digital Library

[13]

Peter J. Denning. 1968. The Working Set Model for Program Behavior. Commun. ACM 11, 5 (May 1968), 323--333.

Digital Library

[14]

Chen Ding and Yutao Zhong. 2003. Predicting Whole-program Locality Through Reuse Distance Analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). ACM, New York, NY, USA, 245--257.

Digital Library

[15]

Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data Tiering in Heterogeneous Memory Systems. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 15, 16 pages.

Digital Library

[16]

Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay Between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA '19). ACM, New York, NY, USA, 224--235.

Digital Library

[17]

Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, USA, 347--358.

Digital Library

[18]

S. Ghosh, M. Halappanavar, A. Tumeo, A. Kalyanaraman, H. Lu, D. ChavarriÃă-Miranda, A. Khan, and A. Gebremedhin. 2018. Distributed Louvain Algorithm for Graph Community Detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 885--895.

[19]

Leopold Grinberg, Carlo Bertolli, and Riyaz Haque. 2017. Hands on with OpenMP4.5 and unified memory: developing applications for IBM's hybrid CPU+GPU systems. In International Workshop on OpenMP. Springer, 3--16.

[20]

Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018. Devirtualizing Memory in Heterogeneous Systems. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 637--650.

Digital Library

[21]

Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012. Dynamically Managed Data for CPU-GPU Architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO '12). ACM, New York, NY, USA, 165--174.

Digital Library

[22]

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU Communication Management and Optimization. In PLDI '11. ACM, New York, NY, USA, 142--151.

Digital Library

[23]

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP). In ISCA '10. ACM, New York, NY, USA, 60--71.

Digital Library

[24]

Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras. 2016. Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead. ACM Trans. Archit. Code Optim. 13, 1, Article 1 (March 2016), 22 pages.

Digital Library

[25]

Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO '04. IEEE Computer Society, Washington, DC, USA, 75--. http://dl.acm.org/citation.cfm?id=977395.977673

Digital Library

[26]

Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 49--63.

Digital Library

[27]

Lingda Li, Hal Finkel, Martin Kong, and Barbara Chapman. 2018. Manage OpenMP GPU Data Environment Under Unified Address Space. In Evolving OpenMP for Evolving Architectures, Bronis R. de Supinski, Pedro Valero-Lara, Xavier Martorell, Sergi Mateo Bellido, and Jesus Labarta (Eds.). Springer International Publishing, Cham, 69--81.

[28]

Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, and Xu Cheng. 2012. Optimal Bypass Monitor for High Performance Last-level Caches. In PACT '12. ACM, New York, NY, USA, 315--324.

Digital Library

[29]

Felix Xiaozhu Lin and Xu Liu. 2016. Memif: Towards Programming Heterogeneous Memory Asynchronously. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 369--383.

Digital Library

[30]

Y. Liu, X. Zhao, M. Jahre, Z. Wang, X. Wang, Y. Luo, and L. Eeckhout. 2018. Get Out of the Valley: Power-Efficient Address Mapping for GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 166--179.

Digital Library

[31]

Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 32, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291699

Digital Library

[32]

Alok Mishra, Lingda Li, Martin Kong, Hal Finkel, and Barbara Chapman. 2017. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading. In LLVM-HPC '17. ACM, New York, NY, USA, Article 6, 10 pages.

Digital Library

[33]

NVIDIA. 2007. Compute unified device architecture programming guide. (2007).

[34]

NVIDIA. 2012. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. (2012).

[35]

NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Data Center Accelerator Ever Built. (2016).

[36]

Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-assisted Runtime Coherence Scheme. In PACT '12. ACM, New York, NY, USA, 33--42.

Digital Library

[37]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In ASPLOS '14. ACM, New York, NY, USA, 743--758.

Digital Library

[38]

J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In HPCA '14. 568--578.

[39]

Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive Insertion Policies for High Performance Caching. In ISCA '07. ACM, New York, NY, USA, 381--391.

Digital Library

[40]

Nikolay Sakharnykh. 2018. Everything You Need to Know About Unified Memory. In GTC '18.

[41]

John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66--73.

[42]

TOP500 News Team. 2018. US Regains TOP500 Crown with Summit Supercomputer, Sierra Grabs Number Three Spot. (2018).

[43]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 41--53.

Digital Library

[44]

Kai Wu, Yingchao Huang, and Dong Li. 2017. Unimem: Runtime Data Manage-menton Non-volatile Memory-based Heterogeneous Main Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 58, 14 pages.

Digital Library

[45]

Kai Wu, Jie Ren, and Dong Li. 2018. Runtime Data Management on Non-volatile Memory-based Heterogeneous Memory for Task-parallel Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 31, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291698

Digital Library

[46]

Seongdae Yu, Seongbeom Park, and Woongki Baek. 2017. Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 18, 10 pages.

Digital Library

[47]

Jishen Zhao and Yuan Xie. 2012. Optimizing Bandwidth and Power of Graphics Memory with Hybrid Memory Technologies and Adaptive Data Migration. In ICCAD '12. ACM, New York, NY, USA, 81--87.

Digital Library

[48]

T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 345--357.

Cited By

B PCox GVesely JBasu A(2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00030
Xia YJiang PAgrawal GRamnath RDehnavi MKulkarni MKrishnamoorthy S(2023)End-to-End LU Factorization of Large Matrices on GPUsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577486(288-300)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577486
Wang QPeng ZRen BChen JEdwards R(2022)MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body CorrelationACM Transactions on Architecture and Code Optimization10.1145/350670519:2(1-26)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3506705
Show More Cited By

Compiler assisted hybrid implicit and explicit GPU memory management under unified address space
1. Software and its engineering
  1. Software notations and tools

Recommendations

Performance of CPU/GPU compiler directives on ISO/TTI kernels

GPUs are slowly becoming ubiquitous devices in High Performance Computing, as their capabilities to enhance the performance per watt of compute intensive algorithms as compared to multicore CPUs have been identified. The primary shortcoming of a GPU is ...
Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading
LLVM-HPC'17: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC

The latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC ...
Compiler-assisted dynamic scratch-pad memory management with space overlapping for embedded systems

Scratch-pad memory (SPM), a small, fast, software-managed on-chip SRAM (Static Random Access Memory) is widely used in embedded systems. With the ever-widening performance gap between processors and main memory, it is very important to reduce the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2019

1921 pages

ISBN:9781450362290

DOI:10.1145/3295500

General Chair:
Michela Taufer,
Program Chairs:
Pavan Balaji,
Antonio J. Peña

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy

Conference

SC '19

Sponsor:

SIGHPC

SC '19: The International Conference for High Performance Computing, Networking, Storage, and Analysis

November 17 - 19, 2019

Colorado, Denver

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
1,355
Total Downloads

Downloads (Last 12 months)162
Downloads (Last 6 weeks)26

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

B PCox GVesely JBasu A(2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00030
Xia YJiang PAgrawal GRamnath RDehnavi MKulkarni MKrishnamoorthy S(2023)End-to-End LU Factorization of Large Matrices on GPUsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577486(288-300)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577486
Wang QPeng ZRen BChen JEdwards R(2022)MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body CorrelationACM Transactions on Architecture and Code Optimization10.1145/350670519:2(1-26)Online publication date: 24-Mar-2022
https://dl.acm.org/doi/10.1145/3506705
Gaihre ALi XLiu H(2022)gSoFa: Scalable Sparse Symbolic LU Factorization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309031633:4(1015-1026)Online publication date: 1-Apr-2022
https://doi.org/10.1109/TPDS.2021.3090316
Lu WTian SCurtis TChapman B(2022)Extending OpenMP and OpenSHMEM for Efficient Heterogeneous Computing2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00006(1-12)Online publication date: Nov-2022
https://doi.org/10.1109/PAW-ATM56565.2022.00006
Zhu WCox GVesely JHairgrove MCox ARixner S(2022)UVM Discard: Eliminating Redundant Memory Transfers for Accelerators2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00013(27-38)Online publication date: Nov-2022
https://doi.org/10.1109/IISWC55918.2022.00013
Haque Rafi MWilliams KQasem A(2022)Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969376(1-8)Online publication date: 24-Oct-2022
https://doi.org/10.1109/IGSC55832.2022.9969376
Xu HLin PEmani MHu LLiao C(2022)XUnified: A Framework for Guiding Optimal Use of GPU Unified MemoryIEEE Access10.1109/ACCESS.2022.319600810(82614-82625)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3196008
Wang JPang WWeng CZhou A(2022)D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systemsFrontiers of Computer Science10.1007/s11704-022-2160-z17:4Online publication date: 12-Dec-2022
https://doi.org/10.1007/s11704-022-2160-z
Chang CKumar ASivasubramaniam AWassermann BMalka MChidambaram VRaz D(2021)To move or not to move?Proceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463766(1-12)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1145/3456727.3463766
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents