[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3295500.3356141acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Public Access

Compiler assisted hybrid implicit and explicit GPU memory management under unified address space

Published: 17 November 2019 Publication History

Abstract

To improve programmability and productivity, recent GPUs adopt a virtual memory address space shared with CPUs (e.g., NVIDIA's unified memory). Unified memory migrates the data management burden from programmers to system software and hardware, and enables GPUs to address datasets that exceed their memory capacity. Our experiments show that while the implicit data transfer of unified memory may bring better data movement efficiency, page fault overhead and data thrashing can erase its benefits. In this paper, we propose several user-transparent unified memory management schemes to 1) achieve adaptive implicit and explicit data transfer and 2) prevent data thrashing. Unlike previous approaches which mostly rely on the runtime and thus suffer from large overhead, we demonstrate the benefits of exploiting key information from compiler analyses, including data locality, access density, and target reuse distance, to accomplish our goal. We implement the proposed schemes to improve OpenMP GPU offloading performance. Our evaluation shows that our schemes improve the GPU performance and memory efficiency significantly.

References

[1]
2013. OpenMP 4.0 Specifications. (2013). http://www.openmp.org/wp-content/uploads/OpenMP4.0.0.pdf
[2]
2017. The OpenACC Application Programming Interface. (2017). https://www.openacc.org/sites/default/files/inline-files/OpenACC.2.6.final-changes.pdf
[3]
2018. OpenMP Application Programming Interface Version 5.0. (2018). https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5.0.pdf
[4]
2018. Summit. https://www.olcf.ornl.gov/summit
[5]
Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W. Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In ASPLOS '15. ACM, New York, NY, USA, 607--618.
[6]
Nabeel AlSaber and Milind Kulkarni. 2013. SemCache: Semantics-aware Caching for Efficient GPU Offloading. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13). ACM, New York, NY, USA, 421--432.
[7]
Samuel F. Antao, Alexey Bataev, Arpith C. Jacob, Gheorghe-Teodor Bercea, Alexandre E. Eichenberger, Georgios Rokos, Matt Martineau, Tian Jin, Guray Ozen, Zehra Sura, Tong Chen, Hyojin Sung, Carlo Bertolli, and Kevin O'Brien. 2016. Offloading Support for OpenMPinClang and LLVM. In LLVM-HPC '16. IEEE Press, Piscataway, NJ, USA, 1--11.
[8]
L. A. Belady. 1966. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal 5, 2 (1966), 78--101.
[9]
M. Brehob, S. Wagner, E. Torng, and R. Enbody. 2004. Optimal replacement is NP-hard for nonstandard caches. IEEE Trans. Comput. 53, 1 (Jan 2004), 73--76.
[10]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44--54.
[11]
X. Cui, T. R. W. Scogland, B. R. d. Supinski, and W. c. Feng. 2017. Directive-Based Partitioning and Pipelining for Graphics Processing Units. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 575--584.
[12]
L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering 5, 1 (Jan 1998), 46--55.
[13]
Peter J. Denning. 1968. The Working Set Model for Program Behavior. Commun. ACM 11, 5 (May 1968), 323--333.
[14]
Chen Ding and Yutao Zhong. 2003. Predicting Whole-program Locality Through Reuse Distance Analysis. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). ACM, New York, NY, USA, 245--257.
[15]
Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data Tiering in Heterogeneous Memory Systems. In Proceedings of the Eleventh European Conference on Computer Systems (EuroSys '16). ACM, New York, NY, USA, Article 15, 16 pages.
[16]
Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem. 2019. Interplay Between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA '19). ACM, New York, NY, USA, 224--235.
[17]
Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. 2010. An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, USA, 347--358.
[18]
S. Ghosh, M. Halappanavar, A. Tumeo, A. Kalyanaraman, H. Lu, D. ChavarriÃă-Miranda, A. Khan, and A. Gebremedhin. 2018. Distributed Louvain Algorithm for Graph Community Detection. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 885--895.
[19]
Leopold Grinberg, Carlo Bertolli, and Riyaz Haque. 2017. Hands on with OpenMP4.5 and unified memory: developing applications for IBM's hybrid CPU+GPU systems. In International Workshop on OpenMP. Springer, 3--16.
[20]
Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018. Devirtualizing Memory in Heterogeneous Systems. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18). ACM, New York, NY, USA, 637--650.
[21]
Thomas B. Jablin, James A. Jablin, Prakash Prabhu, Feng Liu, and David I. August. 2012. Dynamically Managed Data for CPU-GPU Architectures. In Proceedings of the Tenth International Symposium on Code Generation and Optimization (CGO '12). ACM, New York, NY, USA, 165--174.
[22]
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU Communication Management and Optimization. In PLDI '11. ACM, New York, NY, USA, 142--151.
[23]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely, Jr., and Joel Emer. 2010. High Performance Cache Replacement Using Re-reference Interval Prediction (RRIP). In ISCA '10. ACM, New York, NY, USA, 60--71.
[24]
Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras. 2016. Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead. ACM Trans. Archit. Code Optim. 13, 1, Article 1 (March 2016), 22 pages.
[25]
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO '04. IEEE Computer Society, Washington, DC, USA, 75--. http://dl.acm.org/citation.cfm?id=977395.977673
[26]
Chen Li, Rachata Ausavarungnirun, Christopher J. Rossbach, Youtao Zhang, Onur Mutlu, Yang Guo, and Jun Yang. 2019. A Framework for Memory Oversubscription Management in Graphics Processing Units. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '19). ACM, New York, NY, USA, 49--63.
[27]
Lingda Li, Hal Finkel, Martin Kong, and Barbara Chapman. 2018. Manage OpenMP GPU Data Environment Under Unified Address Space. In Evolving OpenMP for Evolving Architectures, Bronis R. de Supinski, Pedro Valero-Lara, Xavier Martorell, Sergi Mateo Bellido, and Jesus Labarta (Eds.). Springer International Publishing, Cham, 69--81.
[28]
Lingda Li, Dong Tong, Zichao Xie, Junlin Lu, and Xu Cheng. 2012. Optimal Bypass Monitor for High Performance Last-level Caches. In PACT '12. ACM, New York, NY, USA, 315--324.
[29]
Felix Xiaozhu Lin and Xu Liu. 2016. Memif: Towards Programming Heterogeneous Memory Asynchronously. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). ACM, New York, NY, USA, 369--383.
[30]
Y. Liu, X. Zhao, M. Jahre, Z. Wang, X. Wang, Y. Luo, and L. Eeckhout. 2018. Get Out of the Valley: Power-Efficient Address Mapping for GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 166--179.
[31]
Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 32, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291699
[32]
Alok Mishra, Lingda Li, Martin Kong, Hal Finkel, and Barbara Chapman. 2017. Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading. In LLVM-HPC '17. ACM, New York, NY, USA, Article 6, 10 pages.
[33]
NVIDIA. 2007. Compute unified device architecture programming guide. (2007).
[34]
NVIDIA. 2012. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110. (2012).
[35]
NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Data Center Accelerator Ever Built. (2016).
[36]
Sreepathi Pai, R. Govindarajan, and Matthew J. Thazhuthaveetil. 2012. Fast and Efficient Automatic Memory Management for GPUs Using Compiler-assisted Runtime Coherence Scheme. In PACT '12. ACM, New York, NY, USA, 33--42.
[37]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In ASPLOS '14. ACM, New York, NY, USA, 743--758.
[38]
J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In HPCA '14. 568--578.
[39]
Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive Insertion Policies for High Performance Caching. In ISCA '07. ACM, New York, NY, USA, 381--391.
[40]
Nikolay Sakharnykh. 2018. Everything You Need to Know About Unified Memory. In GTC '18.
[41]
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering 12, 3 (2010), 66--73.
[42]
TOP500 News Team. 2018. US Regains TOP500 Crown with Summit Supercomputer, Sierra Grabs Number Three Spot. (2018).
[43]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 41--53.
[44]
Kai Wu, Yingchao Huang, and Dong Li. 2017. Unimem: Runtime Data Manage-menton Non-volatile Memory-based Heterogeneous Main Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 58, 14 pages.
[45]
Kai Wu, Jie Ren, and Dong Li. 2018. Runtime Data Management on Non-volatile Memory-based Heterogeneous Memory for Task-parallel Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 31, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291698
[46]
Seongdae Yu, Seongbeom Park, and Woongki Baek. 2017. Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems. In Proceedings of the International Conference on Supercomputing (ICS '17). ACM, New York, NY, USA, Article 18, 10 pages.
[47]
Jishen Zhao and Yuan Xie. 2012. Optimizing Bandwidth and Power of Graphics Memory with Hybrid Memory Technologies and Adaptive Data Migration. In ICCAD '12. ACM, New York, NY, USA, 81--87.
[48]
T. Zheng, D. Nellans, A. Zulfiqar, M. Stephenson, and S. W. Keckler. 2016. Towards high performance paged memory for GPUs. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 345--357.

Cited By

View all
  • (2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
  • (2023)End-to-End LU Factorization of Large Matrices on GPUsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577486(288-300)Online publication date: 25-Feb-2023
  • (2022)MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body CorrelationACM Transactions on Architecture and Code Optimization10.1145/350670519:2(1-26)Online publication date: 24-Mar-2022
  • Show More Cited By
  1. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2019
    1921 pages
    ISBN:9781450362290
    DOI:10.1145/3295500
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    In-Cooperation

    • IEEE CS

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 November 2019

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. GPU
    2. OpenMP
    3. compiler analysis
    4. implicit and explicit data transfer
    5. reuse distance
    6. runtime
    7. unified memory management

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '19
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)162
    • Downloads (Last 6 weeks)26
    Reflects downloads up to 11 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
    • (2023)End-to-End LU Factorization of Large Matrices on GPUsProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577486(288-300)Online publication date: 25-Feb-2023
    • (2022)MemHC: An Optimized GPU Memory Management Framework for Accelerating Many-body CorrelationACM Transactions on Architecture and Code Optimization10.1145/350670519:2(1-26)Online publication date: 24-Mar-2022
    • (2022)gSoFa: Scalable Sparse Symbolic LU Factorization on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.309031633:4(1015-1026)Online publication date: 1-Apr-2022
    • (2022)Extending OpenMP and OpenSHMEM for Efficient Heterogeneous Computing2022 IEEE/ACM Parallel Applications Workshop: Alternatives To MPI+X (PAW-ATM)10.1109/PAW-ATM56565.2022.00006(1-12)Online publication date: Nov-2022
    • (2022)UVM Discard: Eliminating Redundant Memory Transfers for Accelerators2022 IEEE International Symposium on Workload Characterization (IISWC)10.1109/IISWC55918.2022.00013(27-38)Online publication date: Nov-2022
    • (2022)Raptor: Mitigating CPU-GPU False Sharing Under Unified Memory Systems2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC55832.2022.9969376(1-8)Online publication date: 24-Oct-2022
    • (2022)XUnified: A Framework for Guiding Optimal Use of GPU Unified MemoryIEEE Access10.1109/ACCESS.2022.319600810(82614-82625)Online publication date: 2022
    • (2022)D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systemsFrontiers of Computer Science10.1007/s11704-022-2160-z17:4Online publication date: 12-Dec-2022
    • (2021)To move or not to move?Proceedings of the 14th ACM International Conference on Systems and Storage10.1145/3456727.3463766(1-12)Online publication date: 14-Jun-2021
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media