More Web Proxy on the site http://driver.im/

research-article

Throughput-oriented GPU memory allocation

Authors:

Michael GarlandAuthors Info & Claims

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

Pages 27 - 37

https://doi.org/10.1145/3293883.3295727

Published: 16 February 2019 Publication History

Abstract

Throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. This level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability limits, creating significant performance bottlenecks in modules, such as memory allocators, that use them. In this paper, we develop concurrent programming techniques and synchronization primitives, in support of a dynamic memory allocator, that are efficient for use with very high levels of concurrency.

We formulate resource allocation as a two-stage process, that decouples accounting for the number of available resources from the tracking of the available resources themselves. To facilitate the accounting stage, we introduce a novel bulk semaphore abstraction that extends traditional semaphore semantics by optimizing for the case where threads operate on the semaphore simultaneously. We also similarly design new collective synchronization primitives that enable groups of cooperating threads to enter critical sections together. Finally, we show that delegation of deferred reclamation to threads already blocked greatly improves efficiency.

Using all these techniques, our throughput-oriented memory allocator delivers both high allocation rates and low memory fragmentation on modern GPUs. Our experiments demonstrate that it achieves allocation rates that are on average 16.56 times higher than the counterpart implementation in the CUDA 9 toolkit.

References

[1]

{n. d.}. Kinetica Web Page. https://www.kinetica.com/a. Accessed: 2018-12-30.

[2]

{n.d.}. RAPIDS Web Page, https://rapids.ai/. Accessed: 2018-12-30.

[3]

Andrew V Adinetz and Dirk Pleiter. 2014. Halloc: a high-throughput dynamic memory allocator for GPGPU architectures. In GPU Technology Conference (GTC). 152.

[4]

Emery D Berger, Kathryn S McKinley, Robert D Blumofe, and Paul R Wilson. 2000. Hoard: A scalable memory allocator for multithreaded applications. In ACM SIGARCH Computer Architecture News, Vol. 28. ACM, 117--128.

Digital Library

[5]

André B Bondi. 2000. Characteristics of scalability and their impact on performance. In Proceedings of the 2nd international workshop on Software and performance. ACM, 195--203.

Digital Library

[6]

Jeff Bonwick and Jonathan Adams. 2001. Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources. In USENIX Annual Technical Conference, General Track. 15--33.

Digital Library

[7]

Daniel Bovet and Marco Cesati. 2005. Understanding The Linux Kernel. Oreilly & Associates Inc.

Digital Library

[8]

NVIDIA Cooprporation. {n. d.}. cuSolver Manual. NVIDIA.

[9]

Mathieu Desnoyers, Paul E McKenney, Alan S Stern, Michel R Dagenais, and Jonathan Walpole. 2012. User-level implementations of read-copy update. IEEE Transactions on Parallel and Distributed Systems 23, 2 (2012), 375--382.

Digital Library

[10]

Edsger W Dijkstra. 1968. Cooperating sequential processes. In The origin of concurrent programming. Springer, 65--138.

Digital Library

[11]

Jason Evans. 2006. A scalable concurrent malloc (3) implementation for FreeBSD. In Proc. of the BSDCan Conference, Ottawa, Canada.

[12]

Xiaohuang Huang, Christopher I Rodrigues, Stephen Jones, Ian Buck, and Wen-mei Hwu. 2010. Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on. IEEE, 1134--1139.

Digital Library

[13]

Leonard Kleinrock. 1976. Queueing systems, volume 2: Computer applications. Vol. 66. wiley New York.

[14]

Kenneth Knowlton. 1965. A Fast Storage Allocator. Commun. ACM 8, 10 (1965), 623--624.

Digital Library

[15]

Alex Kogan and Erez Petrank. 2011. Wait-free queues with multiple enqueuers and dequeuers. In ACM SIGPLAN Notices, Vol. 46. ACM, 223--234.

Digital Library

[16]

Romolo Marotta, Mauro Ianni, Andrea Scarselli, Alessandro Pellegrini, and Francesco Quaglia. 2018. Anon-blocking buddy system for scalable memory allocation on multi-core machines. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 164--165.

[17]

Paul E McKenney. 2006. Sleepable Read-Copy Update.

[18]

Paul E McKenney and John D Slingwine. 1998. Read-copy update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems. 509--518.

[19]

NVIDIA Coorporation. 2017. NVIDIA Tesla V100 GPU Architecture. Technical Report WP-08608. http://www.nvidia.com/object/volta-architecture-whitepaper.html

[20]

Markus Steinberger, Michael Kenzel, Bernhard Kainz, and Dieter Schmalstieg. 2012. Scatteralloc: Massively parallel dynamic memory allocation for the GPU. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.

[21]

Jeff A Stuart and John D Owens. 2011. Efficient synchronization primitives for GPUs. arXiv preprint arXiv:1110.4623 (2011).

[22]

M. Vinkler and V. Havran. 2014. Register Efficient Memory Allocator for GPUs. In Proceedings of High Performance Graphics (HPG '14). Eurographics Association, Goslar Germany, Germany, 19--28.

Digital Library

[23]

Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In ACM SIGPLAN Notices, Vol. 51. ACM, 11.

Digital Library

[24]

Sven Widmer, Dominik Wodniok, Nicolas Weber, and Michael Goesele. 2013. Fast dynamic memory allocator for massively parallel architectures. In Proceedings of the 6th workshop on general purpose processor using graphics processing units. ACM, 120--126.

Digital Library

Cited By

Zhang JWu FJiang HCheng GChen GWang Q(2024)SyncMalloc: A Synchronized Host-Device Co-Management System for GPU Dynamic Memory Allocation across All ScalesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673069(179-188)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673069
Hu JLuo HJiang HXiao GLi K(2024)FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347743135:12(2423-2434)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3477431
Alawneh AKang NKhairy MRogers T(2024)ThreadFuser: A SIMT Analysis Framework for MIMD Programs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00078(1013-1026)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00078
Show More Cited By

Throughput-oriented GPU memory allocation
1. Computing methodologies
  1. Concurrent computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Recommendations

Unified on-chip memory allocation for SIMT architecture
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture -- single instruction multiple thread (SIMT) architecture. It keeps the context of a significant ...
Gallatin: A General-Purpose GPU Memory Manager
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

Dynamic memory management is critical for efficiently porting modern data processing pipelines to GPUs. However, building a general-purpose dynamic memory manager on GPUs is challenging due to the massive parallelism and weak memory coherence. Existing ...
Software Transactional Memory for GPU Architectures
CGO '14: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization

Modern GPUs have shown promising results in accelerating computation intensive and numerical workloads with limited dynamic data sharing. However, many real-world applications manifest ample amount of data sharing among concurrently executing threads. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

February 2019

472 pages

ISBN:9781450362252

DOI:10.1145/3293883

General Chair:
Jeff Hollingsworth
University of Maryland
,
Program Chair:
Idit Keidar
Technion, Israel

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '19

Sponsor:

PPoPP '19: 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 16 - 20, 2019

District of Columbia, Washington

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
902
Total Downloads

Downloads (Last 12 months)94
Downloads (Last 6 weeks)17

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang JWu FJiang HCheng GChen GWang Q(2024)SyncMalloc: A Synchronized Host-Device Co-Management System for GPU Dynamic Memory Allocation across All ScalesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673069(179-188)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673069
Hu JLuo HJiang HXiao GLi K(2024)FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347743135:12(2423-2434)Online publication date: Dec-2024
https://doi.org/10.1109/TPDS.2024.3477431
Alawneh AKang NKhairy MRogers T(2024)ThreadFuser: A SIMT Analysis Framework for MIMD Programs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00078(1013-1026)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00078
Huang YFan XYan SWeng C(2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00289
Dalmia PMahapatra RIntan JNegrut DSinclair M(2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TPDS.2022.3218508
Ramani VChen JYates R(2023)Lock-based or Lock-less: Which Is Fresh?IEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10229077(1-10)Online publication date: 17-May-2023
https://doi.org/10.1109/INFOCOM53939.2023.10229077
Lee JLee JOh YSong WRo W(2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
https://doi.org/10.1109/HPCA56546.2023.10071063
Lee JJeong SSong SKim KChoi HKim YKim H(2023)Occamy: Memory-efficient GPU Compiler for DNN Inference2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247839(1-6)Online publication date: 9-Jul-2023
https://doi.org/10.1109/DAC56929.2023.10247839
Duong THur J(2023)Page-Size Aware Buddy Allocator With Unaligned Range Supports for TLB CoalescingIEEE Access10.1109/ACCESS.2023.330859111(91850-91860)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3308591
Niu YLu ZJi HSong SJin ZLiu WLee JAgrawal KSpear M(2022)TileSpGEMMProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508431(90-106)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508431
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents