[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3293883.3295727acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Throughput-oriented GPU memory allocation

Published: 16 February 2019 Publication History

Abstract

Throughput-oriented architectures, such as GPUs, can sustain three orders of magnitude more concurrent threads than multicore architectures. This level of concurrency pushes typical synchronization primitives (e.g., mutexes) over their scalability limits, creating significant performance bottlenecks in modules, such as memory allocators, that use them. In this paper, we develop concurrent programming techniques and synchronization primitives, in support of a dynamic memory allocator, that are efficient for use with very high levels of concurrency.
We formulate resource allocation as a two-stage process, that decouples accounting for the number of available resources from the tracking of the available resources themselves. To facilitate the accounting stage, we introduce a novel bulk semaphore abstraction that extends traditional semaphore semantics by optimizing for the case where threads operate on the semaphore simultaneously. We also similarly design new collective synchronization primitives that enable groups of cooperating threads to enter critical sections together. Finally, we show that delegation of deferred reclamation to threads already blocked greatly improves efficiency.
Using all these techniques, our throughput-oriented memory allocator delivers both high allocation rates and low memory fragmentation on modern GPUs. Our experiments demonstrate that it achieves allocation rates that are on average 16.56 times higher than the counterpart implementation in the CUDA 9 toolkit.

References

[1]
{n. d.}. Kinetica Web Page. https://www.kinetica.com/a. Accessed: 2018-12-30.
[2]
{n.d.}. RAPIDS Web Page, https://rapids.ai/. Accessed: 2018-12-30.
[3]
Andrew V Adinetz and Dirk Pleiter. 2014. Halloc: a high-throughput dynamic memory allocator for GPGPU architectures. In GPU Technology Conference (GTC). 152.
[4]
Emery D Berger, Kathryn S McKinley, Robert D Blumofe, and Paul R Wilson. 2000. Hoard: A scalable memory allocator for multithreaded applications. In ACM SIGARCH Computer Architecture News, Vol. 28. ACM, 117--128.
[5]
André B Bondi. 2000. Characteristics of scalability and their impact on performance. In Proceedings of the 2nd international workshop on Software and performance. ACM, 195--203.
[6]
Jeff Bonwick and Jonathan Adams. 2001. Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary Resources. In USENIX Annual Technical Conference, General Track. 15--33.
[7]
Daniel Bovet and Marco Cesati. 2005. Understanding The Linux Kernel. Oreilly & Associates Inc.
[8]
NVIDIA Cooprporation. {n. d.}. cuSolver Manual. NVIDIA.
[9]
Mathieu Desnoyers, Paul E McKenney, Alan S Stern, Michel R Dagenais, and Jonathan Walpole. 2012. User-level implementations of read-copy update. IEEE Transactions on Parallel and Distributed Systems 23, 2 (2012), 375--382.
[10]
Edsger W Dijkstra. 1968. Cooperating sequential processes. In The origin of concurrent programming. Springer, 65--138.
[11]
Jason Evans. 2006. A scalable concurrent malloc (3) implementation for FreeBSD. In Proc. of the BSDCan Conference, Ottawa, Canada.
[12]
Xiaohuang Huang, Christopher I Rodrigues, Stephen Jones, Ian Buck, and Wen-mei Hwu. 2010. Xmalloc: A scalable lock-free dynamic memory allocator for many-core machines. In Computer and Information Technology (CIT), 2010 IEEE 10th International Conference on. IEEE, 1134--1139.
[13]
Leonard Kleinrock. 1976. Queueing systems, volume 2: Computer applications. Vol. 66. wiley New York.
[14]
Kenneth Knowlton. 1965. A Fast Storage Allocator. Commun. ACM 8, 10 (1965), 623--624.
[15]
Alex Kogan and Erez Petrank. 2011. Wait-free queues with multiple enqueuers and dequeuers. In ACM SIGPLAN Notices, Vol. 46. ACM, 223--234.
[16]
Romolo Marotta, Mauro Ianni, Andrea Scarselli, Alessandro Pellegrini, and Francesco Quaglia. 2018. Anon-blocking buddy system for scalable memory allocation on multi-core machines. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 164--165.
[17]
Paul E McKenney. 2006. Sleepable Read-Copy Update.
[18]
Paul E McKenney and John D Slingwine. 1998. Read-copy update: Using execution history to solve concurrency problems. In Parallel and Distributed Computing and Systems. 509--518.
[19]
NVIDIA Coorporation. 2017. NVIDIA Tesla V100 GPU Architecture. Technical Report WP-08608. http://www.nvidia.com/object/volta-architecture-whitepaper.html
[20]
Markus Steinberger, Michael Kenzel, Bernhard Kainz, and Dieter Schmalstieg. 2012. Scatteralloc: Massively parallel dynamic memory allocation for the GPU. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.
[21]
Jeff A Stuart and John D Owens. 2011. Efficient synchronization primitives for GPUs. arXiv preprint arXiv:1110.4623 (2011).
[22]
M. Vinkler and V. Havran. 2014. Register Efficient Memory Allocator for GPUs. In Proceedings of High Performance Graphics (HPG '14). Eurographics Association, Goslar Germany, Germany, 19--28.
[23]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A high-performance graph processing library on the GPU. In ACM SIGPLAN Notices, Vol. 51. ACM, 11.
[24]
Sven Widmer, Dominik Wodniok, Nicolas Weber, and Michael Goesele. 2013. Fast dynamic memory allocator for massively parallel architectures. In Proceedings of the 6th workshop on general purpose processor using graphics processing units. ACM, 120--126.

Cited By

View all
  • (2024)SyncMalloc: A Synchronized Host-Device Co-Management System for GPU Dynamic Memory Allocation across All ScalesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673069(179-188)Online publication date: 12-Aug-2024
  • (2024)FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347743135:12(2423-2434)Online publication date: Dec-2024
  • (2024)ThreadFuser: A SIMT Analysis Framework for MIMD Programs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00078(1013-1026)Online publication date: 2-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming
February 2019
472 pages
ISBN:9781450362252
DOI:10.1145/3293883
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 February 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU programming
  2. concurrency
  3. memory allocation

Qualifiers

  • Research-article

Conference

PPoPP '19

Acceptance Rates

PPoPP '19 Paper Acceptance Rate 29 of 152 submissions, 19%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)94
  • Downloads (Last 6 weeks)17
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SyncMalloc: A Synchronized Host-Device Co-Management System for GPU Dynamic Memory Allocation across All ScalesProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673069(179-188)Online publication date: 12-Aug-2024
  • (2024)FastLoad: Speeding Up Data Loading of Both Sparse Matrix and Vector for SpMV on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.347743135:12(2423-2434)Online publication date: Dec-2024
  • (2024)ThreadFuser: A SIMT Analysis Framework for MIMD Programs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00078(1013-1026)Online publication date: 2-Nov-2024
  • (2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
  • (2023)Improving the Scalability of GPU Synchronization PrimitivesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.321850834:1(275-290)Online publication date: 1-Jan-2023
  • (2023)Lock-based or Lock-less: Which Is Fresh?IEEE INFOCOM 2023 - IEEE Conference on Computer Communications10.1109/INFOCOM53939.2023.10229077(1-10)Online publication date: 17-May-2023
  • (2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
  • (2023)Occamy: Memory-efficient GPU Compiler for DNN Inference2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247839(1-6)Online publication date: 9-Jul-2023
  • (2023)Page-Size Aware Buddy Allocator With Unaligned Range Supports for TLB CoalescingIEEE Access10.1109/ACCESS.2023.330859111(91850-91860)Online publication date: 2023
  • (2022)TileSpGEMMProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508431(90-106)Online publication date: 2-Apr-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media