A 64-GB Sort at 28 GB/s on a 4-GPU POWER9 Node for Uniformly-Distributed 16-Byte Records with 8-Byte Keys

Gordon C. Fossum¹⁶,
Ting Wang¹⁷ &
H. Peter Hofstee^16,18

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11203))

Included in the following conference series:

International Conference on High Performance Computing

Abstract

Govinderaju et al. [1] have shown that a hybrid CPU-GPU system is cost-performance effective at sorting large datasets on a single node, but thus far large clusters used on sorting benchmarks have been limited by network and storage performance, and such clusters have remained CPU-only. With network and storage bandwidths improving more rapidly than CPU throughput performance, the cost effectiveness of CPU-GPU clusters for large sorts should be re-examined. As a first step, we evaluate sort performance on a single GPU-accelerated node with initial and final data residing in system memory. Access to main memory is limited to two reads and two writes, while executing the partitioning and sort in GPU memory. On a dual-socket IBM POWER9 system with four NVlink-attached NVIDIA V100 GPUs a single-node sort of 64 GB 8-byte key, 8-byte value records completes in under 2.3 s corresponding to a sort rate of over 28 GB/s. On a small (4-node) cluster with the same amount of data per node, the cluster sort completes in under 4.5 s. Sort performance is enabled by high system memory bandwidth, managing system-memory NUMA affinities, high CPU-GPU bandwidth, an efficient GPU-based partitioner, and an optimized GPU sort implementation. A cluster version of the algorithm benefits from minimizing copy operations by using RDMA. Matching the throughput of an optimized partitioner for our system would require a 50-100 GB/s network, which is feasible with a dual-socket POWER9 system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

GPU-Based Algorithms for Processing the k Nearest-Neighbor Query on Spatial Data Using Partitioning and Concurrent Kernel Execution

Article Open access 21 July 2023

Faster Segmented Sort on GPUs

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

Article Open access 23 September 2021

References

Govindaraju, N., Gray, J., Kumar, R., Manocha, D.: Gputerasort: high performance graphics co-processor sorting for large database management. In: SIGMOD (2006)
Google Scholar
www.sortbenchmark.org
Parallel Integer Sort/BigSort. https://asc.llnl.gov/coral-2-benchmarks/
Kruger, F.: CPU Bandwidth, the Worrisome 2020 Trend, March 2016. https://blog.westerndigital.com/cpu-bandwidth-the-worrisome-2020-trend/. Accessed Feb 2018
Jiang, J., et al.: Tencent Sort. http://www.sortbenchmark.org/TencentSort2016.pdf. Accessed Dec 2017
Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for manycore gpus. In: IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2009. IEEE, pp. 1–10 (2009)
Google Scholar
Arkhipov, D.I., Wu, D., Li, K., Regan, A.C.: Sorting with GPUs: a survey. arXiv.org 1709.02520, September 2017. (www.arxiv.org/pdf/1709.02520)
Stehle, E., Jacobsen, H.-A.: A memory bandwidth-efficient hybrid radix sort on GPUs. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 417–432, ACM, New York (2017)
Google Scholar
Hoberock, J., Bell, N.: Thrust: A Parallel Template  Library (2016). https://thrust.github.io
Cho, M., Brand, D., Bordawekar, R., Finkler, U., Kulandaisamy, V., Puri, R.: Paradis: an efficient parallel algorithm for in-place radix sort. In: PVLDB (2015)
Google Scholar

Download references

Acknowledgements

The authors would like to thank Mark Nutter for discussions early-on in the project. We thank Bruce D’Amora for his support and feedback, and we want to thank the anonymous reviewers of an earlier version of this paper for their extensive feedback that contributed to significant improvements in this version of the paper.

Author information

Authors and Affiliations

IBM Research, Austin, TX, USA
Gordon C. Fossum & H. Peter Hofstee
IBM Systems, Shanghai, China
Ting Wang
TU Delft, Delft, Netherlands
H. Peter Hofstee

Authors

Gordon C. Fossum
View author publications
You can also search for this author in PubMed Google Scholar
Ting Wang
View author publications
You can also search for this author in PubMed Google Scholar
H. Peter Hofstee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to H. Peter Hofstee .

Editor information

Editors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
University of Edinburgh, Edinburgh, UK
Michèle Weiland
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
John Shalf
Swiss National Supercomputing Centre, Lugano, Switzerland
Sadaf Alam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fossum, G.C., Wang, T., Hofstee, H.P. (2018). A 64-GB Sort at 28 GB/s on a 4-GPU POWER9 Node for Uniformly-Distributed 16-Byte Records with 8-Byte Keys. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 11203. Springer, Cham. https://doi.org/10.1007/978-3-030-02465-9_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-02465-9_25
Published: 25 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02464-2
Online ISBN: 978-3-030-02465-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A 64-GB Sort at 28 GB/s on a 4-GPU POWER9 Node for Uniformly-Distributed 16-Byte Records with 8-Byte Keys

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GPU-Based Algorithms for Processing the k Nearest-Neighbor Query on Spatial Data Using Partitioning and Concurrent Kernel Execution

Faster Segmented Sort on GPUs

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A 64-GB Sort at 28 GB/s on a 4-GPU POWER9 Node for Uniformly-Distributed 16-Byte Records with 8-Byte Keys

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

GPU-Based Algorithms for Processing the k Nearest-Neighbor Query on Spatial Data Using Partitioning and Concurrent Kernel Execution

Faster Segmented Sort on GPUs

How Many CPU Cores is an FPGA Worth? Lessons Learned from Accelerating String Sorting on a CPU-FPGA System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation