More Web Proxy on the site http://driver.im/

research-article

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Authors:

Nadathur Satish,

Jatin Chhugani,

Anthony D. Nguyen,

Pradeep DubeyAuthors Info & Claims

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 351 - 362

https://doi.org/10.1145/1807167.1807207

Published: 06 June 2010 Publication History

Abstract

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 2⁴⁰ (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

References

[1]

CUDPP: CUDA Data Parallel Primitives Library. gpgpu.org/developer/cudpp/.

[2]

Intel Performance Primitives. http://software.intel.com/en-us/intel-ipp/.

[3]

V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Comput. Surv., 27(3):367--432, 1995.

Digital Library

[4]

K. E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307--314, 1968.

Digital Library

[5]

C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for column stores. In SIGMOD, pages 283--296, 2009.

Digital Library

[6]

G. E. Blelloch. Vector models for data-parallel computing. MIT Press, Cambridge, MA, USA, 1990.

Digital Library

[7]

P. Bohannon, P. Mcllroy, and R. Rastogi. Main-memory index structures with fixed-size partial keys. In SIGMOD, pages 163--174, 2001.

Digital Library

[8]

S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, et al. Rock: A High-Performance Sparc CMT Processor. IEEE Micro, 29(2):6--16, 2009.

Digital Library

[9]

J. Chhugani, A. D. Nguyen, V. W. Lee, et al. Efficient implementation of sorting on multi-core SIMD CPU architectures. VLDB, 1(2):1313--1324, 2008.

Digital Library

[10]

T. Cormen, C. Leiserson, and R. Rivest. Intro. to Algorithms. MIT Press, 1990.

[11]

R. S. Francis, I. D. Mathieson, and L. Pannan. A fast, simple algorithm to balance a parallel multiway merge. In Proceedings of PARLE, 1993.

Digital Library

[12]

N. Govindaraju, J. Gray, R. Kumar, et al. GPUTeraSort: High Performance Graphics Co-processor Sorting. In SIGMOD, pages 325--336, 2006.

Digital Library

[13]

H. Inoue, T. Moriyama, H. Komatsu, et al. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In PACT, pages 189--198, 2007.

[14]

Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVXProgramming-Reference-31943302.pdf.

[15]

D. Jiménex-González, J. J. Navarro, and J.-L. Larriba-Pey. CC-Radix:a Cache Conscious Sorting Based on Radix sort. Euromicro Conference on Parallel, Distributed, and Network-Based Processing, 0:101, 2003.

[16]

C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, et al. Sort vs. hash revisited: Fast join implementation on multi-core cpus. PVLDB, 2(2):1378--1389, 2009.

Digital Library

[17]

A. Lamarca and R. E. Ladner. The Influence of Caches on the Performance of Sorting. In Journal of Algorithms, pages 370--379, 1997.

Digital Library

[18]

N. Leischner, V. Osipov, and P. Sanders. Gpu sample sort, 2009.

[19]

NVIDIA. Fermi Architecture White Paper, 2009.

[20]

NVIDIA. NVIDIA CUDA Programming Guide 2.3. 2009.

[21]

M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. In HPEC, 2008.

[22]

N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009.

Digital Library

[23]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008.

Digital Library

[24]

S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan Primitives for GPU Computing. In Graphics Hardware 2007, pages 97--106, Aug. 2007.

Digital Library

[25]

E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on GPGPU, 2007.

[26]

K. Thearling and S. Smith. An improved supercomputer sorting benchmark. In Proceedings of Supercomputing '92, pages 14--19, 1992.

Digital Library

[27]

M. Zagha and G. E. Blelloch. Radix sort for vector multiprocessors. In Proceedings of Supercomputing '91.

Digital Library

Cited By

Jasny MThostrup LTamimi SKoch AIstván ZBinnig C(2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639291
Habich DPietrzyk JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654694
Zulian PBen Bader SFourestey GKrause RRossinelli D(2024)Data-centric workloads with MPI_SortJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104833(104833)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104833
Show More Cited By

Index Terms

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we ...
A comparison-free sorting algorithm on CPUs and GPUs

This paper presents a new sorting algorithm that sorts input data elements without any comparison operations between the data--comparison-free sorting. Our algorithm's time complexity is on the order of O(N) for both single- and multi-threaded CPU and ...
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

June 2010

1286 pages

ISBN:9781450300322

DOI:10.1145/1807167

General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '10

Sponsor:

SIGMOD

SIGMOD/PODS '10: International Conference on Management of Data

June 6 - 10, 2010

Indiana, Indianapolis, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

170
Total Citations
View Citations
3,826
Total Downloads

Downloads (Last 12 months)121
Downloads (Last 6 weeks)12

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jasny MThostrup LTamimi SKoch AIstván ZBinnig C(2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639291
Habich DPietrzyk JBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654694
Zulian PBen Bader SFourestey GKrause RRossinelli D(2024)Data-centric workloads with MPI_SortJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104833(104833)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104833
Zhang JZhou JZhang XMa DGong C(2024)Fine-grained vectorized merge sorting on RISC-V: from register to cacheCCF Transactions on High Performance Computing10.1007/s42514-024-00201-2Online publication date: 18-Dec-2024
https://doi.org/10.1007/s42514-024-00201-2
Mühlig JTeubner J(2023)Micro Partitioning: Friendly to the Hardware and the DeveloperProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595310(27-34)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3592980.3595310
Thostrup LDoci GBoeschen NLuthra MBinnig C(2023)Distributed GPU Joins on Fast RDMA-capable NetworksProceedings of the ACM on Management of Data10.1145/35887091:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588709
Kobayashi RMiura KFujita NBoku TAmagasa T(2022)An Open-source FPGA Library for Data SortingJournal of Information Processing10.2197/ipsjjip.30.76630(766-777)Online publication date: 2022
https://doi.org/10.2197/ipsjjip.30.766
Arman ALoguinov D(2022)OrigamiProceedings of the VLDB Endowment10.14778/3489496.348950715:2(259-271)Online publication date: 4-Feb-2022
https://dl.acm.org/doi/10.14778/3489496.3489507
Habich DPietrzyk JKrause AHildebrandt JLehner W(2022)To use or not to use the SIMD gather instruction?Proceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535089(1-5)Online publication date: 12-Jun-2022
https://dl.acm.org/doi/10.1145/3533737.3535089
Lutz CBreß SZeuch SRabl TMarkl VIves ZBonifati AEl Abbadi A(2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517911
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents