[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1807167.1807207acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Published: 06 June 2010 Publication History

Abstract

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 240 (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

References

[1]
CUDPP: CUDA Data Parallel Primitives Library. gpgpu.org/developer/cudpp/.
[2]
Intel Performance Primitives. http://software.intel.com/en-us/intel-ipp/.
[3]
V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Comput. Surv., 27(3):367--432, 1995.
[4]
K. E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307--314, 1968.
[5]
C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for column stores. In SIGMOD, pages 283--296, 2009.
[6]
G. E. Blelloch. Vector models for data-parallel computing. MIT Press, Cambridge, MA, USA, 1990.
[7]
P. Bohannon, P. Mcllroy, and R. Rastogi. Main-memory index structures with fixed-size partial keys. In SIGMOD, pages 163--174, 2001.
[8]
S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, et al. Rock: A High-Performance Sparc CMT Processor. IEEE Micro, 29(2):6--16, 2009.
[9]
J. Chhugani, A. D. Nguyen, V. W. Lee, et al. Efficient implementation of sorting on multi-core SIMD CPU architectures. VLDB, 1(2):1313--1324, 2008.
[10]
T. Cormen, C. Leiserson, and R. Rivest. Intro. to Algorithms. MIT Press, 1990.
[11]
R. S. Francis, I. D. Mathieson, and L. Pannan. A fast, simple algorithm to balance a parallel multiway merge. In Proceedings of PARLE, 1993.
[12]
N. Govindaraju, J. Gray, R. Kumar, et al. GPUTeraSort: High Performance Graphics Co-processor Sorting. In SIGMOD, pages 325--336, 2006.
[13]
H. Inoue, T. Moriyama, H. Komatsu, et al. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In PACT, pages 189--198, 2007.
[14]
Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVXProgramming-Reference-31943302.pdf.
[15]
D. Jiménex-González, J. J. Navarro, and J.-L. Larriba-Pey. CC-Radix:a Cache Conscious Sorting Based on Radix sort. Euromicro Conference on Parallel, Distributed, and Network-Based Processing, 0:101, 2003.
[16]
C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, et al. Sort vs. hash revisited: Fast join implementation on multi-core cpus. PVLDB, 2(2):1378--1389, 2009.
[17]
A. Lamarca and R. E. Ladner. The Influence of Caches on the Performance of Sorting. In Journal of Algorithms, pages 370--379, 1997.
[18]
N. Leischner, V. Osipov, and P. Sanders. Gpu sample sort, 2009.
[19]
NVIDIA. Fermi Architecture White Paper, 2009.
[20]
NVIDIA. NVIDIA CUDA Programming Guide 2.3. 2009.
[21]
M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. In HPEC, 2008.
[22]
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009.
[23]
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008.
[24]
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan Primitives for GPU Computing. In Graphics Hardware 2007, pages 97--106, Aug. 2007.
[25]
E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on GPGPU, 2007.
[26]
K. Thearling and S. Smith. An improved supercomputer sorting benchmark. In Proceedings of Supercomputing '92, pages 14--19, 1992.
[27]
M. Zagha and G. E. Blelloch. Radix sort for vector multiprocessors. In Proceedings of Supercomputing '91.

Cited By

View all
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • (2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
  • (2024)Data-centric workloads with MPI_SortJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104833(104833)Online publication date: Jan-2024
  • Show More Cited By

Index Terms

  1. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
    June 2010
    1286 pages
    ISBN:9781450300322
    DOI:10.1145/1807167
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 June 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. buffer
    2. databases
    3. many-core
    4. merge
    5. merge network
    6. performance
    7. radix
    8. simd
    9. sorting
    10. tlp

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS '10
    Sponsor:
    SIGMOD/PODS '10: International Conference on Management of Data
    June 6 - 10, 2010
    Indiana, Indianapolis, USA

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)113
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 11 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
    • (2024)SIMDified Data Processing - Foundations, Abstraction, and Advanced TechniquesCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654694(613-621)Online publication date: 9-Jun-2024
    • (2024)Data-centric workloads with MPI_SortJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104833(104833)Online publication date: Jan-2024
    • (2024)Fine-grained vectorized merge sorting on RISC-V: from register to cacheCCF Transactions on High Performance Computing10.1007/s42514-024-00201-2Online publication date: 18-Dec-2024
    • (2023)Micro Partitioning: Friendly to the Hardware and the DeveloperProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595310(27-34)Online publication date: 18-Jun-2023
    • (2023)Distributed GPU Joins on Fast RDMA-capable NetworksProceedings of the ACM on Management of Data10.1145/35887091:1(1-26)Online publication date: 30-May-2023
    • (2022)An Open-source FPGA Library for Data SortingJournal of Information Processing10.2197/ipsjjip.30.76630(766-777)Online publication date: 2022
    • (2022)OrigamiProceedings of the VLDB Endowment10.14778/3489496.348950715:2(259-271)Online publication date: 4-Feb-2022
    • (2022)To use or not to use the SIMD gather instruction?Proceedings of the 18th International Workshop on Data Management on New Hardware10.1145/3533737.3535089(1-5)Online publication date: 12-Jun-2022
    • (2022)Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517911(1017-1032)Online publication date: 10-Jun-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media