[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2458523.2458524acmotherconferencesArticle/Chapter ViewAbstractPublication PagesgpgpuConference Proceedingsconference-collections
research-article

Comparison based sorting for systems with multiple GPUs

Published: 16 March 2013 Publication History

Abstract

As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system.
In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.

References

[1]
P. Jetley, F. Gioachin, C. Mendes, L. Kale, and T. Quinn, "Massively parallel cosmological simulations with changa," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, april 2008, pp. 1--12.
[2]
N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09, 2009, pp. 18:1--18:11.
[3]
C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, "Fast bvh construction on gpus," Computer Graphics Forum, vol. 28, no. 2, pp. 375--384, 2009.
[4]
G. Graefe, "Implementing sorting in database systems," ACM Comput. Surv., vol. 38, no. 3, Sep. 2006.
[5]
R. Wu, B. Zhang, and M. Hsu, "Clustering billions of data points using gpus," in Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, ser. UCHPC-MAW '09, 2009, pp. 1--6.
[6]
B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a mapreduce framework on graphics processors," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ser. PACT '08, 2008, pp. 260--269.
[7]
J. Stuart and J. Owens, "Multi-gpu mapreduce on gpu clusters," in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, may 2011, pp. 1068--1079.
[8]
J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, may 2008.
[9]
E. Sintorn and U. Assarsson, "Fast parallel gpu-sorting using a hybrid algorithm," Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1381--1388, 2008.
[10]
N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore gpus," in Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, may 2009, pp. 1--10.
[11]
X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne, "High performance comparison-based sorting algorithm on many-core gpus," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.
[12]
N. Leischner, V. Osipov, and P. Sanders, "Gpu sample sort," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.
[13]
D. Cederman and P. Tsigas, "Gpu-quicksort: A practical quicksort algorithm for graphics processors," J. Exp. Algorithmics, vol. 14, pp. 4:1.4--4:1.24, Jan. 2010.
[14]
Y. Ye, Z. Du, and D. A. Bader, "Gpumemsort: A high performance graphic co-processors sorting algorithm for large scale in-memory data," 2010.
[15]
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, "Parallel external sorting for cuda-enabled gpus with load balancing and low transfer overhead," Parallel and Distributed Processing Workshops and PhD Forum, 2011 IEEE International Symposium on, vol. 0, pp. 1--8, 2010.
[16]
NVIDIA, CUDA C Programming Guide, 2012.
[17]
C. A. R. Hoare, "Quicksort," The Computer Journal, vol. 5, no. 1, pp. 10--16, January 1962.
[18]
D. Knuth, The Art of Computer Programming. Addison-Wesley, 1998, vol. 3 Sorting and Searching, noted by the author in p. 158.
[19]
P. Bakkum and K. Skadron, "Accelerating sql database operations on a gpu with cuda," in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU '10, 2010, pp. 94--103.
[20]
"Amd fusion whitepaper." {Online}. Available: http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf
[21]
"Intel quickpath interconnect," 2012. {Online}. Available: http://www.intel.com/technology/quickpath
[22]
"Hypertransport interconnect," 2012. {Online}. Available: http://www.hypertransport.org
[23]
"Heterogeneous system architecture," 2012. {Online}. Available: www.hsafoundation.com
[24]
I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu, "An asymmetric distributed shared memory model for heterogeneous parallel systems," in Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ser. ASPLOS '10, 2010, pp. 347--358.
[25]
V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, and W. mei Hwu, "Gpu clusters for high-performance computing," in Workshop on Parallel Programming on Accelerator Clusters. IEEE CLUSTER '09, 31 2009-sept. 4 2009, pp. 1--8.
[26]
NVIDIA, "Thrust parallel algorithms library," 2012. {Online}. Available: http://thrust.github.com
[27]
T. Hagerup and C. Rüb, "Optimal merging and sorting on the erew pram," Information Processing Letters, vol. 33, no. 4, pp. 181--185, 1989.
[28]
J. Singler, P. Sanders, and F. Putze, "Mcstl: The multi-core standard template library," in Euro-Par 2007 Parallel Processing, ser. Lecture Notes in Computer Science, A.-M. Kermarrec, L. Bougé, and T. Priol, Eds. Springer Berlin/Heidelberg, 2007, vol. 4641, pp. 682--694.
[29]
D. R. Musser, "Introspective sorting and selection algorithms," Software: Practice and Experience, vol. 27, no. 8, pp. 983--993, 1997.
[30]
K. E. Batcher, "Sorting networks and their applications," in Proceedings of the April 30--May 2, 1968, spring joint computer conference, ser. AFIPS '68 (Spring), 1968, pp. 307--314.
[31]
R. Cole, "Parallel merge sort," in Foundations of Computer Science, 1986., 27th Annual Symposium on, oct. 1986, pp. 511--516.
[32]
G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, "An experimental analysis of parallel sorting algorithms," Theory of Computing Systems, vol. 31, pp. 135--167, 1998.
[33]
E. Solomonik and L. Kale, "Highly scalable parallel sorting," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--12.
[34]
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey, "Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort," in Proceedings of the 2010 international conference on Management of data, ser. SIGMOD '10, 2010, pp. 351--362.
[35]
A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, "Efficient parallel merge sort for fixed and variable length keys," in Proceedings of Innovative Parallel Computing (InPar '12), May 2012.
[36]
O. Green, R. McColl, and D. A. Bader, "Gpu merge path: a gpu merging algorithm," in Proceedings of the 26th ACM international conference on Supercomputing, ser. ICS '12, 2012, pp. 331--340.
[37]
D. R. Helman, J. JáJá, and D. A. Bader, "A new deterministic parallel sorting algorithm with an experimental evaluation," J. Exp. Algorithmics, vol. 3, Sep. 1998.

Cited By

View all
  • (2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
  • (2020)3D-Sorter: 3D Design of a Resource-Aware Hardware Sorter for Edge Computing Platforms Under Area and Energy Consumption Constraints2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00018(42-47)Online publication date: Jul-2020
  • (2019)Suffix Array Construction on Multi-GPU SystemsProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325961(183-194)Online publication date: 17-Jun-2019
  • Show More Cited By

Index Terms

  1. Comparison based sorting for systems with multiple GPUs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
    March 2013
    156 pages
    ISBN:9781450320177
    DOI:10.1145/2458523
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    In-Cooperation

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 March 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CUDA
    2. GPU
    3. parallel
    4. sorting

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    GPGPU-6

    Acceptance Rates

    GPGPU-6 Paper Acceptance Rate 15 of 37 submissions, 41%;
    Overall Acceptance Rate 57 of 129 submissions, 44%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
    • (2020)3D-Sorter: 3D Design of a Resource-Aware Hardware Sorter for Edge Computing Platforms Under Area and Energy Consumption Constraints2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00018(42-47)Online publication date: Jul-2020
    • (2019)Suffix Array Construction on Multi-GPU SystemsProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325961(183-194)Online publication date: 17-Jun-2019
    • (2017)A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064043(417-432)Online publication date: 9-May-2017
    • (2017)A roadmap of parallel sorting algorithms using GPU computing2017 International Conference on Computing, Communication and Automation (ICCCA)10.1109/CCAA.2017.8229919(736-741)Online publication date: May-2017
    • (2017)Massive spatial query on the Kepler architecture2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2017.7995267(111-118)Online publication date: Jul-2017
    • (2016)GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory CapacityIEEE Transactions on Big Data10.1109/TBDATA.2015.25110012:1(57-69)Online publication date: 1-Mar-2016
    • (2015)Automatic Parallelization of Kernels in Shared-Memory Multi-GPU NodesProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751218(3-13)Online publication date: 8-Jun-2015
    • (2015)GPU-SM: shared memory multi-GPU programmingProceedings of the 8th Workshop on General Purpose Processing using GPUs10.1145/2716282.2716286(13-24)Online publication date: 7-Feb-2015
    • (2015)Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231682526:5(1405-1418)Online publication date: 1-May-2015
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media