More Web Proxy on the site http://driver.im/

research-article

Comparison based sorting for systems with multiple GPUs

Authors:

Lluís Vilanova,

Javier Cabezas,

Wen-mei HwuAuthors Info & Claims

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

Pages 1 - 11

https://doi.org/10.1145/2458523.2458524

Published: 16 March 2013 Publication History

Abstract

As a basic building block of many applications, sorting algorithms that efficiently run on modern machines are key for the performance of these applications. With the recent shift to using GPUs for general purpose compuing, researches have proposed several sorting algorithms for single-GPU systems. However, some workstations and HPC systems have multiple GPUs, and applications running on them are designed to use all available GPUs in the system.

In this paper we present a high performance multi-GPU merge sort algorithm that solves the problem of sorting data distributed across several GPUs. Our merge sort algorithm first sorts the data on each GPU using an existing single-GPU sorting algorithm. Then, a series of merge steps produce a globally sorted array distributed across all the GPUs in the system. This merge phase is enabled by a novel pivot selection algorithm that ensures that merge steps always distribute data evenly among all GPUs. We also present the implementation of our sorting algorithm in CUDA, as well as a novel inter-GPU communication technique that enables this pivot selection algorithm. Experimental results show that an efficient implementation of our algorithm achieves a speed up of 1.9x when running on two GPUs and 3.3x when running on four GPUs, compared to sorting on a single GPU. At the same time, it is able to sort two and four times more records, compared to sorting on one GPU.

References

[1]

P. Jetley, F. Gioachin, C. Mendes, L. Kale, and T. Quinn, "Massively parallel cosmological simulations with changa," in Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, april 2008, pp. 1--12.

[2]

N. Bell and M. Garland, "Implementing sparse matrix-vector multiplication on throughput-oriented processors," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC '09, 2009, pp. 18:1--18:11.

Digital Library

[3]

C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha, "Fast bvh construction on gpus," Computer Graphics Forum, vol. 28, no. 2, pp. 375--384, 2009.

[4]

G. Graefe, "Implementing sorting in database systems," ACM Comput. Surv., vol. 38, no. 3, Sep. 2006.

Digital Library

[5]

R. Wu, B. Zhang, and M. Hsu, "Clustering billions of data points using gpus," in Proceedings of the combined workshops on UnConventional high performance computing workshop plus memory access workshop, ser. UCHPC-MAW '09, 2009, pp. 1--6.

Digital Library

[6]

B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, "Mars: a mapreduce framework on graphics processors," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques, ser. PACT '08, 2008, pp. 260--269.

Digital Library

[7]

J. Stuart and J. Owens, "Multi-gpu mapreduce on gpu clusters," in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, may 2011, pp. 1068--1079.

Digital Library

[8]

J. Owens, M. Houston, D. Luebke, S. Green, J. Stone, and J. Phillips, "Gpu computing," Proceedings of the IEEE, vol. 96, no. 5, pp. 879--899, may 2008.

[9]

E. Sintorn and U. Assarsson, "Fast parallel gpu-sorting using a hybrid algorithm," Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1381--1388, 2008.

Digital Library

[10]

N. Satish, M. Harris, and M. Garland, "Designing efficient sorting algorithms for manycore gpus," in Parallel Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, may 2009, pp. 1--10.

Digital Library

[11]

X. Ye, D. Fan, W. Lin, N. Yuan, and P. Ienne, "High performance comparison-based sorting algorithm on many-core gpus," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.

[12]

N. Leischner, V. Osipov, and P. Sanders, "Gpu sample sort," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--10.

[13]

D. Cederman and P. Tsigas, "Gpu-quicksort: A practical quicksort algorithm for graphics processors," J. Exp. Algorithmics, vol. 14, pp. 4:1.4--4:1.24, Jan. 2010.

Digital Library

[14]

Y. Ye, Z. Du, and D. A. Bader, "Gpumemsort: A high performance graphic co-processors sorting algorithm for large scale in-memory data," 2010.

[15]

H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger, "Parallel external sorting for cuda-enabled gpus with load balancing and low transfer overhead," Parallel and Distributed Processing Workshops and PhD Forum, 2011 IEEE International Symposium on, vol. 0, pp. 1--8, 2010.

[16]

NVIDIA, CUDA C Programming Guide, 2012.

[17]

C. A. R. Hoare, "Quicksort," The Computer Journal, vol. 5, no. 1, pp. 10--16, January 1962.

[18]

D. Knuth, The Art of Computer Programming. Addison-Wesley, 1998, vol. 3 Sorting and Searching, noted by the author in p. 158.

Digital Library

[19]

P. Bakkum and K. Skadron, "Accelerating sql database operations on a gpu with cuda," in Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, ser. GPGPU '10, 2010, pp. 94--103.

Digital Library

[20]

"Amd fusion whitepaper." {Online}. Available: http://sites.amd.com/us/Documents/AMD_fusion_Whitepaper.pdf

[21]

"Intel quickpath interconnect," 2012. {Online}. Available: http://www.intel.com/technology/quickpath

[22]

"Hypertransport interconnect," 2012. {Online}. Available: http://www.hypertransport.org

[23]

"Heterogeneous system architecture," 2012. {Online}. Available: www.hsafoundation.com

[24]

I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.-m. W. Hwu, "An asymmetric distributed shared memory model for heterogeneous parallel systems," in Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ser. ASPLOS '10, 2010, pp. 347--358.

Digital Library

[25]

V. Kindratenko, J. Enos, G. Shi, M. Showerman, G. Arnold, J. Stone, J. Phillips, and W. mei Hwu, "Gpu clusters for high-performance computing," in Workshop on Parallel Programming on Accelerator Clusters. IEEE CLUSTER '09, 31 2009-sept. 4 2009, pp. 1--8.

[26]

NVIDIA, "Thrust parallel algorithms library," 2012. {Online}. Available: http://thrust.github.com

[27]

T. Hagerup and C. Rüb, "Optimal merging and sorting on the erew pram," Information Processing Letters, vol. 33, no. 4, pp. 181--185, 1989.

Digital Library

[28]

J. Singler, P. Sanders, and F. Putze, "Mcstl: The multi-core standard template library," in Euro-Par 2007 Parallel Processing, ser. Lecture Notes in Computer Science, A.-M. Kermarrec, L. Bougé, and T. Priol, Eds. Springer Berlin/Heidelberg, 2007, vol. 4641, pp. 682--694.

Digital Library

[29]

D. R. Musser, "Introspective sorting and selection algorithms," Software: Practice and Experience, vol. 27, no. 8, pp. 983--993, 1997.

Digital Library

[30]

K. E. Batcher, "Sorting networks and their applications," in Proceedings of the April 30--May 2, 1968, spring joint computer conference, ser. AFIPS '68 (Spring), 1968, pp. 307--314.

Digital Library

[31]

R. Cole, "Parallel merge sort," in Foundations of Computer Science, 1986., 27th Annual Symposium on, oct. 1986, pp. 511--516.

Digital Library

[32]

G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, "An experimental analysis of parallel sorting algorithms," Theory of Computing Systems, vol. 31, pp. 135--167, 1998.

[33]

E. Solomonik and L. Kale, "Highly scalable parallel sorting," in Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, april 2010, pp. 1--12.

[34]

N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey, "Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort," in Proceedings of the 2010 international conference on Management of data, ser. SIGMOD '10, 2010, pp. 351--362.

Digital Library

[35]

A. Davidson, D. Tarjan, M. Garland, and J. D. Owens, "Efficient parallel merge sort for fixed and variable length keys," in Proceedings of Innovative Parallel Computing (InPar '12), May 2012.

[36]

O. Green, R. McColl, and D. A. Bader, "Gpu merge path: a gpu merging algorithm," in Proceedings of the 26th ACM international conference on Supercomputing, ser. ICS '12, 2012, pp. 331--340.

Digital Library

[37]

D. R. Helman, J. JáJá, and D. A. Bader, "A new deterministic parallel sorting algorithm with an experimental evaluation," J. Exp. Algorithmics, vol. 3, Sep. 1998.

Digital Library

Cited By

Maltenberger TIlic ITolovski IRabl TIves ZBonifati AEl Abbadi A(2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517842
Norollah AKazemi ZHely D(2020)3D-Sorter: 3D Design of a Resource-Aware Hardware Sorter for Edge Computing Platforms Under Area and Energy Consumption Constraints2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00018(42-47)Online publication date: Jul-2020
https://doi.org/10.1109/ISVLSI49217.2020.00018
Büren FJünger DKobus RHundt CSchmidt BWeissman JButt ASmirni E(2019)Suffix Array Construction on Multi-GPU SystemsProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325961(183-194)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325961
Show More Cited By

Index Terms

Comparison based sorting for systems with multiple GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures

Recommendations

Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs
ICS '18: Proceedings of the 2018 International Conference on Supercomputing

We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sorting algorithms for Graphics Processing Units (GPUs). We experimentally ...
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...
Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

We present a highly parallel CUDA kernel based on the Lattice Monte Carlo (LMC) method for transient thermal conduction, which achieves a peak acceleration of more than 100x over a single-threaded Fortran version. A number of memory and branching ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units

March 2013

156 pages

ISBN:9781450320177

DOI:10.1145/2458523

Editors:
John Cavazos
University of Delaware
,
Xiang Gong,
David Kaeli

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 March 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

GPGPU-6

GPGPU-6: Sixth Workshop on General Purpose Processing Using GPUs

March 16, 2013

Texas, Houston, USA

Acceptance Rates

GPGPU-6 Paper Acceptance Rate 15 of 37 submissions, 41%;

Overall Acceptance Rate 57 of 129 submissions, 44%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
512
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Maltenberger TIlic ITolovski IRabl TIves ZBonifati AEl Abbadi A(2022)Evaluating Multi-GPU Sorting with Modern InterconnectsProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517842(1795-1809)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517842
Norollah AKazemi ZHely D(2020)3D-Sorter: 3D Design of a Resource-Aware Hardware Sorter for Edge Computing Platforms Under Area and Energy Consumption Constraints2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)10.1109/ISVLSI49217.2020.00018(42-47)Online publication date: Jul-2020
https://doi.org/10.1109/ISVLSI49217.2020.00018
Büren FJünger DKobus RHundt CSchmidt BWeissman JButt ASmirni E(2019)Suffix Array Construction on Multi-GPU SystemsProceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3307681.3325961(183-194)Online publication date: 17-Jun-2019
https://dl.acm.org/doi/10.1145/3307681.3325961
Stehle EJacobsen HChirkova RYang JSuciu D(2017)A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUsProceedings of the 2017 ACM International Conference on Management of Data10.1145/3035918.3064043(417-432)Online publication date: 9-May-2017
https://dl.acm.org/doi/10.1145/3035918.3064043
Faujdar NSaraswat S(2017)A roadmap of parallel sorting algorithms using GPU computing2017 International Conference on Computing, Communication and Automation (ICCCA)10.1109/CCAA.2017.8229919(736-741)Online publication date: May-2017
https://doi.org/10.1109/CCAA.2017.8229919
Gong YTang JLi WYe Z(2017)Massive spatial query on the Kepler architecture2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP.2017.7995267(111-118)Online publication date: Jul-2017
https://doi.org/10.1109/ASAP.2017.7995267
Shamoto HShirahata KDrozd ASato HMatsuoka S(2016)GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory CapacityIEEE Transactions on Big Data10.1109/TBDATA.2015.25110012:1(57-69)Online publication date: 1-Mar-2016
https://doi.org/10.1109/TBDATA.2015.2511001
Cabezas JVilanova LGelado IJablin TNavarro NHwu WBhuyan LChong FSarkar V(2015)Automatic Parallelization of Kernels in Shared-Memory Multi-GPU NodesProceedings of the 29th ACM on International Conference on Supercomputing10.1145/2751205.2751218(3-13)Online publication date: 8-Jun-2015
https://dl.acm.org/doi/10.1145/2751205.2751218
Cabezas JJordà MGelado INavarro NHwu WKaeli DCavazos J(2015)GPU-SM: shared memory multi-GPU programmingProceedings of the 8th Workshop on General Purpose Processing using GPUs10.1145/2716282.2716286(13-24)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.1145/2716282.2716286
Cabezas JGelado IStone JNavarro NKirk DHwu W(2015)Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2014.231682526:5(1405-1418)Online publication date: 1-May-2015
https://doi.org/10.1109/TPDS.2014.2316825
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents