[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3514221.3517842acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Evaluating Multi-GPU Sorting with Modern Interconnects

Published: 11 June 2022 Publication History

Abstract

GPUs have become a mainstream accelerator for database operations such as sorting. Most GPU sorting algorithms are single-GPU approaches. They neither harness the full computational power nor exploit the high-bandwidth P2P interconnects of modern multi-GPU platforms. The latest NVLink 2.0 and NVLink 3.0-based NVSwitch interconnects promise unparalleled multi-GPU acceleration. So far, multi-GPU sorting has only been evaluated on systems with PCIe 3.0. In this paper, we analyze serial, parallel, and bidirectional data transfer rates to, from, and between multiple GPUs on systems with PCIe 3.0/4.0, NVLink 2.0/3.0, and NVSwitch. We measure up to 35x higher parallel P2P throughput with NVLink 3.0-based NVSwitch over PCIe 3.0. To study GPU-accelerated sorting on today's hardware, we implement a P2P-based GPU-only (P2P sort) and a heterogeneous (HET sort) multi-GPU sorting algorithm and evaluate them on three modern platforms. We observe speedups over state-of-the-art parallel CPU radix sort of up to 14x for P2P sort and 9x for HET sort. On systems with fast P2P interconnects, P2P sort outperforms HET sort up to 1.65x. Finally, we show that overlapping GPU copy/compute operations does not mitigate the transfer bottleneck when sorting large out-of-core data.

References

[1]
A. Adinets. 2020. A Faster Radix Sort Implementation. NVIDIA. Retrieved October 31, 2021 from https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21572-a-faster-radix-sort-implementation.pdf
[2]
M.-C. Albutiu, A. Kemper, and T. Neumann. 2012. Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems. Proc. VLDB Endow. 5, 10 (June 2012), 1064--1075. https://doi.org/10.14778/2336664.2336678
[3]
Amazon. 2020. Amazon EC2 P4d Instances. Amazon. Retrieved October 31, 2021 from https://aws.amazon.com/ec2/instance-types/p4/
[4]
AMD. 2018. AMD Radeon Instinct MI60: Unleash Discovery on the World's Fastest Double Precision PCIe Accelerator. AMD. Retrieved October 31, 2021 from https://www.amd.com/system/?les/documents/radeon-instinct-mi60-datasheet.pdf
[5]
AMD. 2019. AMD Joins Consortia to Advance CXL. AMD. Retrieved October 31, 2021 from https://community.amd.com/t5/amd-business-blog/amd-joins-consortia-to-advance-cxl-a-new-high-speed-interconnect/ba-p/418202
[6]
M. Axtmann, T. Axtmann, P. Sanders, and C. Schulz. 2015. Practical Massively Parallel Sorting. In Proceedings of the 27th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '15). Association for Computing Machinery, New York, NY, USA, 13--23. https://doi.org/10.1145/2755573.2755595
[7]
M. Axtmann, S. Witt, D. Ferizovic, and P. Sanders. 2017. In-Place (Parallel) Super Scalar Samplesort. Karlsruhe Institute of Technology. Retrieved October 31, 2021 from http://algo2.iti.kit.edu/axtmann/invtalks/colgate/ipssss.pdf
[8]
P. Bakkum and K. Skadron. 2010. Accelerating SQL Database Operations on a GPU with CUDA. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units (GPGPU-3). Association for Computing Machinery, New York, NY, USA, 94--103. https://doi.org/10.1145/1735688.1735706
[9]
C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. 2013. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. Proc. VLDB Endow. 7, 1 (September 2013), 85--96. https://doi.org/10.14778/2732219.2732227
[10]
D. Cederman and P. Tsigas. 2010. GPU-Quicksort: A Practical Quicksort Algorithm for Graphics Processors. ACM J. Exp. Algorithmics 14, 4 (January 2010), 1--24. https://doi.org/10.1145/1498698.1564500
[11]
J.-C. Chen. 2006. A Simple Algorithm for In-Place Merging. Inform. Process. Lett. 98, 1 (April 2006), 34--40. https://doi.org/10.1016/j.ipl.2005.11.018
[12]
A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale. Proc. VLDB Endow. 8, 12 (August 2015), 1804--1815. https://doi.org/10.14778/2824032.2824077
[13]
M. Cho, D. Brand, R. Bordawekar, U. Finkler, V. Kulandaisamy, and R. Puri. 2015. PARADIS: An E?cient Parallel Algorithm for In-Place Radix Sort. Proc. VLDB Endow. 8, 12 (August 2015), 1518--1529. https://doi.org/10.14778/2824032.2824050
[14]
S. Chun, W. D. Becker, J. Casey, S. Ostrander, D. Dreps, J. A. Dreps, R. M. Nett, B. Beaman, and J. R. Eagle. 2018. IBM POWER9 Package Technology and Design. IBM J. Res. Dev. 62, 4 (July 2018), 1--10. https://doi.org/10.1147/JRD.2018.2847178
[15]
J. Fang, Y. T. B. Mulder, J. Hidders, J. Lee, and H. P. Hofstee. 2020. In-Memory Database Acceleration on FPGAs: A Survey. The VLDB Journal 29, 1 (January 2020), 33--59. https://doi.org/10.1007/s00778-019-00581-w
[16]
FAU. 2021. Likwid: Performance Monitoring and Benchmarking Suite. FAU. Retrieved October 31, 2021 from https://github.com/RRZE-HPC/likwid
[17]
FSF. 2021. The GNU C++ Library Manual: Parallel Mode. FSF. Retrieved October 31, 2021 from https://gcc.gnu.org/onlinedocs/gcc-11.2.0/libstdc++/manual/manual/parallel_mode.html
[18]
FSF. 2021. The GNU C++ Library Reference Manual: multiway_merge.h. FSF. Retrieved October 31, 2021 from https://gcc.gnu.org/onlinedocs/gcc-11.2.0/libstdc++/api/a00986.html
[19]
M. Gowanlock and B. Karsin. 2018. Sorting Large Datasets with Heterogeneous CPU/GPU Architectures. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Institute of Electrical and Electronics Engineers, New York, NY, USA, 560--569. https://doi.org/10.1109/IPDPSW.2018.00095
[20]
G. Graefe. 2006. Implementing Sorting in Database Systems. ACM Comput. Surv. 38, 3 (September 2006), 1--37. https://doi.org/10.1145/1132960.1132964
[21]
O. Green, R. McColl, and D. A. Bader. 2012. GPU Merge Path: A GPU Merging Algorithm. In Proceedings of the 26th ACM International Conference on Super- computing (ICS '12). Association for Computing Machinery, New York, NY, USA, 331--340. https://doi.org/10.1145/2304576.2304621
[22]
M. Grund, J. Krüger, H. Plattner, A. Zeier, P. Cudre-Mauroux, and S. Madden. 2010. HYRISE: A Main Memory Hybrid Storage Engine. Proc. VLDB Endow. 4, 2 (November 2010), 105--116. https://doi.org/10.14778/1921071.1921077
[23]
A. Gupta, D. Agarwal, D. Tan, J. Kulesza, R. Pathak, S. Stefani, and V. Srinivasan. 2015. Amazon Redshift and the Case for Simpler Data Warehouses. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIG- MOD '15). Association for Computing Machinery, New York, NY, USA, 1917--1923. https://doi.org/10.1145/2723372.2742795
[24]
M. Harris. 2012. How to Optimize Data Transfers in CUDA C/C++. NVIDIA. Retrieved October 31, 2021 from https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
[25]
M. Heimel, M. Kiefer, and V. Markl. 2015. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1477--1492. https://doi.org/10.1145/2723372.2749438
[26]
IBM. 2018. IBM Power System AC922: Technical Overview and Introduction. IBM. Retrieved October 31, 2021 from https://www.redbooks.ibm.com/redpapers/pdfs/redp5494.pdf
[27]
IBM. 2019. IBM POWER9 Processor User's Manual. IBM. Retrieved October 31, 2021 from https://ibm.ent.box.com/s/tmklq90ze7aj8f4n32er1mu3sy9u8k3k
[28]
H. Inoue and K. Taura. 2015. SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures. Proc. VLDB Endow. 8, 11 (July 2015), 1274--1285. https://doi.org/10.14778/2809974.2809988
[29]
Intel. 2021. OneAPI Threading Building Blocks. Intel. Retrieved October 31, 2021 from https://github.com/oneapi-src/oneTBB
[30]
H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi. 2014. Big Data and Its Technical Challenges. Commun. ACM 57, 7 (July 2014), 86--94. https://doi.org/10.1145/2611567
[31]
T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. 2012. GPU Join Processing Revisited. In Proceedings of the 8th International Workshop on Data Management on New Hardware (DaMoN '12). Association for Computing Machinery, New York, NY, USA, 55--62. https://doi.org/10.1145/2236584.2236592
[32]
S. Kalid, A. Syed, A. Mohammad, and M. N. Halgamuge. 2017. Big-Data NoSQL Databases: A Comparison and Analysis of "Big-Table", "DynamoDB", and "Cassandra". In 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA). Institute of Electrical and Electronics Engineers, New York, NY, USA, 89--93. https://doi.org/10.1109/ICBDA.2017.8078782
[33]
B. Karsin, V. Weichert, H. Casanova, J. Iacono, and N. Sitchinava. 2018. Analysis- Driven Engineering of Comparison-Based Sorting Algorithms on GPUs. In Proceedings of the 2018 International Conference on Supercomputing (ICS '18). Association for Computing Machinery, New York, NY, USA, 86--95. https://doi.org/10.1145/3205289.3205298
[34]
A. Kemper and T. Neumann. 2011. HyPer: A Hybrid OLTP OLAP Main Memory Database System Based on Virtual Memory Snapshots. In 2011 IEEE 27th International Conference on Data Engineering (ICDE). Institute of Electrical and Electronics Engineers, New York, NY, USA, 195--206. https://doi.org/10.1109/ICDE.2011.5767867
[35]
A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. 2020. Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect. IEEE Transactions on Parallel and Distributed Systems (TPDS) 31, 1 (January 2020), 94--110. https://doi.org/10.1109/TPDS.2019.2928289
[36]
A. Li, S. L. Song, J. Chen, X. Liu, N. Tallent, and K. Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). Institute of Electrical and Electronics Engineers, New York, NY, USA, 191--202. https://doi.org/10.1109/IISWC.2018.8573483
[37]
S. Li, D. Reddy, and B. Jacob. 2018. A Performance and Power Comparison of Modern High-Speed DRAM Architectures. In Proceedings of the International Symposium on Memory Systems (MEMSYS '18). Association for Computing Machinery, New York, NY, USA, 341--353. https://doi.org/10.1145/3240302.3240315
[38]
C. Lutz, S. Breß, S. Zeuch, T. Rabl, and V. Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1633--1649. https://doi.org/10.1145/3318464.3389705
[39]
Z. Majo and T. R. Gross. 2011. Memory System Performance in a NUMA Multicore Multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR '11). Association for Computing Machinery, New York, NY, USA, 1--10. https://doi.org/10.1145/1987816.1987832
[40]
E. Manca, A. Manconi, A. Orro, G. Armano, and L. Milanesi. 2016. CUDA-Quicksort: An Improved GPU-Based Implementation of Quicksort. Concurr. Comput.: Pract. Exper. 28, 1 (February 2016), 21--43. https://doi.org/10.1002/cpe.3611
[41]
J. McCalpin. 1995. Memory Bandwidth and Machine Balance in High Performance Computers. IEEE Technical Committee on Computer Architecture Newsletter 2, 1 (December 1995), 19--25.
[42]
D. Merrill and M. Garland. 2016. Single-Pass Parallel Prefix Scan with Decoupled Look-Back. Technical Report. NVIDIA. 1--9 pages. Retrieved October 31, 2021 from https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf
[43]
D. Merrill and A. Grimshaw. 2011. High Performance and Scalable Radix Sorting: A Case Study of Implementing Dynamic Parallelism for GPU Computing. Parallel Processing Letters 21, 2 (June 2011), 245--272. https://doi.org/10.1142/S0129626411000187
[44]
Microsoft. 2021. Microsoft Azure ND A100 v4-Series. Microsoft. Retrieved October 31, 2021 from https://docs.microsoft.com/en-us/azure/virtual-machines/nda100-v4-series
[45]
NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture. NVIDIA. Retrieved October 31, 2021 from http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
[46]
NVIDIA. 2018. NVIDIA NVSwitch: The World's Highest-Bandwidth On-Node Switch. NVIDIA. Retrieved October 31, 2021 from http://images.nvidia.com/content/pdf/nvswitch-technical-overview.pdf
[47]
NVIDIA. 2020. CUDA C++ Best Practices Guide. NVIDIA. Retrieved October 31, 2021 from https://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf
[48]
NVIDIA. 2020. CUDA C++ Programming Guide. NVIDIA. Retrieved October 31, 2021 from https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
[49]
NVIDIA. 2020. Modern GPU: Patterns and Behaviors for GPU Computing. NVIDIA. Retrieved October 31, 2021 from https://github.com/moderngpu/moderngpu
[50]
NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. NVIDIA. Retrieved October 31, 2021 from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf
[51]
NVIDIA. 2021. CUB: Cooperative Primitives for CUDA C++. NVIDIA. Retrieved October 31, 2021 from https://github.com/NVIDIA/cub
[52]
NVIDIA. 2021. NVIDIA DGX A100 System. NVIDIA. Retrieved October 31, 2021 from https://docs.nvidia.com/dgx/pdf/dgxa100-user-guide.pdf
[53]
NVIDIA. 2021. NVIDIA Grace CPU. NVIDIA. Retrieved October 31, 2021 from https://www.nvidia.com/en-us/data-center/grace-cpu/
[54]
NVIDIA. 2021. Thrust: Code at the Speed of Light. NVIDIA. Retrieved October 31, 2021 from https://github.com/NVIDIA/thrust
[55]
J. Paul, S. Lu, B. He, and C. Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data (SIGMOD '21). Association for Computing Machinery, New York, NY, USA, 1413--1425. https://doi.org/10.1145/3448016.3457254
[56]
C. Pearson, A. Dakkak, S. Hashash, C. Li, I.-H. Chung, J. Xiong, and W.-M. Hwu. 2019. Evaluating Characteristics of CUDA Communication Primitives on High- Bandwidth Interconnects. In Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (ICPE '19). Association for Computing Machinery, New York, NY, USA, 209--218. https://doi.org/10.1145/3297663.3310299
[57]
H. Peters and O. Schulz-Hildebrandt. 2012. Comparison-Based In-Place Sorting with CUDA. In GPU Computing Gems Jade Edition, W.-M. Hwu (Ed.). Morgan Kaufmann, Boston, MA, USA, 89--96. https://doi.org/10.1016/B978-0--12--385963--1.00008--3
[58]
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. 2010. Parallel External Sorting for CUDA-Enabled GPUs with Load Balancing and Low Transfer Overhead. In 2010 IEEE International Symposium on Parallel Distributed Processing, Workshops and PhD Forum (IPDPSW). Institute of Electrical and Electronics Engineers, New York, NY, USA, 1--8. https://doi.org/10.1109/IPDPSW.2010.5470833
[59]
H. Peters, O. Schulz-Hildebrandt, and N. Luttenberger. 2012. A Novel Sorting Algorithm for Many-Core Architectures Based on Adaptive Bitonic Sort. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS). Institute of Electrical and Electronics Engineers, New York, NY, USA, 227--237. https://doi.org/10.1109/IPDPS.2012.30
[60]
C. Pheatt. 2008. Intel Threading Building Blocks. J. Comput. Sci. Coll. 23, 4 (April 2008), 298.
[61]
O. Polychroniou and K. A. Ross. 2014. A Comprehensive Study of Main-Memory Partitioning and Its Application to Large-Scale Comparison- and Radix-Sort. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). Association for Computing Machinery, New York, NY, USA, 755--766. https://doi.org/10.1145/2588555.2610522
[62]
S. M. A. Raza, P. Chrysogelos, P. Sioulas, V. Indjic, A. C. Anadiotis, and A. Ailamaki. 2020. GPU-Accelerated Data Management under the Test of Time. In Online Proceedings of the 10th Conference on Innovative Data Systems Research (CIDR). Conference on Innovative Data Systems Research, Amsterdam, Netherlands, 1--11.
[63]
R. Rui, H. Li, and Y.-C. Tu. 2020. Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow. 14, 4 (December 2020), 708--720. https://doi.org/10.14778/3436905.3436927
[64]
N. Satish, M. Harris, and M. Garland. 2009. Designing Effcient Sorting Algorithms for Manycore GPUs. In 2009 IEEE International Symposium on Parallel Distributed Processing (IPDPS). Institute of Electrical and Electronics Engineers, New York, NY, USA, 1--10. https://doi.org/10.1109/IPDPS.2009.5161005
[65]
N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey. 2010. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD '10). Association for Computing Machinery, New York, NY, USA, 351--362. https://doi.org/10.1145/1807167.1807207
[66]
A. Shanbhag, S. Madden, and X. Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics (Extended Version). Technical Report. Massachusetts Institute of Technology. 1--17 pages. Retrieved October 31, 2021 from https://arxiv.org/pdf/2003.01178.pdf
[67]
D. D. Sharma and S. Tavallaei. 2020. Compute Express Link 2.0 White Paper. CXL. Retrieved October 31, 2021 from https://b373eaf2--67af-4a29-b28c-3aae9e644f30. filesusr.com/ugd/0c1418_14c5283e7f3e40f9b2955c7d0f60bebe.pdf
[68]
V. Sikka, F. Färber, W. Lehner, S. K. Cha, T. Peh, and C. Bornhövd. 2012. Efficient Transaction Processing in SAP HANA Database: The End of a Column Store Myth. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). Association for Computing Machinery, New York, NY, USA, 731--742. https://doi.org/10.1145/2213836.2213946
[69]
J. Singler and B. Konsik. 2008. The GNU libstdc++ Parallel Mode: Software Engineering Considerations. In Proceedings of the 1st International Workshop on Multicore Software Engineering (IWMSE '08). Association for Computing Machinery, New York, NY, USA, 15--22. https://doi.org/10.1145/1370082.1370089
[70]
P. Sioulas, P. Chrysogelos, M. Karpathiotakis, R. Appuswamy, and A. Ailamaki. 2019. Hardware-Conscious Hash-Joins on GPUs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). Institute of Electrical and Electronics Engineers, New York, NY, USA, 698--709. https://doi.org/10.1109/ICDE.2019.00068
[71]
E. Stehle and H.-A. Jacobsen. 2017. A Memory Bandwidth-Effcient Hybrid Radix Sort on GPUs. In Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data (SIGMOD '17). Association for Computing Machinery, New York, NY, USA, 417--432. https://doi.org/10.1145/3035918.3064043
[72]
I. Tanasic, L. Vilanova, M. Jordà, J. Cabezas, I. Gelado, N. Navarro, and W.-M. Hwu. 2013. Comparison Based Sorting for Systems with Multiple GPUs. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 1--11. https://doi.org/10.1145/2458523.2458524
[73]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. 2010. Hive -- A Petabyte Scale Data Warehouse Using Hadoop. In 2010 IEEE 26th International Conference on Data Engineering (ICDE). Institute of Electrical and Electronics Engineers, New York, NY, USA, 996--1005. https://doi.org/10.1109/ICDE.2010.5447738
[74]
J. Treibig, G. Hager, and G. Wellein. 2010. Likwid: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments. In 2010 IEEE 39th In- ternational Conference on Parallel Processing Workshops (ICPPW). Institute of Electrical and Electronics Engineers, New York, NY, USA, 207--216. https://doi.org/10.1109/ICPPW.2010.38
[75]
T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. 2009. SIMD-Scan: Ultra Fast In-Memory Table Scan Using On-Chip Vector Processing Units. Proc. VLDB Endow. 2, 1 (August 2009), 385--394. https://doi.org/10.14778/1687627.1687671
[76]
Y. Ye, Z. Du, D. Bader, Q. Yang, and W. Huo. 2011. GPUMemSort: A High Performance Graphics Co-Processors Sorting Algorithm for Large Scale In- Memory Data. GSTF International Journal on Computing 1, 2 (May 2011), 23--28. https://doi.org/10.5176/2010--2283_1.2.34

Cited By

View all
  • (2024)BOSS - An Architecture for Database Kernel CompositionProceedings of the VLDB Endowment10.14778/3636218.363623917:4(877-890)Online publication date: 5-Mar-2024
  • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 9-Jun-2024
  • (2024)ArcaDB: A Disaggregated Query Engine for Heterogenous Computational Environments2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00015(42-53)Online publication date: 7-Jul-2024
  • Show More Cited By

Index Terms

  1. Evaluating Multi-GPU Sorting with Modern Interconnects

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
    June 2022
    2597 pages
    ISBN:9781450392495
    DOI:10.1145/3514221
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 June 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. database acceleration
    2. high-speed interconnects
    3. multi-GPU sorting

    Qualifiers

    • Research-article

    Funding Sources

    • European Union Horizon 2020
    • German Research Foundation
    • German Ministry for Education and Research

    Conference

    SIGMOD/PODS '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)693
    • Downloads (Last 6 weeks)111
    Reflects downloads up to 10 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)BOSS - An Architecture for Database Kernel CompositionProceedings of the VLDB Endowment10.14778/3636218.363623917:4(877-890)Online publication date: 5-Mar-2024
    • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 9-Jun-2024
    • (2024)ArcaDB: A Disaggregated Query Engine for Heterogenous Computational Environments2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00015(42-53)Online publication date: 7-Jul-2024
    • (2023)GPU Database Systems Characterization and OptimizationProceedings of the VLDB Endowment10.14778/3632093.363210717:3(441-454)Online publication date: 1-Nov-2023
    • (2023)Analyzing Vectorized Hash Tables across CPU ArchitecturesProceedings of the VLDB Endowment10.14778/3611479.361148516:11(2755-2768)Online publication date: 1-Jul-2023
    • (2023)Optimizing Search and Sort Algorithms: Harnessing Parallel Programming for Efficient Processing of Large Datasets2023 2nd International Conference on Automation, Computing and Renewable Systems (ICACRS)10.1109/ICACRS58579.2023.10405268(1439-1449)Online publication date: 11-Dec-2023
    • (2023)A fast and accurate coupled meshless algorithm for the 2D/3D Gross–Pitaevskii equations on two GPUsComputing10.1007/s00607-023-01197-3105:12(2595-2620)Online publication date: 11-Jul-2023
    • (2023)Single‐ and multi‐GPU computing on NVIDIA‐ and AMD‐based server platforms for solidification modeling applicationConcurrency and Computation: Practice and Experience10.1002/cpe.800036:9Online publication date: 27-Dec-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media