More Web Proxy on the site http://driver.im/

research-article

Distributed GPU Joins on Fast RDMA-capable Networks

Authors:

Lasse Thostrup,

Manisha Luthra,

Carsten BinnigAuthors Info & Claims

Proceedings of the ACM on Management of Data, Volume 1, Issue 1

Article No.: 29, Pages 1 - 26

https://doi.org/10.1145/3588709

Published: 30 May 2023 Publication History

Abstract

In this paper, we present a novel pipelined GPU join that accelerates the performance of distributed DBMSs by leveraging GPU resources on fast networks. A key insight is that we enable pipelined join execution by overlapping the network shuffling with the build and probe phases, thereby significantly reducing the GPU idle time. To demonstrate this, we propose novel algorithms for distributed pipelined GPU joins with RDMA and GPUDirect for both arbitrarily large probe- and build-side tables. In our evaluation, we show our pipelined distributed GPU join can reduce the overall runtime of a full query by up to 6× against a state-of-the-art CPU-only join.

Supplemental Material

MP4 File

Video presentation for the paper Distributed GPU Joins on Fast RDMA-capable Networks

Download
96.56 MB

References

[1]

Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Özsu. 2013a. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. Proc. VLDB Endow., Vol. 7, 1 (Sept. 2013), 85--96. https://doi.org/10.14778/2732219.2732227

Digital Library

[2]

Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Ö zsu. 2013b. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In IEEE ICDE, Christian S. Jensen, Christopher M. Jermaine, and Xiaofang Zhou (Eds.). IEEE Computer Society, 362--373. https://doi.org/10.1109/ICDE.2013.6544839

Digital Library

[3]

Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. Rack-Scale In-Memory Join Processing Using RDMA. In ACM SIGMOD (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1463--1475. https://doi.org/10.1145/2723372.2750547

Digital Library

[4]

Claude Barthels, Ingo Müller, Timo Schneider, Gustavo Alonso, and Torsten Hoefler. 2017. Distributed Join Algorithms on Thousands of Cores. Proc. VLDB Endow., Vol. 10, 5 (Jan. 2017), 517--528. https://doi.org/10.14778/3055540.3055545

Digital Library

[5]

Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, and Erfan Zamanian. 2016. The End of Slow Networks: It's Time for a Redesign. Proc. VLDB Endow., Vol. 9, 7 (mar 2016), 528--539. https://doi.org/10.14778/2904483.2904485

Digital Library

[6]

Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-Accelerated Databases. In ACM SIGMOD (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1891--1906. https://doi.org/10.1145/2882903.2882936

Digital Library

[7]

Sebastian Breß and Gunter Saake. 2013. Why It is Time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS. Proc. VLDB Endow., Vol. 6, 12 (aug 2013), 1398--1403. https://doi.org/10.14778/2536274.2536325

Digital Library

[8]

Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In USENIX NSDI (Seattle, WA) (NSDI'14). USENIX Association, USA, 401--414.

[9]

Markus Dreseler, Martin Boissier, Tilmann Rabl, and Matthias Uflacker. 2020. Quantifying TPC-H Choke Points and Their Optimizations. Proc. VLDB Endow., Vol. 13, 8 (April 2020), 1206--1220. https://doi.org/10.14778/3389133.3389138

Digital Library

[10]

Philipp Fent, Alexander van Renen, Andreas Kipf, Viktor Leis, Thomas Neumann, and Alfons Kemper. 2020. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory. In IEEE ICDE. IEEE, 1477--1488. https://doi.org/10.1109/ICDE48307.2020.00131

[11]

Philip W. Frey, Romulo Goncalves, Martin Kersten, and Jens Teubner. 2009. Spinning Relations: High-Speed Networks for Distributed Join Processing. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (Providence, Rhode Island) (DaMoN '09). Association for Computing Machinery, New York, NY, USA, 27--33. https://doi.org/10.1145/1565694.1565701

Digital Library

[12]

Philip W. Frey, Romulo Goncalves, Martin Kersten, and Jens Teubner. 2010. A Spinning Join That Does Not Get Dizzy. In IEEE ICDCS (ICDCS '10). IEEE Computer Society, USA, 283--292. https://doi.org/10.1109/ICDCS.2010.23

Digital Library

[13]

Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In ACM SIGMOD (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1603--1618. https://doi.org/10.1145/3183713.3183734

Digital Library

[14]

Chengxin Guo, Hong Chen, Feng Zhang, and Cuiping Li. 2019. Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA. In Proceedings of the 48th International Conference on Parallel Processing (Kyoto, Japan) (ICPP 2019). Association for Computing Machinery, New York, NY, USA, Article 65, 10 pages. https://doi.org/10.1145/3337821.3337862

Digital Library

[15]

Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2015. Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters. In 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8--11, 2015. IEEE Computer Society, 78--87. https://doi.org/10.1109/CLUSTER.2015.21

Digital Library

[16]

Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst., Vol. 34, 4, Article 21 (Dec. 2009), 39 pages. https://doi.org/10.1145/1620585.1620588

Digital Library

[17]

Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational Joins on Graphics Processors. In ACM SIGMOD (Vancouver, Canada) (SIGMOD '08). Association for Computing Machinery, New York, NY, USA, 511--524. https://doi.org/10.1145/1376616.1376670

Digital Library

[18]

Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. Proc. VLDB Endow., Vol. 6, 10 (Aug. 2013), 889--900. https://doi.org/10.14778/2536206.2536216

Digital Library

[19]

Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (Scottsdale, Arizona) (DaMoN '12). Association for Computing Machinery, New York, NY, USA, 55--62. https://doi.org/10.1145/2236584.2236592

Digital Library

[20]

Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In USENIX OSDI (Savannah, GA, USA) (OSDI'16). USENIX Association, USA, 185--201.

[21]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In ACM SIGMOD (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1633--1649. https://doi.org/10.1145/3318464.3389705

Digital Library

[22]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Triton Join: Efficiently Scaling the Operator State on GPUs with Fast Interconnects. In ACM SIGMOD.

[23]

NVIDIA. 2021a. GPUDirect RDMA. NVIDIA. https://developer.nvidia.com/gpudirect

[24]

NVIDIA. 2021b. GPUDirect RDMA Design Considerations - Synchronization and Memory Ordering. NVIDIA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behavior

[25]

NVIDIA. 2021c. Mellanox OFED GPUDirect RDMA. NVIDIA. https://www.mellanox.com/products/GPUDirect-RDMA

[26]

Johns Paul, Shengliang Lu, Bingsheng He, and Chiew Tong Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In ACM SIGMOD (Virtual Event, China) (SIGMOD/PODS '21). Association for Computing Machinery, New York, NY, USA, 1413--1425. https://doi.org/10.1145/3448016.3457254

Digital Library

[27]

Ran Rui, Hao Li, and Yi-Cheng Tu. 2015. Join Algorithms on GPUs: A Revisit after Seven Years. In IEEE BigData (BIG DATA '15). IEEE Computer Society, USA, 2541--2550. https://doi.org/10.1109/BigData.2015.7364051

Digital Library

[28]

Ran Rui, Hao Li, and Yi-Cheng Tu. 2020. Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow., Vol. 14, 4 (Dec. 2020), 708--720. https://doi.org/10.14778/3436905.3436927

Digital Library

[29]

Wolf Rödiger, Sam Idicula, Alfons Kemper, and Thomas Neumann. 2016. Flow-Join: Adaptive skew handling for distributed joins over high-speed networks. In IEEE ICDE. 1194--1205. https://doi.org/10.1109/ICDE.2016.7498324

[30]

Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In ACM SIGMOD (Indianapolis, Indiana, USA) (SIGMOD '10). Association for Computing Machinery, New York, NY, USA, 351--362. https://doi.org/10.1145/1807167.1807207

Digital Library

[31]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In ACM SIGMOD (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1617--1632. https://doi.org/10.1145/3318464.3380595

Digital Library

[32]

Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-Conscious Hash-Joins on GPUs. In IEEE ICDE. 698--709. https://doi.org/10.1109/ICDE.2019.00068

[33]

Lasse Thostrup, Jan Skrzypczak, Matthias Jasny, Tobias Ziegler, and Carsten Binnig. 2021. DFI: The Data Flow Interface for High-Speed Networks. In ACM SIGMOD (Virtual Event, China) (SIGMOD/PODS '21). Association for Computing Machinery, New York, NY, USA, 1825--1837. https://doi.org/10.1145/3448016.3452816

Digital Library

[34]

Haicheng Wu, Gregory Diamos, Tim Sheard, Molham Aref, Sean Baxter, Michael Garland, and Sudhakar Yalamanchili. 2014. Red Fox: An Execution Environment for Relational Query Processing on GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (Orlando, FL, USA) (CGO '14). Association for Computing Machinery, New York, NY, USA, 44--54. https://doi.org/10.1145/2544137.2544166

Digital Library

[35]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow., Vol. 6, 10 (2013), 817--828. https://doi.org/10.14778/2536206.2536210

Digital Library

[36]

Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The End of a Myth: Distributed Transactions Can Scale. Proc. VLDB Endow., Vol. 10, 6 (feb 2017), 685--696. https://doi.org/10.14778/3055330.3055335

Digital Library

[37]

Tobias Ziegler, Viktor Leis, and Carsten Binnig. 2020. RDMA Communciation Patterns. Datenbank-Spektrum, Vol. 20 (11 2020), 199--210. https://doi.org/10.1007/s13222-020-00355--7

[38]

Tobias Ziegler, Sumukha Tumkur Vani, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Designing Distributed Tree-Based Index Structures for Fast RDMA-Capable Networks. In ACM SIGMOD (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 741--758. https://doi.org/10.1145/3299869.3300081

Digital Library

Cited By

Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Jasny MThostrup LTamimi SKoch AIstván ZBinnig C(2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639291
Show More Cited By

Index Terms

Distributed GPU Joins on Fast RDMA-capable Networks

Recommendations

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing
SAAHPC '11: Proceedings of the 2011 Symposium on Application Accelerators in High-Performance Computing

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers ...
Accelerating multi-way joins on the GPU
Abstract
Graphic processing units (GPUs) have been employed as hardware accelerators for online analytics. However, multi-way joins, which are common in analytic workloads, are inefficient on GPUs. Therefore, we propose to accelerate two representative ...
Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing
CLUSTER '10: Proceedings of the 2010 IEEE International Conference on Cluster Computing

In this paper, we describe our experiment developing an implementation of the Linpack benchmark for TianHe-1, a petascale CPU/GPU supercomputer system, the largest GPU-accelerated system ever attempted before. An adaptive optimization framework is ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data

Proceedings of the ACM on Management of Data Volume 1, Issue 1

PACMMOD

May 2023

2807 pages

EISSN:2836-6573

DOI:10.1145/3603164

Editor:
Divyakant Agrawal
UC Santa Barbara, United States

Issue’s Table of Contents

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023

Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Author Tags

Qualifiers

Research-article

Funding Sources

Deutsche Forschungsgemeinschaft

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
572
Total Downloads

Downloads (Last 12 months)277
Downloads (Last 6 weeks)30

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tang XZhang FZhang SLiu YHe BHe BDu XDu X(2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
https://dl.acm.org/doi/10.1145/3677134
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Jasny MThostrup LTamimi SKoch AIstván ZBinnig C(2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639291
Berthold ASchmidt LObersteiner AHabich DLehner WSchirmeier H(2024)On-The-Fly Data Distribution to Accelerate Query Processing in Heterogeneous Memory SystemsAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_12(170-183)Online publication date: 28-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-70626-4_12

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents