[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Distributed GPU Joins on Fast RDMA-capable Networks

Published: 30 May 2023 Publication History

Abstract

In this paper, we present a novel pipelined GPU join that accelerates the performance of distributed DBMSs by leveraging GPU resources on fast networks. A key insight is that we enable pipelined join execution by overlapping the network shuffling with the build and probe phases, thereby significantly reducing the GPU idle time. To demonstrate this, we propose novel algorithms for distributed pipelined GPU joins with RDMA and GPUDirect for both arbitrarily large probe- and build-side tables. In our evaluation, we show our pipelined distributed GPU join can reduce the overall runtime of a full query by up to 6× against a state-of-the-art CPU-only join.

Supplemental Material

MP4 File
Video presentation for the paper Distributed GPU Joins on Fast RDMA-capable Networks

References

[1]
Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Özsu. 2013a. Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited. Proc. VLDB Endow., Vol. 7, 1 (Sept. 2013), 85--96. https://doi.org/10.14778/2732219.2732227
[2]
Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M. Tamer Ö zsu. 2013b. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In IEEE ICDE, Christian S. Jensen, Christopher M. Jermaine, and Xiaofang Zhou (Eds.). IEEE Computer Society, 362--373. https://doi.org/10.1109/ICDE.2013.6544839
[3]
Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. 2015. Rack-Scale In-Memory Join Processing Using RDMA. In ACM SIGMOD (Melbourne, Victoria, Australia) (SIGMOD '15). Association for Computing Machinery, New York, NY, USA, 1463--1475. https://doi.org/10.1145/2723372.2750547
[4]
Claude Barthels, Ingo Müller, Timo Schneider, Gustavo Alonso, and Torsten Hoefler. 2017. Distributed Join Algorithms on Thousands of Cores. Proc. VLDB Endow., Vol. 10, 5 (Jan. 2017), 517--528. https://doi.org/10.14778/3055540.3055545
[5]
Carsten Binnig, Andrew Crotty, Alex Galakatos, Tim Kraska, and Erfan Zamanian. 2016. The End of Slow Networks: It's Time for a Redesign. Proc. VLDB Endow., Vol. 9, 7 (mar 2016), 528--539. https://doi.org/10.14778/2904483.2904485
[6]
Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Processing in Co-Processor-Accelerated Databases. In ACM SIGMOD (San Francisco, California, USA) (SIGMOD '16). Association for Computing Machinery, New York, NY, USA, 1891--1906. https://doi.org/10.1145/2882903.2882936
[7]
Sebastian Breß and Gunter Saake. 2013. Why It is Time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS. Proc. VLDB Endow., Vol. 6, 12 (aug 2013), 1398--1403. https://doi.org/10.14778/2536274.2536325
[8]
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In USENIX NSDI (Seattle, WA) (NSDI'14). USENIX Association, USA, 401--414.
[9]
Markus Dreseler, Martin Boissier, Tilmann Rabl, and Matthias Uflacker. 2020. Quantifying TPC-H Choke Points and Their Optimizations. Proc. VLDB Endow., Vol. 13, 8 (April 2020), 1206--1220. https://doi.org/10.14778/3389133.3389138
[10]
Philipp Fent, Alexander van Renen, Andreas Kipf, Viktor Leis, Thomas Neumann, and Alfons Kemper. 2020. Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory. In IEEE ICDE. IEEE, 1477--1488. https://doi.org/10.1109/ICDE48307.2020.00131
[11]
Philip W. Frey, Romulo Goncalves, Martin Kersten, and Jens Teubner. 2009. Spinning Relations: High-Speed Networks for Distributed Join Processing. In Proceedings of the Fifth International Workshop on Data Management on New Hardware (Providence, Rhode Island) (DaMoN '09). Association for Computing Machinery, New York, NY, USA, 27--33. https://doi.org/10.1145/1565694.1565701
[12]
Philip W. Frey, Romulo Goncalves, Martin Kersten, and Jens Teubner. 2010. A Spinning Join That Does Not Get Dizzy. In IEEE ICDCS (ICDCS '10). IEEE Computer Society, USA, 283--292. https://doi.org/10.1109/ICDCS.2010.23
[13]
Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In ACM SIGMOD (Houston, TX, USA) (SIGMOD '18). Association for Computing Machinery, New York, NY, USA, 1603--1618. https://doi.org/10.1145/3183713.3183734
[14]
Chengxin Guo, Hong Chen, Feng Zhang, and Cuiping Li. 2019. Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA. In Proceedings of the 48th International Conference on Parallel Processing (Kyoto, Japan) (ICPP 2019). Association for Computing Machinery, New York, NY, USA, Article 65, 10 pages. https://doi.org/10.1145/3337821.3337862
[15]
Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan, Hari Subramoni, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2015. Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for NVIDIA GPU Clusters. In 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, September 8--11, 2015. IEEE Computer Society, 78--87. https://doi.org/10.1109/CLUSTER.2015.21
[16]
Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational Query Coprocessing on Graphics Processors. ACM Trans. Database Syst., Vol. 34, 4, Article 21 (Dec. 2009), 39 pages. https://doi.org/10.1145/1620585.1620588
[17]
Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga Govindaraju, Qiong Luo, and Pedro Sander. 2008. Relational Joins on Graphics Processors. In ACM SIGMOD (Vancouver, Canada) (SIGMOD '08). Association for Computing Machinery, New York, NY, USA, 511--524. https://doi.org/10.1145/1376616.1376670
[18]
Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture. Proc. VLDB Endow., Vol. 6, 10 (Aug. 2013), 889--900. https://doi.org/10.14778/2536206.2536216
[19]
Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (Scottsdale, Arizona) (DaMoN '12). Association for Computing Machinery, New York, NY, USA, 55--62. https://doi.org/10.1145/2236584.2236592
[20]
Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs. In USENIX OSDI (Savannah, GA, USA) (OSDI'16). USENIX Association, USA, 185--201.
[21]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. In ACM SIGMOD (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1633--1649. https://doi.org/10.1145/3318464.3389705
[22]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2022. Triton Join: Efficiently Scaling the Operator State on GPUs with Fast Interconnects. In ACM SIGMOD.
[23]
NVIDIA. 2021a. GPUDirect RDMA. NVIDIA. https://developer.nvidia.com/gpudirect
[24]
NVIDIA. 2021b. GPUDirect RDMA Design Considerations - Synchronization and Memory Ordering. NVIDIA. https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#sync-behavior
[25]
NVIDIA. 2021c. Mellanox OFED GPUDirect RDMA. NVIDIA. https://www.mellanox.com/products/GPUDirect-RDMA
[26]
Johns Paul, Shengliang Lu, Bingsheng He, and Chiew Tong Lau. 2021. MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures. In ACM SIGMOD (Virtual Event, China) (SIGMOD/PODS '21). Association for Computing Machinery, New York, NY, USA, 1413--1425. https://doi.org/10.1145/3448016.3457254
[27]
Ran Rui, Hao Li, and Yi-Cheng Tu. 2015. Join Algorithms on GPUs: A Revisit after Seven Years. In IEEE BigData (BIG DATA '15). IEEE Computer Society, USA, 2541--2550. https://doi.org/10.1109/BigData.2015.7364051
[28]
Ran Rui, Hao Li, and Yi-Cheng Tu. 2020. Efficient Join Algorithms for Large Database Tables in a Multi-GPU Environment. Proc. VLDB Endow., Vol. 14, 4 (Dec. 2020), 708--720. https://doi.org/10.14778/3436905.3436927
[29]
Wolf Rödiger, Sam Idicula, Alfons Kemper, and Thomas Neumann. 2016. Flow-Join: Adaptive skew handling for distributed joins over high-speed networks. In IEEE ICDE. 1194--1205. https://doi.org/10.1109/ICDE.2016.7498324
[30]
Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In ACM SIGMOD (Indianapolis, Indiana, USA) (SIGMOD '10). Association for Computing Machinery, New York, NY, USA, 351--362. https://doi.org/10.1145/1807167.1807207
[31]
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In ACM SIGMOD (Portland, OR, USA) (SIGMOD '20). Association for Computing Machinery, New York, NY, USA, 1617--1632. https://doi.org/10.1145/3318464.3380595
[32]
Panagiotis Sioulas, Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. Hardware-Conscious Hash-Joins on GPUs. In IEEE ICDE. 698--709. https://doi.org/10.1109/ICDE.2019.00068
[33]
Lasse Thostrup, Jan Skrzypczak, Matthias Jasny, Tobias Ziegler, and Carsten Binnig. 2021. DFI: The Data Flow Interface for High-Speed Networks. In ACM SIGMOD (Virtual Event, China) (SIGMOD/PODS '21). Association for Computing Machinery, New York, NY, USA, 1825--1837. https://doi.org/10.1145/3448016.3452816
[34]
Haicheng Wu, Gregory Diamos, Tim Sheard, Molham Aref, Sean Baxter, Michael Garland, and Sudhakar Yalamanchili. 2014. Red Fox: An Execution Environment for Relational Query Processing on GPUs. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization (Orlando, FL, USA) (CGO '14). Association for Computing Machinery, New York, NY, USA, 44--54. https://doi.org/10.1145/2544137.2544166
[35]
Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow., Vol. 6, 10 (2013), 817--828. https://doi.org/10.14778/2536206.2536210
[36]
Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The End of a Myth: Distributed Transactions Can Scale. Proc. VLDB Endow., Vol. 10, 6 (feb 2017), 685--696. https://doi.org/10.14778/3055330.3055335
[37]
Tobias Ziegler, Viktor Leis, and Carsten Binnig. 2020. RDMA Communciation Patterns. Datenbank-Spektrum, Vol. 20 (11 2020), 199--210. https://doi.org/10.1007/s13222-020-00355--7
[38]
Tobias Ziegler, Sumukha Tumkur Vani, Carsten Binnig, Rodrigo Fonseca, and Tim Kraska. 2019. Designing Distributed Tree-Based Index Structures for Fast RDMA-Capable Networks. In ACM SIGMOD (Amsterdam, Netherlands) (SIGMOD '19). Association for Computing Machinery, New York, NY, USA, 741--758. https://doi.org/10.1145/3299869.3300081

Cited By

View all
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Management of Data
Proceedings of the ACM on Management of Data  Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 May 2023
Published in PACMMOD Volume 1, Issue 1

Permissions

Request permissions for this article.

Author Tags

  1. GPU
  2. RDMA
  3. distributed joins
  4. networks

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)277
  • Downloads (Last 6 weeks)30
Reflects downloads up to 24 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enabling Adaptive Sampling for Intra-Window Join: Simultaneously Optimizing Quantity and QualityProceedings of the ACM on Management of Data10.1145/36771342:4(1-31)Online publication date: 30-Sep-2024
  • (2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
  • (2024)Zero-sided RDMA: Network-driven Data Shuffling for Disaggregated Heterogeneous Cloud DBMSsProceedings of the ACM on Management of Data10.1145/36392912:1(1-28)Online publication date: 26-Mar-2024
  • (2024)On-The-Fly Data Distribution to Accelerate Query Processing in Heterogeneous Memory SystemsAdvances in Databases and Information Systems10.1007/978-3-031-70626-4_12(170-183)Online publication date: 28-Aug-2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media