More Web Proxy on the site http://driver.im/

research-article

Data-parallel query processing on non-uniform data

Authors:

Jens TeubnerAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 13, Issue 6

Pages 884 - 897

https://doi.org/10.14778/3380750.3380758

Published: 01 February 2020 Publication History

Abstract

Graphics processing units (GPUs) promise spectacular performance advantages when used as database coprocessors. Their massive compute capacity, however, is often hampered by control flow divergence caused by non-uniform data distributions. When data-parallel work items demand for different amounts or types of processing, instructions execute with lowered efficiency. Query compilation techniques---a recent advance in GPU-accelerated database processing---suffer from the problem even more, because divergence effects are amplified during the execution of fused pipeline operators.

In this work, we identify two types of control flow divergence---filter divergence and expansion divergence---that frequently occur in real world workloads. We quantify the problem for two poster cases and propose techniques to balance these divergence effects. By balancing divergence effects, our approach is able to restore processing efficiency even when pipelines contain heavily skewed operations. Our query compiler DogQC has a wider range of functionality than other query coprocessors and achieves performance improvements. We observe shorter execution times for TPC-H benchmark queries by factors up to 4.51x compared with existing GPU query compilers and by factors up to 4.54x compared with CPU-based systems.

References

[1]

P. Bakkum and S. Chakradhar. Efficient data management for GPU databases. High Performance Computing on Graphics Processing Units, 2012.

[2]

M. Billeter, O. Olsson, and U. Assarsson. Efficient stream compaction on wide SIMD many-core architectures. In Proceedings of the conference on high performance graphics 2009, pages 159--166. ACM, 2009.

Digital Library

[3]

C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for main memory column stores. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 283--296. ACM, 2009.

Digital Library

[4]

P. Boncz, T. Neumann, and O. Erling. TPC-H analyzed: Hidden messages and lessons learned from an influential benchmark. In Technology Conference on Performance Evaluation and Benchmarking, pages 61--76. Springer, 2013.

Digital Library

[5]

P. A. Boncz, S. Manegold, M. L. Kersten, et al. Database architecture optimized for the new bottleneck: Memory access. In VLDB, pages 54--65, 1999.

Digital Library

[6]

S. Borkar and A. A. Chien. The Future of Microprocessors. Communications of the ACM, 54(5):67--77, May 2011.

Digital Library

[7]

S. Breß, H. Funke, and J. Teubner. Robust query processing in co-processor-accelerated databases. In Proceedings of the 2016 International Conference on Management of Data, pages 1891--1906. ACM, 2016.

Digital Library

[8]

S. Breß, B. Köcher, H. Funke, S. Zeuch, T. Rabl, and V. Markl. Generating custom code for efficient query execution on heterogeneous processors. VLDB Journal, 27(6):797--822, 2018.

Digital Library

[9]

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer networks, 33(1-6):309--320, 2000.

Digital Library

[10]

G. Chen and X. Shen. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Proceedings of the 48th International Symposium on Microarchitecture, pages 407--419. ACM, 2015.

Digital Library

[11]

A. Davidson, S. Baxter, M. Garland, and J. D. Owens. Work-efficient parallel GPU methods for single-source shortest paths. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pages 349--359. IEEE, 2014.

Digital Library

[12]

H. Funke, S. Breß, S. Noll, V. Markl, and J. Teubner. Pipelined Query Processing in Coprocessor Environments. In Proceedings of the 2018 International Conference on Management of Data, pages 1603--1618. ACM, 2018.

Digital Library

[13]

M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. PVLDB, 6(9):709--720, 2013.

Digital Library

[14]

Intel Corporation. Accelerating x265 with Intel Advanced Vector Extensions 512 (Intel AVX-512). 2018.

[15]

B. Jang, D. Schaa, P. Mistry, and D. Kaeli. Exploiting memory access patterns to improve memory performance in data-parallel architectures. IEEE Transactions on Parallel and Distributed Systems, 22(1):105--118, 2010.

Digital Library

[16]

Z. Jia, M. Maggioni, J. Smith, and D. P. Scarpazza. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arXiv preprint arXiv:1903.07486, 2019.

[17]

T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. GPU join processing revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware, pages 55--62. ACM, 2012.

Digital Library

[18]

A. Kipf, H. Lang, V. Pandey, R. A. Persa, P. Boncz, T. Neumann, and A. Kemper. Approximate geospatial joins with precision guarantees. In 2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1360--1363. IEEE, 2018.

[19]

H. Lang, A. Kipf, L. Passing, P. Boncz, T. Neumann, and A. Kemper. Make the most out of your SIMD investments: counter control flow divergence in compiled query pipelines. In Proceedings of the 14th International Workshop on Data Management on New Hardware, page 5. ACM, 2018.

[20]

V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 743--754. ACM, 2014.

Digital Library

[21]

H. Liu and H. H. Huang. Enterprise: breadth-first graph traversal on GPUs. In SC'15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--12. IEEE, 2015.

Digital Library

[22]

L. Liu, Y. Zhang, M. Liu, C. Wang, and J. Wang. A-MapCG: an adaptive MapReduce framework for GPUs. In 2017 International Conference on Networking, Architecture, and Storage (NAS), pages 1--8. IEEE, 2017.

[23]

X. Mei and X. Chu. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems, 28(1):72--86, 2016.

Digital Library

[24]

I. Müller, C. Ratsch, F. Faerber, et al. Adaptive String Dictionary Compression in In-Memory Column-Store Database Systems. In EDBT, pages 283--294, 2014.

[25]

T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011.

Digital Library

[26]

NVidia Corporation. NVidia Kepler GPU Architecture. 2012.

[27]

NVidia Corporation. NVidia Turing GPU Architecture. 2018.

[28]

OmniSci Incorporated. OmniSciDB. https://www.omnisci.com/, 2019.

[29]

J. Paul, J. He, and B. He. GPL: A GPU-based pipelined query processing engine. In Proceedings of the 2016 International Conference on Management of Data, pages 1935--1950. ACM, 2016.

Digital Library

[30]

H. Pirk, S. Manegold, M. L. Kersten, et al. Accelerating Foreign-Key Joins using Asymmetric Memory Channels. In ADMS@VLDB, pages 27--35, 2011.

[31]

O. Polychroniou and K. A. Ross. Vectorized Bloom filters for advanced SIMD processors. In Proceedings of the Tenth International Workshop on Data Management on New Hardware, page 6. ACM, 2014.

[32]

R. Rui and Y.-C. Tu. Fast Equi-Join Algorithms on GPUs: Design and Implementation. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, page 17. ACM, 2017.

[33]

M. Sha, Y. Li, and K.-L. Tan. GPU-based Graph Traversal on Compressed Graphs. In Proceedings of the 2019 International Conference on Management of Data, pages 775--792. ACM, 2019.

[34]

J. Sompolski, M. Zukowski, and P. Boncz. Vectorization vs. compilation in query execution. In Proceedings of the Seventh International Workshop on Data Management on New Hardware, pages 33--40. ACM, 2011.

Digital Library

[35]

G. Vasiliadis, M. Polychronakis, S. Antonatos, E. P. Markatos, and S. Ioannidis. Regular expression matching on graphics hardware for intrusion detection. In International Workshop on Recent Advances in Intrusion Detection, pages 265--283. Springer, 2009.

Digital Library

[36]

J. Wang and S. Yalamanchili. Characterization and analysis of dynamic parallelism in unstructured GPU applications. In 2014 IEEE International Symposium on Workload Characterization (IISWC), pages 51--60. IEEE, 2014.

[37]

H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: Automatically fusing database primitives for efficient GPU computation. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pages 107--118. IEEE, 2012.

Digital Library

[38]

H. Wu, G. Diamos, T. Sheard, M. Aref, S. Baxter, M. Garland, and S. Yalamanchili. Red fox: An execution environment for relational query processing on gpus. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, page 44. ACM, 2014.

[39]

Y. Yuan, R. Lee, and X. Zhang. The Yin and Yang of processing data warehousing queries on GPU devices. PVLDB, 6(10):817--828, 2013.

Digital Library

[40]

Y. Zu, M. Yang, Z. Xu, L. Wang, X. Tian, K. Peng, and Q. Dong. GPU-based NFA implementation for memory efficient high speed regular expression matching. In ACM SIGPLAN Notices, volume 47, pages 129--140. ACM, 2012.

Cited By

Deng YYan MTang B(2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659443
Yuan JGong CDeng YChen STang B(2024)Free Your Hash Join From Slow Synchronizations: A Fast Hash Join Implementation on GPUProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674437(106-107)Online publication date: 5-Jul-2024
https://dl.acm.org/doi/10.1145/3674399.3674437
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Show More Cited By

Data-parallel query processing on non-uniform data
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

A portable benchmark suite for highly parallel data intensive query processing
PPAA 2015: Proceedings of the 2nd Workshop on Parallel Programming for Analytics Applications

Traditionally, data warehousing workloads have been processed using CPU-focused clusters, such as those that make up the bulk of available machines in Amazon's EC2, and the focus on improving analytics performance has been to utilize a homogenous, ...
Parallel spatial query processing on GPUs using R-trees
BigSpatial '13: Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data

R-Trees are popular spatial indexing techniques that have been widely adopted in many geospatial applications. As commodity GPUs (Graphics Processing Units) are increasingly becoming available on personal workstations and cluster computers, there are ...
Efficient Query Processing on Many-core Architectures: A Case Study with Intel Xeon Phi Processor
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Recently, Intel Xeon Phi is emerging as a many-core processor with up to 61 x86 cores. In this demonstration, we present PhiDB, an OLAP query processor with simultaneous multi-threading (SMT) capabilities on Xeon Phi as a case study for parallel ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 13, Issue 6

February 2020

170 pages

ISSN:2150-8097

Editors:
Magdalena Balazinska
University of Washington
,
Xiaofang Zhou
University of Queensland, Australia

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 February 2020

Published in PVLDB Volume 13, Issue 6

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
491
Total Downloads

Downloads (Last 12 months)71
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Deng YYan MTang B(2024)Accelerating Merkle Patricia Trie with GPUProceedings of the VLDB Endowment10.14778/3659437.365944317:8(1856-1869)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659443
Yuan JGong CDeng YChen STang B(2024)Free Your Hash Join From Slow Synchronizations: A Fast Hash Join Implementation on GPUProceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674437(106-107)Online publication date: 5-Jul-2024
https://dl.acm.org/doi/10.1145/3674399.3674437
Deng YChen SHong ZTang B(2024)How Does Software Prefetching Work on GPU Query Processing?Proceedings of the 20th International Workshop on Data Management on New Hardware10.1145/3662010.3663445(1-9)Online publication date: 10-Jun-2024
https://dl.acm.org/doi/10.1145/3662010.3663445
Grulich PLepping ANugroho DPandey VDel Monte BZeuch SMarkl V(2024)Query Compilation Without RegretsProceedings of the ACM on Management of Data10.1145/36549682:3(1-28)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654968
Yang YZhang GWu YZhao ZFu Y(2024)Split-bucket partition (SBP): a novel execution model for top-K and selection algorithms on GPUsThe Journal of Supercomputing10.1007/s11227-024-06031-x80:11(15122-15160)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1007/s11227-024-06031-x
Doraiswamy HKalagi VRamachandra KHaritsa J(2023)A Case for Graphics-Driven Query ProcessingProceedings of the VLDB Endowment10.14778/3603581.360359016:10(2499-2511)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.14778/3603581.3603590
Yogatama BMiller BWang YMarkall GHemstad JKimball GYu X(2023)Accelerating User-Defined Aggregate Functions (UDAF) with Block-wide Execution and JIT Compilation on GPUsProceedings of the 19th International Workshop on Data Management on New Hardware10.1145/3592980.3595307(19-26)Online publication date: 18-Jun-2023
https://dl.acm.org/doi/10.1145/3592980.3595307
Yang ZPan QXu C(2023)Fine-Grained Tuple Transfer for Pipelined Query Execution on CPU-GPU CoprocessorDatabase Systems for Advanced Applications10.1007/978-3-031-30637-2_2(19-34)Online publication date: 17-Apr-2023
https://dl.acm.org/doi/10.1007/978-3-031-30637-2_2
Liu HTang BZhang JDeng YYan XZheng XShen QZeng DMao ZZhang CYou ZWang ZJiang RWang FYiu MLi HHan MLi QLuo ZGavrilovska AAltınbüken DBinnig C(2022)GHiveProceedings of the 13th Symposium on Cloud Computing10.1145/3542929.3563503(158-172)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3542929.3563503
Boeschen NBinnig CIves ZBonifati AEl Abbadi A(2022)GaccO - A GPU-accelerated OLTP DBMSProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517876(1003-1016)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3517876
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents