More Web Proxy on the site http://driver.im/

research-article

In-Database Machine Learning with SQL on GPUs

Authors:

Maximilian Schule,

Maximilian Springer,

Thomas Neumann,

Stephan GunnemannAuthors Info & Claims

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

Pages 25 - 36

https://doi.org/10.1145/3468791.3468840

Published: 11 August 2021 Publication History

Abstract

In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent.

In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units.

In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.

References

[1]

Andrej Andrejev, Kjell Orsborn, and Tore Risch. 2020. Strategies for array data retrieval from a relational back-end based on access patterns. Computing (2020).

[2]

Natalia Arzamasova, Klemens Böhm, 2020. On the Usefulness of SQL-Query-Similarity Measures to Find User Interests. IEEE TKDE 32, 10 (2020).

[3]

Arian Bär 2014. DBStream. In IWCMC. IEEE.

[4]

Alok Baveja, Amit Chavan, 2018. Improved Bounds in Stochastic Matching and Optimization. Algorithmica 80, 11 (2018), 3225–3252.

Digital Library

[5]

Spiridon F. Beldianu and Sotirios G. Ziavras. 2011. On-chip Vector Coprocessor Sharing for Multicores. In PDP. IEEE, 431–438.

[6]

Khalid Belhajjame. 2020. On Discovering Data Preparation Modules Using Examples. In ICSOC(Lecture Notes in Computer Science, Vol. 12571). Springer, 56–65.

[7]

Altan Birler. 2019. Scalable Reservoir Sampling on Many-Core CPUs. In SIGMOD Conference. ACM, 1817–1819.

[8]

Matthias Boehm 2020. SystemDS. In CIDR. www.cidrdb.org

[9]

Yitao Chen, Saman Biookaghazadeh, and Ming Zhao. 2019. Exploring the capabilities of mobile devices in supporting deep learning. In SEC. ACM, 127–138.

[10]

Cody Coleman 2019. Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark. ACM SIGOPS Oper. Syst. Rev. 53, 1 (2019).

Digital Library

[11]

Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Tilmann Rabl, and Volker Markl. 2019. Continuous Deployment of Machine Learning Pipelines. In EDBT. 397–408.

[12]

Oksana Dolmatova 2020. A Relational Matrix Algebra and its Implementation in a Column Store. In SIGMOD Conference. ACM, 2573–2587.

Digital Library

[13]

Bin Dong 2017. ArrayUDF. In HPDC. ACM, 53–64.

[14]

Christian Duta 2020. Compiling PL/SQL Away. In CIDR. www.cidrdb.org

[15]

Mehrad Eslami, Yicheng Tu, Hadi Charkhgard, Zichen Xu, 2019. PsiDB: A Framework for Batched Query Processing and Optimization. In BigData. IEEE.

[16]

Bettina Fazzinga 2020. Interpreting RFID tracking data for simultaneously moving objects. Expert Syst. Appl. 152(2020), 113368.

[17]

Maxim Filatov and Verena Kantere. 2016. PAW: A Platform for Analytics Workflows. In EDBT. 624–627.

[18]

Ting Guo, Xingquan Zhu, Yang Wang, and Fang Chen. 2019. Discriminative Sample Generation for Deep Imbalanced Learning. In IJCAI. ijcai.org, 2406–2412.

[19]

Ali Hadian, Ankit Kumar, and Thomas Heinis. 2020. Hands-off Model Integration in Spatial Index Structures. In AIDB@VLDB.

[20]

Nina Hubig, Linnea Passing, 2017. HyPerInsight. In CIKM. ACM, 2467–2470.

[21]

Dimitrije Jankov, Shangyu Luo, Binhang Yuan, 2019. Declarative Recursive Computation on an RDBMS. PVLDB 12, 7 (2019), 822–835.

Digital Library

[22]

Peng Jiang 2020. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In PPoPP. ACM, 376–388.

[23]

Peng Jiang and Gagan Agrawal. 2019. Accelerating distributed stochastic gradient descent with adaptive periodic parameter averaging. In PPoPP. ACM, 403–404.

[24]

Woon-Hak Kang, Sang-Won Lee, and Bongki Moon. 2016. Flash as cache extension for online transactional workloads. VLDB J. 25, 5 (2016), 673–694.

Digital Library

[25]

Konstantinos Karanasos 2020. Extending Relational Query Processing with ML Inference. In CIDR. www.cidrdb.org

[26]

Lukas Karnowski 2021. Umbra as a Time Machine. In BTW(LNI). GI.

[27]

Alfons Kemper and Thomas Neumann. 2011. HyPer. In ICDE. IEEE, 195–206.

[28]

Alexandros Koliousis 2019. Crossbow. PVLDB 12, 11 (2019), 1399–1413.

Digital Library

[29]

Andreas Kunft 2019. An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12, 11 (2019), 1553–1567.

Digital Library

[30]

Jeff LeFevre, Jagan Sankaranarayanan, 2014. Opportunistic physical design for big data analytics. In SIGMOD Conference. ACM, 851–862.

Digital Library

[31]

Zheng Li and Tingjian Ge. 2016. Stochastic Data Acquisition for Answering Queries as Time Goes by. PVLDB 10, 3 (2016), 277–288.

Digital Library

[32]

Tyng-Yeu Liang 2014. A Distributed PTX Compilation and Execution System on Hybrid CPU/GPU Clusters. In ICS(FAIA, Vol. 274). IOS Press, 1355–1364.

[33]

Shangyu Luo 2017. Scalable Linear Algebra on a Relational Database System. In ICDE. IEEE, 523–534.

[34]

Daniel Lustig 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In ASPLOS. ACM, 257–270.

[35]

Yanqi Lv and Peiquan Jin. 2020. RotaryDS: Fast Storage for Massive Data Streams via a Rotation Storage Model. In CIKM. ACM, 3305–3308.

[36]

Yujing Ma, Florin Rusu, and Martin Torres. 2019. Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU?. In IPDPS. IEEE, 1063–1072.

[37]

Nantia Makrynioti 2018. Modelling Machine Learning Algorithms on Relational Data with Datalog. In DEEM@SIGMOD. ACM, 5:1–5:4.

[38]

Norman May 2017. SAP HANA. In BTW(LNI). GI.

[39]

Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB 4, 9 (2011), 539–550.

Digital Library

[40]

Thomas Neumann 2020. Umbra. In CIDR. www.cidrdb.org

[41]

Shreya Prasad, Arash Fard, Vishrut Gupta, 2015. Large-scale Predictive Analytics in Vertica. In SIGMOD Conference. ACM, 1657–1668.

[42]

Benjamin Recht, Christopher Ré, Stephen J. Wright, 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS. 693–701.

[43]

Mohammad Sadoghi 2018. L-Store. In EDBT. 540–551.

[44]

Maximilian Schleich 2019. A Layered Aggregate Engine for Analytics Workloads. In SIGMOD Conference. ACM, 1642–1659.

Digital Library

[45]

Josef Schmeißer 2021. B2-Tree. In BTW(LNI). GI.

[46]

Maximilian E. Schüle 2017. Monopedia. VLDB 10, 12 (2017), 1921–1924.

Digital Library

[47]

Maximilian E. Schüle 2019. In-Database Machine Learning: Gradient Descent and Tensor Algebra for MMDBS. In BTW(LNI). GI.

[48]

Maximilian E. Schüle 2019. ML2SQL. In EDBT. 562–565.

[49]

Maximilian E. Schüle 2019. MLearn. In DEEM@SIGMOD. ACM, 7:1–7:4.

[50]

Maximilian E. Schüle 2019. The Power of SQL Lambda Functions. In EDBT.

[51]

Maximilian E. Schüle 2019. Versioning in Main-Memory Database Systems: From MusaeusDB to TardisDB. In SSDBM. ACM, 169–180.

[52]

Maximilian E. Schüle 2020. Freedom for the SQL-Lambda: Just-in-Time-Compiling User-Injected Functions in PostgreSQL. In SSDBM. ACM, 6:1–6:12.

[53]

Maximilian E. Schüle 2021. TardisDB. In SIGMOD. ACM.

[54]

Kurt Stockinger 2019. Scalable architecture for Big Data financial analytics: user-defined functions vs. SQL. J. Big Data 6(2019), 46.

[55]

Alexander Terenin, Shawfeng Dong, 2019. GPU-accelerated Gibbs sampling: a case study of the Horseshoe Probit model. Stat. Comput. 29, 2 (2019), 301–310.

Digital Library

[56]

Yi-Cheng Tu, Anand Kumar, 2013. Data management systems on GPUs: promises and challenges. In SSDBM. ACM, 33:1–33:4.

[57]

Sebastián Villarroya and Peter Baumann. 2020. On the Integration of Machine Learning and Array Databases. In ICDE. IEEE, 1786–1789.

[58]

James Wagner 2020. DF-Toolkit. PVLDB 13, 12 (2020), 2845–2848.

Digital Library

[59]

Christian Winter 2020. Meet Me Halfway: Split Maintenance of Continuous Views. PVLDB 13, 11 (2020), 2620–2633.

Digital Library

[60]

Jia Wu 2013. Active AODE learning based on a novel sampling strategy and its application. Int. J. Comput. Appl. Technol. 47, 4 (2013), 326–333.

Digital Library

[61]

Han Xiao 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:cs.LG/1708.07747 [cs.LG]

[62]

Ying Yang, Niccolò Meneghetti, 2015. Lenses: An On-Demand Approach to ETL. PVLDB 8, 12 (2015), 1578–1589.

Digital Library

[63]

Feng Yu and Jonathan M. Harbor. 2019. CSTAT+: A GPU-accelerated spatial pattern analysis algorithm. Environ. Model. Softw. 120 (2019).

[64]

Mu Yuan, Lan Zhang, Xiang-Yang Li, and Hui Xiong. 2020. Comprehensive and Efficient Data Labeling via Adaptive Model Scheduling. In ICDE. IEEE, 1858–1861.

[65]

Chao Zhang and Farouk Toumani. 2020. Sharing Computations for User-Defined Aggregate Functions. In EDBT. 241–252.

[66]

Miao Zhang, Huiqi Li, Shirui Pan, 2020. One-Shot Neural Architecture Search via Novelty Driven Sampling. In IJCAI. ijcai.org, 3188–3194.

[67]

Yongluan Zhou, Ali Salehi, and Karl Aberer. 2009. Scalable Delivery of Stream Query Results. PVLDB 2, 1 (2009), 49–60.

Digital Library

[68]

Chao Zhu 2013. Developing a Dynamic Materialized View Index. J. Inf. Process. Syst. 9, 4 (2013), 511–537.

Cited By

Zhu YCai LYu LFan AYan LJing ZYan BTiw PLi YTao YYang Y(2024)MeMCISA: Memristor-Enabled Memory-Centric Instruction-Set Architecture for Database Workloads2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00122(1678-1692)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00122
Sachnov AMerzljak LSchüle M(2024)Give a JIT on GPUs: NVRTC for Code-Generating Database Systems2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00061(384-387)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00061
Schüle MHornung J(2024)Higher-Order SQL Lambda Functions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00450(5622-5628)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00450
Show More Cited By

Recommendations

Recursive SQL and GPU-support for in-database machine learning
Abstract
In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have ...
Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
An OpenCL micro-benchmark suite for GPUs and CPUs

Open computing language (OpenCL) is a new industry standard for task-parallel and data-parallel heterogeneous computing on a variety of modern CPUs, GPUs, DSPs, and other microprocessor designs. OpenCL is vendor independent and hence not specialized for ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management

July 2021

275 pages

ISBN:9781450384131

DOI:10.1145/3468791

Editors:
Qiang Zhu
University of Michigan - Dearborn, USA
,
Xingquan (Hill) Zhu
Florida Atlantic University, USA
,
Yicheng Tu
University of South Florida, USA
,
Zichen (Frank) Xu
Nanchang University, China
,
Anand Kumar
Amazon Inc., USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SSDBM 2021

SSDBM 2021: 33rd International Conference on Scientific and Statistical Database Management

July 6 - 7, 2021

FL, Tampa, USA

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
529
Total Downloads

Downloads (Last 12 months)141
Downloads (Last 6 weeks)17

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu YCai LYu LFan AYan LJing ZYan BTiw PLi YTao YYang Y(2024)MeMCISA: Memristor-Enabled Memory-Centric Instruction-Set Architecture for Database Workloads2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00122(1678-1692)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00122
Sachnov AMerzljak LSchüle M(2024)Give a JIT on GPUs: NVRTC for Code-Generating Database Systems2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00061(384-387)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDEW61823.2024.00061
Schüle MHornung J(2024)Higher-Order SQL Lambda Functions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00450(5622-5628)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00450
Paganelli MSottovia PPark KInterlandi MGuerra F(2023)Pushing ML Predictions Into DBMSsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326959235:10(10295-10308)Online publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1109/TKDE.2023.3269592
Chakhar SBrahmia Z(2023)Towards an Integrated Rough Set and Data Modelling Framework for Data Management and Knowledge ExtractionArtificial Intelligence and Smart Environment10.1007/978-3-031-26254-8_116(800-805)Online publication date: 8-Mar-2023
https://doi.org/10.1007/978-3-031-26254-8_116
Schüle MKemper ANeumann T(2022)Recursive SQL for Data Mining34th International Conference on Scientific and Statistical Database Management10.1145/3538712.3538746(1-4)Online publication date: 23-Aug-2022
https://doi.org/10.1145/3538712.3538746
Schüle MLang HSpringer MKemper ANeumann TGünnemann S(2022)Recursive SQL and GPU-support for in-database machine learningDistributed and Parallel Databases10.1007/s10619-022-07417-740:2-3(205-259)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1007/s10619-022-07417-7

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents