[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3468791.3468840acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

In-Database Machine Learning with SQL on GPUs

Published: 11 August 2021 Publication History

Abstract

In machine learning, continuously retraining a model guarantees accurate predictions based on the latest data as training input. But to retrieve the latest data from a database, time-consuming extraction is necessary as database systems have rarely been used for operations such as matrix algebra and gradient descent.
In this work, we demonstrate that SQL with recursive tables makes it possible to express a complete machine learning pipeline out of data preprocessing, model training and its validation. To facilitate the specification of loss functions, we extend the code-generating database system Umbra by an operator for automatic differentiation for use within recursive tables: With the loss function expressed in SQL as a lambda function, Umbra generates machine code for each partial derivative. We further use automatic differentiation for a dedicated gradient descent operator, which generates LLVM code to train a user-specified model on GPUs. We fine-tune GPU kernels at hardware level to allow a higher throughput and propose non-blocking synchronisation of multiple units.
In our evaluation, automatic differentiation accelerated the runtime by the number of cached subexpressions compared to compiling each derivative separately. Our GPU kernels with independent models allowed maximal throughput even for small batch sizes, making machine learning pipelines within SQL more competitive.

References

[1]
Andrej Andrejev, Kjell Orsborn, and Tore Risch. 2020. Strategies for array data retrieval from a relational back-end based on access patterns. Computing (2020).
[2]
Natalia Arzamasova, Klemens Böhm, 2020. On the Usefulness of SQL-Query-Similarity Measures to Find User Interests. IEEE TKDE 32, 10 (2020).
[3]
Arian Bär 2014. DBStream. In IWCMC. IEEE.
[4]
Alok Baveja, Amit Chavan, 2018. Improved Bounds in Stochastic Matching and Optimization. Algorithmica 80, 11 (2018), 3225–3252.
[5]
Spiridon F. Beldianu and Sotirios G. Ziavras. 2011. On-chip Vector Coprocessor Sharing for Multicores. In PDP. IEEE, 431–438.
[6]
Khalid Belhajjame. 2020. On Discovering Data Preparation Modules Using Examples. In ICSOC(Lecture Notes in Computer Science, Vol. 12571). Springer, 56–65.
[7]
Altan Birler. 2019. Scalable Reservoir Sampling on Many-Core CPUs. In SIGMOD Conference. ACM, 1817–1819.
[8]
Matthias Boehm 2020. SystemDS. In CIDR. www.cidrdb.org
[9]
Yitao Chen, Saman Biookaghazadeh, and Ming Zhao. 2019. Exploring the capabilities of mobile devices in supporting deep learning. In SEC. ACM, 127–138.
[10]
Cody Coleman 2019. Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark. ACM SIGOPS Oper. Syst. Rev. 53, 1 (2019).
[11]
Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Tilmann Rabl, and Volker Markl. 2019. Continuous Deployment of Machine Learning Pipelines. In EDBT. 397–408.
[12]
Oksana Dolmatova 2020. A Relational Matrix Algebra and its Implementation in a Column Store. In SIGMOD Conference. ACM, 2573–2587.
[13]
Bin Dong 2017. ArrayUDF. In HPDC. ACM, 53–64.
[14]
Christian Duta 2020. Compiling PL/SQL Away. In CIDR. www.cidrdb.org
[15]
Mehrad Eslami, Yicheng Tu, Hadi Charkhgard, Zichen Xu, 2019. PsiDB: A Framework for Batched Query Processing and Optimization. In BigData. IEEE.
[16]
Bettina Fazzinga 2020. Interpreting RFID tracking data for simultaneously moving objects. Expert Syst. Appl. 152(2020), 113368.
[17]
Maxim Filatov and Verena Kantere. 2016. PAW: A Platform for Analytics Workflows. In EDBT. 624–627.
[18]
Ting Guo, Xingquan Zhu, Yang Wang, and Fang Chen. 2019. Discriminative Sample Generation for Deep Imbalanced Learning. In IJCAI. ijcai.org, 2406–2412.
[19]
Ali Hadian, Ankit Kumar, and Thomas Heinis. 2020. Hands-off Model Integration in Spatial Index Structures. In AIDB@VLDB.
[20]
Nina Hubig, Linnea Passing, 2017. HyPerInsight. In CIKM. ACM, 2467–2470.
[21]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, 2019. Declarative Recursive Computation on an RDBMS. PVLDB 12, 7 (2019), 822–835.
[22]
Peng Jiang 2020. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In PPoPP. ACM, 376–388.
[23]
Peng Jiang and Gagan Agrawal. 2019. Accelerating distributed stochastic gradient descent with adaptive periodic parameter averaging. In PPoPP. ACM, 403–404.
[24]
Woon-Hak Kang, Sang-Won Lee, and Bongki Moon. 2016. Flash as cache extension for online transactional workloads. VLDB J. 25, 5 (2016), 673–694.
[25]
Konstantinos Karanasos 2020. Extending Relational Query Processing with ML Inference. In CIDR. www.cidrdb.org
[26]
Lukas Karnowski 2021. Umbra as a Time Machine. In BTW(LNI). GI.
[27]
Alfons Kemper and Thomas Neumann. 2011. HyPer. In ICDE. IEEE, 195–206.
[28]
Alexandros Koliousis 2019. Crossbow. PVLDB 12, 11 (2019), 1399–1413.
[29]
Andreas Kunft 2019. An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB 12, 11 (2019), 1553–1567.
[30]
Jeff LeFevre, Jagan Sankaranarayanan, 2014. Opportunistic physical design for big data analytics. In SIGMOD Conference. ACM, 851–862.
[31]
Zheng Li and Tingjian Ge. 2016. Stochastic Data Acquisition for Answering Queries as Time Goes by. PVLDB 10, 3 (2016), 277–288.
[32]
Tyng-Yeu Liang 2014. A Distributed PTX Compilation and Execution System on Hybrid CPU/GPU Clusters. In ICS(FAIA, Vol. 274). IOS Press, 1355–1364.
[33]
Shangyu Luo 2017. Scalable Linear Algebra on a Relational Database System. In ICDE. IEEE, 523–534.
[34]
Daniel Lustig 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In ASPLOS. ACM, 257–270.
[35]
Yanqi Lv and Peiquan Jin. 2020. RotaryDS: Fast Storage for Massive Data Streams via a Rotation Storage Model. In CIKM. ACM, 3305–3308.
[36]
Yujing Ma, Florin Rusu, and Martin Torres. 2019. Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU?. In IPDPS. IEEE, 1063–1072.
[37]
Nantia Makrynioti 2018. Modelling Machine Learning Algorithms on Relational Data with Datalog. In DEEM@SIGMOD. ACM, 5:1–5:4.
[38]
Norman May 2017. SAP HANA. In BTW(LNI). GI.
[39]
Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. PVLDB 4, 9 (2011), 539–550.
[40]
Thomas Neumann 2020. Umbra. In CIDR. www.cidrdb.org
[41]
Shreya Prasad, Arash Fard, Vishrut Gupta, 2015. Large-scale Predictive Analytics in Vertica. In SIGMOD Conference. ACM, 1657–1668.
[42]
Benjamin Recht, Christopher Ré, Stephen J. Wright, 2011. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS. 693–701.
[43]
Mohammad Sadoghi 2018. L-Store. In EDBT. 540–551.
[44]
Maximilian Schleich 2019. A Layered Aggregate Engine for Analytics Workloads. In SIGMOD Conference. ACM, 1642–1659.
[45]
Josef Schmeißer 2021. B2-Tree. In BTW(LNI). GI.
[46]
Maximilian E. Schüle 2017. Monopedia. VLDB 10, 12 (2017), 1921–1924.
[47]
Maximilian E. Schüle 2019. In-Database Machine Learning: Gradient Descent and Tensor Algebra for MMDBS. In BTW(LNI). GI.
[48]
Maximilian E. Schüle 2019. ML2SQL. In EDBT. 562–565.
[49]
Maximilian E. Schüle 2019. MLearn. In DEEM@SIGMOD. ACM, 7:1–7:4.
[50]
Maximilian E. Schüle 2019. The Power of SQL Lambda Functions. In EDBT.
[51]
Maximilian E. Schüle 2019. Versioning in Main-Memory Database Systems: From MusaeusDB to TardisDB. In SSDBM. ACM, 169–180.
[52]
Maximilian E. Schüle 2020. Freedom for the SQL-Lambda: Just-in-Time-Compiling User-Injected Functions in PostgreSQL. In SSDBM. ACM, 6:1–6:12.
[53]
Maximilian E. Schüle 2021. TardisDB. In SIGMOD. ACM.
[54]
Kurt Stockinger 2019. Scalable architecture for Big Data financial analytics: user-defined functions vs. SQL. J. Big Data 6(2019), 46.
[55]
Alexander Terenin, Shawfeng Dong, 2019. GPU-accelerated Gibbs sampling: a case study of the Horseshoe Probit model. Stat. Comput. 29, 2 (2019), 301–310.
[56]
Yi-Cheng Tu, Anand Kumar, 2013. Data management systems on GPUs: promises and challenges. In SSDBM. ACM, 33:1–33:4.
[57]
Sebastián Villarroya and Peter Baumann. 2020. On the Integration of Machine Learning and Array Databases. In ICDE. IEEE, 1786–1789.
[58]
James Wagner 2020. DF-Toolkit. PVLDB 13, 12 (2020), 2845–2848.
[59]
Christian Winter 2020. Meet Me Halfway: Split Maintenance of Continuous Views. PVLDB 13, 11 (2020), 2620–2633.
[60]
Jia Wu 2013. Active AODE learning based on a novel sampling strategy and its application. Int. J. Comput. Appl. Technol. 47, 4 (2013), 326–333.
[61]
Han Xiao 2017. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:cs.LG/1708.07747 [cs.LG]
[62]
Ying Yang, Niccolò Meneghetti, 2015. Lenses: An On-Demand Approach to ETL. PVLDB 8, 12 (2015), 1578–1589.
[63]
Feng Yu and Jonathan M. Harbor. 2019. CSTAT+: A GPU-accelerated spatial pattern analysis algorithm. Environ. Model. Softw. 120 (2019).
[64]
Mu Yuan, Lan Zhang, Xiang-Yang Li, and Hui Xiong. 2020. Comprehensive and Efficient Data Labeling via Adaptive Model Scheduling. In ICDE. IEEE, 1858–1861.
[65]
Chao Zhang and Farouk Toumani. 2020. Sharing Computations for User-Defined Aggregate Functions. In EDBT. 241–252.
[66]
Miao Zhang, Huiqi Li, Shirui Pan, 2020. One-Shot Neural Architecture Search via Novelty Driven Sampling. In IJCAI. ijcai.org, 3188–3194.
[67]
Yongluan Zhou, Ali Salehi, and Karl Aberer. 2009. Scalable Delivery of Stream Query Results. PVLDB 2, 1 (2009), 49–60.
[68]
Chao Zhu 2013. Developing a Dynamic Materialized View Index. J. Inf. Process. Syst. 9, 4 (2013), 511–537.

Cited By

View all
  • (2024)MeMCISA: Memristor-Enabled Memory-Centric Instruction-Set Architecture for Database Workloads2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00122(1678-1692)Online publication date: 2-Nov-2024
  • (2024)Give a JIT on GPUs: NVRTC for Code-Generating Database Systems2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00061(384-387)Online publication date: 13-May-2024
  • (2024)Higher-Order SQL Lambda Functions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00450(5622-5628)Online publication date: 13-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
SSDBM '21: Proceedings of the 33rd International Conference on Scientific and Statistical Database Management
July 2021
275 pages
ISBN:9781450384131
DOI:10.1145/3468791
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 August 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automatic Differentiation
  2. GPU
  3. In-Database Machine Learning

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SSDBM 2021

Acceptance Rates

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)141
  • Downloads (Last 6 weeks)17
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)MeMCISA: Memristor-Enabled Memory-Centric Instruction-Set Architecture for Database Workloads2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00122(1678-1692)Online publication date: 2-Nov-2024
  • (2024)Give a JIT on GPUs: NVRTC for Code-Generating Database Systems2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)10.1109/ICDEW61823.2024.00061(384-387)Online publication date: 13-May-2024
  • (2024)Higher-Order SQL Lambda Functions2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00450(5622-5628)Online publication date: 13-May-2024
  • (2023)Pushing ML Predictions Into DBMSsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.326959235:10(10295-10308)Online publication date: 1-Oct-2023
  • (2023)Towards an Integrated Rough Set and Data Modelling Framework for Data Management and Knowledge ExtractionArtificial Intelligence and Smart Environment10.1007/978-3-031-26254-8_116(800-805)Online publication date: 8-Mar-2023
  • (2022)Recursive SQL for Data Mining34th International Conference on Scientific and Statistical Database Management10.1145/3538712.3538746(1-4)Online publication date: 23-Aug-2022
  • (2022)Recursive SQL and GPU-support for in-database machine learningDistributed and Parallel Databases10.1007/s10619-022-07417-740:2-3(205-259)Online publication date: 1-Sep-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media