[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

YeSQL: "you extend SQL" with rich and highly performant user-defined functions in relational databases

Published: 01 June 2022 Publication History

Abstract

The diversity and complexity of modern data management applications have led to the extension of the relational paradigm with syntactic and semantic support for User-Defined Functions (UDFs). Although well-established in traditional DBMS settings, UDFs have become central in many application contexts as well, such as data science, data analytics, and edge computing. Still, a critical limitation of UDFs is the impedance mismatch between their evaluation and relational processing. In this paper, we present YeSQL, an SQL extension with rich UDF support along with a pluggable architecture to easily integrate it with either server-based or embedded database engines. YeSQL currently supports Python UDFs fully integrated with relational queries as scalar, aggregator, or table functions. Key novel characteristics of YeSQL include easy implementation of complex algorithms and several performance enhancements, including tracing JIT compilation of Python UDFs, parallelism and fusion of UDFs, stateful UDFs, and seamless integration with a database engine. Our experimental analysis showcases the usability and expressiveness of YeSQL and demonstrates that our techniques of minimizing context switching between the relational engine and the Python VM are very effective and achieve significant speedups up to 68x in common, practical use cases compared to earlier approaches and alternative implementation choices.

References

[1]
Apache Arrow. 2022. Available at: https://arrow.apache.org/.
[2]
Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcín, Dag Sverre Seljebotn, and Kurt Smith. 2011. Cython: The Best of Both Worlds. Comput. Sci. Eng. 13, 2 (2011), 31--39.
[3]
Konstantina Bereta, Hervé Caumont, Ulrike Daniels, Erwin Goor, Manolis Koubarakis, Despina-Athanasia Pantazi, George Stamoulis, Sam Ubels, Valentijn Venus, and Firman Wahyudi. 2019. The Copernicus App Lab project: Easy Access to Copernicus Data. In EDBT. 501--511.
[4]
Konstantina Bereta, Hervé Caumont, Erwin Goor, Manolis Koubarakis, Despina-Athanasia Pantazi, George Stamoulis, Sam Ubels, Valentijn Venus, and Firman Wahyudi. 2018. From Copernicus Big Data to Big Information and Big Knowledge: A Demo from the Copernicus App Lab Project. In CIKM. 1911--1914.
[5]
Mark Blacher, Joachim Giesen, SÃüren Laue, Julien Klaus, and Viktor Leis. 2022. Machine Learning, Linear Algebra, and More: Is SQL All You Need?. In CIDR.
[6]
Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo. 2009. Tracing the meta-level: PyPy's tracing JIT compiler. In ICOOOLPS.
[7]
CFFI. 2022. Using CFFI for embedding. Available at: https://cffi.readthedocs.io/en/latest/embedding.html.
[8]
CFFI. 2022. Using the ffi/lib objects. Available at: https://cffi.readthedocs.io/en/latest/using.html#extern-python-new-style-callbacks.
[9]
CFFI Docs. 2022. Using the ffi/lib objects. Available at: https://cffi.readthedocs.io/en/latest/using.html.
[10]
Hanfeng Chen, Joseph Vinish D'silva, Hongji Chen, Bettina Kemme, and Laurie Hendren. 2018. HorseIR: Bringing Array Programming Languages Together with Database Query Processing. In ACM SIGPLAN DLS. 37âĂŞ49.
[11]
Hanfeng Chen, Joseph Vinish D'silva, Laurie J. Hendren, and Bettina Kemme. 2021. HorsePower: Accelerating Database Queries for Advanced Data Analytics. In EDBT. 361--366.
[12]
Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing database-backed applications with query synthesis. In SIGPLAN. 3--14.
[13]
Databases and Information Systems Group at TU Ilmenau. 2022. Compiled UDF engine. Available at: https://github.com/dbisilm/Compiled_UDF_engine/tree/main/exp.
[14]
Joseph Vinish D'silva, Florestan De Moor, and Bettina Kemme. 2018. AIDA - Abstraction for Advanced In-Database Analytics. PVLDB 11, 11 (2018), 1400--1413.
[15]
DuckSB. 2022. Available at: https://duckdb.org/.
[16]
Christian Duta and Torsten Grust. 2020. Functional-Style SQL UDFs With a Capital 'F'. In SIGMOD. 1273--1287.
[17]
Christian Duta, Denis Hirn, and Torsten Grust. 2020. Compiling PL/SQL Away. In CIDR.
[18]
Yannis Foufoulas, Alkis Simitsis, Lefteris Stamatogiannakis, and Yannis Ioannidis. 2022. The YeSQL language. Available at: https://github.com/athenarc/YeSQL/blob/main/Specs.md.
[19]
Yannis Foufoulas, Lefteris Stamatogiannakis, Harry Dimitropoulos, and Yannis Ioannidis. 2017. High-pass text filtering for citation matching. In TPDL.
[20]
Yannis Foufoulas, Lefteris Stamatogiannakis, Harry Dimitropoulos, and Yannis Ioannidis. 2017. High-pass text filtering for citation matching. In International Conference on Theory and Practice of Digital Libraries. Springer, 355--366.
[21]
Tasos Giannakopoulos, Yannis Foufoulas, Harry Dimitropoulos, and Natalia Manola. 2019. Interactive Text Analysis and Information Extraction. In IRCDL.
[22]
Theodoros Giannakopoulos, Eleftherios Stamatogiannakis, Ioannis Foufoulas, Harry Dimitropoulos, Natalia Manola, and Yannis Ioannidis. 2013. Content visualization of scientific corpora using an extensible relational database implementation. In TPDL.
[23]
Konstantinos Giannousis, Konstantina Bereta, Nikolaos Karalis, and Manolis Koubarakis. 2018. Distributed Execution of Spatial SQL Queries. In IEEE Big Data. 528--533.
[24]
Anna Gogolou, Marialena Kyriakidi, and Yannis E. Ioannidis. 2016. Data exploration: a roll call of all user-data interaction functionality. In ExploreDB. 31--33.
[25]
Google groops. 2022. Not releasing the GIL. Available at: https://groups.google.com/g/python-cffi/c/ytAHJ0_YYAs/m/lUdP8TWPDQAJ.
[26]
GraalVM. 2022. Available at: https://www.graalvm.org.
[27]
Lars Grammel. 2020. Wrangling US Flight Data - Part 1. 2015. Available at: https://www.trifacta.com/blog/wrangling-us-flight-data-part-1.
[28]
Surabhi Gupta, Sanket Purandare, and Karthik Ramachandra. 2020. Aggify: Lifting the Curse of Cursor Loops using Custom Aggregates. In SIGMOD. 559--573.
[29]
Surabhi Gupta and Karthik Ramachandra. 2021. Procedural Extensions of SQL: Understanding their usage in the wild. PVLDB 14, 8 (2021), 1378--1391.
[30]
Stefan Hagedorn, Steffen Kläbe, and Kai-Uwe Sattler. 2021. Putting Pandas in a Box. In CIDR.
[31]
Anna Herlihy, Periklis Chrysogelos, and Anastasia Ailamaki. 2022. Boosting Efficiency of External Pipelines by Blurring Application Boundaries. In CIDR.
[32]
Alekh Jindal, K. Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, Wentao Wu, and Hiren Patel. 2021. Magpie: Python at Speed and Scale using Cloud Backends. In CIDR.
[33]
Johannes Kepler University. 2022. The Truffle Language Implementation Framework. Available at: http://www.ssw.unilinz.ac.at/Research/Projects/JVM/Truffle.html.
[34]
Alfons Kemper and Thomas Neumann. 2011. HyPer: A hybrid OLTP&OLAP main memory database system based on virtual memory snapshots. In ICDE.
[35]
Ken Kennedy and Kathryn S. McKinley. 1993. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In LCPC, Vol. 768. 301--320.
[36]
Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter A. Boncz. 2018. Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. PVLDB 11, 13 (2018), 2209--2222.
[37]
Steffen KlÃd'be, Robert DeSantis, Stefan Hagedorn, and Kai-Uwe Sattler. 2022. Accelerating Python UDFs in Vectorized Query Execution. In CIDR.
[38]
Kostis Kyzirakos, Dimitrianos Savva, Ioannis Vlachopoulos, Alexandros Vasileiou, Nikolaos Karalis, Manolis Koubarakis, and Stefan Manegold. 2018. GeoTriples: Transforming geospatial data into RDF graphs using R2RML and RML mappings. J. Web Semant. 52--53 (2018), 16--32.
[39]
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-Based Python JIT Compiler. In LLVM-SC.
[40]
Chris Lattner, Jacques A. Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen, Tatiana Shpeisman, Andy Davis, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore's Law. CoRR abs/2002.11054 (2020).
[41]
Lua JIT, FFI library. 2022. Available at: https://luajit.org/ext_ffi.html.
[42]
Natalia Manola. 2019. Interactive Text Analysis and Information Extraction. In Digital Libraries: Supporting Open Science: 15th Italian Research Conference on Digital Libraries, IRCDL 2019, Pisa, Italy, January 31--February 1, 2019, Proceedings, Vol. 988. Springer, 340.
[43]
Kevin Modzelewski. 2014. Introducing Pyston: an upcoming, JIT-based Python implementation. Available at:. https://nuitka.net
[44]
Kevin Modzelewski. 2020. Pyston v2: 20% faster Python. Available at:. https://blog.pyston.org/2020/10/28/pyston-v2-20-faster-python
[45]
MonetDB. 2022. Base Types. Available at: https://www.monetdb.org/documentation-Jan2022/user-guide/sql-manual/data-types/base-types.
[46]
MonetDB. 2022. Index Definitions. Available at: https://www.monetdb.org/documentation-Jan2022/user-guide/sql-manual/data-definition/index-definitions.
[47]
MonetDB docs. 2022. Jit C UDFs. Available at: https://www.monetdb.org/documentation-Jan2022/user-guide/blog-archive/jit-c-udfs.
[48]
Nuitka the Python Compiler. 2022. Available at: https://nuitka.net.
[49]
Numpy. 2022. Available at: https://numpy.org/.
[50]
Shoumik Palkar, James J. Thomas, Anil Shanbhag, Malte Schwarzkopf, Saman P. Amarasinghe, and Matei Zaharia. 2017. A Common Runtime for High Performance Data Analysis. In CIDR.
[51]
Arash Partow. 2020. The Global Airport Database. Available at: https://www.partow.net/miscellaneous/airportdatabase.
[52]
Richard Plangger and Andreas Krall. 2016. Vectorization in PyPy's Tracing Just-In-Time Compiler. In SCOPES. 67âĂŞ76.
[53]
PostgreSQL. 2022. Procedural languages. In PostgreSQL User Manual.
[54]
PostgreSQL Documentation. 2022. Create type. Available at: https://www.postgresql.org/docs/14/sql-createtype.html.
[55]
Martin Prammer, Suryadev Sahadevan Rajesh, Junda Chen, and Jignesh Patel. 2022. Introducing a Query Acceleration Path for Analytics in SQLite3. In CIDR.
[56]
PyPY. 2022. PyPy's Python packages compatibility. Available at: http://packages.pypy.org.
[57]
PyPY. 2022. Python compatibility. Available at: https://www.pypy.org/compat.html.
[58]
PyPY. 2022. Unicode strings. Available at: https://www.pypy.org/posts/2019/03/pypy-v71-released-now-uses-utf-8-451324088028792912.html.
[59]
Mark Raasveldt and Hannes Mühleisen. 2016. Vectorized udfs in column-stores. In SSDBM.
[60]
Karthik Ramachandra, Kwanghyun Park, K. Venkatesh Emani, Alan Halverson, César A. Galindo-Legaria, and Conor Cunningham. 2017. Froid: Optimization of Imperative Programs in a Relational Database. PVLDB 11, 4(2017), 432--444.
[61]
Armin Rigo and Maciej Fijalkowski. 2021. CFFI documentation.
[62]
Viktor Rosenfeld, René Müller, Pinar Tözün, and Fatma Özcan. 2017. Processing Java UDFs in a C++ environment. In SoCC. 419--431.
[63]
Maximilian E. Schüle, Jakob Huber, Alfons Kemper, and Thomas Neumann. 2020. Freedom for the SQL-Lambda: Just-in-Time-Compiling User-Injected Functions in PostgreSQL. In SSDBM. 6:1--6:12.
[64]
Scipy. 2022. Available at: https://scipy.org.
[65]
Weisong Shi, Jie Cao, Quan Zhang, Youhuizi Li, and Lanyu Xu. 2016. Edge Computing: Vision and Challenges. IEEE Internet Things J. 3, 5 (2016), 637--646.
[66]
Varum Simhadri, Karthik Ramachandra, Arun Chaitanya, Ravindra Guravannavar, and S. Sudarshan. 2014. Decorrelation of user defined function invocations in queries. In ICDE. 532--543.
[67]
Spark documentation. 2022. User Defined Aggregate Functions. Available at: https://spark.apache.org/docs/latest/sql-ref-functions-udf-aggregate.html.
[68]
Leonhard Spiegelberg, Rahul Yesantharao, Malte Schwarzkopf, and Tim Kraska. 2021. Tuplex: Data Science in Python at Native Code Speed. In SIGMOD.
[69]
SQLite C Interface. 2022. Create Or Redefine SQL Functions. Available at: https://www.sqlite.org/c3ref/create_function.html.
[70]
SQLite documentation. 2022. Application-Defined SQL Functions. Available at: https://www.sqlite.org/appfunc.html.
[71]
TPC-H Benchmark. 2022. Available at: http://www.tpc.org/tpch.
[72]
Itamar Turner-Trauring. 2022. Faster string processing in Pandas. Available at: https://www.gresearch.co.uk/article/faster-string-processing-in-pandas.
[73]
U.S. Department of Transportation Bureau of Transportation Statistics. 2020. Carrier history lookup table. Available at: https://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_CARRIER_HISTORY.
[74]
Evert H Pieter Van Dijkhuizen, Orfeas Aidonopoulos, Nienke M Ter Haar, Denise Pires Marafon, Silvia Magni-Manzoni, Yannis E Ioannidis, Lorenza Putignani, Sebastiaan J Vastert, Clara Malattia, Fabrizio De Benedetti, et al. 2018. Prediction of inactive disease in juvenile idiopathic arthritis: a multicentre observational cohort study. Rheumatology 57, 10 (2018), 1752--1760.
[75]
Vertica documentation. 2022. UDAF Class Overview. Available at: https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/ExtendingVertica/UDx/AggregateFunctions/UDAFClassOverview.htm.

Cited By

View all
  • (2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
  • (2024)Query Compilation Without RegretsProceedings of the ACM on Management of Data10.1145/36549682:3(1-28)Online publication date: 30-May-2024
  • (2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 15, Issue 10
June 2022
319 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2022
Published in PVLDB Volume 15, Issue 10

Badges

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)6
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)HiPy: Extracting High-Level Semantics from Python Code for Data ProcessingProceedings of the ACM on Programming Languages10.1145/36897378:OOPSLA2(736-762)Online publication date: 8-Oct-2024
  • (2024)Query Compilation Without RegretsProceedings of the ACM on Management of Data10.1145/36549682:3(1-28)Online publication date: 30-May-2024
  • (2024)Sharing Queries with Nonequivalent User-defined Aggregate FunctionsACM Transactions on Database Systems10.1145/364913349:2(1-46)Online publication date: 10-Apr-2024
  • (2023)Declarative Sub-Operators for Universal Data ProcessingProceedings of the VLDB Endowment10.14778/3611479.361153916:11(3461-3474)Online publication date: 24-Aug-2023
  • (2023)Predicate Pushdown for Data Science PipelinesProceedings of the ACM on Management of Data10.1145/35892811:2(1-28)Online publication date: 20-Jun-2023
  • (2022)YeSQLProceedings of the VLDB Endowment10.14778/3554821.355488615:12(3730-3733)Online publication date: 29-Sep-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media