[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3448016.3457311acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics Queries

Published: 18 June 2021 Publication History

Abstract

Hybrid complex analytics workloads typically include (i) data management tasks (joins, selections, etc. ), easily expressed using relational algebra (RA)-based languages, and (ii) complex analytics tasks (regressions, matrix decompositions, etc.), mostly expressed in linear algebra (LA) expressions. Such workloads are common in many application areas, including scientific computing, web analytics, and business recommendation. Existing solutions for evaluating hybrid analytical tasks - ranging from LA-oriented systems, to relational systems (extended to handle LA operations), to hybrid systems - either optimize data management and complex tasks separately, exploit RA properties only while leaving LA-specific optimization opportunities unexploited, or focus heavily on physical optimization, leaving semantic query optimization opportunities unexplored. Additionally, they are not able to exploit precomputed (materialized) results to avoid recomputing (part of) a given mixed (RA and/or LA) computation.
In this paper, we take a major step towards filling this gap by proposing HADAD, an extensible lightweight approach for optimizing hybrid complex analytics queries, based on a common abstraction that facilitates unified reasoning: a relational model endowed with integrity constraints. Our solution can be naturally and portably applied on top of pure LA and hybrid RA-LA platforms without modifying their internals. An extensive empirical evaluation shows that HADAD yields significant performance gains on diverse workloads, ranging from LA-centered to hybrid.

Supplementary Material

Source Code (3448016.3457311_source_code.zip)
Read me (3448016.3457311_readme.pdf)
MP4 File (3448016.3457311.mp4)
Hybrid complex analytics workloads typically include (i) data management tasks (joins, filters, etc. ), easily expressed using relational algebra (RA)-based languages, and (ii) complex analytics tasks (regressions, matrix decompositions, etc.), mostly expressed in linear algebra (LA) expressions. Such workloads are common in a number of areas, including scientific computing, web analytics, business recommendation, natural language processing, speech recognition. Existing solutions for evaluating hybrid complex analytics queries ranging from LA-oriented systems, to relational systems (extended to handle LA operations), to hybrid systems fail to provide a unified optimization framework for such a hybrid setting. These systems either optimize data management and complex analytics tasks separately, or exploit RA properties only while leaving LA-specific optimization opportunities unexplored. Finally, they are not able to exploit precomputed (materialized) results to avoid computing again(part of) a given mixed (LA and RA) computation. We describe HADAD, an extensible lightweight approach for optimizinghybrid complex analytics queries, based on a common abstraction that facilitates unified reasoning: a relational model endowed with integrity constraints, which can be used to express the properties of the two computation formalisms. Our approach enables full exploration of LA properties and rewrites, as well as semantic query optimization. Importantly, our approach does not require modifying the internals of the existing systems. Our experimental evaluation shows significant performance gains on diverse workloads, from LA-centered ones to hybrid ones.

References

[1]
Amazon Review Data. https://nijianmo.github.io/amazon/index.html, Accessed August, 2020.
[2]
Breeze Wiki. https://github.com/scalanlp/breeze/wiki, Accessed June, 2020.
[3]
Cox Proportional-Hazards Model. https://github.com/apache/systemds/blob/master/scripts/algorithms/Cox-predict.dml, Accessed Feb, 2021.
[4]
Kaggle Survey. https://www.kaggle.com/kaggle-survey-2019, Accessed June, 2020.
[5]
Morpheus. https://github.com/lchen001/Morpheus, Accessed December, 2020.
[6]
Native Blas in SystemDS. https://apache.github.io/systemds/native-backend, Accessed June, 2020.
[7]
Netflix Movie Rating. https://www.kaggle.com/netflix-inc/netflix-prize-data, Accessed August, 2020.
[8]
NumPy. https://numpy.org/, Accessed June, 2020.
[9]
Project R. https://www.r-project.org/other-docs.html, Accessed June, 2020.
[10]
Python/NumPy in Monetdb. https://tinyurl.com/11ljy21v, Accessed June, 2020.
[11]
SparkMLlib. https://spark.apache.org/mllib, Accessed June, 2020.
[12]
Technical Report. https://arxiv.org/abs/2103.12317.
[13]
Twitter API. https://developer.twitter.com/en/docs, Accessed January, 2021.
[14]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A System for Large-Scale Machine Learning. In USENIX, pages 265--283, 2016.
[15]
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995.
[16]
R. Alotaibi, D. Bursztyn, A. Deutsch, I. Manolescu, and S. Zampetakis. Towards Scalable Hybrid Stores: Constraint-based Rewriting to the Rescue. In SIGMOD, pages 1660--1677, 2019.
[17]
R. Alotaibi, B. Cautis, A. Deutsch, M. Latrache, I. Manolescu, and Y. Yang. ESTOCADA: Towards Scalable Polystore Systems. PVLDB, pages 2949--2952, 2020.
[18]
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. SparkSQl: Relational Data Processing in Spark. In SIGMOD, pages 1383--1394, 2015.
[19]
S. Axler. Linear Algebra Done Right. Springer, 2015.
[20]
P. Barceló, N. Higuera, J. Pé rez, and B. Subercaseaux. On the Expressiveness of LARA: A Unified Language for Linear and Relational Algebra. In ICDT, pages 6:1--6:20, 2020.
[21]
D. Baylor, E. Breck, H.-T. Cheng, N. Fiedel, C. Y. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, et al. Tfx: A Tnsorflow-based Production-Scale Machine Learning Platform. In SIGKDD, pages 1387--1395, 2017.
[22]
M. Boehm, M. W. Dusenberry, D. Eriksson, A. V. Evfimievski, F. M. Manshadi, N. Pansare, B. Reinwald, F. R. Reiss, P. Sen, A. C. Surve, et al. SystemML: Declarative Machine Learning on Spark. PVLDB, pages 1425--1436, 2016.
[23]
M. Boehm, A. V. Evfimievski, N. Pansare, and B. Reinwald. Declarative Machine Learning Classification of Basic Properties and Types. arXiv:1605.05826, 2016.
[24]
M. Boehm, B. Reinwald, D. Hutchison, P. Sen, A. V. Evfimievski, and N. Pansare. On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML. PVLDB, pages 1755--1768, 2018.
[25]
J.-H. Böse, V. Flunkert, J. Gasthaus, T. Januschowski, D. Lange, D. Salinas, S. Schelter, M. Seeger, and Y. Wang. Probabilistic Demand Forecasting at Scale. PVLDB, pages 1694--1705, 2017.
[26]
R. Brijder, F. Geerts, J. V. den Bussche, and T. Weerwag. On the Expressive Power of Query Languages for Matrices. ACM Trans. Database Syst., pages 15:1--15:31, 2019.
[27]
A. K. Chandra and P. M. Merlin. Optimal Implementation of Conjunctive Queries in Relational Databases. In ACM symposium on Theory of computing, pages 77--90, 1977.
[28]
L. Chen, A. Kumar, J. Naughton, and J. M. Patel. Towards Linear Algebra Over Normalized Data. PVLDB, pages 1214--1225, 2017.
[29]
I. Ileana. Query Rewriting Using Views : a Theoretical and Practical Perspective . Theses, Télécom ParisTech, Oct. 2014.
[30]
I. Ileana, B. Cautis, A. Deutsch, and Y. Katsis. Complete Yet Practical Search for Minimal Query Reformulations Under Constraints. In SIGMOD, pages 1015--1026, 2014.
[31]
A. Johnson et al. MIMIC-III. http://www.nature.com/articles/sdata201635, 2016.
[32]
K. Karanasos, M. Interlandi, D. Xin, F. Psallidas, R. Sen, K. Park, I. Popivanov, S. Nakandal, S. Krishnan, M. Weimer, et al. Extending Relational Query Processing with ML Inference. In CIDR, 2020.
[33]
D. Kernert, F. Köhler, and W. Lehner. Bringing Linear Algebra Objects to Life in a Column-Criented in-Memory Database. In In Memory Data Management and Analysis, pages 44--55. Springer, 2013.
[34]
A. Kumar, M. Boehm, and J. Yang. Data Management in Machine Learning: Challenges, Techniques, and Systems. In SIGMOD, pages 1717--1722, 2017.
[35]
A. Kumar, R. McCann, J. Naughton, and J. M. Patel. Model Selection Management Systems: The Next Frontier of Advanced Analytics. In SIGMOD, pages 17--22, 2016.
[36]
A. Kunft, A. Katsifodimos, S. Schelter, S. Breß, T. Rabl, and V. Markl. An Intermediate Representation for Optimizing Machine Learning Pipelines. PVLDB, pages 1553--1567, 2019.
[37]
K. Kuttler. Linear Algebra: Theory and Applications. The Saylor Foundation, 2012.
[38]
S. Luo, Z. J. Gao, M. Gubanov, L. L. Perez, and C. Jermaine. Scalable Linear Algebra on a Relational Database System. In ICDE, pages 1224--1238, 2018.
[39]
J. MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, pages 281--297, 1967.
[40]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, et al. MLlib: Machine Learning in Apache Spark. The Journal of Machine Learning Research, pages 1235--1241, 2016.
[41]
M. Milani, S. Hosseinpour, and H. Pehlivan. Rule-based Production of Mathematical Expressions. Mathematics, 6:254, 11 2018.
[42]
T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.
[43]
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J.-F. Crespo, and D. Dennison. Hidden Technical Debt in Machine Learning Systems. In NeurIPS, pages 2503--2511, 2015.
[44]
J. Sommer, M. Boehm, A. V. Evfimievski, B. Reinwald, and P. J. Haas. MNC: Structure-Exploiting Sparsity Estimation for Matrix Expressions. In SIGMOD, pages 1607--1623, 2019.
[45]
E. R. Sparks, A. Talwalkar, D. Haas, M. J. Franklin, M. I. Jordan, and T. Kraska. Automating Model Search for Large Scale Machine Learning. In ACM Symposium on Cloud Computing, pages 368--380, 2015.
[46]
A. Thomas and A. Kumar. A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics. PVLDB, pages 2168--2182, 2018.
[47]
Y. R. Wang, S. Hutchison, J. Leang, B. Howe, and D. Suciu. SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra. PVLDB, pages 1919--1932, 2020.

Cited By

View all
  • (2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
  • (2023)Krypton: Real-Time Serving and Analytical SQL Engine at ByteDanceProceedings of the VLDB Endowment10.14778/3611540.361154516:12(3528-3542)Online publication date: 1-Aug-2023
  • (2023)Optimizing Tensor Programs on Flexible StorageProceedings of the ACM on Management of Data10.1145/35887171:1(1-27)Online publication date: 30-May-2023
  • Show More Cited By

Index Terms

  1. HADAD: A Lightweight Approach for Optimizing Hybrid Complex Analytics Queries

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. chase
      2. integrity constraints
      3. linear algebra
      4. query rewriting

      Qualifiers

      • Research-article

      Conference

      SIGMOD/PODS '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)138
      • Downloads (Last 6 weeks)20
      Reflects downloads up to 12 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Amalur: The Convergence of Data Integration and Machine LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.335738936:12(7353-7367)Online publication date: Dec-2024
      • (2023)Krypton: Real-Time Serving and Analytical SQL Engine at ByteDanceProceedings of the VLDB Endowment10.14778/3611540.361154516:12(3528-3542)Online publication date: 1-Aug-2023
      • (2023)Optimizing Tensor Programs on Flexible StorageProceedings of the ACM on Management of Data10.1145/35887171:1(1-27)Online publication date: 30-May-2023
      • (2023)Amalur: Data Integration Meets Machine Learning2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00301(3729-3739)Online publication date: Apr-2023

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media