[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3318464.3389741acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Deep Learning Models for Selectivity Estimation of Multi-Attribute Queries

Published: 31 May 2020 Publication History

Abstract

Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality estimates could result in the selection of bad plans by the query optimizer. Recently, deep learning has been applied to this problem with promising results. However, many of the proposed approaches often struggle to provide accurate results for multi attribute queries involving large number of predicates and with low selectivity. In this paper, we propose two complementary approaches that are effective for this scenario. Our first approach models selectivity estimation as a density estimation problem where one seeks to estimate the joint probability distribution from a finite number of samples. We leverage techniques from neural density estimation to build an accurate selectivity estimator. The key idea is to decompose the joint distribution into a set of tractable conditional probability distributions such that they satisfy the autoregressive property. Our second approach formulates selectivity estimation as a supervised deep learning problem that predicts the selectivity of a given query. We describe how to extend our algorithms for range queries. We also introduce and address a number of practical challenges arising when adapting deep learning for relational data. These include query/data featurization, incorporating query workload information in a deep learning framework and the dynamic scenario where both data and workload queries could be updated. Our extensive experiments with a special emphasis on queries with a large number of predicates and/or small result sizes demonstrates that our proposed techniques provide fast and accurate selective estimates with minimal space overhead.

Supplementary Material

MP4 File (3318464.3389741.mp4)
Presentation Video

References

[1]
M. Akdere, U. cC etintemel, M. Riondato, E. Upfal, and S. B. Zdonik. Learning-based query performance modeling and prediction. In ICDE, pages 390--401. IEEE, 2012.
[2]
N. Bruno and S. Chaudhuri. Conditional selectivity for statistics on query expressions. In SIGMOD, pages 311--322, 2004.
[3]
N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: A multidimensional workload-aware histogram. In SIGMOD, pages 211--222, 2001.
[4]
R. Cappuzzo, P. Papotti, and S. Thirumuruganathan. Creating embeddings of heterogeneous relational datasets for data integration tasks. SIGMOD, 2020.
[5]
F. Changyong, W. Hongyue, L. Naiji, C. Tian, H. Hua, L. Ying, et al. Log-transformation and its implications for data analysis. Shanghai archives of psychiatry, 26(2):105, 2014.
[6]
P. Chilinski and R. Silva. Neural likelihoods via cumulative distribution functions. arXiv preprint arXiv:1811.00974, 2018.
[7]
G. Cormode, M. Garofalakis, P. J. Haas, C. Jermaine, et al. Synopses for massive data: Samples, histograms, wavelets, sketches. Foundations and Trends in Databases, 4(1--3):1--294, 2011.
[8]
J. Dougherty, R. Kohavi, and M. Sahami. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings 1995, pages 194--202. Elsevier, 1995.
[9]
A. Dutt, C. Wang, A. Nazi, S. Kandula, V. Narasayya, and S. Chaudhuri. Selectivity estimation for range predicates using lightweight models. PVLDB, 12(9):1044 -- 1057, 2019.
[10]
M. Ebraheem, S. Thirumuruganathan, S. Joty, M. Ouzzani, and N. Tang. Distributed representations of tuples for entity resolution. PVLDB, 11(11):1454--1467, 2018.
[11]
M. Germain, K. Gregor, I. Murray, and H. Larochelle. Made: Masked autoencoder for distribution estimation. In ICML, pages 881--889, 2015.
[12]
L. Getoor, B. Taskar, and D. Koller. Selectivity estimation using probabilistic models. In ACM SIGMOD Record, volume 30, 2001.
[13]
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[14]
I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
[15]
K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. In ICML, pages 1242--1250, 2014.
[16]
D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. VLDB J., 14(2):137--154, 2005.
[17]
B. Hilprecht, A. Schmidt, M. Kulessa, A. Molina, K. Kersting, and C. Binnig. Deepdb: Learn from data, not from queries! PVLDB, 2020.
[18]
Y. E. Ioannidis. The history of histograms (abridged). In VLDB, 2003.
[19]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB, pages 275--286, 1998.
[20]
M. Kiefer, M. Heimel, S. Breß, and V. Markl. Estimating join selectivities using bandwidth-optimized kernel density models. PVLDB, 10(13):2085--2096, 2017.
[21]
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, abs/1412.6980, 2015.
[22]
A. Kipf, M. Freitag, D. Vorona, P. Boncz, T. Neumann, and A. Kemper. Estimating filtered group-by queries is hard: Deep learning to the rescue. 1st International Workshop on Applied AI for Database Systems and Applications, 2019.
[23]
A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper. Learned cardinalities: Estimating correlated joins with deep learning. CIDR, 2019.
[24]
A. Kipf, D. Vorona, J. Müller, T. Kipf, B. Radke, V. Leis, P. Boncz, T. Neumann, and A. Kemper. Estimating cardinalities with deep sketches. In Proceedings of the 2019 International Conference on Management of Data, pages 1937--1940, 2019.
[25]
F. Korn, T. Johnson, and H. Jagadish. Range selectivity estimation for continuous attributes. In ssdbm, page 244. IEEE, 1999.
[26]
T. Kraska, M. Alizadeh, A. Beutel, E. H. Chi, J. Ding, A. Kristo, G. Leclerc, S. Madden, H. Mao, and V. Nathan. Sagedb: A learned database system. 2019.
[27]
T. Kraska, A. Beutel, E. H. Chi, J. Dean, and N. Polyzotis. The case for learned index structures. In SIGMOD, pages 489--504. ACM, 2018.
[28]
S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196, 2018.
[29]
S. Lakshmi and S. Zhou. Selectivity estimation in extensible databases-a neural network approach. In VLDB, pages 623--627, 1998.
[30]
V. Leis, A. Gubichev, A. Mirchev, P. Boncz, A. Kemper, and T. Neumann. How good are query optimizers, really? PVLDB, 9(3), 2015.
[31]
V. Leis, B. Radke, A. Gubichev, A. Kemper, and T. Neumann. Cardinality estimation done right: Index-based join sampling. In CIDR, 2017.
[32]
G. P. Lepage. A new algorithm for adaptive multidimensional integration. Journal of Computational Physics, 27(2):192--203, 1978.
[33]
Z. Li and D. Hoiem. Learning without forgetting. IEEE TPAMI, 40(12):2935--2947, 2018.
[34]
R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling, volume 19. ACM, 1990.
[35]
R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. Proceedings of the VLDB Endowment, 12(11):1705--1718, 2019.
[36]
R. Marcus and O. Papaemmanouil. Deep reinforcement learning for join order enumeration. arXiv preprint arXiv:1803.00055, 2018.
[37]
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. In ACM SIGMOD Record, volume 27, 1998.
[38]
M. Müller, G. Moerkotte, and O. Kolb. Improved selectivity estimation by combining knowledge from sampling and synopses. PVLDB, 11(9):1016--1028, 2018.
[39]
K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
[40]
J. Neter, W. Wasserman, and M. H. Kutner. Applied linear regression models. 1989.
[41]
J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. Learning state representations for query optimization with deep reinforcement learning. arXiv preprint arXiv:1803.08604, 2018.
[42]
J. Ortiz, M. Balazinska, J. Gehrke, and S. S. Keerthi. An empirical analysis of deep learning for cardinality estimation. arXiv preprint arXiv:1905.06425, 2019.
[43]
V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In ACM Sigmod Record, volume 25, 1996.
[44]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In VLDB, volume 97, pages 486--495, 1997.
[45]
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In ICML, pages 833--840, 2011.
[46]
A. J. Smola and B. Schölkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199--222, 2004.
[47]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929--1958, 2014.
[48]
J. Sun and G. Li. An end-to-end learning-based cost estimator. arXiv preprint arXiv:1906.02560, 2019.
[49]
N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD, pages 428--439, 2002.
[50]
S. Thirumuruganathan, S. Hasan, N. Koudas, and G. Das. Approximate query processing for data exploration using deep generative models. ICDE, 2020.
[51]
S. Thirumuruganathan, N. Tang, M. Ouzzani, and A. Doan. Data curation with deep learning. EDBT, 2020.
[52]
K. Tzoumas, A. Deshpande, and C. S. Jensen. Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB, 4(11):852--863, 2011.
[53]
K. Tzoumas, T. Sellis, and C. S. Jensen. A reinforcement learning approach for adaptive query processing. History, 2008.
[54]
B. Uria, I. Murray, and H. Larochelle. NADE: the real-valued neural autoregressive density-estimator. CoRR, abs/1306.0186, 2013.
[55]
B. Uria, I. Murray, and H. Larochelle. A deep and tractable density estimator. In International Conference on Machine Learning, pages 467--475, 2014.
[56]
P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with denoising autoencoders. In ICML, 2008.
[57]
D. Vorona, A. Kipf, T. Neumann, and A. Kemper. Deepspace: Approximate geospatial query processing with deep learning. In Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 500--503, 2019.
[58]
Z. Wang, D. Cashman, M. Li, J. Li, M. Berger, J. A. Levine, R. Chang, and C. Scheidegger. Nncubes: Learned structures for visual data exploration. arXiv preprint arXiv:1808.08983, 2018.
[59]
Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica. Deep unsupervised cardinality estimation. Proceedings of the VLDB Endowment, 13(3):279--292, 2019.

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
  • (2024)Backbone Index and GNN Models for Skyline Path Query Evaluation over Multi-cost Road NetworksACM Transactions on Spatial Algorithms and Systems10.1145/366063210:4(1-45)Online publication date: 23-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 May 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cardinality estimation
  2. deep learning
  3. density estimation
  4. made
  5. neural autoregressive models
  6. selectivity estimation

Qualifiers

  • Research-article

Funding Sources

  • AT&T
  • COHESA NOE
  • National Science Foundation

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)224
  • Downloads (Last 6 weeks)16
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
  • (2024)Backbone Index and GNN Models for Skyline Path Query Evaluation over Multi-cost Road NetworksACM Transactions on Spatial Algorithms and Systems10.1145/366063210:4(1-45)Online publication date: 23-Apr-2024
  • (2024)A Generic Machine Learning Model for Spatial Query Optimization based on Spatial EmbeddingsACM Transactions on Spatial Algorithms and Systems10.1145/365763310:4(1-33)Online publication date: 13-Apr-2024
  • (2024)LPLM: A Neural Language Model for Cardinality Estimation of LIKE-QueriesProceedings of the ACM on Management of Data10.1145/36393092:1(1-25)Online publication date: 26-Mar-2024
  • (2024)Machine Unlearning in Learned Databases: An Experimental AnalysisProceedings of the ACM on Management of Data10.1145/36393042:1(1-26)Online publication date: 26-Mar-2024
  • (2024)Controllable Tabular Data Synthesis Using Diffusion ModelsProceedings of the ACM on Management of Data10.1145/36392832:1(1-29)Online publication date: 26-Mar-2024
  • (2024)Precision Meets Resilience: Cross-Database Generalization with Uncertainty Quantification for Robust Cost EstimationProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679632(581-590)Online publication date: 21-Oct-2024
  • (2024)A Robust Federated Learning Approach for Combating Attacks Against IoT Systems Under Non-IID Challenges2024 International Conference on Smart Applications, Communications and Networking (SmartNets)10.1109/SmartNets61466.2024.10577749(1-6)Online publication date: 28-May-2024
  • (2024)Towards Exploratory Query Optimization for Template-Based SQL Workloads2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00019(151-164)Online publication date: 13-May-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media