[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2723372.2749438acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation

Published: 27 May 2015 Publication History

Editorial Notes

Computationally Reproducible. The experimental results of this paper were reproduced by a SIGMOD Review Committee and were found to support the central results reported in the paper. Details of the review process are found here:
http://db-reproducibility.seas.harvard.edu/#process

Abstract

Quickly and accurately estimating the selectivity of multidimensional predicates is a vital part of a modern relational query optimizer. The state-of-the art in this field are multidimensional histograms, which offer good estimation quality but are complex to construct and hard to maintain. Kernel Density Estimation (KDE) is an interesting alternative that does not suffer from these problems. However, existing KDE-based selectivity estimators can hardly compete with the estimation quality of state-of-the art methods.
In this paper, we substantially expand the state-of-the-art in KDE-based selectivity estimation by improving along three dimensions: First, we demonstrate how to numerically optimize a KDE model, leading to substantially improved estimates. Second, we develop methods to continuously adapt the estimator to changes in both the database and the query workload. Finally, we show how to drastically improve the performance by pushing computations onto a GPU.
We provide an implementation of our estimator and experimentally evaluate it on a variety of datasets and workloads, demonstrating that it efficiently scales up to very large model sizes, adapts itself to database changes, and typically outperforms the estimation quality of both existing Kernel Density Estimators as well as state-of-the-art multidimensional histograms.

References

[1]
Multivariate Density Estimation - Theory, Practice and Visualization. John Wiley & Sons, Inc., 1992.
[2]
W. Andrzejewski, A. Gramacki, and J. Gramacki. Graphics processing units in acceleration of bandwidth selection for kernel density estimation. Int. J. Appl. Math. Comput. Sci, 23(4):869--885, 2013.
[3]
K. Bache and M. Lichman. UCI machine learning repository, 2013.
[4]
B. Blohsfeld, D. Korus, and B. Seeger. A comparison of selectivity estimators for range queries on metric attributes. SIGMOD Record, 28(2):239--250, 1999.
[5]
A. W. Bowman. An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71(2):353--360, 1984.
[6]
S. Breß, M. Heimel, N. Siegmund, L. Bellatreche, and G. Saake. Gpu-accelerated database systems: Survey and open challenges. In Transactions on Large-Scale Data-and Knowledge-Centered Systems XV, pages 1--35. Springer, 2014.
[7]
N. Bruno, S. Chaudhuri, and L. Gravano. Stholes: a multidimensional workload-aware histogram. SIGMOD Record, 30(2):211--222, 2001.
[8]
R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16(5):1190--1208, 1995.
[9]
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. SIGMOD Record, 28(2):263--274, 1999.
[10]
S. Christodoulakis. Implications of certain assumptions in database performance evauation. ACM Transactions on Database Systems (TODS), 9(2):163--186, 1984.
[11]
T. Duong and M. L. Hazelton. Cross-validation bandwidth matrices for multivariate kernel density estimation. Scandinavian Journal of Statistics, 32(3):485--506, 2005.
[12]
R. Gemulla, W. Lehner, and P. J. Haas. Maintaining bounded-size sample synopses of evolving datasets. The VLDB Journal, 17(2):173--201, 2008.
[13]
P. B. Gibbons, Y. Matias, and V. Poosala. Maintaining a random sample of a relation in a database in the presence of updates to the relation, Jan. 4 2000. US Patent 6,012,064.
[14]
D. Gunopulos, G. Kollios, J. Tsotras, and C. Domeniconi. Selectivity estimators for multidimensional range queries over real attributes. The VLDB Journal, 14(2):137--154, 2005.
[15]
D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multi-dimensional aggregate range queries over real attributes. ACM SIGMOD Record, 29(2):463--474, 2000.
[16]
M. Heimel and V. Markl. A first step towards gpu-assisted query optimization. In International Workshop on Accelerating Data Management Systems using Modern Processor and Storage Architectures (ADMS), Istanbul, Turkey, pages 33--44, 2012.
[17]
M. Heimel, M. Saecker, H. Pirk, S. Manegold, and V. Markl. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment, 6(9):709--720, 2013.
[18]
C. Heinz and B. Seeger. Cluster kernels: Resource-aware kernel density estimators over streaming data. Knowledge and Data Engineering, IEEE Transactions on, 20(7):880--893, 2008.
[19]
D. Horn. Stream reduction operations for gpgpu applications. Gpu gems, 2:573--589, 2005.
[20]
Y. Ioannidis. The history of histograms (abridged). In Proceedings of the 29th VLDB conference, pages 19--30. VLDB Endowment, 2003.
[21]
Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. SIGMOD Record, 20(2):268--277, 1991.
[22]
S. G. Johnson. The NLopt nonlinear-optimization package. http://ab-initio.mit.edu/nlopt.
[23]
M. C. Jones, J. S. Marron, and S. J. Sheather. A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association, 91(433):401--407, 1996.
[24]
A. R. Kan and G. Timmer. Stochastic global optimization methods part ii: Multi level methods. Mathematical Programming, 39(1):57--78, 1987.
[25]
P.-A. Larson, W. Lehner, J. Zhou, and P. Zabback. Cardinality estimation using sample views with quality assurance. In Proceedings of the 2007 ACM SIGMOD Conference, pages 175--186. ACM, 2007.
[26]
J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD Record, 28(2):205--214, 1999.
[27]
Q. Li and J. Racine. Nonparametric estimation of distributions with categorical and continuous data. journal of multivariate analysis, 86(2):266--292, 2003.
[28]
R. J. Lipton, J. F. Naughton, and D. A. Schneider. Practical selectivity estimation through adaptive sampling, volume 19. ACM, 1990.
[29]
V. Markl, P. J. Haas, M. Kutsch, N. Megiddo, U. Srivastava, and T. M. Tran. Consistent selectivity estimation via maximum entropy. The VLDB journal, 16(1):55--76, 2007.
[30]
Y. Matias, J. S. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. SIGMOD Record, 27(2):448--459, 1998.
[31]
G. Moerkotte, T. Neumann, and G. Steidl. Preventing bad plans by bounding the impact of cardinality estimation errors. Proceedings of the VLDB Endowment, 2(1):982--993, 2009.
[32]
M. Muralikrishna and D. J. DeWitt. Equi-depth multidimensional histograms. SIGMOD Record, 17(3):28--36, 1988.
[33]
F. Olken and D. Rotem. Maintenance of materialized views of sampling queries. In Data Engineering, 1992. Proceedings. Eighth International Conference on, pages 632--641. IEEE, 1992.
[34]
V. Poosala and Y. E. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proceedings of the 23rd VLDB conference, pages 486--495. VLDB Endowment, 1997.
[35]
N. Reddy and J. R. Haritsa. Analyzing plan diagrams of database query optimizers. In Proceedings of the 31st VLDB conference, pages 1228--1239. VLDB Endowment, 2005.
[36]
M. Riedmiller and H. Braun. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International Conference on, pages 586--591. IEEE, 1993.
[37]
S. R. Sain, K. A. Baggerly, and D. W. Scott. Cross-validation of multivariate densities. Journal of the American Statistical Association, 89(427):807--817, 1994.
[38]
U. Srivastava, P. J. Haas, V. Markl, M. Kutsch, and T. M. Tran. Isomer: Consistent histogram construction using query feedback. In Data Engineering, 2006. ICDE'06. Proceedings of the 22nd International Conference on, pages 39--39. IEEE, 2006.
[39]
S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In Proceedings of the 32nd VLDB conference, pages 187--198. VLDB Endowment, 2006.
[40]
K. Svanberg. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM journal on optimization, 12(2):555--573, 2002.
[41]
G. R. Terrell and D. W. Scott. Variable kernel density estimation. The Annals of Statistics, pages 1236--1265, 1992.
[42]
T. Tieleman and G. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.
[43]
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37--57, 1985.
[44]
M. Wand and M. Jones. Comparison of smoothing parameterizations in bivariate kernel density estimation. Journal of the American Statistical Association, 88(422):520--528, 1993.
[45]
M. Wand and M. Jones. Multivariate plug-in bandwidth selection. Computational Statistics, 9(2):97--116, 1994.
[46]
Z. Zhang, Y. Yang, R. Cai, D. Papadias, and A. Tung. Kernel-based skyline cardinality estimation. In Proceedings of the 2009 ACM SIGMOD Conference, pages 509--522. ACM, 2009.

Cited By

View all
  • (2024)Learned Query Optimization by Constraint-Based Query Plan AugmentationMathematics10.3390/math1219310212:19(3102)Online publication date: 3-Oct-2024
  • (2024)Presto's History-Based Query OptimizerProceedings of the VLDB Endowment10.14778/3685800.368582817:12(4077-4089)Online publication date: 1-Aug-2024
  • (2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. bandwidth selection
  2. cardinality estimation
  3. gpgpu
  4. gpu-accelerated databases
  5. gpu-assisted query optimization
  6. graphics cards
  7. kernel density estimator
  8. opencl
  9. query optimization
  10. selectivity estimation
  11. self-tuning databases

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)145
  • Downloads (Last 6 weeks)41
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Learned Query Optimization by Constraint-Based Query Plan AugmentationMathematics10.3390/math1219310212:19(3102)Online publication date: 3-Oct-2024
  • (2024)Presto's History-Based Query OptimizerProceedings of the VLDB Endowment10.14778/3685800.368582817:12(4077-4089)Online publication date: 1-Aug-2024
  • (2024)Accurate Sampling-Based Cardinality Estimation for Complex Graph QueriesACM Transactions on Database Systems10.1145/368920949:3(1-46)Online publication date: 17-Sep-2024
  • (2024)QardEst: Using Quantum Machine Learning for Cardinality Estimation of Join QueriesProceedings of the 1st Workshop on Quantum Computing and Quantum-Inspired Technology for Data-Intensive Systems and Applications10.1145/3665225.3665444(2-13)Online publication date: 9-Jun-2024
  • (2024)Learned Query Optimizer: What is New and What is NextCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654692(561-569)Online publication date: 9-Jun-2024
  • (2024)CERT: Finding Performance Issues in Database Systems Through the Lens of Cardinality EstimationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639076(1-13)Online publication date: 20-May-2024
  • (2024)Towards Exploratory Query Optimization for Template-Based SQL Workloads2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00019(151-164)Online publication date: 13-May-2024
  • (2024)Automating localized learning for cardinality estimation based on XGBoostKnowledge and Information Systems10.1007/s10115-024-02142-266:7(3825-3854)Online publication date: 1-Jul-2024
  • (2024)LAF: A Local Depth Autoregressive Framework for Cardinality Estimation of Multi-attribute QueriesWeb and Big Data10.1007/978-981-97-2387-4_20(296-311)Online publication date: 28-Apr-2024
  • (2023)A Cardinality Estimator in Complex Database Systems Based on TreeLSTMSensors10.3390/s2317736423:17(7364)Online publication date: 23-Aug-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media