[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Hashed samples: selectivity estimators for set similarity selection queries

Published: 01 August 2008 Publication History

Abstract

We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.

References

[1]
N. Alon, M. Dietzfelbinger, P. B. Miltersen, E. Petrank, and G. Tardos. Linear hash functions. Journal of the ACM (JACM), 46(5):667--683, 1999.
[2]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In Proc. of Very Large Data Bases (VLDB), pages 918--929, 2006.
[3]
AT&T Inc. Business listings from YellowPages.com. proprietary.
[4]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999.
[5]
R. A. Baeza-Yates and G. Navarro. Faster approximate string matching. Algorithmica, 23(2):127--158, 1999.
[6]
K. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In Proc. of ACM Management of Data (SIGMOD), pages 199--210, 2007.
[7]
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929--965, 1989.
[8]
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18(2):143--154, 1979.
[9]
A. Chandel, O. Hassanzadeh, N. Koudas, M. Sadoghi, and D. Srivastava. Benchmarking declarative approximate selection predicates. In Proc. of ACM Management of Data (SIGMOD), pages 353--364, 2007.
[10]
M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees for distinct values. In Proc. of ACM Symposium on Principles of Database Systems (PODS), pages 268--279, 2000.
[11]
S. Chaudhuri, G. Das, and U. Srivastava. Effective use of block-level sampling in statistics estimation. In Proc. of ACM Management of Data (SIGMOD), pages 287--298, 2004.
[12]
S. Chaudhuri, V. Ganti, and L. Gravano. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proc. of International Conference on Data Engineering (ICDE), pages 227--238, 2004.
[13]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. of International Conference on Data Engineering (ICDE), page 5, 2006.
[14]
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182--209, 1985.
[15]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proc. of Very Large Data Bases (VLDB), pages 491--500, 2001.
[16]
S. Guha, N. Koudas, D. Srivastava, and X. Yu. Reasoning about approximate match query results. In Proc. of International Conference on Data Engineering (ICDE), pages 8--18, 2006.
[17]
P. J. Haas and C. König. A bi-level bernoulli scheme for database sampling. In Proc. of ACM Management of Data (SIGMOD), pages 275--286, 2004.
[18]
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. In Proc. of International Conference on Data Engineering (ICDE), 2008.
[19]
D. Haussler and E. Welzl. Epsilon-nets and simplex range queries. In Proc. of Annual Symposium on Computational Geometry (SCG), pages 61--71, 1986.
[20]
IMDB. IMDB database. http:// www.imdb.com/interfaces.
[21]
H. V. Jagadish, O. Kapitskaia, R. T. Ng, and D. Srivastava. One-dimensional and multi-dimensional substring selectivity estimation. The VLDB Journal, 9(3):214--230, 2000.
[22]
L. Jin and C. Li. Selectivity estimation for fuzzy string predicates in large data sets. In Proc. of Very Large Data Bases (VLDB), pages 397--408, 2005.
[23]
H. Lee, R. Ng, and K. Shim. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proc. of Very Large Data Bases (VLDB), pages 195--205, 2007.
[24]
M. Ley. DBLP database. http://dblp.uni-trier.de/xml.
[25]
A. Mazeika, M. H. Böhlen, N. Koudas, and D. Srivastava. Estimating the selectivity of approximate string queries. ACM Transactions on Database Systems (TODS), 32(2):12, 2007.
[26]
F. Olken. Random sampling from databases. PhD thesis, University of California, Berkeley, CA, 1993.
[27]
V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita. Improved histograms for selectivity estimation of range predicates. In Proc. of ACM Management of Data (SIGMOD), pages 294--305, 1996.
[28]
S. C. Sahinalp, M. Tasan, J. Macker, and Z. M. Özsoyoglu. Distance based indexing for string proximity search. In Proc. of International Conference on Data Engineering (ICDE), pages 125--136, 2003.
[29]
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of ACM Knowledge Discovery and Data Mining (SIGKDD), pages 269--278, 2002.
[30]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In Proc. of ACM Management of Data (SIGMOD), pages 743--754, 2004.
[31]
S. Tata and J. M. Patel. Estimating the selectivity of tf-idf based cosine similarity predicates. SIGMOD Record, 36(2):7--12, 2007.
[32]
V. N. Vapnik and A. Y. Chrervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. SIAM: Theory of Probability and its Applications (TPA), 16(2):264--280, 1971.
[33]
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37--57, 1985.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 1, Issue 1
August 2008
1216 pages

Publisher

VLDB Endowment

Publication History

Published: 01 August 2008
Published in PVLDB Volume 1, Issue 1

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Selectivity Estimation for Queries Containing Predicates over Set-Valued AttributesProceedings of the ACM on Management of Data10.1145/36267551:4(1-26)Online publication date: 12-Dec-2023
  • (2021)Tanium revealProceedings of the VLDB Endowment10.14778/3476311.347638614:12(3096-3109)Online publication date: 28-Oct-2021
  • (2021)LES3Proceedings of the VLDB Endowment10.14778/3476249.347626314:11(2073-2086)Online publication date: 27-Oct-2021
  • (2021)PGMJoins: Random Join Sampling with Graphical ModelsProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457302(1610-1622)Online publication date: 9-Jun-2021
  • (2021)Keyword search over schema-less RDF datasets by SPARQL query compilationInformation Systems10.1016/j.is.2021.101814102:COnline publication date: 1-Dec-2021
  • (2020)Orca-SRProceedings of the VLDB Endowment10.14778/3415478.341552313:12(2977-2980)Online publication date: 1-Aug-2020
  • (2020)Efficient Source Selection for Error Detection via Matching DependenciesDatabase Systems for Advanced Applications10.1007/978-3-030-59410-7_13(211-227)Online publication date: 24-Sep-2020
  • (2019)Joins on samplesProceedings of the VLDB Endowment10.14778/3372716.337272613:4(547-560)Online publication date: 9-Dec-2019
  • (2019)Efficient Signal Reconstruction for a Broad Range of ApplicationsACM SIGMOD Record10.1145/3371316.337132748:1(42-49)Online publication date: 5-Nov-2019
  • (2019)Keyword Search Algorithm over Large RDF DatasetsAdvances in Conceptual Modeling10.1007/978-3-030-34146-6_21(230-238)Online publication date: 4-Nov-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media