[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1325851.1325877dlproceedingsArticle/Chapter ViewAbstractPublication PagesvldbConference Proceedingsconference-collections
research-article

Extending q-grams to estimate selectivity of string matching with low edit distance

Published: 23 September 2007 Publication History

Abstract

There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semi-lattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm Opt EQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing Opt EQ with the state-of-the-art method SEPIA. Our experimental results show that Opt EQ delivers more accurate selectivity estimations.

References

[1]
A. Aboulnaga, A. R. Almeldeen and J. F. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. Proc. VLDB, pp. 591--600, 2001.
[2]
R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Proc. VLDB, pp. 487--499, 1994.
[3]
R. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[4]
S. Burkhardt and J. Karkkianen. One-gapped q-Gram Filters for Levenshtein Distance, Springer, LNCS, pp. 225--234, 2002.
[5]
S. Burkhardt et al. Q-gram based database searching using a suffix array (QUASAR), Proc. RECOMB, pp. 77--83, 1999.
[6]
S. Chaudhuri, V. Ganti and L. Gravano. Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem. Proc. ICDE, pp. 227--238, 2004.
[7]
S. Chaudhuri et al. Robust and Efficient Fuzzy Match for Online Data Cleaning. Proc. SIGMOD, pp. 313--324, 2003.
[8]
W. Cohen, P. Ravikmar and and S. Fienberg. A Comparison of String Metrics for Matching Names and Records. Proc. of KDD Workshop on Data Cleaning, pp. 73--78, 2003.
[9]
R. Dixon and T. Martin. Automatic Speech and Speaker Recognition. IEEE Press, 1979.
[10]
Flamingo Project, http://www.ics.uci.edu/flamingo/.
[11]
L. Gravano et al. Approximate string joins in a database (almost) for free. Proc. VLDB, pp. 491--500, 2001.
[12]
D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge Univ. Press, 1997.
[13]
B. Hore et al. Indexing Text Data under Space Constraints. Proc. CIKM, pp. 198--207, 2004.
[14]
H. V. Jagadish, R. T. Ng and D. Srivastava. Substring Selectivity Estimation, Proc. PODS, pp. 249--260, 1999.
[15]
L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Data Sets, Proc. VLDB, pp. 397--408, 2005.
[16]
L. Jin et al. Indexing Mixed Types for Approximate Retrieval, Proc. VLDB, pp. 793--804, 2005.
[17]
P. Krishnan, J. S. Vitter and B. Iyer. Estimating Alphanumeric Selectivity in the Presence of Wildcards. Proc. SIGMOD, pp. 282--293, 1996.
[18]
M. Ley. DBLP, http://www.fnformatick.uni-tier.de/ley/db.
[19]
L. Lim et al. XPathLearner: An on-line self-tuning Markov histogram for XML path selectivity estimation. Proc. VLDB, pp. 442--453, 2002.
[20]
G. Navarro. A Guided Tour to Approximate String Matching. ACM Computing Survey, 33, pp. 31--88, 2001.
[21]
G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata. Journal of Experimental Algorithmics, volume 5, article 4, 2000.
[22]
S. Sarawagi and A. Bhamidipaty. Interactive Deduplication Using Active Learning. Proc. VLDB, pp. 269--278, 2002.
[23]
Winkler, W. E. The State of Record Linkage and Current Research Problems. Statistics of Income Division, Internal Revenue Service Publication R99/04.

Cited By

View all
  • (2024)LPLM: A Neural Language Model for Cardinality Estimation of LIKE-QueriesProceedings of the ACM on Management of Data10.1145/36393092:1(1-25)Online publication date: 26-Mar-2024
  • (2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
  • (2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
VLDB '07: Proceedings of the 33rd international conference on Very large data bases
September 2007
1443 pages
ISBN:9781595936493

Sponsors

  • Yahoo! Research
  • Google Inc.
  • SAP
  • Intel: Intel
  • Microsoft Research: Microsoft Research
  • ORACLE: ORACLE
  • Connex.cc
  • HP invent
  • WKO
  • IBM: IBM

Publisher

VLDB Endowment

Publication History

Published: 23 September 2007

Qualifiers

  • Research-article

Funding Sources

  • College Information Technology Research Center Support Program

Conference

VLDB '07
Sponsor:
  • Intel
  • Microsoft Research
  • ORACLE
  • IBM

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)LPLM: A Neural Language Model for Cardinality Estimation of LIKE-QueriesProceedings of the ACM on Management of Data10.1145/36393092:1(1-25)Online publication date: 26-Mar-2024
  • (2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
  • (2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
  • (2020)Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning ApproachProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380570(1197-1212)Online publication date: 11-Jun-2020
  • (2016)Predicting the effectiveness of pattern-based entity extractor inferenceApplied Soft Computing10.1016/j.asoc.2016.05.02346:C(398-406)Online publication date: 1-Sep-2016
  • (2014)Extending string similarity join to tolerant fuzzy token matchingACM Transactions on Database Systems10.1145/253562839:1(1-45)Online publication date: 6-Jan-2014
  • (2013)A partition-based method for string similarity joins with edit-distance constraintsACM Transactions on Database Systems10.1145/2487259.248726138:2(1-33)Online publication date: 4-Jul-2013
  • (2013)Efficient top-k algorithms for approximate substring matchingProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2465324(385-396)Online publication date: 22-Jun-2013
  • (2013)Selectivity estimation for hybrid queries over text-rich data graphsProceedings of the 16th International Conference on Extending Database Technology10.1145/2452376.2452421(383-394)Online publication date: 18-Mar-2013
  • (2012)Supporting efficient top-k queries in type-ahead searchProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348333(355-364)Online publication date: 12-Aug-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media