More Web Proxy on the site http://driver.im/

research-article

Extending q-grams to estimate selectivity of string matching with low edit distance

Authors:

Kyuseok ShimAuthors Info & Claims

VLDB '07: Proceedings of the 33rd international conference on Very large data bases

Pages 195 - 206

Published: 23 September 2007 Publication History

Abstract

There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semi-lattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm Opt EQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing Opt EQ with the state-of-the-art method SEPIA. Our experimental results show that Opt EQ delivers more accurate selectivity estimations.

References

[1]

A. Aboulnaga, A. R. Almeldeen and J. F. Naughton. Estimating the selectivity of XML path expressions for Internet scale applications. Proc. VLDB, pp. 591--600, 2001.

Digital Library

[2]

R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules. Proc. VLDB, pp. 487--499, 1994.

Digital Library

[3]

R. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.

Digital Library

[4]

S. Burkhardt and J. Karkkianen. One-gapped q-Gram Filters for Levenshtein Distance, Springer, LNCS, pp. 225--234, 2002.

Digital Library

[5]

S. Burkhardt et al. Q-gram based database searching using a suffix array (QUASAR), Proc. RECOMB, pp. 77--83, 1999.

Digital Library

[6]

S. Chaudhuri, V. Ganti and L. Gravano. Selectivity Estimation for String Predicates: Overcoming the Underestimation Problem. Proc. ICDE, pp. 227--238, 2004.

Digital Library

[7]

S. Chaudhuri et al. Robust and Efficient Fuzzy Match for Online Data Cleaning. Proc. SIGMOD, pp. 313--324, 2003.

Digital Library

[8]

W. Cohen, P. Ravikmar and and S. Fienberg. A Comparison of String Metrics for Matching Names and Records. Proc. of KDD Workshop on Data Cleaning, pp. 73--78, 2003.

[9]

R. Dixon and T. Martin. Automatic Speech and Speaker Recognition. IEEE Press, 1979.

Digital Library

[10]

Flamingo Project, http://www.ics.uci.edu/flamingo/.

[11]

L. Gravano et al. Approximate string joins in a database (almost) for free. Proc. VLDB, pp. 491--500, 2001.

Digital Library

[12]

D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge Univ. Press, 1997.

Digital Library

[13]

B. Hore et al. Indexing Text Data under Space Constraints. Proc. CIKM, pp. 198--207, 2004.

Digital Library

[14]

H. V. Jagadish, R. T. Ng and D. Srivastava. Substring Selectivity Estimation, Proc. PODS, pp. 249--260, 1999.

Digital Library

[15]

L. Jin and C. Li. Selectivity Estimation for Fuzzy String Predicates in Large Data Sets, Proc. VLDB, pp. 397--408, 2005.

Digital Library

[16]

L. Jin et al. Indexing Mixed Types for Approximate Retrieval, Proc. VLDB, pp. 793--804, 2005.

Digital Library

[17]

P. Krishnan, J. S. Vitter and B. Iyer. Estimating Alphanumeric Selectivity in the Presence of Wildcards. Proc. SIGMOD, pp. 282--293, 1996.

Digital Library

[18]

M. Ley. DBLP, http://www.fnformatick.uni-tier.de/ley/db.

[19]

L. Lim et al. XPathLearner: An on-line self-tuning Markov histogram for XML path selectivity estimation. Proc. VLDB, pp. 442--453, 2002.

Digital Library

[20]

G. Navarro. A Guided Tour to Approximate String Matching. ACM Computing Survey, 33, pp. 31--88, 2001.

Digital Library

[21]

G. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata. Journal of Experimental Algorithmics, volume 5, article 4, 2000.

Digital Library

[22]

S. Sarawagi and A. Bhamidipaty. Interactive Deduplication Using Active Learning. Proc. VLDB, pp. 269--278, 2002.

Digital Library

[23]

Winkler, W. E. The State of Record Linkage and Current Research Problems. Statistics of Income Division, Internal Revenue Service Publication R99/04.

Cited By

Aytimur MReiner SWörteler LChondrogiannis TGrossniklaus M(2024)LPLM: A Neural Language Model for Cardinality Estimation of LIKE-QueriesProceedings of the ACM on Management of Data10.1145/36393092:1(1-25)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639309
Kwon SJung WShim K(2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551859
Shetiya SThirumuruganathan SKoudas NDas G(2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
https://dl.acm.org/doi/10.14778/3436905.3436907
Show More Cited By

Extending q-grams to estimate selectivity of string matching with low edit distance
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

Approximate String Matching with $q$-grams and Maximal Matches
The string edit distance matching problem with moves

The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes, and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit ...
The string edit distance matching problem with moves
SODA '02: Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms

The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

VLDB '07: Proceedings of the 33rd international conference on Very large data bases

September 2007

1443 pages

ISBN:9781595936493

General Chairs:
Wolfgang Klas
University of Vienna, Austria
,
Erich J. Neuhold
Research Studio Digital Memory Engineering, Austria

Sponsors

Yahoo! Research
Google Inc.
SAP
Intel: Intel
Microsoft Research: Microsoft Research
ORACLE: ORACLE
Connex.cc
HP invent
WKO
IBM: IBM

Publisher

VLDB Endowment

Publication History

Published: 23 September 2007

Qualifiers

Research-article

Funding Sources

College Information Technology Research Center Support Program

Conference

VLDB '07

Sponsor:

Intel
Microsoft Research
ORACLE
IBM

VLDB '07: Proceedings of the 33rd International Conference on Very Large Data Bases

September 23 - 27, 2007

Vienna, Austria

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21
Total Citations
View Citations
362
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Aytimur MReiner SWörteler LChondrogiannis TGrossniklaus M(2024)LPLM: A Neural Language Model for Cardinality Estimation of LIKE-QueriesProceedings of the ACM on Management of Data10.1145/36393092:1(1-25)Online publication date: 26-Mar-2024
https://dl.acm.org/doi/10.1145/3639309
Kwon SJung WShim K(2022)Cardinality estimation of approximate substring queries using deep learningProceedings of the VLDB Endowment10.14778/3551793.355185915:11(3145-3157)Online publication date: 1-Jul-2022
https://dl.acm.org/doi/10.14778/3551793.3551859
Shetiya SThirumuruganathan SKoudas NDas G(2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
https://dl.acm.org/doi/10.14778/3436905.3436907
Wang YXiao CQin JCao XSun YWang WOnizuka MMaier DPottinger RDoan ATan WAlawini ANgo H(2020)Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning ApproachProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380570(1197-1212)Online publication date: 11-Jun-2020
https://dl.acm.org/doi/10.1145/3318464.3380570
Bartoli ADe Lorenzo AMedvet ETarlao F(2016)Predicting the effectiveness of pattern-based entity extractor inferenceApplied Soft Computing10.1016/j.asoc.2016.05.02346:C(398-406)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1016/j.asoc.2016.05.023
Wang JLi GFeng J(2014)Extending string similarity join to tolerant fuzzy token matchingACM Transactions on Database Systems10.1145/253562839:1(1-45)Online publication date: 6-Jan-2014
https://dl.acm.org/doi/10.1145/2535628
Li GDeng DFeng J(2013)A partition-based method for string similarity joins with edit-distance constraintsACM Transactions on Database Systems10.1145/2487259.248726138:2(1-33)Online publication date: 4-Jul-2013
https://dl.acm.org/doi/10.1145/2487259.2487261
Kim YShim KRoss KSrivastava DPapadias D(2013)Efficient top-k algorithms for approximate substring matchingProceedings of the 2013 ACM SIGMOD International Conference on Management of Data10.1145/2463676.2465324(385-396)Online publication date: 22-Jun-2013
https://dl.acm.org/doi/10.1145/2463676.2465324
Wagner ABicer VTran TGuerrini GPaton N(2013)Selectivity estimation for hybrid queries over text-rich data graphsProceedings of the 16th International Conference on Extending Database Technology10.1145/2452376.2452421(383-394)Online publication date: 18-Mar-2013
https://dl.acm.org/doi/10.1145/2452376.2452421
Li GWang JLi CFeng JHersh WCallan JMaarek YSanderson M(2012)Supporting efficient top-k queries in type-ahead searchProceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval10.1145/2348283.2348333(355-364)Online publication date: 12-Aug-2012
https://dl.acm.org/doi/10.1145/2348283.2348333
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents