More Web Proxy on the site http://driver.im/

research-article

String similarity measures and joins with synonyms

Authors:

Haiyong WangAuthors Info & Claims

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 373 - 384

https://doi.org/10.1145/2463676.2465313

Published: 22 June 2013 Publication History

Abstract

A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, "Bill" is a short form of "William". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching.

In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.

References

[1]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In STOC, pages 20--29, 1996.

Digital Library

[2]

A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40--49, 2008.

Digital Library

[3]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[4]

Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10, 2002.

Digital Library

[5]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[6]

M. Bilenko and R. J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In KDD, pages 39--48, 2003.

Digital Library

[7]

A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997.

Digital Library

[8]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.

Digital Library

[9]

S. Chaudhuri and R. Kaushik. Extending autocompletion to tolerate errors. In SIGMOD Conference, pages 707--718, 2009.

Digital Library

[10]

W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73--78, 2003.

Digital Library

[11]

M. Farach. Optimal suffix tree construction with large alphabets. In FOCS, pages 13--143, 1997.

Digital Library

[12]

P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985.

Digital Library

[13]

S. Ganguly, M. N. Garofalakis, and R. Rastogi. Tracking set-expression cardinalities over continuous update streams. VLDB J., 13(4):354--369, 2004.

Digital Library

[14]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

[15]

K. Iwama and S. Tamaki. Improved upper bounds for 3-sat. In SODA, 2004.

Digital Library

[16]

G. Kondrak. N-gram similarity and distance. In SPIRE, pages 115--126, 2005.

Digital Library

[17]

H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009.

Digital Library

[18]

H. Lee, R. T. Ng, and K. Shim. Similarity join size estimation using locality sensitive hashing. PVLDB, 4(6):338--349, 2011.

Digital Library

[19]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

[20]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition based method for similarity joins. In VLDB, 2012.

Digital Library

[21]

D. R. H. Miller, T. Leek, and R. M. Schwartz. A hidden markov model information retrieval system. In SIGIR, pages 214--221, 1999.

Digital Library

[22]

J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.

Digital Library

[23]

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 24(5):513--523, 1988.

Digital Library

[24]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

[25]

Y. Tsuruoka, J. McNaught, J. Tsujii, and S. Ananiadou. Learning string similarity measures for gene/protein name dictionary look-up using logistic regression. Bioinformatics, 23(20):2768--2774, 2007.

Digital Library

[26]

J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.

Digital Library

[27]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.

Digital Library

[28]

W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit distance constraints. In SIGMOD Conference, pages 759--770, 2009.

Digital Library

[29]

W. E. Winkler. The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau, 1999.

[30]

C. Xiao, J. Qin, W. Wang, Y. Ishikawa, K. Tsuda, and K. Sadakane. Efficient error-tolerant query autocompletion. PVLDB, 2013.

Digital Library

[31]

C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst., 36(3):15, 2011.

Digital Library

Cited By

Wu JTang DChalapathi NChambers TCiccolini JPhillips CPickoff-White LParameswaran A(2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685830
Zheng LXiao QCai X(2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698799
Yokoyama HIwamoto KTakehara IAndo KKamei H(2024)NLPDedup: Using Natural Language Processing for Data Deduplication2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI63651.2024.00031(115-120)Online publication date: 6-Jul-2024
https://doi.org/10.1109/IIAI-AAI63651.2024.00031
Show More Cited By

Index Terms

String similarity measures and joins with synonyms
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Information retrieval query processing

Recommendations

Boosting the Quality of Approximate String Matching by Synonyms

A string-similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings “Sam” and “Samuel” can be considered to be similar. Most existing work that computes the similarity of ...
String similarity search and join: a survey

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
String similarity join with different similarity thresholds based on novel indexing techniques

String similarity join is an essential operation of many applications that need to find all similar string pairs from two given collections. A quantitative way to determine whether two strings are similar is to compute their similarity based on a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

June 2013

1322 pages

ISBN:9781450320375

DOI:10.1145/2463676

General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'13

Sponsor:

SIGMOD

SIGMOD/PODS'13: International Conference on Management of Data

June 22 - 27, 2013

New York, New York, USA

Acceptance Rates

SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

56
Total Citations
View Citations
947
Total Downloads

Downloads (Last 12 months)36
Downloads (Last 6 weeks)4

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu JTang DChalapathi NChambers TCiccolini JPhillips CPickoff-White LParameswaran A(2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 1-Aug-2024
https://dl.acm.org/doi/10.14778/3685800.3685830
Zheng LXiao QCai X(2024)A Universal Sketch for Estimating Heavy Hitters and Per-Element Frequency Moments in Data Streams with Bounded DeletionsProceedings of the ACM on Management of Data10.1145/36987992:6(1-28)Online publication date: 20-Dec-2024
https://dl.acm.org/doi/10.1145/3698799
Yokoyama HIwamoto KTakehara IAndo KKamei H(2024)NLPDedup: Using Natural Language Processing for Data Deduplication2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI63651.2024.00031(115-120)Online publication date: 6-Jul-2024
https://doi.org/10.1109/IIAI-AAI63651.2024.00031
Dong GBate AHaguinet FWestman GDürlich LHviid ASessa M(2023)Optimizing Signal Management in a Vaccine Adverse Event Reporting System: A Proof-of-Concept with COVID-19 Vaccines Using Signs, Symptoms, and Natural Language ProcessingDrug Safety10.1007/s40264-023-01381-647:2(173-182)Online publication date: 7-Dec-2023
https://doi.org/10.1007/s40264-023-01381-6
Pan ZPan GMonti A(2022)Semantic-Similarity-Based Schema Matching for Management of Building Energy DataEnergies10.3390/en1523889415:23(8894)Online publication date: 24-Nov-2022
https://doi.org/10.3390/en15238894
Aguilar JDiaz FAltamiranda JGarcia NPinto A(2022)An Adaptive System for Emerging Serious Games Using a Swarm Intelligence AlgorithmIEEE Transactions on Games10.1109/TG.2021.311827314:4(598-609)Online publication date: Dec-2022
https://doi.org/10.1109/TG.2021.3118273
Xiao GWang JLin CZaniolo C(2022)Highly Efficient String Similarity Search and Join over Compressed Indexes2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00022(232-244)Online publication date: May-2022
https://doi.org/10.1109/ICDE53745.2022.00022
Kosa VErmolayev VKosa VErmolayev V(2022)Related Work and Our ApproachTerminology Saturation10.1007/978-981-16-8630-6_2(7-39)Online publication date: 16-Feb-2022
https://doi.org/10.1007/978-981-16-8630-6_2
Elyashar APuzis RFire M(2021)It Runs in the Family: Unsupervised Algorithm for Alternative Name Suggestion Using Digitized Family TreesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3096670(1-1)Online publication date: 2021
https://doi.org/10.1109/TKDE.2021.3096670
Song GShim KLee H(2021)Substring Similarity Search with Synonyms2021 IEEE 37th International Conference on Data Engineering (ICDE)10.1109/ICDE51399.2021.00191(2003-2008)Online publication date: Apr-2021
https://doi.org/10.1109/ICDE51399.2021.00191
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents