More Web Proxy on the site http://driver.im/

research-article

String similarity joins: an experimental evaluation

Authors:

Wen-Syan LiAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 7, Issue 8

Pages 625 - 636

https://doi.org/10.14778/2732296.2732299

Published: 01 April 2014 Publication History

Abstract

String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.

References

[1]

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.

Digital Library

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.

Digital Library

[3]

A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In ICDE, pages 888--899, 2011.

Digital Library

[4]

S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.

Digital Library

[5]

P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426--435, 1997.

Digital Library

[6]

D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD, 2014.

Digital Library

[7]

D. Deng, G. Li, J. Feng, and W.-S. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013.

Digital Library

[8]

L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.

Digital Library

[9]

J. Jestes, F. Li, Z. Yan, and K. Yi. Probabilistic string similarity joins. In SIGMOD Conference, pages 327--338, 2010.

Digital Library

[10]

Y. Kim and K. Shim. Parallel top-k similarity join algorithms using mapreduce. In ICDE, pages 510--521, 2012.

Digital Library

[11]

H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009.

Digital Library

[12]

H. Lee, R. T. Ng, and K. Shim. Similarity join size estimation using locality sensitive hashing. PVLDB, 4(6):338--349, 2011.

Digital Library

[13]

C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.

Digital Library

[14]

G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.

Digital Library

[15]

X. Lian and L. Chen. Set similarity join on probabilistic data. PVLDB, 3(1):650--659, 2010.

Digital Library

[16]

A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8):704--715, 2012.

Digital Library

[17]

J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.

Digital Library

[18]

S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.

Digital Library

[19]

V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.

Digital Library

[20]

B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. http://fastss.csg.uzh.ch/.

[21]

R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010.

Digital Library

[22]

J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.

Digital Library

[23]

J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.

Digital Library

[24]

W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013.

Digital Library

[25]

C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.

Digital Library

[26]

C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.

Digital Library

[27]

C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.

Digital Library

[28]

Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010.

Digital Library

Cited By

Wei XGuo ASun SWei MYu WWooldridge MDy JNatarajan S(2024)Levenshtein distance embedding with poisson regression for DNA storageProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i14.29509(15796-15804)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i14.29509
Wu JTang DChalapathi NChambers TCiccolini JPhillips CPickoff-White LParameswaran A(2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685830
Chang YTanin ECong GJensen CQi J(2024)Trajectory Similarity Measurement: An Efficiency PerspectiveProceedings of the VLDB Endowment10.14778/3665844.366585817:9(2293-2306)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3665844.3665858
Show More Cited By

String similarity joins: an experimental evaluation
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

String similarity measures and joins with synonyms
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two ...
Top-k String Similarity Joins
SSDBM '20: Proceedings of the 32nd International Conference on Scientific and Statistical Database Management

Top-k joins have been extensively studied in relational databases as ranking operations when every object has, among others, at least one ranking attribute. However, the focus has mostly been the case when the join attributes are of primitive data types ...
Similarity Joins

Similarity Joins are extensively used in multiple application domains and are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 7, Issue 8

April 2014

60 pages

ISSN:2150-8097

Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2014

Published in PVLDB Volume 7, Issue 8

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

108
Total Citations
View Citations
456
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wei XGuo ASun SWei MYu WWooldridge MDy JNatarajan S(2024)Levenshtein distance embedding with poisson regression for DNA storageProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i14.29509(15796-15804)Online publication date: 20-Feb-2024
https://dl.acm.org/doi/10.1609/aaai.v38i14.29509
Wu JTang DChalapathi NChambers TCiccolini JPhillips CPickoff-White LParameswaran A(2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 8-Nov-2024
https://dl.acm.org/doi/10.14778/3685800.3685830
Chang YTanin ECong GJensen CQi J(2024)Trajectory Similarity Measurement: An Efficiency PerspectiveProceedings of the VLDB Endowment10.14778/3665844.366585817:9(2293-2306)Online publication date: 6-Aug-2024
https://dl.acm.org/doi/10.14778/3665844.3665858
Widmoser MKocher DAugsten N(2024)Scalable Distributed Inverted List Indexes in Disaggregated MemoryProceedings of the ACM on Management of Data10.1145/36549742:3(1-27)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3654974
Shimomura LYakovets NFletcher GSerra ESpezzano F(2024)Discovering Graph Generating Dependencies for Property Graph ProfilingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679764(2067-2076)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679764
Shimomura LYakovets NFletcher G(2024)Reasoning on property graphs with graph generating dependenciesInformation Sciences: an International Journal10.1016/j.ins.2024.120675672:COnline publication date: 18-Jul-2024
https://dl.acm.org/doi/10.1016/j.ins.2024.120675
Neuhof FFisichella MPapadakis GNikoletos KAugsten NNejdl WKoubarakis M(2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
https://dl.acm.org/doi/10.1007/s00778-024-00868-7
Rozinek OBorkovcova MMares J(2024)Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big DataGood Practices and New Perspectives in Information Systems and Technologies10.1007/978-3-031-60328-0_18(181-191)Online publication date: 16-May-2024
https://doi.org/10.1007/978-3-031-60328-0_18
Huang WXiong WWang X(2023)PMMTss: A Parallel Multi-Way Merging-Based Trajectory Similarity Search for a Million Metro PassengersApplied Sciences10.3390/app1313798813:13(7988)Online publication date: 7-Jul-2023
https://doi.org/10.3390/app13137988
Papadakis GFisichella MSchoger FMandilaras GAugsten NNejdl W(2023)Benchmarking Filtering Techniques for Entity Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00389(653-666)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00389
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents