[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

String similarity joins: an experimental evaluation

Published: 01 April 2014 Publication History

Abstract

String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.

References

[1]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[3]
A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In ICDE, pages 888--899, 2011.
[4]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.
[5]
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426--435, 1997.
[6]
D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD, 2014.
[7]
D. Deng, G. Li, J. Feng, and W.-S. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013.
[8]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[9]
J. Jestes, F. Li, Z. Yan, and K. Yi. Probabilistic string similarity joins. In SIGMOD Conference, pages 327--338, 2010.
[10]
Y. Kim and K. Shim. Parallel top-k similarity join algorithms using mapreduce. In ICDE, pages 510--521, 2012.
[11]
H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009.
[12]
H. Lee, R. T. Ng, and K. Shim. Similarity join size estimation using locality sensitive hashing. PVLDB, 4(6):338--349, 2011.
[13]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.
[14]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[15]
X. Lian and L. Chen. Set similarity join on probabilistic data. PVLDB, 3(1):650--659, 2010.
[16]
A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8):704--715, 2012.
[17]
J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.
[18]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.
[19]
V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.
[20]
B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. http://fastss.csg.uzh.ch/.
[21]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010.
[22]
J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.
[23]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.
[24]
W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013.
[25]
C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.
[26]
C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.
[27]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.
[28]
Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010.

Cited By

View all
  • (2024)Levenshtein distance embedding with poisson regression for DNA storageProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i14.29509(15796-15804)Online publication date: 20-Feb-2024
  • (2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 8-Nov-2024
  • (2024)Trajectory Similarity Measurement: An Efficiency PerspectiveProceedings of the VLDB Endowment10.14778/3665844.366585817:9(2293-2306)Online publication date: 6-Aug-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 7, Issue 8
April 2014
60 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2014
Published in PVLDB Volume 7, Issue 8

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)3
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Levenshtein distance embedding with poisson regression for DNA storageProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence10.1609/aaai.v38i14.29509(15796-15804)Online publication date: 20-Feb-2024
  • (2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 8-Nov-2024
  • (2024)Trajectory Similarity Measurement: An Efficiency PerspectiveProceedings of the VLDB Endowment10.14778/3665844.366585817:9(2293-2306)Online publication date: 6-Aug-2024
  • (2024)Scalable Distributed Inverted List Indexes in Disaggregated MemoryProceedings of the ACM on Management of Data10.1145/36549742:3(1-27)Online publication date: 30-May-2024
  • (2024)Discovering Graph Generating Dependencies for Property Graph ProfilingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679764(2067-2076)Online publication date: 21-Oct-2024
  • (2024)Reasoning on property graphs with graph generating dependenciesInformation Sciences: an International Journal10.1016/j.ins.2024.120675672:COnline publication date: 18-Jul-2024
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 9-Jul-2024
  • (2024)Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big DataGood Practices and New Perspectives in Information Systems and Technologies10.1007/978-3-031-60328-0_18(181-191)Online publication date: 16-May-2024
  • (2023)PMMTss: A Parallel Multi-Way Merging-Based Trajectory Similarity Search for a Million Metro PassengersApplied Sciences10.3390/app1313798813:13(7988)Online publication date: 7-Jul-2023
  • (2023)Benchmarking Filtering Techniques for Entity Resolution2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00389(653-666)Online publication date: Apr-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media