[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

String similarity joins: an experimental evaluation

Published: 01 April 2014 Publication History

Abstract

String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.

References

[1]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, pages 918--929, 2006.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007.
[3]
A. Behm, C. Li, and M. J. Carey. Answering approximate string queries on large data sets using external memory. In ICDE, pages 888--899, 2011.
[4]
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, page 5, 2006.
[5]
P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In VLDB, pages 426--435, 1997.
[6]
D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD, 2014.
[7]
D. Deng, G. Li, J. Feng, and W.-S. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013.
[8]
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001.
[9]
J. Jestes, F. Li, Z. Yan, and K. Yi. Probabilistic string similarity joins. In SIGMOD Conference, pages 327--338, 2010.
[10]
Y. Kim and K. Shim. Parallel top-k similarity join algorithms using mapreduce. In ICDE, pages 510--521, 2012.
[11]
H. Lee, R. T. Ng, and K. Shim. Power-law based estimation of set similarity join size. PVLDB, 2(1):658--669, 2009.
[12]
H. Lee, R. T. Ng, and K. Shim. Similarity join size estimation using locality sensitive hashing. PVLDB, 4(6):338--349, 2011.
[13]
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008.
[14]
G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011.
[15]
X. Lian and L. Chen. Set similarity join on probabilistic data. PVLDB, 3(1):650--659, 2010.
[16]
A. Metwally and C. Faloutsos. V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors. PVLDB, 5(8):704--715, 2012.
[17]
J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011.
[18]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD Conference, pages 743--754, 2004.
[19]
V. Satuluri and S. Parthasarathy. Bayesian locality sensitive hashing for fast similarity search. PVLDB, 5(5):430--441, 2012.
[20]
B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. http://fastss.csg.uzh.ch/.
[21]
R. Vernica, M. J. Carey, and C. Li. Efficient parallel set-similarity joins using mapreduce. In SIGMOD Conference, pages 495--506, 2010.
[22]
J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010.
[23]
J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012.
[24]
W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013.
[25]
C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008.
[26]
C. Xiao, W. Wang, X. Lin, and H. Shang. Top-k set similarity joins. In ICDE, pages 916--927, 2009.
[27]
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008.
[28]
Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD Conference, pages 915--926, 2010.

Cited By

View all
  • (2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 1-Aug-2024
  • (2024)Trajectory Similarity Measurement: An Efficiency PerspectiveProceedings of the VLDB Endowment10.14778/3665844.366585817:9(2293-2306)Online publication date: 1-May-2024
  • (2024)Scalable Distributed Inverted List Indexes in Disaggregated MemoryProceedings of the ACM on Management of Data10.1145/36549742:3(1-27)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 7, Issue 8
April 2014
60 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 April 2014
Published in PVLDB Volume 7, Issue 8

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)3
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity MatchingProceedings of the VLDB Endowment10.14778/3685800.368583017:12(4104-4116)Online publication date: 1-Aug-2024
  • (2024)Trajectory Similarity Measurement: An Efficiency PerspectiveProceedings of the VLDB Endowment10.14778/3665844.366585817:9(2293-2306)Online publication date: 1-May-2024
  • (2024)Scalable Distributed Inverted List Indexes in Disaggregated MemoryProceedings of the ACM on Management of Data10.1145/36549742:3(1-27)Online publication date: 30-May-2024
  • (2024)Discovering Graph Generating Dependencies for Property Graph ProfilingProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679764(2067-2076)Online publication date: 21-Oct-2024
  • (2024)Reasoning on property graphs with graph generating dependenciesInformation Sciences: an International Journal10.1016/j.ins.2024.120675672:COnline publication date: 1-Jun-2024
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 1-Sep-2024
  • (2024)MinJoin++: a fast algorithm for string similarity joins under edit distanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-023-00806-z33:2(281-299)Online publication date: 1-Mar-2024
  • (2022)TokenJoinProceedings of the VLDB Endowment10.14778/3574245.357426316:4(790-802)Online publication date: 1-Dec-2022
  • (2022)Bitmap filterInformation Systems10.1016/j.is.2019.10144988:COnline publication date: 21-Apr-2022
  • (2021)Learned Cardinality Estimation for Similarity QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452790(1745-1757)Online publication date: 9-Jun-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media