Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter
<p>The construction of a complete bipartite graph where every token vertex of the first record set, <math display="inline"><semantics> <msub> <mi mathvariant="script">R</mi> <mn>1</mn> </msub> </semantics></math>, is connected to every token vertex of the second record set, <math display="inline"><semantics> <msub> <mi mathvariant="script">R</mi> <mn>2</mn> </msub> </semantics></math>.</p> "> Figure 2
<p>Maximum weighted bipartite matching of two records <math display="inline"><semantics> <msub> <mi mathvariant="script">R</mi> <mn>1</mn> </msub> </semantics></math> and <math display="inline"><semantics> <msub> <mi mathvariant="script">R</mi> <mn>2</mn> </msub> </semantics></math> and adjacent edges of token pairs <math display="inline"><semantics> <mrow> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>Y</mi> <mi>j</mi> </msub> </mrow> </semantics></math> weighted by normalized similarity metric <math display="inline"><semantics> <mrow> <msub> <mi>s</mi> <mi>n</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>X</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>Y</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p> "> Figure 3
<p>An example of a singularity of a trigram filter for <math display="inline"><semantics> <mrow> <mi>d</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mrow> <mo>|</mo> <mi>X</mi> <mo>|</mo> </mrow> <mrow> <mi>s</mi> <mi>i</mi> <mi>n</mi> <mi>g</mi> </mrow> </msub> <mo>=</mo> <mn>5</mn> </mrow> </semantics></math>. Suppose the worst case for fixed <math display="inline"><semantics> <mrow> <mi>d</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> is the substitution in the position of character C destroying all trigrams {“<math display="inline"><semantics> <mrow> <mi>C</mi> <mi>A</mi> <mi>C</mi> </mrow> </semantics></math>”, “<math display="inline"><semantics> <mrow> <mi>A</mi> <mi>C</mi> <mi>H</mi> </mrow> </semantics></math>”, “<math display="inline"><semantics> <mrow> <mi>C</mi> <mi>H</mi> <mi>E</mi> </mrow> </semantics></math>”}; hence, <math display="inline"><semantics> <mrow> <mi>t</mi> <mo>=</mo> <mn>0</mn> </mrow> </semantics></math>.</p> "> Figure 4
<p>Sawtooth function of different string lengths <math display="inline"><semantics> <mrow> <mo>|</mo> <mi>X</mi> <mo>|</mo> </mrow> </semantics></math> and fixed <span class="html-italic">q</span> and <math display="inline"><semantics> <mi>α</mi> </semantics></math>. The zero crossings of <math display="inline"><semantics> <msub> <mi>t</mi> <mi>α</mi> </msub> </semantics></math> illustrate the singularities of the Q-gram filter. This is similar to the results in [<a href="#B54-algorithms-18-00150" class="html-bibr">54</a>].</p> "> Figure 5
<p>Production system architecture of fuzzy search/matching engine.</p> "> Figure 6
<p>Block diagram of the processing of records from the source in real-time by a two-stage system of Q-gram filter and FRMS.</p> "> Figure 7
<p>Relative performance of selected similarity functions from the group of hybrid, edit, and Q-gram similarities.</p> "> Figure 8
<p>Comparison of the relative performance of Q-gram filter+FRMS and hybrid similarity functions.</p> "> Figure 9
<p>Comparison of the relative performance of Q-gram filter+FRMS in an ablation study.</p> "> Figure A1
<p>Two examples of an asymmetric Monge–Elkan measure <math display="inline"><semantics> <mrow> <mi>s</mi> <mi>i</mi> <msub> <mi>m</mi> <mrow> <mi>M</mi> <mi>E</mi> </mrow> </msub> </mrow> </semantics></math>.</p> "> Figure A2
<p>Maximum weighted bipartite matching between two records <math display="inline"><semantics> <mi mathvariant="bold-italic">X</mi> </semantics></math> and <math display="inline"><semantics> <mi mathvariant="bold-italic">Y</mi> </semantics></math>.</p> ">
Abstract
:1. Introduction
1.1. Phonetic Similarity
1.2. Heuristic Similarity
1.3. Edit Similarity
1.4. Q-Grams
1.5. Deep Learning
1.6. Hybrid Similarity
2. Similarity Space
- (S1)
- (symmetry);
- (S2)
- (triangle inequality);
- (S3)
- (identity of indiscernibles);
- (S4)
- (non-negativity).
- (N1)
- (symmetry);
- (N2)
- (triangle inequality);
- (N3)
- (identity of indiscernibles);
- (N4)
- (non-negativity).
3. Approximate Record Matching in Similarity Space
3.1. Problem Formulation
3.2. Survey of Current Token-Based Methods
- Fuzzy dice similarity, ,
- Fuzzy cosine similarity, ,
- Fuzzy Jaccard similarity, ,
- Fuzzy overlap similarity, ,
- Second Threshold: The disadvantage is that we actually have two threshold parameters—the threshold in the internal similarity function, , and the overall threshold, , on , with which we classify whether the records and are a match or not. The global threshold, , can be quite different from the local threshold, . In addition, the threshold of the internal function, , requires optimization and can lead to reduced accuracy if selected incorrectly—see an example in Appendix A.2.
- Not Entirely Fuzzy: By applying a binary classifier, , we can determine whether a token pair is classified as a match. Such a strict classification on the token level leads to declining matches on tokens lower than the threshold and does not respect human–natural continuous perception of the overall similarity, because the information about any similarity lower than is replaced by a substitution value of zero.
- Not a Similarity Metric: For the triangle inequalities given by (N2), respectively, (S2) is violated. This statement is supported by using the normalized Levenshtein similarity, which violates the triangle inequality, as further mentioned in the next sub-chapter. For details, see Appendix A.3.
3.3. Proposed Model of Fuzzy Record Similarity Metric (FRSM)
4. Optimal Q-Gram Filter
4.1. Count Filtering
4.2. Optimal Count Q-Gram Filter for Edit Similarity
4.3. Optimal Count Q-Gram Filter for Fuzzy Token Similarity
5. Results
5.1. Suggested Architecture
5.2. Experiment Setup and Evaluation Q-Gram Filter Efficiency
5.3. Time and Space Complexity
5.3.1. FRMS
5.3.2. Q-Gram Filter
6. Discussion
7. Conclusions
- Fuzzy Record Similarity Metric (FRMS): We developed the FRMS, a robust metric for measuring approximate record similarity. The FRMS adheres to key mathematical principles, making it highly suitable for applications in text mining and cluster analysis. Our experiments demonstrate its superior performance compared to existing methods without the need for parameter tuning.
- Optimal Q-gram filter: We proposed an optimal Q-gram filtering method for token matching, ensuring the most efficient filtering based on shared token features. This filter serves as a foundational tool that is applicable to various similarity metrics.
- Approximate Q-gram filter: To enhance computational efficiency, we introduced an approximate Q-gram filter that operates in constant time. This approximation maintains high accuracy while significantly reducing processing time, as evidenced by our experimental results.
- Filter efficiency and properties: We analyzed the efficiency and key properties of the Q-gram filter, highlighting its effectiveness in relation to string lengths and similarity thresholds. Our analysis provides insights into optimizing filter performance for different applications.
- Padding extension of the Q-gram filter: We enhanced the Q-gram filter by incorporating padding techniques, which improve similarity measures by smoothing token boundaries and expanding the feature set. This extension leads to more accurate similarity assessments.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1. Example of Asymmetric Matching
Appendix A.2. Example of Maximum Weighted Bipartite Matching
Appendix A.3. Example of Violation of the Triangle Inequality
- (D1)
- (symmetry);
- (D2)
- (triangle inequality);
- (D3)
- (identity of indiscernibles).
References
- Raeesi, M.; Asadpour, M.; Shakery, A. Swash: A collective personal name matching framework. Expert Syst. Appl. 2020, 147, 113115. [Google Scholar] [CrossRef]
- Bilenko, M.; Mooney, R.; Cohen, W.; Ravikumar, P.; Fienberg, S. Adaptive Name Matching in Information Integration. IEEE Intell. Syst. 2003, 18, 16–23. [Google Scholar] [CrossRef]
- Elmagarmid, A.K.; Ipeirotis, P.G.; Verykios, V.S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 2008, 19, 1–16. [Google Scholar] [CrossRef]
- Gali, N.; Istodor, R.M.; Hostettler, D.; Fränti, P. Framework for syntactic string similarity measures. Expert Syst. Appl. 2019, 129, 169–185. [Google Scholar] [CrossRef]
- Russell, R.C. Index. U.S. Patent 1,261,167, 2 April 1918. [Google Scholar]
- Philips, L. Hanging on the Metaphone. Comput. Lang. Mag. 1990, 7, 39–44. [Google Scholar]
- Philips, L. The Double Metaphone Search Algorithm. C/C++ Users J. 2000, 18, 38–43. [Google Scholar]
- Lait, A.; Randell, B. An Assessment of Name Matching Algorithms; Tech. Rep. 176; Department of Computer Science, University of Newcastle upon Tyne: Newcastle upon Tyne, UK, 1993. [Google Scholar]
- Gadd, T.N. PHONIX: The algorithm. Program Autom. Libr. Inf. Syst. 1990, 24, 363–366. [Google Scholar] [CrossRef]
- Taft, R.L. Name Search Techniques. In Technical Report Special Report No. 1, State Identification and Intelligence System; Bureau of Systems Development: Albany, NY, USA, 1970. [Google Scholar]
- Holmes, D.; McCabe, C.M. Improving precision and recall for soundex retrieval. In Proceedings of the IEEE International Conference on Information Technology: Coding and Computing (ITCC), Las Vegas, NV, USA, 8–10 April 2002. [Google Scholar]
- Christen, P. A Comparison of Personal Name Matching: Techniques and Practical Issues; Technical Reports; The Australian National University: Canberra, Australia, 2006. [Google Scholar]
- Jaro, M.A. Advances in record linkage methodology as applied to the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 1989, 84, 414–420. [Google Scholar] [CrossRef]
- Winkler, W.E. String Comparator Metrics and Enhanced Decision Rules in the Fellegi–Sunter Model of Record Linkage. In Proceedings of the Annual Meeting of the American Statistical Association, Anaheim, CA, USA, 6–9 August 1990; pp. 354–359. [Google Scholar]
- Cohen, W.W.; Ravikumar, P.; Fienberg, S.E. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proceedings of the 2003 International Joint Conferences on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, 9–10 August 2003; pp. 73–78. [Google Scholar]
- Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
- Brill, E.; Moore, R.C. An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, China, 3–6 October 2000; pp. 32–58. [Google Scholar]
- Damerau, F.J. A technique for computer detection and correction of spelling errors. Commun. ACM 1967, 7, 659–664. [Google Scholar] [CrossRef]
- Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [Google Scholar] [CrossRef] [PubMed]
- Smith, T.F.; Waterman, M.S. Identification of Common Molecular Subsequences. J. Mol. Biol. 1981, 147, 195–197. [Google Scholar] [CrossRef]
- Gotoh, O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982, 162, 705–708. [Google Scholar] [CrossRef]
- Ristad, E.S.; Yianilos, P.N. Learning string-edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 522–532. [Google Scholar] [CrossRef]
- Marzal, A.; Vidal, E. Computation of Normalized Edit Distance and Applications. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 926–932. [Google Scholar] [CrossRef]
- Li, Y.; Liu, B. A Normalized Levensthein Distance Metric. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1091–1095. [Google Scholar] [CrossRef]
- Weigel, A.; Fein, F. Normalizing the weighted edit distance. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Jerusalem, Israel, 9–13 October 1994; p. 3. [Google Scholar]
- Kondrak, G. N-Gram Similarity and Distance. In String Processing and Information Retrieval; Consens, M., Navarro, G., Eds.; SPIRE 2005: Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3772, pp. 115–126. [Google Scholar]
- Ukkonen, E. Approximate String Matching with Q-grams and Maximal Matches. Theor. Comput. Sci. 1992, 92, 191–211. [Google Scholar] [CrossRef]
- Christen, P. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Trans. Knowl. Data Eng. 2012, 24, 1537–1555. [Google Scholar] [CrossRef]
- Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 2020, 53, 31. [Google Scholar] [CrossRef]
- Yu, M.; Li, G.; Deng, D. String similarity search and join: A survey. Front. Comput. Sci. 2016, 10, 399–417. [Google Scholar] [CrossRef]
- Hosseini, K.; Nanni, F.; Ardanuy, M.C. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 62–69. [Google Scholar]
- Ferragina, P.; Scaiella, U.T. On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010; pp. 1625–1628. [Google Scholar]
- Santos, R.; Murrieta-Flores, P.; Calado, P.; Martins, B. Toponym matching through deep neural networks. Int. J. Geogr. Inf. Sci. 2018, 32, 324–348. [Google Scholar] [CrossRef]
- Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Qiu, L.; Yu, J.; Pu, Q.; Xiang, C. Knowledge entity learning and representation for ontology matching based on deep neural networks. Clust. Comput. 2017, 20, 969–977. [Google Scholar] [CrossRef]
- Jimenez, S.; Becerra, C.; Gelbukh, A.; Gonzales, F. Generalized Mongue–Elkan Method for Approximate Text String Comparison. In Proceedings of the Computational Linguistics and Intelligent Text Processing, 10th International Conference, Mexico City, Mexico, 1–7 March 2009; pp. 559–570. [Google Scholar]
- Monge, A.E.; Elkan, C.P. The field matching problem: Algorithms and applications. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 267–270. [Google Scholar]
- Moreau, E.; Yvon, F.; Capp, E.O. Robust similarity measures for named entities matching. In Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK, 18–22 August 2008; Volume 1, pp. 593–600. [Google Scholar]
- Wang, J.; Li, G.; Feng, J. Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join. In Proceedings of the IEEE 27th International Conference on Data Engineering, Hannover, Germany, 11–16 April 2011; Volume 39, p. 7. [Google Scholar]
- Wang, J.; Li, G.; Feng, J. Extending string similarity join to tolerant fuzzy token matching. ACM Trans. Database Syst. 2014, 39, 7. [Google Scholar] [CrossRef]
- Rozinek, O.; Mareš, J. The Duality of Similarity and Metric Spaces. Appl. Sci. 2021, 11, 1910. [Google Scholar] [CrossRef]
- Deng, D.; Kim, A.; Madden, S.; Stonebraker, M. SILKMOTH: An efficient method for finding related sets with maximum matching constraints. Proc. VLDB Endow. 2017, 10. [Google Scholar] [CrossRef]
- Gragera, A.; Suppakitpaisarn, V. Relaxed triangle inequality ratio of the Srensen Dice and Tversky indexes. Theor. Comput. Sci. 2018, 718, 37. [Google Scholar] [CrossRef]
- Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Kuhn, H.W. Variants of the Hungarian method for assignment problems. Nav. Res. Logist. Q. 1956, 3, 253–258. [Google Scholar] [CrossRef]
- Munkres, J. Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 1957, 5, 32–38. [Google Scholar] [CrossRef]
- Duan, R.; Pettie, S. Linear-Time Approximation for Maximum Weight Matching. J. ACM 2014, 61, 1–23. [Google Scholar] [CrossRef]
- Edmonds, J.; Karp, R.M. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM 1972, 19, 31–33. [Google Scholar] [CrossRef]
- Xiao, C.; Wang, W.; Lin, X.; Yu, J.X.; Wang, G. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 2011, 36, 1. [Google Scholar] [CrossRef]
- Jokinen, P.; Ukkonen, E. Two algorithms for approximate string matching in static texts. In Proceedings of the International Symposium on Mathematical Foundations of Computer Science, Kazimierz Dolny, Poland, 9–13 September 1991; pp. 240–248. [Google Scholar]
- Yang, Z.; Yu, J.; Kitsuregawa, M. Fast Algorithms for Top-k Approximate String Matching. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence 2010, Atlanta, GA, USA, 11–15 July 2010; pp. 1467–1473. [Google Scholar]
- Rasmussen, K.R.; Stoye, J.; Myers, E.W. Efficient Q-gram Filters for Finding All ϵ-Matches over a Given Length. J. Comput. Biol. 2006, 13, 296–308. [Google Scholar] [CrossRef]
- Grzebala, P.; Cheatham, M. Private record linkage: Comparison of selected techniques for name matching. Eur. Semant. Web Conf. 2016. [Google Scholar]
- Sababa, H.; Stassopoulou, A. A Classifier to Distinguish Between Cypriot Greek and Standard Modern Greek. In Proceedings of the 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), Valencia, Spain, 15–18 October 2018. [Google Scholar]
- Rozinek, O.; Borkovcova, M.; Mareš, J. BipartiteJoin: Optimal Similarity Join for Fuzzy Bipartite Matching. In Proceedings of the World Conference on Information Systems and Technologies, Pisa, Italy, 4–6 April 2023; pp. 171–180. [Google Scholar]
- Rozinek, O.; Borkovcova, M.; Mareš, J. Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data. In Proceedings of the World Conference on Information Systems and Technologies, Pisa, Italy, 4–6 April 2023; pp. 181–191. [Google Scholar]
- Hu, T.C.; Kahng, A.B. Linear and Integer Programming Made Easy; Springer: Berlin, Germany, 2016. [Google Scholar]
- Okazaki, N.; Tsuji, J. Simple and Efficient Algorithm for Approximate Dictionary Matching. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 23–27 August 2010. [Google Scholar]
- Andoni, A.; Deza, M.; Gupta, A.; Indyk, P.; Raskhodnikova, A. Lower bounds for embedding edit distance into normed spaces. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, USA, 12–14 January 2003. [Google Scholar]
Similarity Measure | Subject to |
---|---|
Fuzzy Dice | |
Fuzzy Cosine | |
Fuzzy Jaccard | |
Fuzzy Overlap/FRMS |
Name | Number of Strings | Name | Number of Strings |
---|---|---|---|
Animal | 5709 | Game | 911 |
Bird Kunkel | 336 | Park | 654 |
Bird Nybird | 982 | Restaurant | 863 |
Bird Scott1 | 38 | Ucd-people | 90 |
Bird Scott2 | 719 | Census | 841 |
Business | 2139 |
Similarity | F1-Score | Similarity | F1-Score |
---|---|---|---|
FRMS | 85.09% | Smith–Waterman | 75.71% |
3-gram filter+FRMS | 85.01% | Smith–Waterman-Gotoh | 75.54% |
2-gram filter+FRMS | 84.88% | Jaro | 75.29% |
Fuzzy Jaccard (Levenshtein ) | 84.17% | Overlap 3-gram | 73.21% |
Jaro–Winkler | 81.45% | Jaccard 2-gram | 71.05% |
L2 Monge–Elkan (Levenshtein) | 80.80% | Dice 2-gram | 71.05% |
Damerau–Levenshtein | 76.86% | Jaccard 3-gram | 70.86% |
Levenshtein | 76.83% | Dice 3-gram | 70.86% |
Needleman–Wunsch | 76.25% | Overlap 2-gram | 66.92% |
Similarity | F1-Score | Similarity | F1-Score |
---|---|---|---|
FRMS | 85.09% | naive 2-gram ()+FRMS | 34.75% |
3-gram ()+FRMS | 85.01% | naive 2-gram ()+FRMS | 34.33% |
2-gram ()+FRMS | 84.88% | naive 2-gram ()+FRMS | 31.69% |
3-gram ()+FRMS | 84.84% | naive 3-gram ()+FRMS | 29.07% |
2-gram ()+FRMS | 84.74% | naive 3-gram ()+FRMS | 29.07% |
3-gram ()+FRMS | 84.71% | naive 3-gram ()+FRMS | 29.07% |
2-gram ()+FRMS | 84.38% |
Similarity | Elapsed Time | Similarity | Elapsed Time |
---|---|---|---|
Levenshtein | 13 s:426 ms | L2 Monge- Elkan (Levenshtein) | 14 s:209 ms |
Damerau–Levenshtein | 22 s:824 ms | Jaccard 2-gram | 10 s:542 ms |
Jaro | 3 s:902 ms | Jaccard 3-gram | 9 s:829 ms |
Jaro–Winkler | 3 s:772 ms | Dice 2-gram | 11 s:95 ms |
Needleman–Wunsch | 28 s:170 ms | Dice 3-gram | 10 s:717 ms |
Smith–Waterman | 28 s:600 ms | Overlap 3-gram | 10 s:251 ms |
FRMS | 13 s:474 ms | Overlap 2-gram | 11 s:549 ms |
Q-Gram Filter+FRMS | 0 s:220 ms | Fuzzy Jaccard () | 12 s:824 ms |
Similarity | Time Complexity | Space Complexity |
---|---|---|
Q-gram filter + FRMS | ||
FRMS | ||
Smith–Waterman-Gotoh | ||
Fuzzy Jaccard (Levenshtein | ||
Jaro | ||
Jaro–Winkler | ||
L2 Monge–Elkan (Levenshtein) | ||
Damerau–Levenshtein | ||
Levenshtein | ||
Needleman–Wunsch | ||
Smith–Waterman | ||
Q-gram (2-gram, 3-gram) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rozinek, O.; Marek, J.; Panuš, J.; Mareš, J. Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter. Algorithms 2025, 18, 150. https://doi.org/10.3390/a18030150
Rozinek O, Marek J, Panuš J, Mareš J. Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter. Algorithms. 2025; 18(3):150. https://doi.org/10.3390/a18030150
Chicago/Turabian StyleRozinek, Ondřej, Jaroslav Marek, Jan Panuš, and Jan Mareš. 2025. "Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter" Algorithms 18, no. 3: 150. https://doi.org/10.3390/a18030150
APA StyleRozinek, O., Marek, J., Panuš, J., & Mareš, J. (2025). Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter. Algorithms, 18(3), 150. https://doi.org/10.3390/a18030150