Open AccessArticle

Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter

Rozinet s.r.o., U Josefa 110, 532 10 Pardubice, Czech Republic

Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic

Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague, Thákurova 9, 166 34 Prague, Czech Republic

⁴

Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic

⁵

Department of Mathematics, Informatics and Cybernetics, University of Chemistry and Technology Prague, Technicka 5, 166 28 Prague, Czech Republic

Author to whom correspondence should be addressed.

Algorithms 2025, 18(3), 150; https://doi.org/10.3390/a18030150

Submission received: 16 September 2024 / Revised: 21 November 2024 / Accepted: 2 December 2024 / Published: 6 March 2025

(This article belongs to the Section Analysis of Algorithms and Complexity Theory)

Download

Browse Figures

Versions Notes

Abstract

In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of

\approx O (1)

. In the second stage, FRMS runs for a polynomial time of approximately

\approx O (n^{4})

and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime.

Keywords:

fuzzy matching; Q-gram filter; approximate string matching; record linkage; entity resolution; similarity space

1. Introduction

Human reasoning and perception when comparing records are able to eliminate minor discrepancies better than computer algorithms. Methods that take into account various uncertainties in this comparison can achieve better results than algorithms based on exact string comparisons. Probabilistic and fuzzy approaches aim to become tools for implementing structures that model uncertainties in a way that is similar to human perception. From a biological perspective, people store information imprecisely in the brain, and such uncertainties arise in brain noise. Thus, imprecise information based on the modeling of human traits should, similarly, be incorporated in current search engines. Approximate string-matching algorithms have been developed to find imprecise similarities between records due to misspellings, typographical errors, phonetic errors, wrong input, misinterpretation, and other sources of imprecision [1]. For a short overview, we can group approximate string-matching algorithms into character-based (phonetics, heuristics, Q-gram, and edit similarity), hybrid similarity [2,3,4] (consisting of two levels of comparisons: the token level and the character level), and sequence-based (longest common sub-sequence and Ratcliff–Obershelp similarity) algorithms according to their features.

1.1. Phonetic Similarity

The large group of phonetic algorithms includes Soundex [5], Metaphone [6], Double Methaphone [7], Phonex [8], Phonix [9], NYSIIS [10], and Fuzzy Soundex [11]. These algorithms are characterized by a strong linguistic/phonetic dependence (usually English), which leads to poor F measures [12], poor robustness against misspelling and typographical errors, and a lack of assumptions of tokens. Due to their language specificity, they cannot be used well on multiple-language data sources. However, they are still implemented in many databases and software applications used by leading software companies. In our experience with IT companies, we have frequently encountered a misunderstanding of the use of such algorithms in datasets comprising various languages for which they are not intended.

1.2. Heuristic Similarity

The basic ideas of heuristics are based on linear programming, local searching, and transformations to a linear sum assignment problem with error correction. The choice of the edit distance metric varies depending on the particular heuristic applied. Jaro distance [13] and its modification, Jaro–Winkler distance [14], are well-known representatives that are very often used in the commercial sector for their simple implementation and high F measures [2,12,15]. They are appropriate only for short strings, e.g., names, and do not make assumptions with respect to tokens, as they are sensitive to any disorder in the sequence of tokens.

1.3. Edit Similarity

The edit distance between two strings represents the minimum cost required to perform a series of edit operations to convert one string into the other. Each operation has an associated cost, often 1. Well-known algorithms based on dynamic programming include the Hamming–Levenshtein [16] and Damerau–Levenshtein [17,18] algorithms for DNA or protein sequence comparison; these follow the methods proposed by Needleman and Wunsch [19], Smith and Waterman [20], Gotoh [21], and others. The learning cost associated with distance in supervised machine learning for specific transitions is also worth mentioning [22]. Unfortunately, these standard similarity metrics are not practically usable for a bag-of-words model, where permutations of words should also be involved. The other issue is normalization, e.g., the Levenshtein distance is a measure of the absolute edit distance between terms and is not practically usable due to the bias of such metrics against words with different lengths. Normalization factors that normalize the Levenshtein distance to lie within a range of

[0, 1]

have been proposed [23,24,25] . However, some factors violate the axioms of the similarity metric, especially triangle inequality.

1.4. Q-Grams

Q-grams (also called ‘n-grams’; see [26]) are short character sub-strings (of length q) of the record strings [27]. These methods have been applied in various forms, such as through the application of different approaches to spelling correction or the use of inverted indexes [28,29,30]. The Q-gram method is a favorite indexing method in enterprise and web search engines; it is used by tech companies with the largest market shares and capitalization. A comparison with other similarities yields quite good results; however, token-based algorithms outperform this method [12]. We explain this by the fact that Q-grams do not treat natural language encoding based on words; rather, they split the tokens into ‘artificial’ sub-strings, ignoring the information about which token generated the Q-gram. This concept achieves good results, with robustness to any disorder in the token sequence, but it comes at the cost of its overall accuracy. For this reason, such methods serve as a preprocessing step for more accurate approximate string-matching algorithms.

1.5. Deep Learning

An example of a string-matching approach is demonstrated by [31], who introduce a software library for fuzzy string matching and candidate ranking. Their method incorporates support for various deep neural network architectures, enabling both the training of new classifiers and the fine-tuning of existing models. This paper presents their approach to building the library and tuning the training dataset. They also compare the results with existing systems such as [32], which, however, relies on lookup tables. They compare their results with [33], who bring an approach called toponym matching, which is used, for example, in matching paired texts that represent the same real locations and are therefore often used in the geosciences. The authors of [33] use the approach of gated recurrent units [34], a recurrent neural network used in modeling sequence data. An example is a representation of a sequence of bytes corresponding to texts with which they have something in common. Their approach uses the deep neural network architecture proposed by [35], where the parameters are learned from the training data. The whole model can then be trained end-to-end using a back-propagation algorithm following an optimization method proposed by [36], which is called Adam. The paper then shows results on a large dataset collected by the GeoNames gazetter from www.geonames.org.

Another perspective is presented by [37], who explore ontology matching using deep neural networks. They emphasize that some methods overlook higher-level correlations between different descriptions of the entities being evaluated. To address this, ref. [37] propose learning-based methods that account for such correlations. Ontology matching, as described in their study, involves measuring the similarity between two entities from distinct ontologies, where a pair of entities with high similarities is referred to as a mapping. Establishing such mappings is crucial for ensuring correct data exchange between ontology-based applications; yet, it is a challenging task due to the complexity of modern ontologies. By employing a deep learning model, their approach generates high-level abstract representations of the original data. This methodology is used to address the semantic heterogeneity problem, focusing on identifying semantically similar entities across different ontologies.

1.6. Hybrid Similarity

In structured documents, there is implicit information from the context, given by columns as a semantic topic. Each attribute is modeled using a bag-of-words model, where the context is very often not relevant in traditional databases and NoSQL storage. The previously mentioned methods are based on the comparison of characters in the complete string. A hybrid similarity method converts the string into a set of tokens, the so-called bag-of-words model. A string can be transformed into sets by splitting using a delimiter. This allows us to take the semantic meaning of the words into account and process large texts. The comparison has two levels: a character-level comparison for each pair of tokens and a token-level comparison of all possible pairs of tokens. The similarity calculation is usually based on the Jaccard index or the Sorensen–Dice formulae. For the token set, comparisons such as Monge–Elkan [38,39] or SofTFIDF [15] are used. Unfortunately, these methods perform an approximate combinatorial assignment, and hence yield an asymmetric matching, see Appendix A.1. Maximum matching on bipartite graph [40,41,42] is more suitable for practical use and achieves the best F-measures due to its solving the optimal assignment problem and the symmetricity of their measurements. The significant advantage of these methods is that they preserve the encoding of natural language, particularly in terms of token resolution, treating words as distinct units of meaning. We find that this group of models deserves the most attention in research for its promising results.

Unfortunately, those algorithms often have a quadratic time complexity,

O (n^{2})

, or even a cubic time complexity,

O (n^{3})

, or worse,

O (n^{4})

; this is especially the case for those providing the best recall/precision results on research datasets.

2. Similarity Space

Our goal is to design a similarity metric that could be used widely in data mining and text clustering methods and with great precision. This places demands on the properties of the similarity metric. We may ask ourselves the following questions: What is the ideal similarity function? What should it look like? What properties should it have? Compliance with the metric axioms is a very common requirement. As we have already shown, there are similarity metric axioms that define such nice properties, cf. [43]. We have also shown the duality of such similarity axioms with metric axioms. We can think of a similarity metric as equivalent to the axioms of an ‘inverse’ distance metric, or, vice versa, we can think of a distance metric as the ‘inversion’ of a similarity metric. Here, we recall some necessary definitions [43].

Definition 1

(Similarity Space [43]). Consider a non-empty set

X

and a function,

s : X \times X \to R

, which is a similarity metric if, for all elements,

x, y, z \in X

, it satisfies the following axioms:

(S1): $s (x, y) = s (y, x)$ (symmetry);
(S2): $s (x, z) + s (y, y) \geq s (x, y) + s (y, z)$ (triangle inequality);
(S3): $s (x, x) = s (x, y) = s (y, y) ⟺ x = y$ (identity of indiscernibles);
(S4): $s (x, y) \geq 0$ (non-negativity).

A similarity space is an ordered pair

(X, s)

Definition 2

(Normalized Similarity Metric). Let

R 1

be a non-empty set. A function,

s n (x, y) :

X \times X \to [0, 1] \subset R

, is a normalized similarity metric if, for all elements,

x, y, z \in X

, it satisfies the following axioms:

(N1): $s_{n} (x, y) = s_{n} (y, x)$ (symmetry);
(N2): $s_{n} (x, z) + 1 \geq s_{n} (x, y) + s_{n} (y, z)$ (triangle inequality);
(N3): $s_{n} (x, y) = 1 ⟺ x = y$ (identity of indiscernibles);
(N4): $s_{n} (x, y) \geq 0$ (non-negativity).

A normalized similarity space is an ordered pair,

(X, s_{n})

Theorem 1

(Convex Combinations). Consider the vectors

x, y \in X_{1}^{m}

. A convex combination,

T_{C} : R^{m} \times X^{m} \times X^{m} \to R

, of normalized similarity metrics described by a vector,

α = (α_{1}, \dots, α_{m})

, of weight coefficients is again a normalized similarity metric:

\begin{matrix} s_{n} (x, y) & = T_{C} (α, x, y) = \sum_{i = 1}^{m} α_{i} s_{i} (x_{i}, y_{i}) \\ = α_{1} s_{1} (x_{1}, y_{1}) + α_{2} s_{2} (x_{2}, y_{2}) + \dots + α_{m} s_{m} (x_{m}, y_{m}), \end{matrix}

(1)

such that

\sum_{i = 1}^{m} α_{i} = 1

and

0 \leq α_{i} \leq 1

Proof.

[43]. □

Theorem 2

(Generalized Rozinek Similarity [43]). Suppose we are given a distance metric,

d : R 1 \times X \to R

, and a similarity metric,

s : X \times X \to R

. The generalized Rozinek similarity,

s R : X \times X \to [0, 1] \subset R

, is a normalized similarity metric derived from an arbitrary distance metric,

d (x, y)

, along with positive self-similarities,

s (x, x)

and

s (y, y)

s_{R} (x, y) = \frac{s (x, x) + s (y, y) - d (x, y)}{s (x, x) + s (y, y) + d (x, y)} .

(2)

Proof.

[43]. □

Theorem 3

(Generalized Rozinek Distance [43]). Suppose we are given a normalized similarity metric,

s_{n}

, and an arbitrary similarity metric, s. The generalized Rozinek distance,

d_{R} : X \times X \to R

, is the distance metric derived from an arbitrary normalized similarity metric,

s_{n} (x, y)

, and self-similarities

s (x, x)

and

s (y, y)

d_{R} (x, y) = \frac{1 - s_{n} (x, y)}{1 + s_{n} (x, y)} (s (x, x) + s (y, y)) .

(3)

Proof.

[43]. □

The previously introduced theorems are widely applicable for their generality and their unified theory. They have been mathematically proven to be valid for any measurable sets and functions, and include many existing distance and similarity metrics as special cases. The relationship in Theorem 2 should be preferred because they are the only known generalization of normalized similarity metric obtained from self-similarities and distance metrics. The newly developed theory describes approaches to a normalized similarity space and its generic solution and how to transform any distance or similarity metric between metric spaces and similarity spaces [43].

In the numerical study, we will work with datasets commonly employed for evaluating the effectiveness of approximate matching algorithms, including those analyzed in studies such as [2,4,15,28]. These datasets provide a benchmark for comparing algorithm performance across various domains. The evaluation of time complexity for the proposed methods will be presented in Section 5.3.

3. Approximate Record Matching in Similarity Space

3.1. Problem Formulation

We assume the given normalized similarity metric,

s_{n} (R_{1}, R_{2})

, which measures the similarity between two records,

R_{1}

and

R_{2}

, as a real number in the interval

[0, 1]

, with 1 representing complete similarity and 0 representing complete dissimilarity. Consider a fixed

α

0 < α < 1

. The applications distinguish relevant records by a binary classifier into two classes: the set of matches (

M

) and the set of non-matches (

N

), consisting of ordered pairs of records

(R_{1}, R_{2})

. More formally,

\begin{matrix} (R_{1}, R_{2}) \in M if s_{n} (R_{1}, R_{2}) \geq α, \\ (R_{1}, R_{2}) \in N if s_{n} (R_{1}, R_{2}) < α . \end{matrix}

(4)

The decision of where to set the match/non-match threshold,

α

, is a balancing act. The threshold should be chosen based on an acceptable sensitivity (or recall, which is the proportion of truly matching records that are correctly linked by the algorithm) and positive predictive value (or precision, which is the proportion of records linked by the algorithm that are indeed true matches). The task is to find an algorithm that classifies the sets of matches and non-matches as accurately as possible, corresponding to reality as judged by human observation. In the scientific literature, we also encounter the terms record linkage, data matching, fuzzy matching, entity resolution, etc.

In information retrieval theory and search engines, we can analogically meet with the phrase ‘retrieving relevant documents’. A record can be transformed into tokens of words. That is, each record is split into a set of tokens,

{X_{1}

, …,

X_{j}

, …,

X_{n}} \in R_{1}

. We denote the size of a record,

R_{1}

, by

| R_{1} |

, which is the number of tokens in

R_{1}

3.2. Survey of Current Token-Based Methods

In our survey, we have identified the most-cited algorithms based on fuzzy set relatedness [44]. Hence, we refer to them as the current state of the art. We refer readers to the commonly used similarity functions, fuzzy dice similarity, fuzzy cosine similarity, and fuzzy Jaccard similarity [41,42,44]. Our record-matching method will be based on this class of algorithms.

Definition 3

(Fuzzy Token Similarity [41,42]). Let us consider two sets of tokens,

R_{1}

and

R_{2}

. Write

R_{1} {\tilde{\cap}}_{δ} R_{2}

for the fuzzy overlap of

R_{1}

and

R_{2}

. The value δ is the token level threshold in the internal similarity function,

{sim}^{'} (X_{i}, Y_{j})

. The incident edge for the token pair is considered only if

{sim}^{'} (X_{i}, Y_{j}) \geq δ

. We will define the following similarity functions:

Fuzzy dice similarity, ${sim}_{D} (R_{1}, R_{2})$ ,

${sim}_{D} (R_{1}, R_{2}) = \frac{2 | R_{1} {\tilde{\cap}}_{δ} R_{2} |}{| R_{1} | + | R_{2} |} .$

(5)
Fuzzy cosine similarity, ${sim}_{C} (R_{1}, R_{2})$ ,

${sim}_{C} (R_{1}, R_{2}) = \frac{| R_{1} {\tilde{\cap}}_{δ} R_{2} |}{\sqrt{| R_{1} |} \sqrt{| R_{2} |}} .$

(6)
Fuzzy Jaccard similarity, ${sim}_{J} (R_{1}, R_{2})$ ,

${sim}_{J} (R_{1}, R_{2}) = \frac{| R_{1} {\tilde{\cap}}_{δ} R_{2} |}{| R_{2} | + | R_{1} | - | R_{1} {\tilde{\cap}}_{δ} R_{2} |} .$

(7)
Fuzzy overlap similarity, ${sim}_{O} (R_{1}, R_{2})$ ,

${sim}_{O} (R_{1}, R_{2}) = \frac{| R_{1} {\tilde{\cap}}_{δ} R_{2} |}{min {| R_{1} |, | R_{2} |}} .$

(8)

The relaxed term ‘similarity function’ is used instead of ‘normalized similarity metric’ because, e.g., dice similarity is not a normalized similarity metric, as can be simply deduced from [45].

In the definition of fuzzy token similarity, we also add a fuzzy overlap compared to the original articles because we will continue to work with it. However, the previously introduced models have some disadvantages from our perspective:

Second Threshold: The disadvantage is that we actually have two threshold parameters—the threshold $δ$ in the internal similarity function, ${sim}^{'} (X_{i}, Y_{j})$ , and the overall threshold, $α$ , on $sim (R_{1}, R_{2})$ , with which we classify whether the records $R_{1}$ and $R_{2}$ are a match or not. The global threshold, $α$ , can be quite different from the local threshold, $δ$ . In addition, the threshold of the internal function, $δ$ , requires optimization and can lead to reduced accuracy if selected incorrectly—see an example in Appendix A.2.
Not Entirely Fuzzy: By applying a binary classifier, ${sim}^{'} (X_{i}, Y_{j}) \geq δ$ , we can determine whether a token pair $(X_{i}, Y_{j})$ is classified as a match. Such a strict classification on the token level leads to declining matches on tokens lower than the threshold $δ$ and does not respect human–natural continuous perception of the overall similarity, because the information about any similarity lower than $δ$ is replaced by a substitution value of zero.
Not a Similarity Metric: For the triangle inequalities given by (N2), respectively, (S2) is violated. This statement is supported by using the normalized Levenshtein similarity, which violates the triangle inequality, as further mentioned in the next sub-chapter. For details, see Appendix A.3.

3.3. Proposed Model of Fuzzy Record Similarity Metric (FRSM)

The proposed model is based on maximum weight matching in a bipartite graph [40,42,44] that satisfies an axiom (S1) for the symmetry mentioned above. The exact optimal solution to the combinatorial assignment problem can be found using the Kuhn–Munkres algorithm [46,47,48].

A graph,

G = (V, E)

, consists of a set,

V = R_{1} \cup R_{2}

, of vertices (the tokens of both records) and a set,

E

, of pairs of vertices, called edges. For an edge,

e = (X_{i}, Y_{j})

, we say that the endpoints of e are

X_{i}

and

Y_{j}

; we say that e is incident to

X_{i}

and

Y_{j}

. A graph,

G = (V, E)

, is bipartite if the vertex set, V, can be partitioned into two sets,

R_{1}

and

R_{2}

(the bipartition), such that

R_{1} \cap R_{2} = \emptyset

and no edge in

E

has both endpoints in the same set of the bipartition.

A matching

M \subseteq E

is a collection of edges such that every vertex is incident to at most one edge in

M

. If a vertex has no incident edge, it is referred to as exposed (or unmatched). A matching is perfect if no vertex is exposed; in other words, a matching is perfect if its cardinality is equal to

| R_{1} | = | R_{2} |

Theorem 4

(König). For any bipartite graph, the largest possible matching is equal to the smallest possible vertex cover.

This theorem is an expression of the equality of the primary and dual problem in linear programming. The Kuhn–Munkres algorithm is based on such dual tasks.

Definition 4

(Maximum Weight Matching Problem). Given a weight,

c_{i, j}

for all

(i, j), \in E

. Given a matching, let its incidence vector be

x

, where

x_{i, j} = 1

(i, j) \in M

, and 0 otherwise. One can formulate the maximum weight matching problem as follows: its objective function is

x^{*} = arg max \sum_{i = 1}^{| R_{1} |} \sum_{j = 1}^{| R_{2} |} c_{i, j} x_{i, j} = arg max c^{T} x,

(9)

subject to

\sum_{i = 1}^{| R_{1} |} x_{i, j} = 1,

(10)

\sum_{j = 1}^{| R_{2} |} x_{i, j} = 1 .

(11)

In general graphs, the minimum vertex cover problem is NP-complete; meanwhile, in bipartite graphs, this problem belongs to the class N. The maximum weight matching problem has a polynomial time complexity of

O (| V |^{4})

, but it has been shown that it can be modified to achieve a running time of

O (| V |^{3})

. Note that the best-known time complexity for the maximum weighted matching is

O (| V | | E | + | V |^{2} log | V |)

[49,50].

For the calculation of textual similarity, we use the normalized edit similarity metric in the range

[0, 1]

, as defined in Theorem (5):

c_{i, j} = s_{n} (X_{i}, Y_{j}) .

(12)

Let us mention that several normalization factors have been developed [23,24,25], and most of them violate the triangle inequality as a condition of being a similarity metric. However, in [24], there is a normalized similarity metric, and this has been further generalized into a similarity space [43]. On the other hand, it also has been found that such normalization factors do not affect accuracy, with fluctuations in the test results of

\approx 1 %

. This accuracy could be considered to be a statistically acceptable error.

We now introduce a normalization of

d (X_{i}, Y_{j})

differences (generally called the edit distance) for Levenshtein similarity in the interval

[0, 1]

, which is one of the most used in [4,12,30,42]:

s (X_{i}, Y_{j}) = 1 - \frac{d (X_{i}, Y_{j})}{max {| X_{i} |, | Y_{j} |}} .

(13)

It can be easily shown, for example, Appendix A.3, that this normalization also violates the triangle inequality; therefore, all the distances or similarity functions introduced in the articles based on this normalization are not metrics and cannot be used well in applications like cluster analysis.

Theorem 5

(Normalized Edit Similarity Metric). Let Σ be a finite alphabet, and let

Σ^{*}

represent the set of all possible strings formed over Σ. Given

X, Y \in Σ^{*}

, the generalized Rozinek similarity over Σ is the normalized edit similarity metric:

s_{n} (X, Y) = \frac{s (X, X) + s (Y, Y) - d (X, Y)}{s (X, X) + s (Y, Y) + d (X, Y)} = \frac{| X | + | Y | - d (X, Y)}{| X | + | Y | + d (X, Y)},

(14)

where

s_{n} (X, Y) : Σ^{*} \times Σ^{*} \to [0, 1] \subset R

and

| . |

denote the cardinality of a set, specifically the number of characters of the string.

Proof.

Without loss of generality, we can assume that the self-similarity

s (X, X)

is a set function. Specifically,

s (X, X)

represents a measure,

μ (X)

, defined on a

σ

-algebra over the finite alphabet

Σ

. The set of all strings,

Σ

, induces a

σ

-algebra on the finite alphabet

Σ

, which results in a measure space represented by the triple

(Σ, Σ^{'} μ)

. In the abstract sense, the edit distance,

d (x, y)

, can be interpreted as the symmetric difference of ordered sets, known as the Frechét–Nikodym metric [43], where

d (X, Y) = μ (X ▵ Y)

is calculated by an algorithm of dynamic programming. For finite sets, the cardinality is a natural measure of size. In this case, we take the cardinality

| . |

of sets given by the number of characters in the sets. We express

\begin{array}{l} s_{R} (X, Y) & = & s_{n} (X, Y) = \frac{s (X, X) + s (Y, Y) - d (X, Y)}{s (X, X) + s (Y, Y) + d (X, Y)} \\ = & \frac{μ (X) + μ (Y) - μ (X ▵ Y)}{μ (X) + μ (Y) + μ (X ▵ Y)} = \frac{| X | + | Y | - d (X, Y)}{| X | + | Y | + d (X, Y)} . \end{array}

(15)

From Theorem , it is proven that

s_{R} (X, Y)

is a normalized similarity metric; therefore, this is complete. Alternatively, we also give a second proof [24,43] to support our statement:

\begin{matrix} \frac{| X | + | Y | - d (X, Y)}{| X | + | Y | + d (X, Y)} = 1 - \frac{2 d (X, Y)}{| X | + | Y | + d (X, Y)} = 1 - d_{n} (X, Y), \end{matrix}

(16)

where

d_{n} (x, y)

is a normalized edit distance of the form

d_{n} (X, Y) = \frac{2 d (X, Y)}{| X | + | Y | + d (X, Y)} .

(17)

In [24], it is further proved that

d_{n}

is a normalized distance metric. Hence, by the duality between normalized similarity metrics and normalized distance metrics,

s_{n} (x, y) = 1 - d_{n} (x, y)

is also proven [43]. □

Now, consider a scenario in which we want to predict the edit distance based on knowing only the lengths of the strings X and Y. The calculation of the edit distance,

d (X, Y)

, itself is computationally expensive, in quadratic time it is

O (| X | | Y |)

, so we would like to obtain at least a rough estimate of the worst case of the expected edit distance for the minimum edit normalized similarity, given by the threshold

s_{n} (X, Y) \geq α

. We introduce an expected edit distance,

{sup}_{α} d (X, Y) = d (α, | X |, | Y |)

, depending on already-known parameters—the threshold

α

and the lengths of the strings

| X |

and

| Y |

. Finally, we obtain a prediction which is computationally feasible in constant time,

O (1)

, and will help in developing an optimal filter in further chapters.

Definition 5

(Expected Edit Distance). Let the worst-case expected edit distance be

d (α, | X |, | Y |) :

R \times N^{+} \times N^{+} \to N_{0}

, representing the maximal possible edit distance,

d (X, Y) : Σ^{*} \times Σ^{*} \to N_{0}

, for string lengths

| X |

and

| Y |

, with a similarity determined by a fixed

α \in [0, 1]

Theorem 6

(Threshold of Normalized Edit Similarity Metric). Let us consider an edit distance of

d (X, Y)

. The worst case of the expected distance function is

d (α, | X |, | Y |)

. Then,

s_{n} (X, Y) \geq α ⟺ d (X, Y) \leq d (α, | X |, | Y |) = ⌊\frac{1 - α}{1 + α} (| X | + | Y |)⌋,

(18)

where

α \in [0, 1] \subset R

is a threshold of the normalized similarity metric given by inequality

s_{n} (X, Y) \geq α

. We use symbol

⌊ . ⌋

for the floor function.

Proof.

According to Theorem 3 and by substituting the self-similarities, which are equal to the corresponding cardinalities of the sets,

s (X, X) = | X |

and

s (Y, Y) = | Y |

, the expected distance

d (α, | X |, | Y |)

reaches its maximum when the similarity is at its minimum, corresponding to the lowest similarity given by the threshold α. This occurs if—and only if—we substitute

s_{n} (X, Y) = α

\begin{matrix} d (X, Y) & = ⌊d_{R}⌋ = ⌊\frac{1 - s_{n} (X, Y)}{1 + s_{n} (X, Y)} (s (X, X) + s (Y, Y))⌋ \\ \leq sup_{α} d (X, Y) = ⌊\frac{1 - α}{1 + α} (| X | + | Y |)⌋ = d (α, | X |, | Y |) . \end{matrix}

(19)

The edit distance is an integer, so we apply the floor function to eliminate any undefined fractional part. □

Definition 6

(Fuzzy Record Similarity Metric—FRMS). Let

M

be a matching collection of edges connecting pairs of tokens, and the cardinality be

| M | = min {| R_{1} |, | R_{2} |}

; this is especially true for cases of imperfect matching,

| R_{1} | \neq | R_{2} |

. We define an expected value of the similarity metric as an average of the normalized edit similarity metrics between

R_{1}

and

R_{2}

, as follows:

E [s_{n} (R_{1}, R_{2})] = s_{n} (R_{1}, R_{2}) = \frac{\sum_{(i, j) \in M} s_{n} (X_{i}, Y_{j})}{| M |} .

(20)

Theorem 7.

The Fuzzy Record Similarity Metric (FRMS) is a normalized similarity metric.

Proof.

FRMS satisfies (N1) because the maximum weight matching in a bipartite graph is a symmetric measure. We continue from Definition 6, applying Theorem 5 and convex combinations in Theorem 1; then, the remaining axioms are easily proven. □

Corollary 1

(Identity FRMS and Fuzzy Overlap Similarity). The Fuzzy Record Similarity Metric equals the fuzzy overlap similarity if—and only if—

δ = 0

s_{n} (R_{1}, R_{2}) = s i m_{O} (R_{1}, R_{2}) .

(21)

Proof.

We can express directly

\begin{matrix} s_{n} (R_{1}, R_{2}) & = \frac{\sum_{(i, j) \in M} s_{n} (X_{i}, Y_{j})}{| M |} = \frac{\sum_{(i, j) \in M} s_{n} (X_{i}, Y_{j})}{min {| R_{1} |, | R_{2} |}} = \frac{| R_{1} {\tilde{\cap}}_{δ = 0} R_{2} |}{min {| R_{1} |, | R_{2} |}} = {sim}_{O} (R_{1}, R_{2}) . \end{matrix}

(22)

whenever we remove a second threshold by setting

δ = 0

. Then, we obtain a complete bipartite graph; therefore,

\sum_{(i, j) \in M} s_{n} (X_{i}, Y_{j}) = | R_{1} {\tilde{\cap}}_{δ = 0} R_{2} |

. □

In contrast with the fuzzy Jaccard similarity from Definition 3, we offer a well-designed generalized similarity metric that is suitable for text mining and cluster analysis as the first introduced advanced fuzzy record similarity technique.

Graphically, the whole procedure is shown in Figure 1 and Figure 2 and the enumerated example is shown in Appendix A.2.

For the time complexity, note that, in most cases in large databases, each attribute stores short strings with few tokens; hence, the computation is feasible and fast in many real-world scenarios. However, in large storage situations with millions of records, such time complexity in the algorithm is not usable for real-time approximate string matching and search. A filter, also called a blocking technique, could be developed using a method with much lower time complexity; this could thus filter many records and significantly reduce the large comparison space [28,29,30]. Practically all used methods are sub-optimal filters which, during filtering, lose true-positive candidates. The reason that we observe for this is that these methods are not explicitly mathematically united. The intention of this article is to show, in general, that it is possible to derive an optimal filter that is less time-consuming and serves as the lower bound of the threshold related to a more time-consuming algorithm of approximate string matching. In our case, we propose a two-stage method, where a less time-consuming filter, such as a Q-gram filter, could be used for our proposed token-based model, FRMS, with polynomial time complexity. The Q-gram filter is also one of the most-used indexing methods for search engines in real-world applications. Thresholding by least-shared Q-grams is known as T-overlap similarity join [30]. In the next chapters, we will describe a new optimal Q-gram filter for FRMS, the purpose of which is to solve the T-overlap similarity join problem.

4. Optimal Q-Gram Filter

We distinguish some basic types of Q-gram filtering, such as count filtering, positional filtering, prefix filtering, and length filtering [29,51]. In this article, we describe two of them, giving a basic description of their ideas.

4.1. Count Filtering

The intuition behind count filtering is that strings with greater edit similarity than a threshold of

s_{n} \geq α

have a large number of Q-grams in common, based on Theorem 8 and Corollaries 3 and 9.

At the beginning of this chapter, we introduce the concept of Q-gram similarity lying in similarity space.

Definition 7

(Q-gram Similarity). Let Σ be a finite alphabet, and let

Σ^{*}

denote the set of all strings over Σ. The function

s_{q} : Σ^{*} \times Σ^{*} \to N_{0}

is the Q-gram similarity function for the strings X and Y:

s_{q} (X, Y) = | Q_{X} \cap Q_{Y} |,

(23)

where

Q_{X}

and

Q_{Y}

are the corresponding Q-gram sets.

Corollary 2.

Q-gram similarity,

s_{q} (X, Y)

, is a similarity metric.

Proof.

It has been already shown in [43] that a set intersection is a elementary similarity metric satisfying (S1), (S2), (S3), and (S4) from Definition 1. □

By dividing by

| Q_{X} \cup Q_{Y} |

, we obtain a normalized similarity metric called the Jaccard metric [43]. For further derivation, we consider only a simple intersection as a measurement of the Q-grams common to X and Y.

In [52], a relation is proposed between the edit distance and the Q-gram method. Suppose we are given a pattern string, X, with length

| X |

and a text string, Y, with length

| Y |

. Then, the following theorem is given in [53]:

Theorem 8

(Q-gram Count Filtering). Let consider two strings, X and Y, and the edit distance

d (X, Y)

. Then, the Q-gram similarity,

s_{q} (X, Y)

, between X and Y is bounded below by

t = max {| X |, | Y |} - q + 1 - q d (X, Y),

(24)

where t is a Q-gram similarity threshold with respect to

d (X, Y)

Proof.

[52]. □

This observation is critical for a class of algorithms that target string similarity search and similarity joins based on edit distance constraints, with numerous practical applications in data cleaning, search engines, and integration. These algorithms extend traditional exact search-and-join operations in databases by allowing for errors and inconsistencies in the data—see [30].

One of the common problems is to find the threshold of the least-sharing Q-grams for an allowed edit distance with a fixed maximum distance,

d_{m a x}

, which can be set as a parameter by the user. Denote by

d \in [0, d_{m a x}]

an interval range of allowed edit distance. A singularity of a Q-gram filter,

d_{s i n g}

, occurs whenever this lower bound is less than or equal to zero, as will be explained further on.

Corollary 3

(Infimum of Q-gram for Edit Distance). Let

d \in [0, d_{m a x}]

be an edit distance. Then, there exists a lower bound

inf_{d} {| Q_{X} \cap Q_{Y} |} = max {| X |, | Y |} - q + 1 - q min {d_{s i n g}, d_{m a x}} .

(25)

Proof.

{inf}_{d} {| Q_{X} \cap Q_{Y} |} = {inf}_{d = d_{m a x}} {| Q_{X} \cap Q_{Y} |}

; the term

min {d_{s i n g}, d_{m a x}}

treats the filter underflow below 0. □

Corollary 4

(Supremum of Q-gram for Edit Distance). Suppose we are given an edit distance,

d \in [0, d_{m a x}]

. Then, there exists an upper bound

sup_{d} {| Q_{X} \cap Q_{Y} |} = max {| X |, | Y |} - q + 1 .

(26)

Proof.

Similarly, we obtain an upper bound by maximizing

| Q_{X} \cap Q_{Y} |

and use

d = 0

; hence,

{sup}_{d} {| Q_{X} \cap Q_{Y} |} = {sup}_{d = 0} {| Q_{X} \cap Q_{Y} |}

. □

In our definition, for the first time, we introduced an expression using infimum and supremum on the Q-gram set in relation to the edit distance, d. A similar kind of equation is also introduced in [52,53]. For simplicity, from now on, we write the threshold as

t = {inf}_{d} {| Q_{X} \cap Q_{Y} |}

Definition 8

(Q-gram Singularity). We say there is a Q-gram singularity if no common Q-gram is guaranteed; so, Q-gram filtering has no effect and

t = inf_{d} {| Q_{X} \cap Q_{Y} |} \leq 0 .

(27)

Corollary 5

(Length Singularity of Q-gram Filter). Let

{| X |}_{s i n g}

be the string length where

t = 0

for an edit distance,

d (X, Y) > 0

. We call it a length singularity of the Q-gram filter for edit distance

{| X |}_{s i n g} = (d (X, Y) + 1) q - 1 .

(28)

Proof.

We assume a given positive edit distance of

d > 0

. Then,

\begin{matrix} t = | X | - q + 1 - q d (X, Y), \end{matrix}

(29)

\begin{matrix} | X | = t + (d (X, Y) + 1) q - 1, \end{matrix}

(30)

and with

t = 0

, it is proven. □

There is an example in Figure 3. The underlined letter indicates the position that is the worst in the calculation, as there is no trigram without that letter.

4.2. Optimal Count Q-Gram Filter for Edit Similarity

The normalization of the edit distance has received less attention in many scientific articles, but we find it to have great importance for many real-world applications that need to measure normalized similarity independently of the length of the string. The edit distance,

d = 1

, on a string with a length of

| X | = 24

is very often totally different from that of a string, Y, with a length of, e.g.,

| Y | = 6

. The normalization factor that this brings is related well to the human perception of the similarity of different objects. From similarity space theory, the strings could often have different self-similarities. Self-similarity could be a measure of the number of extracted Q-grams or string lengths. For instance, having German words

X =

‘Einkommensteuererklärung’ (income tax return) and

Y =

‘Steuer’ (tax), we have

s (X, X) \geq s (Y, Y)

when counting string lengths, common characters, or Q-grams. Whenever we compare two objects with different feature sizes, we ask the following question: Are these objects also of different importance? In our model, we say no; hence, we obtain the following argument for normalization among strings.

Similar to the previous (3), we reformulate our normalized least-sharing Q-gram pairs as a threshold similarity, as follows:

t_{α} = inf_{α} {| Q_{X} \cap Q_{Y} |} = max {| X |, | Y |} - q + 1 - q d (α, | X |, | Y |),

(31)

where we substitute for the edit distance the worst case scenario with

d (X, Y) = d (α, | X |, | Y |))

, and we use this case later in this paper.

The occurrence of a singularity of the Q-gram filter,

t_{α}

, depends on the string lengths

| X |

and

| Y |

and the user-chosen threshold parameter,

α \in [0, 1]

Corollary 6.

Let

| X |

and

| Y |

be string lengths, where

t_{α} = 0

for a normalized edit similarity,

s_{n} (X, Y) \geq α

. We call this a singularity of the Q-gram filter.

{| X |}_{s i n g} = q d (α, | X |, | Y |) + q - 1 .

(32)

Proof.

With

t_{α} = 0

, it can be proven

\begin{matrix} 0 = | X | - q + 1 - q d (α, | X |, | Y |), \\ {| X |}_{s i n g} = q d (α, | X |, | Y |) + q - 1 . \end{matrix}

(33)

For simplicity, we assume the tokens have the same lengths,

| X | = | Y |

. For illustration, we introduce these examples: if

d (α, | X |, | Y |) = 1

, then we obtain

{| X |}_{s i n g} = {| Y |}_{s i n g} = 2 q - 1

for a trigram,

{| X |}_{s i n g} = 5

, and for a bigram,

{| X |}_{s i n g} = 3

. For another example, if

d (α, | X |, | Y |) = 2

, then

{| X |}_{s i n g} = {| Y |}_{s i n g} = 3 q - 1

for the trigram

{| X |}_{s i n g} = 8

and for the bigram

{| X |}_{s i n g} = 5

. □

Corollary 7

(Edit Distance Singularity of Q-gram Filter). Suppose that we are given an expected edit distance of

d_{s i n g} (α, | X |, | Y |)

, which is a singularity of the Q-gram filter for an edit distance, that is, where

t_{α} = 0

. Then,

d_{s i n g} (α, | X |, | Y |) = ⌈\frac{| Q_{X} |}{q}⌉

(34)

where

Q_{X}

is the Q-gram set of string X.

Proof.

Since

t_{α} = 0

, then, similar to before,

\begin{matrix} 0 & = max {| X |, | Y |} - q + 1 - q d_{s i n g} (α, | X |, | Y |), \\ d_{s i n g} (α, | X |, | Y |) & = ⌈\frac{max {| X |, | Y |} - q + 1}{q}⌉, \\ d_{s i n g} (α, | X |, | Y |) & = ⌈\frac{sup {| Q_{X} \cap Q_{Y} |}}{q}⌉ . \end{matrix}

(35)

For simplicity, again assume

| X | = | Y |

. Then, we obtain finally

d_{s i n g} (α, | X |, | Y |) = ⌈\frac{| Q_{X} |}{q}⌉ .

(36)

The fractional part leads to the remaining Q-grams, which have the size

| Q_{X} |

mod q. In addition, the remaining Q-grams must be destroyed with

d = 1

; hence, we use ceiling function

⌈ . ⌉

to handle the fractional part. □

Simply explained, a Q-gram filter can work efficiently if

d (X, Y) \leq d_{s i n g} (α, | X |, | Y |)

. If equality holds, then all Q-grams could be destroyed, as the worst case scenario.

Corollary 8.

(Edit Similarity Singularity of Q-gram Filter) Let

α_{s i n g}

be a singularity threshold of a Q-gram filter, where

t_{α} = 0

α_{s i n g} = \frac{2 q | X | - | X | + q - 1}{2 q | X | + | X | - q + 1} = \frac{2 q | X | - | Q_{X} |}{2 q | X | + | Q_{X} |} .

(37)

Proof.

If we assume an unfractional part, then we have

\begin{matrix} d (α, | X |, | Y |) & = d_{s i n g} (α, | X |, | Y |), \\ \frac{1 - α}{1 + α} (| X | + | Y |) & = \frac{sup {| Q_{X} \cap Q_{Y} |}}{q}, \\ \frac{1 - α}{1 + α} & = \frac{sup {| Q_{X} \cap Q_{Y} |}}{q (| X | + | Y |)}, \\ α & = \frac{q (| X | + | Y |) - sup {| Q_{X} \cap Q_{Y} |}}{q (| X | + | Y |) + sup {| Q_{X} \cap Q_{Y} |}} . \end{matrix}

(38)

This is proven by finding a root of the equation for

α

. If we assume the strings have equal lengths, then we can simplify this to

α = \frac{2 q | X | - | Q_{X} |}{2 q | X | + | Q_{X} |} = \frac{2 q | X | - | X | + q - 1}{2 q | X | + | X | - q + 1} .

(39)

□

In other words, we should always select a threshold,

α \geq α_{s i n g}

, for the corresponding token lengths, to have a working and efficient Q-gram filter that is capable of filtering any dissimilar records.

Corollary 9.

Shorter strings less than

| X | < ⌈\frac{1 + α}{2 - 2 α}⌉

must have an exact match with size for all Q-grams

| Q_{X} |

for the Q-gram filter

| X | < ⌈\frac{1 + α}{2 - 2 α}⌉ \Rightarrow t_{α} = | Q_{X} | = | X | - q + 1 .

(40)

Proof.

Thus, we show that a non-integer fractional part,

d (α, | X |, | Y |) < 1

, leads to an exact match after applying the floor function,

d (α, | X |, | Y |) < 1 \Rightarrow ⌊ d (α, | X |, | Y |) ⌋ = 0

\begin{matrix} ⌊\frac{2 | X | - 2 α | X |}{α + 1}⌋ & < 1, \\ | X | & < ⌈\frac{α + 1}{2 - 2 α}⌉ . \end{matrix}

(41)

□

Indeed, for

d = 1

and

α = 0.8

, we obtain

| X | = 5

. Shorter strings must always have

s_{n} (X, Y) < α

for

d > 0

A singularity of a Q-gram filter could cause performance problems on certain string lengths,

{| X |}_{s i n g}

, due to generating a large candidate set. This property is demonstrated in Figure 4. To resolve this kind of problem, we should extend our model with at least 1 padding Q-gram, denoted by

p \in Σ^{*}

. From nature, empirical observation, and some blocking techniques [28,29,30], increasing the importance of initial letters seems reasonable. For an initial solution, we could define a prefix or postfix ‘

p = #

’, or both of them for each token, and extend the Q-gram sets by about

| Q_{X} | + | p |

Q-grams.

Definition 9

(Q-gram Set with Padding). Let X be a string over the alphabet, Σ, and

p \in Σ^{*}

be a padding string. The padded Q-gram set

Q_{X, p}

is obtained by extracting Q-grams from the concatenated string,

p \cdot X \cdot p

Corollary 10

(Q-gram Filter with Padding). Let

Q_{X}

and

Q_{Y}

be Q-gram sets for strings X and Y, and let p be a padding string. Then, for any normalized edit similarity threshold,

s_{n} (X, Y) \geq α

, and expected edit distance,

d (α, | X |, | Y |) > 0

, the padded Q-gram similarity satisfies:

t_{α_{p}} = t_{α} + | p |

(42)

where

t_{α}

is the unpadded Q-gram similarity threshold and

| p |

is the length of the padding string.

Proof.

For padded strings,

X_{p} = p \cdot X \cdot p

and

Y_{p} = p \cdot Y \cdot p

\begin{matrix} t_{α_{p}} & = max {| X_{p} |, | Y_{p} |} - q + 1 - q d (α, | X |, | Y |) \\ = max {| X |, | Y |} + | p | - q + 1 - q d (α, | X |, | Y |) \\ = t_{α} + | p | \end{matrix}

Note that padding characters do not affect the edit distance calculation

d (α, | X |, | Y |)

between the original strings. □

The behavior of the Q-gram filters near the singularity points plays a fundamental role in their theoretical and practical performance. While empirical studies across diverse datasets have demonstrated that padding Q-grams consistently improves F-measure statistics [55,56], prior work has lacked a rigorous theoretical framework explaining this phenomenon. We propose that the singularity behavior of Q-gram filters provides the theoretical foundation for these empirical observations. Specifically, we show that padding characters serve two complementary functions: First, they ensure non-zero Q-gram similarity at singularity points by introducing guaranteed common sub-strings at token boundaries. Second, they expand the feature space dimensionality through additional Q-grams generated from the padding–token boundary regions. This dual effect creates what we term a “smooth Q-gram set”—one where similarity measures transition continuously across token boundaries rather than exhibiting sharp discontinuities. The additional discriminative power provided by the extended feature space enables more robust classification, particularly in boundary cases where the base Q-gram filter approaches singularity.

Another solution could be to combine feature extraction with multiple sizes of the Q-grams, e.g., unigrams, bigrams, and trigrams. The singularities of those Q-gram filters would be mutually resolved. A generalization of such a combination could be expressed as a sum of shared Q-grams over different Q-gram sizes,

t_{1}

t_{2}

, and

t_{3}

, and their constants,

q_{1} = 1

q_{2} = 2

, and

q_{3} = 3

\begin{matrix} t_{1, 2, 3} = t_{1} + t_{2} + t_{3} = \\ 3 max {| X |, | Y |} - q_{1} - q_{2} - q_{3} + 3 - d (α, | X |, | Y |) (q_{1} + q_{2} + q_{3}) \end{matrix}

(43)

and evaluating

t_{1} + t_{2} + t_{3} = 3 max {| X |, | Y |} - 3 - 6 d (α, | X |, | Y |) .

(44)

4.3. Optimal Count Q-Gram Filter for Fuzzy Token Similarity

In this generalization, we consider the records

R_{1}, R_{2}

, consisting of token sets that are matched in a bipartite graph. As mentioned previously,

| M |

is the resulting maximum edge-connected pair token for the optimal combinatorial assignment problem. We assume that the only known factors are the matching token pairs given by

M

, with unknown edit distances, but we can predict them in accordance with the expected edit distances. It can be very confusing that the edit distances are unknown while this model assumes a known matching

M

, which is calculated on the adjacency matrix. Building such a model has a quite tricky reason, which we may express as a model depending on threshold

α

and token lengths of

X_{i} \in R_{1}

and

Y_{j} \in R_{2}

. We formulate this generalization in the following.

Theorem 9

(Optimal Count Q-gram Filter for Bipartite Matching [57,58]). Let

R_{1}

and

R_{2}

be records representing a set of tokens. Then, the Q-gram similarity in bipartite matching of

R_{1}

and

R_{2}

, and cardinality,

| M |

, for a given threshold,

s_{n} (R_{1}, R_{2}) \geq α

, is at least

\begin{matrix} t_{M} = inf_{α} {| Q_{R_{1}} \cap Q_{R_{2}} |} = \\ \underset{maximum shared Q - grams}{\underset{︸}{\sum_{(i, j) \in M} max {| X_{i} |, | Y_{j} |} - | M | q + | M |}} - q \underset{loss function}{\underset{︸}{max_{α} \sum_{(i, j) \in M} d (α_{i, j}, | X_{i} |, | Y_{j} |)}}, \end{matrix}

(45)

containing a linear combination of

d (α_{i, j}, | X_{i} |, | Y_{j} |) = \frac{1 - α_{i, j}}{1 + α_{i, j}} (| X_{i} | + | Y_{j} |)

(46)

under the constraint

α = \frac{\sum_{(i, j) \in M} α_{i, j}}{| M |}

for which the linear combination is maximized.

Proof.

From (31), consider the sum over connected pairs of tokens with cardinality

| M |

\begin{matrix} inf_{α} {| Q_{R_{1}} \cap Q_{R_{2}} |} \\ = inf_{α} \{\sum_{(i, j) \in M} | Q_{X_{i}} \cap Q_{Y_{i}} |\} = \sum_{(i, j) \in M} inf_{α_{i, j}} {| Q_{X_{i}} \cap Q_{Y_{j}} |} \\ = \sum_{(i, j) \in M} inf_{α_{i, j}} {max {| X_{i} |, | Y_{j} |} - q + 1 - q d (X, Y)} \\ = \sum_{(i, j) \in M} {max {| X_{i} |, | Y_{j} |} - q + 1 - q sup_{α_{i, j}} d (X, Y)} \\ = \sum_{(i, j) \in M} {max {| X_{i} |, | Y_{j} |} - q + 1 - q d (α_{i, j}, | X_{i} |, | Y_{j} |)} . \end{matrix}

(47)

Each

α_{i, j}

is the distributed minimum similarity for each token, giving a threshold vector that should maximize the sum of the expected distances

d (α_{i, j}, | X_{i} |, | Y_{j} |)

, so that

s_{n} (R_{1}, R_{2}) \geq α = \frac{\sum_{(i, j) \in M} α_{i, j}}{| M |}

holds for

t_{M}

. Formalizing this, we obtain the integer linear programming task

\begin{matrix} maximize & \sum_{(i, j) \in M} d (α_{i, j}, | X_{i} |, | Y_{j} |), \\ subject to & \sum_{(i, j) \in M} α_{i, j} \geq α | M | & α \in [0, 1], & i = 1, \dots, | M |, \\ α_{i, j} \in [0, 1], & j = 1, \dots, | M | . \end{matrix}

After a small reformulation, this leads to the well-known and well-solvable knapsack problem. The optimization can be reduced to a knapsack problem and solved using dynamic programming with complexity

O (n b)

, where n is the number of items and b is the capacity. While this appears polynomial, it is actually pseudo-polynomial, since the true input size depends on

{log}_{2} b

rather than b directly [59]. This distinction is important as the algorithm’s runtime depends on the magnitude of the numbers involved, not just their bit length. In other words, we can simply ask how the maximum expected edit distance can be allocated across tokens, so that there will be at least

s_{n} (R_{1}, R_{2}) \geq α

for records

R_{1}

and

R_{2}

. The above optimization algorithm should answer such a question. Note that this is not a completely trivial task, because greater distance allocation will be in larger tokens. This means that the error

ϵ_{i, j} = 1 - α_{i, j}

will be distributed first on the longest tokens to achieve the maximum sum of distances,

d (α_{i, j}, | X_{i} |, | Y_{j} |)

, where

α_{i, j}

are variables depending on the lengths of the tokens. □

For the sake of exposition, we explain two terms of the optimal Q-gram filter equation. The term maximum shared Q-grams counts the maximum possible number of shared Q-grams,

| Q_{R_{1}} \cap Q_{R_{2}} |

, that could be achieved with tokens with corresponding lengths,

| X_{i} |, | Y_{j} |

, given by the resulting matching set of

M

on the complete bipartite graph. In fact, this term is simple because it counts only the Q-grams across matching token pairs

| X_{i} |

and

| Y_{j} |

. The most crucial term is the loss function, which must be maximized for an unknown expected distance distribution of the tokens employing a user-defined global threshold,

α

. Table 1 (bottom) contains a complete list of the constraints previously imposed on the integer linear programming problem as well as the fuzzy token similarity functions that have been introduced. Note that, in previous publications, this task was simplified: there was no consideration of the dependence of

t_{M}

on variables such as the length of the Q-gram q and the lengths of the tokens

| X_{i} |

| Y_{j} |

. The derived model solves it for optimality and cannot be further improved.

A naive approximation would be to assume the constancy of

α_{i, j} = α

, which actually performs an average distance allocation among the tokens. This is the approximation of the optimal lower bound Q-gram pairs for maximum weight matching in a bipartite graph. We note the fact that

X_{i}

and

Y_{i}

are connected by a resulting edge. However, we can not predict which tokens will be matched on the bipartite graph, and so cannot predict their lengths.

Theorem 10

(Approximate Count Q-gram Filter). Let

F_{R_{1}}

and

F_{R_{2}}

be discrete distribution functions representing the ascending sorted lengths

| X_{i} |

and

| Y_{i} |

, respectively. Let

R_{1}

and

R_{2}

be sets of records with unknown edge connections and let

M

denote the set of matched record pairs with cardinality,

| M |

. Under the assumption of constant similarity

α_{i, j} = α

for all

(i, j)

, the Q-gram similarity in bipartite matching satisfies the following:

\begin{matrix} {\hat{t}}_{M} & = & \frac{2 q α + α - 2 q + 1}{2 + α} (F_{R_{1}} [| M |] + F_{R_{2}} [| M |]) + \\ + \frac{1}{2} |F_{R_{1}} [| M |] - F_{R_{2}} [| M |]| - | M | q + | M | \end{matrix}

(48)

for the classification threshold

s_{n} (R_{1}, R_{2}) \geq α

in fuzzy bipartite matching.

Proof.

We begin with three fundamental inequalities. For any matched pairs,

(i, j) \in M

max \{\sum_{(i, j) \in M} | X_{i} |, \sum_{(i, j) \in M} | Y_{j} |\} \leq \sum_{(i, j) \in M} max {| X_{i} |, | Y_{j} |}

(49)

For arbitrary real numbers,

a, b \in R

max {a, b} = \frac{1}{2} (a + b + | a - b |)

(50)

Furthermore, for the cumulative sums:

max {F_{R_{1}} [| M |], F_{R_{2}} [| M |]} \leq max \{\sum_{(i, j) \in M} | X_{i} |, \sum_{(i, j) \in M} | Y_{j} |\}

(51)

Beginning with the definition of Q-gram similarity in the context of bipartite matching:

\begin{matrix} t_{M} & = \sum_{(i, j) \in M} max {| X_{i} |, | Y_{j} |} - | M | q + | M | - q max_{α} \sum_{(i, j) \in M} d (α, | X_{i} |, | Y_{j} |) \end{matrix}

(52)

Initially keeping the individual

α_{i, j}

terms:

\begin{matrix} = \sum_{(i, j) \in M} max {| X_{i} |, | Y_{j} |} - | M | q + | M | - q max_{α} \sum_{(i, j) \in M} \frac{1 - α_{i, j}}{1 + α_{i, j}} (| X_{i} | + | Y_{j} |) \end{matrix}

(53)

Under the assumption of constant

α_{i, j} = α

for all

(i, j)

\begin{matrix} \approx \sum_{(i, j) \in M} max {| X_{i} |, | Y_{j} |} - | M | q + | M | - q \frac{1 - α}{1 + α} \sum_{(i, j) \in M} (| X_{i} | + | Y_{j} |) \end{matrix}

(54)

Applying inequality (49):

\begin{matrix} \geq max \{\sum_{(i, j) \in M} | X_{i} |, \sum_{(i, j) \in M} | Y_{j} |\} - | M | q + | M | - \frac{q - q α}{1 + α} \sum_{(i, j) \in M} (| X_{i} | + | Y_{j} |) \end{matrix}

(55)

Using the analytical maximum form (50):

\begin{matrix} = \frac{1}{2} \sum_{(i, j) \in M} (| X_{i} | + | Y_{j} |) + \frac{1}{2} | \sum_{(i, j) \in M} | X_{i} | - \sum_{(i, j) \in M} | Y_{j} | | \\ - | M | q + | M | - \frac{q - q α}{1 + α} \sum_{(i, j) \in M} (| X_{i} | + | Y_{j} |) \end{matrix}

(56)

Collecting terms with common factor

\sum_{(i, j) \in M} (| X_{i} | + | Y_{j} |)

\begin{matrix} = (\frac{1}{2} - \frac{q - q α}{1 + α}) \sum_{(i, j) \in M} (| X_{i} | + | Y_{j} |) \\ + \frac{1}{2} | \sum_{(i, j) \in M} | X_{i} | - \sum_{(i, j) \in M} | Y_{j} | | - | M | q + | M | \end{matrix}

(57)

Through algebraic manipulation:

\begin{matrix} = \frac{2 q α + α - 2 q + 1}{2 + α} \sum_{(i, j) \in M} (| X_{i} | + | Y_{j} |) \\ + \frac{1}{2} | \sum_{(i, j) \in M} | X_{i} | - \sum_{(i, j) \in M} | Y_{j} | | - | M | q + | M | \end{matrix}

(58)

Applying inequality (51):

\begin{matrix} \geq \frac{2 q α + α - 2 q + 1}{2 + α} (F_{R_{1}} [| M |] + F_{R_{2}} [| M |]) \\ + \frac{1}{2} | F_{R_{1}} [| M |] - F_{R_{2}} [| M |] | - | M | q + | M | \end{matrix}

(59)

Finally, taking the floor function to ensure integer values:

\begin{matrix} {\hat{t}}_{M} & = ⌊\frac{2 q α + α - 2 q + 1}{2 + α} (F_{R_{1}} [| M |] + F_{R_{2}} [| M |]) \\ + \frac{1}{2} | F_{R_{1}} [| M |] - F_{R_{2}} [| M |] |⌋ - | M | q + | M | \end{matrix}

(60)

Therefore,

t_{M} \approx {\hat{t}}_{M}

provides a lower bound approximation for the Q-gram similarity under the constant

α

assumption, with the approximation becoming tighter as the variation in individual

α_{i, j}

values decreases. □

As a result, by employing a pre-built inverted Q-gram index that incorporates the distribution of ascending sorted lengths, a time complexity of

O (1)

can be attained.

Theorem 11

(Q-gram Filter Efficiency). Let

α_{o p t i m}

be a minimal efficient threshold for a Q-gram filter to be effective at filtering dissimilar records for bipartite matching

s_{n} (R_{1}, R_{2}) \geq α

. Then,

α_{o p t i m} \geq \frac{2 q - 1}{2 q + 1} .

(61)

From Equation (51), it follows that

F_{X} [| M |] \leq \sum_{i = 1}^{| M |} | X_{i} |

. So, we should prove that, in the ordered statistics,

| X_{1} | \leq | X_{2} |, \dots, | X_{m - 1} | \leq | X_{M} |

, we also have

t_{X_{1}} \leq t_{X_{2}}, \dots, t_{X_{m - 1}} \leq t_{X_{M}}

. We show that, for any

| X_{i} |

and

| X_{j} | = | X_{i} | + 1

, we have

t_{X_{i}} \leq t_{X_{j}}

. Thus, we obtain the inequality

\begin{matrix} | X | - q + 1 - q d (α, | X |, | X |) & \leq (| X | + 1) - q + 1 - q d (α, | X | + 1, | X | + 1), \\ | X | - q + 1 - q \frac{2 | X | - 2 α | X |}{1 + α} & \leq (| X | + 1) - q + 1 - q \frac{2 (| X | + 1) - 2 α (| X | + 1)}{1 + α}, \\ - q \frac{2 | X | - 2 α | X |}{1 + α} & \leq 1 - q \frac{2 (| X | + 1) - 2 α (| X | + 1)}{1 + α}, \\ 2 | X | - 2 α | X | & \geq 2 (| X | + 1) - 2 α (| X | + 1) - \frac{1 + α}{q} \\ α & \geq \frac{2 q - 1}{2 q + 1} . \end{matrix}

(62)

As derived previously in (8), the Q-gram filter works for

α \geq α_{s i n g}

. If we put

α_{o p t i m} = \frac{2 q - 1}{2 q + 1}

, then the convergence of this is obvious:

lim_{| X | \to \infty} α_{s i n g} (| X |) = α_{o p t i m} .

(63)

By evaluating the expression we obtain an effective Q-gram filter for bigram

α \geq 0.6

, for trigram

α \geq \frac{5}{7}

, and for fourgram

α \geq \frac{7}{9}

. The conditions for the minimum thresholds make sense for ordinary use in real applications and have great robustness to errors.

Now, we incorporate in our previous model also a padding extension, which should further improve the efficiency of the filtering.

Theorem 12

(Optimal Q-gram Filter with Padding for Bipartite Matching). Let

R_{1}

and

R_{2}

be records with token sets padded by string

p \in Σ^{*}

. Thus, the Q-gram similarity in bipartite matching with cardinality

| M |

for the threshold α is given by

t_{M_{p}} = inf_{α, p} {| Q_{R_{1}} \cap Q_{R_{2}} |} = t_{M} + | M | | p |

(64)

where

t_{M}

is the unpadded Q-gram similarity threshold.

Proof.

Apply Corollary 10 to each matched token pair

(X_{i}, Y_{j}) \in M

and sum over all

| M |

pairs:

\begin{matrix} t_{M_{p}} = \sum_{(i, j) \in M} (t_{α} + | p |) = \sum_{(i, j) \in M} t_{α} + | M | | p | = t_{M} + | M | | p | . \end{matrix}

□

Corollary 11

(Approximate Lower Bound to Optimal Count Q-gram Filter with Padding for Bipartite Matching). Let

R_{1}

and

R_{2}

be records representing a set of tokens, and let the total number of padding characters in the records

R_{1}

and

R 2

be denoted as

| p R 1 |

and

| p R_{2} |

, respectively. Then, the Q-gram similarity in bipartite matching of

R_{1}

R_{2}

with cardinality N for a given threshold, α, is at least

t_{M_{p}} \approx {\hat{t}}_{M_{p}} = {\hat{t}}_{M} + min {| p_{R_{1}} |, | p_{R_{2}} |} .

(65)

Proof.

Similarly to the previous, use Corollary 10 and Theorem 10. The counting of the total padding characters per each record is due to the existence of tokens shorter than q, where we do not append any padding characters, but only prolong the string to be at least of size q and so extract at least one Q-gram feature. □

5. Results

5.1. Suggested Architecture

The architecture of the application for solving the approximate string-matching problem is shown in Figure 5. The region inside the dashed rectangle delineates the main topics of the article. The mathematically derived lower bound of FRMS

{\hat{t}}_{M}

filters dissimilar records,

| Q_{R_{1}} \cap Q_{R_{2}} | < {\hat{t}}_{M}

, from a Q-gram inverted index. The generated candidates only include a very small fraction of the indexed collection of records. This stage mainly guarantees real-time running. Furthermore, FRMS runs comparisons on the generated candidates and generates a subset,

M

, of the candidates for

s_{n} (R_{1}, R_{2}) \geq α

. This process is illustrated in the block of Figure 6. We repeat the critical point that a lower threshold,

α

, generates a larger set of candidates. The previous chapters explained how the lower bound,

{\hat{t}}_{M}

, and the FRMS are related mathematically to the threshold,

α

5.2. Experiment Setup and Evaluation Q-Gram Filter Efficiency

In this section, we present the results of an extensive set of experiments conducted to demonstrate the efficiency of the proposed mathematical models for Q-gram filtering and approximate record-weighted matching in a bipartite graph.

The quality of the measurement data can be checked by finding matches in a database created from other sources. We used a test suite that has been reported in various papers and is widely used to analyze these metrics [4,15].

In the dataset, we compared pairs of inputs that belonged to the same domain, and if their IDs matched, then we marked them as identical. Our test involved comparing datasets taken from [15], shown in Table 2. In the test, each dataset was divided into two parts. Between these two parts, we calculated the scores between all pairs of records. We sorted all the pairs according to the calculated similarity scores. Ideally, all matches should have a higher similarity score and, as a result, should appear in the sorted list before all mismatches.

We computed the non-interpolated average precision of this ranking. According to the papers [4,15], we calculate the precision and recall as:

\begin{matrix} Precision = \frac{c (i)}{i}, \end{matrix}

(66)

\begin{matrix} Recall = \frac{c (i)}{P}, \end{matrix}

(67)

where

c (i)

is the number of correct matching pairs ranked before position i, and

P

is total number of correct matches. Consequently, interpolated precision at recall, r, is the

{max}_{i} \frac{c (i)}{i}

, where the max is taken over all ranks, i, such that

\frac{c (i)}{P} \geq r

. The graphs in Figure 7, Figure 8 and Figure 9 are plotted from the interpolated precision in the recall sequence

r = 0.0

0.05

, …,

0.95

, and

1.0

(21 equidistant recall levels), with a step length of

0.05

. The curves go through the points and are smoothed for better clarity. The overall relative performance of the compared similarity functions is calculated using the maximum F1-score as follows:

F 1 - score = 2 \frac{(Precision \times Recall)}{Precision + Recall},

(68)

This is shown in Table 3 and Table 4. The table shows that the best results of 85.09% were achieved using the FRMS method and its combination with the Q-gram filter at 85.01%. Based on these results, we can say that such results confirm the high accuracy of the approximated optimal Q-gram filter and verify the correctness of our mathematical derivation of its approximate form. However, it should be noted that, although it is still an approximation, there are several records for which we found, upon detailed analysis, that they did not pass through the filter.

We performed a complete ablation study for combinations of our derived Q-gram filters with padding (Theorem 10 and Corollary 11) and naive Q-gram filter with padding (Corollary 10) along with FRMS. See Table 4. Padding

| p | = 2

indicates that a single prefix and postfix character was used,

| p | = 1

indicates that only a prefix character was used, and

| p | = 0

indicates that no padding was applied. The experiments evaluated whether each compared pair of matches passes the filter at a fixed threshold,

α

, corresponding to the final similarity score of the FRMS method, i.e., a filter threshold of

α = s_{n} (R_{1}, R_{2})

. The thresholding parameter

α

was only reduced by

0.01

as an error tolerance. This is how we tested the effectiveness of the Q-gram filter as a lower bound on the FRMS method.

A demonstration of the results of the numerical study for this group of matching methods is provided in Table 4.

5.3. Time and Space Complexity

As already shown in Figure 6, the Q-gram filter and the FRMS together perform a real-time fuzzy matching run. The speed is affected by the alpha threshold parameter, which affects the size of the pool of potential candidates in the second stage of FRMS. Testing was performed to compare relative time complexity on a single-core Intel i7 11370H device with a maximum turbo frequency of 4.80 GHz and 16GB of RAM. The results of these tests are presented in Table 3 and Table 5. The results of the tests are only meant to demonstrate the real-time capacity of the system architecture we proposed, which was used in comparison to the trivial configuration of individual similarity functions. The achieved overall result of 220 ms for the Q-gram filter and FRMS is measured for all datasets, and the system architecture shown in Figure 5 is used. Furthermore, we deal separately with the analysis of the time complexity of the Q-gram filter and FRMS.

5.3.1. FRMS

If we are given two tokens of sizes

| X_{i} |

and

| Y_{j} |

(keeping the previous notation), then the normalized edit similarity metric is computed by a dynamic algorithm with a time complexity of

O (| X_{i} | | Y_{j} |)

and a space complexity of

Θ (| X_{i} | | Y_{j} |)

. Since we use an implementation that has been highly optimized for performance for production systems, the cost matrix is transformed into a short vector, and so we achieve in our implementation a space complexity of only

Θ (| Y_{j} |)

. The construction of the adjacency matrix for the complete bipartite graph takes

O (| R_{1} | | R_{2} |)

and requires an allocation of

Θ (| R_{1} | | R_{2} |)

. As discussed before, the solution of the assignment problem by the Kuhn–Munkres algorithm is calculated in addition to the adjacency matrix, incurring

{O ((| V |}^{3})

. For simplicity, let us generally denote the number of elements by n. We obtain the total time complexity

O (| R_{1} | | R_{2} |) O (| X_{i} | | Y_{j} |) + {O ((| V |}^{3}) \approx O (n^{2}) O (n^{2}) + O (n^{3}) = O (n^{4}) .

(69)

5.3.2. Q-Gram Filter

We consider the building of a Q-gram inverted index, which depends on the used algorithm. Furthermore, we take into consideration the T-similarity join problem as a merging of posting lists. Furthermore, each query record,

R_{1}

, needs the construction of its Q-gram set,

| Q_{R_{1}} |

, incurring

O (n)

. Such an architecture is quite common in the majority of the production systems of real-world search engines, so it is not our intention to discuss it in detail. We concentrate only on the pre-calculation of the filter within the time of the building of the inverted index, which is simply an operation added to the existing algorithms for constructing the inverted index. To speed up the computation of

{\hat{t}}_{M}

O (1)

, we must keep within the index all the pre-calculated discrete distribution functions of ascending sorted lengths

F_{R_{2}}

for each record,

R_{2}

. Hence, the required space complexity is an array of integers of the average record size

Θ (| {R_{2}}_{a v g} |)

for each record,

R_{2}

. Note that an efficient encoding of unsigned integers with a bounded maximum size of the records in the production system could take only a few bytes per record. Access to the two-dimensional array,

F_{R_{2}} [r] [| M |]

, is clearly in

O (1)

, where the index, r, means the record position. Finally, the calculation of

{\hat{t}}_{M}

consists of a few multiplication and summation operations; therefore, we obtain a total of

O (1)

for the calculation of the Q-gram filter threshold for each record,

R_{2}

. The Q-gram filter and FRMS together have a time complexity of

(1 - γ) \cdot O (1) + γ \cdot O (n^{4})

, where

γ

is a small fraction of potential candidates, as shown in Figure 6 and Table 6.

6. Discussion

Our mathematical models have been developed in a newly proposed similarity space instead of a metric space. It has been proven that such definitions are fully interchangeable by using various duality theorems. A similarity space is a more direct and simpler way to formulate our tasks and avoid misunderstanding, using the dual notation with the distance metric [43] in corresponding areas such as fuzzy matching, similarity search, and similarity join.

Recently, the cited articles show that there is only one possible way to symmetrically solve the token assignment problem between records

R_{1}

and

R_{2}

in approximate record matching: by maximum weight matching on a bipartite graph.

Secondly, we do not simplify the derivation of the lower bound of the Q-gram filter by applying the lower bound to the entire record as if it were a single token according to the formula in Theorem 8; rather, we derive a much more accurate deduction by assuming that the record consists of multiple tokens in Theorems 9 and 10. In this article, we have shown that maximizing the edit distance loss function among tokens under constraint

α

is an integer linear programming problem; additionally, we find that a more precise estimation exists for the record matching. We have found such an optimal solution for the Q-gram filter that cannot be further improved.

In our paper, we mainly focus on comparing two strings on textual similarity. Our research does not include the context of the data, which would be the intention of supervised (deep learning) or unsupervised learning (e.g., TF-IDF, BM25). The main focus of the research is to establish a baseline metric between two known records without their context of what source or document collection they come from.

7. Conclusions

In this work, we have introduced several novel contributions to the field of record similarity and text mining:

Fuzzy Record Similarity Metric (FRMS): We developed the FRMS, a robust metric for measuring approximate record similarity. The FRMS adheres to key mathematical principles, making it highly suitable for applications in text mining and cluster analysis. Our experiments demonstrate its superior performance compared to existing methods without the need for parameter tuning.
Optimal Q-gram filter: We proposed an optimal Q-gram filtering method for token matching, ensuring the most efficient filtering based on shared token features. This filter serves as a foundational tool that is applicable to various similarity metrics.
Approximate Q-gram filter: To enhance computational efficiency, we introduced an approximate Q-gram filter that operates in constant time. This approximation maintains high accuracy while significantly reducing processing time, as evidenced by our experimental results.
Filter efficiency and properties: We analyzed the efficiency and key properties of the Q-gram filter, highlighting its effectiveness in relation to string lengths and similarity thresholds. Our analysis provides insights into optimizing filter performance for different applications.
Padding extension of the Q-gram filter: We enhanced the Q-gram filter by incorporating padding techniques, which improve similarity measures by smoothing token boundaries and expanding the feature set. This extension leads to more accurate similarity assessments.

Overall, our contributions provide advanced tools and methodologies for accurate and efficient record similarity measurement, with significant implications for text mining and data analysis applications.

Author Contributions

Conceptualization, O.R.; methodology, O.R.; software, O.R. and J.P.; experiments, O.R. and J.P.; investments, J.M. (Jan Mareš); writing—original draft preparation, O.R. and J.M. (Jaroslav Marek); writing—review and editing, O.R., J.M. (Jaroslav Marek) and J.M. (Jan Mareš); supervision, J.M. (Jaroslav Marek) and J.M. (Jan Mareš). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Links to the datasets used are given in the text of this paper.

Acknowledgments

We would like to thank the software technology company Rozinet s.r.o. (www.rozinet.net) for their support, without which this article would not have been possible; they provided technical resources, data, and space for research. Further, we would like to thank Pavel Kř’iž (Charles University, Faculty of Mathematics and Physics) for their help with the technical mathematical formalism and corrections.

Conflicts of Interest

Author Ondřej Rozinek was employed by the Rozinet s.r.o. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A

Appendix A.1. Example of Asymmetric Matching

Monge and Elkan [39] proposed recursive field matching, which measures the similarity distance between two strings, X and Y. Each string is broken up into token sets

X = {X_{1}, X_{2}, \dots, X_{n}}

and

Y = {Y_{1}, Y_{2}, \dots, Y_{m}}

. Then, the similarity is expressed by

{sim}_{M E} (X, Y) = \frac{1}{n} \sum_{i = 1}^{n} {max}_{j = 1}^{M} s i m^{'} (X_{i}, Y_{j}),

(A1)

where

{sim}^{'}

is an internal similarity function capable of calculating the similarity between two individual tokens. This measure is independent of the sequential order of the tokens. This equation approximates the solution to the combinatorial assignment problem in combinatorial optimization. Unfortunately, the Monge–Elkan approximation is not a symmetric similarity function, as shown in Figure A1.

Figure A1. Two examples of an asymmetric Monge–Elkan measure

s i m_{M E}

Figure A1. Two examples of an asymmetric Monge–Elkan measure

s i m_{M E}

Appendix A.2. Example of Maximum Weighted Bipartite Matching

Suppose two records are given:

\begin{matrix} X = {" R o z i n e t ", " s o f t w a r e ", " c o m p a n y "}; \\ Y = {" S o f t w a r e ", " f r o m ", " R o z i n e k^{'} s ", " c o m p a n y "} . \end{matrix}

(A2)

We compute the incidence matrix of the complete bipartite graph between the records broken into token sets,

X

and

Y

\begin{matrix} [\begin{matrix} s_{n} (X_{1}, Y_{1}) & s_{n} (X_{1}, Y_{2}) & s_{n} (X_{1}, Y_{3}) & s_{n} (X_{1}, Y_{4}) \\ s_{n} (X_{2}, Y_{1}) & s_{n} (X_{2}, Y_{2}) & s_{n} (X_{2}, Y_{3}) & s_{n} (X_{2}, Y_{4}) \\ s_{n} (X_{3}, Y_{1}) & s_{n} (X_{3}, Y_{2}) & s_{n} (X_{3}, Y_{3}) & s_{n} (X_{3}, Y_{4}) \end{matrix}] \\ = [\begin{matrix} 0.36 & 0.29 & 0.76 & 0.4 \\ 1.0 & 0.26 & 0.39 & 0.43 \\ 0.43 & 0.29 & 0.36 & 1.0 \end{matrix}] . \end{matrix}

(A3)

Here, each element of the matrix is calculated with the normalized edit similarity metric from Theorem 5 for the corresponding token pair; that is, edge

(X_{i}, Y_{j})

. The numbers in bold are the result of the maximum weight matching on the bipartite graph obtained by the Kuhn–Munkres algorithm, graphically drawn in Figure A2.

The FRMS from Definition 6 is then calculated as follows:

s_{n} (R_{1}, R_{2}) = \frac{\sum_{(i, j) \in M} s_{n} (X_{i}, Y_{j})}{| M |} = \frac{0.76 + 1.0 + 1.0}{3} = 0.92 .

(A4)

On the other hand, we consider the fuzzy Jaccard similarity from Definition 3 with an additional local threshold on the tokens,

δ = 0.8

, and we obtain quite a different result in this special case:

{sim}_{J} (R_{1}, R_{2}) = \frac{| R_{1} {\tilde{\cap}}_{δ = 0.8} R_{2} |}{| R_{2} | + | R_{1} | - | R_{1} {\tilde{\cap}}_{δ = 0.8} R_{2} |} = \frac{2.0}{3 + 4 - 2.0} = 0.4 .

(A5)

Figure A2. Maximum weighted bipartite matching between two records

X

and

Y

Figure A2. Maximum weighted bipartite matching between two records

X

and

Y

If we define the threshold

α = 0.8

, then we obtain the FRMS score

0.92

above the threshold, and so we classify it as a match. Note that

| R_{1} | \neq | R_{2} |

; hence, we obtained a near-perfect matching

M

, with one exposed vertex

Y_{2} = ‘ ‘ f r o m "

The fuzzy Jaccard similarity is quite sensitive to the number of exposed vertices—the final score goes down in direct proportion to the number of exposed vertices. So, the token score is

(X_{1}, Y_{3}) = 0.76

below the token threshold,

δ = 0.8

, and this edge is declined. The final score,

0.4

, is classified into the class of non-matches. We constructed this special example to demonstrate the main disadvantages of fuzzy Jaccard similarity and the use of a local threshold on the token level,

δ

Appendix A.3. Example of Violation of the Triangle Inequality

We know that a normalized similarity metric is given by a relation

s_{n} (x, y) = 1 - d_{n} (x, y)

. Hence, we prove only that the term

d_{n} (x, y) = \frac{d (x, y)}{m a x {| x |, | y |}}

is a distance metric. Recall the definition.

A metric space is an ordered pair,

(X, d)

, where

X

is a set and s is a function

s : X \times X \to R^{+}

that is a metric, i.e., such that, for any

x, y, z \in X

, the following hold:

(D1): $d (x, y) = d (y, x)$ (symmetry);
(D2): $d (x, z) \leq d (x, y) + d (y, z)$ (triangle inequality);
(D3): $d (x, y) = 0 \Leftrightarrow x = y$ (identity of indiscernibles).

The edit distance is a solution to a dynamic programming problem and can only be obtained numerically. The task of the embeddability of the edit distance into

l_{p}

norms is still an open problem. It has been shown that such a metric cannot be embedded into the

l_{1}

norm (with arbitrary dimension) with distortion better than

\frac{3}{2}

[61].

Let us illustrate this rather with the example of the strings

X = " a b "

Y = " a b c "

, and

Z = " b c "

. Then, we obtain

\begin{matrix} d (X, Z) \leq d (X, Y) + d (Y, Z), \\ \frac{d (X, Z)}{m a x {| X |, | Z |}} \leq \frac{d (X, Y)}{m a x {| X |, | Y |}} + \frac{d (Y, Z)}{m a x {| Y |, | Z |}}, \\ \frac{2}{2} ≰ \frac{1}{3} + \frac{1}{3} . \end{matrix}

(A6)

References

Raeesi, M.; Asadpour, M.; Shakery, A. Swash: A collective personal name matching framework. Expert Syst. Appl. 2020, 147, 113115. [Google Scholar] [CrossRef]
Bilenko, M.; Mooney, R.; Cohen, W.; Ravikumar, P.; Fienberg, S. Adaptive Name Matching in Information Integration. IEEE Intell. Syst. 2003, 18, 16–23. [Google Scholar] [CrossRef]
Elmagarmid, A.K.; Ipeirotis, P.G.; Verykios, V.S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng. 2008, 19, 1–16. [Google Scholar] [CrossRef]
Gali, N.; Istodor, R.M.; Hostettler, D.; Fränti, P. Framework for syntactic string similarity measures. Expert Syst. Appl. 2019, 129, 169–185. [Google Scholar] [CrossRef]
Russell, R.C. Index. U.S. Patent 1,261,167, 2 April 1918. [Google Scholar]
Philips, L. Hanging on the Metaphone. Comput. Lang. Mag. 1990, 7, 39–44. [Google Scholar]
Philips, L. The Double Metaphone Search Algorithm. C/C++ Users J. 2000, 18, 38–43. [Google Scholar]
Lait, A.; Randell, B. An Assessment of Name Matching Algorithms; Tech. Rep. 176; Department of Computer Science, University of Newcastle upon Tyne: Newcastle upon Tyne, UK, 1993. [Google Scholar]
Gadd, T.N. PHONIX: The algorithm. Program Autom. Libr. Inf. Syst. 1990, 24, 363–366. [Google Scholar] [CrossRef]
Taft, R.L. Name Search Techniques. In Technical Report Special Report No. 1, State Identification and Intelligence System; Bureau of Systems Development: Albany, NY, USA, 1970. [Google Scholar]
Holmes, D.; McCabe, C.M. Improving precision and recall for soundex retrieval. In Proceedings of the IEEE International Conference on Information Technology: Coding and Computing (ITCC), Las Vegas, NV, USA, 8–10 April 2002. [Google Scholar]
Christen, P. A Comparison of Personal Name Matching: Techniques and Practical Issues; Technical Reports; The Australian National University: Canberra, Australia, 2006. [Google Scholar]
Jaro, M.A. Advances in record linkage methodology as applied to the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 1989, 84, 414–420. [Google Scholar] [CrossRef]
Winkler, W.E. String Comparator Metrics and Enhanced Decision Rules in the Fellegi–Sunter Model of Record Linkage. In Proceedings of the Annual Meeting of the American Statistical Association, Anaheim, CA, USA, 6–9 August 1990; pp. 354–359. [Google Scholar]
Cohen, W.W.; Ravikumar, P.; Fienberg, S.E. A Comparison of String Distance Metrics for Name-Matching Tasks. In Proceedings of the 2003 International Joint Conferences on Artificial Intelligence (IJCAI-03), Acapulco, Mexico, 9–10 August 2003; pp. 73–78. [Google Scholar]
Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
Brill, E.; Moore, R.C. An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, Hong Kong, China, 3–6 October 2000; pp. 32–58. [Google Scholar]
Damerau, F.J. A technique for computer detection and correction of spelling errors. Commun. ACM 1967, 7, 659–664. [Google Scholar] [CrossRef]
Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [Google Scholar] [CrossRef] [PubMed]
Smith, T.F.; Waterman, M.S. Identification of Common Molecular Subsequences. J. Mol. Biol. 1981, 147, 195–197. [Google Scholar] [CrossRef]
Gotoh, O. An improved algorithm for matching biological sequences. J. Mol. Biol. 1982, 162, 705–708. [Google Scholar] [CrossRef]
Ristad, E.S.; Yianilos, P.N. Learning string-edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 522–532. [Google Scholar] [CrossRef]
Marzal, A.; Vidal, E. Computation of Normalized Edit Distance and Applications. IEEE Trans. Pattern Anal. Mach. Intell. 1993, 15, 926–932. [Google Scholar] [CrossRef]
Li, Y.; Liu, B. A Normalized Levensthein Distance Metric. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 1091–1095. [Google Scholar] [CrossRef]
Weigel, A.; Fein, F. Normalizing the weighted edit distance. In Proceedings of the 12th IAPR International Conference on Pattern Recognition, Jerusalem, Israel, 9–13 October 1994; p. 3. [Google Scholar]
Kondrak, G. N-Gram Similarity and Distance. In String Processing and Information Retrieval; Consens, M., Navarro, G., Eds.; SPIRE 2005: Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3772, pp. 115–126. [Google Scholar]
Ukkonen, E. Approximate String Matching with Q-grams and Maximal Matches. Theor. Comput. Sci. 1992, 92, 191–211. [Google Scholar] [CrossRef]
Christen, P. A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. IEEE Trans. Knowl. Data Eng. 2012, 24, 1537–1555. [Google Scholar] [CrossRef]
Papadakis, G.; Skoutas, D.; Thanos, E.; Palpanas, T. Blocking and Filtering Techniques for Entity Resolution: A Survey. ACM Comput. Surv. 2020, 53, 31. [Google Scholar] [CrossRef]
Yu, M.; Li, G.; Deng, D. String similarity search and join: A survey. Front. Comput. Sci. 2016, 10, 399–417. [Google Scholar] [CrossRef]
Hosseini, K.; Nanni, F.; Ardanuy, M.C. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 62–69. [Google Scholar]
Ferragina, P.; Scaiella, U.T. On-the-fly annotation of short text fragments (by Wikipedia entities). In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010; pp. 1625–1628. [Google Scholar]
Santos, R.; Murrieta-Flores, P.; Calado, P.; Martins, B. Toponym matching through deep neural networks. Int. J. Geogr. Inf. Sci. 2018, 32, 324–348. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Qiu, L.; Yu, J.; Pu, Q.; Xiang, C. Knowledge entity learning and representation for ontology matching based on deep neural networks. Clust. Comput. 2017, 20, 969–977. [Google Scholar] [CrossRef]
Jimenez, S.; Becerra, C.; Gelbukh, A.; Gonzales, F. Generalized Mongue–Elkan Method for Approximate Text String Comparison. In Proceedings of the Computational Linguistics and Intelligent Text Processing, 10th International Conference, Mexico City, Mexico, 1–7 March 2009; pp. 559–570. [Google Scholar]
Monge, A.E.; Elkan, C.P. The field matching problem: Algorithms and applications. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 267–270. [Google Scholar]
Moreau, E.; Yvon, F.; Capp, E.O. Robust similarity measures for named entities matching. In Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, UK, 18–22 August 2008; Volume 1, pp. 593–600. [Google Scholar]
Wang, J.; Li, G.; Feng, J. Fast-Join: An Efficient Method for Fuzzy Token Matching based String Similarity Join. In Proceedings of the IEEE 27th International Conference on Data Engineering, Hannover, Germany, 11–16 April 2011; Volume 39, p. 7. [Google Scholar]
Wang, J.; Li, G.; Feng, J. Extending string similarity join to tolerant fuzzy token matching. ACM Trans. Database Syst. 2014, 39, 7. [Google Scholar] [CrossRef]
Rozinek, O.; Mareš, J. The Duality of Similarity and Metric Spaces. Appl. Sci. 2021, 11, 1910. [Google Scholar] [CrossRef]
Deng, D.; Kim, A.; Madden, S.; Stonebraker, M. SILKMOTH: An efficient method for finding related sets with maximum matching constraints. Proc. VLDB Endow. 2017, 10. [Google Scholar] [CrossRef]
Gragera, A.; Suppakitpaisarn, V. Relaxed triangle inequality ratio of the Srensen Dice and Tversky indexes. Theor. Comput. Sci. 2018, 718, 37. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Kuhn, H.W. Variants of the Hungarian method for assignment problems. Nav. Res. Logist. Q. 1956, 3, 253–258. [Google Scholar] [CrossRef]
Munkres, J. Algorithms for the assignment and transportation problems. J. Soc. Ind. Appl. Math. 1957, 5, 32–38. [Google Scholar] [CrossRef]
Duan, R.; Pettie, S. Linear-Time Approximation for Maximum Weight Matching. J. ACM 2014, 61, 1–23. [Google Scholar] [CrossRef]
Edmonds, J.; Karp, R.M. Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems. J. ACM 1972, 19, 31–33. [Google Scholar] [CrossRef]
Xiao, C.; Wang, W.; Lin, X.; Yu, J.X.; Wang, G. Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. 2011, 36, 1. [Google Scholar] [CrossRef]
Jokinen, P.; Ukkonen, E. Two algorithms for approximate string matching in static texts. In Proceedings of the International Symposium on Mathematical Foundations of Computer Science, Kazimierz Dolny, Poland, 9–13 September 1991; pp. 240–248. [Google Scholar]
Yang, Z.; Yu, J.; Kitsuregawa, M. Fast Algorithms for Top-k Approximate String Matching. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence 2010, Atlanta, GA, USA, 11–15 July 2010; pp. 1467–1473. [Google Scholar]
Rasmussen, K.R.; Stoye, J.; Myers, E.W. Efficient Q-gram Filters for Finding All ϵ-Matches over a Given Length. J. Comput. Biol. 2006, 13, 296–308. [Google Scholar] [CrossRef]
Grzebala, P.; Cheatham, M. Private record linkage: Comparison of selected techniques for name matching. Eur. Semant. Web Conf. 2016. [Google Scholar]
Sababa, H.; Stassopoulou, A. A Classifier to Distinguish Between Cypriot Greek and Standard Modern Greek. In Proceedings of the 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), Valencia, Spain, 15–18 October 2018. [Google Scholar]
Rozinek, O.; Borkovcova, M.; Mareš, J. BipartiteJoin: Optimal Similarity Join for Fuzzy Bipartite Matching. In Proceedings of the World Conference on Information Systems and Technologies, Pisa, Italy, 4–6 April 2023; pp. 171–180. [Google Scholar]
Rozinek, O.; Borkovcova, M.; Mareš, J. Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data. In Proceedings of the World Conference on Information Systems and Technologies, Pisa, Italy, 4–6 April 2023; pp. 181–191. [Google Scholar]
Hu, T.C.; Kahng, A.B. Linear and Integer Programming Made Easy; Springer: Berlin, Germany, 2016. [Google Scholar]
Okazaki, N.; Tsuji, J. Simple and Efficient Algorithm for Approximate Dictionary Matching. In Proceedings of the 23rd International Conference on Computational Linguistics, Beijing, China, 23–27 August 2010. [Google Scholar]
Andoni, A.; Deza, M.; Gupta, A.; Indyk, P.; Raskhodnikova, A. Lower bounds for embedding edit distance into normed spaces. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Baltimore, MD, USA, 12–14 January 2003. [Google Scholar]

Figure 1. The construction of a complete bipartite graph where every token vertex of the first record set,

R_{1}

, is connected to every token vertex of the second record set,

R_{2}

Figure 1. The construction of a complete bipartite graph where every token vertex of the first record set,

R_{1}

, is connected to every token vertex of the second record set,

R_{2}

Figure 2. Maximum weighted bipartite matching of two records

R_{1}

and

R_{2}

and adjacent edges of token pairs

X_{i}, Y_{j}

weighted by normalized similarity metric

s_{n} (X_{i}, Y_{j})

Figure 2. Maximum weighted bipartite matching of two records

R_{1}

and

R_{2}

and adjacent edges of token pairs

X_{i}, Y_{j}

weighted by normalized similarity metric

s_{n} (X_{i}, Y_{j})

Figure 3. An example of a singularity of a trigram filter for

d = 1

and

{| X |}_{s i n g} = 5

. Suppose the worst case for fixed

d = 1

is the substitution in the position of character C destroying all trigrams {“

C A C

”, “

A C H

”, “

C H E

”}; hence,

t = 0

Figure 3. An example of a singularity of a trigram filter for

d = 1

and

{| X |}_{s i n g} = 5

. Suppose the worst case for fixed

d = 1

is the substitution in the position of character C destroying all trigrams {“

C A C

”, “

A C H

”, “

C H E

”}; hence,

t = 0

Figure 4. Sawtooth function of different string lengths

| X |

and fixed q and

α

. The zero crossings of

t_{α}

illustrate the singularities of the Q-gram filter. This is similar to the results in [54].

Figure 4. Sawtooth function of different string lengths

| X |

and fixed q and

α

. The zero crossings of

t_{α}

illustrate the singularities of the Q-gram filter. This is similar to the results in [54].

Figure 5. Production system architecture of fuzzy search/matching engine.

Figure 6. Block diagram of the processing of records from the source in real-time by a two-stage system of Q-gram filter and FRMS.

Figure 7. Relative performance of selected similarity functions from the group of hybrid, edit, and Q-gram similarities.

Figure 8. Comparison of the relative performance of Q-gram filter+FRMS and hybrid similarity functions.

Figure 9. Comparison of the relative performance of Q-gram filter+FRMS in an ablation study.

Table 1. An overview of fuzzy token similarity functions and their corresponding constraints, as discussed in [60], is provided for the integer linear programming problem.

Similarity Measure	Subject to $\sum_{(i, j) \in M} α_{i, j}$
Fuzzy Dice	$\frac{α}{2} (\| R_{1} \| + \| R_{2} \|)$
Fuzzy Cosine	$α \sqrt{\| R_{1} \| \| R_{2} \|}$
Fuzzy Jaccard	$\frac{α}{1 + α} (\| R_{1} \| + \| R_{2} \|)$
Fuzzy Overlap/FRMS	$α min {\| R_{1} \|, \| R_{2} \|}$

Table 2. Datasets used in experiments from original sources [15].

Name	Number of Strings	Name	Number of Strings
Animal	5709	Game	911
Bird Kunkel	336	Park	654
Bird Nybird	982	Restaurant	863
Bird Scott1	38	Ucd-people	90
Bird Scott2	719	Census	841
Business	2139

Table 3. Comparison of selected similarity functions ranked in descending order of F1-score.

Similarity	F1-Score	Similarity	F1-Score
FRMS	85.09%	Smith–Waterman	75.71%
3-gram filter+FRMS	85.01%	Smith–Waterman-Gotoh	75.54%
2-gram filter+FRMS	84.88%	Jaro	75.29%
Fuzzy Jaccard (Levenshtein $δ = 0.8$ )	84.17%	Overlap 3-gram	73.21%
Jaro–Winkler	81.45%	Jaccard 2-gram	71.05%
L2 Monge–Elkan (Levenshtein)	80.80%	Dice 2-gram	71.05%
Damerau–Levenshtein	76.86%	Jaccard 3-gram	70.86%
Levenshtein	76.83%	Dice 3-gram	70.86%
Needleman–Wunsch	76.25%	Overlap 2-gram	66.92%

Table 4. Ablation study of the combination of Q-gram filters with FRMS evaluated by F1-score.

Similarity	F1-Score	Similarity	F1-Score
FRMS	85.09%	naive 2-gram ( $\| p \| = 2$ )+FRMS	34.75%
3-gram ( $\| p \| = 2$ )+FRMS	85.01%	naive 2-gram ( $\| p \| = 1$ )+FRMS	34.33%
2-gram ( $\| p \| = 2$ )+FRMS	84.88%	naive 2-gram ( $\| p \| = 0$ )+FRMS	31.69%
3-gram ( $\| p \| = 1$ )+FRMS	84.84%	naive 3-gram ( $\| p \| = 2$ )+FRMS	29.07%
2-gram ( $\| p \| = 1$ )+FRMS	84.74%	naive 3-gram ( $\| p \| = 1$ )+FRMS	29.07%
3-gram ( $\| p \| = 0$ )+FRMS	84.71%	naive 3-gram ( $\| p \| = 1$ )+FRMS	29.07%
2-gram ( $\| p \| = 0$ )+FRMS	84.38%

Table 5. Relative time complexity.

Similarity	Elapsed Time	Similarity	Elapsed Time
Levenshtein	13 s:426 ms	L2 Monge- Elkan (Levenshtein)	14 s:209 ms
Damerau–Levenshtein	22 s:824 ms	Jaccard 2-gram	10 s:542 ms
Jaro	3 s:902 ms	Jaccard 3-gram	9 s:829 ms
Jaro–Winkler	3 s:772 ms	Dice 2-gram	11 s:95 ms
Needleman–Wunsch	28 s:170 ms	Dice 3-gram	10 s:717 ms
Smith–Waterman	28 s:600 ms	Overlap 3-gram	10 s:251 ms
FRMS	13 s:474 ms	Overlap 2-gram	11 s:549 ms
Q-Gram Filter+FRMS	0 s:220 ms	Fuzzy Jaccard ( $δ = 0.8$ )	12 s:824 ms

Table 6. Time and space complexity of selected similarity functions.

Similarity	Time Complexity	Space Complexity
Q-gram filter + FRMS	$(1 - γ) \cdot O (1) + γ \cdot O (n^{4})$	$O (n^{2})$
FRMS	$O (n^{4})$	$O (n^{2})$
Smith–Waterman-Gotoh	$O (n m)$	$O (n m)$
Fuzzy Jaccard (Levenshtein $δ = 0.8)$	$O (n m)$	$O (n m)$
Jaro	$O (n + m)$	$O (n + m)$
Jaro–Winkler	$O (n + m)$	$O (n + m)$
L2 Monge–Elkan (Levenshtein)	$O (n m)$	$O (n m)$
Damerau–Levenshtein	$O (n m)$	$O (n m)$
Levenshtein	$O (n m)$	$O (n m)$
Needleman–Wunsch	$O (n m)$	$O (n m)$
Smith–Waterman	$O (n m)$	$O (n m)$
Q-gram (2-gram, 3-gram)	$O (n)$	$O (n)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rozinek, O.; Marek, J.; Panuš, J.; Mareš, J. Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter. Algorithms 2025, 18, 150. https://doi.org/10.3390/a18030150

AMA Style

Rozinek O, Marek J, Panuš J, Mareš J. Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter. Algorithms. 2025; 18(3):150. https://doi.org/10.3390/a18030150

Chicago/Turabian Style

Rozinek, Ondřej, Jaroslav Marek, Jan Panuš, and Jan Mareš. 2025. "Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter" Algorithms 18, no. 3: 150. https://doi.org/10.3390/a18030150

APA Style

Rozinek, O., Marek, J., Panuš, J., & Mareš, J. (2025). Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter. Algorithms, 18(3), 150. https://doi.org/10.3390/a18030150

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter

Abstract

1. Introduction

1.1. Phonetic Similarity

1.2. Heuristic Similarity

1.3. Edit Similarity

1.4. Q-Grams

1.5. Deep Learning

1.6. Hybrid Similarity

2. Similarity Space

3. Approximate Record Matching in Similarity Space

3.1. Problem Formulation

3.2. Survey of Current Token-Based Methods

3.3. Proposed Model of Fuzzy Record Similarity Metric (FRSM)

4. Optimal Q-Gram Filter

4.1. Count Filtering

4.2. Optimal Count Q-Gram Filter for Edit Similarity

4.3. Optimal Count Q-Gram Filter for Fuzzy Token Similarity

5. Results

5.1. Suggested Architecture

5.2. Experiment Setup and Evaluation Q-Gram Filter Efficiency

5.3. Time and Space Complexity

5.3.1. FRMS

5.3.2. Q-Gram Filter

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Example of Asymmetric Matching

Appendix A.2. Example of Maximum Weighted Bipartite Matching

Appendix A.3. Example of Violation of the Triangle Inequality

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI