A Java library of similarity and distance metrics e.g. Levenshtein distance and Cosine similarity. All similarity metrics return normalized values rather than unbounded similarity scores. Distance metrics return non-negative unbounded scores.
For a quick and easy use StringMetrics and StringDistances contain a collection of well known similarity and distance metrics.
String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";
StringMetric metric = StringMetrics.cosineSimilarity();
float result = metric.compare(str1, str2); //0.4767
The StringMetricBuilder and StringDistanceBuilder are convenience tools to build string similarity and distance metrics. Any class implementing Metric or Distance respectively can be used to build a metric. The builders support simplification, tokenization, token-filtering, token-transformation, and caching. For usage see the examples section.
For a terse syntax use import static org.simmetrics.builders.StringMetricBuilder.with;
String str1 = "This is a sentence. It is made of words";
String str2 = "This sentence is similar. It has almost the same words";
StringMetric metric =
with(new CosineSimilarity<String>())
.simplify(Simplifiers.toLowerCase(Locale.ENGLISH))
.simplify(Simplifiers.replaceNonWord())
.tokenize(Tokenizers.whitespace())
.build();
float result = metric.compare(str1, str2); //0.5720