Closed
Description
I noticed this bug when I saw that scores change if I simply duplicate an input text to make it twice as long.
Each difficult word is only counted once, no matter how many times it occurs in the text. This is wrong, because the algorithms need to compute, e.g., difficult word count / total word count. This bug is causing many of the scores to be off. This tests for the problem:
def test_difficult_words_counts_duplicates():
textstat.set_lang("en_US")
twice_as_long = " ".join([long_test, long_test])
result = textstat.difficult_words(twice_as_long)
assert result == 2 * 55
The bug is here, where a set is used. Changing this to a tuple fixes it.
I wrote a PR: #193
Metadata
Metadata
Assignees
Labels
No labels