Logic bug in difficult_words_list()

I noticed this bug when I saw that scores change if I simply duplicate an input text to make it twice as long.

Each difficult word is only counted once, no matter how many times it occurs in the text. This is wrong, because the algorithms need to compute, e.g., difficult word count / total word count. This bug is causing many of the scores to be off. This tests for the problem:

def test_difficult_words_counts_duplicates():
    textstat.set_lang("en_US")
    twice_as_long = " ".join([long_test, long_test])
    result = textstat.difficult_words(twice_as_long)

    assert result == 2 * 55

The bug is here, where a set is used. Changing this to a tuple fixes it.

I wrote a PR: #193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions