8000 Logic bug in difficult_words_list() · Issue #192 · textstat/textstat · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Logic bug in difficult_words_list() #192
Closed
@dogweather

Description

@dogweather

I noticed this bug when I saw that scores change if I simply duplicate an input text to make it twice as long.

Each difficult word is only counted once, no matter how many times it occurs in the text. This is wrong, because the algorithms need to compute, e.g., difficult word count / total word count. This bug is causing many of the scores to be off. This tests for the problem:

def test_difficult_words_counts_duplicates():
    textstat.set_lang("en_US")
    twice_as_long = " ".join([long_test, long_test])
    result = textstat.difficult_words(twice_as_long)

    assert result == 2 * 55

The bug is here, where a set is used. Changing this to a tuple fixes it.

I wrote a PR: #193

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0