8000 Feature Request: Get all characters with confidence >x · Issue #41 · ropensci/tesseract · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Feature Request: Get all characters with confidence >x #41
Open
@billdenney

Description

@billdenney

This is related to #8 and #39 (or more accurately, the underlying ideas within them).

With the upstream issue that the whitelist and blacklist are not implemented in tesseract 4 (discussed in #39), it is difficult to extract all-numeric values. More generally, I have some text that follows very rigid formatting with columns of person identifiers (that are a mix of alpha-numeric and dash characters) and floating point numbers. The person identifiers will be hard to limit the values for, but the floating point numbers are easy as they come from the set 0-9, ".", and "-".

Is it possible within the ocr_data() function to get a vector of all characters that matched with >x confidence and the confidence values of those characters (where x is input by the user)?

That way, I could manually implement whitelist or blacklist functionality.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0