Feature Request: Get all characters with confidence >x

This is related to #8 and #39 (or more accurately, the underlying ideas within them).

With the upstream issue that the whitelist and blacklist are not implemented in tesseract 4 (discussed in #39), it is difficult to extract all-numeric values. More generally, I have some text that follows very rigid formatting with columns of person identifiers (that are a mix of alpha-numeric and dash characters) and floating point numbers. The person identifiers will be hard to limit the values for, but the floating point numbers are easy as they come from the set 0-9, ".", and "-".

Is it possible within the ocr_data() function to get a vector of all characters that matched with >x confidence and the confidence values of those characters (where x is input by the user)?

That way, I could manually implement whitelist or blacklist functionality.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions