8000 Format Lines: Fix line merging by tarun-menta · Pull Request #712 · datalab-to/marker · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Format Lines: Fix line merging #712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 2, 2025
Merged

Format Lines: Fix line merging #712

merged 13 commits into from
Jun 2, 2025

Conversation

tarun-menta
Copy link
Contributor
@tarun-menta tarun-menta commented May 27, 2025

We may have detected text that is not covered by any provider line. We ideally don't want to lose out on these lines in format_lines model. This arises in two cases - One off lines, and when the layout check fails. This PR:

  • Fixes bugs in line merging

    • Adds in text detection boxes where there is no provider line.
    • Corrects the sorting when a layout box contains lines from both sources
    • Filters out detection lines from layout boxes that won't use them'
    • Covers the reverse case - When multiple detection lines belong to the same pdftext provider line
  • Adds an initial implementation for box expansion to avoid picture/figure blocks being cut off

  • Updates README to remove deprecated --languages flag

When merging provider and detection lines, some boxes may be missing,
but the layout check fails. This catches and merges in these boxes too.
@VikParuchuri
Copy link
Member

This could use a test

Avoid slight cutting off of the layout boxes
In format lines mode, we include lines from surya which were not present
in the provider lines. However, we do not have ordering of these
relative to the provider lines.

This commit identifies blocks which contain lines from both sources, and
sort with a different method within those blocks (Unchanged for all
other blocks)
@tarun-menta tarun-menta changed the title Format Lines: Make sure missed detection boxes are merged Format Lines: Fix line merging May 30, 2025
When merging multiple detected lines into a single provider line, skip
detected lines which have already been asigned to a different provider
line

Was causing repeated text otherwise
Ensure expansion doesn't cut into other layout blocks, still upper
bounded by the max fraction
@VikParuchuri VikParuchuri merged commit acadb06 into dev Jun 2, 2025
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0