8000 WIP: Foundation Model Integration by tarun-menta · Pull Request #616 · datalab-to/marker · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

WIP: Foundation Model Integration #616

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 83 commits into from
May 19, 2025
Merged

WIP: Foundation Model Integration #616

merged 83 commits into from
May 19, 2025

Conversation

tarun-menta
Copy link
Contributor
@tarun-menta tarun-menta commented Mar 14, 2025

Switching over OCRBuilder and EquationProcessor to use the new foundation model
Also removes the need for the inline math detection model, while greatly simplifying LineBuilder

In addition, adds a special mode - --fix_lines enabling the model to re-write lines with formatting, math, or garbled text. When this flag is set, every line in the document is passed through the OCR model for potential re-writing

Pending:

  • Don't replace good lines, tune model
  • Support new tag types
  • Retain anchor tags when replacing lines in all cases
  • Fix math inside tables
  • Add tests for the new functionality, replace old tests which are currently being skipped

OCR now includes bboxes, so these are written into the marker lines, now
matches pdftext outputs more cloesly
Force span html, because mathml tags inside the span text field are
escaped when outputting
Inline detection model is now removed
If `--fix_lines` is enabled, math lines should automatically be fixed by
the new model
Line builder is significantly simpler now :)
To align better with the outputted mathML, while also supporting LaTeX
outputs from the LLMs
@tarun-menta tarun-menta changed the title Foundation integration WIP: Foundation Model Integration Mar 14, 2025
tarun-menta and others added 12 commits March 15, 2025 00:00
Provider line bboxes can be wrong even when the text inside is valid.
When this happens across a full page, we catch it thanks to layout and
other checks.

This fixes the case where a few lines have bad boxes, the provider line
bboxes are replaced by bbox of the detected line that they are merged
into
Prevents re-OCRing of lines inside equations, since the
EquationProcessor will handle that anyways

Significantly cuts down time on math heavy documents
Matches the input distribution of the model much better since it never
really trained on `\n` characters, significantly higher number of
retained lines now, less than half the runtime on average
Line merging was broken when there were vertical lines in the page,
causing the input texts and image slices for OCR to be mis-aligned

Since we now use the detected line boxes to replace provider line boxes,
this was showing up in the `--fix_lines` mode.
Old logic could not handle the case where a single span had multiple
matches. Now the remaining text after a match is preserved and
considered for future matches as well
@VikParuchuri VikParuchuri merged commit 772ae75 into dev May 19, 2025
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 19, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0