WIP: Foundation Model Integration #616

tarun-menta · 2025-03-14T20:21:27Z

Switching over OCRBuilder and EquationProcessor to use the new foundation model
Also removes the need for the inline math detection model, while greatly simplifying LineBuilder

In addition, adds a special mode - --fix_lines enabling the model to re-write lines with formatting, math, or garbled text. When this flag is set, every line in the document is passed through the OCR model for potential re-writing

Pending:

Don't replace good lines, tune model
Support new tag types
Retain anchor tags when replacing lines in all cases
Fix math inside tables
Add tests for the new functionality, replace old tests which are currently being skipped

OCR now includes bboxes, so these are written into the marker lines, now matches pdftext outputs more cloesly

Force span html, because mathml tags inside the span text field are escaped when outputting

Inline detection model is now removed If `--fix_lines` is enabled, math lines should automatically be fixed by the new model Line builder is significantly simpler now :)

To align better with the outputted mathML, while also supporting LaTeX outputs from the LLMs

Provider line bboxes can be wrong even when the text inside is valid. When this happens across a full page, we catch it thanks to layout and other checks. This fixes the case where a few lines have bad boxes, the provider line bboxes are replaced by bbox of the detected line that they are merged into

Prevents re-OCRing of lines inside equations, since the EquationProcessor will handle that anyways Significantly cuts down time on math heavy documents

…uri/marker into foundation-integration

Matches the input distribution of the model much better since it never really trained on `\n` characters, significantly higher number of retained lines now, less than half the runtime on average

Line merging was broken when there were vertical lines in the page, causing the input texts and image slices for OCR to be mis-aligned Since we now use the detected line boxes to replace provider line boxes, this was showing up in the `--fix_lines` mode.

Old logic could not handle the case where a single span had multiple matches. Now the remaining text after a match is preserved and considered for future matches as well

Keep chars

Empty list has a falsy value which was causing it to always be overwritten by the default list when passed in as an arg

Structured extraction

tarun-menta added 17 commits March 12, 2025 11:47

Start replacing texify with foundation OCR model

29acb15

Replace texify completely with new foundation model

e54ff7d

Cleanup and refactor old texify tests to the new model

ec82e25

Add new model support for OCR

66a4dbf

OCR now includes bboxes, so these are written into the marker lines, now matches pdftext outputs more cloesly

Fix inline math

ed770e4

Force span html, because mathml tags inside the span text field are escaped when outputting

Fix missing newlines at end of marker lines

de41b4a

Remove inline math llm processor

d0a74d6

Strip out inline math tests

26ffd5a

Fix tags

be2aeec

Replace inline math with new model

7d5926b

Inline detection model is now removed If `--fix_lines` is enabled, math lines should automatically be fixed by the new model Line builder is significantly simpler now :)

Handle edge case

7edadb4

Remove inline model stuff from tests

aeeb8e2

Fix how text extractino method is set

c473f3a

Update markdown rendering to support MathML

20498e0

Fix encoding issues with ftfy

2517cc7

Update LLM processor and markdown rendering

13874ba

To align better with the outputted mathML, while also supporting LaTeX outputs from the LLMs

tarun-menta changed the title ~~Foundation integration~~ WIP: Foundation Model Integration Mar 14, 2025

tarun-menta and others added 12 commits March 15, 2025 00:00

Update logic for preserving good original text

eba0aed

Skip OCR for lines inside certain blocks

f2dba4f

Prevents re-OCRing of lines inside equations, since the EquationProcessor will handle that anyways Significantly cuts down time on math heavy documents

Correct llm prompts

c791ba5

Add support for new tag types

baafa90

Merge branch 'foundation-integration' of https://github.com/VikParuch…

b5ca3f3

…uri/marker into foundation-integration

rstrip() lines before inputting to OCR model

6f573a1

Matches the input distribution of the model much better since it never really trained on `\n` characters, significantly higher number of retained lines now, less than half the runtime on average

Fix line merging logic

fc6a5ff

Line merging was broken when there were vertical lines in the page, causing the input texts and image slices for OCR to be mis-aligned Since we now use the detected line boxes to replace provider line boxes, this was showing up in the `--fix_lines` mode.

Initial support for preserving spans with --fix_lines

521efc8

Make ref matching logic recursive

9be4db1

Old logic could not handle the case where a single span had multiple matches. Now the remaining text after a match is preserved and considered for future matches as well

Update streamlit app

33d10de

Cleanup

911b603

VikParuchuri and others added 28 commits April 14, 2025 15:52

Fix README

026098d

Fix OCR tests

8925491

Clean up timings

f488c60

Merge pull request #662 from VikParuchuri/keep-chars

1ded28e

Keep chars

Fix batch sizes

b07545e

bump version

01c063f

Poetry lock

eca412f

Merge pull request #665 from VikParuchuri/keep-chars

f4ed6c1

Keep chars

Align text

3cb85f4

Bump prerelease

793bbb2

Fix structure merge bug

c6dae45

Cleanup streamlit app

5ba026a

Allow empty processor list in PdfConverter

31bd14b

Empty list has a falsy value which was causing it to always be overwritten by the default list when passed in as an arg

Remove global logging change

42d3aeb

Add proper logging

d979ff6

Basic structured extraction

3aae414

Better blank line detection

d127b87

Add extraction app

d5c8bfb

Improve structured extraction alpha

9817023

Add tests for extraction converter

8a01367

Review comments

ae6361f

Bump surya

a0a9dfa

Pass detection polygons

9bc10a8

Enable bytes in converter

5242c09

Patch tests

ff1d3e7

Bump version

f688c70

Merge pull request #687 from VikParuchuri/structured_extraction

47557ff

Structured extraction

Merge branch 'dev' into foundation-integration

d1dc215

VikParuchuri merged commit 772ae75 into dev May 19, 2025
4 checks passed

github-actions bot locked and limited conversation to collaborators May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Foundation Model Integration #616

WIP: Foundation Model Integration #616

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WIP: Foundation Model Integration #616

WIP: Foundation Model Integration #616

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!