Releases: allenai/olmocr
v0.1.76
What's new
Commits
24a2f9b Bump version to v0.1.76 for release
cd93ca5 Version bump
ecce181 Merge pull request #256 from allenai/jakep/dockerfix
0c6d199 Update README.md
ec5c5b6 Updating pareto plots
6c51829 Some helper scripts
626952a Adding news
9d26079 README updates
69524cb Updatinge bench readme
v0.1.75
v0.1.74
v0.1.73
v0.1.72
What's new
Commits
5e5c31b Bump version to v0.1.72 for release
715b841 0.1.72
b03feb3 Fixed
b588ae2 Remvoing sglang tests, switch to vllm
6e3fba3 Lints
e489b28 Lints
6fcd26d Updating readme
8c62072 Merge remote-tracking branch 'origin/main' into jakep/vllm_perf
3eda2c0 updated vllm to 0.9.1
a83a0da Cleanup of vllm perf branch with @amanr
316d0af added dtype functionality
c8a5361 fixing packages of 22.04
c5d075c fixed apt_pkg module
08fd82f made changes wrt ubuntu 22.04
6507a65 updated ubuntu to 22.04 for glbc 2.32
25dfe0b Weird glibc error
9539eab AWs creds fix
e0fda1a Passing aws creds to benchmark so we can
10000
run custom models stored in s3
ecf0d48 Dont allow uncomitted changes
134bba9 Run benchmark adjustments
7009a7a Trying out FP8 compression
aad8428 Reverting custom pipeline image
5c52e01 Include cuda 12.8
5c524b5 Cleaning up stats reportng
916f0cb Trying with flash infer installed
2ccef7d Ugh, this code is bad
2f1957b Performance fixes with vllm backend
d717033 Fixing parse for waiting
d1baa51 Python alternatives
581915f Fixes for docker image
153f1e5 Final uv fixes
97da87a Hopefully a much better dockerfile
04dd71c Trying to get onto vllm latest
106070d Moving pipeline to vllm
2235b82 Beaker tests
967c83d Better way to setup beaker
v0.1.71
What's new
Commits
23f4a0e Bump version to v0.1.71 for release
8b4f6cd Upping version
24b6822 Pushing beaker images now too
208c29d Not including fallbacks in olmocr_pipeline bench runner so we can measure direct model performance better
5faf570 Format fixes
587b73f Try with more aggressive anchor changing
8f5d5bd Revert "Trying to add repetition penalty"
90f754e Trying to add repetition penalty
9dcdef6 Going to try with up to 5k tokens
8d92620 Merge remote-tracking branch 'origin/main' into retry_improvements
2cb14cc ALlowing more tokens
022be37 Some better info strings in benchmark runner
22ee068 Merge remote-tracking branch 'origin/main' into retry_improvements
fbcd82a Cleanup attempt lookup code a bit
f8fd234 Idea to improve retry performance
61d427e Repo cleanup
7a50ee1 merge
241e5bf Merge branch 'main' of github.com:allenai/olmocr
470394d pareto plot
v0.1.70
What's new
Commits
e10a53c Bump version to v0.1.70 for release
76270f5 Upping to v70 to test new docker builds
a6d6c34 Refactored docker workflows
78ea21a Merge pull request #216 from allenai/amanr/docker
bea1873 Update README.md
7996a7d Update README.md
bdf0879 Merge pull request #202 from allenai/amanr/docker
74f4786 README updated with pip install and --markdown
v0.1.69
What's new
Commits
57238cf Bump version to v0.1.69 for release
71275cc Bumping version, adding more docs, more to come
7b640ae Merge branch 'main' of https://github.com/allenai/olmocr
8d8e323 Adding markdown flag to directly generate markdown outputs
1043491 Oops, removing submodule olmOCR bench repo, best if you just clone from hugging face
2c1c8a6 Updating readme more
v0.1.68
What's new
Commits
d2755ad Bump version to v0.1.68 for release
db9972c Readme updates
c97ce8b Lints
08806fd Fixups
10b5e9e Includes
63aee2c Code cleanup, version bump, remove unused permutation test
5de52e7 Update README.md
66f9b46 Merge branch 'main' of github.com:allenai/olmocr
7f4edb2 pareto plot
c970851 Merge branch 'main' of https://github.com/allenai/olmocr
bb3fe14 Pareto plot for paper
f0768bb Merge branch 'main' of https://github.com/allenai/olmocr
c4a0fb9 Adding back in proper CI estimation
d17210f Lint fix
ffee4c9 Big bug fix, moving the prompt to match how training was done, 2.3 point boost on olmocr-bench
28966b9 Adding CDF plots
2e8753a Docling runner based on CLI, but its too slow to use. Pii rule fixes
74ef2b6 Fixes for some pii taggers
b3b405d dedupe script
e06fd62 Adjusting tagging pipelien v2
1538163 Merge branch 'main' of https://github.com/allenai/olmocr
623c66c Fixing up tagging pipeline
1854ae1 A bit more work on tagging
72bcfd8 doing some extra pii tagging steps
9871e06 Merge branch 'main' of https://github.com/allenai/olmocr
424052d Outputting some nice reference docs to check pii
d18f3f7 More pii tag checking
80645c8 Hypothesis checker
3aba3a5 Comitting script to get stats on PII tagging
6f62e05 Merge pull request #188 from allenai/amanr/miners
9e5965a Some PII filter
ef083bf Stats fix
d671be6 Working on some dataset filtering
da21074 More nits
88270e9 More work on qwen25 finetune
a2ec95e Testing out to see where we stand on qwen2.5
97e4992 Merge branch 'main' of https://github.com/allenai/olmocr
dcbe654 Report for benchmarking
791983c Tweaking some more pii detection
5cc0848 Rich tagger with bigger model
4ed00d0 Fixes for rich tagging
472ee10 Lints
8ef7e56 Trying a new rich tagging pipeline for PII
0a320e9 Some helper scripts for Aman
1067f80 Update README.md
4e9e13e Option in benchmark to output tests which fail on all models for debugging
e51362b Showing benchmark scores per category, speed improvements
f880847 Adding some small changes to the tagging pipeline
66d293c Decent resume/cv tagging
1f66b96 Adding openai dependecy for benchmarking
689bcd9 Merge branch 'main' of https://github.com/allenai/olmocr
8ec7dbe Script updates
83002a0 Reinit credentials
2d5e183 Small corrections
df71dc3 Small fix for cluster usage
67a01cf FIxups for tagging pipeline
c326fae Refactoring tagging bigly
811d267 Merge branch 'main' of https://github.com/allenai/olmocr into main
479b2c1 Working on a tagger
717ed81 Cleanup
97ae48c Making some more progress
7d8e9d1 Fixing up tagging pipeline
12100b4 Adding some manual structure to be filled in
ee8c506 Example of a basic empty pipeline that I'm hoping to extend for tagging
582518f Merge pull request #181 from mhamada-ai2/patch-1
887efac Merge branch 'main' of https://github.com/allenai/olmocr
246490f Lint fixes
967210f Adjustments to task
3dffeea Saving prolific PID
b20a488 README for benchmark
b897bf1 Merge branch 'main' of https://github.com/allenai/olmocr
f0992b9 Better staggering of downloads
858cf69 Bumping version
10cb6aa Updating pipeline to take cloud storage model names and paths, as well as local directory
e361713 Update README.md
ac8c536 Update README.md
df65757 Update README.md
ca6e142 Adding some extra unit tests on some math cases I wasn't sure of
7a638c7 Adding some more options to prompt chatgpt
eabbe27 Lint fixes
7f82260 Merge pull request #173 from allenai/amanr/olmocr-bench-old_scans
e16f66d Working on annotation for dolma docs release
9a67f50 Doing some work on annotations again...
1d0c560 Upping version to fix issue with work queue and delimited paths
786b14a Final adjustments
4d8a8af Adjusting prolific script
dc2512c Adjusted annotation script
ee41449 Instructions updated in annotation tool
5ebec46 Merge branch 'main' of https://github.com/allenai/olmocr
0b5cd40 Staggering model downloads in big sharded jobs
3f34969 Rendering math in review app
590a92e Ruff fix
4e990e2 Merge branch 'main' of https://github.com/allenai/olmocr
a13a501 Formatting, fixes to annotation tool
a74800f New flowchart based annotation tool
cdc7fae Adjusting annotation script
2f74a2a Prompt6 for qwen2.7 vl
8c287a0 Basic prompt edits
ecbd3a2 Merge branch 'main' of https://github.com/allenai/olmocr into main
474e0ef Lint fixes, adjusting qwen2.5 vl prompt
aa5cb95 Typos fixed up
141fc69 Vllm based qwen2.5 evals
9d8a4cf Merge branch 'main' of https://github.com/allenai/olmocr into main
f5641c6 Convert script updated a bit
ae4fda7 Bugfixes
aa58370 Merge branch 'main' of https://github.com/allenai/olmocr into main
613a4f3 Adding additional runners and updating convert script
b607aec Lints
95b03a1 Fixing gemini conver script to use new API
3d19250 Removing progress bar in annotation UI
caf21b9 Lints
f1188dc Merge branch 'main' of https://github.com/allenai/olmocr
a0f8b02 Reporting results
cc7b113 Editing
9338f53 Saving pdf paths
61624a3 Fixed
d299119 Links updated
a113fd3 Review app
e8c14fc Saving prolific codes
cd9e370 Tinyhosting automatically
02cd002 Step by step annotation
6a0dbfc Adjusting buttons
d4d87f7 Force flag for review app, tests fixed for difference comparison in tables
e856e9d Test mining not including line numbers
2614fc9 Merge branch 'main' of https://github.com/allenai/olmocr
a96f154 Hopefuly avoiding comparison issues now
b8b780f More mining of synthetic tests code
360b1be Better filtering of tests
6d3a7d6 Adding autorender if katex into synthetic pipeline
4604b59 SYnth mining
69b0222 Improving miner script
841ce72 Miner improvements
9737649 More tests
748ab95 Miner unit tests for duplicate absent tests
594f473 Synth miner coming together more
fb8b23d SMall adjustments to synthetic data pipeline
678c000 Nicer claude prompt for synth data gen
5c98a47 Mining upgrades
a34b158 Lints
83ae610 Scan dolma docs improvements for PII review
bc78e0d Adding feedback
213252f A few improvements to the dolma doc viewer script
3ca39ab Merge branch 'main' of https://github.com/allenai/olmocr
9b119c8 First attempt at mining actual test cases
abcf7f0 Lints
cd5a93d Rendering pdfs with playwright and chromium
9749e95 Merge branch 'main' of https://github.com/allenai/olmocr
731aa73 Better synth miner script
42be0cc Too much debug spew
d45c032 Better equation rendering checker with more tests.
b8e3034 Trying a change to the render script
2141f18 Adding a katex test case that should be fixed
4d6a97f Style fix, a few notes
c36d8fd Merge branch 'main' of https://github.com/allenai/olmocr into main
223d05a Adding basic prompt template
2417e61 Mediod
03285d9 Merge branch 'main' of https://github.com/allenai/olmocr
1f77aab Some early code for mining html templates of pages, pick mediod code
58276b0 Mining reading order checkpoint, convert script to use images
f79bd0d Cleanup review app
063d4f5 Review page
449900a Tests
9e3b554 More html table parsing goodness
2944d3b More fixes
16ab1a4 Progress on more complicated header and footers
1e13dde Sorting results
c25e9cb Addxing some fixes
3005ebd Normalization
8ec1ebe Normalization
cb4dfeb Fix
a4605e4 Fixing normalizing during table cell comparison
1797911 Lints
b307f5a More robust markdown parsing
5344457 Tests
cac5ef1 Tests for the tests
196654e Merge branch 'main' of https://github.com/allenai/olmocr
0a3a5ef Lints
0afacd6 Less duped tests
9855f70 Some work on table dataset
bc41ba9 Merge branch 'main' of https://github.com/allenai/olmocr
ad82e55 Adding url reference for tests, some mining and cleanup scripts
3c22cf3 Lints
da05b4c Merge branch 'main' of https://github.com/allenai/olmocr
d620722 Review app is much nicer now
5ec9647 Keyboard shorcuts
9df5102 review document
7f921f4 review app
89b628d Slighty better
9344107 pdf viewr
4939e41 Flask based review app first attempt
93450c3 Table miner
b472845 Table miners
aee030c Fixing sample dataset, outputting some reports for debugging. Math is good enough for now
v0.1.60
What's new
Commits
dd72563 Bump version to v0.1.60 for release
baa0082 Don't go down too low in temp
f2951f3 Lints
1e42e5e Faster and nicer equation cache
1f8cc59 Pipeline scales temperature automatically, increases performance ~2%
4768ac4 Merge branch 'main' of https://github.com/allenai/olmocr
0968bd1 Mine headers footers
1270ca3 lints
d7361c4 Basic convert script
142a9cb Convert script to support broader folder structures
98c4283 Cap max workers to hopefully improve stability
5f3ef51 Faster equation cache and checking, cleanup data script
79e2677 Hmm, these should be passing!
f5d92bd Trying to get new CI to work
1db1b34 Merge pull request #122 from allenai/gpu-ci
9f38a8a Lints
5009bb3 Lints
acb0df3 Fixes
3eec2a8 Mining math
95f03e1 More small tests
d30a070 Tests
2696502 Much faster and responsive math bench
980121f Loading tests much faster in parallel
7729e5a Graphical pdf test from github
154a07c Math miner looks decent
d0b9b5b Fixes for math mining
09fd299 Mining
3f92265 Math miner working decently
5387a79 More tests for olmocrbench
189104b Fixing escaped html bug in mathml parsing
770bc36 Fixes for multipage
0553443 Convert scripts and other fun
8b3a9e4 Fixes for multipage runners
743e48e More fixes
b2fe82d Working on math compares
bc3a945 Adding some tests
35cc6f1 A few fixes for text comparisons and normalized chars
4709156 Leaving with some more data, but still cases to investigate
07be9ea More math testing
e39c3e4 New method for comparing equations
fff4050 More test documents
0ba56c0 Adjusting repeat test to be the "baseline" test which also looks for disallowed characters
a2b5ca8 Better markdown table parsing
3fef3f9 Gemini support, some debugging stuff
fc857f9 Starting on math dataset
d006e8f Working on equation matching
7003e9c Working on a better compare function
e144200 Fix markdown parsing for mistral
bdc0d75 Adding mistral ocr to eval
4053ea5 Work on image matching
b03d840 Better error handling on eqn rendering
438e68e Some more math stuff
7f36ac8 First math tests
b62ccc2 Equation rendering code, first pass
9be696f Adding a trailing repetition test
07466e1 Stats tests
eeb2733 Marker rerun, stats changes
50e55f4 Conversion fixes
fb0a729 Better convert script
fa68c6b Better conversion script, run on more things
c9ecd8e Need those chat templates
5611d79 Model runners
5cb32c3 Convert script work with server backends
87875b3 Merge branch 'main' of https://github.com/allenai/olmocr into main
2982526 Convert scripts for benchmark
1545a6d Adding more work on diffs
004486f Nice tables support
3a0bcb6 Better table tests
748fd62 Adding basic table relative tests
76476f9 Synth rendering ideas
c4f6b11 Fixing the mine diffs script, but it still doesn't work great
fcb1eab Consistent ordering on convert, with data dir script
ecac384 Making a nicer warning message when waiting for sglang server
03ef353 One last lint fix
7d7e81e Internal version bump
7a7c878 double parentheses for proper escaping
dc7cb5c Ruff fixes to CI
1348a29 Merge branch 'main' of https://github.com/allenai/olmocr into main
ca0f911 Probably need at least 20GB GPU ram to have a good time with olmocr
2241853 Merge branch 'main' of https://github.com/allenai/olmocr into main
a701a37 Fix for calling --pdfs with an invalid pdf
622540e Fix so that the pipeline.py attempts to download the model weights first, before starting the loading timeout
010fdf8 Small fix
7dd44ed convert script
701abdb Some new entries
1148b47 Minor fixes
361ed2a Merge branch 'main' of https://github.com/allenai/olmocr into main
9f12917 Organizing things for data entry
af02c63 Working viewer
8061aac Working on viewer/editor for rules
ab13ac6 Mining diff script outputs candidate rules
99ab046 Autominer work
143769b Merge pull request #61 from allenai/kylel/elo
1b78ec9 More work on automining
3670219 commits
2d4c1a1 Merge branch 'main' of https://github.com/allenai/olmocr into main
a03673e Working on some progress for the autominer, fixing more options in convert script
11e89dc Script fixups
505e08c automine draft
ae7efd3 Refactoring
9e019f1 More factoring
bd08fdb fixes missing OSS code for Issue #36
d4b902c Olmocr runner implemented
aac0c15 chatgpt converter
8a6e8b9 Basic rule viewer
9081f7f Update README.md
0130a97 fixed style
c2b54d8 updated readme
d841216 Merge branch 'main' of https://github.com/allenai/olmocr into main
813a355 Fixing mineru runner, added a few sample docs
cc1f476 Bugfixes
9da1f92 Cleaner implementations of benchmark stuff
53494d9 Refactoring
ff465f7 Starting refactor
a348cd6 olmocr bench runner
c20e3c0 Pdf for dataset
16a3244 olmocr running
422d08f Adding more rules and seeing how they should work
f2f7619 Adding mineru script
e5a80c5 Fixing up benchmark a bit
c3d0ce9 Some readmes and instructions
4e0339f Runner for olmocr bench
a8f6921 Benchmark runners for other systems
318abf2 Adding runbench
1230aef Making progress
072bc1d Making some progress
823629d Sample code for olmocrbench
9e62003 Adding readme for olmocr bench
e4f9b19 Infinigram counting script for paper
6020122 Match script
b871e4b Small helper to measure overlap