Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.
To do the visualizations yourself, simply run (change word
into char
or line
),
poetry run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive
original | char | word | line |
---|---|---|---|
Install the package from Pypi
pip install docling-parse
Convert a PDF (look in the visualize.py for a more detailed information)
from docling_parse.document import SegmentedPdfPageLabel
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument
parser = DoclingPdfParser()
pdf_doc: PdfDocument = parser.load(
path_or_stream="<path-to-pdf>"
)
# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():
# iterate over the word-cells
for word in pred_page.yield_cells(label=SegmentedPdfPageLabel.WORD):
print(word.rect, ": ", word.text)
# create a PIL image with the char cells
img = pred_page.render(label=SegmentedPdfPageLabel.CHAR)
img.show()
Use the CLI
$ docling-parse -h
usage: docling-parse [-h] -p PDF
Process a PDF file.
options:
-h, --help show this help message and exit
-p PDF, --pdf PDF Path to the PDF file
Version | Original | Word-level | Snippet-level | Performance |
---|---|---|---|---|
V1 | Not Supported | ~0.250 sec/page | ||
V2 | ~0.050 sec/page [~5-10X faster than v1] |
We ran the v1 and v2 parser on DocLayNet. We found the following overall behavior
To build the parse, simply run the following command in the root folder,
rm -rf build; cmake -B ./build; cd build; make
You can run the parser from your build folder. Example from parse_v1,
% ./parse_v1.exe -h
A program to process PDF files or configuration files
Usage:
PDFProcessor [OPTION...]
-i, --input arg Input PDF file
-c, --config arg Config file
--create-config arg Create config file
-o, --output arg Output file
-l, --loglevel arg loglevel [error;warning;success;info]
-h, --help Print usage
Example from parse_v2,
% ./parse_v2.exe -h
program to process PDF files or configuration files
Usage:
PDFProcessor [OPTION...]
-i, --input arg Input PDF file
-c, --config arg Config file
--create-config arg Create config file
-p, --page arg Pages to process (default: -1 for all) (default:
-1)
-o, --output arg Output file
-l, --loglevel arg loglevel [error;warning;success;info]
-h, --help Print usage
If you dont have an input file, then a template input file will be printed on the terminal.
To build the package, simply run (make sure poetry is installed),
poetry install
To test the package, run:
poetry run pytest ./tests -v -s
Please read Contributing to Docling Parse for details.
If you use Docling in your projects, please consider citing the following:
@techreport{Docling,
author = {Deep Search Team},
month = {8},
title = {Docling Technical Report},
url = {https://arxiv.org/abs/2408.09869},
eprint = {2408.09869},
doi = {10.48550/arXiv.2408.09869},
version = {1.0.0},
year = {2024}
}
The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.