Docling Parse

Simple package to extract text, paths and bitmap images with coordinates from programmatic PDFs. This package is used in the Docling PDF conversion. Below, we show a few output of the latest parser with char, word and line level output for text, in addition to the extracted paths and bitmap resources.

To do the visualizations yourself, simply run (change word into char or line),

poetry run python ./docling_parse/visualize.py -i <path-to-pdf-file> -c word --interactive

original	char	word	line

Quick start

Install the package from Pypi

pip install docling-parse

Convert a PDF (look in the visualize.py for a more detailed information)

from docling_parse.document import SegmentedPdfPageLabel
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

parser = DoclingPdfParser()

pdf_doc: PdfDocument = parser.load(
    path_or_stream="<path-to-pdf>"
)

# PdfDocument.iterate_pages() will automatically populate pages as they are yielded.
for page_no, pred_page in pdf_doc.iterate_pages():

    # iterate over the word-cells
    for word in pred_page.yield_cells(label=SegmentedPdfPageLabel.WORD):
        print(word.rect, ": ", word.text)    

    # create a PIL image with the char cells
    img = pred_page.render(label=SegmentedPdfPageLabel.CHAR)
    img.show()

Use the CLI

$ docling-parse -h
usage: docling-parse [-h] -p PDF

Process a PDF file.

options:
  -h, --help         show this help message and exit
  -p PDF, --pdf PDF  Path to the PDF file

Performance Benchmarks

Characteristics of different parser versions

Version	Original	Word-level	Snippet-level	Performance
V1		Not Supported		~0.250 sec/page
V2				~0.050 sec/page [~5-10X faster than v1]

Timings of different parser versions

We ran the v1 and v2 parser on DocLayNet. We found the following overall behavior

Development

CXX

To build the parse, simply run the following command in the root folder,

rm -rf build; cmake -B ./build; cd build; make

You can run the parser from your build folder. Example from parse_v1,

% ./parse_v1.exe -h
A program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

Example from parse_v2,

% ./parse_v2.exe -h
program to process PDF files or configuration files
Usage:
  PDFProcessor [OPTION...]

  -i, --input arg          Input PDF file
  -c, --config arg         Config file
      --create-config arg  Create config file
  -p, --page arg           Pages to process (default: -1 for all) (default:
                           -1)
  -o, --output arg         Output file
  -l, --loglevel arg       loglevel [error;warning;success;info]
  -h, --help               Print usage

If you dont have an input file, then a template input file will be printed on the terminal.

Python

To build the package, simply run (make sure poetry is installed),

poetry install

To test the package, run:

poetry run pytest ./tests -v -s

Contributing

Please read Contributing to Docling Parse for details.

References

If you use Docling in your projects, please consider citing the following:

@techreport{Docling,
  author = {Deep Search Team},
  month = {8},
  title = {Docling Technical Report},
  url = {https://arxiv.org/abs/2408.09869},
  eprint = {2408.09869},
  doi = {10.48550/arXiv.2408.09869},
  version = {1.0.0},
  year = {2024}
}

License

The Docling Parse codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github		.github
app		app
cmake		cmake
docling_parse		docling_parse
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
build.py		build.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docling Parse

Quick start

Performance Benchmarks

Characteristics of different parser versions

Timings of different parser versions

Development

CXX

Python

Contributing

References

License

About

Releases 38

Packages

Used by 228

Contributors 10

Languages

License

DS4SD/docling-parse

Folders and files

Latest commit

History

Repository files navigation

Docling Parse

Quick start

Performance Benchmarks

Characteristics of different parser versions

Timings of different parser versions

Development

CXX

Python

Contributing

References

License

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 38

Packages 0

Used by 228

Contributors 10

Languages

Packages