Enhancements

Summary

#121 Adding support for W&B logging and visualizing evaluation results
#132 Adding new properties for 'Page' and new attributes for 'Image'

Details

Adding support for W&B

The W&B WandbTableAgent is a new objects the generates a table rows with images and bounding boxes and sends this table to W&B
server. Having setup a W&B account this class allows monitoring evaluation results during training.

Moreover, a WandbWriter has been added that allows writing logs in JSON format and sending them to W&B.

Adding new properties for `Table` and `Image` objects

Some properties to Table have been added:

- `csv`: Returning a list of list of string (cell entry)
- '__str__`: Returning a string representation of a table

Some attributes to Image have been added in order to take care of data lineage for multi page documents:

- `document_id`: Global document identifier (equal to `image_id` for single page documents)
- `page_number`: Page number in multi page documents (defaults to 0)

Bugs

#129 with cm 5fb2355
NN (defrost Hugging Face hub version) with #122
#124 (partial) with #125
NN (small bug fixes related to PR #117) with #118

Enhancements

Summary

#101 Docs are built now with MkDocs, Material for MkDocs as well as Mkdocstring. This PR is already productive
#110 Adding state_id
#115 Adding Table-Transformer with custom pipeline components
#117 Adding pipeline component for NMS per pairs

Details

Adding state_id

ImageAnnotation change as they pass through a pipeline. To better detect the change of the annotation, the state_id is introduced, which, unlike the static annotation_id, changes when the annotation changes by adding e.g. sub-categories.

Adding Table-Transformer

The following has been added:

dataset pubtables1m_struct for table structure recognition using Pubtables-1M
A derived ObjectDetector wrapper HFDetrDerivedDetector for TableTransformerForObjectDetection
A pipeline component PubtablesSegmentationService in order to segment the table structure recognition result of the model. (The
segmentation following the first approach in deepdoctection cannot be used).
A training script for training TableTransformerForObjectDetection models for object detection as well as DataCollator and
Detr mappers.

Adding `AnnotationNmsService`

Adding AnnotationNmsService. It allows to run non-maximum-supression on pairs or more generally on group of image annotations. Compared to the post-processing step in object detectors, this step allows to suppress annotations that have been detected from different detectors.

This service runs for Tensorflow and Pytorch and chooses the necessary functions accordingly.

Bugs

#100 with PR #101
#112 with special fix on notebooks repo

Enhancements

Summary

#94 Adding support for LayoutLMv2 and LayoutXLM
#99 Adding support for LayoutLMv3
#97 Refactoring repo structure and moving jupyter notebooks to notebooks

Details

Adding support for LayoutLMv2 and LayoutXLM

-Model wrappers for LayoutLMv2 have been added. To give the whole concept some more structure, two new new base classes (sub classed from LMTokenClassifier and LMSequenceClassifier resp.) HFLayoutLmTokenClassifierBase and HFLayoutLmSequenceClassifierBase have been added.
-Adding sliding window for training and inference: Before adding sliding windows, pages with more than 512 tokens could be processed by splitting the page batch into several disjoint batches. This approach has the disadvantage that one loses context especially for tokens very close to where the batch has been dissected. Sliding window generates several overlapping batches so that there is always a batch where any token has context (except the first and last tokens). For inference one needs to add a post processing step for tokens in more than one batch: We currently choose the prediction with the highest score but there are other approaches. The effect on inference has not been tested yet and the implementation may be subject to change.
-Adding support in training script for LayoutLMv2 and LayoutXLM.

Note: As transformer tokenizers implement the handling of distributing bounding boxes to tokens which is a part of the pre-processing step in this library as well, users must not use these but have to take the tokenizers generating the vocab of the underlying language model. This means:

LayoutLMv2 -> LayoutLMTokenizerFast
LayoutXLM -> XLMRobertaTokenizerFast

Refactoring repo structure

Disentangling code base from jupyter notebooks.

Add layoutlmv3 and add more features for layoutlm processing

-Adding HFLayoutLmv3SequenceClassifier and HFLayoutLmv3TokenClassifier
-LMTokenClassifierService does not require a mapping function in its __init__ method anymore because the inference processing works with one single mapping for all models.
-When processing features, it is now possible to choose the segment positions to be used as bounding boxes. This implies, that the segment positions will need child-specific relationships to the words.
-Evaluator has a new method compare which makes it possible to compare ground truth sample from a dataset with predictions from a pipeline. Currently, only object detection models can be compared.

Bugs

#91 with PR #93
#95 with PR #96

Patch release:

Du 8000 to changes of the hf_hub as of release 0.11.0, only versions <0.11.0 can be currently used.

Enhancements

Summary

#69 Modified cell merging in table refinement process and new row/column stretching rule
#72 Optimizing reading order
#76 Refactoring pipeline base classes
#82 Adding an image transformer and corresponding pipeline component
#86 Modify API for analyzing document output

Details

Modified cell merging in table refinement process and new row/column stretching rule

TableSegmentationRefinementService : When merging cells, the merged cell can be equal to one of the input cells (e.g. if the largest cell contains all other cell. In this case the merging cell cannot be dumped and the smaller cells won't be deactivated. A logic has been added that deals with this situation.

TableSegmentationService: To tile tables with rows and column more equally a new row/column stretching rule has been added.

Optimizing reading order

Optimization of the arrangement of layout blocks so that the reading order gets more robust even when the layout elements vary heavily.

Refactoring pipeline base classes

Adding a new attribute name so that each pipeline component in a pipeline can be uniquely described by its predictor and its
component.
Removing some parameters in classes that do not really belong to these classes adding meth get_pipeline_info to abstract base class
pipeline.

Adding an image transformer (not a model) and corresponding pipeline component (closes #30)

Adding the package jdeskew to estimate the distortion angle of a skewed document and to rotate it accordingly so that text lines are horizontal lines and easier to consume for OCR systems.
Adding an new class interface ImageTransformer, that accepts and returns an image as numpy array
a new pipeline component SimpleTransformService that accepts an ImageTransformer and updates the necessary meta data.

Modify API for analyzing document output

ObjectTypes strings have been changed to lower case. The reason for that is that ObjectTypes members are now made
available as attributes for sub classes of ImageAnnotationObj.
Unused and deprecated data structures have been deleted.
A new Page object now derived from Image has been created. This new object replaces the object of the same name. Moreover, a couple of Layout structures have been created. Both Page and Image represent a view on the underlying Image, resp. ImageAnnotation and providing a more intuitive interface to document parsing, text extraction/text classification than the Image and ImageAnnotation classes.
A new class CustomDataset has been added to provide users an easy interface to create custom datasets. This class reduces the boilerplate and now users have only to write a DataFlowBuilder and need to instantiate CustomDataset.
ModelProfile has been provided with a new attribute model_wrapper.
TextExtractionService has been provided with a new function run_time_ocr_language_selection. If Tesseract has been chosen as text_extract_detector and a LanguageDetectionService is a predecessor pipeline component, setting run_time_ocr_language_selection=True will select the Tesseract model with the predicted language. You can therefore have different languages in one stream of documents.
All notebooks have been revisited and updated. Many notebooks are now almost one year old and do not give a exhaustive overview what can be solved with that library.
Beside notebooks, a substantial part of the docs have been updated.

Bugs

#66 with PR #68
#70 with PR #71
#73 with PR 74
#77 with PR #78
#80 with PR #81
#84 with PR #85

Releases: deepdoctection/deepdoctection

v.0.22 Add support for W&B, some new attributes for Image and Page and small bug fixes

Enhancements

Summary

Details

Adding support for W&B

Adding new properties for Table and Image objects

Bugs

Uh oh!

v.0.21 Adding support for table transformer and new pipeline component

Enhancements

Summary

Details

Adding state_id

Adding Table-Transformer

Adding AnnotationNmsService

Bugs

Uh oh!

v.0.20 Adding support to LayoutLMv2, LayoutXLM, LayoutLMv3 and moving notebooks to separate repo

Enhancements

Summary

Details

Adding support for LayoutLMv2 and LayoutXLM

Refactoring repo structure

Add layoutlmv3 and add more features for layoutlm processing

Bugs

Uh oh!

v.0.19 Patch release

Uh oh!

v.018 Adding jdeskew, modifying the `Page` object and further improve usage (hopefully!)

Enhancements

Summary

Details

Modified cell merging in table refinement process and new row/column stretching rule

Optimizing reading order

Refactoring pipeline base classes

Adding an image transformer (not a model) and corresponding pipeline component (closes #30)

Modify API for analyzing document output

Bugs

Uh oh!

v.017 More metrics, more LayoutLM tutorials, dropping names, unifying logging, simplifying distribution

Enhancements

Summary

Details

Adding Precision/recall/F1 metrics

More docs for LayoutLM

Enums for categories

Unifying log messages

Reducing number of extra install options

Bugs

Uh oh!

v.0.16 LayoutLM, Evaluation over pipelines, TEDS metric, Doclaynet and re-design of Page class

Enhancements

Summary

Details

Evaluator running over pipelines:

Adding TEDS metric

Adding LayoutLMv1

Adding Doclaynet

New design of page class

Uh oh!

v.015 patch: update setup.py

Uh oh!

v0.14 - Refactoring extension modules, optimizing typing and adding Detectron2 training scripts

Enhancements

Summary

Details

Re-organizing extra dependencies

Optimizing typing

Training scripts for Detectron2 and new PyTorch models

Uh oh!

v.013 - Language detection, merging datasets, unifying registries

Enhancements

Summary

Details

Language detection

Merging datasets

Bugs

Improvements

Dataclass for model profile and new ModelCatalog

Adding new properties for `Table` and `Image` objects

Adding `AnnotationNmsService`