Releases: deepdoctection/deepdoctection
v.0.22 Add support for W&B, some new attributes for Image and Page and small bug fixes
Enhancements
Summary
#121 Adding support for W&B logging and visualizing evaluation results
#132 Adding new properties for 'Page' and new attributes for 'Image'
Details
Adding support for W&B
The W&B WandbTableAgent
is a new objects the generates a table rows with images and bounding boxes and sends this table to W&B
server. Having setup a W&B account this class allows monitoring evaluation results during training.
Moreover, a WandbWriter
has been added that allows writing logs in JSON format and sending them to W&B.
Adding new properties for Table
and Image
objects
Some properties to Table
have been added:
- `csv`: Returning a list of list of string (cell entry)
- '__str__`: Returning a string representation of a table
Some attributes to Image
have been added in order to take care of data lineage for multi page documents:
- `document_id`: Global document identifier (equal to `image_id` for single page documents)
- `page_number`: Page number in multi page documents (defaults to 0)
Bugs
#129 with cm 5fb2355
NN (defrost Hugging Face hub version) with #122
#124 (partial) with #125
NN (small bug fixes related to PR #117) with #118
v.0.21 Adding support for table transformer and new pipeline component
Enhancements
Summary
#101 Docs are built now with MkDocs, Material for MkDocs as well as Mkdocstring. This PR is already productive
#110 Adding state_id
#115 Adding Table-Transformer with custom pipeline components
#117 Adding pipeline component for NMS per pairs
Details
Adding state_id
ImageAnnotation
change as they pass through a pipeline. To better detect the change of the annotation, the state_id
is introduced, which, unlike the static annotation_id
, changes when the annotation changes by adding e.g. sub-categories.
Adding Table-Transformer
The following has been added:
- dataset pubtables1m_struct for table structure recognition using Pubtables-1M
- A derived ObjectDetector wrapper HFDetrDerivedDetector for TableTransformerForObjectDetection
- A pipeline component PubtablesSegmentationService in order to segment the table structure recognition result of the model. (The
segmentation following the first approach in deepdoctection cannot be used). - A training script for training TableTransformerForObjectDetection models for object detection as well as DataCollator and
Detr mappers.
Adding AnnotationNmsService
Adding AnnotationNmsService
. It allows to run non-maximum-supression on pairs or more generally on group of image annotations. Compared to the post-processing step in object detectors, this step allows to suppress annotations that have been detected from different detectors.
This service runs for Tensorflow and Pytorch and chooses the necessary functions accordingly.
Bugs
v.0.20 Adding support to LayoutLMv2, LayoutXLM, LayoutLMv3 and moving notebooks to separate repo
Enhancements
Summary
#94 Adding support for LayoutLMv2 and LayoutXLM
#99 Adding support for LayoutLMv3
#97 Refactoring repo structure and moving jupyter notebooks to notebooks
Details
Adding support for LayoutLMv2 and LayoutXLM
-Model wrappers for LayoutLMv2 have been added. To give the whole concept some more structure, two new new base classes (sub classed from LMTokenClassifier and LMSequenceClassifier resp.) HFLayoutLmTokenClassifierBase
and HFLayoutLmSequenceClassifierBase
have been added.
-Adding sliding window for training and inference: Before adding sliding windows, pages with more than 512 tokens could be processed by splitting the page batch into several disjoint batches. This approach has the disadvantage that one loses context especially for tokens very close to where the batch has been dissected. Sliding window generates several overlapping batches so that there is always a batch where any token has context (except the first and last tokens). For inference one needs to add a post processing step for tokens in more than one batch: We currently choose the prediction with the highest score but there are other approaches. The effect on inference has not been tested yet and the implementation may be subject to change.
-Adding support in training script for LayoutLMv2 and LayoutXLM.
Note: As transformer
tokenizers implement the handling of distributing bounding boxes to tokens which is a part of the pre-processing step in this library as well, users must not use these but have to take the tokenizers generating the vocab of the underlying language model. This means:
LayoutLMv2 -> LayoutLMTokenizerFast
LayoutXLM -> XLMRobertaTokenizerFast
Refactoring repo structure
Disentangling code base from jupyter notebooks.
Add layoutlmv3 and add more features for layoutlm processing
-Adding HFLayoutLmv3SequenceClassifier
and HFLayoutLmv3TokenClassifier
-LMTokenClassifierService
does not require a mapping function in its __init__
method anymore because the inference processing works with one single mapping for all models.
-When processing features, it is now possible to choose the segment positions to be used as bounding boxes. This implies, that the segment positions will need child
-specific relationships
to the words.
-Evaluator
has a new method compare
which makes it possible to compare ground truth sample from a dataset with predictions from a pipeline. Currently, only object detection models can be compared.
Bugs
v.0.19 Patch release
Patch release:
Du 8000 to changes of the hf_hub as of release 0.11.0, only versions <0.11.0 can be currently used.
v.018 Adding jdeskew, modifying the `Page` object and further improve usage (hopefully!)
Enhancements
Summary
#69 Modified cell merging in table refinement process and new row/column stretching rule
#72 Optimizing reading order
#76 Refactoring pipeline base classes
#82 Adding an image transformer and corresponding pipeline component
#86 Modify API for analyzing document output
Details
Modified cell merging in table refinement process and new row/column stretching rule
TableSegmentationRefinementService
: When merging cells, the merged cell can be equal to one of the input cells (e.g. if the largest cell contains all other cell. In this case the merging cell cannot be dumped and the smaller cells won't be deactivated. A logic has been added that deals with this situation.
TableSegmentationService
: To tile tables with rows and column more equally a new row/column stretching rule has been added.
Optimizing reading order
Optimization of the arrangement of layout blocks so that the reading order gets more robust even when the layout elements vary heavily.
Refactoring pipeline base classes
- Adding a new attribute name so that each pipeline component in a pipeline can be uniquely described by its predictor and its
component. - Removing some parameters in classes that do not really belong to these classes adding meth get_pipeline_info to abstract base class
pipeline.
Adding an image transformer (not a model) and corresponding pipeline component (closes #30)
- Adding the package
jdeskew
to estimate the distortion angle of a skewed document and to rotate it accordingly so that text lines are horizontal lines and easier to consume for OCR systems. - Adding an new class interface
ImageTransformer
, that accepts and returns an image as numpy array
a new pipeline componentSimpleTransformService
that accepts anImageTransformer
and updates the necessary meta data.
Modify API for analyzing document output
ObjectTypes
strings have been changed to lower case. The reason for that is thatObjectTypes
members are now made
available as attributes for sub classes ofImageAnnotationObj
.- Unused and deprecated data structures have been deleted.
- A new
Page
object now derived fromImage
has been created. This new object replaces the object of the same name. Moreover, a couple ofLayout
structures have been created. BothPage
andImage
represent a view on the underlyingImage
, resp.ImageAnnotation
and providing a more intuitive interface to document parsing, text extraction/text classification than theImage
andImageAnnotation
classes. - A new class
CustomDataset
has been added to provide users an easy interface to create custom datasets. This class reduces the boilerplate and now users have only to write aDataFlowBuilder
and need to instantiateCustomDataset
. ModelProfile
has been provided with a new attribute model_wrapper.TextExtractionService
has been provided with a new function run_time_ocr_language_selection. If Tesseract has been chosen astext_extract_detector
and aLanguageDetectionService
is a predecessor pipeline component, settingrun_time_ocr_language_selection=True
will select the Tesseract model with the predicted language. You can therefore have different languages in one stream of documents.- All notebooks have been revisited and updated. Many notebooks are now almost one year old and do not give a exhaustive overview what can be solved with that library.
- Beside notebooks, a substantial part of the docs have been updated.
Bugs
#66 with PR #68
#70 with PR #71
#73 with PR 74
#77 with PR #78
#80 with PR #81
#84 with PR #85
v.017 More metrics, more LayoutLM tutorials, dropping names, unifying logging, simplifying distribution
Enhancements
Summary
#55 Adding precision/recall/F1 metrics
#57 More docs for LayoutLM
#61 Enum
for categories
#63 Unifying log messages
#65 Reducing the number of extra install options
Details
Adding Precision/recall/F1 metrics
Precision, recall and F1 metrics (macro/micro/average versions) have been added to evaluate token classification models.
Regarding visualization, some options have been added to display token class output at page level.
More docs for LayoutLM
As side notes, two docs have been added to discuss
- results of sequence classification problems on modern type documents
- results of LayoutLM models with visual backbone trained on layout analysis tasks
Enums for categories
The current data model is based on object detection tasks. This can be seen by the choice of classes that includes Image
, CategoryAnnotation
, ImageAnnotation
, and the relationships between ImageAnnotation
and sub categories. On the other hand, however, category types from Document AI tasks are generally used to set up the sequential steps in the code base. These category types have been stored in the category_names
attribute as a string type. All category types are currently an attribute of the AttrDict
instance names
. As the number of category types increases, this procedure means that the names cannot be maintained well. Furthermore, one is not able to group category types.
This vulnerability should be eliminated with the introduction of special Enum
types for groups of categories. In the future, an Enum
member will be stored in the category_names
attribute. This ensures that categories can also be controlled using Enum
type in the future.
Enum
members will also be used as keys of sub categories.
Enum
s are defined as string Enum
s, so one can still call Enum
members with their original names.
Unifying log messages
Log messages have been unified across all libraries, while keeping logs unchanged when they are devoted to training scripts - so that Tensorboard works correctly. Moreover, many assertion errors have been replaced with a more precise built-in error type.
Reducing number of extra install options
The number of extra install options has been reduced by two. The installation docs has been modified accordingly.
The concept of lazy modules has been added. Lazy modules allows to defer the execution of importing modules until the moment when an imported module is used for the first time. This give some speed gains.
Bugs
fixes: # 53 with PR # 54
v.0.16 LayoutLM, Evaluation over pipelines, TEDS metric, Doclaynet and re-design of Page class
Enhancements
Summary
Evaluator running over pipelines: #38
Adding Tree edit distance metric: #38
Adding LayoutLMv1 model: #44
Adding Doclaynet dataset: #45
New design of Page class: #47
Details
Evaluator running over pipelines:
When running evaluation for table recognition the prediction depend on a chain of pipeline components for object detection and post processing (cell/row column matching and table refinement). The evaluator therefore needed to compare results between the ground truth of a dataset and the prediction of a whole pipeline.
-
For comparing prediction and ground truth on a datapoint, the evaluator first has to make a copy of the ground truth and then it need to erase all interim results that will later be generated when running through the pipeline. In order to know what has to be erased a new meta data scheme had to be established for each pipeline component indicating what type of annotation (image annotation or category annotation) will be generated when passing the datapoint through the component.
-
Moreover, additional functions had to be added to each metric so that one can specify over which sub category/summary the evaluation is required.
Adding TEDS metric
Tree edit distance has been proposed to compare HTML representation of tables in the realm of table recognition. It is possible to call this metric on a given category for every task that generates an XML representation. The code has been mainly taken from the Pubtabnet repo.
Adding LayoutLMv1
The major addition of this release add support to train, eval and run LayoutLMv1 models in deepdoctection pipelines. Using separate pipeline components for sequence and token classification with support of LayoutLM extends massively the applicability of this repo. LayoutLMv1 is basically a BERT model that accepts multimodal features like tokens and bounding boxes and comes with different flavors. Higher LayoutLM versions with additional features will be added in later releases.
To the model belong a training script for fine tuning based on the custom trainer from transformer library. A notebook to showcase the new functionality has been added.
Adding Doclaynet
Doclaynet is new dataset for document layout analysis that contains around 80k manually labeled images like financial reports, patents and others. Compared to automatically generated labels from other datasets like Publaynet, Doclaynet has high variability in document layouts that allows training models that are able to determine layouts for a large variety of documents.
New design of page class
The original page class suffered from poor design choices resulting in challenges when adding additional features to the output. It therefore had to be completely redesigned and simplified where now has a modular approach that can easily be extended with new components.
v.015 patch: update setup.py
Patch to add long description
v0.14 - Refactoring extension modules, optimizing typing and adding Detectron2 training scripts
Enhancements
Summary
Re-organizing extra dependencies: #35
Optimizing typing: #36
Training script for Detectron2 and new models: #37
Details
Re-organizing extra dependencies
Adding basic, full and all extra dependencies for TF as well as full and all dependencies for PT. Compared to old dependency setting it
is not compulsory in the basic setting to have pycocotools as well as lxml. All dependencies includes packages that have a predictor wrapper.
Setup has now several installation options depending on whether package has been downloaded from PyPi or Guthub.
Test-suite has been divided into tests groups according to the additional package distributions.
CI for tests when merging into master have been added.
Optimizing typing
Static typing has been optimized to reduce the massive amount of typing issues due to incorrect typing. Some additional types (e.g. Pathlike) have been added.
Training scripts for Detectron2 and new PyTorch models
Detectron2 is now on equal footing with Tensorpacks models and is easily trainable on dd datasets. Training metrics show that this framework is superior to the Tensorpack implementation in terms of speed and accuracy. A training script with identical API to Tensorpack has been provided which is based on D2's train_net script. Central to this script is a trainer derived from D2's DefaultTrainer with custom data loading methods.
Training of given PyTorch models have been resumed for 20-50k iterations to overcome the poor accuracy in higher iou metrics.
v.013 - Language detection, merging datasets, unifying registries
Enhancements
Summary
Language detection: #33
Merging datasets: #34
Details
Language detection
Adding a predictor for language detection, that accepts a string and predicts the returned language. As model, we use the large fasttext word embedding that is also included in the model catalog. Detemining the language is crucial when applying downstream NLP tasks.
Along with language detector a new pipeline component LanguageDetectionService has been implemented. The service can be used in two situations:
-
before the text extraction: A OCR predictor extracts a snippet on a region of the page a passes it to the language detector. The result can be used to do a proper text extraction thereafter with a OCR model specialized on the inferred language.
-
after the text extraction: If the text extraction does not really depend on a specific language (e.g. text extraction with a PDF miner) one can use the pipeline component to determine a more confident prediction.
Merging datasets
Adding a new class derived from DatasetBase to construct datasets as union of pre selected datapoints without touching the original datasets.
To train models on multiple datasets MergeDataset accepts a number of datasets and builds meta data (e.g. categories) and dataflow based on their inputs. Configuring the datasets (filtering, replace cats with sub cats) before creating the merge is allowed, configuring the dataflow of each individual dataflow as well.
Bugs
Getting started notebook fails: #26
When printing tables from page object, output does not show last row: #28
Some pipeline components do not have a clone method: #31
Improvements
Dataclass for model profile and new ModelCatalog
Models are not registered with a dataclass that allows saving the necessary meta data (urls, hf repo id, etc.) and retrieving the information from the ModelCatalog
Unifying registries
For metrics, datasets and pipeline components we now use a small library catalogue, that easily allows creating registries for these datapoints. This class is especially designed for adding custom objects in individual objects.
Silence some TF warnings
Some TF warnings (esp. for warnings appearing in TF >= 2.5) are now silenced.