Annotated DocSet

Annotated DocSet is a synthetic dataset with instance and semantic segmentations. The source documents are PubLayNet and IAMonDo which are used to construct mixed-media documents. The annotations are automatically generated by matching during the construction of the new documents into HDF5 files. More details are available in our paper "Classification of handwritten annotations in mixed-media documents" (CRV 2022).

Getting Data

This dataset uses two external datasets to construct a synthentic datasets and they must be downloaded for the authors websites.

PubLayNet: Download the dataset here and extract to datasets. Note: We used the validation set only due to the limited size of the IAMonDo dataset.

Expected Folder Structure:

datasets/
-- publaynet/
-- val/
-- *.PNG \

The labels for tval splits can be found in val.json. The labels are in the MS COCO format.

IAMonDo: Download and extract the dataset here to datasets

Expected Folder Structure:

datasets/
-- IAMonDo-db-1.0/
-- *.inkml
-- *.set \

Construct the dataset

Requirements:\

Python 3.7+
OpenCV 4.5+\

Install the required python packages

pip install -r requirements.txt

Convert the IAMonDo dataset from INKML format to PNG images

python parse_inkml.py [--iamondo PATH/TO/IAMonDo-db-1.0] [--output-dir PATH/WHERE/TO/SAVE]

Defaults:
python parse_inkml.py --iamondo datasets/IAMonDo-db-1.0] --output-dir datasets/IAMonDo-Images

This will create a folder structure:
path/to/output/
-- categories/
-- IAMonDo Classes/
-- *.png
-- labelled/
-- *.png
-- original/
-- *.png \ -- colormap.[png,json]
inkml.json\

Construct the Annotated DocSet. This will create a folder annotated_docset-{tag}. Sub-folder images contains the created dataset files.

python create_dataset.py --tag v1.0

Annotation Format

The ground-truth is stored in numpy arrays, stored in HDF5 files.

instance_segmentations.hdf5: Segmentation mask for each instance
dense_segmentations.hdf5: Single-channel dense label masks and experimental multi-channel dense labels
multilabel_segmentations.hdf5: one-hot encoded masks
stroke_segmentations.hdf5: Handwritten classes ground-truth with only stroke masks

Visualize

A basic visualization script has been provided to view the contents of the H5 ground-truth files.

python visualize -d annotated_docset-v1.0 [-s val|test|train] [-i docset image]

Cite us

@inproceedings{dash2022classification,
  title={Classification of handwritten annotations in mixed-media documents},
  author={Dash, Amanda and Branzan Albu, Alexandra },
  booktitle={19th Conference on Robots and Vision 2022 (CRV)},
  year={2022},
  volume={},
  number={},
  pages={},
  doi={},
  ISSN={},
  month={May.},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
datasets		datasets
utils		utils
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
create_dataset.py		create_dataset.py
parse_inkml.py		parse_inkml.py
requirements.txt		requirements.txt
splits.csv		splits.csv
visualize.py		visualize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Annotated DocSet

Getting Data

Construct the dataset

Annotation Format

Visualize

Cite us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

dash-uvic/ann-docset

Folders and files

Latest commit

History

Repository files navigation

Annotated DocSet

Getting Data

Construct the dataset

Annotation Format

Visualize

Cite us

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages