GitHub - jfma-USTC/InvoiceDatasets

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
Readme.txt		Readme.txt
taxi_0001.jpg		taxi_0001.jpg
vat_0001.jpg		vat_0001.jpg

Repository files navigation

Two public datasets of the camera-captured invoice images for key word spotting

Currently, there is no public dataset of the camera-captured invoice images. In order to enable the comparison among different text detection and word spotting algorithms, we collect two datasets containing taxi and value added tax(VAT) invoices from different provinces in China and they are publicly available now. One is called the taxi invoice dataset (TID for short), which consists of 104 and 140 categories of key words and characters. Note that the key words of taxi invoices vary greatly between provinces and we collect samples from 25 different provinces. The other is called VATID (value added tax invoice dataset) consisting of 24 and 57 types of key words and characters. For these two datasets, we randomly select fifty percent of the images as the training set and the rest are assigned to the testing set.