The Indian scene text detection model is developed as part of the work towards Indian Signboard Translation Project by AI4Bharat. I worked on this project under the mentorship of Mitesh Khapra and Pratyush Kumar from IIT Madras.
Indian Signboard Translation involves 4 modular tasks:
T1
: Detection: Detecting bounding boxes containing text in the imagesT2
: Classification: Classifying the language of the text in the bounding box identifed byT1
T3
: Recognition: Getting the text from the detected crop byT1
using theT2
classified recognition modelT4
: Translation: Translating text fromT3
from one Indian language to other Indian language
Note:
T2
: Classification is not updated in the above picture
Indian Scene Text Detection Dataset(D1-Big
+ D1-English
) is used for training the detection model and evaluation. Axis-Aligned Bounding Box representation of the text boxes are used.
The score map for an image is the region with in the shrinked bounding box. The geometry map at a point inside the bounding box represents the distance of that point to the left, top, right and bottom boundaries respectively.
The fully convolutional neural network proposed in the paper titled "An Efficient and Accurate Scene Text Detector" (EAST) is used to predict the word instance regions and their geometries. The following two variants of the model are experimented:
M1
: Pretrained VGG-16 net as a feature extractor. It produces output in the reduced dimensions by a factor of 4.
- Input Image Shape: [320, 320, 3]
- Output Score Map Shape: [80, 80, 1]
- Output Geometry Map Shape: [80, 80, 4]
M2
: U-Net for feature extractor and merging. It produces per pixel predictions of text regions and geometries.
- Input Image Shape: [320, 320, 3]
- Output Score Map Shape: [320, 320, 1]
- Output Geometry Map Shape: [320, 320, 4]
Non-Maximal Supression (NMS) is performed to remove the overlapping bounding boxes with the maximum permitted IoU threshold of 0.1.
For detailed model architecture, check the file model.py
Sample Input-Output
M1
& M2
converged to simliar score and geometry losses after training for a specific number of epochs. As M1
is significantly efficient in memory and computation, it is selected over M2
. The detection model is trained for 30 epochs. The model weights are saved every 3 epochs and you can find them in the Models
directory.
The final hyperparameters can be accessed in config.yaml
The lowest validation loss is observed in epoch 12. Hence, the model Models/EAST-Detector-e12.pth
is used to evaluate the detection performance. In the NMS stage, minimum score threshold is set as 0.85 and maximum permitted IoU threshold is set as 0.2
Minimum IoU threshold for the predicted bounding boxes to be considered as correct is set as 0.70
Metric | Precision | Recall | F1-Score |
---|---|---|---|
Trainset | 0.311847 | 0.360114 | 0.426558 |
Valset | 0.331797 | 0.384548 | 0.446315 |
Testset | 0.267891 | 0.343183 | 0.343183 |
Sample Detections:
- Model: model.py
- Merging English Data: 0-Merge-English-Data.ipynb
- Training: 1-Indian-Scene-Text-Detection-Training
- Training Visualisation: 2-MLFlow-Training-Visualisation
- Prediction: 3-Indian-Scene-Text-Detection-Prediction
- Evaluation: 4-Indian-Scene-Text-Detection-Evaluation
- Indian Signboard Translation Project
- Indian Scene Text Dataset
- Indian Scene Text Detection
- Indian Scene Text Classification
- Indian Scene Text Recognition
- https://openaccess.thecvf.com/content_cvpr_2017/papers/Zhou_EAST_An_Efficient_CVPR_2017_paper.pdf
- https://arxiv.org/pdf/1505.04597.pdf
- https://github.com/liushuchun/EAST.pytorch
- https://github.com/GokulKarthik/EAST.pytorch
- https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/