[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

NDOrder: : Exploring a novel decoding order for scene text recognition

Published: 17 July 2024 Publication History

Abstract

Text recognition in scene images is still considered as a challenging task for the computer vision and pattern recognition community. For text images affected by multiple adverse factors, such as occlusion (due to obstacles) and poor quality (due to blur and low resolution), the performance of the state-of-the-art scene text recognition methods degrades. The key reason is that the existing encoder–decoder framework follows fixed left-to-right decoding order, which lacks sufficient contextual information. In this paper, we present a novel decoding order where good-quality characters can first be decoded followed by low-quality characters, which preserves the contextual information irrespective of the aforementioned difficult scenarios. Our method, named NDOrder, extracts visual features with a ViT encoder and then decodes with the Random Order Generation module (ROG) for learning to decode with random decoding orders and the Vision-Content-Position module (VCP) for exploiting the connections among visual information, content and position. In addition, a new dataset named OLQT (Occluded and Low-Quality Text) is created by manually collecting text images that suffer from occlusion or low-quality from several standard text recognition datasets. The dataset is now available at https://github.com/djzhong1/OLQT. Experiments on OLQT and public scene text recognition benchmarks show that the proposed method achieves state-of-the-art performance.

Highlights

The NDOrder can recognize text images with occluded and low-quality characters.
The ROG module ensures the method work with a novel decoding order.
The VCP module constructs a robust connection among image, content and position.
We contribute to a new scene text dataset OLQT with 245 text instances.

References

[1]
Aberdam, A., Litman, R., Tsiper, S., Anschel, O., Slossberg, R., Mazor, S., et al. (2021). Sequence-to-sequence contrastive learning for text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15302–15312).
[2]
Atienza, R. (2021a). Data Augmentation for Scene Text Recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1561–1570).
[3]
Atienza R., Vision transformer for fast and efficient scene text recognition, in: International conference on document analysis and recognition, Springer, 2021, pp. 319–334.
[4]
Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S., et al. (2019). What is wrong with scene text recognition model comparisons? dataset and model analysis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4715–4723).
[5]
Baek, J., Matsui, Y., & Aizawa, K. (2021). What if we only use real datasets for scene text recognition? toward scene text recognition with fewer labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3113–3122).
[6]
Bahdanau D., Cho K., Bengio Y., Neural machine translation by jointly learning to align and translate, 2014, arXiv preprint arXiv:1409.0473.
[7]
Bautista, D., & Atienza, R. (2022). Scene Text Recognition with Permuted Autoregressive Sequence Models. In European conference on computer vision (pp. 178–196).
[8]
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., & Zhou, S. (2017). Focusing attention: Towards accurate text recognition in natural images. In Proceedings of the IEEE international conference on computer vision (pp. 5076–5084).
[9]
Chng C.K., Liu Y., Sun Y., Ng C.C., Luo C., Ni Z., et al., Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art, in: 2019 international conference on document analysis and recognition, IEEE, 2019, pp. 1571–1576.
[10]
Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020, arXiv preprint arXiv:2010.11929.
[11]
Du Y., Chen Z., Jia C., Yin X., Zheng T., Li C., et al., SVTR: Scene text recognition with a single visual model, 2022, arXiv preprint arXiv:2205.00159.
[12]
Fang, S., Xie, H., Wang, Y., Mao, Z., & Zhang, Y. (2021). Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7098–7107).
[13]
Fu, Z., Xie, H., Jin, G., & Guo, J. (2021). Look Back Again: Dual Parallel Attention Network for Accurate and Robust Scene Text Recognition. In Proceedings of the 2021 international conference on multimedia retrieval (pp. 638–644).
[14]
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376).
[15]
Gupta, A., Vedaldi, A., & Zisserman, A. (2016). Synthetic data for text localisation in natural images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2315–2324).
[16]
He, Y., Chen, C., Zhang, J., Liu, J., He, F., Wang, C., et al. (2022). Visual semantics allow for textual reasoning better in scene text recognition. In Proceedings of the AAAI conference on artificial intelligence (pp. 888–896).
[17]
He, P., Huang, W., Qiao, Y., Loy, C. C., & Tang, X. (2016). Reading scene text in deep convolutional sequences. In AAAI conference on artificial intelligence.
[18]
Hu, W., Cai, X., Hou, J., Yi, S., & Lin, Z. (2020). Gtc: Guided training of ctc towards efficient and accurate scene text recognition. In Proceedings of the AAAI conference on artificial intelligence (pp. 11005–11012).
[19]
Huang, M., Liu, Y., Peng, Z., Liu, C., Lin, D., Zhu, S., et al. (2022). SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4593–4603).
[20]
Jaderberg M., Simonyan K., Vedaldi A., Zisserman A., Synthetic data and artificial neural networks for natural scene text recognition, 2014, arXiv preprint arXiv:1406.2227.
[21]
Karatzas D., Gomez-Bigorda L., Nicolaou A., Ghosh S., Bagdanov A., Iwamura M., et al., ICDAR 2015 competition on robust reading, in: 2015 international conference on document analysis and recognition, IEEE, 2015, pp. 1156–1160.
[22]
Karatzas D., Shafait F., Uchida S., Iwamura M., i Bigorda L.G., Mestre S.R., et al., ICDAR 2013 robust reading competition, in: 2013 12th international conference on document analysis and recognition, IEEE, 2013, pp. 1484–1493.
[23]
Kingma D.P., Ba J., Adam: A method for stochastic optimization, 2014, arXiv preprint arXiv:1412.6980.
[24]
Kittenplon, Y., Lavi, I., Fogel, S., Bar, Y., Manmatha, R., & Perona, P. (2022). Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4604–4613).
[25]
Li M., Lv T., Cui L., Lu Y., Florencio D., Zhang C., et al., Trocr: Transformer-based optical character recognition with pre-trained models, 2021, arXiv preprint arXiv:2109.10282.
[26]
Li B., Tang X., Qi X., Chen Y., Li C.-G., Xiao R., EMU: Effective multi-hot encoding net for lightweight scene text recognition with a large character set, IEEE Transactions on Circuits and Systems for Video Technology 32 (8) (2022) 5374–5385.
[27]
Liao, M., Pang, G., Huang, J., Hassner, T., & Bai, X. (2020). Mask textspotter v3: Segmentation proposal network for robust scene text spotting. In European conference on computer vision (pp. 706–722).
[28]
Liao, M., Zhang, J., Wan, Z., Xie, F., Liang, J., Lyu, P., et al. (2019). Scene text recognition from two-dimensional perspective. In Proceedings of the AAAI conference on artificial intelligence (pp. 8714–8721).
[29]
Liu, C., Yang, C., & Yin, X.-C. (2022). Open-Set Text Recognition via Character-Context Decoupling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4523–4532).
[30]
Mishra, A., Alahari, K., & Jawahar, C. (2012a). Scene text recognition using higher order language priors. In British machine vision conference.
[31]
Mishra A., Alahari K., Jawahar C., Top-down and bottom-up cues for scene text recognition, in: 2012 IEEE conference on computer vision and pattern recognition, IEEE, 2012, pp. 2687–2694.
[32]
Na B., Kim Y., Park S., Multi-modal text recognition networks: Interactive enhancements between visual and semantic features, in: European conference on computer vision, Springer, 2022, pp. 446–463.
[33]
Nayef N., Patel Y., Busta M., Chowdhury P.N., Karatzas D., Khlif W., et al., ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019, in: 2019 international conference on document analysis and recognition, IEEE, 2019, pp. 1582–1587.
[34]
Nguyen, N., Nguyen, T., Tran, V., Tran, M.-T., Ngo, T. D., Nguyen, T. H., et al. (2021). Dictionary-guided scene text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7383–7392).
[35]
Phan, T. Q., Shivakumara, P., Tian, S., & Tan, C. L. (2013). Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE international conference on computer vision (pp. 569–576).
[36]
Qiao, Z., Zhou, Y., Wei, J., Wang, W., Zhang, Y., Jiang, N., et al. (2021). PIMNet: a parallel, iterative and mimicking network for scene text recognition. In Proceedings of the 29th ACM international conference on multimedia (pp. 2046–2055).
[37]
Qiao, Z., Zhou, Y., Yang, D., Zhou, Y., & Wang, W. (2020). Seed: Semantics enhanced encoder-decoder framework for scene text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13528–13537).
[38]
Raghunandan K., Shivakumara P., Roy S., Kumar G.H., Pal U., Lu T., Multi-script-oriented text detection and recognition in video/scene/born digital images, IEEE Transactions on Circuits and Systems for Video Technology 29 (4) (2018) 1145–1162.
[39]
Raisi, Z., & Zelek, J. (2022). Occluded Text detection and Recognition in the Wild. In 2022 19th conference on robots and vision (pp. 140–150).
[40]
Ren J.-X., Xiong Y.-J., Zhan H., Huang B., 2C2S: A two-channel and two-stream transformer based framework for offline signature verification, Engineering Applications of Artificial Intelligence 118 (2023).
[41]
Risnumawan A., Shivakumara P., Chan C.S., Tan C.L., A robust arbitrary text detection system for natural scene images, Expert Systems with Applications 41 (18) (2014) 8027–8048.
[42]
Saha S., Chakraborty N., Kundu S., Paul S., Mollah A.F., Basu S., et al., Multi-lingual scene text detection and language identification, Pattern Recognition Letters 138 (2020) 16–22.
[43]
Shi B., Bai X., Yao C., An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11) (2016) 2298–2304.
[44]
Shi, B., Wang, X., Lyu, P., Yao, C., & Bai, X. (2016). Robust scene text recognition with automatic rectification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognitio (pp. 4168–4176).
[45]
Shi C.-Z., Wang C.-H., Xiao B.-H., Gao S., Hu J.-L., Scene text recognition using structure-guided character detection and linguistic knowledge, IEEE Transactions on Circuits and Systems for Video Technology 24 (7) (2014) 1235–1250.
[46]
Shi B., Yang M., Wang X., Lyu P., Yao C., Bai X., Aster: An attentional scene text recognizer with flexible rectification, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (9) (2018) 2035–2048.
[47]
Shi B., Yao C., Liao M., Yang M., Xu P., Cui L., et al., ICDAR2017 competition on reading Chinese text in the wild (RCTW-17), 2017 14th IAPR international conference on document analysis and recognition, vol. 1, IEEE, 2017, pp. 1429–1434.
[48]
Smith L.N., Topin N., Super-convergence: Very fast training of neural networks using large learning rates, Artificial intelligence and machine learning for multi-domain operations applications, vol. 11006, SPIE, 2019, pp. 369–386.
[49]
Sun Y., Ni Z., Chng C.-K., Liu Y., Luo C., Ng C.C., et al., ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT, in: 2019 international conference on document analysis and recognition, IEEE, 2019, pp. 1557–1562.
[50]
Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A.N., et al., Attention is all you need, Advances in neural information processing systems, vol. 30, 2017.
[51]
Veit A., Matera T., Neumann L., Matas J., Belongie S., Coco-text: Dataset and benchmark for text detection and recognition in natural images, 2016, arXiv preprint arXiv:1601.07140.
[52]
Wan, Z., He, M., Chen, H., Bai, X., & Yao, C. (2020). Textscanner: Reading characters in order for robust scene text recognition. In Proceedings of the AAAI conference on artificial intelligence (pp. 12120–12127).
[53]
Wan Z., Xie F., Liu Y., Bai X., Yao C., 2D-CTC for scene text recognition, 2019, arXiv preprint arXiv:1907.09705.
[54]
Wang K., Babenko B., Belongie S., End-to-end scene text recognition, in: 2011 international conference on computer vision, IEEE, 2011, pp. 1457–1464.
[55]
Wang P., Da C., Yao C., Multi-granularity prediction for scene text recognition, in: European conference on computer vision, Springer, 2022, pp. 339–355.
[56]
Wang, Y., Xie, H., Fang, S., Wang, J., Zhu, S., & Zhang, Y. (2021). From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14194–14203).
[57]
Weinman J.J., Learned-Miller E., Hanson A.R., Scene text recognition using similarity and a lexicon with sparse belief propagation, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (10) (2009) 1733–1746.
[58]
Xie X., Fu L., Zhang Z., Wang Z., Bai X., Toward understanding WordArt: Corner-guided transformer for scene text recognition, in: European conference on computer vision, Springer, 2022, pp. 303–321.
[59]
Xing, L., Tian, Z., Huang, W., & Scott, M. R. (2019). Convolutional character networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9126–9136).
[60]
Xue T., Yu H., Model-agnostic metalearning-based text-driven visual navigation model for unfamiliar tasks, IEEE Access 8 (2020) 166742–166752.
[61]
Yan, R., Peng, L., Xiao, S., & Yao, G. (2021). Primitive representation learning for scene text recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 284–293).
[62]
Yue X., Kuang Z., Lin C., Sun H., Zhang W., Robustscanner: Dynamically enhancing positional clues for robust text recognition, in: European conference on computer vision, Springer, 2020, pp. 135–151.
[63]
Zhang Y., Gueguen L., Zharkov I., Zhang P., Seifert K., Kadlec B., Uber-text: A large-scale dataset for optical character recognition from street-level imagery, SUNw: scene understanding workshop-CVPR, vol. 2017, 2017, p. 5.
[64]
Zhang R., Zhou Y., Jiang Q., Song Q., Li N., Zhou K., et al., Icdar 2019 robust reading challenge on reading Chinese text on signboard, in: 2019 international conference on document analysis and recognition, IEEE, 2019, pp. 1577–1581.
[65]
Zheng T., Chen Z., Fang S., Xie H., Jiang Y.-G., Cdistnet: Perceiving multi-domain character distance for robust text recognition, 2021, arXiv preprint arXiv:2111.11011.
[66]
Zheng, C., Li, H., Rhee, S.-M., Han, S., Han, J.-J., & Wang, P. (2022). Pushing the Performance Limit of Scene Text Recognizer without Human Annotation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14116–14125).
[67]
Zhong D., Lyu S., Shivakumara P., Yin B., Wu J., Pal U., et al., SGBANet: Semantic GAN and balanced attention network for arbitrarily oriented scene text recognition, in: European conference on computer vision, Springer, 2022, pp. 464–480.
[68]
Zhu Y., Liao M., Yang M., Liu W., Cascaded segmentation-detection networks for text-based traffic sign detection, IEEE Transactions on Intelligent Transportation Systems 19 (1) (2017) 209–219.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Expert Systems with Applications: An International Journal
Expert Systems with Applications: An International Journal  Volume 249, Issue PC
Sep 2024
1587 pages

Publisher

Pergamon Press, Inc.

United States

Publication History

Published: 17 July 2024

Author Tags

  1. Scene text recognition
  2. Transformer
  3. Decoding order optimization
  4. Random order generation
  5. Contextual information

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media