[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3460426.3463623acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

NASTER: Non-local Attentional Scene Text Recognizer

Published: 01 September 2021 Publication History

Abstract

Scene text recognition has been widely investigated in computer vision. In the literature, the encoder-decoder based framework, which first encodes image into feature map and then decodes them into corresponding text sequences, have achieved great success. However, this solution fails in low-quality images, as the local visual features extracted from curved or blurred images are difficult to decode into corresponding text. To address this issue, we propose a new framework for Scene Text Recognition (STR), named Non-Local Attentional Scene Text Recognizer (NASTER). We use ResNet with Global Context Block (GC block) to extract global visual features. The global context information is then captured in parallel using the self-attention module and finally decoded by a multi-layer attention decoder with an intermediate supervision module. The proposed method achieves the state-of-the-art performances on seven benchmark datasets, demonstrating the effectiveness of our approach.

References

[1]
Jeonghun Baek, Geewook Kim, and Junyeop Lee. 2019. What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) .
[2]
F. Bai, Z. Cheng, and Y. Niu. 2018. Edit Probability for Scene Text Recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1508--1516. https://doi.org/10.1109/CVPR.2018.00163
[3]
Yue Cao, Jiarui Xu, and Stephen Lin. 2019. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. 1971--1980. https://doi.org/10.1109/ICCVW.2019.00246
[4]
Zhanzhan Cheng, Fan Bai, and Yunlu Xu. 2017. Focusing Attention: Towards Accurate Text Recognition in Natural Images. 5086--5094. https://doi.org/10.1109/ICCV.2017.543
[5]
Z. Cheng, Y. Xu, and F. Bai. 2018. AON: Towards Arbitrarily-Oriented Text Recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5571--5579. https://doi.org/10.1109/CVPR.2018.00584
[6]
Yunze Gao, Yingying Chen, Jinqiao Wang, Ming Tang, and Hanqing Lu. 2019. Reading scene text with fully convolutional sequence modeling. Neurocomputing, Vol. 339 (2019), 161 -- 170. https://doi.org/10.1016/j.neucom.2019.01.094
[7]
Yuting Gao, Zheng Huang, and Yuchen Dai. 2018. Double Supervised Network with Attention Mechanism for Scene Text Recognition. CoRR, Vol. abs/1808.00677 (2018). arxiv: 1808.00677 http://arxiv.org/abs/1808.00677
[8]
A. Gupta, A. Vedaldi, and A. Zisserman. 2016. Synthetic Data for Text Localisation in Natural Images. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2315--2324. https://doi.org/10.1109/CVPR.2016.254
[9]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778. https://doi.org/10.1109/CVPR.2016.90
[10]
T. He, Z. Tian, and W. Huang. 2018. An End-to-End TextSpotter with Explicit Alignment and Attention. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5020--5029. https://doi.org/10.1109/CVPR.2018.00527
[11]
J. Hu, L. Shen, and S. Albanie. 2020. Squeeze-and-Excitation Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 42, 8 (2020), 2011--2023. https://doi.org/10.1109/TPAMI.2019.2913372
[12]
Max Jaderberg, Karen Simonyan, and Andrea Vedaldi. 2016. Reading Text in the Wild with Convolutional Neural Networks. Int. J. Comput. Vision, Vol. 116, 1 (Jan. 2016), 1--20. https://doi.org/10.1007/s11263-015-0823-z
[13]
Kai Wang, B. Babenko, and S. Belongie. 2011. End-to-end scene text recognition. In 2011 International Conference on Computer Vision. 1457--1464. https://doi.org/10.1109/ICCV.2011.6126402
[14]
D. Karatzas, L. Gomez-Bigorda, and A. Nicolaou. 2015. ICDAR 2015 competition on Robust Reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR). 1156--1160. https://doi.org/10.1109/ICDAR.2015.7333942
[15]
D. Karatzas, F. Shafait, and S. Uchida. 2013. ICDAR 2013 Robust Reading Competition. In 2013 12th International Conference on Document Analysis and Recognition. 1484--1493. https://doi.org/10.1109/ICDAR.2013.221
[16]
Hui Li, Peng Wang, and Chunhua Shen. 2019. Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 01 (Jul. 2019), 8610--8617. https://doi.org/10.1609/aaai.v33i01.33018610
[17]
Ron Litman, Oron Anschel, and Shahar Tsiper. 2020. SCATTER: Selective Context Attentional Scene Text Recognizer. (2020).
[18]
Wei Liu, Chaofeng Chen, and Kwan-Yee K. Wong. 2018. Char-Net: A Character-Aware Neural Network for Distorted Scene Text Recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2--7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 7154--7161. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16327
[19]
Ning Lu, Wenwen Yu, and Xianbiao Qi. 2019. MASTER: Multi-Aspect Non-local Network for Scene Text Recognition. (2019).
[20]
Simon Lucas, Alexandros Panaretos, and Luis Sosa. 2005. ICDAR 2003 Robust Reading Competitions: Entries, Results and Future Directions. IJDAR, Vol. 7 (07 2005), 105--122. https://doi.org/10.1007/s10032-004-0134--3
[21]
Canjie Luo, Lianwen Jin, and Zenghui Sun. 2019. MORAN: A Multi-Object Rectified Attention Network for scene text recognition. Pattern Recognition, Vol. 90 (2019), 109 -- 118. https://doi.org/10.1016/j.patcog.2019.01.020
[22]
Anand Mishra, Karteek Alahari, and C.V. Jawahar. 2012. Scene Text Recognition using Higher Order Language Priors. In BMVC - British Machine Vision Conference. BMVA, Surrey, United Kingdom. https://doi.org/10.5244/C.26.127
[23]
Anand Mishra, Karteek Alahari, and C.V. Jawahar. 2016. Enhancing energy minimization framework for scene text recognition with top-down cues. Computer Vision and Image Understanding, Vol. 145 (2016), 30 -- 42. https://doi.org/10.1016/j.cviu.2016.01.002 Light Field for Computer Vision.
[24]
Adam Paszke, Sam Gross, and Francisco Massa. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., 8026--8037. https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
[25]
T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. 2013. Recognizing Text with Perspective Distortion in Natural Scenes. In 2013 IEEE International Conference on Computer Vision. 569--576. https://doi.org/10.1109/ICCV.2013.76
[26]
Zhi Qiao, Xugong Qin, and Yin qing Zhou. 2020. Gaussian Constrained Attention Network for Scene Text Recognition. ArXiv, Vol. abs/2010.09169 (2020).
[27]
Zhi Qiao, Yu Zhou, and Dongbao Yang. 2020. SEED: Semantics Enhanced Encoder-Decoder Framework for Scene Text Recognition. arXiv e-prints, Article arXiv:2005.10977 (May 2020), arXiv:2005.10977 pages.arxiv: 2005.10977 [cs.CV]
[28]
Anhar Risnumawan, Palaiahankote Shivakumara, Chee Seng Chan, and Chew Lim Tan. 2014. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, Vol. 41, 18 (2014), 8027 -- 8048. https://doi.org/10.1016/j.eswa.2014.07.008
[29]
B. Shi, X. Bai, and C. Yao. 2017. An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 11 (2017), 2298--2304. https://doi.org/10.1109/TPAMI.2016.2646371
[30]
B. Shi, X. Wang, and P. Lyu. 2016. Robust Scene Text Recognition with Automatic Rectification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4168--4176. https://doi.org/10.1109/CVPR.2016.452
[31]
B. Shi, M. Yang, and X. Wang. 2019. ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 9 (2019), 2035--2048. https://doi.org/10.1109/TPAMI.2018.2848939
[32]
Ashish Vaswani, Noam Shazeer, and Niki Parmar. 2017. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., 6000--6010.
[33]
Kai Wang and Serge Belongie. 2010. Word Spotting in the Wild. In Proceedings of the 11th European Conference on Computer Vision: Part I (Heraklion, Crete, Greece) (ECCV'10). Springer-Verlag, Berlin, Heidelberg, 591--604.
[34]
Kelvin Xu, Jimmy Lei Ba, and Ryan Kiros. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML'15). JMLR.org, 2048--2057.
[35]
MingKun Yang, Yushuo Guan, and Minghui Liao. 2019. Symmetry-constrained Rectification Network for Scene Text Recognition. arXiv e-prints, Article arXiv:1908.01957 (Aug. 2019), arXiv:1908.01957 pages.arxiv: 1908.01957 [cs.CV]
[36]
Xiao Yang, Dafang He, and Zihan Zhou. 2017. Learning to Read Irregular Text with Attention Mechanisms. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. 3280--3286. https://doi.org/10.24963/ijcai.2017/458
[37]
C. Yao, X. Bai, and B. Shi. 2014. Strokelets: A Learned Multi-scale Representation for Scene Text Recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition. 4042--4049. https://doi.org/10.1109/CVPR.2014.515
[38]
Matthew Zeiler. 2012. ADADELTA: An adaptive learning rate method., Vol. 1212 (12 2012).
[39]
Fangneng Zhan and Shijian Lu. 2019. ESIR: End-To-End Scene Text Recognition via Iterative Image Rectification. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16--20, 2019. Computer Vision Foundation / IEEE, 2059--2068. https://doi.org/10.1109/CVPR.2019.00216

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
August 2021
715 pages
ISBN:9781450384636
DOI:10.1145/3460426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. OCR
  2. iterative decoding
  3. non-local attention
  4. scene text recognizer

Qualifiers

  • Research-article

Funding Sources

Conference

ICMR '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)1
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media