Abstract
This paper proposes a multi-oriented text localization method in natural images suitable for real-time processing of high-definition video on portable and mobile devices. Our method is based on the connected components (CC) approach: first, CC are isolated by convolving a multi-scale pyramid with a specifically designed linear spatial filter followed by hysteresis thresholding. Next, non-textual CC are pruned employing a local classifier consisting of a cascade of multilayer perceptron (MLP) fed with increasingly extended feature vectors. The stroke width feature is estimated in linear time complexity by computing the maximal inscribed squares in the CC. Candidate CC and their neighbors are then checked using a more context aware neural network classifier that takes into account the target CC and their vicinity. Finally, text sequences are extracted in all pyramid levels and fused using dynamic programming. The main contribution of the work presented here is execution speed: the CPU-only parallel implementation of the proposed method is capable of processing 1080p HD video at nearly 30 frames per second on a standard laptop. Furthermore, when benchmarked on the ICDAR 2013 Robust Reading and on the ICDAR 2015 Incidental Scene Text data sets, our system performs more than twice faster than the state-of-the-art, while still delivering competitive results in terms of precision and recall.
Similar content being viewed by others
Notes
Physics constants have been removed.
2D vectors in the cross product extended to 3D by setting z to 0.
References
Weinman, J.J., Learned-Miller, E., Hanson, A.R.: Scene text recognition using similarity and a lexicon with sparse belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 31(10), 1733–1746 (2009). https://doi.org/10.1109/TPAMI.2009.38
Jiao, J., Ye, Q., Huang, Q.: A configurable method for multi-style license plate recognition. Pattern Recognit. 42(3), 358–369 (2009). https://doi.org/10.1016/j.patcog.2008.08.016
Park, J., Lee, G., Kim, E., Lim, J., Kim, S., Yang, H., Lee, M., Hwang, S.: Automatic detection and recognition of Korean text in outdoor signboard images. Pattern Recognit. Lett. 31(12), 1728–1739 (2010). https://doi.org/10.1016/j.patrec.2010.05.024
Liu, X., Wang, W., Zhu, T.: Extracting captions in complex background from videos. In: ICPR, pp. 3232–3235 (2010). https://doi.org/10.1109/ICPR.2010.790
Yi, C., Tian, Y.: Assistive text reading from complex background for blind persons. In: CBDAR’11, pp. 15–28. Springer (2012). https://doi.org/10.1007/978-3-642-29364-1_2
Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: ICCV, pp. 1457 –1464 (2011). https://doi.org/10.1109/ICCV.2011.6126402
Tian, S., Pan, Y., Huang, C., Lu, S., Yu, K., Tan, C.L.: Text flow: a unified text detection system in natural scene images. In: ICCV, pp. 4651–4659 (2015). https://doi.org/10.1109/ICCV.2015.528
Zhang, Z., Shen, W., Yao, C., Bai, X.: Symmetry-based text line detection in natural scenes. In: CVPR, pp. 2558–2567 (2015). https://doi.org/10.1109/CVPR.2015.7298871
Qi, Z.K., Kimachi, M., Wu, Y., Aziwa, T.: Using Adaboost to detect and segment characters from natural scenes. In: Proceedings of CBDAR, ICDAR Workshop (2005)
Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: CVPR, pp. 2963–2970 (2010). https://doi.org/10.1109/CVPR.2010.5540041
Neumann, L., Matas, J.: Text localization in real-world images using efficiently pruned exhaustive search. In: ICDAR, pp. 687–691 (2011). https://doi.org/10.1109/ICDAR.2011.144
Koo, H.I., Kim, D.H.: Scene text detection via connected component clustering and nontext filtering. IEEE Trans. Image Process. 22(6), 2296–2305 (2013). https://doi.org/10.1109/TIP.2013.2249082
Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 970–983 (2014). https://doi.org/10.1109/TPAMI.2013.182
Lu, S., Chen, T., Tian, S., Lim, J.H., Tan, C.L.: Scene text extraction based on edges and support vector regression. IJDAR 18(2), 125–135 (2015). https://doi.org/10.1007/s10032-015-0237-z
Qin, S., Manduchi, R.: A fast and robust text spotter. In: WACV, pp. 1–8 (2016). https://doi.org/10.1109/WACV.2016.7477663
Tian, C., Xia, Y., Zhang, X., Gao, X.: Natural scene text detection with MC-MR candidate extraction and coarse-to-fine filtering. Neurocomputing 260, 112–122 (2017). https://doi.org/10.1016/j.neucom.2017.03.078
Wei, Y., Shen, W., Zeng, D., Ye, L., Zhang, Z.: Multi-oriented text detection from natural scene images based on a CNN and pruning non-adjacent graph edges. Signal Process. Image Commun. 64, 89–98 (2018). https://doi.org/10.1016/j.image.2018.02.016
Tian, Z., Huang, W., He, T., He, P., Qiao, Y.: Detecting text in natural image with connectionist text proposal network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision—ECCV 2016, Lecture Notes in Computer Science, vol. 9912, pp. 56–72. , Springer International Publishing (2016). https://doi.org/10.1007/978-3-319-46484-8_4
Shi, B., Bai, X., Belongie, S.: Detecting oriented text in natural images by linking segments. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3482–3490 (2017). https://doi.org/10.1109/CVPR.2017.371
Tang, Y., Wu, X.: Scene text detection and segmentation based on cascaded convolution neural networks. IEEE Trans. Image Process. 26(3), 1509–1520 (2017). https://doi.org/10.1109/TIP.2017.2656474
He, W., Zhang, X.Y., Yin, F., Liu, C.L.: Deep direct regression for multi-oriented scene text detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 745–753 (2017). https://doi.org/10.1109/ICCV.2017.87
Hu, H., Zhang, C., Luo, Y., Wang, Y., Han, J., Ding, E.: WordSup: exploiting word annotations for character based text detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 4950–4959 (2017). https://doi.org/10.1109/ICCV.2017.529
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., Liang, J.: EAST: An efficient and accurate scene text detector. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651 (2017). https://doi.org/10.1109/CVPR.2017.283
Ma, J., Shao, W., Ye, H., Wang, L., Wang, H., Zheng, Y., Xue, X.: Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 20(11), 3111–3122 (2018). https://doi.org/10.1109/TMM.2018.2818020
Lyu, P., Liao, M., Yao, C., Wu, W., Bai, X.: Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, Lecture Notes in Computer Science, vol. 11218, pp 71–88. Springer International Publishing (2018). https://doi.org/10.1007/978-3-030-01264-9_5
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: TextSnake: a flexible representation for detecting text of arbitrary shapes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision—ECCV 2018, Lecture Notes in Computer Science, pp. 19–35. Springer International Publishing
Mohanty, S., Dutta, T., Gupta, H.P.: Recurrent global convolutional network for scene text detection. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2750–2754 (2018). https://doi.org/10.1109/ICIP.2018.8451058
Liao, M., Shi, B., Bai, X.: TextBoxes++: a single-shot oriented scene text detector. IEEE Trans. Image Process. 27(8), 3676–3690 (2018). https://doi.org/10.1109/TIP.2018.2825107
Jenq, J., Sahni, S.: Serial and parallel algorithms for the medial axis transform. IEEE Trans. Pattern Anal. Mach. Intell. 14(12), 1218–1224 (1992). https://doi.org/10.1109/34.177389
Gironés, X., Julià, C.: Real-time text localization in natural scene images using a linear spatial filter. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), IEEE, Kyoto, pp. 1261–1268 (2017). https://doi.org/10.1109/ICDAR.2017.208
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., Matas, J., Neumann, L., Chandrasekhar, V.R., Lu, S., Shafait, F., Uchida, S., Valveny, E.: ICDAR 2015 competition on robust reading. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 156–1160 (2015).https://doi.org/10.1109/ICDAR.2015.7333942
Liu, X., Fu, H., Jia, Y.: Gaussian mixture modeling and learning of neighboring characters for multilingual text extraction in images. Pattern Recognit. 41(2), 484–493 (2008). https://doi.org/10.1016/j.patcog.2007.06.004
He, X., Song, Y., Zhang, Y.: A coarse-to-fine scene text detection method based on skeleton-cut detector and binary-tree-search based rectification. Pattern Recognit. Lett. 112, 27–33 (2018). https://doi.org/10.1016/j.patrec.2018.05.020
Yi, C., Tian, Y.: Localizing text in scene images by boundary clustering, stroke segmentation, and string fragment classification. IEEE Trans. Image Process. 21(9), 4256–4268 (2012). https://doi.org/10.1109/TIP.2012.2199327
Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: ACCV, 6494, pp. 770–783. Springer, Berlin (2010). https://doi.org/10.1007/978-3-642-19318-7_60
Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3538 –3545 (2012). https://doi.org/10.1109/CVPR.2012.6248097
Buta, M., Neumann, L., Matas, J.: FASText: efficient unconstrained scene text detector. In: ICCV, pp. 1206–1214 (2015). https://doi.org/10.1109/ICCV.2015.143
Fabrizio, J., Robert-Seidowsky, M., Dubuisson, S., Calarasanu, S., Boissel, R.: TextCatcher: a method to detect curved and challenging text in natural scenes. IJDAR 19(2), 99–117 (2016). https://doi.org/10.1007/s10032-016-0264-4
Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004). https://doi.org/10.1016/j.imavis.2004.02.006
Zamberletti, A., Noce, L., Gallo, I.: Text localization based on fast feature pyramids and multi-resolution maximally stable extremal regions. In: ACCV Workshops, 9009, pp. 91–105. Springer (2014). https://doi.org/10.1007/978-3-319-16631-5_7
Neumann, L., Matas, J.: Efficient scene text localization and recognition with local character refinement. In: ICDAR, pp 746–750 (2015). https://doi.org/10.1109/ICDAR.2015.7333861
Zhu, R., Mao, X.J., Zhu, Q.H., Li, N., Yang, Y.B.: Text detection based on convolutional neural networks with spatial pyramid pooling. In: ICIP, pp. 1032–1036 (2016). https://doi.org/10.1109/ICIP.2016.7532514
Nistér, D., Stewénius, H.: Linear time maximally stable extremal regions. In: ECCV, pp. 183–196 (2008). https://doi.org/10.1007/978-3-540-88688-4_14
Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Bigorda, L.G., Mestre, S.R., Mas, J., Mota, D.F., Almazàn, J.A., de las Heras, L.P.: ICDAR 2013 robust reading competition. In: ICDAR, pp. 1484–1493 (2013). https://doi.org/10.1109/ICDAR.2013.221
Khurshid, K., Siddiqi, I., Faure, C., Vincent, N.: Comparison of Niblack inspired binarization methods for ancient documents. In: SPIE, vol. 7247, pp. 72470U–72470U–9 (2009). https://doi.org/10.1117/12.805827
Niblack, W.: An Introduction to Digital Image Processing, First English edn. Prentice Hall, Upper Saddle River (1986)
Pan, Y.F., Hou, X., Liu, C.L.: A hybrid approach to detect and localize texts in natural scene images. IEEE Trans. Image Process. 20(3), 800–813 (2011). https://doi.org/10.1109/TIP.2010.2070803
Wang, L., Fan, W., He, Y., Sun, J., Katsuyama, Y., Hotta, Y.: Fast and accurate text detection in natural scene images with user-intention. In: ICPR, pp. 2920–2925 (2014). https://doi.org/10.1109/ICPR.2014.503
Sauvola, J., Pietikäinen, M.: Adaptive document image binarization. Pattern Recognit. 33(2), 225–236 (2000). https://doi.org/10.1016/S0031-3203(99)00055-2
Rodtook, S., Rangsanseri, Y.: Adaptive thresholding of document images based on Laplacian sign. In: Proceedings International Conference on Information Technology: Coding and Computing, pp. 501–505 (2001). https://doi.org/10.1109/ITCC.2001.918846
Howe, N.R.: A Laplacian energy for document binarization. In: ICDAR, pp. 6–10 (2011). https://doi.org/10.1109/ICDAR.2011.11
Zhang, Y., Lai, J.: Arbitrarily oriented text detection using geodesic distances between corners and skeletons. In: ICPR, pp. 1896–1899 (2012)
Shivakumara, P., Phan, T.Q., Tan, C.L.: A Laplacian approach to multi-oriented text detection in video. IEEE Trans. Pattern Anal. Mach. Intell. 33(2), 412–419 (2011). https://doi.org/10.1109/TPAMI.2010.166
Liu, Y., Zhang, D., Zhang, Y., Lin, S.: Real-time scene text detection based on stroke model. In: ICPR, pp. 3116–3120 (2014). https://doi.org/10.1109/ICPR.2014.537
Schwarz, C., Teich, J., Welzl, E., Evans, B.: On Finding a Minimal Enclosing Parallelogram. Tech. Rep. TR-94-036, International Computer Science Institute, Berkeley (1994)
Girones, X., Julia, C., Puig, D.: Full quadrant approximations for the arctangent function [tips and tricks]. IEEE Signal Process. Mag. 30(1), 130–135 (2013). https://doi.org/10.1109/MSP.2012.2219677
Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., Girod, B.: Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: Image Processing (ICIP), 2011 18th IEEE International Conference, pp. 2609–2612 (2011). https://doi.org/10.1109/ICIP.2011.6116200
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Opitz, M., Diem, M., Fiel, S., Kleber, F., Sablatnig, R.: End-to-end text recognition using local ternary patterns, MSER and deep convolutional nets. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 186–190 (2014). https://doi.org/10.1109/DAS.2014.29
Bernsen, J.: Dynamic thresholding of grey-level images. ICPR 2, 1251–1255 (1986)
Wolf, C., Jolion, J.M., Chassaing, F.: Text localization, enhancement and binarization in multimedia documents. In: ICPR, vol. 2, pp. 1037–1040 (2002). https://doi.org/10.1109/ICPR.2002.1048482
Bradley, D., Roth, G.: Adaptive thresholding using the integral image. J. Graph. Tools 12(2), 13–21 (2007). https://doi.org/10.1080/2151237X.2007.10129236
Lee, S., Cho, M.S., Jung, K., Kim, J.H.: Scene text extraction with edge constraint and text collinearity. In: ICPR, pp. 3983–3986 (2010). https://doi.org/10.1109/ICPR.2010.969
Cho, H., Sung, M., Jun, B.: Canny text detector: fast and robust scene text localization algorithm. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3566–3573 (2016). https://doi.org/10.1109/CVPR.2016.388
Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. IJDAR 8(4), 280–296 (2006). https://doi.org/10.1007/s10032-006-0014-0
Deng, D., Liu, H., Li, X., Cai, D.: PixelLink: detecting scene text via instance segmentation. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 6773–6780. AAAI Press (2018)
Sosa, L.P., Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R.: ICDAR 2003 robust reading competitions. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 2, pp. 682–687 (2003). https://doi.org/10.1109/ICDAR.2003.1227749
Du, Y., Duan, G., Ai, H.: Context-based text detection in natural scenes. In: ICIP, pp. 1857–1860 (2012). https://doi.org/10.1109/ICIP.2012.6467245
Mao, J., Li, H., Zhou, W., Yan, S., Tian, Q.: Scale based region growing for scene text detection. In: Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, pp. 1007–1016. ACM, New York (2013). https://doi.org/10.1145/2502081.2502108
Acknowledgements
Work supported by the Spanish government under Grant TIN2016-80250-R.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gironés, X., Julià, C. Real-time localization of multi-oriented text in natural scene images using a linear spatial filter. J Real-Time Image Proc 17, 1505–1525 (2020). https://doi.org/10.1007/s11554-019-00911-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-019-00911-9