Abstract
Text-VQA refers to the set of problems that reason about the text present in an image to answer specific questions regarding the image content. Previous works in text-VQA have largely followed the common strategy of feeding various input modalities (OCR, Objects, Question) to an attention-based learning framework. Such approaches treat the OCR tokens as independent entities and ignore the fact that these tokens often come correlated in an image representing a larger ‘meaningful’ entity. The ‘meaningful’ entity potentially represented by a group of OCR tokens could be primarily discerned by the layout of the text in the image along with the broader context it appears. In the proposed work, we aim to cluster the OCR tokens using a novel spatially-aware and knowledge-enabled clustering technique that uses an external knowledge graph to improve the answer prediction accuracy of the text-VQA problem. Our proposed algorithm is generic enough to be applied to any multi-modal transformer architecture used for text-VQA training. We showcase the objective and subjective effectiveness of the proposed approach by improving the performance of the M4C model on the Text-VQA datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Singh, A., et al.: Towards VQA models that can read. arXiv 2019, arxiv.org/abs/1904.08920
Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification (2016)
Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493 (2013)
Karatzas, D., et al.: ICDAR 2015 competition on Robust Reading (2015)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. arXiv 2018, arxiv.org/abs/1802.08218
Mishra, A., Alahari, K., Jawahar, C.: Image retrieval using textual cues. In: 2013 IEEE International Conference On Computer Vision, pp. 3040–3047 (2013)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv 2016, arxiv.org/abs/1602.07332
Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-text: dataset and benchmark for text detection and recognition in natural images. arXiv 2016, arxiv.org/abs/1601.07140
Mishra, A., Shekhar, S., Singh, A., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Iwana, B., Rizvi, S., Ahmed, S., Dengel, A., Uchida, S.: Judging a book by its cover. arXiv 2016, arxiv.org/abs/1610.09204
Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. arXiv 2019, arxiv.org/abs/1911.06258
Vaswani, A., et al.: Attention is all you need. arXiv 2017, arxiv.org/abs/1706.03762
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arxiv.org/abs/1810.04805
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv 2015, arxiv.org/abs/1506.01497
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv 2016, arxiv.org/abs/1607.04606
Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2552–2566 (2014)
Biten, A., et al.: Scene text visual question answering. arXiv 2019, arxiv.org/abs/1905.13648
Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv 2018, arxiv.org/abs/1807.09956
Biten, A., et al.: ICDAR 2019 competition on scene text visual question answering (2019)
Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. arXiv 2020, arxiv.org/abs/2003.13962
Kant, Y., et al.: Spatially aware multimodal transformers for TextVQA. arXiv 2020, arxiv.org/abs/2007.12146
Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. arXiv 2022, arxiv.org/abs/2209.06794
Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv 2022, arxiv.org/abs/2205.14100
Kil, J., et al.: PreSTU: pre-training for scene-text understanding. arXiv 2022, arxiv.org/abs/2209.05534
Lu, X., Fan, Z., Wang, Y., Oh, J., Rose, C.: Localize, group, and select: boosting text-VQA by scene text modeling (2021). arxiv.org/abs/2108.08965
Singh, A., Mishra, A., Shekhar, S., Chakraborty, A.: From strings to things: knowledge-enabled VQA Model that can Read and Reason. In: ICCV (2019)
Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. CoRR abs/1910.05085 (2019). arxiv.org/abs/1910.05085
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Knowledge Discovery and Data Mining (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mishra, S.K., Joshi, S., Gopalakrishnan, V. (2023). Re-Thinking Text Clustering for Images with Text. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-41679-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41678-1
Online ISBN: 978-3-031-41679-8
eBook Packages: Computer ScienceComputer Science (R0)