[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Re-Thinking Text Clustering for Images with Text

  • Conference paper
  • First Online:
Document Analysis and Recognition - ICDAR 2023 (ICDAR 2023)

Abstract

Text-VQA refers to the set of problems that reason about the text present in an image to answer specific questions regarding the image content. Previous works in text-VQA have largely followed the common strategy of feeding various input modalities (OCR, Objects, Question) to an attention-based learning framework. Such approaches treat the OCR tokens as independent entities and ignore the fact that these tokens often come correlated in an image representing a larger ‘meaningful’ entity. The ‘meaningful’ entity potentially represented by a group of OCR tokens could be primarily discerned by the layout of the text in the image along with the broader context it appears. In the proposed work, we aim to cluster the OCR tokens using a novel spatially-aware and knowledge-enabled clustering technique that uses an external knowledge graph to improve the answer prediction accuracy of the text-VQA problem. Our proposed algorithm is generic enough to be applied to any multi-modal transformer architecture used for text-VQA training. We showcase the objective and subjective effectiveness of the proposed approach by improving the performance of the M4C model on the Text-VQA datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 95.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 119.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Singh, A., et al.: Towards VQA models that can read. arXiv 2019, arxiv.org/abs/1904.08920

  2. Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification (2016)

    Google Scholar 

  3. Karatzas, D., et al.: ICDAR 2013 robust reading competition. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493 (2013)

    Google Scholar 

  4. Karatzas, D., et al.: ICDAR 2015 competition on Robust Reading (2015)

    Google Scholar 

  5. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

  6. Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. arXiv 2018, arxiv.org/abs/1802.08218

  7. Mishra, A., Alahari, K., Jawahar, C.: Image retrieval using textual cues. In: 2013 IEEE International Conference On Computer Vision, pp. 3040–3047 (2013)

    Google Scholar 

  8. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv 2016, arxiv.org/abs/1602.07332

  9. Veit, A., Matera, T., Neumann, L., Matas, J., Belongie, S.: COCO-text: dataset and benchmark for text detection and recognition in natural images. arXiv 2016, arxiv.org/abs/1601.07140

  10. Mishra, A., Shekhar, S., Singh, A., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)

    Google Scholar 

  11. Iwana, B., Rizvi, S., Ahmed, S., Dengel, A., Uchida, S.: Judging a book by its cover. arXiv 2016, arxiv.org/abs/1610.09204

  12. Hu, R., Singh, A., Darrell, T., Rohrbach, M.: Iterative answer prediction with pointer-augmented multimodal transformers for TextVQA. arXiv 2019, arxiv.org/abs/1911.06258

  13. Vaswani, A., et al.: Attention is all you need. arXiv 2017, arxiv.org/abs/1706.03762

  14. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arxiv.org/abs/1810.04805

  15. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv 2015, arxiv.org/abs/1506.01497

  16. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv 2016, arxiv.org/abs/1607.04606

  17. Almazán, J., Gordo, A., Fornés, A., Valveny, E.: Word spotting and recognition with embedded attributes. IEEE Trans. Pattern Anal. Mach. Intell. 36, 2552–2566 (2014)

    Article  Google Scholar 

  18. Biten, A., et al.: Scene text visual question answering. arXiv 2019, arxiv.org/abs/1905.13648

  19. Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv 2018, arxiv.org/abs/1807.09956

  20. Biten, A., et al.: ICDAR 2019 competition on scene text visual question answering (2019)

    Google Scholar 

  21. Gao, D., Li, K., Wang, R., Shan, S., Chen, X.: Multi-modal graph neural network for joint reasoning on vision and scene text. arXiv 2020, arxiv.org/abs/2003.13962

  22. Kant, Y., et al.: Spatially aware multimodal transformers for TextVQA. arXiv 2020, arxiv.org/abs/2007.12146

  23. Chen, X., et al.: PaLI: a jointly-scaled multilingual language-image model. arXiv 2022, arxiv.org/abs/2209.06794

  24. Wang, J., et al.: GIT: a generative image-to-text transformer for vision and language. arXiv 2022, arxiv.org/abs/2205.14100

  25. Kil, J., et al.: PreSTU: pre-training for scene-text understanding. arXiv 2022, arxiv.org/abs/2209.05534

  26. Lu, X., Fan, Z., Wang, Y., Oh, J., Rose, C.: Localize, group, and select: boosting text-VQA by scene text modeling (2021). arxiv.org/abs/2108.08965

  27. Singh, A., Mishra, A., Shekhar, S., Chakraborty, A.: From strings to things: knowledge-enabled VQA Model that can Read and Reason. In: ICCV (2019)

    Google Scholar 

  28. Borisyuk, F., Gordo, A., Sivakumar, V.: Rosetta: large scale system for text detection and recognition in images. CoRR abs/1910.05085 (2019). arxiv.org/abs/1910.05085

  29. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Knowledge Discovery and Data Mining (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shwet Kamal Mishra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Mishra, S.K., Joshi, S., Gopalakrishnan, V. (2023). Re-Thinking Text Clustering for Images with Text. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14188. Springer, Cham. https://doi.org/10.1007/978-3-031-41679-8_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-41679-8_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-41678-1

  • Online ISBN: 978-3-031-41679-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics