[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language

  • Conference paper
  • First Online:
Pattern Recognition (DAGM GCPR 2023)

Abstract

Converting a model’s internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 64.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 79.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Licensed under the BSD-2 license.

  2. 2.

    Licensed under the MIT license.

References

  1. Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: ACL (2020)

    Google Scholar 

  2. Agrawal, A., et al.: Vqa: visual question answering. Int. J. Comput. Vis. (2015)

    Google Scholar 

  3. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24

    Chapter  Google Scholar 

  4. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, e0130140 (2015)

    Article  Google Scholar 

  5. Banerjee, S., Lavie, A.: Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL (2005)

    Google Scholar 

  6. Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  7. Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: ICCV (2021)

    Google Scholar 

  8. Chen, S., Zhao, Q.: Rex: Reasoning-aware and grounded explanation. In: CVPR (2022)

    Google Scholar 

  9. Draelos, R.L., Carin, L.: Use HiReSCAM instead of grad-cam for faithful explanations of convolutional neural networks (2020)

    Google Scholar 

  10. Fong, R., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: ICCV (2019)

    Google Scholar 

  11. Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: ICCV (2017)

    Google Scholar 

  12. Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., Li, B.: Axiom-based grad-cam: towards accurate visualization and explanation of CNNs. In: BMVC (2020)

    Google Scholar 

  13. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)

    Google Scholar 

  14. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)

    Google Scholar 

  15. Huk Park, D., et al.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR (2018)

    Google Scholar 

  16. Jain, S., Wallace, B.C.: Attention is not explanation. In: North American Chapter of the Association for Computational Linguistics (2019)

    Google Scholar 

  17. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)

    Google Scholar 

  18. Kim, S.S., Meister, N., Ramaswamy, V.V., Fong, R., Russakovsky, O.: Hive: evaluating the human interpretability of visual explanations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13672, pp. 280–298. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19775-8_17

    Chapter  Google Scholar 

  19. Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (2021)

    Google Scholar 

  20. Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S.R., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)

    Google Scholar 

  21. Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: ECCV (2018)

    Google Scholar 

  22. Li, W., Zhu, L., Wen, L., Yang, Y.: DeCap: decoding CLIP latents for zero-shot captioning via text-only training. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  23. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL (2004)

    Google Scholar 

  24. Muhammad, M.B., Yeasin, M.: Eigen-cam: class activation map using principal components. In: IJCNN (2020)

    Google Scholar 

  25. Ouyang, L., et al.: Training language models to follow instructions with human feedback. NeurIPS 35, 27730–27744 (2022)

    Google Scholar 

  26. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method and for automatic and evaluation of machine and translation. In: ACL (2002)

    Google Scholar 

  27. Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models. In: BMVC (2018)

    Google Scholar 

  28. Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. In: AAAI (2021)

    Google Scholar 

  29. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  30. Salewski, L., Koepke, A.S., Lensch, H., Akata, Z.: Clevr-x: a visual reasoning dataset for natural language explanations. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, KR., Samek, W. (eds.) xxAI 2020. LNCS, vol. 13200, pp. 69–88. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04083-2_5

  31. Sammani, F., Mukherjee, T., Deligiannis, N.: Nlx-gpt: a model for natural language explanations in vision and vision-language tasks. In: CVPR (2022)

    Google Scholar 

  32. Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. (2019)

    Google Scholar 

  33. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)

    Google Scholar 

  34. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: ICLR workshop (2013)

    Google Scholar 

  35. Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv:1706.03825 (2017)

  36. Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. In: ICLR (Workshop Track) (2015)

    Google Scholar 

  37. Su, Y., et al.: Language models can see: plugging visual controls in text generation. arXiv:2205.02655 (2022)

  38. Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., Collier, N.: A contrastive framework for neural text generation. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  39. Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv:1908.07490 (2019)

  40. Tewel, Y., Shalev, Y., Nadler, R., Schwartz, I., Wolf, L.: Zero-shot video captioning with evolving pseudo-tokens. arXiv:2207.11100 (2022)

  41. Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: zero-shot image-to-text generation for visual-semantic arithmetic. In: CVPR (2022)

    Google Scholar 

  42. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)

    Google Scholar 

  43. Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: ACL (2019)

    Google Scholar 

  44. Wang, J., Zhang, Y., Yan, M., Zhang, J.C., Sang, J.: Zero-shot image captioning by anchor-augmented vision-language space alignment. arXiv:2211.07275 (2022)

  45. Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Conference on Empirical Methods in Natural Language Processing (2019)

    Google Scholar 

  46. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53

    Chapter  Google Scholar 

  47. Zeng, A., et al.: Socratic models: composing zero-shot multimodal reasoning with language. In: ICLR (2023)

    Google Scholar 

  48. Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 126, 1084–1102 (2018)

    Article  Google Scholar 

  49. Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv:2205.01068 (2022)

  50. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)

    Google Scholar 

Download references

Acknowledgements

The authors thank IMPRS-IS for supporting Leonard Salewski. This work was partially funded by the Max Planck Society, the BMBF Tübingen AI Center (FKZ: 01IS18039A), DFG (EXC number 2064/1 - Project number 390727645), ERC (853489-DEXIM), and DFG-CRC 1233 (Project number 276693517).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leonard Salewski .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 193 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Salewski, L., Koepke, A.S., Lensch, H.P.A., Akata, Z. (2024). Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language. In: Köthe, U., Rother, C. (eds) Pattern Recognition. DAGM GCPR 2023. Lecture Notes in Computer Science, vol 14264. Springer, Cham. https://doi.org/10.1007/978-3-031-54605-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-54605-1_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-54604-4

  • Online ISBN: 978-3-031-54605-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics