Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14264))

Included in the following conference series:

DAGM German Conference on Pattern Recognition

525 Accesses

Abstract

Converting a model’s internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available here.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 64.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 79.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

UIT: Unifying Pre-training Objectives for Image-Text Understanding

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Notes

1.
Licensed under the BSD-2 license.
2.
Licensed under the MIT license.

References

Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: ACL (2020)
Google Scholar
Agrawal, A., et al.: Vqa: visual question answering. Int. J. Comput. Vis. (2015)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, e0130140 (2015)
Article Google Scholar
Banerjee, S., Lavie, A.: Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL (2005)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: ICCV (2021)
Google Scholar
Chen, S., Zhao, Q.: Rex: Reasoning-aware and grounded explanation. In: CVPR (2022)
Google Scholar
Draelos, R.L., Carin, L.: Use HiReSCAM instead of grad-cam for faithful explanations of convolutional neural networks (2020)
Google Scholar
Fong, R., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: ICCV (2019)
Google Scholar
Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: ICCV (2017)
Google Scholar
Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., Li, B.: Axiom-based grad-cam: towards accurate visualization and explanation of CNNs. In: BMVC (2020)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
Google Scholar
Huk Park, D., et al.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR (2018)
Google Scholar
Jain, S., Wallace, B.C.: Attention is not explanation. In: North American Chapter of the Association for Computational Linguistics (2019)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Google Scholar
Kim, S.S., Meister, N., Ramaswamy, V.V., Fong, R., Russakovsky, O.: Hive: evaluating the human interpretability of visual explanations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13672, pp. 280–298. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19775-8_17
Chapter Google Scholar
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (2021)
Google Scholar
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S.R., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Google Scholar
Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: ECCV (2018)
Google Scholar
Li, W., Zhu, L., Wen, L., Yang, Y.: DeCap: decoding CLIP latents for zero-shot captioning via text-only training. In: The Eleventh International Conference on Learning Representations (2023)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL (2004)
Google Scholar
Muhammad, M.B., Yeasin, M.: Eigen-cam: class activation map using principal components. In: IJCNN (2020)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. NeurIPS 35, 27730–27744 (2022)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method and for automatic and evaluation of machine and translation. In: ACL (2002)
Google Scholar
Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models. In: BMVC (2018)
Google Scholar
Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. In: AAAI (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Salewski, L., Koepke, A.S., Lensch, H., Akata, Z.: Clevr-x: a visual reasoning dataset for natural language explanations. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, KR., Samek, W. (eds.) xxAI 2020. LNCS, vol. 13200, pp. 69–88. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04083-2_5
Sammani, F., Mukherjee, T., Deligiannis, N.: Nlx-gpt: a model for natural language explanations in vision and vision-language tasks. In: CVPR (2022)
Google Scholar
Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. (2019)
Google Scholar
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
Google Scholar
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: ICLR workshop (2013)
Google Scholar
Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv:1706.03825 (2017)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. In: ICLR (Workshop Track) (2015)
Google Scholar
Su, Y., et al.: Language models can see: plugging visual controls in text generation. arXiv:2205.02655 (2022)
Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., Collier, N.: A contrastive framework for neural text generation. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)
Google Scholar
Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv:1908.07490 (2019)
Tewel, Y., Shalev, Y., Nadler, R., Schwartz, I., Wolf, L.: Zero-shot video captioning with evolving pseudo-tokens. arXiv:2207.11100 (2022)
Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: zero-shot image-to-text generation for visual-semantic arithmetic. In: CVPR (2022)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: ACL (2019)
Google Scholar
Wang, J., Zhang, Y., Yan, M., Zhang, J.C., Sang, J.: Zero-shot image captioning by anchor-augmented vision-language space alignment. arXiv:2211.07275 (2022)
Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Conference on Empirical Methods in Natural Language Processing (2019)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zeng, A., et al.: Socratic models: composing zero-shot multimodal reasoning with language. In: ICLR (2023)
Google Scholar
Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 126, 1084–1102 (2018)
Article Google Scholar
Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv:2205.01068 (2022)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Google Scholar

Download references

Acknowledgements

The authors thank IMPRS-IS for supporting Leonard Salewski. This work was partially funded by the Max Planck Society, the BMBF Tübingen AI Center (FKZ: 01IS18039A), DFG (EXC number 2064/1 - Project number 390727645), ERC (853489-DEXIM), and DFG-CRC 1233 (Project number 276693517).

Author information

Authors and Affiliations

University of Tübingen, Tübingen, Germany
Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch & Zeynep Akata
MPI for Intelligent Systems, Tübingen, Germany
Zeynep Akata

Authors

Leonard Salewski
View author publications
You can also search for this author in PubMed Google Scholar
A. Sophia Koepke
View author publications
You can also search for this author in PubMed Google Scholar
Hendrik P. A. Lensch
View author publications
You can also search for this author in PubMed Google Scholar
Zeynep Akata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leonard Salewski .

Editor information

Editors and Affiliations

IWR, Heidelberg University, Heidelberg, Germany
Ullrich Köthe
IWR, Heidelberg University, Heidelberg, Germany
Carsten Rother

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 193 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salewski, L., Koepke, A.S., Lensch, H.P.A., Akata, Z. (2024). Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language. In: Köthe, U., Rother, C. (eds) Pattern Recognition. DAGM GCPR 2023. Lecture Notes in Computer Science, vol 14264. Springer, Cham. https://doi.org/10.1007/978-3-031-54605-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-54605-1_25
Published: 08 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54604-4
Online ISBN: 978-3-031-54605-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

UIT: Unifying Pre-training Objectives for Image-Text Understanding

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 193 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

UIT: Unifying Pre-training Objectives for Image-Text Understanding

GViG: Generative Visual Grounding Using Prompt-Based Language Modeling for Visual Question Answering

SCBM: A Hybrid Model for Vietnamese Visual Question Answering

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 193 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation