Abstract
Converting a model’s internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). ZS-A2T builds on a pre-trained large language model (LLM), which receives a task prompt, question, and predicted answer, as inputs. The LLM is guided to select tokens which describe the regions in the input image that the VQA model attended to. Crucially, we determine this similarity by exploiting the text-image matching capabilities of the underlying VQA model. Our framework does not require any training and allows the drop-in replacement of different guiding sources (e.g. attribution instead of attention maps), or language models. We evaluate this novel task on textual explanation datasets for VQA, giving state-of-the-art performances for the zero-shot setting on GQA-REX and VQA-X. Our code is available here.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Licensed under the BSD-2 license.
- 2.
Licensed under the MIT license.
References
Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: ACL (2020)
Agrawal, A., et al.: Vqa: visual question answering. Int. J. Comput. Vis. (2015)
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10, e0130140 (2015)
Banerjee, S., Lavie, A.: Meteor: An automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL (2005)
Brown, T., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers. In: ICCV (2021)
Chen, S., Zhao, Q.: Rex: Reasoning-aware and grounded explanation. In: CVPR (2022)
Draelos, R.L., Carin, L.: Use HiReSCAM instead of grad-cam for faithful explanations of convolutional neural networks (2020)
Fong, R., Patrick, M., Vedaldi, A.: Understanding deep networks via extremal perturbations and smooth masks. In: ICCV (2019)
Fong, R.C., Vedaldi, A.: Interpretable explanations of black boxes by meaningful perturbation. In: ICCV (2017)
Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., Li, B.: Axiom-based grad-cam: towards accurate visualization and explanation of CNNs. In: BMVC (2020)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
Huk Park, D., et al.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR (2018)
Jain, S., Wallace, B.C.: Attention is not explanation. In: North American Chapter of the Association for Computational Linguistics (2019)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.B.: Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Kim, S.S., Meister, N., Ramaswamy, V.V., Fong, R., Russakovsky, O.: Hive: evaluating the human interpretability of visual explanations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13672, pp. 280–298. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19775-8_17
Kim, W., Son, B., Kim, I.: Vilt: vision-and-language transformer without convolution or region supervision. In: International Conference on Machine Learning (2021)
Li, J., Selvaraju, R.R., Gotmare, A.D., Joty, S.R., Xiong, C., Hoi, S.C.H.: Align before fuse: vision and language representation learning with momentum distillation. In: NeurIPS (2021)
Li, Q., Tao, Q., Joty, S., Cai, J., Luo, J.: VQA-E: explaining, elaborating, and enhancing your answers for visual questions. In: ECCV (2018)
Li, W., Zhu, L., Wen, L., Yang, Y.: DeCap: decoding CLIP latents for zero-shot captioning via text-only training. In: The Eleventh International Conference on Learning Representations (2023)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: ACL (2004)
Muhammad, M.B., Yeasin, M.: Eigen-cam: class activation map using principal components. In: IJCNN (2020)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. NeurIPS 35, 27730–27744 (2022)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method and for automatic and evaluation of machine and translation. In: ACL (2002)
Petsiuk, V., Das, A., Saenko, K.: Rise: randomized input sampling for explanation of black-box models. In: BMVC (2018)
Pillai, V., Pirsiavash, H.: Explainable models with consistent interpretations. In: AAAI (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Salewski, L., Koepke, A.S., Lensch, H., Akata, Z.: Clevr-x: a visual reasoning dataset for natural language explanations. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, KR., Samek, W. (eds.) xxAI 2020. LNCS, vol. 13200, pp. 69–88. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04083-2_5
Sammani, F., Mukherjee, T., Deligiannis, N.: Nlx-gpt: a model for natural language explanations in vision and vision-language tasks. In: CVPR (2022)
Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-CAM: visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. (2019)
Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: ICLR workshop (2013)
Smilkov, D., Thorat, N., Kim, B., Viégas, F., Wattenberg, M.: Smoothgrad: removing noise by adding noise. arXiv:1706.03825 (2017)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. In: ICLR (Workshop Track) (2015)
Su, Y., et al.: Language models can see: plugging visual controls in text generation. arXiv:2205.02655 (2022)
Su, Y., Lan, T., Wang, Y., Yogatama, D., Kong, L., Collier, N.: A contrastive framework for neural text generation. In: Oh, A.H., Agarwal, A., Belgrave, D., Cho, K. (eds.) Advances in Neural Information Processing Systems (2022)
Tan, H., Bansal, M.: Lxmert: learning cross-modality encoder representations from transformers. arXiv:1908.07490 (2019)
Tewel, Y., Shalev, Y., Nadler, R., Schwartz, I., Wolf, L.: Zero-shot video captioning with evolving pseudo-tokens. arXiv:2207.11100 (2022)
Tewel, Y., Shalev, Y., Schwartz, I., Wolf, L.: Zerocap: zero-shot image-to-text generation for visual-semantic arithmetic. In: CVPR (2022)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: specialized heads do the heavy lifting, the rest can be pruned. In: ACL (2019)
Wang, J., Zhang, Y., Yan, M., Zhang, J.C., Sang, J.: Zero-shot image captioning by anchor-augmented vision-language space alignment. arXiv:2211.07275 (2022)
Wiegreffe, S., Pinter, Y.: Attention is not not explanation. In: Conference on Empirical Methods in Natural Language Processing (2019)
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Zeng, A., et al.: Socratic models: composing zero-shot multimodal reasoning with language. In: ICLR (2023)
Zhang, J., Bargal, S.A., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. Int. J. Comput. Vis. 126, 1084–1102 (2018)
Zhang, S., et al.: Opt: open pre-trained transformer language models. arXiv:2205.01068 (2022)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Acknowledgements
The authors thank IMPRS-IS for supporting Leonard Salewski. This work was partially funded by the Max Planck Society, the BMBF Tübingen AI Center (FKZ: 01IS18039A), DFG (EXC number 2064/1 - Project number 390727645), ERC (853489-DEXIM), and DFG-CRC 1233 (Project number 276693517).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Salewski, L., Koepke, A.S., Lensch, H.P.A., Akata, Z. (2024). Zero-Shot Translation of Attention Patterns in VQA Models to Natural Language. In: Köthe, U., Rother, C. (eds) Pattern Recognition. DAGM GCPR 2023. Lecture Notes in Computer Science, vol 14264. Springer, Cham. https://doi.org/10.1007/978-3-031-54605-1_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-54605-1_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54604-4
Online ISBN: 978-3-031-54605-1
eBook Packages: Computer ScienceComputer Science (R0)