Mutant Texts: A Technique for Uncovering Unexpected Inconsistencies in Large-Scale Vision-Language Models

Mingliang Liang¹⁴,
Zhouran Liu¹⁴ &
Martha Larson¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14557))

Included in the following conference series:

International Conference on Multimedia Modeling

841 Accesses

Abstract

Recently, Vision-Language Models (VLMs) trained on large-scale noisy data have shown strong generalization abilities on many downstream tasks. In this paper, we introduce a new technique for uncovering unexpected inconsistencies in VLMs, which lead to the formulation of new research questions on how to improve VLMs. Specifically, we propose that performance on original texts should be compared with that of ‘mutant texts’, carefully-designed variants of the original texts. In contrast to text perturbations used to study robustness, ‘mutant texts’ represent large changes in the original texts that impact semantics. We present two types of example mutant texts: one-word-only (OWO) mutants, which replace the original text with one of the words it contains and plus-one-word (POW) mutants, which add a word to the original text. The mutant texts allow us to discover the existence of dominating words in texts that correspond to images. The embedding of a dominating words is closer to the image embedding than the embedding of the entire original text. The existence of dominating words reflects underlying inconsistency in a VLM’s embedding space, a possible source of risk for bias undetected without the mutant text technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 47.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 59.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Evolving Interpretable Visual Classifiers with Large Language Models

Explaining Image Classifiers by Removing Input Features Using Generative Models

References

Brown, K.S., et al.: Investigating the extent to which distributional semantic models capture a broad range of semantic relations. Cogn. Sci. 47(5), e13291 (2023)
Article Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: ICML, vol. 119, pp. 1597–1607 (2020)
Google Scholar
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: CVPR, pp. 2818–2829 (2023)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Dou, Z., et al.: An empirical study of training end-to-end vision-and-language transformers. In: CVPR, pp. 18145–18155 (2022)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC, p. 12 (2018)
Google Scholar
Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) NeurIPS, pp. 2121–2129 (2013)
Google Scholar
Gui, L., Wang, B., Huang, Q., Hauptmann, A., Bisk, Y., Gao, J.: KAT: a knowledge augmented transformer for vision-and-language. In: The Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 956–968 (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9726–9735 (2020)
Google Scholar
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 4904–4916 (2021)
Google Scholar
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Google Scholar
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 5583–5594 (2021)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S. (eds.) ICML, vol. 162, pp. 12888–12900 (2022)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint (2019)
Google Scholar
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: What does BERT with vision look at? In: ACL, pp. 5265–5275 (2020)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NeurIPS, pp. 3111–3119 (2013)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Qiu, J., et al.: Are multimodal models robust to image and text perturbations? arXiv preprint (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 8748–8763 (2021)
Google Scholar
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint (2022)
Google Scholar
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Google Scholar
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015)
Google Scholar
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS, vol. 35, pp. 25278–25294 (2022)
Google Scholar
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint: abs/2111.02114 (2021)
Google Scholar
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does CLIP know about a red circle? Visual prompt engineering for VLMs. In: ICCV, pp. 11987–11997 (2023)
Google Scholar
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59, 64–73 (2016)
Article Google Scholar
Weston, J., Bengio, S., Usunier, N.: Large scale image annotation: learning to rank with joint word-image embeddings. Mach. Learn. 81, 21–35 (2010)
Article MathSciNet Google Scholar
Wolfe, R., Banaji, M.R., Caliskan, A.: Evidence for hypodescent in visual semantic AI. In: ACM Conference on Fairness, Accountability, and Transparency (2022)
Google Scholar
Yasunaga, M., et al.: Retrieval-augmented multimodal language modeling. arXiv preprint (2022)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: CVPR (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Radboud University, Nijmegen, Netherlands
Mingliang Liang, Zhouran Liu & Martha Larson

Authors

Mingliang Liang
View author publications
You can also search for this author in PubMed Google Scholar
Zhouran Liu
View author publications
You can also search for this author in PubMed Google Scholar
Martha Larson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Mingliang Liang or Martha Larson .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liang, M., Liu, Z., Larson, M. (2024). Mutant Texts: A Technique for Uncovering Unexpected Inconsistencies in Large-Scale Vision-Language Models. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-031-53302-0_16
Published: 29 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Mutant Texts: A Technique for Uncovering Unexpected Inconsistencies in Large-Scale Vision-Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Evolving Interpretable Visual Classifiers with Large Language Models

Explaining Image Classifiers by Removing Input Features Using Generative Models

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Mutant Texts: A Technique for Uncovering Unexpected Inconsistencies in Large-Scale Vision-Language Models

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models

Evolving Interpretable Visual Classifiers with Large Language Models

Explaining Image Classifiers by Removing Input Features Using Generative Models

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation