Abstract
Recently, Vision-Language Models (VLMs) trained on large-scale noisy data have shown strong generalization abilities on many downstream tasks. In this paper, we introduce a new technique for uncovering unexpected inconsistencies in VLMs, which lead to the formulation of new research questions on how to improve VLMs. Specifically, we propose that performance on original texts should be compared with that of ‘mutant texts’, carefully-designed variants of the original texts. In contrast to text perturbations used to study robustness, ‘mutant texts’ represent large changes in the original texts that impact semantics. We present two types of example mutant texts: one-word-only (OWO) mutants, which replace the original text with one of the words it contains and plus-one-word (POW) mutants, which add a word to the original text. The mutant texts allow us to discover the existence of dominating words in texts that correspond to images. The embedding of a dominating words is closer to the image embedding than the embedding of the entire original text. The existence of dominating words reflects underlying inconsistency in a VLM’s embedding space, a possible source of risk for bias undetected without the mutant text technique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brown, K.S., et al.: Investigating the extent to which distributional semantic models capture a broad range of semantic relations. Cogn. Sci. 47(5), e13291 (2023)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.E.: A simple framework for contrastive learning of visual representations. In: ICML, vol. 119, pp. 1597–1607 (2020)
Cherti, M., et al.: Reproducible scaling laws for contrastive language-image learning. In: CVPR, pp. 2818–2829 (2023)
Dosovitskiy, A., et al.: An image is worth 16 \(\times \) 16 words: transformers for image recognition at scale. In: ICLR (2021)
Dou, Z., et al.: An empirical study of training end-to-end vision-and-language transformers. In: CVPR, pp. 18145–18155 (2022)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC, p. 12 (2018)
Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) NeurIPS, pp. 2121–2129 (2013)
Gui, L., Wang, B., Huang, Q., Hauptmann, A., Bisk, Y., Gao, J.: KAT: a knowledge augmented transformer for vision-and-language. In: The Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 956–968 (2022)
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.B.: Momentum contrast for unsupervised visual representation learning. In: CVPR, pp. 9726–9735 (2020)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 4904–4916 (2021)
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 5583–5594 (2021)
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., Sabato, S. (eds.) ICML, vol. 162, pp. 12888–12900 (2022)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: What does BERT with vision look at? In: ACL, pp. 5265–5275 (2020)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11), 2579–2605 (2008)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NeurIPS, pp. 3111–3119 (2013)
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint (2018)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Qiu, J., et al.: Are multimodal models robust to image and text perturbations? arXiv preprint (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) ICML, vol. 139, pp. 8748–8763 (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with CLIP latents. arXiv preprint (2022)
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR (2022)
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015)
Schuhmann, C., et al.: LAION-5B: an open large-scale dataset for training next generation image-text models. In: NeurIPS, vol. 35, pp. 25278–25294 (2022)
Schuhmann, C., et al.: LAION-400M: open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint: abs/2111.02114 (2021)
Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does CLIP know about a red circle? Visual prompt engineering for VLMs. In: ICCV, pp. 11987–11997 (2023)
Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59, 64–73 (2016)
Weston, J., Bengio, S., Usunier, N.: Large scale image annotation: learning to rank with joint word-image embeddings. Mach. Learn. 81, 21–35 (2010)
Wolfe, R., Banaji, M.R., Caliskan, A.: Evidence for hypodescent in visual semantic AI. In: ACM Conference on Fairness, Accountability, and Transparency (2022)
Yasunaga, M., et al.: Retrieval-augmented multimodal language modeling. arXiv preprint (2022)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Zhai, X., et al.: LiT: zero-shot transfer with locked-image text tuning. In: CVPR (2022)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liang, M., Liu, Z., Larson, M. (2024). Mutant Texts: A Technique for Uncovering Unexpected Inconsistencies in Large-Scale Vision-Language Models. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-53302-0_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)