More Web Proxy on the site http://driver.im/

research-article

Retrieval-augmented multimodal language modeling

AUTHORs:

Michihiro Yasunaga,

Armen Aghajanyan,

Luke Zettlemoyer,

Wen-tau YihAuthors Info & Claims

ICML'23: Proceedings of the 40th International Conference on Machine Learning

Article No.: 1659, Pages 39755 - 39769

Published: 23 July 2023 Publication History

Abstract

Recent multimodal models such as DALL-E and CM3 have achieved remarkable progress in text-to-image and image-to-text generation. However, these models store all their knowledge (e.g., the appearance of the Eiffel Tower) in the model parameters, requiring increasingly larger models and training data to capture more knowledge. To integrate knowledge in a more scalable and modular way, we propose a retrieval-augmented multimodal model, which enables a base multimodal model (generator) to refer to relevant text and images fetched by a retriever from external memory (e.g., documents on the web). Specifically, for the retriever, we use a pretrained CLIP, and for the generator, we train a CM3 Transformer on the LAION dataset. Our resulting model, named Retrieval-Augmented CM3 (RA-CM3), is the first multimodal model that can retrieve and generate both text and images. We show that RA-CM3 significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks (12 FID and 17 CIDEr improvements on MS-COCO), while requiring much less compute for training (<30% of DALLE). Moreover, we show that RA-CM3 exhibits novel capabilities, such as faithful image generation and multimodal in-context learning (e.g., image generation from demonstrations).

References

[1]

Agarwal, O., Ge, H., Shakeri, S., and Al-Rfou, R. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training. In North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

[2]

Aghajanyan, A., Huang, B., Ross, C., Karpukhin, V., Xu, H., Goyal, N., Okhonko, D., Joshi, M., Ghosh, G., Lewis, M., and Zettlemoyer, L. CM3: A causal masked multimodal model of the internet. arXiv preprint arXiv:2201.07520, 2022.

[3]

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.

[4]

Ashual, O., Sheynin, S., Polyak, A., Singer, U., Gafni, O., Nachmani, E., and Taigman, Y. Knn-diffusion: Image generation via large-scale retrieval. arXiv preprint arXiv:2204.02849, 2022.

[5]

Birhane, A., Prabhu, V. U., and Kahembwe, E. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021a.

[6]

Birhane, A., Prabhu, V. U., and Kahembwe, E. Ethical considerations of generative AI. AI for Content Creation Workshop, CVPR, 2021b.

[7]

Blattmann, A., Rombach, R., Oktay, K., and Ommer, B. Retrieval-augmented diffusion models. arXiv preprint arXiv:2204.11824, 2022.

[8]

Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., Van Den Driessche, G. B., Lespiau, J.-B., Damoc, B., Clark, A., et al. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning (ICML), 2022.

[9]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.

[10]

Chen, W., Hu, H., Chen, X., Verga, P., and Cohen, W. W. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. In Empirical Methods in Natural Language Processing (EMNLP), 2022a.

[11]

Chen, W., Hu, H., Saharia, C., and Cohen, W. W. Reimagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022b.

[12]

Cho, J., Lu, J., Schwenk, D., Hajishirzi, H., and Kembhavi, A. X-lxmert: Paint, caption and answer questions with multi-modal transformers. arXiv preprint arXiv:2009.11278, 2020.

[13]

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[14]

Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H., et al. Cogview: Mastering text-to-image generation via transformers. In Neural Information Processing Systems (NeurIPS), 2021.

[15]

Ding, M., Zheng, W., Hong, W., and Tang, J. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217, 2022.

[16]

Esser, P., Rombach, R., and Ommer, B. Taming transformers for high-resolution image synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

[17]

Forever, A. rudall-e. https://github.com/ai-forever/ru-dalle, 2021.

[18]

Fried, D., Aghajanyan, A., Lin, J., Wang, S., Wallace, E., Shi, F., Zhong, R., Yih, W.-t., Zettlemoyer, L., and Lewis, M. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.

[19]

Gafni, O., Polyak, A., Ashual, O., Sheynin, S., Parikh, D., and Taigman, Y. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022.

[20]

Gur, S., Neverova, N., Stauffer, C., Lim, S.-N., Kiela, D., and Reiter, A. Cross-modal retrieval augmentation for multi-modal classification. arXiv preprint arXiv:2104.08108, 2021.

[21]

Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.-W. Realm: Retrieval-augmented language model pre-training. In International Conference on Machine Learning (ICML), 2020.

[22]

Hao, Y., Song, H., Dong, L., Huang, S., Chi, Z., Wang, W., Ma, S., and Wei, F. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336, 2022.

[23]

Hashimoto, T. B., Guu, K., Oren, Y., and Liang, P. S. A retrieve-and-edit framework for predicting structured outputs. In Neural Information Processing Systems (NeurIPS), 2018.

[24]

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Neural Information Processing Systems (NeurIPS), 2017.

[25]

Johnson, J., Douze, M., and Jégou, H. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2019.

[26]

Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. In Empirical Methods in Natural Language Processing (EMNLP), 2020.

[27]

Keskar, N. S., McCann, B., Varshney, L. R., Xiong, C., and Socher, R. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.

[28]

Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and Lewis, M. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations (ICLR), 2019.

[29]

Kim, S., Cho, S., Kim, C., Lee, D., and Baek, W. mindall-e on conceptual captions. https://github.com/kakaobrain/minDALL-E, 2021.

[30]

Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

[31]

Lewis, M., Ghazvininejad, M., Ghosh, G., Aghajanyan, A., Wang, S., and Zettlemoyer, L. Pre-training via paraphrasing. In Neural Information Processing Systems (NeurIPS), 2020a.

[32]

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., et al. Retrieval-augmented generation for knowledgeintensive nlp tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020b.

[33]

Li, B., Qi, X., Lukasiewicz, T., and Torr, P. Controllable text-to-image generation. Neural Information Processing Systems (NeurIPS), 2019.

[34]

Li, Y., Pan, Y., Yao, T., and Mei, T. Comprehending and ordering semantics for image captioning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[35]

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In European conference on computer vision, 2014.

[36]

Metzler, D., Tay, Y., Bahri, D., and Najork, M. Rethinking search: making domain experts out of dilettantes. In ACM SIGIR Forum, 2021.

Digital Library

[37]

Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.

[38]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems (NeurIPS), 2019.

[39]

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.

[40]

Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. In International Conference on Machine Learning (ICML), 2021.

[41]

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.

[42]

Ramos, R., Martins, B., Elliott, D., and Kementchedjhieva, Y. Smallcap: Lightweight image captioning prompted with retrieval augmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

[43]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

[44]

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.

[45]

Sarto, S., Cornia, M., Baraldi, L., and Cucchiara, R. Retrieval-augmented transformer for image captioning. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing, 2022.

Digital Library

[46]

Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.

[47]

Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., Zettlemoyer, L., and Yih, W.-t. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.

[48]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems (NeurIPS), 2017.

[49]

Vedantam, R., Lawrence Zitnick, C., and Parikh, D. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[50]

Wang, P. Dall-e in pytorch. https://github.com/lucidrains/DALLE-pytorch, 2021.

[51]

Wang, Y., Yasunaga, M., Ren, H., Wada, S., and Leskovec, J. Vqa-gnn: Reasoning with multimodal semantic graph for visual question answering. arXiv preprint arXiv:2205.11501, 2022a.

[52]

Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., and Cao, Y. Simvlm: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations (ICLR), 2022b.

[53]

Xie, T., Wu, C. H., Shi, P., Zhong, R., Scholak, T., Yasunaga, M., Wu, C.-S., Zhong, M., Yin, P., Wang, S. I., Zhong, V., Wang, B., Li, C., Boyle, C., Ni, A., Yao, Z., Radev, D., Xiong, C., Kong, L., Zhang, R., Smith, N. A., Zettlemoyer, L., and Yu, T. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models. In Empirical Methods in Natural Language Processing (EMNLP), 2022.

[54]

Yasunaga, M., Ren, H., Bosselut, A., Liang, P., and Leskovec, J. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In North American Chapter of the Association for Computational Linguistics (NAACL), 2021.

[55]

Yasunaga, M., Bosselut, A., Ren, H., Zhang, X., Manning, C. D., Liang, P., and Leskovec, J. Deep bidirectional language-knowledge graph pretraining. In Neural Information Processing Systems (NeurIPS), 2022a.

[56]

Yasunaga, M., Leskovec, J., and Liang, P. LinkBERT: Pretraining language models with document links. In Association for Computational Linguistics (ACL), 2022b.

[57]

Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.

[58]

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.

[59]

Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., and Liu, Q. Ernie: Enhanced language representation with informative entities. In Association for Computational Linguistics (ACL), 2019.

Recommendations

MVRMLM 2024: Multimodal Video Retrieval and Multimodal Language Modelling
ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

As the proliferation of video content continues, and many video archives lack suitable metadata, therefore, video retrieval, particularly through example-based search, has become increasingly crucial. Existing metadata often fails to meet the needs of ...
TIAR: Text-Image-Audio Retrieval with weighted multimodal re-ranking
Abstract
Cross-modal retrieval has developed remarkably recently and received extensive attention as an essential method for multimodal interaction study. However, most existing models are limited to one of the applications in cross-modal retrieval, i.e., ...
Multimodal interface design for multimodal meeting content retrieval
ICMI '04: Proceedings of the 6th international conference on Multimodal interfaces

This thesis will investigate which modalities, and in which combinations, are best suited for use in a multimodal interface that allows users to retrieve the content of recorded and processed multimodal meetings. The dual role of multimodality in the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'23: Proceedings of the 40th International Conference on Machine Learning

July 2023

43479 pages

Copyright © 2023.

Publisher

JMLR.org

Publication History

Published: 23 July 2023

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents