Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.14332 (cs)

[Submitted on 18 Oct 2024 (v1), last revised 24 Dec 2024 (this version, v3)]

Title:Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Authors:Yin Xie, Kaicheng Yang, Ninghua Yang, Weimo Deng, Xiangzi Dai, Tiancheng Gu, Yumeng Wang, Xiang An, Yongle Zhao, Ziyong Feng, Roy Miles, Ismail Elezi, Jiankang Deng

View PDF HTML (experimental)

Abstract:Recent advances in Large Language Models (LLMs) have catalyzed the development of Large Multimodal Models (LMMs). However, existing research primarily focuses on tuning language and image instructions, ignoring the critical pretraining phase where models learn to process textual and visual modalities jointly. In this paper, we propose a new pretraining paradigm for LMMs to enhance the visual comprehension capabilities of LLMs by introducing a novel cross-modal comprehension stage. Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens. Then, we conceptualize visual tokens as analogous to a "foreign language" for the LLMs and propose a mixed attention mechanism with bidirectional visual attention and unidirectional textual attention to comprehensively enhance the understanding of visual tokens. Meanwhile, we integrate a detailed caption generation task, leveraging rich descriptions to further facilitate LLMs in understanding visual semantic information. After pretraining on 1.5 million publicly accessible data, we present a new foundation model called Croc. Experimental results demonstrate that Croc achieves new state-of-the-art performance on massive vision-language benchmarks. To support reproducibility and facilitate further research, we release the training code and pre-trained model weights at this https URL.

Comments:	14 pages, 12 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.14332 [cs.CV]
	(or arXiv:2410.14332v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.14332

Submission history

From: Kaicheng Yang [view email]
[v1] Fri, 18 Oct 2024 09:44:25 UTC (3,620 KB)
[v2] Sat, 30 Nov 2024 05:56:52 UTC (4,491 KB)
[v3] Tue, 24 Dec 2024 03:27:45 UTC (4,490 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators