Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.11574 (cs)

[Submitted on 18 Feb 2024]

Title:Visual In-Context Learning for Large Vision-Language Models

Authors:Yucheng Zhou, Xiang Li, Qianning Wang, Jianbing Shen

Abstract:In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of our method. Moreover, our extensive experiments leverage information flow analysis to elucidate the effectiveness of our method, and investigate the impact of length and position of demonstrations for LVLM. The use of in-context unlearning further shows promise in resetting specific model knowledge without retraining.

Comments:	13 pages, 7 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2402.11574 [cs.CV]
	(or arXiv:2402.11574v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.11574

Submission history

From: Yucheng Zhou [view email]
[v1] Sun, 18 Feb 2024 12:43:38 UTC (8,020 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual In-Context Learning for Large Vision-Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual In-Context Learning for Large Vision-Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators