Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.09958 (cs)

[Submitted on 18 Sep 2023]

Title:An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Authors:Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen

View PDF

Abstract:Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild.
We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

Comments:	Released at LLaVA Model Zoo: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2309.09958 [cs.CV]
	(or arXiv:2309.09958v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.09958

Submission history

From: Chunyuan Li [view email]
[v1] Mon, 18 Sep 2023 17:30:46 UTC (205 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators