Computer Science > Computation and Language

arXiv:2410.06765 (cs)

[Submitted on 9 Oct 2024]

Title:To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Authors:Junyan Lin, Haoran Chen, Dawei Zhu, Xiaoyu Shen

Abstract:In recent years, multimodal large language models (MLLMs) have garnered significant attention from both industry and academia. However, there is still considerable debate on constructing MLLM architectures, particularly regarding the selection of appropriate connectors for perception tasks of varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance. Our findings reveal that feature-preserving connectors excel in \emph{fine-grained perception} tasks due to their ability to retain detailed visual information. In contrast, feature-compressing connectors, while less effective in fine-grained perception tasks, offer significant speed advantages and perform comparably in \emph{coarse-grained perception} and \emph{reasoning} tasks. These insights are crucial for guiding MLLM architecture design and advancing the optimization of MLLM architectures.

Comments:	Accepted to EMNLP 2024 Main Conference
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.06765 [cs.CL]
	(or arXiv:2410.06765v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2410.06765

Submission history

From: Junyan Lin [view email]
[v1] Wed, 9 Oct 2024 10:53:18 UTC (1,014 KB)

Computer Science > Computation and Language

Title:To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators