Computer Science > Computation and Language

arXiv:2310.08166 (cs)

[Submitted on 12 Oct 2023 (v1), last revised 31 Oct 2023 (this version, v3)]

Title:Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Authors:Junyu Lu, Dixiang Zhang, Xiaojun Wu, Xinyu Gao, Ruyi Gan, Jiaxing Zhang, Yan Song, Pingjian Zhang

View PDF

Abstract:Recent advancements enlarge the capabilities of large language models (LLMs) in zero-shot image-to-text generation and understanding by integrating multi-modal inputs. However, such success is typically limited to English scenarios due to the lack of large-scale and high-quality non-English multi-modal resources, making it extremely difficult to establish competitive counterparts in other languages. In this paper, we introduce the Ziya-Visual series, a set of bilingual large-scale vision-language models (LVLMs) designed to incorporate visual semantics into LLM for multi-modal dialogue. Composed of Ziya-Visual-Base and Ziya-Visual-Chat, our models adopt the Querying Transformer from BLIP-2, further exploring the assistance of optimization schemes such as instruction tuning, multi-stage training and low-rank adaptation module for visual-language alignment. In addition, we stimulate the understanding ability of GPT-4 in multi-modal scenarios, translating our gathered English image-text datasets into Chinese and generating instruction-response through the in-context learning method. The experiment results demonstrate that compared to the existing LVLMs, Ziya-Visual achieves competitive performance across a wide range of English-only tasks including zero-shot image-text retrieval, image captioning, and visual question answering. The evaluation leaderboard accessed by GPT-4 also indicates that our models possess satisfactory image-text understanding and generation capabilities in Chinese multi-modal scenario dialogues. Code, demo and models are available at ~\url{this https URL}.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.08166 [cs.CL]
	(or arXiv:2310.08166v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.08166

Submission history

From: JunYu Lu [view email]
[v1] Thu, 12 Oct 2023 09:39:17 UTC (13,341 KB)
[v2] Sun, 29 Oct 2023 15:39:51 UTC (13,341 KB)
[v3] Tue, 31 Oct 2023 17:51:51 UTC (13,341 KB)

Computer Science > Computation and Language

Title:Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Ziya-Visual: Bilingual Large Vision-Language Model via Multi-Task Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators