Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.12966 (cs)

[Submitted on 24 Aug 2023 (v1), last revised 13 Oct 2023 (this version, v3)]

Title:Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Authors:Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou

View PDF

Abstract:In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at this https URL.

Comments:	Code, demo and models are available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2308.12966 [cs.CV]
	(or arXiv:2308.12966v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.12966

Submission history

From: Shuai Bai [view email]
[v1] Thu, 24 Aug 2023 17:59:17 UTC (5,795 KB)
[v2] Thu, 14 Sep 2023 17:08:39 UTC (4,670 KB)
[v3] Fri, 13 Oct 2023 02:41:28 UTC (5,291 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators