Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.05163 (cs)

[Submitted on 10 Jan 2024 (v1), last revised 19 Jun 2024 (this version, v3)]

Title:MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

Authors:Jiawei Chen, Dingkang Yang, Yue Jiang, Yuxuan Lei, Lihua Zhang

Abstract:Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using Large Language Models (LLMs), enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.

Comments:	ICANN, 2024
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2401.05163 [cs.CV]
	(or arXiv:2401.05163v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.05163

Submission history

From: Jiawei Chen [view email]
[v1] Wed, 10 Jan 2024 13:56:40 UTC (1,295 KB)
[v2] Thu, 18 Jan 2024 09:34:31 UTC (1,295 KB)
[v3] Wed, 19 Jun 2024 11:14:40 UTC (1,411 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MISS: A Generative Pretraining and Finetuning Approach for Med-VQA

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators