Electrical Engineering and Systems Science > Image and Video Processing

arXiv:2410.14200 (eess)

[Submitted on 18 Oct 2024]

Title:E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

Authors:Haoran Lai, Zihang Jiang, Qingsong Yao, Rongsheng Wang, Zhiyang He, Xiaodong Tao, Wei Wei, Weifu Lv, S.Kevin Zhou

View PDF HTML (experimental)

Abstract:The development of 3D medical vision-language models holds significant potential for disease diagnosis and patient treatment. However, compared to 2D medical images, 3D medical images, such as CT scans, face challenges related to limited training data and high dimension, which severely restrict the progress of 3D medical vision-language models. To address these issues, we collect a large amount of unlabeled 3D CT data and utilize self-supervised learning to construct a 3D visual foundation model for extracting 3D visual features. Then, we apply 3D spatial convolutions to aggregate and project high-level image features, reducing computational complexity while preserving spatial information. We also construct two instruction-tuning datasets based on BIMCV-R and CT-RATE to fine-tune the 3D vision-language model. Our model demonstrates superior performance compared to existing methods in report generation, visual question answering, and disease diagnosis. Code and data will be made publicly available soon.

Subjects:	Image and Video Processing (eess.IV); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2410.14200 [eess.IV]
	(or arXiv:2410.14200v1 [eess.IV] for this version)
	https://doi.org/10.48550/arXiv.2410.14200

Submission history

From: Haoran Lai [view email]
[v1] Fri, 18 Oct 2024 06:31:40 UTC (1,896 KB)

Electrical Engineering and Systems Science > Image and Video Processing

Title:E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Image and Video Processing

Title:E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators