More Web Proxy on the site http://driver.im/

short-paper

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Authors:

Guanbin LiAuthors Info & Claims

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

Pages 456 - 460

https://doi.org/10.1145/3460426.3463584

Published: 01 September 2021 Publication History

Abstract

Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve question-related answer prediction. These two phases are performed independently and without considering the compatibility and applicability of the pre-trained features for cross-modal fusion. Thus, we reformulate image feature pre-training as a multi-task learning paradigm and witness its extraordinary superiority, forcing it to take into account the applicability of features for the specific image comprehension task. Furthermore, we introduce a cross-modal self-attention~(CMSA) module to selectively capture the long-range contextual relevance for more effective fusion of visual and linguistic features. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods. Our code and models are available at https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA.

References

[1]

Asma Ben Abacha, Soumya Gayen, Jason J Lau, Sivaramakrishnan Rajaraman, and Dina Demner-Fushman. 2018. NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain. In CLEF (Working Notes).

[2]

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. 2019. VQA-Med: Overview of the medical visual question answering task at imageclef 2019. In CLEF2019 Working Notes. CEUR Workshop Proceedings. 09--12.

[3]

Imane Allaouzi, B Benamrou, and MB Ahmed. 2019. An encoder-decoder model for visual question answering in the medical domain. Working Notes of CLEF (2019).

[4]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.

[5]

Guanqi Chen, Haifan Gong, and Guanbin Li. 2020. HCP-MIC at VQA-Med 2020: Effective Visual Representation for Medical Visual Question Answering. In Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22--25, 2020 (CEUR Workshop Proceedings, Vol. 2696).

[6]

Jun Cheng. 2017. brain tumor dataset. (4 2017). https://doi.org/10.6084/m9.figshare.1512427.v5

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.

[8]

Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1--4, 2016. The Association for Computational Linguistics, 457--468.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[10]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[11]

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 590--597.

Digital Library

[12]

Bumjun Jung, Lin Gu, and Tatsuya Harada. 2020. bumjun_jung at VQA-Med 2020: VQA Model Based on Feature Extraction and Multi-modal Feature Fusion. In Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22--25, 2020 (CEUR Workshop Proceedings, Vol. 2696). CEUR-WS.org.

[13]

Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montré al, Canada. 1571--1581.

[14]

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, Vol. 5, 1 (2018), 1--10.

[15]

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, Vol. 36, 4 (2020), 1234--1240.

[16]

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021. SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering. In 18th IEEE International Symposium on Biomedical Imaging, ISBI 2021, Nice, Antipolis, France, April 13--16, 2021. IEEE, pp.1--5.

[17]

Binh D Nguyen, Thanh-Toan Do, Binh X Nguyen, Tuong Do, Erman Tjiputra, and Quang D Tran. 2019. Overcoming Data Limitation in Medical Visual Question Answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 522--530.

[18]

Fuji Ren and Yangyang Zhou. 2020. CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering. IEEE Access, Vol. 8 (2020), 50626--50636.

[19]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 6 (2017), 1137--1149.

Digital Library

[20]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.

[21]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[22]

Minh Vu, Raphael Sznitman, Tufve Nyholm, and Tommy Löfstedt. 2019. Ensemble of streamlined bilinear visual question answering models for the imageclef 2019 challenge in the medical domain. In CLEF 2019, Vol. 2380.

[23]

Minh H Vu, Tommy Löfstedt, Tufve Nyholm, and Raphael Sznitman. 2020. A Question-Centric Model for Visual Question Answering in Medical Imaging. IEEE transactions on medical imaging, Vol. 39, 9 (2020), 2856--2868.

[24]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.

[25]

Xin Yan, Lin Li, Chulin Xie, Jun Xiao, and Lin Gu. 2019. Zhejiang university at imageclef 2019 visual question answering in the medical domain. Working Notes of CLEF (2019).

[26]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 21--29.

[27]

Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10502--10511.

[28]

Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision. 1821--1830.

[29]

Li-Ming Zhan, Bo Liu, Lu Fan, Jiaxin Chen, and Xiao-Ming Wu. 2020. Medical Visual Question Answering via Conditional Reasoning. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 2345--2354.

Digital Library

[30]

Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific data, Vol. 6, 1 (2019), 1--9.

Cited By

Lameesa ASilpasuwanchai CAlam M(2025)VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question AnsweringNeurocomputing10.1016/j.neucom.2024.128730613(128730)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128730
Wang JSeng KShen YAng LHuang D(2024)Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question AnsweringElectronics10.3390/electronics1312227313:12(2273)Online publication date: 10-Jun-2024
https://doi.org/10.3390/electronics13122273
Zhang KYang YYu JJiang HFan JHuang QHan W(2024)Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-TrainingIEEE Transactions on Multimedia10.1109/TMM.2023.332596526(4706-4721)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3325965
Show More Cited By

Index Terms

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
    2. Natural language processing

Recommendations

Multi-modal co-attention relation networks for visual question answering
Abstract
The current mainstream visual question answering (VQA) models only model the object-level visual representations but ignore the relationships between visual objects. To solve this problem, we propose a Multi-Modal Co-Attention Relation Network (...
Cross-modality co-attention networks for visual question answering
Abstract
Visual question answering (VQA) is an emerging task combining natural language processing and computer vision technology. Selecting compelling multi-modality features is the core of visual question answering. In multi-modal learning, the attention ...
Multimodal attention-driven visual question answering for Malayalam
Abstract
Visual question answering is a challenging task that necessitates for sophisticated reasoning over the visual elements to provide an accurate answer to a question. Majority of the state-of-the-art VQA models are only applicable to English ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval

August 2021

715 pages

ISBN:9781450384636

DOI:10.1145/3460426

General Chairs:
Wen-Huang Cheng
National Yang Ming Chiao Tung University, Taiwan
,
Mohan Kankanhalli
National University of Singapore, Singapore
,
Meng Wang
Hefei University of Technology, China
,
Program Chairs:
Wei-Ta Chu
National Cheng Kung University, Taiwan
,
Jiaying Liu
Peking University, China
,
Marcel Worring
University of Amsterdam, Netherlands

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Poster

Author Tags

Qualifiers

Short-paper

Funding Sources

Guangdong Basic and Applied Basic Research Foundation
National Natural Science Foundation of China

Conference

ICMR '21

Sponsor:

SIGMM

ICMR '21: International Conference on Multimedia Retrieval

August 21 - 24, 2021

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

49
Total Citations
View Citations
476
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)8

Reflects downloads up to 30 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Lameesa ASilpasuwanchai CAlam M(2025)VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question AnsweringNeurocomputing10.1016/j.neucom.2024.128730613(128730)Online publication date: Jan-2025
https://doi.org/10.1016/j.neucom.2024.128730
Wang JSeng KShen YAng LHuang D(2024)Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question AnsweringElectronics10.3390/electronics1312227313:12(2273)Online publication date: 10-Jun-2024
https://doi.org/10.3390/electronics13122273
Zhang KYang YYu JJiang HFan JHuang QHan W(2024)Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-TrainingIEEE Transactions on Multimedia10.1109/TMM.2023.332596526(4706-4721)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3325965
Cai LFang HXu NRen B(2024)Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question AnsweringIEEE Transactions on Medical Imaging10.1109/TMI.2024.342553343:12(4430-4441)Online publication date: Dec-2024
https://doi.org/10.1109/TMI.2024.3425533
Liu JHu TZhang YFeng YHao JLv JLiu Z(2024)Parameter-Efficient Transfer Learning for Medical Visual Question AnsweringIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33113338:4(2816-2826)Online publication date: Aug-2024
https://doi.org/10.1109/TETCI.2023.3311333
Bi YWang XWang QYang J(2024)Overcoming Data Limitations and Cross-Modal Interaction Challenges in Medical Visual Question Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651345(1-8)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10651345
Peng PFan WLiu WYang XZhou D(2024)Prior-Posterior Knowledge Prompting-and-Reasoning for Surgical Visual Question Localized-Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650493(1-9)Online publication date: 30-Jun-2024
https://doi.org/10.1109/IJCNN60899.2024.10650493
Chen BZhu JLiu YZeng BPan JDing M(2024)Medical Vision-Language Representation Learning with Cross-Modal Multi-Teacher Contrastive DistillationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447344(1891-1895)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447344
Zhu HTogo ROgawa THaseyama M(2024)Prompt-Based Personalized Federated Learning for Medical Visual Question AnsweringICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445933(1821-1825)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10445933
Liu GHe JLi PZhao ZZhong S(2024)Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answeringJournal of Biomedical Informatics10.1016/j.jbi.2024.104748160(104748)Online publication date: Dec-2024
https://doi.org/10.1016/j.jbi.2024.104748
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents