[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3460426.3463584acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

Published: 01 September 2021 Publication History

Abstract

Due to the severe lack of labeled data, existing methods of medical visual question answering usually rely on transfer learning to obtain effective image feature representation and use cross-modal fusion of visual and linguistic features to achieve question-related answer prediction. These two phases are performed independently and without considering the compatibility and applicability of the pre-trained features for cross-modal fusion. Thus, we reformulate image feature pre-training as a multi-task learning paradigm and witness its extraordinary superiority, forcing it to take into account the applicability of features for the specific image comprehension task. Furthermore, we introduce a cross-modal self-attention~(CMSA) module to selectively capture the long-range contextual relevance for more effective fusion of visual and linguistic features. Experimental results demonstrate that the proposed method outperforms existing state-of-the-art methods. Our code and models are available at https://github.com/haifangong/CMSA-MTPT-4-MedicalVQA.

References

[1]
Asma Ben Abacha, Soumya Gayen, Jason J Lau, Sivaramakrishnan Rajaraman, and Dina Demner-Fushman. 2018. NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain. In CLEF (Working Notes).
[2]
Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Joey Liu, Dina Demner-Fushman, and Henning Müller. 2019. VQA-Med: Overview of the medical visual question answering task at imageclef 2019. In CLEF2019 Working Notes. CEUR Workshop Proceedings. 09--12.
[3]
Imane Allaouzi, B Benamrou, and MB Ahmed. 2019. An encoder-decoder model for visual question answering in the medical domain. Working Notes of CLEF (2019).
[4]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6077--6086.
[5]
Guanqi Chen, Haifan Gong, and Guanbin Li. 2020. HCP-MIC at VQA-Med 2020: Effective Visual Representation for Medical Visual Question Answering. In Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22--25, 2020 (CEUR Workshop Proceedings, Vol. 2696).
[6]
Jun Cheng. 2017. brain tumor dataset. (4 2017). https://doi.org/10.6084/m9.figshare.1512427.v5
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171--4186.
[8]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1--4, 2016. The Association for Computational Linguistics, 457--468.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[10]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[11]
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 590--597.
[12]
Bumjun Jung, Lin Gu, and Tatsuya Harada. 2020. bumjun_jung at VQA-Med 2020: VQA Model Based on Feature Extraction and Multi-modal Feature Fusion. In Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22--25, 2020 (CEUR Workshop Proceedings, Vol. 2696). CEUR-WS.org.
[13]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3--8, 2018, Montré al, Canada. 1571--1581.
[14]
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, Vol. 5, 1 (2018), 1--10.
[15]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, Vol. 36, 4 (2020), 1234--1240.
[16]
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. 2021. SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering. In 18th IEEE International Symposium on Biomedical Imaging, ISBI 2021, Nice, Antipolis, France, April 13--16, 2021. IEEE, pp.1--5.
[17]
Binh D Nguyen, Thanh-Toan Do, Binh X Nguyen, Tuong Do, Erman Tjiputra, and Quang D Tran. 2019. Overcoming Data Limitation in Medical Visual Question Answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 522--530.
[18]
Fuji Ren and Yangyang Zhou. 2020. CGMVQA: A New Classification and Generative Model for Medical Visual Question Answering. IEEE Access, Vol. 8 (2020), 50626--50636.
[19]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2017. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, 6 (2017), 1137--1149.
[20]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.
[21]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.
[22]
Minh Vu, Raphael Sznitman, Tufve Nyholm, and Tommy Löfstedt. 2019. Ensemble of streamlined bilinear visual question answering models for the imageclef 2019 challenge in the medical domain. In CLEF 2019, Vol. 2380.
[23]
Minh H Vu, Tommy Löfstedt, Tufve Nyholm, and Raphael Sznitman. 2020. A Question-Centric Model for Visual Question Answering in Medical Imaging. IEEE transactions on medical imaging, Vol. 39, 9 (2020), 2856--2868.
[24]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7794--7803.
[25]
Xin Yan, Lin Li, Chulin Xie, Jun Xiao, and Lin Gu. 2019. Zhejiang university at imageclef 2019 visual question answering in the medical domain. Working Notes of CLEF (2019).
[26]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 21--29.
[27]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10502--10511.
[28]
Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. 2017. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE international conference on computer vision. 1821--1830.
[29]
Li-Ming Zhan, Bo Liu, Lu Fan, Jiaxin Chen, and Xiao-Ming Wu. 2020. Medical Visual Question Answering via Conditional Reasoning. In Proceedings of the 28th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 2345--2354.
[30]
Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific data, Vol. 6, 1 (2019), 1--9.

Cited By

View all
  • (2025)VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question AnsweringNeurocomputing10.1016/j.neucom.2024.128730613(128730)Online publication date: Jan-2025
  • (2024)Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question AnsweringElectronics10.3390/electronics1312227313:12(2273)Online publication date: 10-Jun-2024
  • (2024)Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-TrainingIEEE Transactions on Multimedia10.1109/TMM.2023.332596526(4706-4721)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
      August 2021
      715 pages
      ISBN:9781450384636
      DOI:10.1145/3460426
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 September 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      • Best Poster

      Author Tags

      1. multi-task learning
      2. self-attention
      3. transfer learning
      4. visual question answering

      Qualifiers

      • Short-paper

      Funding Sources

      Conference

      ICMR '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 254 of 830 submissions, 31%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)117
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 30 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)VG-CALF: A vision-guided cross-attention and late-fusion network for radiology images in Medical Visual Question AnsweringNeurocomputing10.1016/j.neucom.2024.128730613(128730)Online publication date: Jan-2025
      • (2024)Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question AnsweringElectronics10.3390/electronics1312227313:12(2273)Online publication date: 10-Jun-2024
      • (2024)Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-TrainingIEEE Transactions on Multimedia10.1109/TMM.2023.332596526(4706-4721)Online publication date: 2024
      • (2024)Counterfactual Causal-Effect Intervention for Interpretable Medical Visual Question AnsweringIEEE Transactions on Medical Imaging10.1109/TMI.2024.342553343:12(4430-4441)Online publication date: Dec-2024
      • (2024)Parameter-Efficient Transfer Learning for Medical Visual Question AnsweringIEEE Transactions on Emerging Topics in Computational Intelligence10.1109/TETCI.2023.33113338:4(2816-2826)Online publication date: Aug-2024
      • (2024)Overcoming Data Limitations and Cross-Modal Interaction Challenges in Medical Visual Question Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10651345(1-8)Online publication date: 30-Jun-2024
      • (2024)Prior-Posterior Knowledge Prompting-and-Reasoning for Surgical Visual Question Localized-Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650493(1-9)Online publication date: 30-Jun-2024
      • (2024)Medical Vision-Language Representation Learning with Cross-Modal Multi-Teacher Contrastive DistillationICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447344(1891-1895)Online publication date: 14-Apr-2024
      • (2024)Prompt-Based Personalized Federated Learning for Medical Visual Question AnsweringICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10445933(1821-1825)Online publication date: 14-Apr-2024
      • (2024)Cross-Modal self-supervised vision language pre-training with multiple objectives for medical visual question answeringJournal of Biomedical Informatics10.1016/j.jbi.2024.104748160(104748)Online publication date: Dec-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media