[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3460426.3463667acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

Published: 01 September 2021 Publication History

Abstract

Medical image captioning automatically generates a medical description to describe the content of a given medical image. Traditional medical image captioning models create a medical description based on a single medical image input only. Hence, an abstract medical description or concept is hard to be generated based on the traditional approach. Such a method limits the effectiveness of medical image captioning. Multi-modal medical image captioning is one of the approaches utilized to address this problem. In multi-modal medical image captioning, textual input, e.g., expert-defined keywords, is considered as one of the main drivers of medical description generation. Thus, encoding the textual input and the medical image effectively are both important for the task of multi-modal medical image captioning. In this work, a new end-to-end deep multi-modal medical image captioning model is proposed. Contextualized keyword representations, textual feature reinforcement, and masked self-attention are used to develop the proposed approach. Based on the evaluation of an existing multi-modal medical image captioning dataset, experimental results show that the proposed model is effective with an increase of +53.2% in BLEU-avg and +18.6% in CIDEr, compared with the state-of-the-art method. https://github.com/Jhhuangkay/Contextualized-Keyword-Representations-for-Multi-modal-Retinal-Image-Captioning

References

[1]
Kedir M Adal, Peter G van Etten, Jose P Martinez, Lucas J van Vliet, and Koenraad A Vermeer. 2015. Accuracy assessment of intra-and intervisit fundus image registration for diabetic retinopathy screening. Investigative ophthalmology & visual science, Vol. 56, 3 (2015), 1805--1812.
[2]
Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2017. Vqa: Visual question answering. International Journal of Computer Vision, Vol. 123, 1 (2017), 4--31.
[3]
Bashir Al-Diri, Andrew Hunter, David Steel, Maged Habib, Taghread Hudaib, and Simon Berry. 2008. A reference data set for retinal vessel profiles. In 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2262--2265.
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization . 65--72.
[5]
Enrique J Carmona, Mariano Rincón, Julián Garc'ia-Feijoó, and José M Mart'inez-de-la Casa. 2008. Identification of the optic nerve head with genetic algorithms. Artificial Intelligence in Medicine, Vol. 43, 3 (2008), 243--259.
[6]
Retinal Image Computing. 2012. Understanding,?ONHSD-Optic Nerve Head Segmentation Dataset," University of Lincoln, United Kingdom, 2004.
[7]
Etienne Decencière, Xiwei Zhang, Guy Cazuguel, Bruno Lay, Béatrice Cochener, Caroline Trone, Philippe Gain, Richard Ordonez, Pascale Massin, Ali Erginay, et almbox. 2014. Feedback on a publicly distributed image database: the Messidor database. Image Analysis & Stereology, Vol. 33, 3 (2014), 231--234.
[8]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[9]
Kawin Ethayarajh. 2019. How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv preprint arXiv:1909.00512 (2019).
[10]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et almbox. 2015. From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1473--1482.
[11]
Muhammad Moazam Fraz, Paolo Remagnino, Andreas Hoppe, Bunyarit Uyyanonvara, Alicja R Rudnicka, Christopher G Owen, and Sarah A Barman. 2012. An ensemble classification-based approach applied to retinal blood vessel segmentation. IEEE Transactions on Biomedical Engineering, Vol. 59, 9 (2012), 2538--2548.
[12]
Lianli Gao, Kaixuan Fan, Jingkuan Song, Xianglong Liu, Xing Xu, and Heng Tao Shen. 2019. Deliberate Attention Networks for Image Captioning. AAAI (2019).
[13]
Zellig S Harris. 1954. Distributional structure. Word, Vol. 10, 2--3 (1954), 146--162.
[14]
Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. 2016. Generating visual explanations. In European Conference on Computer Vision. Springer, 3--19.
[15]
Carlos Hernandez-Matas, Xenophon Zabulis, Areti Triantafyllou, Panagiota Anyfanti, Stella Douma, and Antonis A Argyros. 2017. FIRE: fundus image registration dataset. Journal for Modeling in Ophthalmology, Vol. 1, 4 (2017), 16--28.
[16]
Adam Hoover and Michael Goldbaum. 2003. Locating the optic nerve in a retinal image using the fuzzy convergence of the blood vessels. IEEE transactions on medical imaging, Vol. 22, 8 (2003), 951--958.
[17]
Tao Hu, Pascal Mettes, Jia-Hong Huang, and Cees GM Snoek. 2019. Silco: Show a few images, localize the common object. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5067--5076.
[18]
Jia-Hong Huang. 2017. Robustness Analysis of Visual Question Answering Models by Basic Questions. King Abdullah University of Science and Technology MS thesis (2017).
[19]
Jia-Hong Huang, Modar Alfadly, and Bernard Ghanem. 2017. Vqabq: Visual question answering by basic questions. CVPR VQA Challenge Workshop (2017).
[20]
Jia-Hong Huang, Modar Alfadly, Bernard Ghanem, and Marcel Worring. 2019 a. Assessing the robustness of visual question answering. arXiv preprint arXiv:1912.01452 (2019).
[21]
Jia-Hong Huang, Cuong Duc Dao, Modar Alfadly, and Bernard Ghanem. 2019 b. A novel framework for robustness analysis of visual qa models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 8449--8456.
[22]
Jia-Hong Huang, Cuong Duc Dao, Modar Alfadly, C Huck Yang, and Bernard Ghanem. 2018. Robustness analysis of visual qa models by basic questions. CVPR VQA Challenge and Visual Dialog Workshop (2018).
[23]
Jia-Hong Huang, Luka Murn, Marta Mrak, and Marcel Worring. 2021 a. GPT2MVS: Generative Pre-trained Transformer-2 forMulti-modal Video Summarization. In Proceedings of the 2021 International Conference on Multimedia Retrieval .
[24]
Jia-Hong Huang and Marcel Worring. 2020. Query-controllable video summarization. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 242--250.
[25]
Jia-Hong Huang, Ting-Wei Wu, Chao-Han Huck Yang, and Marcel Worring. 2021 b. Deep Context-Encoding Network for Retinal Image Captioning. IEEE International Conference on Image Processing (ICIP) .
[26]
Jia-Hong Huang, Ting-Wei Wu, Chao-Han Huck Yang, and Marcel Worring. 2021 c. Longer Version for “Deep Context-Encoding Network for Retinal Image Captioning”. arXiv preprint arXiv:2105.14538 .
[27]
Jia-Hong Huang, C-H Huck Yang, Fangyu Liu, Meng Tian, Yi-Chieh Liu, Ting-Wei Wu, I Lin, Kang Wang, Hiromasa Morikawa, Hernghua Chang, et almbox. 2021 d. DeepOpht: medical report generation for retinal images via deep models and visual explanation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2442--2452.
[28]
Baoyu Jing, Pengtao Xie, Eric Xing, Baoyu Jing, Pengtao Xie, and Eric Xing. 2018. On the automatic generation of medical imaging reports. ACL (2018).
[29]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR . 3128--3137.
[30]
Tomi Kauppi, Valentina Kalesnykiene, Joni-Kristian Kamarainen, Lasse Lensu, Iiris Sorri, A Raninen, R Voutilainen, J Pietil"a, H K"alvi"ainen, and H Uusitalo. 2007. DIARETDB1-Standard Diabetic Retinopathy Database Calibration level 1.
[31]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[32]
Jonathan Laserson, Christine Dan Lantsman, Michal Cohen-Sfady, Itamar Tamir, Eli Goz, Chen Brestel, Shir Bar, Maya Atar, and Eldad Elnekave. 2018. Textray: Mining clinical reports to gain a broad understanding of chest x-rays. In International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 553--561.
[33]
Omer Levy and Yoav Goldberg. 2014a. Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning . 171--180.
[34]
Omer Levy and Yoav Goldberg. 2014b. Neural word embedding as implicit matrix factorization. NIPS, Vol. 27 (2014), 2177--2185.
[35]
Yuan Li, Xiaodan Liang, Zhiting Hu, and Eric P Xing. 2018. Hybrid retrieval-generation reinforced agent for medical image report generation. In Advances in Neural Information Processing Systems. 1530--1540.
[36]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out (2004).
[37]
Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew E Peters, and Noah A Smith. 2019. Linguistic knowledge and transferability of contextual representations. arXiv preprint arXiv:1903.08855 (2019).
[38]
Siqi Liu, Zhenhai Zhu, Ning Ye, Sergio Guadarrama, and Kevin Murphy. 2017. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE international conference on computer vision. 873--881.
[39]
Yi-Chieh Liu, Hao-Hsiang Yang, C-H Huck Yang, Jia-Hong Huang, Meng Tian, Hiromasa Morikawa, Yi-Chang James Tsai, and Jesper Tegner. 2018. Synthesizing new retinal symptom images by multiple generative models. In Asian Conference on Computer Vision. Springer, 235--250.
[40]
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. arXiv preprint arXiv:1410.0210 (2014).
[41]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE international conference on computer vision . 1--9.
[42]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2017. Ask your neurons: A deep learning approach to visual question answering. International Journal of Computer Vision, Vol. 125, 1 (2017), 110--135.
[43]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013).
[44]
M Niemeijer, X Xu, A Dumitrescu, P Gupta, B van Ginneken, J Folk, and M Abramoff. 2011. INSPIRE-AVR: Iowa normative set for processing images of the retina-artery vein ratio.
[45]
Marcos Ortega, Manuel G Penedo, José Rouco, Noelia Barreira, and Mar'ia J Carreira. 2009. Retinal verification using a feature points-based biometric pattern. EURASIP Journal on Advances in Signal Processing, Vol. 2009 (2009), 2.
[46]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-Linear Attention Networks for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971--10980.
[47]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of ACL. Association for Computational Linguistics, 311--318.
[48]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 1532--1543.
[49]
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).
[50]
Prasanna Porwal, Samiksha Pachade, Ravi Kamble, Manesh Kokare, Girish Deshmukh, Vivek Sahasrabuddhe, and Fabrice Meriaudeau. 2018. Indian diabetic retinopathy image dataset (IDRiD): a database for diabetic retinopathy screening research. Data, Vol. 3, 3 (2018), 25.
[51]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[52]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et almbox. 2015. Imagenet large scale visual recognition challenge. International journal of computer vision, Vol. 115, 3 (2015), 211--252.
[53]
Sam Scott and Stan Matwin. 1998. Text classification using WordNet hypernyms. In Usage of WordNet in Natural Language Processing Systems .
[54]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[55]
Jayanthi Sivaswamy, SR Krishnadas, Gopal Datt Joshi, Madhulika Jain, and A Ujjwaft Syed Tabish. 2014. Drishti-gs: Retinal image dataset for optic nerve head (onh) segmentation. In 2014 IEEE 11th International Symposium on Biomedical Imaging (ISBI). IEEE, 53--56.
[56]
K Soumya George and Shibily Joseph. 2014. Text classification by augmenting bag of words (BOW) representation with co-occurrence feature. IOSR J. Comput. Eng, Vol. 16, 1 (2014), 34--38.
[57]
Joes Staal, Michael D Abràmoff, Meindert Niemeijer, Max A Viergever, and Bram Van Ginneken. 2004. Ridge-based vessel segmentation in color images of the retina. TMI, Vol. 23, 4 (2004), 501--509.
[58]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
[59]
SG Vázquez, Brais Cancela, Noelia Barreira, Manuel G Penedo, M Rodr'iguez-Blanco, M Pena Seijo, G Coll de Tuero, Maria Antònia Barceló, and Marc Saez. 2013. Improving retinal artery and vein classification by means of a minimal path approach. Machine vision and applications, Vol. 24, 5 (2013), 919--930.
[60]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4566--4575.
[61]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'15) . 3156--3164.
[62]
C-H Huck Yang, Jia-Hong Huang, Fangyu Liu, Fang-Yi Chiu, Mengya Gao, Weifeng Lyu, Jesper Tegner, et almbox. 2018a. A novel hybrid machine learning model for auto-classification of retinal diseases. ICML Workshop on Computational Biology (2018).
[63]
C-H Huck Yang, Fangyu Liu, Jia-Hong Huang, Meng Tian, MD I-Hung Lin, Yi Chieh Liu, Hiromasa Morikawa, Hao-Hsiang Yang, and Jesper Tegner. 2018b. Auto-classification of retinal diseases in the limit of sparse data using a two-streams machine learning model. In Asian Conference on Computer Vision. Springer, 323--338.

Cited By

View all
  • (2024)Multi-modal Video SummarizationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657582(1214-1218)Online publication date: 30-May-2024
  • (2024)M3T: Multi-Modal Medical Transformer To Bridge Clinical Context With Visual Insights For Retinal Image Medical Description Generation2024 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP51287.2024.10647584(3037-3043)Online publication date: 27-Oct-2024
  • (2024)Work like a doctorExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121442237:PAOnline publication date: 27-Feb-2024
  • Show More Cited By

Index Terms

  1. Contextualized Keyword Representations for Multi-modal Retinal Image Captioning

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICMR '21: Proceedings of the 2021 International Conference on Multimedia Retrieval
    August 2021
    715 pages
    ISBN:9781450384636
    DOI:10.1145/3460426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. contextualized word representations
    2. multi-modal medical image captioning
    3. retinal images

    Qualifiers

    • Research-article

    Funding Sources

    • University of Amsterdam

    Conference

    ICMR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 830 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)52
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Multi-modal Video SummarizationProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3657582(1214-1218)Online publication date: 30-May-2024
    • (2024)M3T: Multi-Modal Medical Transformer To Bridge Clinical Context With Visual Insights For Retinal Image Medical Description Generation2024 IEEE International Conference on Image Processing (ICIP)10.1109/ICIP51287.2024.10647584(3037-3043)Online publication date: 27-Oct-2024
    • (2024)Work like a doctorExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121442237:PAOnline publication date: 27-Feb-2024
    • (2024)ECG Captioning with Prior-Knowledge Transformer and Diffusion Probabilistic ModelJournal of Healthcare Informatics Research10.1007/s41666-024-00176-3Online publication date: 22-Oct-2024
    • (2023)Vision–Language Model for Visual Question Answering in Medical ImageryBioengineering10.3390/bioengineering1003038010:3(380)Online publication date: 20-Mar-2023
    • (2023)Deep Learning Approaches on Image Captioning: A ReviewACM Computing Surveys10.1145/361759256:3(1-39)Online publication date: 5-Oct-2023
    • (2023)Expert-defined Keywords Improve Interpretability of Retinal Image Captioning2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV56688.2023.00190(1859-1868)Online publication date: Jan-2023
    • (2023)Image Captioning for Chest X-Rays Using GRU - Based Attention Mechanism2023 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)10.1109/ICCCIS60361.2023.10425174(518-525)Online publication date: 3-Nov-2023
    • (2023)Causalainer: Causal Explainer for Automatic Video Summarization2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW59228.2023.00262(2630-2636)Online publication date: Jun-2023
    • (2022)Non-local Attention Improves Description Generation for Retinal Images2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV51458.2022.00331(3250-3259)Online publication date: Jan-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media