[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Image Captioning by Asking Questions

Published: 19 July 2019 Publication History

Abstract

Image captioning and visual question answering are typical tasks that connect computer vision and natural language processing. Both of them need to effectively represent the visual content using computer vision methods and smoothly process the text sentence using natural language processing skills. The key problem of these two tasks is to infer the target result based on the interactive understanding of the word sequence and the image. Though they practically use similar algorithms, they are studied independently in the past few years. In this article, we attempt to exploit the mutual correlation between these two tasks. We propose the first VQA-improved image-captioning method that transfers the knowledge learned from the VQA corpora to the image-captioning task. A VQA model is first pretrained on image--question--answer instances. Then, the pretrained VQA model is used to extract VQA-grounded semantic representations according to selected free-form open-ended visual question--answer pairs. The VQA-grounded features are complementary to the visual features, because they interpret images from a different perspective. We incorporate the VQA model into the image-captioning model by adaptively fusing the VQA-grounded feature and the attended visual feature. We show that such simple VQA-improved image-captioning (VQA-IIC) models perform better than conventional image-captioning methods on large-scale public datasets.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[2]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’16). 1545--1554.
[3]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 39--48.
[4]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 2425--2433.
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005. 65--72.
[6]
Kobus Barnard, Pinar Duygulu, David A. Forsyth, Nando de Freitas, David M. Blei, and Michael I. Jordan. 2003. Matching words and pictures. J. Mach. Learn. Res. 3 (Feb. 2003), 1107--1135.
[7]
Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. 2015. ABC-CNN: An attention based convolutional neural network for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'16). 1--10.
[8]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. CoRR abs/1504.00325 (2015). arxiv:1504.00325 http://arxiv.org/abs/1504.00325
[9]
Xinlei Chen and C. Lawrence Zitnick. 2015. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 2422--2431.
[10]
Pinar Duygulu, Kobus Barnard, João F. G. de Freitas, and David A. Forsyth. 2002. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the 7th European Conference on Computer Vision (ECCV’02). 97--112.
[11]
Ali Farhadi, Seyyed Mohammad Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David A. Forsyth. 2010. Every picture tells a story: Generating sentences from images. In Proceedings of the 11th European Conference on Computer Vision (ECCV’10). 15--29.
[12]
Andrea Frome, Gregory S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. In Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013. 2121--2129.
[13]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP’16). 457--468.
[14]
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question answering. CoRR abs/1505.05612 (2015).
[15]
Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. 2015. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences (PNAS) 112, 12 (2015), 3618--3623.
[16]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. Int. J. Comput. Vis. 106, 2 (2014), 210--233.
[17]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neur. Comput. 9, 8 (1997), 1735--1780.
[18]
Richang Hong, Zhenzhen Hu, Ruxin Wang, Meng Wang, and Dacheng Tao. 2016. Multi-view object retrieval via multi-scale topic models. IEEE Trans. Image Process. 25, 12 (2016), 5814--5827.
[19]
Richang Hong, Lei Li, Junjie Cai, Dapeng Tao, Meng Wang, and Qi Tian. 2017. Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans. Image Process. 26, 9 (2017), 4128--4138.
[20]
Richang Hong, Jinhui Tang, Hung-Khoon Tan, Chong-Wah Ngo, Shuicheng Yan, and Tat-Seng Chua. 2011. Beyond search: Event-driven summarization for web videos. Trans. Multimedia Comput. Commun. Appl. 7, 4 (2011), 35:1--35:18.
[21]
Aiwen Jiang, Fang Wang, Fatih Porikli, and Yi Li. 2015. Compositional memory for visual question answering. CoRR abs/1511.05676 (2015).
[22]
Andrej Karpathy, Armand Joulin, and Fei-Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the Annual Conference on Neural Information Processing Systems 2014. 1889--1897.
[23]
Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3128--3137.
[24]
Jin-Hwa Kim, Sang-Woo Lee, Dong-Hyun Kwak, Min-Oh Heo, Jeonghee Kim, JungWoo Ha, and Byoung-Tak Zhang. 2016. Multimodal residual learning for visual QA. In Proceedings of the Annual Conference on Neural Information Processing Systems 2016. 361--369.
[25]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014).
[26]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539 (2014).
[27]
Polina Kuznetsova, Vicente Ordonez, Tamara L. Berg, and Yejin Choi. 2014. TREETALK: Composition and compression of trees for image descriptions. Trans. Assoc. Comput. Linguist. 2 (2014), 351--362.
[28]
Rémi Lebret, Pedro H. O. Pinheiro, and Ronan Collobert. 2015. Phrase-based image captioning. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2085--2094.
[29]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS’04).
[30]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV’14). 740--755.
[31]
Xiao Lin and Devi Parikh. 2016. Leveraging visual question answering for image-caption ranking. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16). 261--277.
[32]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 3242--3250.
[33]
Lin Ma, Zhengdong Lu, and Hang Li. 2016. Learning to answer questions from image using convolutional neural network. In Proceedings of the 30th AAAI Conference on Artificial Intelligence 3567--3573.
[34]
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the Annual Conference on Neural Information Processing Systems 2014. 1682--1690.
[35]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 1--9.
[36]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-RNN). CoRR abs/1412.6632 (2014).
[37]
Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. 2016. Image question answering using convolutional neural network with dynamic parameter prediction. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 30--38.
[38]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics 311--318.
[39]
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV’15). 2641--2649.
[40]
Mengye Ren, Ryan Kiros, and Richard S. Zemel. 2015. Image question answering: A visual semantic embedding model and a new dataset. CoRR abs/1505.02074 (2015).
[41]
Zhou Ren, Xiaoyu Wang, Ning Zhang, Xutao Lv, and Li-Jia Li. 2017. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1151--1159.
[42]
Steven J. Rennie, Etienne Marcheret, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. 2017. Self-critical sequence training for image captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1179--1195.
[43]
Kevin J. Shih, Saurabh Singh, and Derek Hoiem. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4613--4621.
[44]
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2 (2014), 207--218.
[45]
Kewei Tu, Meng Meng, Mun Wai Lee, Tae Eun Choe, and Song Chun Zhu. 2014. Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia 21, 2 (2014), 42--70.
[46]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 4566--4575.
[47]
Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond J. Mooney, Trevor Darrell, and Kate Saenko. 2017. Captioning images with diverse objects. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 1170--1178.
[48]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15). 3156--3164.
[49]
Peng Wang, Qi Wu, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2017. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 1290--1296.
[50]
Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony R. Dick. 2016. FVQA: Fact-based visual question answering. CoRR abs/1606.05433 (2016).
[51]
Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2017. Visual question answering: A survey of methods and datasets. Comput. Vis. Image Understand. 163 (2017), 21--40.
[52]
Qi Wu, Peng Wang, Chunhua Shen, Anthony R. Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4622--4630.
[53]
Yu Wu, Linchao Zhu, Lu Jiang, and Yi Yang. 2018. Decoupled novel object captioner. In Proceedings of the 2018 ACM Multimedia Conference on Multimedia Conference (MM’18). 1029--1037.
[54]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the 33nd International Conference on Machine Learning (ICML’16). 2397--2406.
[55]
Huijuan Xu and Kate Saenko. 2016. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In Proceedings of the 14th European Conference on Computer Vision (ECCV’16). 451--466.
[56]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2048--2057.
[57]
Youjiang Xu, Yahong Han, Richang Hong, and Qi Tian. 2018. Sequential video VLAD: Training the aggregation locally and temporally. IEEE Trans. Image Process. 27, 10 (2018), 4933--4944.
[58]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 21--29.
[59]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2017. Incorporating copying mechanism in image captioning for learning novel objects. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’17). 5263--5271.
[60]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4651--4659.
[61]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2 (2014), 67--78.
[62]
Yuke Zhu, Oliver Groth, Michael S. Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR’16). 4995--5004.

Cited By

View all
  • (2023)Learning Video-Text Aligned Representations for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354682819:2(1-21)Online publication date: 6-Feb-2023
  • (2022)Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-trainingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/347314018:2(1-16)Online publication date: 16-Feb-2022
  • (2022)A Framework for Image Captioning Based on Relation Network and Multilevel Attention MechanismNeural Processing Letters10.1007/s11063-022-11106-y55:5(5693-5715)Online publication date: 17-Dec-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 2s
Special Section on Cross-Media Analysis for Visual Question Answering, Special Section on Big Data, Machine Learning and AI Technologies for Art and Design and Special Section on MMSys/NOSSDAV 2018
April 2019
381 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3343360
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2019
Accepted: 01 February 2019
Revised: 01 February 2019
Received: 01 July 2018
Published in TOMM Volume 15, Issue 2s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image captioning
  2. attention networks
  3. visual question answering

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Key Research Program of Frontier Sciences, CAS
  • Research Program of National Laboratory of Pattern Recognition
  • CCF-Tencent Open Fund
  • National Natural Science Foundation of China
  • National Key Research and Development Program of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 10 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Learning Video-Text Aligned Representations for Video CaptioningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/354682819:2(1-21)Online publication date: 6-Feb-2023
  • (2022)Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-trainingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/347314018:2(1-16)Online publication date: 16-Feb-2022
  • (2022)A Framework for Image Captioning Based on Relation Network and Multilevel Attention MechanismNeural Processing Letters10.1007/s11063-022-11106-y55:5(5693-5715)Online publication date: 17-Dec-2022
  • (2022)Multilevel attention and relation network based image captioning modelMultimedia Tools and Applications10.1007/s11042-022-13793-082:7(10981-11003)Online publication date: 16-Sep-2022
  • (2022)Image captioning improved visual question answeringMultimedia Tools and Applications10.1007/s11042-021-11276-281:24(34775-34796)Online publication date: 1-Oct-2022
  • (2022)Remote Sensing Image Captioning via Multilevel Attention-Based Visual Question AnsweringInnovations in Computational Intelligence and Computer Vision10.1007/978-981-19-0475-2_41(465-475)Online publication date: 15-May-2022
  • (2021)Community Detection in Partially Observable Social NetworksACM Transactions on Knowledge Discovery from Data10.1145/346133916:2(1-24)Online publication date: 21-Jul-2021
  • (2021)RCBE-AS: Rabin cryptosystem–based efficient authentication scheme for wireless sensor networksPersonal and Ubiquitous Computing10.1007/s00779-021-01592-728:1(171-192)Online publication date: 16-Jul-2021
  • (2020)Nonnegative Residual Matrix Factorization for Community DetectionWeb Information Systems Engineering – WISE 202010.1007/978-3-030-62005-9_15(196-209)Online publication date: 20-Oct-2020

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media