[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
survey

Visual Question Generation: The State of the Art

Published: 28 May 2020 Publication History

Abstract

Visual question generation (VQG) is an interesting problem that has recently received attention. The task of VQG involves generating meaningful questions based on the input image. It is a multi-modal problem involving image understanding and natural language generation, especially using deep learning methods. VQG can be considered as complementary task of visual question answering. In this article, we review the current state of VQG in terms of methods to understand the problem, existing datasets to train the VQG model, evaluation metrics, and algorithms to handle the problem. Finally, we discuss the challenges that need to be conquered and the possible future directions for an effective VQG.

References

[1]
Somak Aditya, Yezhou Yang, Chitta Baral, Yiannis Aloimono, and Cornelia Fermuller. 2018. Image understanding using vision and reasoning through scene description graph. Computer Vision and Image Understanding 3 (2018), 33--45.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual question answering. In Proceedings of the International Conference on Computer Vision.
[3]
Jun Araki, Dheeraj Rajagopal, Sreecharan Sankaranarayanan, Susan Holm, Yukari Yamakawa, and Teruko Mitamura. 2016. Generating questions and multiple-choice answers using semantic analysis of texts. In Proceedings of the International Conference on Computational Linguistics. 1125--1136.
[4]
Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations.
[5]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Vol. 29. 65--72.
[6]
H. Ben-Younes, R. Cadene, M. Cord, and N. Thome. 2017. MUTAN: Multimodal Tucker fusion for visual question answering. In Proceedings of the International Conference on Computer Vision.
[7]
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.
[8]
A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.
[9]
Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. 2017. Towards diverse and natural image descriptions via a conditional GAN. In Proceedings of the International Conference on Computer Vision.
[10]
Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew McCallum. 2017. Question answering on knowledge bases and text using universal schema and memory networks. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vol. 2. 358--365.
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[12]
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 1342--1352.
[13]
Zhihao Fan, Zhongyu Wei, Piji Li, Yanyan Lan, and Xuanjing Huang. 2018. A question type driven framework to diversify visual question generation. In Proceedings of the International Joint Conference on Artificial Intelligence.
[14]
Zhihao Fan, Zhongyu Wei, Siyuan Wang, Yang Liu, and Xuanjing Huang. 2018. A reinforcement learning framework for natural question generation using bi-discriminators. In Proceedings of the International Conference on Computational Linguistics.
[15]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[16]
Difei Gao, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. From two graphs to N questions: A VQA dataset for compositional reasoning on vision and commonsense. arXiv:1908.02962v2 [cs.CV].
[17]
Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. 2015. Are you talking to a machine? Dataset and methods for multilingual image question. In Proceedings of Advances in Neural Information Processing Systems 28 (NIPS’15). 2296--2304.
[18]
Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. 2015. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences of the United States of America 112, 12 (2015), 3618--3623.
[19]
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[20]
Drew Hudson and Christopher Manning. 2018. Compositional attention networks for machine reasoning. In Proceedings of the International Conference on Learning Representations.
[21]
Drew Hudson and Christopher D. Manning. 2019. GQA—A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6700--6709.
[22]
Sathish Indurthi, Dinesh Raghu, Mitesh M. Khapra, and Sachindra Joshi. 2017. Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 376--385.
[23]
Unnat Jain, Ziyu Zhang, and Alexander G. Schwing. 2017. Creativity: Generating diverse questions using variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[24]
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Fei-Fei Li, Larry Zitnick, and Ross Girshick. 2017. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.
[25]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. DenseCap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[26]
Kushal Kafle and Christopher Kanan. 2016. Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4976--4984.
[27]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[28]
Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 1889--1897.
[29]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Multimodal neural language models. In Proceedings of the International Conference on Machine Learning. 595--603.
[30]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14).
[31]
Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. 2019. Information maximizing visual question generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[32]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS’12). 1097--1105.
[33]
Ankit Kumar, Peter Ondruska, Mohit Iyyer, James Bradbury, Ishaan Gulrajani, Victor Zhong, Romain Paulus, and Richard Socher. 2016. Ask me anything: Dynamic memory networks for natural language processing. arXiv:1506.07285.
[34]
Yikang Li, Nan Duan, Bolei Zhou, Xiao Chu, Wanli Ouyang, and Xiaogang Wang. 2018. Visual question generation as dual task of visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[35]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the International Conference on Computer Vision.
[36]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the ACL Workshop on Text Summarization Branches Out. 74--81.
[37]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollar, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.
[38]
Feng Liu, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. 2018. Investigating network architectures for body sensor networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[39]
J. Lu, J. Yang, D. Batra, and D. Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems 29 (NIPS’16).
[40]
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 1682--1690.
[41]
Mateusz Malinowski and Mario Fritz. 2014. A multi-world approach to question answering about real-world scenes based on uncertain input. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 4995--5004.
[42]
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons—A neural-based approach to answering questions about images. In Proceedings of the International Conference on Computer Vision.
[43]
Junhua Mao, Wei Xu, Yi Yang, JiangWang, Zhiheng Huang, and Alan Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). In Proceedings of the International Conference on Learning Representations.
[44]
Karen Mazidi and Paul Tarau. 2016. Infusing NLU into automatic question generation. In Proceedings of the ACL International Natural Language Generation Conference. 51--60.
[45]
Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, and Laurens van der Maaten. 2018. Learning by asking questions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[46]
Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. 2016. Generating natural questions about an image. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vol. 1. 1802--1813.
[47]
Kishore Papineni, Salim Roukos, ToddWard, and Wei Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 311--318.
[48]
Badri N. Patro, Sandeep Kumar, Vinod K. Kurmi, and Vinay P. Namboodiri. 2018. Multimodal differential network for visual question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4002--4012.
[49]
R. Krishna, Y, Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (2017), 32--73.
[50]
Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15). 2953--2961.
[51]
Vasile Rus, Zhiqiang Cai, and Art Graesser. 2008. Question generation: Example of a multi-year evaluation campaign. In Proceedings of 1st Question Generation Workshop.
[52]
Iulian Vlad Serban, Alberto Garcia-Duran, Caglar Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron Courville, and Yoshua Bengio. 2016. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 588--598.
[53]
Sanket Shah, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. 2019. KVQA: Knowledge-aware visual question answering. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.
[54]
Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, and Bernt Schiele. 2017. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the International Conference on Computer Vision.
[55]
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from RGBD images. In Proceedings of the European Conference on Computer Vision.
[56]
K. Simonyan and A. Zisserman. 2015. Deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations.
[57]
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association of Computational Linguistics 2 (2014), 207--218.
[58]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). 3104--3112.
[59]
C. Szegedy, W. Liu, P. Sermanet Y. Jia, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’15).
[60]
K. Shih, S. Singh, and D. Hoiem. 2016. Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4976--4984.
[61]
Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun, Cornelia Carapcea, Chris Thrasher, Chris Buehler, and Chris Sienkiewiczi. 2016. Rich image captioning in the wild. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW ’16).
[62]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4566--4575.
[63]
Oriol Vinyals, Alexander, Toshev Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[64]
Cheng Wang, Haojin Yang, Christian Bartz, and Christoph Meinel. 2016. Image captioning with deep bidirectional LSTMs. In Proceedings of the ACM Multimedia Conference. 988--997.
[65]
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2018. FVQA: Fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 10 (2018), 2413--2427.
[66]
Zichao Wang, Andrew S. Lan, Weili Nie, Andrew E. Waters, Phillip J. Grimaldi, and Richard G. Baraniuk. 2018. QG-Net—A data-driven question generation model for educational content. In Proceedings of the Annual ACM Conference on Learning at Scale.
[67]
Qi Wu, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Explicit knowledge-based reasoning for visual question answering. In Proceedings of the International Joint Conference on Artificial Intelligence.
[68]
Qi Wu, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2016. Ask me anything: Free-form visual question answering based on knowledge from external sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[69]
Caiming Xiong, Stephen Merity, and Richard Socher. 2016. Dynamic memory networks for visual and textual question answering. In Proceedings of the International Conference on Machine Learning. 2397--2406.
[70]
Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[71]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning.
[72]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Visual curiosity: Learning to ask questions to learn visual recognition. In Proceedings of the Conference on Robotic Learning.
[73]
Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. 2015. Neural self talk: Image understanding via continuous questioning and answering. arXiv:1512.03460 [cs.CV].
[74]
Z. Yang, J. Gao X. He, L. Deng, and A. J. Smola. 2016. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[75]
Junjie Zhang, Chunhua Shen, Jianfeng Lu, and Anton van den Hengel. 2018. Goal-oriented visual question generation via intermediate rewards. In Proceedings of the European Conference on Computer Vision.
[76]
Shijie Zhang, Lizhen Qu, Shaodi You, Zhenglu Yang, and Jiawan Zhang. 2017. Automatic generation of grounded visual questions. In Proceedings of the International Joint Conference on Artificial Intelligence.
[77]
Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. 2017. Neural question generation from text: A preliminary study. arXiv:1704.01792v3 [cs.CL].
[78]
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. 2016. Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4995--5004.

Cited By

View all
  • (2024)Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image DescriptionRemote Sensing10.3390/rs1603044116:3(441)Online publication date: 23-Jan-2024
  • (2024)Searching Questions and Learning Problems in Large Problem Banks: Constructing Tests and Assignments on the FlyComputers10.3390/computers1306014413:6(144)Online publication date: 5-Jun-2024
  • (2024)So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question AnsweringIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.331996454:2(854-865)Online publication date: Feb-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Computing Surveys
ACM Computing Surveys  Volume 53, Issue 3
May 2021
787 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3403423
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 May 2020
Online AM: 07 May 2020
Accepted: 01 February 2020
Revised: 01 February 2020
Received: 01 November 2019
Published in CSUR Volume 53, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image understanding
  2. question generation

Qualifiers

  • Survey
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)95
  • Downloads (Last 6 weeks)8
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Machine-to-Machine Visual Dialoguing with ChatGPT for Enriched Textual Image DescriptionRemote Sensing10.3390/rs1603044116:3(441)Online publication date: 23-Jan-2024
  • (2024)Searching Questions and Learning Problems in Large Problem Banks: Constructing Tests and Assignments on the FlyComputers10.3390/computers1306014413:6(144)Online publication date: 5-Jun-2024
  • (2024)So Many Heads, So Many Wits: Multimodal Graph Reasoning for Text-Based Visual Question AnsweringIEEE Transactions on Systems, Man, and Cybernetics: Systems10.1109/TSMC.2023.331996454:2(854-865)Online publication date: Feb-2024
  • (2024)Visual Question Generation Framework for Chess Game State Identification2024 IEEE International Conference on Contemporary Computing and Communications (InC4)10.1109/InC460750.2024.10649325(1-6)Online publication date: 15-Mar-2024
  • (2024)Situational Data Integration in Question Answering systems: a survey over two decadesKnowledge and Information Systems10.1007/s10115-024-02136-066:10(5875-5918)Online publication date: 1-Oct-2024
  • (2024)Automated multiple-choice question generation in Spanish using neural language modelsNeural Computing and Applications10.1007/s00521-024-10076-736:29(18223-18235)Online publication date: 24-Jul-2024
  • (2023)A Critical Analysis of Benchmarks, Techniques, and Models in Medical Visual Question AnsweringIEEE Access10.1109/ACCESS.2023.333521611(136507-136540)Online publication date: 2023
  • (2023)Automatic question generation: a review of methodologies, datasets, evaluation metrics, and applicationsProgress in Artificial Intelligence10.1007/s13748-023-00295-912:1(1-32)Online publication date: 1-Mar-2023
  • (2022)Template-based Abstractive Microblog Opinion SummarizationTransactions of the Association for Computational Linguistics10.1162/tacl_a_0051610(1229-1248)Online publication date: 22-Nov-2022
  • (2022)Effective Generation of Visual Questions2022 International Conference on Signal and Information Processing (IConSIP)10.1109/ICoNSIP49665.2022.10007524(1-5)Online publication date: 26-Aug-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media