Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Yash Srivastava⁹,
Vaishnav Murali⁹,
Shiv Ram Dubey⁹ &
…
Snehasis Mukherjee⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1377))

Included in the following conference series:

International Conference on Computer Vision and Image Processing

2389 Accesses
13 Citations

Abstract

The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic ‘common sense’ questions about given images. Given an image and a question in natural language, the VQA system tries to find the correct answer to it using visual elements of the image and inference gathered from textual questions. In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question-formats and robustness of the machine-learning models. Next, we discuss about new deep learning models that have shown promising results over the VQA datasets. At the end, we present and discuss some of the results computed by us over the vanilla VQA model, Stacked Attention Network and the VQA Challenge 2017 winner model. We also provide the detailed analysis along with the challenges and future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 87.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 109.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Visual question answering: a state-of-the-art review

Article 08 April 2020

Multimodal Attention for Visual Question Answering

VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering

Notes

1.
https://github.com/facebookresearch/pythia.

References

Acharya, M., Kafle, K., Kanan, C.: Tallyqa: Answering complex counting questions. arXiv preprint arXiv:1810.12440 (2018)
Antol, S., et al.: VQA: visual question answering. In: IEEE ICCV, pp. 2425–2433 (2015)
Google Scholar
Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809 (2020)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: IEEE CVPR, pp. 564–571 (2013)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE CVPR, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, Q., et al.: Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7166–7176 (2020)
Google Scholar
Jiang, L., Liang, J., Cao, L., Kalantidis, Y., Farfade, S., Hauptmann, A.: MemexQA: Visual memex question answering. arXiv preprint arXiv:1708.01336 (2017)
Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0. 1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956 (2018)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: IEEE CVPR, pp. 2901–2910 (2017)
Google Scholar
Kafle, K., Kanan, C.: An analysis of visual question answering algorithms. In: ICCV (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)
Google Scholar
Li, G., Wang, X., Zhu, W.: Boosting visual question answering with context-aware knowledge aggregation. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 1227–1235 (2020)
Google Scholar
Li, W., Sun, J., Liu, G., Zhao, L., Fang, X.: Visual question answering with attention transfer and a cross-modal gating mechanism. Pattern Recogn. Lett. 133, 334–340 (2020)
Article Google Scholar
Liang, J., Jiang, L., Cao, L., Li, L.J., Hauptmann, A.G.: Focal visual-text attention for visual question answering. In: IEEE CVPR, pp. 6135–6143 (2018)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Lobry, S., Marcos, D., Murray, J., Tuia, D.: RSVQA: visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 58(12), 8555–8566 (2020)
Article Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS, pp. 1682–1690 (2014)
Google Scholar
Medsker, L.R., Jain, L.: Recurrent neural networks. Design and Applications 5, (2001)
Google Scholar
Patro, B., Namboodiri, V.P.: Differential attention for visual question answering. In: IEEE CVPR, pp. 7680–7688 (2018)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Advances in Neural Information Processing Systems, pp. 2953–2961 (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Google Scholar
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: AAAI (2019)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE CVPR, pp. 2818–2826 (2016)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: IEEE CVPR, pp. 4631–4640 (2016)
Google Scholar
Teney, D., Anderson, P., He, X., van den Hengel, A.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: IEEE CVPR, pp. 4223–4232 (2018)
Google Scholar
Wu, C., Liu, J., Wang, X., Li, R.: Differential networks for visual question answering. In: AAAI 2019 (2019)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: IEEE CVPR, pp. 21–29 (2016)
Google Scholar
Yi, K., Wu, J., Gan, C., Torralba, A., Kohli, P., Tenenbaum, J.: Neural-symbolic VQA: disentangling reasoning from vision and language understanding. In: NIPS, pp. 1031–1042 (2018)
Google Scholar
Yu, J., Zhu, Z., Wang, Y., Zhang, W., Hu, Y., Tan, J.: Cross-modal knowledge reasoning for knowledge-based visual question answering. Pattern Recogn. 108, 107563 (2020)
Article Google Scholar
Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: fill in the blank description generation and question answering. In: IEEE ICCV, pp. 2461–2469 (2015)
Google Scholar
Zhan, L.M., Liu, B., Fan, L., Chen, J., Wu, X.M.: Medical visual question answering via conditional reasoning. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2345–2354 (2020)
Google Scholar
Zheng, Z., Wang, W., Qi, S., Zhu, S.C.: Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6669–6678 (2019)
Google Scholar
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: IEEE CVPR, pp. 4995–5004 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Vision Group, Indian Institute of Information Technology, Sri City, Chittoor, Andhra Pradesh, India
Yash Srivastava, Vaishnav Murali, Shiv Ram Dubey & Snehasis Mukherjee

Authors

Yash Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Vaishnav Murali
View author publications
You can also search for this author in PubMed Google Scholar
Shiv Ram Dubey
View author publications
You can also search for this author in PubMed Google Scholar
Snehasis Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiv Ram Dubey .

Editor information

Editors and Affiliations

Indian Institute of Information Technology Allahabad, Prayagraj, India
Satish Kumar Singh
Indian Institute of Technology Roorkee, Roorkee, India
Partha Roy
Indian Institute of Technology Roorkee, Roorkee, India
Balasubramanian Raman
Indian Institute of Information Technology Allahabad, Prayagraj, India
P. Nagabhushan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Srivastava, Y., Murali, V., Dubey, S.R., Mukherjee, S. (2021). Visual Question Answering Using Deep Learning: A Survey and Performance Analysis. In: Singh, S.K., Roy, P., Raman, B., Nagabhushan, P. (eds) Computer Vision and Image Processing. CVIP 2020. Communications in Computer and Information Science, vol 1377. Springer, Singapore. https://doi.org/10.1007/978-981-16-1092-9_7

Download citation

DOI: https://doi.org/10.1007/978-981-16-1092-9_7
Published: 28 March 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1091-2
Online ISBN: 978-981-16-1092-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Visual question answering: a state-of-the-art review

Multimodal Attention for Visual Question Answering

VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Visual question answering: a state-of-the-art review

Multimodal Attention for Visual Question Answering

VIBIKNet: Visual Bidirectional Kernelized Network for Visual Question Answering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation