Abstract
The problem of visual question answering (VQA) is of significant importance both as a challenging research question and for the rich set of applications it enables. In this context, however, inherent structure in our world and bias in our language tend to be a simpler signal for learning than visual modalities, resulting in VQA models that ignore visual information, leading to an inflated sense of their capability. We propose to counter these language priors for the task of VQA and make vision (the V in VQA) matter! Specifically, we balance the popular VQA dataset (Antol et al., in: ICCV, 2015) by collecting complementary images such that every question in our balanced dataset is associated with not just a single image, but rather a pair of similar images that result in two different answers to the question. Our dataset is by construction more balanced than the original VQA dataset and has approximately twice the number of image-question pairs. Our complete balanced dataset is publicly available at http://visualqa.org/ as part of the 2nd iteration of the VQA Dataset and Challenge (VQA v2.0). We further benchmark a number of state-of-art VQA models on our balanced dataset. All models perform significantly worse on our balanced dataset, suggesting that these models have indeed learned to exploit language priors. This finding provides the first concrete empirical evidence for what seems to be a qualitative sense among practitioners. We also present interesting insights from analysis of the participant entries in VQA Challenge 2017, organized by us on the proposed VQA v2.0 dataset. The results of the challenge were announced in the 2nd VQA Challenge Workshop at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Finally, our data collection protocol for identifying complementary images enables us to develop a novel interpretable model, which in addition to providing an answer to the given (image, question) pair, also provides a counter-example based explanation. Specifically, it identifies an image that is similar to the original image, but it believes has a different answer to the same question. This can help in building trust for machines among their users.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note that this entry is a single model and does not use pretrained word embeddings and data augmentation unlike the winning entry in VQA Challenge 2016 which was an ensemble of 7 such MCB models, and was trained with pretrained Glove (Pennington et al. 2014) embeddings and data augmentation from Visual Genome dataset (Krishna et al. 2016). These three factors lead to a 2–3% increase in performance.
It could easily also convey what color it thinks the fire-hydrant is in the counter-example. We will explore this in future work.
In practice, this answer to be explained would be the answer predicted by the first step \(A_{pred}\). However, we only have access to negative explanation annotations from humans for the ground-truth answer A to the question. Providing A to the explanation module also helps in evaluating the two steps of answering and explaining separately.
Note that in theory, one could provide \(A_{pred}\) as input during training instead of A. After all, this matches the expected use case scenario at test time. However, this alternate setup (where \(A_{pred}\) is provided as input instead of A) leads to a peculiar and unnatural explanation training goal—specifically, the explanation head will still be learning to explain A since that is the answer for which we collected negative explanation human annotations. It is simply unnatural to build that model that answers a question with \(A_{pred}\) but learn to explain a different answer A! Note that this is an interesting scenario where the current push towards “end-to-end” training for everything breaks down.
References
Agrawal, A., Batra, D., & Parikh, D. (2016). Analyzing the behavior of visual question answering models. In EMNLP.
Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In CVPR.
Agrawal, A., Kembhavi, A., Batra, D., & Parikh, D. (2017). C-vqa: A compositional split of the visual question answering (vqa) v1. 0 dataset. arXiv preprint arXiv:1704.08243.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., et al. (2018). Bottom-up and top-down attention for image captioning and visual question answering. In CVPR.
Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016). Neural module networks. In CVPR.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., et al. (2015). VQA: Visual question answering. In ICCV.
Berg, T., & Belhumeur, P. N. (2013). How do you tell a blackbird from a crow? In ICCV.
Chen, X., & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual representation for image caption generation. In CVPR.
Devlin, J., Gupta, S., Girshick, R. B., Mitchell, M., & Zitnick, C. L. (2015). Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467.
Doersch, C., Singh, S., Gupta, A., Sivic, J., & Efros, A. A. (2012). What makes Paris look like Paris? ACM Transactions on Graphics (SIGGRAPH), 31(4), 101:1–101:9.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.
Fang, H., Gupta, S., Iandola, F. N., Srivastava, R., Deng, L., Dollár, P., et al. (2015). From captions to visual concepts and back. In CVPR.
Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In EMNLP.
Gao, H., Mao, J., Zhou, J., Huang, Z., & Yuille, A. (2015). Are you talking to a machine? dataset and methods for multilingual image question answering. In NIPS.
Goyal, Y., Mohapatra, A., Parikh, D., & Batra, D. (2016). Towards transparent AI systems: Interpreting visual question answering models. In ICML workshop on visualization for deep learning.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hendricks, L. A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., & Darrell, T. (2016). Generating visual explanations. In ECCV.
Hodosh, M., & Hockenmaier, J. (2016). Focused evaluation for image description with binary forced-choice tasks. In Workshop on vision and language, annual meeting of the association for computational linguistics.
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., & Saenko, K. (2017). Learning to reason: End-to-end module networks for visual question answering. In ICCV.
Ilievski, I., Yan, S., & Feng, J. (2016). A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485.
Jabri, A., Joulin, A., & van der Maaten, L. (2016). Revisiting visual question answering baselines. In ECCV.
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., & Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR.
Kafle, K., & Kanan, C. (2016a). Answer-type prediction for visual question answering. In CVPR.
Kafle, K., & Kanan, C. (2016b). Visual question answering: Datasets, algorithms, and future challenges. arXiv preprint arXiv:1610.01465.
Kafle, K., & Kanan, C. (2017). An analysis of visual question answering algorithms. In ICCV.
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In CVPR.
Kim, J. H., Lee, S. W., Kwak, D. H., Heo, M. O., Kim, J., Ha, J. W., et al. (2016). Multimodal residual learning for visual QA. In NIPS.
Kim, J. H., On, K. W., Lim, W., Kim, J., Ha, J. W., & Zhang, B. T. (2017). Hadamard product for low-rank bilinear pooling. In ICLR.
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2015). Unifying visual-semantic embeddings with multimodal neural language models. In TACL.
Koh, P. W., & Liang, P. (2017). Understanding black-box predictions via influence functions. In ICML.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., et al. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft COCO: Common objects in context. In ECCV.
Lu, J., Lin, X., Batra, D., & Parikh, D. (2015). Deeper LSTM and normalized CNN visual question answering model. https://github.com/VT-vision-lab/VQA_LSTM_CNN. Accessed 1 Sep 2017.
Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical question-image co-attention for visual question answering. In NIPS.
Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In NIPS.
Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In ICCV.
Mao, J., Xu, W., Yang, Y., Wang, J., & Yuille, A. L. (2014). Explain images with multimodal recurrent neural networks. In NIPS.
Noh, H., & Han, B. (2016). Training recurrent answering units with joint loss minimization for vqa. arXiv preprint arXiv:1606.03647.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP.
Ray, A., Christie, G., Bansal, M., Batra, D., & Parikh, D. (2016). Question relevance in VQA: Identifying non-visual and false-premise questions. In EMNLP.
Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In NIPS.
Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. In Knowledge discovery and data mining (KDD).
Saito, K., Shin, A., Ushiku, Y., & Harada, T. (2016). Dualnet: Domain-invariant network for visual question answering. arXiv preprint arXiv:1606.06108.
Selvaraju, R. R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., & Batra, D. (2016). Grad-CAM: Why did you say that? Visual explanations from deep networks via gradient-based localization. arXiv preprint arXiv:1610.02391.
Shih, K. J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In CVPR.
Shin, A., Ushiku, Y., & Harada, T. (2016). The color of the cat is gray: 1 Million full-sentences visual question answering (FSVQA). arXiv preprint arXiv:1609.06657.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. In CVPR.
Teney, D., Anderson, P., He, X., & van den Hengel, A. (2018). Tips and tricks for visual question answering: Learnings from the 2017 challenge. In CVPR.
Torralba, A., & Efros, A. (2011). Unbiased look at dataset bias. In CVPR.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In CVPR.
Wang, P., Wu, Q., Shen, C., van den Hengel, A., & Dick, A. R. (2015). Explicit knowledge-based reasoning for visual question answering. arXiv preprint arXiv:1511.02570.
Wu, Q., Wang, P., Shen, C., van den Hengel, A., & Dick, A. R. (2016). Ask me anything: Free-form visual question answering based on knowledge from external sources. In CVPR.
Xiong, C., Merity, S., & Socher, R. (2016). Dynamic memory networks for visual and textual question answering. In ICML.
Xu, H., & Saenko, K. (2016). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In ECCV.
Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In CVPR.
Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual madlibs: Fill-in-the-blank description generation and question answering. In ICCV.
Yu, Z., Yu, J., Fan, J., & Tao, D. (2017). Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV.
Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., & Parikh, D. (2016). Yin and Yang: Balancing and answering binary visual questions. In CVPR.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2015). Learning deep features for discriminative localization. In CVPR.
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., & Fergus, R. (2015). Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167.
Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7w: Grounded question answering in images. In CVPR.
Acknowledgements
We thank Anitha Kannan for helpful discussions. This work was funded in part by NSF CAREER awards to DP and DB, an ONR YIP award to DP, ONR Grant N00014-14-1-0679 to DB, a Sloan Fellowship to DP, ARO YIP awards to DB and DP, an Allen Distinguished Investigator award to DP from the Paul G. Allen Family Foundation, ICTAS Junior Faculty awards to DB and DP, Google Faculty Research Awards to DP and DB, Amazon Academic Research Awards to DP and DB, AWS in Education Research Grant to DB, and NVIDIA GPU donations to DB. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Svetlana Lazebnik.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yash Goyal, Tejas Khot, and Aishwarya Agrawal: Part of this work was done while at Virginia Tech, Blacksburg, VA, USA.
Rights and permissions
About this article
Cite this article
Goyal, Y., Khot, T., Agrawal, A. et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. Int J Comput Vis 127, 398–414 (2019). https://doi.org/10.1007/s11263-018-1116-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-018-1116-0