Article

VQA: Visual Question Answering

Authors:

Devi ParikhAuthors Info & Claims

ICCV '15: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)

Pages 2425 - 2433

https://doi.org/10.1109/ICCV.2015.279

Published: 07 December 2015 Publication History

Abstract

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance.

Cited By

View all

Yang GJi CLiu XZhang ZWang C(2024)DCF–VQAInternational Journal of Applied Mathematics and Computer Science10.61822/amcs-2024-003234:3(453-466)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.61822/amcs-2024-0032
Wang SZhu YLiu HZheng ZChen CLi J(2024)Knowledge Editing for Large Language Models: A SurveyACM Computing Surveys10.1145/369859057:3(1-37)Online publication date: 11-Nov-2024
https://dl.acm.org/doi/10.1145/3698590
Xuan LHaoxiang ZBaozheng JYanxia LYou L(2024)A Benchmark Dataset for Evaluating Spatial Perception in Multimodal Large ModelsProceedings of the First International Workshop on IoT Datasets for Multi-modal Large Model10.1145/3698385.3699875(37-43)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3698385.3699875
Show More Cited By

Recommendations

VQA: Visual Question Answering

We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping ...
R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to ...
AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Visual Question Answering (VQA) serves as a proxy for evaluating the scene understanding of an intelligent agent by answering questions about images. Most VQA benchmarks to date are focused on those questions that can be answered through understanding ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

ICCV '15: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)

December 2015

4730 pages

ISBN:9781467383912

Publisher

IEEE Computer Society

United States

Publication History

Published: 07 December 2015

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

456
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 11 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Yang GJi CLiu XZhang ZWang C(2024)DCF–VQAInternational Journal of Applied Mathematics and Computer Science10.61822/amcs-2024-003234:3(453-466)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.61822/amcs-2024-0032
Wang SZhu YLiu HZheng ZChen CLi J(2024)Knowledge Editing for Large Language Models: A SurveyACM Computing Surveys10.1145/369859057:3(1-37)Online publication date: 11-Nov-2024
https://dl.acm.org/doi/10.1145/3698590
Xuan LHaoxiang ZBaozheng JYanxia LYou L(2024)A Benchmark Dataset for Evaluating Spatial Perception in Multimodal Large ModelsProceedings of the First International Workshop on IoT Datasets for Multi-modal Large Model10.1145/3698385.3699875(37-43)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3698385.3699875
Li CJing CLi ZWu YJia Y(2024)Adversarial Sample Synthesis for Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368884820:12(1-24)Online publication date: 16-Sep-2024
https://dl.acm.org/doi/10.1145/3688848
Zhao KLai RGuo BLiu LHe LZhao Y(2024)AI-Vision: A Three-Layer Accessible Image Exploration System for People with Visual Impairments in ChinaProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36785378:3(1-27)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1145/3678537
Zhu SPeng SChen S(2024)Enhancing Visual Question Answering with Prompt-based Learning: A Cross-modal Approach for Deep Semantic UnderstandingProceedings of the International Conference on Algorithms, Software Engineering, and Network Security10.1145/3677182.3677310(713-717)Online publication date: 26-Apr-2024
https://dl.acm.org/doi/10.1145/3677182.3677310
Fons EKaur RZeng ZPalande SBalch TVyetrenko SVeloso M(2024)TADACap: Time-series Adaptive Domain-Aware CaptioningProceedings of the 5th ACM International Conference on AI in Finance10.1145/3677052.3698690(54-62)Online publication date: 14-Nov-2024
https://dl.acm.org/doi/10.1145/3677052.3698690
Hao DWang QZhu XLiu J(2024)HCCL: Hierarchical Counterfactual Contrastive Learning for Robust Visual Question AnsweringACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367390220:10(1-21)Online publication date: 27-Jun-2024
https://dl.acm.org/doi/10.1145/3673902
Huang ZXue MLiu YXu KLi JYu C(2024)DCMFNet: Deep Cross-Modal Fusion Network for Different Modalities with Iterative Gated FusionProceedings of the 50th Graphics Interface Conference10.1145/3670947.3670956(1-12)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3670947.3670956
Hsu CLee CLin YChou YJian CTsai CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Revisiting Vision-Language Features Adaptation and Inconsistency for Social Media Popularity PredictionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3689000(11464-11469)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3689000
Show More Cited By

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Recommendations