More Web Proxy on the site http://driver.im/

research-article

Open access

S-VQA: Sentence-Based Visual Question Answering

Authors:

Sanchit Pathak,

Prithwijit GuhaAuthors Info & Claims

ICVGIP '23: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing

Article No.: 39, Pages 1 - 8

https://doi.org/10.1145/3627631.3627670

Published: 31 January 2024 Publication History

All formats PDF

Abstract

Visual Question Answering (VQA) system responds to a natural language question in context of an image. This problem has been primarily formulated as a classification problem with the answers as the finite number of classes. Thus, the generated response consists of a single word or a short phrase. However, this also limits the linguistic capabilities of such a system. In contrast, this work presents a Sentence-based VQA (S-VQA) which responds to questions with complete sentences as answers. The first contribution of this work is the development of a dataset from the Task Directed Image Understanding Challenge (TDIUC) VQA dataset using natural language rules and pretrained para-phrasers. This new dataset is referred to as TDIUC-SVQA. The second contribution involves the performance evaluation of multiple models on the TDIUC-SVQA dataset. This is performed by using two multi-modal models. The Bottom-Up Top-Down Attention based VQA model is combined with LSTM decoder and Attention-on-Attention Network for answer generation. The proposed models are observed to provide improved results compared to the baseline model.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017), 6077–6086.

[2]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.

Digital Library

[3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65–72.

[4]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).

[6]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.

[7]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.

[8]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (nov 1997), 1735–1780.

Digital Library

[9]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on Attention for Image Captioning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 4633–4642.

[10]

Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision. 1965–1973.

[11]

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).

[12]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.

Digital Library

[13]

Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.

[14]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.

[15]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. ArXiv abs/1606.00061 (2016).

[16]

Lin Ma, Zhengdong Lu, and Hang Li. 2015. Learning to Answer Questions from Image Using Convolutional Neural Network. ArXiv abs/1506.00333 (2015).

[17]

Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330.

Digital Library

[18]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318.

Digital Library

[19]

Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2022. MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1 (2022), 318–329.

Digital Library

[20]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.

[21]

Prithivida. 2021. Gramformer. https://huggingface.co/spaces/prithivida/gramformer. Accessed: 2023-06-29.

[22]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.

Digital Library

[23]

Kevin J. Shih, Saurabh Singh, and Derek Hoiem. 2015. Where to Look: Focus Regions for Visual Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 4613–4621.

[24]

Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. 2016. The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa). arXiv preprint arXiv:1609.06657 (2016).

[25]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).

[26]

Tuner007. 2020. PEGASUS Paraphrase. https://huggingface.co/tuner007/pegasus_paraphrase. Accessed: 2023-06-29.

[27]

Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.

[28]

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), 4566–4575.

[29]

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2015. Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 21–29.

Index Terms

S-VQA: Sentence-Based Visual Question Answering
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Visual content-based indexing and retrieval

Recommendations

Visual Question Answering With a Hybrid Convolution Recurrent Model
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Visual Question Answering (VQA) is a relatively new task, which tries to infer answer sentences for an input image coupled with a corresponding question. Instead of dynamically generating answers, they are usually inferred by finding the most probable ...
Incorporation of question segregation procedures in visual question-answering models

There are various open issues in visual question answering (VQA). One of them is sometimes a model can predict 'Yes' or 'No' as an answer, which is not relatable to the question and requires a descriptive answer, and vice versa. To solve this issue in ...
IQ-VQA: Intelligent Visual Question Answering
Pattern Recognition. ICPR International Workshops and Challenges
Abstract
Despite tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. Thus, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICVGIP '23: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing

December 2023

352 pages

ISBN:9798400716256

DOI:10.1145/3627631

Copyright © 2023 ACM.

Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICVGIP '23

ICVGIP '23: Indian Conference on Computer Vision, Graphics and Image Processing

December 15 - 17, 2023

Rupnagar, India

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
198
Total Downloads

Downloads (Last 12 months)198
Downloads (Last 6 weeks)30

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents