[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3627631.3627670acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicvgipConference Proceedingsconference-collections
research-article
Open access

S-VQA: Sentence-Based Visual Question Answering

Published: 31 January 2024 Publication History

Abstract

Visual Question Answering (VQA) system responds to a natural language question in context of an image. This problem has been primarily formulated as a classification problem with the answers as the finite number of classes. Thus, the generated response consists of a single word or a short phrase. However, this also limits the linguistic capabilities of such a system. In contrast, this work presents a Sentence-based VQA (S-VQA) which responds to questions with complete sentences as answers. The first contribution of this work is the development of a dataset from the Task Directed Image Understanding Challenge (TDIUC) VQA dataset using natural language rules and pretrained para-phrasers. This new dataset is referred to as TDIUC-SVQA. The second contribution involves the performance evaluation of multiple models on the TDIUC-SVQA dataset. This is performed by using two multi-modal models. The Bottom-Up Top-Down Attention based VQA model is combined with LSTM decoder and Attention-on-Attention Network for answer generation. The proposed models are observed to provide improved results compared to the baseline model.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2017. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017), 6077–6086.
[2]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, Ann Arbor, Michigan, 65–72.
[4]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
[5]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv abs/1810.04805 (2019).
[6]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6904–6913.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
[8]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (nov 1997), 1735–1780.
[9]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on Attention for Image Captioning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019), 4633–4642.
[10]
Kushal Kafle and Christopher Kanan. 2017. An analysis of visual question answering algorithms. In Proceedings of the IEEE international conference on computer vision. 1965–1973.
[11]
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
[12]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (2017), 32–73.
[13]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81.
[14]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 740–755.
[15]
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical Question-Image Co-Attention for Visual Question Answering. ArXiv abs/1606.00061 (2016).
[16]
Lin Ma, Zhengdong Lu, and Hang Li. 2015. Learning to Answer Questions from Image Using Convolutional Neural Network. ArXiv abs/1506.00333 (2015).
[17]
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics 19, 2 (1993), 313–330.
[18]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318.
[19]
Liang Peng, Yang Yang, Zheng Wang, Zi Huang, and Heng Tao Shen. 2022. MRA-Net: Improving VQA Via Multi-Modal Relation Attention Network. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1 (2022), 318–329.
[20]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
[21]
Prithivida. 2021. Gramformer. https://huggingface.co/spaces/prithivida/gramformer. Accessed: 2023-06-29.
[22]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2015), 1137–1149.
[23]
Kevin J. Shih, Saurabh Singh, and Derek Hoiem. 2015. Where to Look: Focus Regions for Visual Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 4613–4621.
[24]
Andrew Shin, Yoshitaka Ushiku, and Tatsuya Harada. 2016. The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa). arXiv preprint arXiv:1609.06657 (2016).
[25]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
[26]
Tuner007. 2020. PEGASUS Paraphrase. https://huggingface.co/tuner007/pegasus_paraphrase. Accessed: 2023-06-29.
[27]
Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
[28]
Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2014. CIDEr: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014), 4566–4575.
[29]
Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. 2015. Stacked Attention Networks for Image Question Answering. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 21–29.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICVGIP '23: Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing
December 2023
352 pages
ISBN:9798400716256
DOI:10.1145/3627631
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of a national government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 January 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AoANet
  2. BUTD
  3. LSTM
  4. S-VQA
  5. TDIUC
  6. VQA

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICVGIP '23

Acceptance Rates

Overall Acceptance Rate 95 of 286 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 198
    Total Downloads
  • Downloads (Last 12 months)198
  • Downloads (Last 6 weeks)30
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media