Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu

Abstract

Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today’s AI capable of similar understanding?We present VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction.Experiments show that 1) machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy, while humans reached 98.0%. Models also performed 19.5% worse when distinguishing between irrelevant objects within the image compared to external objects. 2) Providing relevant visual premises improved model performance significantly.

Anthology ID:: 2024.emnlp-main.143
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2423–2451
Language:
URL:: https://aclanthology.org/2024.emnlp-main.143/
DOI:: 10.18653/v1/2024.emnlp-main.143
Bibkey:
Cite (ACL):: Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, and Youngjae Yu. 2024. Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2423–2451, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding (Chung et al., EMNLP 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.emnlp-main.143.pdf
Software:: 2024.emnlp-main.143.software.zip
Data:: 2024.emnlp-main.143.data.zip

PDF Cite Search Software Data Fix data