AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

Xiuyuan Chen¹³,
Yuan Lin¹⁵,
Yuchen Zhang¹⁴ &
…
Weiran Huang ORCID: orcid.org/0000-0003-1193-6157¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15126))

Included in the following conference series:

European Conference on Computer Vision

107 Accesses

Abstract

We propose a novel and challenging benchmark, AutoEval-Video, to comprehensively evaluate large vision-language models in open-ended video question answering. The comprehensiveness of AutoEval-Video is demonstrated in two aspects: 1) AutoEval-Video constructs open-ended video-questions across 9 skill dimensions, addressing capabilities of perception, comprehension, and generation; 2) AutoEval-Video contains newly collected videos that cover over 40 distinct themes. To efficiently evaluate responses to the open-ended questions, we employ an LLM-based evaluation approach, but instead of merely providing a reference answer, we annotate unique evaluation rules for every single instance (video-question pair). To maximize the robustness of these rules, we develop a novel adversarial annotation mechanism. By using instance-specific rules as prompt, GPT-4, as an automatic evaluator, can achieve a stable evaluation accuracy of around 97.0%, comparable to the 94.9% - 97.5% accuracy of a human evaluator. Furthermore, we assess the performance of eleven large vision-language models on AutoEval-Video. Among them, GPT-4V(ision) significantly outperforms other models, achieving an accuracy of 32.2%. However, there is still substantial room for improvement compared to human accuracy of 72.8%. By conducting an extensive case study, we uncover several drawbacks of GPT-4V, such as limited temporal and dynamic comprehension, and overly general responses. Code is available at https://github.com/Xiuyuan-Chen/AutoEval-Video.

X. Chen—Work done during an internship at ByteDance Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 49.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 59.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Locating Visual Explanations for Video Question Answering

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

ICDAR 2023 Competition on Born Digital Video Text Question Answering

Notes

1.
The videos collected from YouTube comply with Creative Commons License (https://support.google.com/youtube/answer/2797468).

References

Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
Bai, J., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the video in video-language understanding. In: CVPR, pp. 2917–2927 (2022)
Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)
Google Scholar
Dai, W., et al.: InstructBLIP: towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
Fan, H., et al.: LaSOT: a high-quality benchmark for large-scale single object tracking. In: CVPR, pp. 5374–5383 (2019)
Google Scholar
Fu, J., Ng, S.K., Jiang, Z., Liu, P.: GPTScore: evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023)
Goyal, R., et al.: The something something video database for learning and evaluating visual common sense. In: ICCV, pp. 5842–5850 (2017)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR, pp. 6904–6913 (2017)
Google Scholar
Grauman, K., et al.: EGO4D: around the world in 3,000 hours of egocentric video. In: CVPR, pp. 18995–19012 (2022)
Google Scholar
Grunde-McLaughlin, M., Krishna, R., Agrawala, M.: AGQA: a benchmark for compositional spatio-temporal reasoning. In: CVPR, pp. 11287–11297 (2021)
Google Scholar
Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR, pp. 6047–6056 (2018)
Google Scholar
Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: CLIPScore: a reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)
Hu, A., Chen, S., Zhang, L., Jin, Q.: InfoMetIC: an informative metric for reference-free image caption evaluation. arXiv preprint arXiv:2305.06002 (2023)
Huang, D.A., et al.: What makes a video a video: analyzing temporal information in video understanding models and datasets. In: CVPR, pp. 7366–7375 (2018)
Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: ICCV, pp. 706–715 (2017)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563. IEEE (2011)
Google Scholar
Lei, J., Berg, T.L., Bansal, M.: Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428 (2022)
Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., Shan, Y.: SEED-Bench: benchmarking multimodal LLMs with generative comprehension. arXiv preprint arXiv:2307.16125 (2023)
Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
Li, K., et al.: VideoChat: chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
Li, Y., Wang, C., Jia, J.: LLaMa-VID: an image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043 (2023)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL (2004)
Google Scholar
Lin, Y.T., Chen, Y.N.: LLM-Eval: unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711 (2023)
Liu, D., Qu, X., Hu, W.: Reducing the vision and language bias for temporal sentence grounding. In: ACM MM (2022). https://api.semanticscholar.org/CorpusID:251104851
Liu, F., Lin, K., Li, L., Wang, J., Yacoob, Y., Wang, L.: Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565 (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-Eval: NLG evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634 (2023)
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-ChatGPT: towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424 (2023)
OpenAI: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar
Pătrăucean, V., et al.: Perception test: a diagnostic benchmark for multimodal video models. arXiv preprint arXiv:2305.13786 (2023)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Sun, Z., et al.: Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525 (2023)
Team, G., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)
Google Scholar
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: ICCV, pp. 4581–4591 (2019)
Google Scholar
Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 35, 24824–24837 (2022)
Google Scholar
Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: CVPR, pp. 2411–2418 (2013)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR, pp. 5288–5296 (2016)
Google Scholar
Yang, Z., et al.: The dawn of LMMs: preliminary explorations with GPT-4V (ision). arXiv preprint arXiv:2309.17421 (2023)
Yu, W., et al.: MM-Vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Yu, W., et al.: Mm-vet: evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490 (2023)
Yu, Z., et al.: ANetQA: a large-scale benchmark for fine-grained compositional reasoning over untrimmed videos. In: CVPR, pp. 23191–23200 (2023)
Google Scholar
Zhang, H., Li, X., Bing, L.: Video-LLaMa: an instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858 (2023)
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. arXiv preprint arXiv:2306.05685 (2023)
Zhou, Y., et al.: Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754 (2023)

Download references

Acknowledgment

We would like to express our sincere gratitude to the reviewers of ECCV 2024 for their insightful and constructive feedback. Their valuable comments have greatly contributed to improving the quality of our work.

Author information

Authors and Affiliations

MIFA Lab, Qing Yuan Research Institute, SEIEE, Shanghai Jiao Tong University, Shanghai, China
Xiuyuan Chen & Weiran Huang
ByteDance Research, Beijing, China
Yuchen Zhang
ByteDance Research, Shanghai, China
Yuan Lin

Authors

Xiuyuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Weiran Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiran Huang .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 51295 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, X., Lin, Y., Zhang, Y., Huang, W. (2025). AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15126. Springer, Cham. https://doi.org/10.1007/978-3-031-73113-6_11

Download citation

DOI: https://doi.org/10.1007/978-3-031-73113-6_11
Published: 21 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73112-9
Online ISBN: 978-3-031-73113-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Locating Visual Explanations for Video Question Answering

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

ICDAR 2023 Competition on Born Digital Video Text Question Answering

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 51295 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

AutoEval-Video: An Automatic Benchmark for Assessing Large Vision Language Models in Open-Ended Video Question Answering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Locating Visual Explanations for Video Question Answering

TimeCraft: Navigate Weakly-Supervised Temporal Grounded Video Question Answering via Bi-directional Reasoning

ICDAR 2023 Competition on Born Digital Video Text Question Answering

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 51295 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation