SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15067))

Included in the following conference series:

European Conference on Computer Vision

120 Accesses

Abstract

Recent advances in vision-language models have shown notable generalization in broad tasks through visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models (LLMs) becomes the whole network’s bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which, however, is costly to obtain and has not thoroughly explored the rich contextual information contained in images. This paper first attempts to harness the overlooked context within visual instruction data, training the model to self-supervised “learning” how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts. Our code is available at https://github.com/heliossun/SQ-LLaVA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 54.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 69.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Article 29 August 2017

Complex visual question answering based on uniform form and content

Article 01 March 2024

References

Gpt-4v(ision) system card (2023)
Google Scholar
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Google Scholar
Bai, J., et al.: Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv (2023)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Chen, L., et al.: Sharegpt4v: improving large multi-modal models with better captions. arXiv (2023)
Google Scholar
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv (2015)
Google Scholar
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. (2022)
Google Scholar
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Google Scholar
Dessì, R., Bevilacqua, M., Gualdoni, E., Rakotonirina, N.C., Franzon, F., Baroni, M.: Cross-domain image captioning with discriminative finetuning. In: CVPR (2023)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Google Scholar
Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: CVPR (2018)
Google Scholar
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2022)
Google Scholar
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
Google Scholar
Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B., Heming, J.: K-means clustering algorithms: a comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. (2023)
Google Scholar
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Google Scholar
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Google Scholar
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2016)
Google Scholar
Li, J., et al.: Empowering vision-language models to follow interleaved vision-language instructions. In: ICLR (2024)
Google Scholar
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Google Scholar
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023)
Google Scholar
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv (2023)
Google Scholar
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Google Scholar
Liu, Y., et al.: Mmbench: is your multi-modal model an all-around player? arXiv (2023)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Google Scholar
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)
Google Scholar
Mañas, O., López, P.R., Ahmadi, S., Nematzadeh, A., Goyal, Y., Agrawal, A.: MAPL: parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In: EACL (2023)
Google Scholar
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Google Scholar
Mokady, R.: Clipcap: clip prefix for image captioning. arXiv (2021)
Google Scholar
Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS (2011)
Google Scholar
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: EMNLP (2018)
Google Scholar
Sanh, V., et al.: Multitask prompted training enables zero-shot task generalization. In: ICLR (2022)
Google Scholar
Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop (2021)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)
Google Scholar
Shukor, M., Dancette, C., Cord, M.: eP-ALM: efficient perceptual augmentation of language models. In: ICCV (2023)
Google Scholar
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Google Scholar
Sun, G., Bai, Y., Yang, X., Fang, Y., Fu, Y., Tao, Z.: Aligning out-of-distribution web images and caption semantics via evidential learning. In: Proceedings of the ACM on Web Conference 2024 (2024)
Google Scholar
Tofade, T., Elsner, J., Haines, S.: Best practice strategies for effective use of questions as a teaching tool. Am. J. Pharm. Educ. (2013)
Google Scholar
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv (2023)
Google Scholar
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)
Google Scholar
Triantafillou, E., et al.: Meta-dataset: a dataset of datasets for learning to learn from few examples. In: ICLR (2019)
Google Scholar
Vattani, A.: K-means requires exponentially many iterations even in the plane. In: Annual Symposium on Computational Geometry (2009)
Google Scholar
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)
Google Scholar
Wang, J., et al.: Text is mass: modeling as stochastic embedding for text-video retrieval. In: CVPR (2024)
Google Scholar
Wang, Y., et al.: Self-instruct: aligning language models with self-generated instructions. In: ACL (2022)
Google Scholar
Wang, Y., et al.: Super-naturalinstructions: generalization via declarative instructions on 1600+ NLP tasks. In: EMNLP (2022)
Google Scholar
Wei, J., et al.: Finetuned language models are zero-shot learners. In: ICLR (2022)
Google Scholar
Xu, C., et al.: Wizardlm: empowering large language models to follow complex instructions. In: ICLR (2024)
Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: TACL (2014)
Google Scholar
Yu, W., et al.: Mm-vet: evaluating large multimodal models for integrated capabilities. arXiv (2023)
Google Scholar
Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. In: ICLR (2024)
Google Scholar
Zhang, Y., et al.: Llavar: enhanced visual instruction tuning for text-rich image understanding. arXiv (2023)
Google Scholar
Zhao, B., Wu, B., Huang, T.: Svit: scaling up visual instruction tuning. arXiv (2023)
Google Scholar
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: NeurIPS (2023)
Google Scholar
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. In: ICLR (2024)
Google Scholar

Download references

Author information

Authors and Affiliations

Rochester Institute of Technology, Rochester, NY, USA
Guohao Sun, Jiamian Wang & Zhiqiang Tao
Salesforce AI Research, San Francisco, CA, USA
Can Qin, Zeyuan Chen & Ran Xu

Authors

Guohao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Can Qin
View author publications
You can also search for this author in PubMed Google Scholar
Jiamian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zeyuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ran Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guohao Sun .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1078 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, G., Qin, C., Wang, J., Chen, Z., Xu, R., Tao, Z. (2025). SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15067. Springer, Cham. https://doi.org/10.1007/978-3-031-72673-6_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-72673-6_9
Published: 22 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72672-9
Online ISBN: 978-3-031-72673-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Complex visual question answering based on uniform form and content

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1078 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Complex visual question answering based on uniform form and content

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 1078 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation