Abstract
Recent advances in vision-language models have shown notable generalization in broad tasks through visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models (LLMs) becomes the whole network’s bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which, however, is costly to obtain and has not thoroughly explored the rich contextual information contained in images. This paper first attempts to harness the overlooked context within visual instruction data, training the model to self-supervised “learning” how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts. Our code is available at https://github.com/heliossun/SQ-LLaVA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Gpt-4v(ision) system card (2023)
Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)
Bai, J., et al.: Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv (2023)
Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Chen, L., et al.: Sharegpt4v: improving large multi-modal models with better captions. arXiv (2023)
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv (2015)
Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. (2022)
Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)
Dessì, R., Bevilacqua, M., Gualdoni, E., Rakotonirina, N.C., Franzon, F., Baroni, M.: Cross-domain image captioning with discriminative finetuning. In: CVPR (2023)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: CVPR (2018)
Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2022)
Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)
Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B., Heming, J.: K-means clustering algorithms: a comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. (2023)
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)
Kirillov, A., et al.: Segment anything. In: ICCV (2023)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2016)
Li, J., et al.: Empowering vision-language models to follow interleaved vision-language instructions. In: ICLR (2024)
Li, J., Li, D., Xiong, C., Hoi, S.C.H.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)
Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023)
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
Liu, Y., et al.: Mmbench: is your multi-modal model an all-around player? arXiv (2023)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)
Mañas, O., López, P.R., Ahmadi, S., Nematzadeh, A., Goyal, Y., Agrawal, A.: MAPL: parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In: EACL (2023)
Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)
Mokady, R.: Clipcap: clip prefix for image captioning. arXiv (2021)
Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS (2011)
Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: EMNLP (2018)
Sanh, V., et al.: Multitask prompted training enables zero-shot task generalization. In: ICLR (2022)
Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop (2021)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)
Shukor, M., Dancette, C., Cord, M.: eP-ALM: efficient perceptual augmentation of language models. In: ICCV (2023)
Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)
Sun, G., Bai, Y., Yang, X., Fang, Y., Fu, Y., Tao, Z.: Aligning out-of-distribution web images and caption semantics via evidential learning. In: Proceedings of the ACM on Web Conference 2024 (2024)
Tofade, T., Elsner, J., Haines, S.: Best practice strategies for effective use of questions as a teaching tool. Am. J. Pharm. Educ. (2013)
Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv (2023)
Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)
Triantafillou, E., et al.: Meta-dataset: a dataset of datasets for learning to learn from few examples. In: ICLR (2019)
Vattani, A.: K-means requires exponentially many iterations even in the plane. In: Annual Symposium on Computational Geometry (2009)
Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)
Wang, J., et al.: Text is mass: modeling as stochastic embedding for text-video retrieval. In: CVPR (2024)
Wang, Y., et al.: Self-instruct: aligning language models with self-generated instructions. In: ACL (2022)
Wang, Y., et al.: Super-naturalinstructions: generalization via declarative instructions on 1600+ NLP tasks. In: EMNLP (2022)
Wei, J., et al.: Finetuned language models are zero-shot learners. In: ICLR (2022)
Xu, C., et al.: Wizardlm: empowering large language models to follow complex instructions. In: ICLR (2024)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: TACL (2014)
Yu, W., et al.: Mm-vet: evaluating large multimodal models for integrated capabilities. arXiv (2023)
Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. In: ICLR (2024)
Zhang, Y., et al.: Llavar: enhanced visual instruction tuning for text-rich image understanding. arXiv (2023)
Zhao, B., Wu, B., Huang, T.: Svit: scaling up visual instruction tuning. arXiv (2023)
Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: NeurIPS (2023)
Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. In: ICLR (2024)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sun, G., Qin, C., Wang, J., Chen, Z., Xu, R., Tao, Z. (2025). SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15067. Springer, Cham. https://doi.org/10.1007/978-3-031-72673-6_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72673-6_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72672-9
Online ISBN: 978-3-031-72673-6
eBook Packages: Computer ScienceComputer Science (R0)