[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Recent advances in vision-language models have shown notable generalization in broad tasks through visual instruction tuning. However, bridging the gap between the pre-trained vision encoder and the large language models (LLMs) becomes the whole network’s bottleneck. To improve cross-modality alignment, existing works usually consider more visual instruction data covering a broader range of vision tasks to fine-tune the model for question-answering, which, however, is costly to obtain and has not thoroughly explored the rich contextual information contained in images. This paper first attempts to harness the overlooked context within visual instruction data, training the model to self-supervised “learning” how to ask high-quality questions. In this way, we introduce a novel framework named SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. SQ-LLaVA exhibits proficiency in generating flexible and meaningful image-related questions while analyzing the visual clue and prior language knowledge, signifying an advanced level of generalized visual understanding. Moreover, fine-tuning SQ-LLaVA on higher-quality instruction data shows a performance improvement compared with traditional visual-instruction tuning methods. This improvement highlights the efficacy of self-questioning techniques in achieving a deeper and more nuanced comprehension of visual content across various contexts. Our code is available at https://github.com/heliossun/SQ-LLaVA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 54.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 69.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Gpt-4v(ision) system card (2023)

    Google Scholar 

  2. Agrawal, H., et al.: Nocaps: novel object captioning at scale. In: ICCV (2019)

    Google Scholar 

  3. Bai, J., et al.: Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and beyond. arXiv (2023)

    Google Scholar 

  4. Brown, T., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  5. Chen, L., et al.: Sharegpt4v: improving large multi-modal models with better captions. arXiv (2023)

    Google Scholar 

  6. Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv (2015)

    Google Scholar 

  7. Chowdhery, A., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. (2022)

    Google Scholar 

  8. Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  9. Dessì, R., Bevilacqua, M., Gualdoni, E., Rakotonirina, N.C., Franzon, F., Baroni, M.: Cross-domain image captioning with discriminative finetuning. In: CVPR (2023)

    Google Scholar 

  10. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)

    Google Scholar 

  11. Gurari, D., et al.: Vizwiz grand challenge: answering visual questions from blind people. In: CVPR (2018)

    Google Scholar 

  12. Hu, E.J., et al.: Lora: low-rank adaptation of large language models. In: ICLR (2022)

    Google Scholar 

  13. Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)

    Google Scholar 

  14. Ikotun, A.M., Ezugwu, A.E., Abualigah, L., Abuhaija, B., Heming, J.: K-means clustering algorithms: a comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. (2023)

    Google Scholar 

  15. Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: EMNLP (2014)

    Google Scholar 

  16. Kirillov, A., et al.: Segment anything. In: ICCV (2023)

    Google Scholar 

  17. Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. In: IJCV (2016)

    Google Scholar 

  18. Li, J., et al.: Empowering vision-language models to follow interleaved vision-language instructions. In: ICLR (2024)

    Google Scholar 

  19. Li, J., Li, D., Xiong, C., Hoi, S.C.H.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning (2022)

    Google Scholar 

  20. Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, X., Wen, J.R.: Evaluating object hallucination in large vision-language models. In: EMNLP (2023)

    Google Scholar 

  21. Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv (2023)

    Google Scholar 

  22. Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

    Google Scholar 

  23. Liu, Y., et al.: Mmbench: is your multi-modal model an all-around player? arXiv (2023)

    Google Scholar 

  24. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  25. Lu, P., et al.: Learn to explain: multimodal reasoning via thought chains for science question answering. In: NeurIPS (2022)

    Google Scholar 

  26. Mañas, O., López, P.R., Ahmadi, S., Nematzadeh, A., Goyal, Y., Agrawal, A.: MAPL: parameter-efficient adaptation of unimodal pre-trained models for vision-language few-shot prompting. In: EACL (2023)

    Google Scholar 

  27. Mishra, A., Shekhar, S., Singh, A.K., Chakraborty, A.: OCR-VQA: visual question answering by reading text in images. In: ICDAR (2019)

    Google Scholar 

  28. Mokady, R.: Clipcap: clip prefix for image captioning. arXiv (2021)

    Google Scholar 

  29. Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NeurIPS (2011)

    Google Scholar 

  30. Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: NeurIPS (2022)

    Google Scholar 

  31. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)

    Google Scholar 

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  33. Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: EMNLP (2018)

    Google Scholar 

  34. Sanh, V., et al.: Multitask prompted training enables zero-shot task generalization. In: ICLR (2022)

    Google Scholar 

  35. Schuhmann, C., et al.: Laion-400m: open dataset of clip-filtered 400 million image-text pairs. In: NeurIPS Workshop (2021)

    Google Scholar 

  36. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2018)

    Google Scholar 

  37. Shukor, M., Dancette, C., Cord, M.: eP-ALM: efficient perceptual augmentation of language models. In: ICCV (2023)

    Google Scholar 

  38. Singh, A., et al.: Towards VQA models that can read. In: CVPR (2019)

    Google Scholar 

  39. Sun, G., Bai, Y., Yang, X., Fang, Y., Fu, Y., Tao, Z.: Aligning out-of-distribution web images and caption semantics via evidential learning. In: Proceedings of the ACM on Web Conference 2024 (2024)

    Google Scholar 

  40. Tofade, T., Elsner, J., Haines, S.: Best practice strategies for effective use of questions as a teaching tool. Am. J. Pharm. Educ. (2013)

    Google Scholar 

  41. Touvron, H., et al.: Llama: open and efficient foundation language models. arXiv (2023)

    Google Scholar 

  42. Touvron, H., et al.: Llama 2: open foundation and fine-tuned chat models. arXiv (2023)

    Google Scholar 

  43. Triantafillou, E., et al.: Meta-dataset: a dataset of datasets for learning to learn from few examples. In: ICLR (2019)

    Google Scholar 

  44. Vattani, A.: K-means requires exponentially many iterations even in the plane. In: Annual Symposium on Computational Geometry (2009)

    Google Scholar 

  45. Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: consensus-based image description evaluation. In: CVPR (2015)

    Google Scholar 

  46. Wang, J., et al.: Text is mass: modeling as stochastic embedding for text-video retrieval. In: CVPR (2024)

    Google Scholar 

  47. Wang, Y., et al.: Self-instruct: aligning language models with self-generated instructions. In: ACL (2022)

    Google Scholar 

  48. Wang, Y., et al.: Super-naturalinstructions: generalization via declarative instructions on 1600+ NLP tasks. In: EMNLP (2022)

    Google Scholar 

  49. Wei, J., et al.: Finetuned language models are zero-shot learners. In: ICLR (2022)

    Google Scholar 

  50. Xu, C., et al.: Wizardlm: empowering large language models to follow complex instructions. In: ICLR (2024)

    Google Scholar 

  51. Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. In: TACL (2014)

    Google Scholar 

  52. Yu, W., et al.: Mm-vet: evaluating large multimodal models for integrated capabilities. arXiv (2023)

    Google Scholar 

  53. Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. In: ICLR (2024)

    Google Scholar 

  54. Zhang, Y., et al.: Llavar: enhanced visual instruction tuning for text-rich image understanding. arXiv (2023)

    Google Scholar 

  55. Zhao, B., Wu, B., Huang, T.: Svit: scaling up visual instruction tuning. arXiv (2023)

    Google Scholar 

  56. Zheng, L., et al.: Judging LLM-as-a-judge with MT-bench and chatbot arena. In: NeurIPS (2023)

    Google Scholar 

  57. Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: enhancing vision-language understanding with advanced large language models. In: ICLR (2024)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guohao Sun .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1078 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, G., Qin, C., Wang, J., Chen, Z., Xu, R., Tao, Z. (2025). SQ-LLaVA: Self-Questioning for Large Vision-Language Assistant. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15067. Springer, Cham. https://doi.org/10.1007/978-3-031-72673-6_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72673-6_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72672-9

  • Online ISBN: 978-3-031-72673-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics