[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3627673.3679963acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper

Meta-Prompt Tuning Vision-Language Model for Multi-Label Few-Shot Image Recognition

Published: 21 October 2024 Publication History

Abstract

Multi-label few-shot image recognition aims to identify multiple unseen objects using only a handful of examples. Recent methods typically tune pre-trained vision-language models with shared or class-specific prompts. However, they still have drawbacks. Tuning a shared prompt is insufficient for all samples especially when the tasks are complex and tuning specific prompts for each class is inevitable to lose generalization ability, thus failing to capture diverse visual knowledge. To address these issues, we propose to meta-tune a generalized prompt pool, enabling each prompt to act as an expert for multi-label few-shot image recognition. Specifically, we first construct a diverse prompt pool to handle complex samples and tasks effectively. Then, the meta-tuning strategy is designed to learn meta-knowledge and transfer it from source tasks to target tasks, enhancing the generalization of prompts. Extensive experimental results on two widely used multi-label image recognition datasets demonstrate the effectiveness of our method.

References

[1]
Amit Alfassy, Leonid Karlinsky, Amit Aides, Joseph Shtok, Sivan Harary, Rogério Schmidt Feris, Raja Giryes, and Alexander M. Bronstein. 2019. LaSO: Label-Set Operations Networks for Multi-Label Few-Shot Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[2]
Delong Chen, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. 2024. Visual Instruction Tuning with Polite Flamingo. In AAAI Conference on Artificial Intelligence (AAAI). 17745--17753.
[3]
Tianshui Chen, Liang Lin, Riquan Chen, Xiaolu Hui, and Hefeng Wu. 2022. Knowledge-Guided Multi-Label Few-Shot Learning for General Image Recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2022).
[4]
Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-Label Image Recognition With Graph Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[5]
Zixuan Ding, Ao Wang, Hui Chen, Qiang Zhang, Pengzhang Liu, Yongjun Bao, Weipeng Yan, and Jungong Han. 2023. Exploring Structured Semantic Prior for Multi Label Recognition with Incomplete Labels. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17--24, 2023. IEEE, 3398--3407.
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).
[7]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John M. Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. (2010).
[8]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning (ICML).
[9]
Bin-Bin Gao and Hong-Yu Zhou. 2021. Learning to Discover Multi-Class Attentional Regions for Multi-Label Image Recognition. IEEE Trans. Image Process. (2021).
[10]
Zixian Guo, Bowen Dong, Zhilong Ji, Jinfeng Bai, Yiwen Guo, and Wangmeng Zuo. 2023. Texts as Images in Prompt Tuning for Multi-Label Image Recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2808--2817.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12]
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Conference on Empirical Methods in Natural Language Processing (EMNLP). 7514--7528.
[13]
Ping Hu, Ximeng Sun, Stan Sclaroff, and Kate Saenko. 2024. DualCoOp: Fast and Effective Adaptation to Multi-Label Recognition With Limited Annotations. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 46, 5 (2024), 3450--3462.
[14]
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), Vol. 139. 4904--4916.
[15]
Weisen Jiang, Yu Zhang, and James Kwok. 2023. Effective Structured Prompting by Meta-Learning and Representative Verbalizer. In International Conference on Machine Learning (ICML).
[16]
Muhammad Uzair Khattak, Hanoona Abdul Rasheed, Muhammad Maaz, Salman H. Khan, and Fahad Shahbaz Khan. 2023. MaPLe: Multi-modal Prompt Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19113--19122.
[17]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning (ICML).
[18]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV.
[19]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning. In Annual Conference on Neural Information Processing Systems 'NeurIPS).
[20]
Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. 2019. Learning to Propagate Labels: Transductive Propagation Network for Few-Shot Learning. In International Conference on Learning Representations (ICLR).
[21]
Tsendsuren Munkhdalai and Hong Yu. 2017. Meta Networks. In International Conference on Machine Learning (ICML).
[22]
Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, and Shu Kong. 2024. The Neglected Tails of Vision-Language Models. CoRR, Vol. abs/2401.12425 (2024).
[23]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML), Vol. 139. 8748--8763.
[24]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML).
[25]
Tal Ridnik, Emanuel Ben Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. 2021. Asymmetric Loss For Multi-Label Classification. In IEEE/CVF International Conference on Computer Vision (ICCV).
[26]
Pau Rodríguez, Issam H. Laradji, Alexandre Drouin, and Alexandre Lacoste. 2020. Embedding Propagation: Smoother Manifold for Few-Shot Classification. In ECCV.
[27]
Cheng Shi and Sibei Yang. 2023. LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models. In IEEE/CVF International Conference on Computer Vision (ICCV). 2920--2929.
[28]
Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical Networks for Few-shot Learning. In Annual Conference on Neural Information Processing Systems (NeurIPS).
[29]
Ximeng Sun, Ping Hu, and Kate Saenko. 2022. DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations. In Annual Conference on Neural Information Processing Systems (NeurIPS).
[30]
Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2018. Learning to Compare: Relation Network for Few-Shot Learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31]
Vishaal Udandarao, Ameya Prabhu, Adhiraj Ghosh, Yash Sharma, Philip H. S. Torr, Adel Bibi, Samuel Albanie, and Matthias Bethge. 2024. No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance. CoRR, Vol. abs/2404.04125 (2024).
[32]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. JMLR (2008).
[33]
Henan Wang, Muli Yang, Kun Wei, and Cheng Deng. 2023. Hierarchical Prompt Learning for Compositional Zero-Shot Recognition. In International Joint Conference on Artificial Intelligence (IJCAI). 1470--1478.
[34]
Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer G. Dy, and Tomas Pfister. 2022. Learning to Prompt for Continual Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 139--149.
[35]
Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, and Nan Duan. 2023. BridgeTower: Building Bridges between Encoders in Vision-Language Representation Learning. In AAAI Conference on Artificial Intelligence (AAAI). 10637--10647.
[36]
Kun Yan, Chenbin Zhang, Jun Hou, Ping Wang, Zied Bouraoui, Shoaib Jameel, and Steven Schockaert. 2022. Inferring Prototypes for Multi-Label Few-Shot Image Classification with Word Vector Guided Attention. In AAAI Conference on Artificial Intelligence (AAAI).
[37]
Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless Recurrent Models for Multi-Label Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38]
Vacit Oguz Yazici, Abel Gonzalez-Garcia, Arnau Ramisa, Bartlomiej Twardowski, and Joost van de Weijer. 2020. Orderless Recurrent Models for Multi-Label Classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39]
Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. 2018. Bayesian Model-Agnostic Meta-Learning. In Annual Conference on Neural Information Processing Systems (NeurIPS). 7343--7353.
[40]
Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. 2020. Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification. In AAAI Conference on Artificial Intelligence (AAAI).
[41]
Jiawei Zhao, Ke Yan, Yifan Zhao, Xiaowei Guo, Feiyue Huang, and Jia Li. 2021. Transformer-based Dual Relation Graph for Multi-label Image Recognition. In IEEE/CVF International Conference on Computer Vision (ICCV). 163--172.
[42]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Conditional Prompt Learning for Vision-Language Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16795--16804.
[43]
Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to Prompt for Vision-Language Models. Int. J. Comput. Vis. (2022).
[44]
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. CoRR, Vol. abs/2304.10592 (2023).

Index Terms

  1. Meta-Prompt Tuning Vision-Language Model for Multi-Label Few-Shot Image Recognition

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management
    October 2024
    5705 pages
    ISBN:9798400704369
    DOI:10.1145/3627673
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 October 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. few-shot learning
    2. meta-prompt learning
    3. multi-label image recognition

    Qualifiers

    • Short-paper

    Funding Sources

    • Science and Technology Project of State Grid Corporation of China

    Conference

    CIKM '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 89
      Total Downloads
    • Downloads (Last 12 months)89
    • Downloads (Last 6 weeks)43
    Reflects downloads up to 20 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media