Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

被引:5
|
作者
Sun, Hongbo [1 ,2 ]
He, Xiangteng [1 ,2 ]
Zhou, Jiahuan [1 ,2 ]
Peng, Yuxin [1 ,2 ]
机构
[1] Peking Univ, Wangxuan Inst Comp Technol, Beijing, Peoples R China
[2] Peking Univ, Natl Key Lab Multimedia Informat Proc, Beijing, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
Fine-Grained visual prompt learning; Vision-language models; Image recognition with few training samples;
D O I
10.1145/3581783.3612403
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Large-scale pre-trained vision-language (VL) models have shown powerful generic representation capabilities for adapting to down-stream tasks with limited training data, which are data-efficient solutions to various applications such as image recognition. In order to enhance the adaption performance, most existing methods attempt to introduce learnable vectors into the text prompt to generate adaptive classification weights for the class in the downstream task. However, they generally focus on the text side while neglecting adaptive visual feature generation on the image side, which is insufficient to fit the downstream task data. In this paper, we propose fine-grained visual prompt learning (FG-VPL) of vision-language models for image recognition with few training samples, and the main contributions are: (1) Fine-grained visual prompt is introduced into the image encoder of the vision-language model for focusing on the target object and conducting information interaction within the object, which facilitates generating discriminative visual features for image recognition. (2) A two-pathway adaptive recognition module is proposed to narrow the domain gap and utilize both the cross-modal knowledge of the vision-language model and the visual information of the few-sample training set for classifying images with the help of feature adapters. We conduct extensive experiments on 11 image recognition benchmark datasets under the few training samples setting, which demonstrate that our proposed approach can achieve state-of-the-art performance. The code is available at https://github.com/PKU-ICST-MIPL/FG-VPL_ACMMM2023.
引用
收藏
页码:5828 / 5836
页数:9
相关论文
共 50 条
  • [11] Learning Domain Invariant Prompt for Vision-Language Models
    Zhao, Cairong
    Wang, Yubin
    Jiang, Xinyang
    Shen, Yifei
    Song, Kaitao
    Li, Dongsheng
    Miao, Duoqian
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
  • [12] PROMETHEUS- VISION: Vision-Language Model as a Judge for Fine-Grained Evaluation
    Lee, Seongyun
    Kim, Seungone
    Park, Sue Hyun
    Kim, Geewook
    Seo, Minjoon
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11286 - 11315
  • [13] PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition
    Zhang, Haosong
    Leong, Mei Chee
    Li, Liyuan
    Lin, Weisi
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 18857 - 18867
  • [14] Facial Expression Monitoring via Fine-Grained Vision-Language Alignment
    Ren, Weihong
    Gao, Yu
    Chen, Xiai
    Han, Zhi
    Wang, Zhiyong
    Wang, Jiaole
    Liu, Honghai
    IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, 2024,
  • [15] Fine-Grained Semantically Aligned Vision-Language Pre-Training
    Li, Juncheng
    He, Xin
    Wei, Longhui
    Qian, Long
    Zhu, Linchao
    Xie, Lingxi
    Zhuang, Yueting
    Tian, Qi
    Tang, Siliang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [16] ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data
    Varma, Maya
    Delbrouck, Jean-Benoit
    Hooper, Sarah
    Chaudhari, Akshay
    Langlotz, Curtis
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22168 - 22178
  • [17] Learning to locate for fine-grained image recognition
    Chen, Jiamin
    Hu, Jianguo
    Li, Shiren
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 206
  • [18] Incremental Learning for Fine-Grained Image Recognition
    Cao, Liangliang
    Hsiao, Jenhao
    de Juan, Paloma
    Li, Yuncheng
    Thomee, Bart
    ICMR'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2016, : 363 - 366
  • [19] Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization
    Zhu, Beier
    Niu, Yulei
    Lee, Saeil
    Hur, Minhoe
    Zhang, Hanwang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3834 - 3842
  • [20] JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models
    Guo, Yuncheng
    Guo, Xiaodong
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 28695 - 28705