Fine-Tuning for Few-Shot Image Classification by Multimodal Prototype Regularization

被引:2
|
作者
Wu, Qianhao [1 ]
Qi, Jiaxin [2 ]
Zhang, Dong [1 ]
Zhang, Hanwang
Tang, Jinhui [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
基金
新加坡国家研究基金会;
关键词
Training; Visualization; Testing; Task analysis; Prototypes; Feature extraction; Tuning; Few-shot classification; large pre-trained vision-language models; model fine-tuning; prototype regularization; NETWORK; MODELS;
D O I
10.1109/TMM.2024.3379896
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large pre-trained vision-language models, such as CLIP [Radford et al. 2021], have demonstrated remarkable performance in few-shot image classification. To facilitate the rapid adaptation of CLIP in downstream tasks with limited visual samples, two primary frameworks have been proposed. The first framework centers on the image encoder and introduces a trainable visual classifier after the backbone to generate logits for each object class. Nevertheless, this framework heavily depends on limited visual features extracted by the pre-trained visual encoder, which can result in over-fitting issues. The second framework aims to optimize the text encoder by using trainable soft language prompts and computing logits for each class based on the similarity between image features and optimized prompt features. However, this framework encounters the issue of imperfect alignment between the representations extracted by the image and text encoders, making it difficult to fine-tune the language prompts using visual samples. This paper proposes a Multi-Modal Prototype Regularization (MMPR) method for CLIP-based few-shot fine-tuning for image classification. MMPR can address the challenges of effectively utilizing both image and text features. MMPR fine-tunes a classifier and regularizes its weights using both image-based (ImgPR) and text-based (TexPR) prototypes. ImgPR represents the mean of image representations within the same class, derived from the image encoder, to distill specific visual distribution knowledge for classifier adaptation. TexPR represents the hand-crafted prompt associated with the class, derived from the text encoder, to incorporate general encyclopedic knowledge and mitigate visual over-fitting. MMPR significantly leverages both image and text information without increasing computational complexity during the inference stage compared to existing methods. Experimental results on various challenging public benchmarks demonstrate the superiority of the proposed MMPR method over state-of-the-art methods.
引用
收藏
页码:8543 / 8556
页数:14
相关论文
共 50 条
  • [21] Pathologies of Pre-trained Language Models in Few-shot Fine-tuning
    Chen, Hanjie
    Zheng, Guoqing
    Awadallah, Ahmed Hassan
    Ji, Yangfeng
    PROCEEDINGS OF THE THIRD WORKSHOP ON INSIGHTS FROM NEGATIVE RESULTS IN NLP (INSIGHTS 2022), 2022, : 144 - 153
  • [22] Open-Set Face Identification on Few-Shot Gallery by Fine-Tuning
    Park, Hojin
    Park, Jaewoo
    Teoh, Andrew Beng Jin
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1026 - 1032
  • [23] COFT-AD: COntrastive Fine-Tuning for Few-Shot Anomaly Detection
    Liao, Jingyi
    Xu, Xun
    Nguyen, Manh Cuong
    Goodge, Adam
    Foo, Chuan Sheng
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 2090 - 2103
  • [24] Incremental few-shot instance segmentation without fine-tuning on novel classes
    Zhang, Luofeng
    Weng, Libo
    Zhang, Yuanming
    Gao, Fei
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2025, 254
  • [25] Incremental Few-Shot Object Detection via Simple Fine-Tuning Approach
    Choi, Tae-Min
    Kim, Jong-Hwan
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 9289 - 9295
  • [26] Exploring Few-Shot Fine-Tuning Strategies for Models of Visually Grounded Speech
    Miller, Tyler
    Harwath, David
    INTERSPEECH 2022, 2022, : 1416 - 1420
  • [27] Fine-Tuning of CLIP in Few-Shot Scenarios via Supervised Contrastive Learning
    Luo, Jing
    Wu, Guangxing
    Liu, Hongmei
    Wang, Ruixuan
    PATTERN RECOGNITION AND COMPUTER VISION, PT III, PRCV 2024, 2025, 15033 : 104 - 117
  • [28] Prototype Bayesian Meta-Learning for Few-Shot Image Classification
    Fu, Meijun
    Wang, Xiaomin
    Wang, Jun
    Yi, Zhang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 15
  • [29] Contrastive prototype learning with semantic patchmix for few-shot image classification
    Dong, Mengping
    Lei, Fei
    Li, Zhenbo
    Liu, Xue
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 142
  • [30] Multiview Calibrated Prototype Learning for Few-Shot Hyperspectral Image Classification
    Yu, Chunyan
    Gong, Baoyu
    Song, Meiping
    Zhao, Enyu
    Chang, Chein-, I
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2022, 60