Fine-Tuning for Few-Shot Image Classification by Multimodal Prototype Regularization

被引:2
|
作者
Wu, Qianhao [1 ]
Qi, Jiaxin [2 ]
Zhang, Dong [1 ]
Zhang, Hanwang
Tang, Jinhui [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
基金
新加坡国家研究基金会;
关键词
Training; Visualization; Testing; Task analysis; Prototypes; Feature extraction; Tuning; Few-shot classification; large pre-trained vision-language models; model fine-tuning; prototype regularization; NETWORK; MODELS;
D O I
10.1109/TMM.2024.3379896
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large pre-trained vision-language models, such as CLIP [Radford et al. 2021], have demonstrated remarkable performance in few-shot image classification. To facilitate the rapid adaptation of CLIP in downstream tasks with limited visual samples, two primary frameworks have been proposed. The first framework centers on the image encoder and introduces a trainable visual classifier after the backbone to generate logits for each object class. Nevertheless, this framework heavily depends on limited visual features extracted by the pre-trained visual encoder, which can result in over-fitting issues. The second framework aims to optimize the text encoder by using trainable soft language prompts and computing logits for each class based on the similarity between image features and optimized prompt features. However, this framework encounters the issue of imperfect alignment between the representations extracted by the image and text encoders, making it difficult to fine-tune the language prompts using visual samples. This paper proposes a Multi-Modal Prototype Regularization (MMPR) method for CLIP-based few-shot fine-tuning for image classification. MMPR can address the challenges of effectively utilizing both image and text features. MMPR fine-tunes a classifier and regularizes its weights using both image-based (ImgPR) and text-based (TexPR) prototypes. ImgPR represents the mean of image representations within the same class, derived from the image encoder, to distill specific visual distribution knowledge for classifier adaptation. TexPR represents the hand-crafted prompt associated with the class, derived from the text encoder, to incorporate general encyclopedic knowledge and mitigate visual over-fitting. MMPR significantly leverages both image and text information without increasing computational complexity during the inference stage compared to existing methods. Experimental results on various challenging public benchmarks demonstrate the superiority of the proposed MMPR method over state-of-the-art methods.
引用
收藏
页码:8543 / 8556
页数:14
相关论文
共 50 条
  • [31] Multi-layer Tuning CLIP for Few-Shot Image Classification
    Zhang, Ruihao
    Geng, Jinsong
    Liu, Cenyu
    Zhang, Wei
    Feng, Zunlei
    Xue, Liang
    Bei, Yijun
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 173 - 186
  • [32] Partial Is Better Than All: Revisiting Fine-tuning Strategy for Few-shot Learning
    Shen, Zhiqiang
    Liu, Zechun
    Qin, Jie
    Savvides, Marios
    Cheng, Kwang-Ting
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 9594 - 9602
  • [33] Data race detection via few-shot parameter-efficient fine-tuning
    Shen, Yuanyuan
    Peng, Manman
    Zhang, Fan
    Wu, Qiang
    JOURNAL OF SYSTEMS AND SOFTWARE, 2025, 222
  • [34] A fine-tuning prototypical network for few-shot cross-domain fault diagnosis
    Zhong, Jianhua
    Gu, Kairong
    Jiang, Haifeng
    Liang, Wei
    Zhong, Shuncong
    MEASUREMENT SCIENCE AND TECHNOLOGY, 2024, 35 (11)
  • [35] Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation
    Fabbri, Alexander R.
    Han, Simeng
    Li, Haoyuan
    Li, Haoran
    Ghazvininejad, Marjan
    Joty, Shafiq
    Radev, Dragomir
    Mehdad, Yashar
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 704 - 717
  • [36] COMPARING THE EFFICACY OF FINE-TUNING AND META-LEARNING FOR FEW-SHOT POLICY IMITATION
    Patacchiola, Massimiliano
    Sun, Mingfei
    Hofmann, Katja
    Turner, Richard E.
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 878 - 908
  • [37] Few-Shot Intent Detection via Contrastive Pre-Training and Fine-Tuning
    Zhang, Jian-Guo
    Bui, Trung
    Yoon, Seunghyun
    Chen, Xiang
    Liu, Zhiwei
    Xia, Congying
    Tran, Quan Hung
    Chang, Walter
    Yu, Philip
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1906 - 1912
  • [38] Contrastive prototype network with prototype augmentation for few-shot classification
    Jiang, Mengjuan
    Fan, Jiaqing
    He, Jiangzhen
    Du, Weidong
    Wang, Yansong
    Li, Fanzhang
    INFORMATION SCIENCES, 2025, 686
  • [39] Quantum Few-Shot Image Classification
    Huang, Zhihao
    Shi, Jinjing
    Li, Xuelong
    IEEE TRANSACTIONS ON CYBERNETICS, 2025, 55 (01) : 194 - 206
  • [40] ProtoMed: Prototypical networks with auxiliary regularization for few-shot medical image classification
    Ouahab, Achraf
    Ben Ahmed, Olfa
    IMAGE AND VISION COMPUTING, 2025, 154