Fine-Tuning for Few-Shot Image Classification by Multimodal Prototype Regularization

被引:2
|
作者
Wu, Qianhao [1 ]
Qi, Jiaxin [2 ]
Zhang, Dong [1 ]
Zhang, Hanwang
Tang, Jinhui [1 ]
机构
[1] Nanjing Univ Sci & Technol, Sch Comp Sci & Engn, Nanjing 210094, Peoples R China
[2] Nanyang Technol Univ, Sch Comp Sci & Engn, Singapore 639798, Singapore
基金
新加坡国家研究基金会;
关键词
Training; Visualization; Testing; Task analysis; Prototypes; Feature extraction; Tuning; Few-shot classification; large pre-trained vision-language models; model fine-tuning; prototype regularization; NETWORK; MODELS;
D O I
10.1109/TMM.2024.3379896
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large pre-trained vision-language models, such as CLIP [Radford et al. 2021], have demonstrated remarkable performance in few-shot image classification. To facilitate the rapid adaptation of CLIP in downstream tasks with limited visual samples, two primary frameworks have been proposed. The first framework centers on the image encoder and introduces a trainable visual classifier after the backbone to generate logits for each object class. Nevertheless, this framework heavily depends on limited visual features extracted by the pre-trained visual encoder, which can result in over-fitting issues. The second framework aims to optimize the text encoder by using trainable soft language prompts and computing logits for each class based on the similarity between image features and optimized prompt features. However, this framework encounters the issue of imperfect alignment between the representations extracted by the image and text encoders, making it difficult to fine-tune the language prompts using visual samples. This paper proposes a Multi-Modal Prototype Regularization (MMPR) method for CLIP-based few-shot fine-tuning for image classification. MMPR can address the challenges of effectively utilizing both image and text features. MMPR fine-tunes a classifier and regularizes its weights using both image-based (ImgPR) and text-based (TexPR) prototypes. ImgPR represents the mean of image representations within the same class, derived from the image encoder, to distill specific visual distribution knowledge for classifier adaptation. TexPR represents the hand-crafted prompt associated with the class, derived from the text encoder, to incorporate general encyclopedic knowledge and mitigate visual over-fitting. MMPR significantly leverages both image and text information without increasing computational complexity during the inference stage compared to existing methods. Experimental results on various challenging public benchmarks demonstrate the superiority of the proposed MMPR method over state-of-the-art methods.
引用
收藏
页码:8543 / 8556
页数:14
相关论文
共 50 条
  • [1] Hybrid Fine-Tuning Strategy for Few-Shot Classification
    Zhao, Lei
    Ou, Zhonghua
    Zhang, Lixun
    Li, Shuxiao
    COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [2] Singular Value Fine-tuning: Few-shot Segmentation requires Few-parameters Fine-tuning
    Sun, Yanpeng
    Chen, Qiang
    He, Xiangyu
    Wang, Jian
    Feng, Haocheng
    Han, Junyu
    Ding, Errui
    Cheng, Jian
    Li, Zechao
    Wang, Jingdong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [3] Adaptive fine-tuning strategy for few-shot learning
    Zhuang, Xinkai
    Shao, Mingwen
    Gao, Wei
    Yang, Jianxin
    JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (06)
  • [4] Embedding Hallucination for Few-Shot Language Fine-tuning
    Jian, Yiren
    Gao, Chongyang
    Vosoughi, Soroush
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5522 - 5530
  • [5] Network Pruning and Fine-tuning for Few-shot Industrial Image Anomaly Detection
    Zhang, Jie
    Suganuma, Masanori
    Okatani, Takayuki
    2023 IEEE 21ST INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS, INDIN, 2023,
  • [6] Chain of Thought Guided Few-Shot Fine-Tuning of LLMs for Multimodal Aspect-Based Sentiment Classification
    Wu, Hao
    Yang, Danping
    Liu, Peng
    Li, Xianxian
    MULTIMEDIA MODELING, MMM 2025, PT I, 2025, 15520 : 182 - 194
  • [7] EFTNet: an efficient fine-tuning method for few-shot segmentation
    Li, Jiaguang
    Wang, Yubo
    Gao, Zihan
    Wei, Ying
    APPLIED INTELLIGENCE, 2024, 54 (19) : 9488 - 9507
  • [8] Boosting Transductive Few-Shot Fine-tuning with Margin-based Uncertainty Weighting and Probability Regularization
    Tao, Ran
    Chen, Hao
    Savvides, Marios
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 15752 - 15761
  • [9] Dual Attention Relation Network With Fine-Tuning for Few-Shot EEG Motor Imagery Classification
    An, Sion
    Kim, Soopil
    Chikontwe, Philip
    Park, Sang Hyun
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15479 - 15493
  • [10] Dual Attention Relation Network With Fine-Tuning for Few-Shot EEG Motor Imagery Classification
    An, Sion
    Kim, Soopil
    Chikontwe, Philip
    Park, Sang Hyun
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (11) : 15479 - 15493