A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

被引:4
|
作者
Xing, Jialu [1 ]
Liu, Jianping [1 ,2 ,4 ]
Wang, Jian [3 ]
Sun, Lulu [1 ]
Chen, Xi [1 ]
Gu, Xunxun [1 ]
Wang, Yingfei [1 ]
机构
[1] North Minzu Univ, Coll Comp Sci & Engn, Yinchuan 750021, Peoples R China
[2] North Minzu Univ, Key Lab Images & Grap Intelligent Proc, State Ethn Affairs Commiss, Yinchuan 750021, Peoples R China
[3] Chinese Acad Agr Sci, Agr Informat Inst, Beijing 100081, Peoples R China
[4] 204,Wenchang North St, Yinchuan, Ningxia, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2024年 / 119卷
关键词
Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter;
D O I
10.1016/j.cag.2024.01.012
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.
引用
收藏
页数:23
相关论文
共 50 条
  • [1] Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization
    Zhu, Beier
    Niu, Yulei
    Lee, Saeil
    Hur, Minhoe
    Zhang, Hanwang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3834 - 3842
  • [2] Robust Fine-Tuning of Vision-Language Models for Domain Generalization
    Vogt-Lowell, Kevin
    Lee, Noah
    Tsiligkaridis, Theodoros
    Vaillant, Marc
    2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,
  • [3] Adversarial Prompt Tuning for Vision-Language Models
    Zhang, Jiaming
    Ma, Xingjun
    Wang, Xin
    Qiu, Lingyu
    Wang, Jiaqi
    Jiang, Yu-Gang
    Sang, Jitao
    COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
  • [4] Prompt-Ladder: Memory-efficient prompt tuning for vision-language models on edge devices
    Cai, Siqi
    Liu, Xuan
    Yuan, Jingling
    Zhou, Qihua
    PATTERN RECOGNITION, 2025, 163
  • [5] Distribution-Aware Prompt Tuning for Vision-Language Models
    Cho, Eulrang
    Kim, Jooyeon
    Kim, Hyunwoo J.
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956
  • [6] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
    Ma, Chengcheng
    Liu, Yang
    Deng, Jiankang
    Xie, Lingxi
    Dong, Weiming
    Xu, Changsheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629
  • [7] Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
    Lan, Long
    Wang, Fengxiang
    Zheng, Xiangtao
    Wang, Zengmao
    Liu, Xinwang
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [8] UMPA: Unified multi-modal prompt with adapter for vision-language models
    Jin, Zhengwei
    Wei, Yun
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [9] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
    Ming, Yifei
    Li, Yixuan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 596 - 609
  • [10] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
    Yifei Ming
    Yixuan Li
    International Journal of Computer Vision, 2024, 132 : 596 - 609