A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

被引：4

作者：

Xing, Jialu ^{[1
]}

Liu, Jianping ^{[1
,2
,4
]}

Wang, Jian ^{[3
]}

Sun, Lulu ^{[1
]}

Chen, Xi ^{[1
]}

Gu, Xunxun ^{[1
]}

Wang, Yingfei ^{[1
]}

机构：

[1] North Minzu Univ, Coll Comp Sci & Engn, Yinchuan 750021, Peoples R China

[2] North Minzu Univ, Key Lab Images & Grap Intelligent Proc, State Ethn Affairs Commiss, Yinchuan 750021, Peoples R China

[3] Chinese Acad Agr Sci, Agr Informat Inst, Beijing 100081, Peoples R China

[4] 204,Wenchang North St, Yinchuan, Ningxia, Peoples R China

来源：

COMPUTERS & GRAPHICS-UK | 2024年 / 119卷

关键词：

Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter;

D O I：

10.1016/j.cag.2024.01.012

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.

引用

页数：23

共 50 条

[1] Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization
Zhu, Beier
Niu, Yulei
Lee, Saeil
Hur, Minhoe
Zhang, Hanwang
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 3, 2023, : 3834 - 3842
[2] Robust Fine-Tuning of Vision-Language Models for Domain Generalization
Vogt-Lowell, Kevin
Lee, Noah
Tsiligkaridis, Theodoros
Vaillant, Marc
2023 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE, HPEC, 2023,
[3] Adversarial Prompt Tuning for Vision-Language Models
Zhang, Jiaming
Ma, Xingjun
Wang, Xin
Qiu, Lingyu
Wang, Jiaqi
Jiang, Yu-Gang
Sang, Jitao
COMPUTER VISION - ECCV 2024, PT XLV, 2025, 15103 : 56 - 72
[4] Prompt-Ladder: Memory-efficient prompt tuning for vision-language models on edge devices
Cai, Siqi
Liu, Xuan
Yuan, Jingling
Zhou, Qihua
PATTERN RECOGNITION, 2025, 163
[5] Distribution-Aware Prompt Tuning for Vision-Language Models
Cho, Eulrang
Kim, Jooyeon
Kim, Hyunwoo J.
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21947 - 21956
[6] Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models
Ma, Chengcheng
Liu, Yang
Deng, Jiankang
Xie, Lingxi
Dong, Weiming
Xu, Changsheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4616 - 4629
[7] Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
Lan, Long
Wang, Fengxiang
Zheng, Xiangtao
Wang, Zengmao
Liu, Xinwang
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
[8] UMPA: Unified multi-modal prompt with adapter for vision-language models
Jin, Zhengwei
Wei, Yun
MULTIMEDIA SYSTEMS, 2025, 31 (02)
[9] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
Ming, Yifei
Li, Yixuan
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024, 132 (02) : 596 - 609
[10] How Does Fine-Tuning Impact Out-of-Distribution Detection for Vision-Language Models?
Yifei Ming
Yixuan Li
International Journal of Computer Vision, 2024, 132 : 596 - 609

← 1 2 3 4 5 →