A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

被引：4

作者：

Xing, Jialu ^{[1
]}

Liu, Jianping ^{[1
,2
,4
]}

Wang, Jian ^{[3
]}

Sun, Lulu ^{[1
]}

Chen, Xi ^{[1
]}

Gu, Xunxun ^{[1
]}

Wang, Yingfei ^{[1
]}

机构：

[1] North Minzu Univ, Coll Comp Sci & Engn, Yinchuan 750021, Peoples R China

[2] North Minzu Univ, Key Lab Images & Grap Intelligent Proc, State Ethn Affairs Commiss, Yinchuan 750021, Peoples R China

[3] Chinese Acad Agr Sci, Agr Informat Inst, Beijing 100081, Peoples R China

[4] 204,Wenchang North St, Yinchuan, Ningxia, Peoples R China

来源：

COMPUTERS & GRAPHICS-UK | 2024年 / 119卷

关键词：

Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter;

D O I：

10.1016/j.cag.2024.01.012

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.

引用

页数：23

共 50 条

[21] Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks
Qui, Tingyu
Tuytelaars, Tinne
Moens, Marie-Francine
COMPUTER VISION - ECCV 2024, PT LXXXVIII, 2025, 15146 : 291 - 308
[22] Reparameterization-Based Parameter-Efficient Fine-Tuning Methods for Large Language Models: A Systematic Survey
Chen, Zezhou
Liu, Zhaoxiang
Wang, Kai
Lian, Shiguo
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT III, NLPCC 2024, 2025, 15361 : 107 - 118
[23] Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
Luo, Gen
Zhou, Yiyi
Ren, Tianhe
Chen, Shengxin
Sun, Xiaoshuai
Ji, Rongrong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[24] How fine can fine-tuning be? Learning efficient language models
Radiya-Dixit, Evani
Wang, Xin
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 2435 - 2442
[25] Multi-task prompt tuning with soft context sharing for vision-language models
Ding, Kun
Wang, Ying
Liu, Pengzhang
Yu, Qiang
Zhang, Haojian
Xiang, Shiming
Pan, Chunhong
NEUROCOMPUTING, 2024, 603
[26] Vision-Language Models for Vision Tasks: A Survey
Zhang, Jingyi
Huang, Jiaxing
Jin, Sheng
Lu, Shijian
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (08) : 5625 - 5644
[27] Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition
Sun, Hongbo
He, Xiangteng
Zhou, Jiahuan
Peng, Yuxin
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5828 - 5836
[28] Fine-grained multi-modal prompt learning for vision-language models
Liu, Yunfei
Deng, Yunziwei
Liu, Anqi
Liu, Yanan
Li, Shengyang
NEUROCOMPUTING, 2025, 636
[29] Learning Domain Invariant Prompt for Vision-Language Models
Zhao, Cairong
Wang, Yubin
Jiang, Xinyang
Shen, Yifei
Song, Kaitao
Li, Dongsheng
Miao, Duoqian
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 1348 - 1360
[30] DPO: Discrete Prompt Optimization for Vision-Language Models
Liang, Nanhao
Liu, Yong
IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 671 - 675

← 1 2 3 4 5 →