A survey of efficient fine-tuning methods for Vision-Language Models - Prompt and Adapter

被引:4
|
作者
Xing, Jialu [1 ]
Liu, Jianping [1 ,2 ,4 ]
Wang, Jian [3 ]
Sun, Lulu [1 ]
Chen, Xi [1 ]
Gu, Xunxun [1 ]
Wang, Yingfei [1 ]
机构
[1] North Minzu Univ, Coll Comp Sci & Engn, Yinchuan 750021, Peoples R China
[2] North Minzu Univ, Key Lab Images & Grap Intelligent Proc, State Ethn Affairs Commiss, Yinchuan 750021, Peoples R China
[3] Chinese Acad Agr Sci, Agr Informat Inst, Beijing 100081, Peoples R China
[4] 204,Wenchang North St, Yinchuan, Ningxia, Peoples R China
来源
COMPUTERS & GRAPHICS-UK | 2024年 / 119卷
关键词
Vision-language; Computer vision; Efficient fine-tuning; Pre-training model; Prompt; Adapter;
D O I
10.1016/j.cag.2024.01.012
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision -Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.
引用
收藏
页数:23
相关论文
共 50 条
  • [31] Task Residual for Tuning Vision-Language Models
    Yu, Tao
    Lu, Zhihe
    Jin, Xin
    Chen, Zhibo
    Wang, Xinchao
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10899 - 10909
  • [32] MMA: Multi-Modal Adapter for Vision-Language Models
    Yang, Lingxiao
    Zhang, Ru-Yuan
    Wang, Yanchen
    Xie, Xiaohua
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 23826 - +
  • [33] LAPT: Label-Driven Automated Prompt Tuning for OOD Detection with Vision-Language Models
    Zhang, Yabin
    Zhu, Wenjie
    He, Chenhang
    Zhang, Lei
    COMPUTER VISION - ECCV 2024, PT LXXII, 2025, 15130 : 271 - 288
  • [34] Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning
    Gao, Zhengqing
    Ao, Xiang
    Zhang, Xu-Yao
    Liu, Cheng-Lin
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 439 - 452
  • [35] Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models
    Shu, Manli
    Nie, Weili
    Huang, De-An
    Yu, Zhiding
    Goldstein, Tom
    Anandkumar, Anima
    Xiao, Chaowei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [36] Prompt Engineering or Fine-Tuning? A Case Study on Phishing Detection with Large Language Models
    Trad, Fouad
    Chehab, Ali
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2024, 6 (01): : 367 - 384
  • [37] Debiasing vision-language models for vision tasks: a survey
    Zhu, Beier
    Zhang, Hanwang
    FRONTIERS OF COMPUTER SCIENCE, 2025, 19 (01)
  • [38] Black-box Prompt Tuning for Vision-Language Model as a Service
    Yu, Lang
    Chen, Qin
    Lin, Jiaju
    He, Liang
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1686 - 1694
  • [39] APoLLo : Unified Adapter and Prompt Learning for Vision Language Models
    Chowdhury, Sanjoy
    Nag, Sayan
    Manocha, Dinesh
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10173 - 10187
  • [40] CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning
    Dui, Yuexi
    Chang, Brian
    Dvornek, Nicha C.
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT XII, 2024, 15012 : 465 - 475