Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

被引:0
|
作者
Zhang, Yichi [1 ,2 ]
Dong, Yinpeng [1 ,2 ]
Zhang, Siyuan [1 ]
Min, Tianzan [1 ]
Su, Hang [1 ,3 ]
Zhu, Jun [1 ,2 ,3 ]
机构
[1] Tsinghua Univ, Tsinghua Bosch Joint ML Ctr, Dept Comp Sci & Tech, Inst AI,THBI Lab,BNRist Ctr, Beijing 100084, Peoples R China
[2] RealAI, Beijing, Peoples R China
[3] Pazhou Lab Huangpu, Guangzhou, Guangdong, Peoples R China
关键词
D O I
10.1109/CVPR52733.2024.02508
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.
引用
收藏
页码:26552 / 26562
页数:11
相关论文
共 50 条
  • [31] AdaShield: Safeguarding Multimodal Large Language Models from Structure-Based Attack via Adaptive Shield Prompting
    Wang, Yu
    Liu, Xiaogeng
    Li, Yu
    Chen, Muhao
    Xiao, Chaowei
    COMPUTER VISION - ECCV 2024, PT XX, 2025, 15078 : 77 - 94
  • [32] A Survey on Multimodal Large Language Models in Radiology for Report Generation and Visual Question Answering
    Yi, Ziruo
    Xiao, Ting
    Albert, Mark V.
    INFORMATION, 2025, 16 (02)
  • [33] From Large Language Models to Large Multimodal Models: A Literature Review
    Huang, Dawei
    Yan, Chuan
    Li, Qing
    Peng, Xiaojiang
    APPLIED SCIENCES-BASEL, 2024, 14 (12):
  • [34] A comprehensive survey of large language models and multimodal large models in medicine
    Xiao, Hanguang
    Zhou, Feizhong
    Liu, Xingyue
    Liu, Tianqi
    Li, Zhipeng
    Liu, Xin
    Huang, Xiaoxuan
    INFORMATION FUSION, 2025, 117
  • [35] 2AFC Prompting of Large Multimodal Models for Image Quality Assessment
    Zhu, Hanwei
    Sui, Xiangjie
    Chen, Baoliang
    Liu, Xuelin
    Chen, Peilin
    Fang, Yuming
    Wang, Shiqi
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (12) : 12873 - 12878
  • [36] Multimodal Prompting with Missing Modalities for Visual Recognition
    Lee, Yi-Lun
    Tsai, Yi-Hsuan
    Chiu, Wei-Chen
    Lee, Chen-Yu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14943 - 14952
  • [37] Multimodal Large Language Models in Vision and Ophthalmology
    Lu, Zhiyong
    INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE, 2024, 65 (07)
  • [38] The application of multimodal large language models in medicine
    Qiu, Jianing
    Yuan, Wu
    Lam, Kyle
    LANCET REGIONAL HEALTH-WESTERN PACIFIC, 2024, 45
  • [39] Multimodal large language models for bioimage analysis
    Zhang, Shanghang
    Dai, Gaole
    Huang, Tiejun
    Chen, Jianxu
    NATURE METHODS, 2024, 21 (08) : 1390 - 1393
  • [40] Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation
    Chhun, Cyril
    Suchanek, Fabian M.
    Clavel, Chloe
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 1122 - 1142