Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

被引:0
|
作者
Zhang, Yichi [1 ,2 ]
Dong, Yinpeng [1 ,2 ]
Zhang, Siyuan [1 ]
Min, Tianzan [1 ]
Su, Hang [1 ,3 ]
Zhu, Jun [1 ,2 ,3 ]
机构
[1] Tsinghua Univ, Tsinghua Bosch Joint ML Ctr, Dept Comp Sci & Tech, Inst AI,THBI Lab,BNRist Ctr, Beijing 100084, Peoples R China
[2] RealAI, Beijing, Peoples R China
[3] Pazhou Lab Huangpu, Guangzhou, Guangdong, Peoples R China
关键词
D O I
10.1109/CVPR52733.2024.02508
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.
引用
收藏
页码:26552 / 26562
页数:11
相关论文
共 50 条
  • [41] Does Metacognitive Prompting Improve Causal Inference in Large Language Models?
    Ohtani, Ryusei
    Sakurai, Yuko
    Oyama, Satoshi
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 458 - 459
  • [42] On Hardware Security Bug Code Fixes by Prompting Large Language Models
    Ahmad, Baleegh
    Thakur, Shailja
    Tan, Benjamin
    Karri, Ramesh
    Pearce, Hammond
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 4043 - 4057
  • [43] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
    Wei, Jason
    Wang, Xuezhi
    Schuurmans, Dale
    Bosma, Maarten
    Ichter, Brian
    Xia, Fei
    Chi, Ed H.
    Le, Quoc V.
    Zhou, Denny
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [44] A Communication Theory Perspective on Prompting Engineering Methods for Large Language Models
    Song, Yuan-Feng
    He, Yuan-Qin
    Zhao, Xue-Fang
    Gu, Han-Lin
    Jiang, Di
    Yang, Hai-Jun
    Fan, Li-Xin
    Journal of Computer Science and Technology, 2024, 39 (04) : 984 - 1004
  • [45] SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
    Huang, Yuzhou
    Xie, Liangbin
    Wang, Xintao
    Yuan, Ziyang
    Cun, Xiaodong
    Ge, Yixiao
    Zhou, Jiantao
    Dong, Chao
    Huang, Rui
    Zhang, Ruimao
    Shan, Ying
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 8362 - 8371
  • [46] Evolution and Prospects of Foundation Models: From Large Language Models to Large Multimodal Models
    Chen, Zheyi
    Xu, Liuchang
    Zheng, Hongting
    Chen, Luyao
    Tolba, Amr
    Zhao, Liang
    Yu, Keping
    Feng, Hailin
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 80 (02): : 1753 - 1808
  • [47] Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
    Li, Yifan
    Guo, Hangyu
    Zhou, Kun
    Zhao, Wayne Xin
    Wen, Ji-Rong
    arXiv,
  • [48] Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
    Li, Yifan
    Guo, Hangyu
    Zhou, Kun
    Zhou, Wayne Xin
    Wen, Ji-Rong
    COMPUTER VISION - ECCV 2024, PT LXXIII, 2025, 15131 : 174 - 189
  • [49] Large language models and multimodal foundation models for precision oncology
    Truhn, Daniel
    Eckardt, Jan-Niklas
    Ferber, Dyke
    Kather, Jakob Nikolas
    NPJ PRECISION ONCOLOGY, 2024, 8 (01)
  • [50] Large language models and multimodal foundation models for precision oncology
    Daniel Truhn
    Jan-Niklas Eckardt
    Dyke Ferber
    Jakob Nikolas Kather
    npj Precision Oncology, 8