Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

被引:0
|
作者
Zhang, Yichi [1 ,2 ]
Dong, Yinpeng [1 ,2 ]
Zhang, Siyuan [1 ]
Min, Tianzan [1 ]
Su, Hang [1 ,3 ]
Zhu, Jun [1 ,2 ,3 ]
机构
[1] Tsinghua Univ, Tsinghua Bosch Joint ML Ctr, Dept Comp Sci & Tech, Inst AI,THBI Lab,BNRist Ctr, Beijing 100084, Peoples R China
[2] RealAI, Beijing, Peoples R China
[3] Pazhou Lab Huangpu, Guangzhou, Guangdong, Peoples R China
关键词
D O I
10.1109/CVPR52733.2024.02508
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction.
引用
收藏
页码:26552 / 26562
页数:11
相关论文
共 50 条
  • [1] EarthMarker: A Visual Prompting Multimodal Large Language Model for Remote Sensing
    Zhang, Wei
    Cai, Miaoxin
    Zhang, Tong
    Zhuang, Yin
    Li, Jun
    Mao, Xuerui
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2025, 63
  • [2] Visual cognition in multimodal large language models
    Buschoff, Luca M. Schulze
    Akata, Elif
    Bethge, Matthias
    Schulz, Eric
    NATURE MACHINE INTELLIGENCE, 2025, 7 (01) : 96 - 106
  • [3] Considerations for Prompting Large Language Models
    Schulte, Brian
    JAMA ONCOLOGY, 2024, 10 (04) : 538 - 538
  • [4] Prompting Is Programming: A Query Language for Large Language Models
    Beurer-Kellner, Luca
    Fischer, Marc
    Vechev, Martin
    PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2023, 7 (PLDI):
  • [5] Graph Neural Prompting with Large Language Models
    Tian, Yijun
    Song, Huan
    Wang, Zichen
    Wang, Haozhu
    Hu, Ziqing
    Wang, Fang
    Chawla, Nitesh V.
    Xu, Panpan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19080 - 19088
  • [6] Prompting Large Language Models With the Socratic Method
    Chang, Edward Y.
    2023 IEEE 13TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE, CCWC, 2023, : 351 - 360
  • [7] Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation
    Kritharoula, Anastasia
    Lymperaiou, Maria
    Stamou, Giorgos
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13053 - 13077
  • [8] Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
    Ma, Chuofan
    Jiang, Yi
    Wu, Jiannan
    Yuan, Zehuan
    Qi, Xiaojuan
    COMPUTER VISION - ECCV 2024, PT VI, 2025, 15064 : 417 - 435
  • [9] Compositional Chain-of-Thought Prompting for Large Multimodal Models
    Mitra, Chancharik
    Huang, Brandon
    Darrell, Trevor
    Herzig, Roei
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 14420 - 14431
  • [10] Prompting Large Language Models to Power Educational Chatbots
    Farah, Juan Carlos
    Ingram, Sandy
    Spaenlehauer, Basile
    Lasne, Fanny Kim-Lan
    Gillet, Denis
    ADVANCES IN WEB-BASED LEARNING, ICWL 2023, 2023, 14409 : 169 - 188