Instruction Tuning-Free Visual Token Complement for Multimodal LLMs

被引:0
|
作者
Wang, Dongsheng [1 ]
Cui, Jiequan [2 ]
Li, Miaoge [3 ]
Lin, Wang [4 ]
Chen, Bo [5 ]
Zhang, Hanwang [2 ]
机构
[1] Shenzhen Univ, Shenzhen 518052, Peoples R China
[2] Nanyang Technol Univ, 50 Nanyang Ave, Singapore 639798, Singapore
[3] Hong Kong Polytech Univ, Hung Hom, Kowloon, Hong Kong, Peoples R China
[4] Zhejiang Univ, Hangzhou 310058, Peoples R China
[5] Xidian Univ, Xian 710126, Shaanxi, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
D O I
10.1007/978-3-031-73004-7_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.
引用
收藏
页码:446 / 462
页数:17
相关论文
共 50 条
  • [21] Multimodal Instruction Tuning with Conditional Mixture of LoRA
    Shen, Ying
    Xu, Zhiyang
    Wang, Qifan
    Cheng, Yu
    Yin, Wenpeng
    Huang, Lifu
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 637 - 648
  • [22] V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
    Wu, Penghao
    Xie, Saining
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13084 - 13094
  • [23] Tuning-free method for parameter estimation under varying error characteristics
    Agarwal, Mukul
    Optimal Control Applications and Methods, 20 (04): : 213 - 221
  • [24] Towards Tuning-Free Minimum-Volume Nonnegative Matrix Factorization
    Nguyen, Duc Toan
    Chi, Eric C.
    PROCEEDINGS OF THE 2024 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2024, : 217 - 225
  • [25] HONES: A Fast and Tuning-free Homotopy Method For Online Newton Step
    Ye, Yuting
    Ju, Cheng
    Lei, Lihua
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
  • [26] Visual Instruction Tuning with Polite Flamingo
    Chen, Delong
    Liu, Jianfeng
    Dai, Wenliang
    Wang, Baoyuan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17745 - 17753
  • [27] Tuning-free controller to accurately regulate flow rates in a microfluidic network
    Young Jin Heo
    Junsu Kang
    Min Jun Kim
    Wan Kyun Chung
    Scientific Reports, 6
  • [28] TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation
    Mueller, Samuel G.
    Hutter, Frank
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 754 - 762
  • [29] Don't Fall for Tuning Parameters: Tuning-Free Variable Selection in High Dimensions With the TREX
    Lederer, Johannes
    Mueller, Christian L.
    PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 2729 - 2735
  • [30] Tuning-free controller to accurately regulate flow rates in a microfluidic network
    Heo, Young Jin
    Kang, Junsu
    Kim, Min Jun
    Chung, Wan Kyun
    SCIENTIFIC REPORTS, 2016, 6