Instruction Tuning-Free Visual Token Complement for Multimodal LLMs

被引:0
|
作者
Wang, Dongsheng [1 ]
Cui, Jiequan [2 ]
Li, Miaoge [3 ]
Lin, Wang [4 ]
Chen, Bo [5 ]
Zhang, Hanwang [2 ]
机构
[1] Shenzhen Univ, Shenzhen 518052, Peoples R China
[2] Nanyang Technol Univ, 50 Nanyang Ave, Singapore 639798, Singapore
[3] Hong Kong Polytech Univ, Hung Hom, Kowloon, Hong Kong, Peoples R China
[4] Zhejiang Univ, Hangzhou 310058, Peoples R China
[5] Xidian Univ, Xian 710126, Shaanxi, Peoples R China
来源
基金
中国国家自然科学基金;
关键词
D O I
10.1007/978-3-031-73004-7_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.
引用
收藏
页码:446 / 462
页数:17
相关论文
共 50 条
  • [41] Rejoinder to "A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression"
    Wang, Lan
    Peng, Bo
    Bradic, Jelena
    Li, Runze
    Wu, Yunan
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2020, 115 (532) : 1726 - 1729
  • [42] Arrayed-waveguide grating lasers and their applications to tuning-free wavelength routing
    NTT Opto-electronics Lab, Ibaraki-Ken, Japan
    IEE Proc Optoelectron, 5 (322-328):
  • [43] Tuning-Free, Low Memory Robust Estimator to Mitigate GPS Spoofing Attacks
    Lee, Junhwan
    Taha, Ahmad F.
    Gatsis, Nikolaos
    Akopian, David
    IEEE CONTROL SYSTEMS LETTERS, 2020, 4 (01): : 145 - 150
  • [44] Tuning-free and self-supervised image enhancement against ill exposure
    Li, Lu
    Li, Daoyu
    Wang, Shuai
    Jiao, Qiang
    Bian, Liheng
    OPTICS EXPRESS, 2023, 31 (06) : 10368 - 10385
  • [45] FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention
    Xiao, Guangxuan
    Yin, Tianwei
    Freeman, William T.
    Durand, Fredo
    Han, Song
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (03) : 1175 - 1194
  • [46] Parameter Tuning-Free Missing-Feature Reconstruction for Robust Sound Recognition
    Liu, Qi
    Wu, Jibin
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2021, 15 (01) : 78 - 89
  • [47] Application of a tuning-free burned area detection algorithm to the Chornobyl wildfires in 2022
    Hu, Jun
    Igarashi, Yasunori
    Kotsuki, Shunji
    Yang, Ziping
    Talerko, Mykola
    Landin, Volodymyr
    Tyshchenko, Olha
    Zheleznyak, Mark
    Protsak, Valentyn
    Kirieiev, Serhii
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [48] Tuning-free ridge estimators for high-dimensional generalized linear models
    Huang, Shih-Ting
    Xie, Fang
    Lederer, Johannes
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2021, 159
  • [49] Efficient Tuning-Free l1-Regression of Nonnegative Compressible Signals
    Petersen, Hendrik Bernd
    Bah, Bubacarr
    Jung, Peter
    FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS, 2021, 7
  • [50] Comment on "A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression"
    Fan, Jianqing
    Ma, Cong
    Wang, Kaizheng
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2020, 115 (532) : 1720 - 1725