Instruction Tuning-Free Visual Token Complement for Multimodal LLMs

被引：0

作者：

Wang, Dongsheng ^{[1
]}

Cui, Jiequan ^{[2
]}

Li, Miaoge ^{[3
]}

Lin, Wang ^{[4
]}

Chen, Bo ^{[5
]}

Zhang, Hanwang ^{[2
]}

机构：

[1] Shenzhen Univ, Shenzhen 518052, Peoples R China

[2] Nanyang Technol Univ, 50 Nanyang Ave, Singapore 639798, Singapore

[3] Hong Kong Polytech Univ, Hung Hom, Kowloon, Hong Kong, Peoples R China

[4] Zhejiang Univ, Hangzhou 310058, Peoples R China

[5] Xidian Univ, Xian 710126, Shaanxi, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT LXXXI | 2025年 / 15139卷

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1007/978-3-031-73004-7_26

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.

引用

页码：446 / 462

页数：17

共 50 条

[41] Rejoinder to "A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression"
Wang, Lan
Peng, Bo
Bradic, Jelena
Li, Runze
Wu, Yunan
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2020, 115 (532) : 1726 - 1729
[42] Arrayed-waveguide grating lasers and their applications to tuning-free wavelength routing
NTT Opto-electronics Lab, Ibaraki-Ken, Japan
IEE Proc Optoelectron, 5 (322-328):
[43] Tuning-Free, Low Memory Robust Estimator to Mitigate GPS Spoofing Attacks
Lee, Junhwan
Taha, Ahmad F.
Gatsis, Nikolaos
Akopian, David
IEEE CONTROL SYSTEMS LETTERS, 2020, 4 (01): : 145 - 150
[44] Tuning-free and self-supervised image enhancement against ill exposure
Li, Lu
Li, Daoyu
Wang, Shuai
Jiao, Qiang
Bian, Liheng
OPTICS EXPRESS, 2023, 31 (06) : 10368 - 10385
[45] FastComposer: Tuning-Free Multi-subject Image Generation with Localized Attention
Xiao, Guangxuan
Yin, Tianwei
Freeman, William T.
Durand, Fredo
Han, Song
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (03) : 1175 - 1194
[46] Parameter Tuning-Free Missing-Feature Reconstruction for Robust Sound Recognition
Liu, Qi
Wu, Jibin
IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2021, 15 (01) : 78 - 89
[47] Application of a tuning-free burned area detection algorithm to the Chornobyl wildfires in 2022
Hu, Jun
Igarashi, Yasunori
Kotsuki, Shunji
Yang, Ziping
Talerko, Mykola
Landin, Volodymyr
Tyshchenko, Olha
Zheleznyak, Mark
Protsak, Valentyn
Kirieiev, Serhii
SCIENTIFIC REPORTS, 2023, 13 (01)
[48] Tuning-free ridge estimators for high-dimensional generalized linear models
Huang, Shih-Ting
Xie, Fang
Lederer, Johannes
COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2021, 159
[49] Efficient Tuning-Free l1-Regression of Nonnegative Compressible Signals
Petersen, Hendrik Bernd
Bah, Bubacarr
Jung, Peter
FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS, 2021, 7
[50] Comment on "A Tuning-Free Robust and Efficient Approach to High-Dimensional Regression"
Fan, Jianqing
Ma, Cong
Wang, Kaizheng
JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2020, 115 (532) : 1720 - 1725

← 1 2 3 4 5 →