Instruction Tuning-Free Visual Token Complement for Multimodal LLMs

被引：0

作者：

Wang, Dongsheng ^{[1
]}

Cui, Jiequan ^{[2
]}

Li, Miaoge ^{[3
]}

Lin, Wang ^{[4
]}

Chen, Bo ^{[5
]}

Zhang, Hanwang ^{[2
]}

机构：

[1] Shenzhen Univ, Shenzhen 518052, Peoples R China

[2] Nanyang Technol Univ, 50 Nanyang Ave, Singapore 639798, Singapore

[3] Hong Kong Polytech Univ, Hung Hom, Kowloon, Hong Kong, Peoples R China

[4] Zhejiang Univ, Hangzhou 310058, Peoples R China

[5] Xidian Univ, Xian 710126, Shaanxi, Peoples R China

来源：

COMPUTER VISION - ECCV 2024, PT LXXXI | 2025年 / 15139卷

基金：

中国国家自然科学基金;

关键词：

D O I：

10.1007/978-3-031-73004-7_26

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

As the open community of large language models (LLMs) matures, multimodal LLMs (MLLMs) have promised an elegant bridge between vision and language. However, current research is inherently constrained by challenges such as the need for high-quality instruction pairs and the loss of visual information in image-to-text training objectives. To this end, we propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features and thus improve response accuracy. Specifically, our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens to enrich the original visual input. Moreover, an iterative strategy is further designed to extract more visual information by iteratively using the visual selector without any additional training. Notably, the training pipeline requires no additional image-text pairs, resulting in a desired instruction tuning-free property. Both qualitative and quantitative experiments demonstrate the superiority and efficiency of our VTC.

引用

页码：446 / 462

页数：17

共 50 条

[21] Multimodal Instruction Tuning with Conditional Mixture of LoRA
Shen, Ying
Xu, Zhiyang
Wang, Qifan
Cheng, Yu
Yin, Wenpeng
Huang, Lifu
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 637 - 648
[22] V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Wu, Penghao
Xie, Saining
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13084 - 13094
[23] Tuning-free method for parameter estimation under varying error characteristics
Agarwal, Mukul
Optimal Control Applications and Methods, 20 (04): : 213 - 221
[24] Towards Tuning-Free Minimum-Volume Nonnegative Matrix Factorization
Nguyen, Duc Toan
Chi, Eric C.
PROCEEDINGS OF THE 2024 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2024, : 217 - 225
[25] HONES: A Fast and Tuning-free Homotopy Method For Online Newton Step
Ye, Yuting
Ju, Cheng
Lei, Lihua
INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 84, 2018, 84
[26] Visual Instruction Tuning with Polite Flamingo
Chen, Delong
Liu, Jianfeng
Dai, Wenliang
Wang, Baoyuan
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17745 - 17753
[27] Tuning-free controller to accurately regulate flow rates in a microfluidic network
Young Jin Heo
Junsu Kang
Min Jun Kim
Wan Kyun Chung
Scientific Reports, 6
[28] TrivialAugment: Tuning-free Yet State-of-the-Art Data Augmentation
Mueller, Samuel G.
Hutter, Frank
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 754 - 762
[29] Don't Fall for Tuning Parameters: Tuning-Free Variable Selection in High Dimensions With the TREX
Lederer, Johannes
Mueller, Christian L.
PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2015, : 2729 - 2735
[30] Tuning-free controller to accurately regulate flow rates in a microfluidic network
Heo, Young Jin
Kang, Junsu
Kim, Min Jun
Chung, Wan Kyun
SCIENTIFIC REPORTS, 2016, 6

← 1 2 3 4 5 →