Visual Instruction Tuning

被引：0

作者：

Liu, Haotian ^{[1
]}

Li, Chunyuan ^{[2
]}

Wu, Qingyang ^{[3
]}

Lee, Yong Jae ^{[1
]}

机构：

[1] Univ Wisconsin Madison, Madison, WI 53706 USA

[2] Microsoft Res, Cambridge, MD USA

[3] Columbia Univ, Columbia, MD USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.

引用

页数：25

共 50 条

[1] Visual Instruction Tuning with Polite Flamingo
Chen, Delong
Liu, Jianfeng
Dai, Wenliang
Wang, Baoyuan
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17745 - 17753
[2] EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning
Xie, Hongxia
Peng, Chu-Jun
Tseng, Yu-Wen
Chen, Hung-Jen
Hsu, Chan-Feng
Shuai, Hong-Han
Cheng, Wen-Huang
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26586 - 26595
[3] Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
Li, Yanda
Zhang, Chi
Yu, Gang
Yang, Wanqi
Wang, Zhibin
Fu, Bin
Lin, Guosheng
Shen, Chunhua
Chen, Ling
Wei, Yunchao
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14512 - 14531
[4] Instruction Tuning-Free Visual Token Complement for Multimodal LLMs
Wang, Dongsheng
Cui, Jiequan
Li, Miaoge
Lin, Wang
Chen, Bo
Zhang, Hanwang
COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 446 - 462
[5] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
Lai, Bolin
Dai, Xiaoliang
Chen, Lawrence
Pang, Guan
Rehg, James M.
Liu, Miao
COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 135 - 155
[6] On the Exploitability of Instruction Tuning
Shu, Manli
Wang, Jiongxiao
Zhu, Chen
Geiping, Jonas
Xiao, Chaowei
Goldstein, Tom
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[7] VISUAL INSTRUCTION
Skinner, Charles Edward
SCHOOL AND SOCIETY, 1924, 19 (478): : 227 - 230
[8] Visual Tuning
Yu, Bruce X. B.
Chang, Jianlong
Wang, Haixin
Liu, Lingbo
Wang, Shijie
Wang, Zhiyu
Lin, Junfan
Xie, Lingxi
Li, Haojie
Lin, Zhouchen
Tian, Qi
Chen, Chang Wen
ACM COMPUTING SURVEYS, 2024, 56 (12)
[9] Organisation for visual instruction
Kimmins, CW
NATURE, 1922, 109 : 617 - 618
[10] A Handbook of Visual Instruction
不详
EDUCATION, 1934, 55 (02): : 125 - 125

← 1 2 3 4 5 →