Visual Instruction Tuning

被引：0

作者：

Liu, Haotian ^{[1
]}

Li, Chunyuan ^{[2
]}

Wu, Qingyang ^{[3
]}

Lee, Yong Jae ^{[1
]}

机构：

[1] Univ Wisconsin Madison, Madison, WI 53706 USA

[2] Microsoft Res, Cambridge, MD USA

[3] Columbia Univ, Columbia, MD USA

来源：

ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023) | 2023年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.

引用

页数：25

共 50 条

[31] VISUAL INSTRUCTION IN ADULT EDUCATION
Gibbony, Hazel L.
EDUCATION, 1938, 58 (08): : 493 - 497
[32] DESIGNING VISUAL ANALOGIES FOR INSTRUCTION
SMITH, PL
RAGAN, TJ
ABOUT VISUALS : RESEARCH, TEACHING AND APPLICATIONS, 1989, : 394 - 405
[33] An important tool for visual instruction
Heston, T
WELDING JOURNAL, 2000, : 6 - 6
[34] Instruction Tuning with LLMs for Programming Exercise Generation
Zeng, Guolong
Xue, Qinchen
Lu, Xuesong
WEB INFORMATION SYSTEMS AND APPLICATIONS, WISA 2024, 2024, 14883 : 377 - 385
[35] MotIF: Motion Instruction Fine-Tuning
Hwang, Minyoung
Hejna, Joey
Sadigh, Dorsa
Bisk, Yonatan
IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (03): : 2287 - 2294
[36] Tuning the GNU instruction scheduler to superscalar microprocessors
Unger, A
Zehendner, E
23RD EUROMICRO CONFERENCE - NEW FRONTIERS OF INFORMATION TECHNOLOGY, PROCEEDINGS, 1997, : 275 - 282
[37] The Place of Visual Instruction in the Modern School, A Syllabus of a Proposed Text-Book in Visual Instruction
Dent, Ellsworth C.
EDUCATION, 1933, 53 (06): : 380 - 381
[38] Instruction Cache Tuning for Embedded Multitasking Applications
Dash, Santanu Kumar
Srikanthan, Thambipillai
RSP 2009: TWENTIETH IEEE/IFIP INTERNATIONAL SYMPOSIUM ON RAPID SYSTEM PROTOTYPING, PROCEEDINGS: SHORTENING THE PATH FROM SPECIFICATION TO PROTOTYPE, 2009, : 152 - 158
[39] Facial Affective Behavior Analysis with Instruction Tuning
Li, Yifan
Dao, Anh
Bao, Wentao
Tang, Zhen
Chen, Tianlong
Liu, Huan
Kong, Yu
COMPUTER VISION-ECCV 2024, PT XVIII, 2025, 15076 : 165 - 186
[40] Instruction cache tuning for embedded multitasking applications
Dash, S. K.
Srikanthan, T.
IET COMPUTERS AND DIGITAL TECHNIQUES, 2010, 4 (06): : 439 - 457

← 1 2 3 4 5 →