Visual Instruction Tuning

被引:0
|
作者
Liu, Haotian [1 ]
Li, Chunyuan [2 ]
Wu, Qingyang [3 ]
Lee, Yong Jae [1 ]
机构
[1] Univ Wisconsin Madison, Madison, WI 53706 USA
[2] Microsoft Res, Cambridge, MD USA
[3] Columbia Univ, Columbia, MD USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.
引用
收藏
页数:25
相关论文
共 50 条
  • [31] VISUAL INSTRUCTION IN ADULT EDUCATION
    Gibbony, Hazel L.
    EDUCATION, 1938, 58 (08): : 493 - 497
  • [32] DESIGNING VISUAL ANALOGIES FOR INSTRUCTION
    SMITH, PL
    RAGAN, TJ
    ABOUT VISUALS : RESEARCH, TEACHING AND APPLICATIONS, 1989, : 394 - 405
  • [33] An important tool for visual instruction
    Heston, T
    WELDING JOURNAL, 2000, : 6 - 6
  • [34] Instruction Tuning with LLMs for Programming Exercise Generation
    Zeng, Guolong
    Xue, Qinchen
    Lu, Xuesong
    WEB INFORMATION SYSTEMS AND APPLICATIONS, WISA 2024, 2024, 14883 : 377 - 385
  • [35] MotIF: Motion Instruction Fine-Tuning
    Hwang, Minyoung
    Hejna, Joey
    Sadigh, Dorsa
    Bisk, Yonatan
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2025, 10 (03): : 2287 - 2294
  • [36] Tuning the GNU instruction scheduler to superscalar microprocessors
    Unger, A
    Zehendner, E
    23RD EUROMICRO CONFERENCE - NEW FRONTIERS OF INFORMATION TECHNOLOGY, PROCEEDINGS, 1997, : 275 - 282
  • [37] The Place of Visual Instruction in the Modern School, A Syllabus of a Proposed Text-Book in Visual Instruction
    Dent, Ellsworth C.
    EDUCATION, 1933, 53 (06): : 380 - 381
  • [38] Instruction Cache Tuning for Embedded Multitasking Applications
    Dash, Santanu Kumar
    Srikanthan, Thambipillai
    RSP 2009: TWENTIETH IEEE/IFIP INTERNATIONAL SYMPOSIUM ON RAPID SYSTEM PROTOTYPING, PROCEEDINGS: SHORTENING THE PATH FROM SPECIFICATION TO PROTOTYPE, 2009, : 152 - 158
  • [39] Facial Affective Behavior Analysis with Instruction Tuning
    Li, Yifan
    Dao, Anh
    Bao, Wentao
    Tang, Zhen
    Chen, Tianlong
    Liu, Huan
    Kong, Yu
    COMPUTER VISION-ECCV 2024, PT XVIII, 2025, 15076 : 165 - 186
  • [40] Instruction cache tuning for embedded multitasking applications
    Dash, S. K.
    Srikanthan, T.
    IET COMPUTERS AND DIGITAL TECHNIQUES, 2010, 4 (06): : 439 - 457