Visual Instruction Tuning

被引:0
|
作者
Liu, Haotian [1 ]
Li, Chunyuan [2 ]
Wu, Qingyang [3 ]
Lee, Yong Jae [1 ]
机构
[1] Univ Wisconsin Madison, Madison, WI 53706 USA
[2] Microsoft Res, Cambridge, MD USA
[3] Columbia Univ, Columbia, MD USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.
引用
收藏
页数:25
相关论文
共 50 条
  • [41] Multimodal Instruction Tuning with Conditional Mixture of LoRA
    Shen, Ying
    Xu, Zhiyang
    Wang, Qifan
    Cheng, Yu
    Yin, Wenpeng
    Huang, Lifu
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 637 - 648
  • [42] Audio-visual Aids to Instruction
    Burns, Dorothy Morrow
    ELEMENTARY SCHOOL JOURNAL, 1941, 41 (08): : 630 - 631
  • [43] Audio-Visual Instruction Conference
    Lyman, D. F.
    JOURNAL OF THE SOCIETY OF MOTION PICTURE & TELEVISION ENGINEERS, 1952, 58 (05): : 445 - 449
  • [44] A VISUAL APPROACH TO AUDITING AND ACCOUNTING INSTRUCTION
    PESCOW, JK
    ACCOUNTING REVIEW, 1963, 38 (04): : 839 - 843
  • [45] Audio-Visual Aids to Instruction
    Brewster, James R.
    HARVARD EDUCATIONAL REVIEW, 1941, 11 (02) : 263 - 264
  • [46] AUDIO-VISUAL AIDS TO INSTRUCTION
    Gilkinson, Howard
    Howell, William S.
    QUARTERLY JOURNAL OF SPEECH, 1948, 34 (04) : 529 - 530
  • [47] VISUAL INSTRUCTION IN NEW YORK STATE
    Abrams, Alfred W.
    ANNALS OF THE AMERICAN ACADEMY OF POLITICAL AND SOCIAL SCIENCE, 1916, 67 : 270 - 272
  • [48] FINDINGS OF THE NATIONAL VISUAL INSTRUCTION SURVEY
    不详
    SCHOOL AND SOCIETY, 1936, 44 (1137): : 487 - 488
  • [49] Audio-visual Aids to Instruction
    不详
    JOURNAL OF EDUCATIONAL SOCIOLOGY, 1941, 14 (06): : 383 - 383
  • [50] Audio-Visual Aids to Instruction
    不详
    EDUCATION, 1941, 61 (06): : 381 - 381