Visual Instruction Tuning

被引:0
|
作者
Liu, Haotian [1 ]
Li, Chunyuan [2 ]
Wu, Qingyang [3 ]
Lee, Yong Jae [1 ]
机构
[1] Univ Wisconsin Madison, Madison, WI 53706 USA
[2] Microsoft Res, Cambridge, MD USA
[3] Columbia Univ, Columbia, MD USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instruction tuning large language models (LLMs) using machine-generated instruction-following data has been shown to improve zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding. To facilitate future research on visual instruction following, we construct two evaluation benchmarks with diverse and challenging application-oriented tasks. Our experiments show that LLaVA demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model, and code publicly available.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Visual Instruction Tuning with Polite Flamingo
    Chen, Delong
    Liu, Jianfeng
    Dai, Wenliang
    Wang, Baoyuan
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17745 - 17753
  • [2] EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning
    Xie, Hongxia
    Peng, Chu-Jun
    Tseng, Yu-Wen
    Chen, Hung-Jen
    Hsu, Chan-Feng
    Shuai, Hong-Han
    Cheng, Wen-Huang
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 26586 - 26595
  • [3] Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
    Li, Yanda
    Zhang, Chi
    Yu, Gang
    Yang, Wanqi
    Wang, Zhibin
    Fu, Bin
    Lin, Guosheng
    Shen, Chunhua
    Chen, Ling
    Wei, Yunchao
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 14512 - 14531
  • [4] Instruction Tuning-Free Visual Token Complement for Multimodal LLMs
    Wang, Dongsheng
    Cui, Jiequan
    Li, Miaoge
    Lin, Wang
    Chen, Bo
    Zhang, Hanwang
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 446 - 462
  • [5] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
    Lai, Bolin
    Dai, Xiaoliang
    Chen, Lawrence
    Pang, Guan
    Rehg, James M.
    Liu, Miao
    COMPUTER VISION - ECCV 2024, PT IX, 2025, 15067 : 135 - 155
  • [6] On the Exploitability of Instruction Tuning
    Shu, Manli
    Wang, Jiongxiao
    Zhu, Chen
    Geiping, Jonas
    Xiao, Chaowei
    Goldstein, Tom
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [7] VISUAL INSTRUCTION
    Skinner, Charles Edward
    SCHOOL AND SOCIETY, 1924, 19 (478): : 227 - 230
  • [8] Visual Tuning
    Yu, Bruce X. B.
    Chang, Jianlong
    Wang, Haixin
    Liu, Lingbo
    Wang, Shijie
    Wang, Zhiyu
    Lin, Junfan
    Xie, Lingxi
    Li, Haojie
    Lin, Zhouchen
    Tian, Qi
    Chen, Chang Wen
    ACM COMPUTING SURVEYS, 2024, 56 (12)
  • [9] Organisation for visual instruction
    Kimmins, CW
    NATURE, 1922, 109 : 617 - 618
  • [10] A Handbook of Visual Instruction
    不详
    EDUCATION, 1934, 55 (02): : 125 - 125