How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites

被引:1
|
作者
Chen, Zhe [1 ,2 ]
Wang, Weiyun [2 ,5 ]
Tian, Hao [3 ]
Ye, Shenglong [2 ]
Gao, Zhangwei [2 ]
Cui, Erfei [2 ]
Tong, Wenwen [3 ]
Hu, Kongzhi [3 ]
Luo, Jiapeng [3 ]
Ma, Zheng [3 ]
Ma, Ji [3 ]
Wang, Jiaqi [2 ]
Dong, Xiaoyi [2 ,6 ]
Yan, Hang [2 ]
Guo, Hewei [3 ]
He, Conghui [2 ]
Shi, Botian [2 ]
Jin, Zhenjiang [2 ]
Xu, Chao [2 ]
Wang, Bin [2 ]
Wei, Xingjian [2 ]
Li, Wei [2 ]
Zhang, Wenjian [2 ]
Zhang, Bo [2 ]
Cai, Pinlong [2 ]
Wen, Licheng [2 ]
Yan, Xiangchao [2 ]
Dou, Min [2 ]
Lu, Lewei [3 ]
Zhu, Xizhou [2 ,3 ,4 ]
Lu, Tong [1 ]
Lin, Dahua [2 ,6 ]
Qiao, Yu [2 ]
Dai, Jifeng [2 ,4 ]
Wang, Wenhai [2 ,6 ]
机构
[1] Nanjing Univ, State Key Lab Novel Software Technol, Nanjing 210023, Peoples R China
[2] Shanghai AI Lab, Shanghai 200232, Peoples R China
[3] SenseTime Res, Shanghai 200233, Peoples R China
[4] Tsinghua Univ, Dept Elect Engn, Beijing 100084, Peoples R China
[5] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
[6] Chinese Univ Hong Kong, Dept Informat Engn, Hong Kong 999077, Peoples R China
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
multimodal model; open-source; vision encoder; dynamic resolution; bilingual dataset;
D O I
10.1007/s11432-024-4231-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements. (1) Strong vision encoder: we explored a continuous learning strategy for the large-scale vision foundation model - InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic high-resolution: we divide images into tiles ranging from 1 to 40 of 448x448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-quality bilingual dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in optical character recognition (OCR) and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary commercial models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 multimodal benchmarks. Code and models are available at https://github.com/OpenGVLab/InternVL.
引用
收藏
页数:18
相关论文
共 5 条
  • [1] How far are we to GPT-4V?Closing the gap to commercial multimodal models with open-source suites
    Zhe CHEN
    Weiyun WANG
    Hao TIAN
    Shenglong YE
    Zhangwei GAO
    Erfei CUI
    Wenwen TONG
    Kongzhi HU
    Jiapeng LUO
    Zheng MA
    Ji MA
    Jiaqi WANG
    Xiaoyi DONG
    Hang YAN
    Hewei GUO
    Conghui HE
    Botian SHI
    Zhenjiang JIN
    Chao XU
    Bin WANG
    Xingjian WEI
    Wei LI
    Wenjian ZHANG
    Bo ZHANG
    Pinlong CAI
    Licheng WEN
    Xiangchao YAN
    Min DOU
    Lewei LU
    Xizhou ZHU
    Tong LU
    Dahua LIN
    Yu QIAO
    Jifeng DAI
    Wenhai WANG
    Science China(Information Sciences), 2024, 67 (12) : 5 - 22
  • [2] Can Open-Source AI Models Diagnose Complex Cases as Well as GPT-4?
    Perlis, Roy
    Collins, Nora
    JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2025,
  • [3] Closing the gap between open source and commercial large language models for medical evidence summarization
    Zhang, Gongbo
    Jin, Qiao
    Zhou, Yiliang
    Wang, Song
    Idnay, Betina
    Luo, Yiming
    Park, Elizabeth
    Nestor, Jordan G.
    Spotnitz, Matthew E.
    Soroush, Ali
    Campion Jr, Thomas R.
    Lu, Zhiyong
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [4] Closing the gap in the clinical adoption of computational pathology: an open-source workflow for the integration of deep-learning models into the laboratory information system
    Angeloni, M.
    Rizzi, D.
    Schoen, S.
    Hartmann, A.
    Fraggetta, F.
    Ferrazzi, F.
    VIRCHOWS ARCHIV, 2024, 485 : S113 - S113
  • [5] Comparative diagnostic accuracy of GPT-4o and LLaMA 3-70b: Proprietary vs. open-source large language models in radiology☆
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    CLINICAL IMAGING, 2025, 118