How far are we to GPT-4V?Closing the gap to commercial multimodal models with open-source suites

被引:0
|
作者
Zhe CHEN [1 ,2 ]
Weiyun WANG [3 ,2 ]
Hao TIAN [4 ]
Shenglong YE [2 ]
Zhangwei GAO [2 ]
Erfei CUI [2 ]
Wenwen TONG [4 ]
Kongzhi HU [4 ]
Jiapeng LUO [4 ]
Zheng MA [4 ]
Ji MA [4 ]
Jiaqi WANG [2 ]
Xiaoyi DONG [5 ,2 ]
Hang YAN [2 ]
Hewei GUO [4 ]
Conghui HE [2 ]
Botian SHI [2 ]
Zhenjiang JIN [2 ]
Chao XU [2 ]
Bin WANG [2 ]
Xingjian WEI [2 ]
Wei LI [2 ]
Wenjian ZHANG [2 ]
Bo ZHANG [2 ]
Pinlong CAI [2 ]
Licheng WEN [2 ]
Xiangchao YAN [2 ]
Min DOU [2 ]
Lewei LU [4 ]
Xizhou ZHU [6 ,2 ,4 ]
Tong LU [1 ]
Dahua LIN [5 ,2 ]
Yu QIAO [2 ]
Jifeng DAI [6 ,2 ]
Wenhai WANG [5 ,2 ]
机构
[1] State Key Laboratory for Novel Software Technology, Nanjing University
[2] Shanghai AI Laboratory
[3] School of Computer Science, Fudan University
[4] SenseTime Research
[5] Department of Information Engineering, The Chinese University of Hong Kong
[6] Department of Electronic Engineering, Tsinghua
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model(MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements.(1) Strong vision encoder: we explored a continuous learning strategy for the large-scale vision foundation model — InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs.(2) Dynamic high-resolution: we divide images into tiles ranging from 1 to 40 of 448×448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input.(3) High-quality bilingual dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images,and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in optical character recognition(OCR) and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary commercial models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 multimodal benchmarks. Code and models are available at https://github.com/OpenGVLab/InternVL.
引用
收藏
页码:5 / 22
页数:18
相关论文
共 5 条
  • [1] How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites
    Chen, Zhe
    Wang, Weiyun
    Tian, Hao
    Ye, Shenglong
    Gao, Zhangwei
    Cui, Erfei
    Tong, Wenwen
    Hu, Kongzhi
    Luo, Jiapeng
    Ma, Zheng
    Ma, Ji
    Wang, Jiaqi
    Dong, Xiaoyi
    Yan, Hang
    Guo, Hewei
    He, Conghui
    Shi, Botian
    Jin, Zhenjiang
    Xu, Chao
    Wang, Bin
    Wei, Xingjian
    Li, Wei
    Zhang, Wenjian
    Zhang, Bo
    Cai, Pinlong
    Wen, Licheng
    Yan, Xiangchao
    Dou, Min
    Lu, Lewei
    Zhu, Xizhou
    Lu, Tong
    Lin, Dahua
    Qiao, Yu
    Dai, Jifeng
    Wang, Wenhai
    SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (12)
  • [2] Can Open-Source AI Models Diagnose Complex Cases as Well as GPT-4?
    Perlis, Roy
    Collins, Nora
    JAMA-JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, 2025,
  • [3] Closing the gap between open source and commercial large language models for medical evidence summarization
    Zhang, Gongbo
    Jin, Qiao
    Zhou, Yiliang
    Wang, Song
    Idnay, Betina
    Luo, Yiming
    Park, Elizabeth
    Nestor, Jordan G.
    Spotnitz, Matthew E.
    Soroush, Ali
    Campion Jr, Thomas R.
    Lu, Zhiyong
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [4] Closing the gap in the clinical adoption of computational pathology: an open-source workflow for the integration of deep-learning models into the laboratory information system
    Angeloni, M.
    Rizzi, D.
    Schoen, S.
    Hartmann, A.
    Fraggetta, F.
    Ferrazzi, F.
    VIRCHOWS ARCHIV, 2024, 485 : S113 - S113
  • [5] Comparative diagnostic accuracy of GPT-4o and LLaMA 3-70b: Proprietary vs. open-source large language models in radiology☆
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    CLINICAL IMAGING, 2025, 118