GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination

被引:14
|
作者
Hirano, Yuichiro [1 ,5 ]
Hanaoka, Shouhei [5 ]
Nakao, Takahiro [2 ]
Miki, Soichiro [2 ]
Kikuchi, Tomohiro [2 ,3 ]
Nakamura, Yuta [2 ]
Nomura, Yukihiro [2 ,4 ]
Yoshikawa, Takeharu [2 ]
Abe, Osamu [5 ]
机构
[1] Int Univ Hlth & Welf, Narita Hosp, Dept Radiol, 852 Hatakeda, Narita, Chiba, Japan
[2] Univ Tokyo Hosp, Dept Computat Diagnost Radiol & Prevent Med, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[3] Jichi Med Univ, Sch Med, Dept Radiol, 3311-1 Yakushiji, Shimotsuke, Tochigi, Japan
[4] Chiba Univ, Ctr Frontier Med Engn, 1-33 Yayoicho,Inage Ku, Chiba, Japan
[5] Univ Tokyo Hosp, Dept Radiol, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
关键词
Artificial intelligence (AI); Large language model (LLM); ChatGPT; GPT-4; Turbo; GPT-4 Turbo with Vision; Japan Diagnostic Radiology Board Examination (JDRBE);
D O I
10.1007/s11604-024-01561-z
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE).Materials and methods The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test.Results The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses.Conclusion No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
引用
收藏
页码:918 / 926
页数:9
相关论文
共 50 条
  • [21] The performance of the multimodal large language model GPT-4 on the European board of radiology examination sample test
    Besler, Muhammed Said
    JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (08) : 927 - 927
  • [22] Response accuracy of GPT-4 across languages: insights from an expert-level diagnostic radiology examination in Japan
    Harigai, Ayaka
    Toyama, Yoshitaka
    Nagano, Mitsutoshi
    Abe, Mirei
    Kawabata, Masahiro
    Li, Li
    Yamamura, Jin
    Takase, Kei
    JAPANESE JOURNAL OF RADIOLOGY, 2025, 43 (02) : 319 - 329
  • [23] Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases
    Kikuchi, Tomohiro
    Nakao, Takahiro
    Nakamura, Yuta
    Hanaoka, Shouhei
    Mori, Harushi
    Yoshikawa, Takeharu
    AMERICAN JOURNAL OF NEURORADIOLOGY, 2024, 45 (10) : 1506 - 1511
  • [24] GPT-4 and plastic surgery inservice training examination
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    JOURNAL OF PLASTIC RECONSTRUCTIVE AND AESTHETIC SURGERY, 2024, 88 : 71 - 72
  • [25] Experiences with Remote Examination Formats in Light of GPT-4
    Dobslaw, Felix
    Bergh, Peter
    PROCEEDINGS OF THE 5TH EUROPEAN CONFERENCE ON SOFTWARE ENGINEERING EDUCATION, ECSEE 2023, 2023, : 220 - 225
  • [26] Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations
    Ali, Rohaid
    Tang, Oliver Y.
    Connolly, Ian D.
    Sullivan, Patricia L. Zadnik
    Shin, John H.
    Fridley, Jared S.
    Asaad, Wael F.
    Cielo, Deus
    Oyelese, Adetokunbo A.
    Doberstein, Curtis E.
    Gokaslan, Ziya L.
    Telfeian, Albert E.
    NEUROSURGERY, 2023, 93 (06) : 1353 - 1365
  • [27] GPT-4 Vision: Multi-Modal Evolution of ChatGPT and Potential Role in Radiology
    Javan, Ramin
    Kim, Theodore
    Mostaghni, Navid
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
  • [28] Performance of GPT-4 Vision on kidney pathology exam questions
    Miao, Jing
    Thongprayoon, Charat
    Cheungpasitporn, Wisit
    Cornell, Lynn D.
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2024, 162 (03) : 220 - 226
  • [29] Performance of GPT-4 Vision on kidney pathology exam questions
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    AMERICAN JOURNAL OF CLINICAL PATHOLOGY, 2024,
  • [30] Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination
    Kaneda, Yudai
    Takahashi, Ryo
    Kaneda, Uiri
    Akashima, Shiori
    Okita, Haruna
    Misaki, Sadaya
    Yamashiro, Akimi
    Ozaki, Akihiko
    Tanimoto, Tetsuya
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)