GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination

被引:14
|
作者
Hirano, Yuichiro [1 ,5 ]
Hanaoka, Shouhei [5 ]
Nakao, Takahiro [2 ]
Miki, Soichiro [2 ]
Kikuchi, Tomohiro [2 ,3 ]
Nakamura, Yuta [2 ]
Nomura, Yukihiro [2 ,4 ]
Yoshikawa, Takeharu [2 ]
Abe, Osamu [5 ]
机构
[1] Int Univ Hlth & Welf, Narita Hosp, Dept Radiol, 852 Hatakeda, Narita, Chiba, Japan
[2] Univ Tokyo Hosp, Dept Computat Diagnost Radiol & Prevent Med, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[3] Jichi Med Univ, Sch Med, Dept Radiol, 3311-1 Yakushiji, Shimotsuke, Tochigi, Japan
[4] Chiba Univ, Ctr Frontier Med Engn, 1-33 Yayoicho,Inage Ku, Chiba, Japan
[5] Univ Tokyo Hosp, Dept Radiol, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
关键词
Artificial intelligence (AI); Large language model (LLM); ChatGPT; GPT-4; Turbo; GPT-4 Turbo with Vision; Japan Diagnostic Radiology Board Examination (JDRBE);
D O I
10.1007/s11604-024-01561-z
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE).Materials and methods The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test.Results The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses.Conclusion No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
引用
收藏
页码:918 / 926
页数:9
相关论文
共 50 条
  • [31] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Maciej Rosoł
    Jakub S. Gąsior
    Jonasz Łaba
    Kacper Korzeniewski
    Marcel Młyńczak
    Scientific Reports, 13
  • [32] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Rosol, Maciej
    Gasior, Jakub S.
    Laba, Jonasz
    Korzeniewski, Kacper
    Mlynczak, Marcel
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [33] GPT-4 as a Board-Certified Surgeon: A Pilot Study
    Roshal, Joshua A.
    Silvestri, Caitlin
    Sathe, Tejas
    Townsend, Courtney
    Klimberg, V. Suzanne
    Perez, Alexander
    MEDICAL SCIENCE EDUCATOR, 2025,
  • [34] An exploratory assessment of GPT-4o and GPT-4 performance on the Japanese National Dental Examination
    Morishita, Masaki
    Fukuda, Hikaru
    Yamaguchi, Shino
    Muraoka, Kosuke
    Nakamura, Taiji
    Hayashi, Masanari
    Yoshioka, Izumi
    Ono, Kentaro
    Awano, Shuji
    SAUDI DENTAL JOURNAL, 2024, 36 (12) : 1577 - 1581
  • [35] Letter to the editor response to “ChatGPT, GPT-4, and bard and official board examination: comment”
    Ayaka Harigai
    Yoshitaka Toyama
    Kei Takase
    Japanese Journal of Radiology, 2024, 42 : 214 - 215
  • [36] Letter to the editor response to "ChatGPT, GPT-4, and bard and official board examination: comment"
    Harigai, Ayaka
    Toyama, Yoshitaka
    Takase, Kei
    JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (02) : 214 - 215
  • [37] Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    RADIOLOGY, 2024, 310 (01)
  • [38] ChatGPT in radiology structured reporting: analysis of ChatGPT-3.5 Turbo and GPT-4 in reducing word count and recalling findings
    Mallio, Carlo A.
    Bernetti, Caterina
    Sertorio, Andrea C.
    Zobel, Bruno Beomonte
    QUANTITATIVE IMAGING IN MEDICINE AND SURGERY, 2024, 14 (02)
  • [39] Applying GPT-4 to the plastic surgery inservice training examination
    Zhao, Jiuli
    Du, Hong
    JOURNAL OF PLASTIC RECONSTRUCTIVE AND AESTHETIC SURGERY, 2024, 91 : 225 - 226
  • [40] Applying GPT-4 to the Plastic Surgery Inservice Training Examination
    Gupta, Rohun
    Park, John B.
    Herzog, Isabel
    Yosufi, Nahid
    Mangan, Amelia
    Firouzbakht, Peter K.
    Mailey, Brian A.
    JOURNAL OF PLASTIC RECONSTRUCTIVE AND AESTHETIC SURGERY, 2023, 87 : 78 - 82