GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination

被引:14
|
作者
Hirano, Yuichiro [1 ,5 ]
Hanaoka, Shouhei [5 ]
Nakao, Takahiro [2 ]
Miki, Soichiro [2 ]
Kikuchi, Tomohiro [2 ,3 ]
Nakamura, Yuta [2 ]
Nomura, Yukihiro [2 ,4 ]
Yoshikawa, Takeharu [2 ]
Abe, Osamu [5 ]
机构
[1] Int Univ Hlth & Welf, Narita Hosp, Dept Radiol, 852 Hatakeda, Narita, Chiba, Japan
[2] Univ Tokyo Hosp, Dept Computat Diagnost Radiol & Prevent Med, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
[3] Jichi Med Univ, Sch Med, Dept Radiol, 3311-1 Yakushiji, Shimotsuke, Tochigi, Japan
[4] Chiba Univ, Ctr Frontier Med Engn, 1-33 Yayoicho,Inage Ku, Chiba, Japan
[5] Univ Tokyo Hosp, Dept Radiol, 7-3-1 Hongo,Bunkyo Ku, Tokyo, Japan
关键词
Artificial intelligence (AI); Large language model (LLM); ChatGPT; GPT-4; Turbo; GPT-4 Turbo with Vision; Japan Diagnostic Radiology Board Examination (JDRBE);
D O I
10.1007/s11604-024-01561-z
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Purpose To assess the performance of GPT-4 Turbo with Vision (GPT-4TV), OpenAI's latest multimodal large language model, by comparing its ability to process both text and image inputs with that of the text-only GPT-4 Turbo (GPT-4 T) in the context of the Japan Diagnostic Radiology Board Examination (JDRBE).Materials and methods The dataset comprised questions from JDRBE 2021 and 2023. A total of six board-certified diagnostic radiologists discussed the questions and provided ground-truth answers by consulting relevant literature as necessary. The following questions were excluded: those lacking associated images, those with no unanimous agreement on answers, and those including images rejected by the OpenAI application programming interface. The inputs for GPT-4TV included both text and images, whereas those for GPT-4 T were entirely text. Both models were deployed on the dataset, and their performance was compared using McNemar's exact test. The radiological credibility of the responses was assessed by two diagnostic radiologists through the assignment of legitimacy scores on a five-point Likert scale. These scores were subsequently used to compare model performance using Wilcoxon's signed-rank test.Results The dataset comprised 139 questions. GPT-4TV correctly answered 62 questions (45%), whereas GPT-4 T correctly answered 57 questions (41%). A statistical analysis found no significant performance difference between the two models (P = 0.44). The GPT-4TV responses received significantly lower legitimacy scores from both radiologists than the GPT-4 T responses.Conclusion No significant enhancement in accuracy was observed when using GPT-4TV with image input compared with that of using text-only GPT-4 T for JDRBE questions.
引用
收藏
页码:918 / 926
页数:9
相关论文
共 50 条
  • [1] GPT-4 turbo with vision fails to outperform text-only GPT-4 turbo in the Japan diagnostic radiology board examination: correspondence
    Kleebayoon, Amnuay
    Wiwanitkit, Viroj
    JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (10) : 1213 - 1213
  • [2] GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews
    Oami, Takehiko
    Okada, Yohei
    Nakada, Taka-aki
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [3] Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society
    Toyama, Yoshitaka
    Harigai, Ayaka
    Abe, Mirei
    Nagano, Mitsutoshi
    Kawabata, Masahiro
    Seki, Yasuhiro
    Takase, Kei
    JAPANESE JOURNAL OF RADIOLOGY, 2023, 42 (2) : 201 - 207
  • [4] Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society
    Yoshitaka Toyama
    Ayaka Harigai
    Mirei Abe
    Mitsutoshi Nagano
    Masahiro Kawabata
    Yasuhiro Seki
    Kei Takase
    Japanese Journal of Radiology, 2024, 42 : 201 - 207
  • [5] Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings
    Hackl, Veronika
    Mueller, Alexandra Elena
    Granitzer, Michael
    Sailer, Maximilian
    FRONTIERS IN EDUCATION, 2023, 8
  • [6] Exploring the Boundaries of GPT-4 in Radiology
    Liu, Qianchu
    Hyland, Stephanie L.
    Bannur, Shruthi
    Bouzid, Kenza
    Castro, Daniel C.
    Wetscherek, Maria Teodora
    Tinn, Robert
    Sharma, Harshita
    Perez-Garcia, Fernando
    Schwaighofer, Anton
    Rajpurkar, Pranav
    Khanna, Sameer Tajdin
    Poon, Hoifung
    Usuyama, Naoto
    Thieme, Anja
    Nori, Aditya
    Lungren, Matthew P.
    Oktay, Ozan
    Alvarez-Valle, Javier
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 14414 - 14445
  • [7] Using GPT-4 Turbo To Automatically Identify Defeaters In Assurance Cases
    Shahandashti, Kimya Khakzad
    Belle, Alvine Boaye
    Mohajer, Mohammad Mahdi
    Odu, Oluwafemi
    Lethbridge, Timothy C.
    Hemmati, Hadi
    Wang, Song
    32ND INTERNATIONAL REQUIREMENTS ENGINEERING CONFERENCE WORKSHOPS, REW 2024, 2024, : 46 - 56
  • [8] Assessing the Impact of GPT-4 Turbo in Generating Defeaters for Assurance Cases
    Shahandashti, Kimya Khakzad
    Sivakumar, Mithila
    Mohajer, Mohammad Mahdi
    Belle, Alvine B.
    Wang, Song
    Lethbridge, Timothy C.
    PROCEEDINGS 2024 IEEE/ACM FIRST INTERNATIONAL CONFERENCE ON AI FOUNDATION MODELS AND SOFTWARE ENGINEERING, FORGE 2024, 2024, : 52 - 56
  • [9] GPT-4 in Radiology: Improvements in Advanced Reasoning
    Bhayana, Rajesh
    Bleakney, Robert R.
    Krishna, Satheesh
    RADIOLOGY, 2023, 307 (05)
  • [10] ChatGPT, GPT-4, and Bard and official board examination: comment
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (02) : 212 - 213