Comparing Vision-Capable Models, GPT-4 and Gemini, With GPT-3.5 on Taiwan's Pulmonologist Exam

被引:1
|
作者
Chen, Chih-Hsiung [1 ]
Hsieh, Kuang-Yu [1 ]
Huang, Kuo-En [1 ]
Lai, Hsien-Yun [2 ]
机构
[1] Mennonite Christian Hosp, Dept Crit Care Med, Hualien, Taiwan
[2] Mennonite Christian Hosp, Dept Educ & Res, Hualien, Taiwan
关键词
vision feature; pulmonologist exam; gemini; gpt; large language models; artificial intelligence;
D O I
10.7759/cureus.67641
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Introduction The latest generation of large language models (LLMs) features multimodal capabilities, allowing them to interpret graphics, images, and videos, which are crucial in medical fields. This study investigates the vision capabilities of the next-generation Generative Pre-trained Transformer 4 (GPT-4) and Google's Gemini. Methods To establish a comparative baseline, we used GPT-3.5, a model limited to text processing, and evaluated the performance of both GPT-4 and Gemini on questions from the Taiwan Specialist Board Exams in Pulmonary and Critical Care Medicine. Our dataset included 1,100 questions from 2012 to 2023, with 100 questions per year. Of these, 1,059 were in pure text and 41 were text with images, with the majority in a non-English language and only six in pure English. Results For each annual exam consisting of 100 questions from 2013 to 2023, GPT-4 achieved scores of 66, 69, 51, 64, 72, 64, 66, 64, 63, 68, and 67, respectively. Gemini scored 45, 48, 45, 45, 46, 59, 54, 41, 53, 45, and 45, while GPT-3.5 scored 39, 33, 35, 36, 32, 33, 43, 28, 32, 33, and 36. Conclusions These results demonstrate that the newer LLMs with vision capabilities significantly outperform the text- only model. When a passing score of 60 was set, GPT-4 passed most exams and approached human performance.
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination
    Krishna, Satheesh
    Bhambra, Nishaant
    Bleakney, Robert
    Bhayana, Rajesh
    RADIOLOGY, 2024, 311 (02)
  • [32] Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases
    Kikuchi, Tomohiro
    Nakao, Takahiro
    Nakamura, Yuta
    Hanaoka, Shouhei
    Mori, Harushi
    Yoshikawa, Takeharu
    AMERICAN JOURNAL OF NEURORADIOLOGY, 2024, 45 (10) : 1506 - 1511
  • [33] Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
    Yudovich, Max Samuel
    Makarova, Elizaveta
    Hague, Christian Michael
    Raman, Jay Dilip
    JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS, 2024, 21 : 17
  • [34] The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education
    Rizzo, Michael G.
    Cai, Nathan
    Constantinescu, David
    JOURNAL OF ORTHOPAEDICS, 2024, 50 : 70 - 75
  • [35] Evaluating the GPT-3.5 and GPT-4 Large Language Models for Zero-Shot Classification of South African Violent Event Data
    Kotze, Eduan
    Senekal, Burgert A.
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS, ICABCD 2024, 2024,
  • [36] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Jamil S. Samaan
    Nithya Rajeev
    Wee Han Ng
    Nitin Srinivasan
    Jonathan A. Busam
    Yee Hui Yeo
    Kamran Samakar
    Obesity Surgery, 2024, 34 : 1987 - 1989
  • [37] Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination
    Liu, Chiu-Liang
    Ho, Chien-Ta
    Wu, Tzu-Chi
    HEALTHCARE, 2024, 12 (17)
  • [38] Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4
    Patel, Dhavalkumar
    Raut, Ganesh
    Zimlichman, Eyal
    Cheetirala, Satya Narayan
    Nadkarni, Girish N.
    Glicksberg, Benjamin S.
    Apakama, Donald U.
    Bell, Elijah J.
    Freeman, Robert
    Timsina, Prem
    Klang, Eyal
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [39] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
    Meyer, Annika
    Riese, Janik
    Streichert, Thomas
    JMIR MEDICAL EDUCATION, 2024, 10
  • [40] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)