Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination

被引:20
|
作者
Krishna, Satheesh [1 ,2 ]
Bhambra, Nishaant [3 ]
Bleakney, Robert [1 ,2 ]
Bhayana, Rajesh [1 ,2 ]
机构
[1] Univ Med Imaging Toronto, Univ Hlth Network, Univ Toronto, Mt Sinai Hosp,Joint Dept Med Imaging, 200 Elizabeth St, Toronto, ON M5G 24C, Canada
[2] Univ Toronto, Dept Med Imaging, Toronto, ON, Canada
[3] Univ Ottawa, Dept Family Med, Ottawa, ON, Canada
关键词
D O I
10.1148/radiol.232715
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: ChatGPT (OpenAI) can pass a text -based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose: To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods: In this exploratory prospective study, 150 radiology board-style multiple-choice text -based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by >= 1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results: Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively ( P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively ( P = .42). Though both GPT-4 and GPT3.5 had only moderate intrarater agreement (kappa = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (>= 8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion: Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)
  • [2] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Maciej Rosoł
    Jakub S. Gąsior
    Jonasz Łaba
    Kacper Korzeniewski
    Marcel Młyńczak
    Scientific Reports, 13
  • [3] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Rosol, Maciej
    Gasior, Jakub S.
    Laba, Jonasz
    Korzeniewski, Kacper
    Mlynczak, Marcel
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [4] Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination
    Kaneda, Yudai
    Takahashi, Ryo
    Kaneda, Uiri
    Akashima, Shiori
    Okita, Haruna
    Misaki, Sadaya
    Yamashiro, Akimi
    Ozaki, Akihiko
    Tanimoto, Tetsuya
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [5] Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    RADIOLOGY, 2024, 310 (01)
  • [6] Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
    Jin, Hye Kyung
    Kim, Eunyoung
    JMIR MEDICAL EDUCATION, 2024, 10
  • [7] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study
    Takagi, Soshi
    Watari, Takashi
    Erabi, Ayano
    Sakaguchi, Kota
    JMIR MEDICAL EDUCATION, 2023, 9
  • [8] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
    Lin, John C. C.
    Younessi, David N. N.
    Kurapati, Sai S. S.
    Tang, Oliver Y. Y.
    Scott, Ingrid U. U.
    EYE, 2023, 37 (17) : 3694 - 3695
  • [9] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
    John C. Lin
    David N. Younessi
    Sai S. Kurapati
    Oliver Y. Tang
    Ingrid U. Scott
    Eye, 2023, 37 : 3694 - 3695
  • [10] The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study
    Ohta, Keiichi
    Ohta, Satomi
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)