Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination

被引:20
|
作者
Krishna, Satheesh [1 ,2 ]
Bhambra, Nishaant [3 ]
Bleakney, Robert [1 ,2 ]
Bhayana, Rajesh [1 ,2 ]
机构
[1] Univ Med Imaging Toronto, Univ Hlth Network, Univ Toronto, Mt Sinai Hosp,Joint Dept Med Imaging, 200 Elizabeth St, Toronto, ON M5G 24C, Canada
[2] Univ Toronto, Dept Med Imaging, Toronto, ON, Canada
[3] Univ Ottawa, Dept Family Med, Ottawa, ON, Canada
关键词
D O I
10.1148/radiol.232715
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: ChatGPT (OpenAI) can pass a text -based radiology board-style examination, but its stochasticity and confident language when it is incorrect may limit utility. Purpose: To assess the reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 (ChatGPT; OpenAI) through repeated prompting with a radiology board-style examination. Materials and Methods: In this exploratory prospective study, 150 radiology board-style multiple-choice text -based questions, previously used to benchmark ChatGPT, were administered to default versions of ChatGPT (GPT-3.5 and GPT-4) on three separate attempts (separated by >= 1 month and then 1 week). Accuracy and answer choices between attempts were compared to assess reliability (accuracy over time) and repeatability (agreement over time). On the third attempt, regardless of answer choice, ChatGPT was challenged three times with the adversarial prompt, "Your answer choice is incorrect. Please choose a different option," to assess robustness (ability to withstand adversarial prompting). ChatGPT was prompted to rate its confidence from 1-10 (with 10 being the highest level confidence and 1 being the lowest) on the third attempt and after each challenge prompt. Results: Neither version showed a difference in accuracy over three attempts: for the first, second, and third attempt, accuracy of GPT-3.5 was 69.3% (104 of 150), 63.3% (95 of 150), and 60.7% (91 of 150), respectively ( P = .06); and accuracy of GPT-4 was 80.6% (121 of 150), 78.0% (117 of 150), and 76.7% (115 of 150), respectively ( P = .42). Though both GPT-4 and GPT3.5 had only moderate intrarater agreement (kappa = 0.78 and 0.64, respectively), the answer choices of GPT-4 were more consistent across three attempts than those of GPT-3.5 (agreement, 76.7% [115 of 150] vs 61.3% [92 of 150], respectively; P = .006). After challenge prompt, both changed responses for most questions, though GPT-4 did so more frequently than GPT-3.5 (97.3% [146 of 150] vs 71.3% [107 of 150], respectively; P < .001). Both rated "high confidence" (>= 8 on the 1-10 scale) for most initial responses (GPT-3.5, 100% [150 of 150]; and GPT-4, 94.0% [141 of 150]) as well as for incorrect responses (ie, overconfidence; GPT-3.5, 100% [59 of 59]; and GPT-4, 77% [27 of 35], respectively; P = .89). Conclusion: Default GPT-3.5 and GPT-4 were reliably accurate across three attempts, but both had poor repeatability and robustness and were frequently overconfident. GPT-4 was more consistent across attempts than GPT-3.5 but more influenced by an adversarial prompt.
引用
收藏
页数:7
相关论文
共 50 条
  • [21] A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course
    Yeadon, Will
    Peach, Alex
    Testrow, Craig
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [22] Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties
    Luk, Dik Wai Anderson
    Ip, Whitney Chin Tung
    Shea, Yat-fung
    JOURNAL OF THE CHINESE MEDICAL ASSOCIATION, 2024, 87 (03) : 259 - 260
  • [23] Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules
    Gamble, Joel
    Ferguson, Duncan
    Yuen, Joanna
    Sheikh, Adnan
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (02): : 412 - 416
  • [24] Examining Lexical Alignment in Human-Agent Conversations with GPT-3.5 and GPT-4 Models
    Wang, Boxuan
    Theune, Mariet
    Srivastava, Sumit
    CHATBOT RESEARCH AND DESIGN, CONVERSATIONS 2023, 2024, 14524 : 94 - 114
  • [25] Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4
    Lahat, Adi
    Sharif, Kassem
    Zoabi, Narmin
    Patt, Yonatan Shneor
    Sharif, Yousra
    Fisher, Lior
    Shani, Uria
    Arow, Mohamad
    Levin, Roni
    Klang, Eyal
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [26] Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Wright, Mark Andrew
    Opisso, Eloy
    Vidal, Joan
    MEDICAL TEACHER, 2024,
  • [27] BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study
    Cozzi, Andrea
    Pinker, Katja
    Hidber, Andri
    Zhang, Tianyu
    Bonomo, Luca
    Lo Gullo, Roberto
    Christianson, Blake
    Curti, Marco
    Rizzo, Stefania
    Del Grande, Filippo
    Mann, Ritse M.
    Schiaffino, Simone
    RADIOLOGY, 2024, 311 (01)
  • [28] GPT-4 Turbo with Vision fails to outperform text-only GPT-4 Turbo in the Japan Diagnostic Radiology Board Examination
    Hirano, Yuichiro
    Hanaoka, Shouhei
    Nakao, Takahiro
    Miki, Soichiro
    Kikuchi, Tomohiro
    Nakamura, Yuta
    Nomura, Yukihiro
    Yoshikawa, Takeharu
    Abe, Osamu
    JAPANESE JOURNAL OF RADIOLOGY, 2024, 42 (08) : 918 - 926
  • [29] Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society
    Toyama, Yoshitaka
    Harigai, Ayaka
    Abe, Mirei
    Nagano, Mitsutoshi
    Kawabata, Masahiro
    Seki, Yasuhiro
    Takase, Kei
    JAPANESE JOURNAL OF RADIOLOGY, 2023, 42 (2) : 201 - 207
  • [30] Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society
    Yoshitaka Toyama
    Ayaka Harigai
    Mirei Abe
    Mitsutoshi Nagano
    Masahiro Kawabata
    Yasuhiro Seki
    Kei Takase
    Japanese Journal of Radiology, 2024, 42 : 201 - 207