Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

被引:83
|
作者
Rao, Arya [1 ,2 ]
Kim, John [1 ,2 ]
Kamineni, Meghana [1 ,2 ]
Pang, Michael [1 ,2 ]
Lie, Winston [1 ,2 ]
Dreyer, Keith J. [1 ,2 ,3 ,4 ]
Succi, Marc D. [1 ,3 ,5 ,6 ,7 ]
机构
[1] Harvard Med Sch, Boston, MA USA
[2] Massachusetts Gen Hosp, Medically Engn Solut Healthcare Incubator Innovat, Boston, MA USA
[3] Massachusetts Gen Hosp, Dept Radiol, Boston, MA USA
[4] Mass Gen Brigham, Boston, MA USA
[5] Mass Gen Brigham Enterprise Radiol, Medically Engn Solut Healthcare Innovat Operat Res, Boston, MA USA
[6] Massachusetts Gen Hosp, MESH Incubator, Boston, MA USA
[7] Massachusetts Gen Hosp, Dept Radiol, 55 Fruit St, Boston, MA 02114 USA
关键词
AI; breast imaging; ChatGPT; clinical decision making; clinical decision support; CANCER; OVERUTILIZATION; MAMMOGRAPHY; OVERUSE; CHATGPT;
D O I
10.1016/j.jacr.2023.05.003
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain.Methods: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores.Results: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%.Discussion: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.
引用
收藏
页码:990 / 997
页数:8
相关论文
共 50 条
  • [1] Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4
    Lahat, Adi
    Sharif, Kassem
    Zoabi, Narmin
    Patt, Yonatan Shneor
    Sharif, Yousra
    Fisher, Lior
    Shani, Uria
    Arow, Mohamad
    Levin, Roni
    Klang, Eyal
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [2] ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4
    Currie, Geoff
    Robbie, Stephanie
    Tually, Peter
    JOURNAL OF NUCLEAR MEDICINE TECHNOLOGY, 2023, 51 (04) : 307 - 313
  • [3] Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination
    Dingyuan Luo
    Mengke Liu
    Runyuan Yu
    Yulian Liu
    Wenjun Jiang
    Qi Fan
    Naifeng Kuang
    Qiang Gao
    Tao Yin
    Zuncheng Zheng
    Scientific Reports, 15 (1)
  • [4] GPT-4 in Nuclear Medicine Education: Does It Outperform GPT-3.5?
    Currie, Geoffrey M.
    JOURNAL OF NUCLEAR MEDICINE TECHNOLOGY, 2023, 51 (04) : 314 - 317
  • [5] Correspondence on Chat GPT-4, GPT-3.5 and drug information queries
    Kleebayoon, Amnuay
    Wiwanitkit, Viroj
    JOURNAL OF TELEMEDICINE AND TELECARE, 2023,
  • [6] Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination
    Kaneda, Yudai
    Takahashi, Ryo
    Kaneda, Uiri
    Akashima, Shiori
    Okita, Haruna
    Misaki, Sadaya
    Yamashiro, Akimi
    Ozaki, Akihiko
    Tanimoto, Tetsuya
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
  • [7] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Maciej Rosoł
    Jakub S. Gąsior
    Jonasz Łaba
    Kacper Korzeniewski
    Marcel Młyńczak
    Scientific Reports, 13
  • [8] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
    Rosol, Maciej
    Gasior, Jakub S.
    Laba, Jonasz
    Korzeniewski, Kacper
    Mlynczak, Marcel
    SCIENTIFIC REPORTS, 2023, 13 (01)
  • [9] Chat GPT-4 significantly surpasses GPT-3.5 in drug information queries
    He, Na
    Yan, Yingying
    Wu, Ziyang
    Cheng, Yinchu
    Liu, Fang
    Li, Xiaotong
    Zhai, Suodi
    JOURNAL OF TELEMEDICINE AND TELECARE, 2025, 31 (02) : 306 - 308
  • [10] Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases
    Kikuchi, Tomohiro
    Nakao, Takahiro
    Nakamura, Yuta
    Hanaoka, Shouhei
    Mori, Harushi
    Yoshikawa, Takeharu
    AMERICAN JOURNAL OF NEURORADIOLOGY, 2024, 45 (10) : 1506 - 1511