Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

被引：83

作者：

Rao, Arya ^{[1
,2
]}

Kim, John ^{[1
,2
]}

Kamineni, Meghana ^{[1
,2
]}

Pang, Michael ^{[1
,2
]}

Lie, Winston ^{[1
,2
]}

Dreyer, Keith J. ^{[1
,2
,3
,4
]}

Succi, Marc D. ^{[1
,3
,5
,6
,7
]}

机构：

[1] Harvard Med Sch, Boston, MA USA

[2] Massachusetts Gen Hosp, Medically Engn Solut Healthcare Incubator Innovat, Boston, MA USA

[3] Massachusetts Gen Hosp, Dept Radiol, Boston, MA USA

[4] Mass Gen Brigham, Boston, MA USA

[5] Mass Gen Brigham Enterprise Radiol, Medically Engn Solut Healthcare Innovat Operat Res, Boston, MA USA

[6] Massachusetts Gen Hosp, MESH Incubator, Boston, MA USA

[7] Massachusetts Gen Hosp, Dept Radiol, 55 Fruit St, Boston, MA 02114 USA

来源：

JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY | 2023年 / 20卷 / 10期

关键词：

AI; breast imaging; ChatGPT; clinical decision making; clinical decision support; CANCER; OVERUTILIZATION; MAMMOGRAPHY; OVERUSE; CHATGPT;

D O I：

10.1016/j.jacr.2023.05.003

中图分类号：

R8 [特种医学]; R445 [影像诊断学];

学科分类号：

1002 ; 100207 ; 1009 ;

摘要：

Objective: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain.Methods: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores.Results: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%.Discussion: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.

引用

页码：990 / 997

页数：8

共 50 条

[1] Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4
Lahat, Adi
Sharif, Kassem
Zoabi, Narmin
Patt, Yonatan Shneor
Sharif, Yousra
Fisher, Lior
Shani, Uria
Arow, Mohamad
Levin, Roni
Klang, Eyal
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[2] ChatGPT and Patient Information in Nuclear Medicine: GPT-3.5 Versus GPT-4
Currie, Geoff
Robbie, Stephanie
Tually, Peter
JOURNAL OF NUCLEAR MEDICINE TECHNOLOGY, 2023, 51 (04) : 307 - 313
[3] Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination
Dingyuan Luo
Mengke Liu
Runyuan Yu
Yulian Liu
Wenjun Jiang
Qi Fan
Naifeng Kuang
Qiang Gao
Tao Yin
Zuncheng Zheng
Scientific Reports, 15 (1)
[4] GPT-4 in Nuclear Medicine Education: Does It Outperform GPT-3.5?
Currie, Geoffrey M.
JOURNAL OF NUCLEAR MEDICINE TECHNOLOGY, 2023, 51 (04) : 314 - 317
[5] Correspondence on Chat GPT-4, GPT-3.5 and drug information queries
Kleebayoon, Amnuay
Wiwanitkit, Viroj
JOURNAL OF TELEMEDICINE AND TELECARE, 2023,
[6] Assessing the Performance of GPT-3.5 and GPT-4 on the 2023 Japanese Nursing Examination
Kaneda, Yudai
Takahashi, Ryo
Kaneda, Uiri
Akashima, Shiori
Okita, Haruna
Misaki, Sadaya
Yamashiro, Akimi
Ozaki, Akihiko
Tanimoto, Tetsuya
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (08)
[7] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
Maciej Rosoł
Jakub S. Gąsior
Jonasz Łaba
Kacper Korzeniewski
Marcel Młyńczak
Scientific Reports, 13
[8] Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination
Rosol, Maciej
Gasior, Jakub S.
Laba, Jonasz
Korzeniewski, Kacper
Mlynczak, Marcel
SCIENTIFIC REPORTS, 2023, 13 (01)
[9] Chat GPT-4 significantly surpasses GPT-3.5 in drug information queries
He, Na
Yan, Yingying
Wu, Ziyang
Cheng, Yinchu
Liu, Fang
Li, Xiaotong
Zhai, Suodi
JOURNAL OF TELEMEDICINE AND TELECARE, 2025, 31 (02) : 306 - 308
[10] Toward Improved Radiologic Diagnostics: Investigating the Utility and Limitations of GPT-3.5 Turbo and GPT-4 with Quiz Cases
Kikuchi, Tomohiro
Nakao, Takahiro
Nakamura, Yuta
Hanaoka, Shouhei
Mori, Harushi
Yoshikawa, Takeharu
AMERICAN JOURNAL OF NEURORADIOLOGY, 2024, 45 (10) : 1506 - 1511

← 1 2 3 4 5 →