Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

被引:83
|
作者
Rao, Arya [1 ,2 ]
Kim, John [1 ,2 ]
Kamineni, Meghana [1 ,2 ]
Pang, Michael [1 ,2 ]
Lie, Winston [1 ,2 ]
Dreyer, Keith J. [1 ,2 ,3 ,4 ]
Succi, Marc D. [1 ,3 ,5 ,6 ,7 ]
机构
[1] Harvard Med Sch, Boston, MA USA
[2] Massachusetts Gen Hosp, Medically Engn Solut Healthcare Incubator Innovat, Boston, MA USA
[3] Massachusetts Gen Hosp, Dept Radiol, Boston, MA USA
[4] Mass Gen Brigham, Boston, MA USA
[5] Mass Gen Brigham Enterprise Radiol, Medically Engn Solut Healthcare Innovat Operat Res, Boston, MA USA
[6] Massachusetts Gen Hosp, MESH Incubator, Boston, MA USA
[7] Massachusetts Gen Hosp, Dept Radiol, 55 Fruit St, Boston, MA 02114 USA
关键词
AI; breast imaging; ChatGPT; clinical decision making; clinical decision support; CANCER; OVERUTILIZATION; MAMMOGRAPHY; OVERUSE; CHATGPT;
D O I
10.1016/j.jacr.2023.05.003
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain.Methods: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores.Results: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%.Discussion: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.
引用
收藏
页码:990 / 997
页数:8
相关论文
共 50 条
  • [21] Examining Lexical Alignment in Human-Agent Conversations with GPT-3.5 and GPT-4 Models
    Wang, Boxuan
    Theune, Mariet
    Srivastava, Sumit
    CHATBOT RESEARCH AND DESIGN, CONVERSATIONS 2023, 2024, 14524 : 94 - 114
  • [22] Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard
    Farhat, Faiza
    Chaudhry, Beenish Moalla
    Nadeem, Mohammad
    Sohail, Shahab Saquib
    Madsen, Dag Oivind
    JMIR MEDICAL EDUCATION, 2024, 10
  • [23] BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study
    Cozzi, Andrea
    Pinker, Katja
    Hidber, Andri
    Zhang, Tianyu
    Bonomo, Luca
    Lo Gullo, Roberto
    Christianson, Blake
    Curti, Marco
    Rizzo, Stefania
    Del Grande, Filippo
    Mann, Ritse M.
    Schiaffino, Simone
    RADIOLOGY, 2024, 311 (01)
  • [24] Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions
    Moshirfar, Majid
    Altaf, Amal W.
    Stoakes, Isabella M.
    Tuttle, Jared J.
    Hoopes, Phillip C.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)
  • [25] Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination
    Liu, Chiu-Liang
    Ho, Chien-Ta
    Wu, Tzu-Chi
    HEALTHCARE, 2024, 12 (17)
  • [26] Advancements in AI for Gastroenterology Education: An Assessment of OpenAI's GPT-4 and GPT-3.5 in MKSAP Question Interpretation
    Patel, Akash
    Samreen, Isha
    Ahmed, Imran
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1580 - S1580
  • [27] Comparing Vision-Capable Models, GPT-4 and Gemini, With GPT-3.5 on Taiwan's Pulmonologist Exam
    Chen, Chih-Hsiung
    Hsieh, Kuang-Yu
    Huang, Kuo-En
    Lai, Hsien-Yun
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
  • [28] Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board-style Examination
    Krishna, Satheesh
    Bhambra, Nishaant
    Bleakney, Robert
    Bhayana, Rajesh
    RADIOLOGY, 2024, 311 (02)
  • [30] Evaluating prompt engineering on GPT-3.5's performance in USMLE-style medical calculations and clinical scenarios generated by GPT-4
    Patel, Dhavalkumar
    Raut, Ganesh
    Zimlichman, Eyal
    Cheetirala, Satya Narayan
    Nadkarni, Girish N.
    Glicksberg, Benjamin S.
    Apakama, Donald U.
    Bell, Elijah J.
    Freeman, Robert
    Timsina, Prem
    Klang, Eyal
    SCIENTIFIC REPORTS, 2024, 14 (01):