Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

被引:83
|
作者
Rao, Arya [1 ,2 ]
Kim, John [1 ,2 ]
Kamineni, Meghana [1 ,2 ]
Pang, Michael [1 ,2 ]
Lie, Winston [1 ,2 ]
Dreyer, Keith J. [1 ,2 ,3 ,4 ]
Succi, Marc D. [1 ,3 ,5 ,6 ,7 ]
机构
[1] Harvard Med Sch, Boston, MA USA
[2] Massachusetts Gen Hosp, Medically Engn Solut Healthcare Incubator Innovat, Boston, MA USA
[3] Massachusetts Gen Hosp, Dept Radiol, Boston, MA USA
[4] Mass Gen Brigham, Boston, MA USA
[5] Mass Gen Brigham Enterprise Radiol, Medically Engn Solut Healthcare Innovat Operat Res, Boston, MA USA
[6] Massachusetts Gen Hosp, MESH Incubator, Boston, MA USA
[7] Massachusetts Gen Hosp, Dept Radiol, 55 Fruit St, Boston, MA 02114 USA
关键词
AI; breast imaging; ChatGPT; clinical decision making; clinical decision support; CANCER; OVERUTILIZATION; MAMMOGRAPHY; OVERUSE; CHATGPT;
D O I
10.1016/j.jacr.2023.05.003
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain.Methods: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores.Results: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%.Discussion: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.
引用
收藏
页码:990 / 997
页数:8
相关论文
共 50 条
  • [31] Evaluating the GPT-3.5 and GPT-4 Large Language Models for Zero-Shot Classification of South African Violent Event Data
    Kotze, Eduan
    Senekal, Burgert A.
    2024 7TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, BIG DATA, COMPUTING AND DATA COMMUNICATION SYSTEMS, ICABCD 2024, 2024,
  • [32] Investigating the Perception of the Future in GPT-3,-3.5 and GPT-4
    Kozachek, Diana
    2023 PROCEEDINGS OF THE 15TH CONFERENCE ON CREATIVITY AND COGNITION, C&C 2023, 2023, : 282 - 287
  • [33] Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
    Yudovich, Max Samuel
    Makarova, Elizaveta
    Hague, Christian Michael
    Raman, Jay Dilip
    JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS, 2024, 21 : 17
  • [34] Evaluating GPT-4V (GPT-4 with Vision) on Detection of Radiologic Findings on Chest Radiographs
    Zhou, Yiliang
    Ong, Hanley
    Kennedy, Patrick
    Wu, Carol C.
    Kazam, Jacob
    Hentel, Keith
    Flanders, Adam
    Shih, George
    Peng, Yifan
    RADIOLOGY, 2024, 311 (02)
  • [35] Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study
    Sönmez Saglam
    Veysel Uludag
    Zekeriya Okan Karaduman
    Mehmet Arıcan
    Mücahid Osman Yücel
    Raşit Emin Dalaslan
    BMC Medical Informatics and Decision Making, 25 (1)
  • [36] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Jamil S. Samaan
    Nithya Rajeev
    Wee Han Ng
    Nitin Srinivasan
    Jonathan A. Busam
    Yee Hui Yeo
    Kamran Samakar
    Obesity Surgery, 2024, 34 : 1987 - 1989
  • [37] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
    Ballard, David H.
    RADIOLOGY, 2024, 311 (02)
  • [38] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
    Meyer, Annika
    Riese, Janik
    Streichert, Thomas
    JMIR MEDICAL EDUCATION, 2024, 10
  • [39] ChatGPT as a Source of Information for Bariatric Surgery Patients: a Comparative Analysis of Accuracy and Comprehensiveness Between GPT-4 and GPT-3.5
    Samaan, Jamil S.
    Rajeev, Nithya
    Ng, Wee Han
    Srinivasan, Nitin
    Busam, Jonathan A.
    Yeo, Yee Hui
    Samakar, Kamran
    OBESITY SURGERY, 2024, 34 (05) : 1987 - 1989
  • [40] A Comparison Between GPT-3.5, GPT-4, and GPT-4V: Can the Large Language Model (ChatGPT) Pass the Japanese Board of Orthopaedic Surgery Examination?
    Nakajima, Nozomu
    Fujimori, Takahito
    Furuya, Masayuki
    Kanie, Yuya
    Imai, Hirotatsu
    Kita, Kosuke
    Uemura, Keisuke
    Okada, Seiji
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)