Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

被引:83
|
作者
Rao, Arya [1 ,2 ]
Kim, John [1 ,2 ]
Kamineni, Meghana [1 ,2 ]
Pang, Michael [1 ,2 ]
Lie, Winston [1 ,2 ]
Dreyer, Keith J. [1 ,2 ,3 ,4 ]
Succi, Marc D. [1 ,3 ,5 ,6 ,7 ]
机构
[1] Harvard Med Sch, Boston, MA USA
[2] Massachusetts Gen Hosp, Medically Engn Solut Healthcare Incubator Innovat, Boston, MA USA
[3] Massachusetts Gen Hosp, Dept Radiol, Boston, MA USA
[4] Mass Gen Brigham, Boston, MA USA
[5] Mass Gen Brigham Enterprise Radiol, Medically Engn Solut Healthcare Innovat Operat Res, Boston, MA USA
[6] Massachusetts Gen Hosp, MESH Incubator, Boston, MA USA
[7] Massachusetts Gen Hosp, Dept Radiol, 55 Fruit St, Boston, MA 02114 USA
关键词
AI; breast imaging; ChatGPT; clinical decision making; clinical decision support; CANCER; OVERUTILIZATION; MAMMOGRAPHY; OVERUSE; CHATGPT;
D O I
10.1016/j.jacr.2023.05.003
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain.Methods: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores.Results: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%.Discussion: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.
引用
收藏
页码:990 / 997
页数:8
相关论文
共 50 条
  • [11] Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study
    Jin, Hye Kyung
    Kim, Eunyoung
    JMIR MEDICAL EDUCATION, 2024, 10
  • [12] Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study
    Takagi, Soshi
    Watari, Takashi
    Erabi, Ayano
    Sakaguchi, Kota
    JMIR MEDICAL EDUCATION, 2023, 9
  • [13] Comparing GPT-3.5 and GPT-4 Accuracy and Drift in Radiology Diagnosis Please Cases
    Li, David
    Gupta, Kartik
    Bhaduri, Mousumi
    Sathiadoss, Paul
    Bhatnagar, Sahir
    Chong, Jaron
    RADIOLOGY, 2024, 310 (01)
  • [14] GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews
    Oami, Takehiko
    Okada, Yohei
    Nakada, Taka-aki
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [15] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
    Lin, John C. C.
    Younessi, David N. N.
    Kurapati, Sai S. S.
    Tang, Oliver Y. Y.
    Scott, Ingrid U. U.
    EYE, 2023, 37 (17) : 3694 - 3695
  • [16] Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination
    John C. Lin
    David N. Younessi
    Sai S. Kurapati
    Oliver Y. Tang
    Ingrid U. Scott
    Eye, 2023, 37 : 3694 - 3695
  • [17] A comparison of human, GPT-3.5, and GPT-4 performance in a university-level coding course
    Yeadon, Will
    Peach, Alex
    Testrow, Craig
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [18] The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study
    Ohta, Keiichi
    Ohta, Satomi
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
  • [19] Performance of GPT-4 and GPT-3.5 in generating accurate and comprehensive diagnoses across medical subspecialties
    Luk, Dik Wai Anderson
    Ip, Whitney Chin Tung
    Shea, Yat-fung
    JOURNAL OF THE CHINESE MEDICAL ASSOCIATION, 2024, 87 (03) : 259 - 260
  • [20] Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules
    Gamble, Joel
    Ferguson, Duncan
    Yuen, Joanna
    Sheikh, Adnan
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (02): : 412 - 416