Evaluating GPT as an Adjunct for Radiologic Decision Making: GPT-4 Versus GPT-3.5 in a Breast Imaging Pilot

被引:83
|
作者
Rao, Arya [1 ,2 ]
Kim, John [1 ,2 ]
Kamineni, Meghana [1 ,2 ]
Pang, Michael [1 ,2 ]
Lie, Winston [1 ,2 ]
Dreyer, Keith J. [1 ,2 ,3 ,4 ]
Succi, Marc D. [1 ,3 ,5 ,6 ,7 ]
机构
[1] Harvard Med Sch, Boston, MA USA
[2] Massachusetts Gen Hosp, Medically Engn Solut Healthcare Incubator Innovat, Boston, MA USA
[3] Massachusetts Gen Hosp, Dept Radiol, Boston, MA USA
[4] Mass Gen Brigham, Boston, MA USA
[5] Mass Gen Brigham Enterprise Radiol, Medically Engn Solut Healthcare Innovat Operat Res, Boston, MA USA
[6] Massachusetts Gen Hosp, MESH Incubator, Boston, MA USA
[7] Massachusetts Gen Hosp, Dept Radiol, 55 Fruit St, Boston, MA 02114 USA
关键词
AI; breast imaging; ChatGPT; clinical decision making; clinical decision support; CANCER; OVERUTILIZATION; MAMMOGRAPHY; OVERUSE; CHATGPT;
D O I
10.1016/j.jacr.2023.05.003
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Objective: Despite rising popularity and performance, studies evaluating the use of large language models for clinical decision support are lacking. Here, we evaluate ChatGPT (Generative Pre-trained Transformer)-3.5 and GPT-4's (OpenAI, San Francisco, California) capacity for clinical decision support in radiology via the identification of appropriate imaging services for two important clinical presentations: breast cancer screening and breast pain.Methods: We compared ChatGPT's responses to the ACR Appropriateness Criteria for breast pain and breast cancer screening. Our prompt formats included an open-ended (OE) and a select all that apply (SATA) format. Scoring criteria evaluated whether proposed imaging modalities were in accordance with ACR guidelines. Three replicate entries were conducted for each prompt, and the average of these was used to determine final scores.Results: Both ChatGPT-3.5 and ChatGPT-4 achieved an average OE score of 1.830 (out of 2) for breast cancer screening prompts. ChatGPT-3.5 achieved a SATA average percentage correct of 88.9%, compared with ChatGPT-4's average percentage correct of 98.4% for breast cancer screening prompts. For breast pain, ChatGPT-3.5 achieved an average OE score of 1.125 (out of 2) and a SATA average percentage correct of 58.3%, as compared with an average OE score of 1.666 (out of 2) and a SATA average percentage correct of 77.7%.Discussion: Our results demonstrate the eventual feasibility of using large language models like ChatGPT for radiologic decision making, with the potential to improve clinical workflow and responsible use of radiology services. More use cases and greater accuracy are necessary to evaluate and implement such tools.
引用
收藏
页码:990 / 997
页数:8
相关论文
共 50 条
  • [41] Is GPT-4 a reliable rater? Evaluating consistency in GPT-4's text ratings
    Hackl, Veronika
    Mueller, Alexandra Elena
    Granitzer, Michael
    Sailer, Maximilian
    FRONTIERS IN EDUCATION, 2023, 8
  • [42] How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini
    Irmici, Giovanni
    Cozzi, Andrea
    Della Pepa, Gianmarco
    De Berardinis, Claudia
    D'Ascoli, Elisa
    Cellina, Michaela
    Ce, Maurizio
    Depretto, Catherine
    Scaperrotta, Gianfranco
    RADIOLOGIA MEDICA, 2024, 129 (10): : 1463 - 1467
  • [43] Cognitive Network Science Reveals Bias in GPT-3, GPT-3.5 Turbo, and GPT-4 Mirroring Math Anxiety in High-School Students
    Abramski, Katherine
    Citraro, Salvatore
    Lombardi, Luigi
    Rossetti, Giulio
    Stella, Massimo
    BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (03)
  • [44] Assessing readability of explanations and reliability of answers by GPT-3.5 and GPT-4 in non-traumatic spinal cord injury education
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Wright, Mark Andrew
    Opisso, Eloy
    Vidal, Joan
    MEDICAL TEACHER, 2024,
  • [45] The Potential and Pitfalls of GPT-4 in Radiologic Assessment
    Arachchige, Arosh S. Perera Molligoda
    ACADEMIC RADIOLOGY, 2024, 31 (08) : 3446 - 3447
  • [46] Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of artificial intelligence responses from GPT-3.5 and GPT-4
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Opisso, Eloy
    Soler, Maria Dolors
    PAIN MEDICINE, 2024, 26 (01) : 48 - 50
  • [47] The performance of ChatGPT on orthopaedic in-service training exams: A comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education
    Rizzo, Michael G.
    Cai, Nathan
    Constantinescu, David
    JOURNAL OF ORTHOPAEDICS, 2024, 50 : 70 - 75
  • [48] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
    Daungsupawong, Hinpetch
    Wiwanitkit, Viroj
    PAIN MEDICINE, 2024,
  • [49] RE: Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4
    Garcia-Rudolph, Alejandro
    Sanchez-Pinsach, David
    Opisso, Eloy
    Soler, Maria Dolors
    PAIN MEDICINE, 2024,
  • [50] Comment on: ‘Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination’ and ‘ChatGPT in ophthalmology: the dawn of a new era?’
    Nima Ghadiri
    Eye, 2024, 38 : 654 - 655