BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study

被引:16
|
作者
Cozzi, Andrea [1 ]
Pinker, Katja [2 ]
Hidber, Andri [3 ]
Zhang, Tianyu [4 ,5 ,6 ]
Bonomo, Luca [1 ]
Lo Gullo, Roberto [2 ,4 ]
Christianson, Blake [2 ]
Curti, Marco [1 ]
Rizzo, Stefania [1 ,3 ]
Del Grande, Filippo [1 ,3 ]
Mann, Ritse M. [4 ,5 ]
Schiaffino, Simone [1 ,3 ]
机构
[1] Ente Osped Cantonale, Imaging Inst Southern Switzerland IIMSI, ViaTesserete 46, CH-6900 Lugano, Switzerland
[2] Mem Sloan Kettering Canc Ctr, Dept Radiol, Breast Imaging Serv, New York, NY USA
[3] Univ Svizzera italiana, Fac Biomed Sci, Lugano, Switzerland
[4] Netherlands Canc Inst, Dept Radiol, Amsterdam, Netherlands
[5] Radboud Univ Nijmegen, Dept Diagnost Imaging, Med Ctr, NL-6500 HB Nijmegen, Netherlands
[6] Maastricht Univ, GROW Res Inst Oncol & Reprod, Maastricht, Netherlands
关键词
INTEROBSERVER VARIABILITY; AGREEMENT; RELIABILITY;
D O I
10.1148/radiol.232133
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: The performance of publicly available large language models (LLMs) remains unclear for complex clinical tasks. Purpose: To evaluate the agreement between human readers and LLMs for Breast Imaging Reporting and Data System (BI-RADS) categories assigned based on breast imaging reports written in three languages and to assess the impact of discordant category assignments on clinical management. Materials and Methods: This retrospective study included reports for women who underwent MRI, mammography, and/or US breast cancer screening or diagnostic purposes at three referral centers. Reports with findings categorized as BI-RADS 1-5 and in Italian, English, or Dutch were collected between January 2000 and October 2023. Board -certified breast radiologists and LLMs GPT-3.5 and GPT-4 (OpenAI) and Bard, now called Gemini (Google), assigned BI-RADS categories using only the described by the original radiologists. Agreement between human readers and LLMs for BI-RADS categories was assessed using Gwet agreement coefficient (AC1 value). Frequencies were calculated for changes in BI-RADS category assignments that would clinical management (ie, BI-RADS 0 vs BI-RADS 1 or 2 vs BI-RADS 3 vs BI-RADS 4 or 5) and compared using the McNemar test. Results: Across 2400 reports, agreement between the original and reviewing radiologists was almost perfect (AC1 = 0.91), while agreement between the original radiologists and GPT-4, GPT-3.5, and Bard was moderate (AC1 = 0.52, 0.48, and 0.42, respectively). Across human readers and LLMs, differences were observed in the frequency of BI-RADS category upgrades or downgrades that result in changed clinical management (118 of 2400 [4.9%] for human readers, 611 of 2400 [25.5%] for Bard, 573 of 2400 for GPT-3.5, and 435 of 2400 [18.1%] for GPT-4; P < .001) and that would negatively impact clinical management (37 of 2400 [1.5%] for human readers, 435 of 2400 [18.1%] for Bard, 344 of 2400 [14.3%] for GPT-3.5, and 255 of 2400 [10.6%] for P < .001). Conclusion: LLMs achieved moderate agreement with human reader-assigned BI-RADS categories across reports written in three languages but also yielded a high percentage of discordant BI-RADS categories that would negatively impact clinical management.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Limitations of GPT-3.5 and GPT-4 in Applying Fleischner Society Guidelines to Incidental Lung Nodules
    Gamble, Joel
    Ferguson, Duncan
    Yuen, Joanna
    Sheikh, Adnan
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (02): : 412 - 416
  • [22] Examining Lexical Alignment in Human-Agent Conversations with GPT-3.5 and GPT-4 Models
    Wang, Boxuan
    Theune, Mariet
    Srivastava, Sumit
    CHATBOT RESEARCH AND DESIGN, CONVERSATIONS 2023, 2024, 14524 : 94 - 114
  • [23] Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4
    Lahat, Adi
    Sharif, Kassem
    Zoabi, Narmin
    Patt, Yonatan Shneor
    Sharif, Yousra
    Fisher, Lior
    Shani, Uria
    Arow, Mohamad
    Levin, Roni
    Klang, Eyal
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [24] Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
    Yudovich, Max Samuel
    Makarova, Elizaveta
    Hague, Christian Michael
    Raman, Jay Dilip
    JOURNAL OF EDUCATIONAL EVALUATION FOR HEALTH PROFESSIONS, 2024, 21 : 17
  • [25] Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources
    Nitin Srinivasan
    Jamil S. Samaan
    Nithya D. Rajeev
    Mmerobasi U. Kanu
    Yee Hui Yeo
    Kamran Samakar
    Surgical Endoscopy, 2024, 38 : 2522 - 2532
  • [26] Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources
    Srinivasan, Nitin
    Samaan, Jamil S.
    Rajeev, Nithya D.
    Kanu, Mmerobasi U.
    Yeo, Yee Hui
    Samakar, Kamran
    SURGICAL ENDOSCOPY AND OTHER INTERVENTIONAL TECHNIQUES, 2024, 38 (05): : 2522 - 2532
  • [27] Custom GPTs Enhancing Performance and Evidence Compared with GPT-3.5, GPT-4, and GPT-4o? A Study on the Emergency Medicine Specialist Examination
    Liu, Chiu-Liang
    Ho, Chien-Ta
    Wu, Tzu-Chi
    HEALTHCARE, 2024, 12 (17)
  • [28] Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study
    Meyer, Annika
    Riese, Janik
    Streichert, Thomas
    JMIR MEDICAL EDUCATION, 2024, 10
  • [29] Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions
    Moshirfar, Majid
    Altaf, Amal W.
    Stoakes, Isabella M.
    Tuttle, Jared J.
    Hoopes, Phillip C.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)
  • [30] Advancements in AI for Gastroenterology Education: An Assessment of OpenAI's GPT-4 and GPT-3.5 in MKSAP Question Interpretation
    Patel, Akash
    Samreen, Isha
    Ahmed, Imran
    AMERICAN JOURNAL OF GASTROENTEROLOGY, 2024, 119 (10S): : S1580 - S1580