Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports

被引:0
|
作者
Dorfner, Felix J. [1 ,2 ,3 ,4 ,5 ]
Juergensen, Liv [3 ,5 ,6 ]
Donle, Leonhard [3 ]
Al Mohamad, Fares [3 ,5 ]
Bodenmann, Tobias R. [1 ,2 ,4 ]
Cleveland, Mason C. [1 ,2 ,4 ]
Busch, Felix [3 ,5 ]
Adams, Lisa C. [7 ]
Sato, James [8 ]
Schultz, Thomas [8 ]
Kim, Albert E. [1 ,2 ,4 ]
Merkow, Jameson [9 ]
Bressem, Keno K. [10 ,11 ,12 ]
Bridge, Christopher P. [1 ,2 ,4 ,5 ,8 ]
机构
[1] Massachusetts Gen Hosp, Athinoula A Martinos Ctr Biomed Imaging, 149 Thirteenth St, Charlestown, MA 02129 USA
[2] Harvard Med Sch, 149 Thirteenth St, Charlestown, MA 02129 USA
[3] Charite Univ Med Berlin, Dept Radiol, Berlin, Germany
[4] Free Univ Berlin, Berlin, Germany
[5] Humboldt Univ, Berlin, Germany
[6] Dana Farber Canc Inst, Dept Pediat Oncol, Boston, MA USA
[7] Tech Univ Munich, Dept Diagnost & Intervent Radiol, Munich, Germany
[8] Mass Gen Brigham Data Sci Off, Boston, MA USA
[9] Microsoft Hlth & Life Sci HLS, Redmond, WA 98052 USA
[10] Tech Univ Munich, Klinikum Rechts Isar, Munich, Germany
[11] German Heart Ctr Munich, Dept Radiol & Nucl Med, Munich, Germany
[12] Tech Univ Munich, TUM Univ Hosp, Sch Med & Hlth, Dept Cardiovasc Radiol & Nucl Med,German Heart Ctr, Munich, Germany
关键词
D O I
10.1148/radiol.241139
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: Rapid advances in large language models (LLMs) have led to the development of numerous commercial and open-source models. While recent publications have explored OpenAI's GPT-4 to extract information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to leading open-source models. Purpose: To compare different leading open-source LLMs to GPT-4 on the task of extracting relevant findings from chest radiograph reports. Materials and Methods: Two independent datasets of free-text radiology reports from chest radiograph examinations were used in this retrospective study performed between February 2, 2024, and February 14, 2024. The first dataset consisted of reports from the ImaGenome dataset, providing reference standard annotations from the MIMIC-CXR database acquired between 2011 and 2016. The second dataset consisted of randomly selected reports created at the Massachusetts General Hospital between July 2019 and July 2021. In both datasets, the commercial models GPT-3.5 Turbo and GPT-4 were compared with open-source models that included Mistral-7B and Mixtral-8 x 7B (Mistral AI), Llama 2-13B and Llama 2-70B (Meta), and Qwen1.5-72B (Alibaba Group), as well as CheXbert and CheXpert-labeler (Stanford ML Group), in their ability to accurately label the presence of multiple findings in radiograph text reports using zero-shot and few-shot prompting. The McNemar test was used to compare F1 scores between models. Results: On the ImaGenome dataset (n = 450), the open-source model with the highest score, Llama 2-70B, achieved micro F1 scores of 0.97 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.98 (P > .99 and < .001 for superiority of GPT-4). On the institutional dataset (n = 500), the open-source model with the highest score, an ensemble model, achieved micro F1 scores of 0.96 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.97 (P < .001 and > .99 for superiority of GPT-4). Conclusion: Although GPT-4 was superior to open-source models in zero-shot report labeling, few-shot prompting with a small number of example reports closely matched the performance of GPT-4. The benefit of few-shot prompting varied across datasets and models.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Reusing open-source software and practices: The impact of open-source on commercial vendors
    Brown, AW
    Booch, G
    SOFTWARE REUSE: METHODS, TECHNIQUES, AND TOOLS, PROCEEDINGS, 2002, 2319 : 123 - 136
  • [32] Comparing commercial IP reputation databases to open-source IP reputation algorithms
    Academic and Research Network of Slovenia, p.p.7, 1000 Ljubljana, Slovenia
    不详
    Comput Syst Sci Eng, 1 (53-66):
  • [33] Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking
    Zhuang, Shengyao
    Liu, Bing
    Koopman, Bevan
    Zuccon, Guido
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8807 - 8817
  • [34] Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports
    Le Guellec, Bastien
    Lefevre, Alexandre
    Geay, Charlotte
    Shorten, Lucas
    Bruge, Cyril
    Hacein-Bey, Lotfi
    Amouyel, Philippe
    Pruvo, Jean-Pierre
    Kuchcinski, Gregory
    Hamroun, Aghiles
    RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2024, 6 (04)
  • [35] Closing the gap between open source and commercial large language models for medical evidence summarization
    Zhang, Gongbo
    Jin, Qiao
    Zhou, Yiliang
    Wang, Song
    Idnay, Betina
    Luo, Yiming
    Park, Elizabeth
    Nestor, Jordan G.
    Spotnitz, Matthew E.
    Soroush, Ali
    Campion Jr, Thomas R.
    Lu, Zhiyong
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [36] Open-Source Large Language Models in Anesthesia Perioperative Medicine: ASA-Physical Status Evaluation
    Rouholiman, Dara
    Goodell, Alex J.
    Fung, Ethan
    Chandrasoma, Janak T.
    Chu, Larry F.
    ANESTHESIA AND ANALGESIA, 2024, 139 (06): : 2779 - 2781
  • [37] Enhancing Commit Message Categorization in Open-Source Repositories Using Structured Taxonomy and Large Language Models
    Al-razgan, Muna
    Alaqil, Manal
    Almuwayshir, Ruba
    Alhijji, Zamzam
    ADVANCES IN ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, 2024, 4 (04): : 2950 - 2968
  • [38] TeenyTinyLlama: Open-source tiny language models trained in Brazilian Portuguese
    Correa, Nicholas Kluge
    Falk, Sophia
    Fatimah, Shiza
    Sen, Aniket
    De Oliveira, Nythamar
    MACHINE LEARNING WITH APPLICATIONS, 2024, 16
  • [39] OPEN-SOURCE LANGUAGE AI CHALLENGES BIG TECH'S MODELS
    Gibney, Elizabeth
    NATURE, 2022, 606 (7916) : 850 - 851
  • [40] Open-source language AI challenges big tech’s models
    Elizabeth Gibney
    Nature, 2022, 606 : 850 - 851