Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports

被引:0
|
作者
Dorfner, Felix J. [1 ,2 ,3 ,4 ,5 ]
Juergensen, Liv [3 ,5 ,6 ]
Donle, Leonhard [3 ]
Al Mohamad, Fares [3 ,5 ]
Bodenmann, Tobias R. [1 ,2 ,4 ]
Cleveland, Mason C. [1 ,2 ,4 ]
Busch, Felix [3 ,5 ]
Adams, Lisa C. [7 ]
Sato, James [8 ]
Schultz, Thomas [8 ]
Kim, Albert E. [1 ,2 ,4 ]
Merkow, Jameson [9 ]
Bressem, Keno K. [10 ,11 ,12 ]
Bridge, Christopher P. [1 ,2 ,4 ,5 ,8 ]
机构
[1] Massachusetts Gen Hosp, Athinoula A Martinos Ctr Biomed Imaging, 149 Thirteenth St, Charlestown, MA 02129 USA
[2] Harvard Med Sch, 149 Thirteenth St, Charlestown, MA 02129 USA
[3] Charite Univ Med Berlin, Dept Radiol, Berlin, Germany
[4] Free Univ Berlin, Berlin, Germany
[5] Humboldt Univ, Berlin, Germany
[6] Dana Farber Canc Inst, Dept Pediat Oncol, Boston, MA USA
[7] Tech Univ Munich, Dept Diagnost & Intervent Radiol, Munich, Germany
[8] Mass Gen Brigham Data Sci Off, Boston, MA USA
[9] Microsoft Hlth & Life Sci HLS, Redmond, WA 98052 USA
[10] Tech Univ Munich, Klinikum Rechts Isar, Munich, Germany
[11] German Heart Ctr Munich, Dept Radiol & Nucl Med, Munich, Germany
[12] Tech Univ Munich, TUM Univ Hosp, Sch Med & Hlth, Dept Cardiovasc Radiol & Nucl Med,German Heart Ctr, Munich, Germany
关键词
D O I
10.1148/radiol.241139
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: Rapid advances in large language models (LLMs) have led to the development of numerous commercial and open-source models. While recent publications have explored OpenAI's GPT-4 to extract information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to leading open-source models. Purpose: To compare different leading open-source LLMs to GPT-4 on the task of extracting relevant findings from chest radiograph reports. Materials and Methods: Two independent datasets of free-text radiology reports from chest radiograph examinations were used in this retrospective study performed between February 2, 2024, and February 14, 2024. The first dataset consisted of reports from the ImaGenome dataset, providing reference standard annotations from the MIMIC-CXR database acquired between 2011 and 2016. The second dataset consisted of randomly selected reports created at the Massachusetts General Hospital between July 2019 and July 2021. In both datasets, the commercial models GPT-3.5 Turbo and GPT-4 were compared with open-source models that included Mistral-7B and Mixtral-8 x 7B (Mistral AI), Llama 2-13B and Llama 2-70B (Meta), and Qwen1.5-72B (Alibaba Group), as well as CheXbert and CheXpert-labeler (Stanford ML Group), in their ability to accurately label the presence of multiple findings in radiograph text reports using zero-shot and few-shot prompting. The McNemar test was used to compare F1 scores between models. Results: On the ImaGenome dataset (n = 450), the open-source model with the highest score, Llama 2-70B, achieved micro F1 scores of 0.97 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.98 (P > .99 and < .001 for superiority of GPT-4). On the institutional dataset (n = 500), the open-source model with the highest score, an ensemble model, achieved micro F1 scores of 0.96 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.97 (P < .001 and > .99 for superiority of GPT-4). Conclusion: Although GPT-4 was superior to open-source models in zero-shot report labeling, few-shot prompting with a small number of example reports closely matched the performance of GPT-4. The benefit of few-shot prompting varied across datasets and models.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] A tale of two imaging informatics translational licensing models: Commercial, and Open-source
    Harris, Gordon J.
    IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, MEDICAL IMAGING 2024, 2024, 12931
  • [42] General-Purpose Large Language Models Versus a Domain-Specific Natural Language Processing Tool for Label Extraction From Chest Radiograph Reports
    Savage, Cody H.
    Park, Hyoungsun
    Kwak, Kijung
    Smith, Andrew D.
    Rothenberg, Steven A.
    Parekh, Vishwa S.
    Doo, Florence X.
    Yi, Paul H.
    AMERICAN JOURNAL OF ROENTGENOLOGY, 2024, 222 (04)
  • [43] Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
    Ehrett, Carl
    Hegde, Sudeep
    Andre, Kwame
    Liu, Dixizi
    Wilson, Timothy
    JMIR MEDICAL EDUCATION, 2024, 10
  • [44] FaultLines - Evaluating the Efficacy of Open-Source Large Language Models for Fault Detection in Cyber-Physical Systems
    Muehlburger, Herbert
    Wotawa, Franz
    2024 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2024, : 47 - 54
  • [45] Comparing Commercial, Vendor-Specific vs Open-Source Business Intelligence Dashboard Solutions
    Cotten, Steven W.
    JOURNAL OF APPLIED LABORATORY MEDICINE, 2023, 8 (01): : 223 - 225
  • [46] RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model
    Lu, Yao
    Liu, Shang
    Zhang, Qijun
    Xie, Zhiyao
    29TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE, ASP-DAC 2024, 2024, : 722 - 727
  • [47] gryannote open-source speaker diarization labeling tool
    Pages, Clement
    Bredin, Herve
    INTERSPEECH 2024, 2024, : 3650 - 3651
  • [48] Archetypes of open-source business models
    Estelle Duparc
    Frederik Möller
    Ilka Jussen
    Maleen Stachon
    Sükran Algac
    Boris Otto
    Electronic Markets, 2022, 32 : 727 - 745
  • [49] Archetypes of open-source business models
    Duparc, Estelle
    Moeller, Frederik
    Jussen, Ilka
    Stachon, Maleen
    Algac, Sukran
    Otto, Boris
    ELECTRONIC MARKETS, 2022, 32 (02) : 727 - 745
  • [50] PMC-LLaMA: toward building open-source language models for medicine
    Wu, Chaoyi
    Lin, Weixiong
    Zhang, Xiaoman
    Zhang, Ya
    Xie, Weidi
    Wang, Yanfeng
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 1833 - 1843