Comparing Commercial and Open-Source Large Language Models for Labeling Chest Radiograph Reports

被引:0
|
作者
Dorfner, Felix J. [1 ,2 ,3 ,4 ,5 ]
Juergensen, Liv [3 ,5 ,6 ]
Donle, Leonhard [3 ]
Al Mohamad, Fares [3 ,5 ]
Bodenmann, Tobias R. [1 ,2 ,4 ]
Cleveland, Mason C. [1 ,2 ,4 ]
Busch, Felix [3 ,5 ]
Adams, Lisa C. [7 ]
Sato, James [8 ]
Schultz, Thomas [8 ]
Kim, Albert E. [1 ,2 ,4 ]
Merkow, Jameson [9 ]
Bressem, Keno K. [10 ,11 ,12 ]
Bridge, Christopher P. [1 ,2 ,4 ,5 ,8 ]
机构
[1] Massachusetts Gen Hosp, Athinoula A Martinos Ctr Biomed Imaging, 149 Thirteenth St, Charlestown, MA 02129 USA
[2] Harvard Med Sch, 149 Thirteenth St, Charlestown, MA 02129 USA
[3] Charite Univ Med Berlin, Dept Radiol, Berlin, Germany
[4] Free Univ Berlin, Berlin, Germany
[5] Humboldt Univ, Berlin, Germany
[6] Dana Farber Canc Inst, Dept Pediat Oncol, Boston, MA USA
[7] Tech Univ Munich, Dept Diagnost & Intervent Radiol, Munich, Germany
[8] Mass Gen Brigham Data Sci Off, Boston, MA USA
[9] Microsoft Hlth & Life Sci HLS, Redmond, WA 98052 USA
[10] Tech Univ Munich, Klinikum Rechts Isar, Munich, Germany
[11] German Heart Ctr Munich, Dept Radiol & Nucl Med, Munich, Germany
[12] Tech Univ Munich, TUM Univ Hosp, Sch Med & Hlth, Dept Cardiovasc Radiol & Nucl Med,German Heart Ctr, Munich, Germany
关键词
D O I
10.1148/radiol.241139
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: Rapid advances in large language models (LLMs) have led to the development of numerous commercial and open-source models. While recent publications have explored OpenAI's GPT-4 to extract information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to leading open-source models. Purpose: To compare different leading open-source LLMs to GPT-4 on the task of extracting relevant findings from chest radiograph reports. Materials and Methods: Two independent datasets of free-text radiology reports from chest radiograph examinations were used in this retrospective study performed between February 2, 2024, and February 14, 2024. The first dataset consisted of reports from the ImaGenome dataset, providing reference standard annotations from the MIMIC-CXR database acquired between 2011 and 2016. The second dataset consisted of randomly selected reports created at the Massachusetts General Hospital between July 2019 and July 2021. In both datasets, the commercial models GPT-3.5 Turbo and GPT-4 were compared with open-source models that included Mistral-7B and Mixtral-8 x 7B (Mistral AI), Llama 2-13B and Llama 2-70B (Meta), and Qwen1.5-72B (Alibaba Group), as well as CheXbert and CheXpert-labeler (Stanford ML Group), in their ability to accurately label the presence of multiple findings in radiograph text reports using zero-shot and few-shot prompting. The McNemar test was used to compare F1 scores between models. Results: On the ImaGenome dataset (n = 450), the open-source model with the highest score, Llama 2-70B, achieved micro F1 scores of 0.97 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.98 (P > .99 and < .001 for superiority of GPT-4). On the institutional dataset (n = 500), the open-source model with the highest score, an ensemble model, achieved micro F1 scores of 0.96 and 0.97 for zero-shot and few-shot prompting, respectively, compared with the GPT-4 F1 scores of 0.98 and 0.97 (P < .001 and > .99 for superiority of GPT-4). Conclusion: Although GPT-4 was superior to open-source models in zero-shot report labeling, few-shot prompting with a small number of example reports closely matched the performance of GPT-4. The benefit of few-shot prompting varied across datasets and models.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Evaluation of Open-Source Large Language Models for Metal-Organic Frameworks Research
    Bai, Xuefeng
    Xie, Yabo
    Zhang, Xin
    Han, Honggui
    Li, Jian-Rong
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2024, 64 (13) : 4958 - 4965
  • [22] Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain
    Ruiz, Maj Daniel C.
    Sell, John
    arXiv,
  • [23] Evaluating local open-source large language models for data extraction from unstructured reports on mechanical thrombectomy in patients with ischemic stroke
    Meddeb, Aymen
    Ebert, Philipe
    Bressem, Keno Kyrill
    Desser, Dmitriy
    Dell'Orco, Andrea
    Bohner, Georg
    Kleine, Justus F.
    Siebert, Eberhard
    Grauhan, Nils
    Brockmann, Marc A.
    Othman, Ahmed
    Scheel, Michael
    Nawabi, Jawed
    JOURNAL OF NEUROINTERVENTIONAL SURGERY, 2024,
  • [24] EAI-SIM: An Open-source Embodied AI Simulation Framework with Large Language Models
    Liu, Guocai
    Sun, Tao
    Li, Weihua
    Li, Xiaohui
    Liu, Xin
    Cui, Jinqiang
    2024 IEEE 18TH INTERNATIONAL CONFERENCE ON CONTROL & AUTOMATION, ICCA 2024, 2024, : 994 - 999
  • [25] Staged Multi-Strategy Framework With Open-Source Large Language Models for Natural Language to SQL Generation
    Liu, Chuanlong
    Liao, Wei
    Xu, Zhen
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2025,
  • [26] Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions
    Severino, Joao Victor Bruneti
    de Paula, Pedro Angelo Basei
    Berger, Matheus Nespolo
    Loures, Filipe Silveira
    Todeschini, Solano Amadori
    Roeder, Eduardo Augusto
    Veiga, Maria Han
    Guedes, Murilo
    Marques, Gustavo Lenci
    BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)
  • [27] ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models
    Feuer, Benjamin
    Liu, Yurong
    Hegde, Chinmay
    Freire, Juliana
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (09): : 2279 - 2292
  • [28] Analyzing Women's Contributions to Open-Source Software Projects based on Large Language Models
    Zhuang, Yuqian
    Zhang, Mingya
    Yang, Yiyuan
    Wang, Liang
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 2363 - 2368
  • [29] Need of Fine-Tuned Radiology Aware Open-Source Large Language Models for Neuroradiology
    Ray, Partha Pratim
    CLINICAL NEURORADIOLOGY, 2024,
  • [30] Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge
    Hu, Xuke
    Kersten, Jens
    Klan, Friederike
    Farzana, Sheikh Mastura
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2024,