An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

被引:1
|
作者
Serapio, Adrian [1 ]
Chaudhari, Gunvant [3 ]
Savage, Cody [2 ]
Lee, Yoo Jin [1 ]
Vella, Maya [1 ]
Sridhar, Shravan [1 ]
Schroeder, Jamie Lee [4 ]
Liu, Jonathan [1 ]
Yala, Adam [5 ,6 ]
Sohn, Jae Ho [1 ]
机构
[1] Univ Calif San Francisco, Dept Radiol & Biomed Imaging, San Francisco, CA 94143 USA
[2] Univ Maryland, Med Ctr, Dept Radiol, Baltimore, MD USA
[3] Univ Washington, Dept Radiol, Seattle, WA USA
[4] MedStar Georgetown Univ Hosp, Washington, DC USA
[5] Univ Calif Berkeley, Computat Precis Hlth, Berkeley, CA USA
[6] Univ Calif San Francisco, San Francisco, CA USA
来源
BMC MEDICAL IMAGING | 2024年 / 24卷 / 01期
关键词
Natural language processing; Large language model; Open-source; Summarization; Impressions;
D O I
10.1186/s12880-024-01435-w
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
BackgroundThe impression section integrates key findings of a radiology report but can be subjective and variable. We sought to fine-tune and evaluate an open-source Large Language Model (LLM) in automatically generating impressions from the remainder of a radiology report across different imaging modalities and hospitals.MethodsIn this institutional review board-approved retrospective study, we collated a dataset of CT, US, and MRI radiology reports from the University of California San Francisco Medical Center (UCSFMC) (n = 372,716) and the Zuckerberg San Francisco General (ZSFG) Hospital and Trauma Center (n = 60,049), both under a single institution. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, an automatic natural language evaluation metric that measures word overlap, was used for automatic natural language evaluation. A reader study with five cardiothoracic radiologists was performed to more strictly evaluate the model's performance on a specific modality (CT chest exams) with a radiologist subspecialist baseline. We stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.ResultsThe LLM achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on UCSFMC and upon external validation, ROUGE-L scores of 40.74, 37.89, and 24.61 on ZSFG across the CT, US, and MRI modalities respectively, implying a substantial degree of overlap between the model-generated impressions and impressions written by the subspecialist attending radiologists, but with a degree of degradation upon external validation. In our reader study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, 3.37/4, 18.29 s,12.32 words, and 84 while the original impression written by a subspecialist radiologist achieved overall mean scores of 3.75/4, 3.87/4, 3.54/4, 12.2 s, 5.74 words, and 89 for clinical accuracy, grammatical accuracy, stylistic quality, edit time, edit distance, and ROUGE-L score respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.ConclusionsAn open-source fine-tuned LLM can generate impressions to a satisfactory level of clinical accuracy, grammatical accuracy, and stylistic quality. Our reader performance study demonstrates the potential of large language models in drafting radiology report impressions that can aid in streamlining radiologists' workflows.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] A large language model-based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records
    Chiang, Chia-Chun
    Luo, Man
    Dumkrieger, Gina
    Trivedi, Shubham
    Chen, Yi-Chieh
    Chao, Chieh-Ju
    Schwedt, Todd J.
    Sarker, Abeed
    Banerjee, Imon
    HEADACHE, 2024, 64 (04): : 400 - 409
  • [42] OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx Multi-Die FPGAs
    Shin, Gyeongcheol
    Kim, Junsoo
    Kim, Joo-Young
    IEEE COMPUTER ARCHITECTURE LETTERS, 2022, 21 (02) : 101 - 104
  • [43] Quality evaluation meta-model for open-source software: multi-method validation study
    Yilmaz, Nebi
    Tarhan, Ayca Kolukisa
    SOFTWARE QUALITY JOURNAL, 2024, 32 (02) : 487 - 541
  • [44] Crop2ML: An open-source multi-language modeling framework for the exchange and reuse of crop model components
    Midingoyi, Cyrille Ahmed
    Pradal, Christophe
    Enders, Andreas
    Fumagalli, Davide
    Raynal, Helene
    Donatelli, Marcello
    Athanasiadis, Ioannis N.
    Porter, Cheryl
    Hoogenboom, Gerrit
    Holzworth, Dean
    Garcia, Frederick
    Thorburn, Peter
    Martre, Pierre
    ENVIRONMENTAL MODELLING & SOFTWARE, 2021, 142
  • [45] Improving entity recognition using ensembles of deep learning and fine-tuned large language models: A case study on adverse event extraction from VAERS and social media
    Li, Yiming
    Viswaroopan, Deepthi
    He, William
    Li, Jianfu
    Zuo, Xu
    Xu, Hua
    Tao, Cui
    JOURNAL OF BIOMEDICAL INFORMATICS, 2025, 163
  • [46] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
    Jenish Maharjan
    Anurag Garikipati
    Navan Preet Singh
    Leo Cyrus
    Mayank Sharma
    Madalina Ciobanu
    Gina Barnes
    Rahul Thapa
    Qingqing Mao
    Ritankar Das
    Scientific Reports, 14 (1)
  • [47] OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models
    Maharjan, Jenish
    Garikipati, Anurag
    Singh, Navan Preet
    Cyrus, Leo
    Sharma, Mayank
    Ciobanu, Madalina
    Barnes, Gina
    Thapa, Rahul
    Mao, Qingqing
    Das, Ritankar
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [48] Understanding Citizens' Response to Social Activities on Twitter in US Metropolises During the COVID-19 Recovery Phase Using a Fine-Tuned Large Language Model: Application of AI
    Saito, Ryuichi
    Tsugawa, Sho
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2025, 27
  • [49] Experience is all you need: a large language model application of fine-tuned GPT-3.5 and RoBERTa for aspect-based sentiment analysis of college football stadium reviews
    Qian, Tyreal Yizhou
    Li, Weizhe
    Gong, Hua
    Seifried, Chad
    Xu, Chenglong
    SPORT MANAGEMENT REVIEW, 2025, 28 (01) : 1 - 25
  • [50] Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature
    Lozano, Alejandro
    Fleming, Scott L.
    Chiang, Chia-Chun
    Shah, Nigam
    BIOCOMPUTING 2024, PSB 2024, 2024, : 8 - 23