An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

被引:1
|
作者
Serapio, Adrian [1 ]
Chaudhari, Gunvant [3 ]
Savage, Cody [2 ]
Lee, Yoo Jin [1 ]
Vella, Maya [1 ]
Sridhar, Shravan [1 ]
Schroeder, Jamie Lee [4 ]
Liu, Jonathan [1 ]
Yala, Adam [5 ,6 ]
Sohn, Jae Ho [1 ]
机构
[1] Univ Calif San Francisco, Dept Radiol & Biomed Imaging, San Francisco, CA 94143 USA
[2] Univ Maryland, Med Ctr, Dept Radiol, Baltimore, MD USA
[3] Univ Washington, Dept Radiol, Seattle, WA USA
[4] MedStar Georgetown Univ Hosp, Washington, DC USA
[5] Univ Calif Berkeley, Computat Precis Hlth, Berkeley, CA USA
[6] Univ Calif San Francisco, San Francisco, CA USA
来源
BMC MEDICAL IMAGING | 2024年 / 24卷 / 01期
关键词
Natural language processing; Large language model; Open-source; Summarization; Impressions;
D O I
10.1186/s12880-024-01435-w
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
BackgroundThe impression section integrates key findings of a radiology report but can be subjective and variable. We sought to fine-tune and evaluate an open-source Large Language Model (LLM) in automatically generating impressions from the remainder of a radiology report across different imaging modalities and hospitals.MethodsIn this institutional review board-approved retrospective study, we collated a dataset of CT, US, and MRI radiology reports from the University of California San Francisco Medical Center (UCSFMC) (n = 372,716) and the Zuckerberg San Francisco General (ZSFG) Hospital and Trauma Center (n = 60,049), both under a single institution. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, an automatic natural language evaluation metric that measures word overlap, was used for automatic natural language evaluation. A reader study with five cardiothoracic radiologists was performed to more strictly evaluate the model's performance on a specific modality (CT chest exams) with a radiologist subspecialist baseline. We stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.ResultsThe LLM achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on UCSFMC and upon external validation, ROUGE-L scores of 40.74, 37.89, and 24.61 on ZSFG across the CT, US, and MRI modalities respectively, implying a substantial degree of overlap between the model-generated impressions and impressions written by the subspecialist attending radiologists, but with a degree of degradation upon external validation. In our reader study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, 3.37/4, 18.29 s,12.32 words, and 84 while the original impression written by a subspecialist radiologist achieved overall mean scores of 3.75/4, 3.87/4, 3.54/4, 12.2 s, 5.74 words, and 89 for clinical accuracy, grammatical accuracy, stylistic quality, edit time, edit distance, and ROUGE-L score respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.ConclusionsAn open-source fine-tuned LLM can generate impressions to a satisfactory level of clinical accuracy, grammatical accuracy, and stylistic quality. Our reader performance study demonstrates the potential of large language models in drafting radiology report impressions that can aid in streamlining radiologists' workflows.
引用
收藏
页数:14
相关论文
共 50 条
  • [11] EpilepsyLLM: Domain-Specific Large Language Model Fine-tuned with Epilepsy Medical Knowledge
    Zhao, Xuyang
    Zhao, Qibin
    Tanaka, Toshihisa
    arXiv,
  • [12] AECR: Automatic attack technique intelligence extraction based on fine-tuned large language model
    Chen, Minghao
    Zhu, Kaijie
    Lu, Bin
    Li, Ding
    Yuan, Qingjun
    Zhu, Yuefei
    COMPUTERS & SECURITY, 2025, 150
  • [13] Comparing Fine-Tuned Transformers and Large Language Models for Sales Call Classification: A Case Study
    Eisenstadt, Roy
    Asi, Abedelkader
    Ronen, Royi
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 5240 - 5241
  • [14] Fine-Tuned Large Language Model for Extracting Patients on Pretreatment for Lung Cancer from a Picture Archiving and Communication System Based on Radiological Reports
    Yasaka, Koichiro
    Kanzawa, Jun
    Kanemaru, Noriko
    Koshino, Saori
    Abe, Osamu
    JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2025, 38 (01): : 327 - 334
  • [15] Accelerating the Classification of NOVA Food Processing Levels Using a Fine-Tuned Language Model: A Multi-Country Study
    Hu, Guanlan
    Flexner, Nadia
    Tiscornia, Maria Victoria
    L'Abbe, Mary R.
    NUTRIENTS, 2023, 15 (19)
  • [16] MIRA-ChatGLM: A Fine-Tuned Large Language Model for Intelligent Risk Assessment in Coal Mining
    Sun, Yi
    Zhang, Chao
    Wang, Chen
    Han, Ying
    APPLIED SCIENCES-BASEL, 2024, 14 (24):
  • [17] Extracting structured data from organic synthesis procedures using a fine-tuned large language model
    Ai, Qianxiang
    Meng, Fanwang
    Shi, Jiale
    Pelkie, Brenden
    Coley, Connor W.
    DIGITAL DISCOVERY, 2024, 3 (09): : 1822 - 1831
  • [18] Assessing Programming Proficiency Through Eye Gaze Analysis Using Fine-Tuned Large Language Model
    Li, Zheng
    Holly, Dominic
    PROCEEDINGS OF THE 2024 IEEE 10TH IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING, HPSC 2024, 2024, : 7 - 12
  • [19] Fine-Tuning and Evaluating Open-Source Large Language Models for the Army Domain
    Ruiz, Maj Daniel C.
    Sell, John
    arXiv,
  • [20] The Fine-Tuned Large Language Model for Extracting the Progressive Bone Metastasis from Unstructured Radiology Reports
    Kanemaru, Noriko
    Yasaka, Koichiro
    Fujita, Nana
    Kanzawa, Jun
    Abe, Osamu
    JOURNAL OF IMAGING INFORMATICS IN MEDICINE, 2025, 38 (02): : 865 - 872