An open-source fine-tuned large language model for radiological impression generation: a multi-reader performance study

被引:1
|
作者
Serapio, Adrian [1 ]
Chaudhari, Gunvant [3 ]
Savage, Cody [2 ]
Lee, Yoo Jin [1 ]
Vella, Maya [1 ]
Sridhar, Shravan [1 ]
Schroeder, Jamie Lee [4 ]
Liu, Jonathan [1 ]
Yala, Adam [5 ,6 ]
Sohn, Jae Ho [1 ]
机构
[1] Univ Calif San Francisco, Dept Radiol & Biomed Imaging, San Francisco, CA 94143 USA
[2] Univ Maryland, Med Ctr, Dept Radiol, Baltimore, MD USA
[3] Univ Washington, Dept Radiol, Seattle, WA USA
[4] MedStar Georgetown Univ Hosp, Washington, DC USA
[5] Univ Calif Berkeley, Computat Precis Hlth, Berkeley, CA USA
[6] Univ Calif San Francisco, San Francisco, CA USA
来源
BMC MEDICAL IMAGING | 2024年 / 24卷 / 01期
关键词
Natural language processing; Large language model; Open-source; Summarization; Impressions;
D O I
10.1186/s12880-024-01435-w
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
BackgroundThe impression section integrates key findings of a radiology report but can be subjective and variable. We sought to fine-tune and evaluate an open-source Large Language Model (LLM) in automatically generating impressions from the remainder of a radiology report across different imaging modalities and hospitals.MethodsIn this institutional review board-approved retrospective study, we collated a dataset of CT, US, and MRI radiology reports from the University of California San Francisco Medical Center (UCSFMC) (n = 372,716) and the Zuckerberg San Francisco General (ZSFG) Hospital and Trauma Center (n = 60,049), both under a single institution. The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score, an automatic natural language evaluation metric that measures word overlap, was used for automatic natural language evaluation. A reader study with five cardiothoracic radiologists was performed to more strictly evaluate the model's performance on a specific modality (CT chest exams) with a radiologist subspecialist baseline. We stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.ResultsThe LLM achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on UCSFMC and upon external validation, ROUGE-L scores of 40.74, 37.89, and 24.61 on ZSFG across the CT, US, and MRI modalities respectively, implying a substantial degree of overlap between the model-generated impressions and impressions written by the subspecialist attending radiologists, but with a degree of degradation upon external validation. In our reader study, the model-generated impressions achieved overall mean scores of 3.56/4, 3.92/4, 3.37/4, 18.29 s,12.32 words, and 84 while the original impression written by a subspecialist radiologist achieved overall mean scores of 3.75/4, 3.87/4, 3.54/4, 12.2 s, 5.74 words, and 89 for clinical accuracy, grammatical accuracy, stylistic quality, edit time, edit distance, and ROUGE-L score respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings and on shorter impressions.ConclusionsAn open-source fine-tuned LLM can generate impressions to a satisfactory level of clinical accuracy, grammatical accuracy, and stylistic quality. Our reader performance study demonstrates the potential of large language models in drafting radiology report impressions that can aid in streamlining radiologists' workflows.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge
    Li, Yunxiang
    Li, Zihan
    Zhang, Kai
    Dan, Ruilong
    Jiang, Steve
    Zhang, You
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (06)
  • [32] Enhancing Code Security Through Open-Source Large Language Models: A Comparative Study
    Ridley, Norah
    Branca, Enrico
    Kimber, Jadyn
    Stakhanova, Natalia
    FOUNDATIONS AND PRACTICE OF SECURITY, PT I, FPS 2023, 2024, 14551 : 233 - 249
  • [33] Enhancing semantical text understanding with fine-tuned large language models: A case study on Quora Question Pair duplicate identification
    Han, Sifei
    Shi, Lingyun
    Tsui, Fuchiang
    PLOS ONE, 2025, 20 (01):
  • [34] Automated Pathologic TN Classification Prediction and Rationale Generation From Lung CancerSurgical Pathology Reports Using a Large Language Model Fine-Tuned With Chain-of-Thought: Algorithm Development and Validation Study
    Kim, Sanghwan
    Jang, Sowon
    Kim, Borham
    Sunwoo, Leonard
    Kim, Seok
    Chung, Jin-Haeng
    Nam, Sejin
    Cho, Hyeongmin
    Lee, Donghyoung
    Lee, Keehyuck
    Yoo, Sooyoung
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [35] Exploration of Using an Open-Source Large Language Model for Analyzing Trial Information: A Case Study of Clinical Trials With Decentralized Elements
    Huh, Ki Young
    Song, Ildae
    Kim, Yoonjin
    Park, Jiyeon
    Ryu, Hyunwook
    Koh, Jaeeun
    Yu, Kyung-Sang
    Kim, Kyung Hwan
    Lee, Seunghwan
    CTS-CLINICAL AND TRANSLATIONAL SCIENCE, 2025, 18 (03):
  • [36] Performance of three commercially available large language models and one locally fine-tuned model at preparing formal letters to appeal medical insurance denials of radiotherapy services.
    Kiser, Kendall
    Waters, Michael
    Reckford, Jocelyn
    Lundeberg, Christopher
    Abraham, Christopher
    JOURNAL OF CLINICAL ONCOLOGY, 2024, 42 (16)
  • [37] Benchmarking Open-Source Large Language Models on Code-Switched Tagalog-English Retrieval Augmented Generation
    Adoptante, Aunhel John M.
    Castro, Jasper Adrian Dwight, V
    Medrana, Micholo Lanz B.
    Ocampo, Alyssa Patricia B.
    Peramo, Elmer C.
    Miranda, Melissa Ruth M.
    JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, 2025, 16 (02) : 233 - 242
  • [38] Screening oncology articles in a qualitative literature review using large language models: A comparison of GPT4 versus fine-tuned open source models using expert-annotated data
    Thorlund, Kristian
    Lloyd-Price, Lucy
    Jafar, Reza
    Nourizade, Milad
    Burbridge, Claire
    Hudgens, Stacie
    JOURNAL OF CLINICAL ONCOLOGY, 2024, 42 (16)
  • [39] OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Multi-Die FPGAs
    Shin, Gyeongcheol
    Kim, Junsoo
    Kim, Joo-Young
    2022 IEEE 30TH INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2022), 2022, : 246 - 246
  • [40] Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study
    Ehrett, Carl
    Hegde, Sudeep
    Andre, Kwame
    Liu, Dixizi
    Wilson, Timothy
    JMIR MEDICAL EDUCATION, 2024, 10