Assessing Completeness of Clinical Histories Accompanying Imaging Orders Using Adapted Open-Source and Closed-Source Large Language Models

被引:0
|
作者
Larson, David B. [1 ,2 ]
Koirala, Arogya [2 ]
Cheuy, Lina Y. [1 ,2 ]
Paschali, Magdalini [1 ,2 ]
Van Veen, Dave [3 ]
Na, Hye Sun [1 ,2 ]
Petterson, Matthew B. [1 ]
Fang, Zhongnan [1 ,2 ]
Chaudhari, Akshay S. [1 ,2 ,4 ]
机构
[1] Stanford Univ, Dept Radiol, Sch Med, 453 Quarry Rd,MC 5659, Stanford, CA 94304 USA
[2] Stanford Univ, AI Dev & Evaluat Lab, Sch Med, Palo Alto, CA 94305 USA
[3] Stanford Univ, Dept Elect Engn, Stanford, CA USA
[4] Stanford Univ, Dept Biomed Data Sci, Stanford, CA USA
关键词
D O I
10.1148/radiol.241051
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Background: Incomplete clinical histories are a well-known problem in radiology. Previous dedicated quality improvement efforts focusing on reproducible assessments of the completeness of free-text clinical histories have relied on tedious manual analysis. Purpose: To adapt and evaluate open-source and closed-source large language models (LLMs) for their ability to automatically extract clinical history elements within imaging orders and to use the best-performing adapted open-source model to assess the completeness of a large sample of clinical histories as a benchmark for clinical practice. Materials and Methods: This retrospective single-site study used previously extracted information accompanying CT, MRI, US, and radiography orders from August 2020 to May 2022 at an adult and pediatric emergency department of a 613-bed tertiary academic medical center. Two open-source (Llama 2-7B [Meta], Mistral-7B [Mistral AI]) and one closed-source (GPT-4 Turbo [OpenAI]) LLMs were adapted using prompt engineering, in-context learning, and fine-tuning (open-source only) to extract the elements "past medical history," "what," "when," "where," and "clinical concern" from clinical histories. Model performance, interreader agreement using Cohen kappa (none to slight, 0.01-0.20; fair, 0.21-0.40; moderate, 0.41-0.60; substantial, 0.61-0.80; almost perfect, 0.81-1.00), and semantic similarity between the models and the adjudicated manual annotations of two board-certified radiologists with 16 and 3 years of postfellowship experience, respectively, were assessed using accuracy, Cohen kappa, and BERTScore, an LLM metric that quantifies how well two pieces of text convey the same meaning; 95% CIs were also calculated. The best-performing open-source model was then used to assess completeness on a large dataset of unannotated clinical histories. Results: A total of 50 186 clinical histories were included (794 training, 150 validation, 300 initial testing, 48 942 real-world application). Of the two open-source models, Mistral-7B outperformed Llama 2-7B in assessing completeness and was further fine-tuned. Both Mistral-7B and GPT-4 Turbo showed substantial overall agreement with radiologists (mean kappa, 0.73 [95% CI: 0.67, 0.78] to 0.77 [95% CI: 0.71, 0.82]) and adjudicated annotations (mean BERTScore, 0.96 [95% CI: 0.96, 0.97] for both models; P = .38). Mistral-7B also rivaled GPT-4 Turbo in performance (weighted overall mean accuracy, 91% [95% CI: 89, 93] vs 92% [95% CI: 90, 94]; P = .31) despite being a smaller model. Using Mistral-7B, 26.2% (12 803 of 48 942) of unannotated clinical histories were found to contain all five elements. Conclusion: An easily deployable fine-tuned open-source LLM (Mistral-7B), rivaling GPT-4 Turbo in performance, could effectively extract clinical history elements with substantial agreement with radiologists and produce a benchmark for completeness of a large sample of clinical histories. The model and code will be fully open-sourced.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Large language models for error detection in radiology reports: a comparative analysis between closed-source and privacy-compliant open-source models
    Salam, Babak
    Stuewe, Claire
    Nowak, Sebastian
    Sprinkart, Alois M.
    Theis, Maike
    Kravchenko, Dmitrij
    Mesropyan, Narine
    Dell, Tatjana
    Endler, Christoph
    Pieper, Claus C.
    Kuetting, Daniel L.
    Luetkens, Julian A.
    Isaak, Alexander
    EUROPEAN RADIOLOGY, 2025,
  • [2] Classifying Cancer Stage with Open-Source Clinical Large Language Models
    Chang, Chia-Hsuan
    Lucas, Mary M.
    Lu-Yao, Grace
    Yang, Christopher C.
    2024 IEEE 12TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS, ICHI 2024, 2024, : 76 - 82
  • [3] Re: Open-Source Large Language Models in Radiology
    Kooraki, Soheil
    Bedayat, Arash
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4293 - 4293
  • [4] Servicing open-source large language models for oncology
    Ray, Partha Pratim
    ONCOLOGIST, 2024,
  • [5] ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models
    Liu, Qijiong
    Chen, Nuo
    Sakai, Tetsuya
    Wu, Xiao-Ming
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 452 - 461
  • [6] A tutorial on open-source large language models for behavioral science
    Hussain, Zak
    Binz, Marcel
    Mata, Rui
    Wulff, Dirk U.
    BEHAVIOR RESEARCH METHODS, 2024, 56 (08) : 8214 - 8237
  • [7] Upgrading Academic Radiology with Open-Source Large Language Models
    Ray, Partha Pratim
    ACADEMIC RADIOLOGY, 2024, 31 (10) : 4291 - 4292
  • [8] Preliminary Systematic Review of Open-Source Large Language Models in Education
    Lin, Michael Pin-Chuan
    Chang, Daniel
    Hall, Sarah
    Jhajj, Gaganpreet
    GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 68 - 77
  • [9] ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models
    Feuer, Benjamin
    Liu, Yurong
    Hegde, Chinmay
    Freire, Juliana
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (09): : 2279 - 2292
  • [10] Comparison of Frontier Open-Source and Proprietary Large Language Models for Complex Diagnoses
    Buckley, Thomas A.
    Crowe, Byron
    Abdulnour, Raja-Elie E.
    Rodman, Adam
    Manrai, Arjun K.
    JAMA HEALTH FORUM, 2025, 6 (03):