Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study

被引:0
|
作者
Workum, Jessica D. [1 ,2 ,3 ]
Volkers, Bas W. S. [1 ,3 ]
van de Sande, Davy [1 ,3 ]
Arora, Sumesh [4 ]
Goeijenbier, Marco [1 ,5 ]
Gommers, Diederik [1 ,3 ]
van Genderen, Michel E. [1 ,3 ]
机构
[1] Erasmus MC Univ Med Ctr, Dept Adult Intens Care, Rotterdam, Netherlands
[2] Elisabeth Tweesteden Hosp, Dept Intens Care, Tilburg, Netherlands
[3] Erasmus MC Univ Med Ctr, Erasmus MC Datahub, Rotterdam, Netherlands
[4] Prince Wales Hosp, Sydney, Australia
[5] Spaarne Gasthuis, Dept Intens Care Med, Hoofddorp, Netherlands
关键词
Large language models; Generative artificial intelligence; Critical care; Benchmarking;
D O I
10.1186/s13054-025-05302-0
中图分类号
R4 [临床医学];
学科分类号
1002 ; 100602 ;
摘要
Background Large language models (LLMs) show increasing potential for their use in healthcare for administrative support and clinical decision making. However, reports on their performance in critical care medicine is lacking. Methods This study evaluated five LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Mistral Large 2407 and Llama 3.1 70B) on 1181 multiple choice questions (MCQs) from the gotheextramile.com database, a comprehensive database of critical care questions at European Diploma in Intensive Care examination level. Their performance was compared to random guessing and 350 human physicians on a 77-MCQ practice test. Metrics included accuracy, consistency, and domain-specific performance. Costs, as a proxy for energy consumption, were also analyzed. Results GPT-4o achieved the highest accuracy at 93.3%, followed by Llama 3.1 70B (87.5%), Mistral Large 2407 (87.9%), GPT-4o-mini (83.0%), and GPT-3.5-turbo (72.7%). Random guessing yielded 41.5% (p < 0.001). On the practice test, all models surpassed human physicians, scoring 89.0%, 80.9%, 84.4%, 80.3%, and 66.5%, respectively, compared to 42.7% for random guessing (p < 0.001) and 61.9% for the human physicians. However, in contrast to the other evaluated LLMs (p < 0.001), GPT-3.5-turbo's performance did not significantly outperform physicians (p = 0.196). Despite high overall consistency, all models gave consistently incorrect answers. The most expensive model was GPT-4o, costing over 25 times more than the least expensive model, GPT-4o-mini. Conclusions LLMs exhibit exceptional accuracy and consistency, with four outperforming human physicians on a European-level practice exam. GPT-4o led in performance but raised concerns about energy consumption. Despite their potential in critical care, all models produced consistently incorrect answers, highlighting the need for more thorough and ongoing evaluations to guide responsible implementation in clinical settings.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Starr, Matthew R.
    Tajfirouz, Deena A.
    Chodnicki, Kevin D.
    Brodsky, Michael C.
    Mansukhani, Sasha A.
    Moss, Heather E.
    Lai, Kevin E.
    Ko, Melissa W.
    Mackay, Devin D.
    Di Nome, Marie A.
    Dumitrascu, Oana M.
    Pless, Misha L.
    Eggenberger, Eric R.
    Chen, John J.
    JOURNAL OF NEURO-OPHTHALMOLOGY, 2025, 45 (01) : 71 - 77
  • [2] A Comparative Study of Responses to Retina Questions from Either Experts, Expert-Edited Large Language Models, or Expert-Edited Large Language Models Alone
    Tailor, Prashant D.
    Dalvin, Lauren A.
    Chen, John J.
    Iezzi, Raymond
    Olsen, Timothy W.
    Scruggs, Brittni A.
    Barkmeier, Andrew J.
    Bakri, Sophie J.
    Ryan, Edwin H.
    Tang, Peter H.
    Parke, D. Wilkin., III
    Belin, Peter J.
    Sridhar, Jayanth
    Xu, David
    Kuriyan, Ajay E.
    Yonekawa, Yoshihiro
    Starr, Matthew R.
    OPHTHALMOLOGY SCIENCE, 2024, 4 (04):
  • [3] Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study
    Li, Yu
    Huang, Chen-Kai
    Hu, Yi
    Zhou, Xiao-Dong
    He, Cong
    Zhong, Jia-Wei
    WORLD JOURNAL OF GASTROENTEROLOGY, 2025, 31 (03)
  • [4] Comparative Performance of the Leading Large Language Models in Answering Complex Rhinoplasty Consultation Questions
    Goshtasbi, Khodayar
    Best, Corliss
    Powers, Bethany
    Ching, Harry
    Pastorek, Norman J.
    Altman, Donald
    Adamson, Peter
    Krugman, Mark
    Wong, Brian J. F.
    FACIAL PLASTIC SURGERY & AESTHETIC MEDICINE, 2025,
  • [5] CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
    Yu, Linhao
    Leng, Yongqi
    Huang, Yufei
    Wu, Shang
    Liu, Haixin
    Ji, Xinmeng
    Zhao, Jiahui
    Song, Jinwang
    Cui, Tingting
    Cheng, Xiaoqing
    Liu, Tao
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11817 - 11837
  • [6] SafeLLMs: A Benchmark for Secure Bilingual Evaluation of Large Language Models
    Liang, Wenhan
    Wu, Huijia
    Gao, Jun
    Shang, Yuhu
    He, Zhaofeng
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 437 - 448
  • [7] Generation and Evaluation of Synthetic Critical Care Progress Notes With Large Language Models
    Leding, B.
    Gao, Y.
    Dligach, D.
    Croxford, E.
    Mayampurath, A.
    Churpek, M. M.
    Afshar, M.
    AMERICAN JOURNAL OF RESPIRATORY AND CRITICAL CARE MEDICINE, 2024, 209
  • [8] Comparative Evaluation of the Accuracies of Large Language Models in Answering VI-RADS-Related Questions
    Camur, Eren
    Cesur, Turay
    Gunes, Yasin Celal
    KOREAN JOURNAL OF RADIOLOGY, 2024, 25 (08) : 767 - 768
  • [9] HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
    Li, Junyi
    Cheng, Xiaoxue
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 6449 - 6464
  • [10] Large language models and rheumatology: a comparative evaluation
    Venerito, Vincenzo
    Puttaswamy, Darshan
    Iannone, Florenzo
    Gupta, Latika
    LANCET RHEUMATOLOGY, 2023, 5 (10): : E574 - E578