Evaluation and mitigation of the limitations of large language models in clinical decision-making

被引:43
|
作者
Hager, Paul [1 ,2 ]
Jungmann, Friederike [1 ,2 ]
Holland, Robbie [3 ]
Bhagat, Kunal [4 ]
Hubrecht, Inga [5 ]
Knauer, Manuel [5 ]
Vielhauer, Jakob [6 ]
Makowski, Marcus [2 ]
Braren, Rickmer [2 ]
Kaissis, Georgios [1 ,2 ,3 ,7 ]
Rueckert, Daniel [1 ,3 ]
机构
[1] Tech Univ Munich, Klinikum Rechts Isar, Inst AI & Informat, Munich, Germany
[2] Tech Univ Munich, Inst Diagnost & Intervent Radiol, Klinikum Rechts Isar, Munich, Germany
[3] Imperial Coll, Dept Comp, London, England
[4] ChristianaCare Hlth Syst, Dept Med, Wilmington, DE USA
[5] Tech Univ Munich, Dept Med 3, Klinikum Rechts Isar, Munich, Germany
[6] Ludwig Maximilian Univ Munich, Dept Med 2, Univ Hosp, Munich, Germany
[7] Helmholtz Munich, Inst Machine Learning Biomed Imaging, Reliable AI Grp, Munich, Germany
关键词
AI; BIAS;
D O I
10.1038/s41591-024-03097-1
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Clinical decision-making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies. Using a curated dataset of 2,400 cases and a framework to simulate a realistic clinical setting, current large language models are shown to incur substantial pitfalls when used for autonomous clinical decision-making.
引用
收藏
页码:2613 / 2622
页数:26
相关论文
共 50 条
  • [1] Towards AI-assisted cardiology: a reflection on the performance and limitations of using large language models in clinical decision-making
    Salihu, Adil
    Gadiri, Mehdi Ali
    Skalidis, Ioannis
    Meier, David
    Auberson, Denise
    Fournier, Annick
    Fournier, Romain
    Thanou, Dorina
    Abbe, Emmanuel
    Muller, Olivier
    Fournier, Stephane
    EUROINTERVENTION, 2023, 19 (10) : E798 - E801
  • [2] Large language models: Tools for new environmental decision-making
    Nie, Qiyang
    Liu, Tong
    JOURNAL OF ENVIRONMENTAL MANAGEMENT, 2025, 375
  • [3] Enhancing emergency decision-making with knowledge graphs and large language models
    Chen, Minze
    Tao, Zhenxiang
    Tang, Weitong
    Qin, Tingxin
    Yang, Rui
    Zhu, Chunli
    INTERNATIONAL JOURNAL OF DISASTER RISK REDUCTION, 2024, 113
  • [4] The Limitations of Decision-Making
    Walton, Paul
    INFORMATION, 2020, 11 (12) : 1 - 22
  • [5] Potentials and Challenges of Large Language Models (LLMs) in the Context of Administrative Decision-Making
    Pesch, Paulina Jo
    EUROPEAN JOURNAL OF RISK REGULATION, 2025,
  • [6] Assessment of Large Language Models (LLMs) in decision-making support for gynecologic oncology
    Gumilar, Khanisyah Erza
    Indraprasta, Birama R.
    Faridzi, Ach Salman
    Wibowo, Bagus M.
    Herlambang, Aditya
    Rahestyningtyas, Eccita
    Irawan, Budi
    Tambunan, Zulkarnain
    Bustomi, Ahmad Fadhli
    Brahmantara, Bagus Ngurah
    Yu, Zih-Ying
    Hsu, Yu-Cheng
    Pramuditya, Herlangga
    Putra, Very Great E.
    Nugroho, Hari
    Mulawardhana, Pungky
    Tjokroprawiro, Brahmana A.
    Hedianto, Tri
    Ibrahim, Ibrahim H.
    Huang, Jingshan
    Lij, Dongqi
    Lu, Chien-Hsing
    Yang, Jer-Yen
    Liao, Li-Na
    Tan, Ming
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2024, 23 : 4019 - 4026
  • [7] AutoPlan: Automatic Planning of Interactive Decision-Making Tasks With Large Language Models
    Ouyang, Siqi
    Li, Lei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 3114 - 3128
  • [8] Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models ' feasibility in clinical decision-making
    Sblendorio, Elena
    Dentamaro, Vincenzo
    Lo Cascio, Alessio
    Germini, Francesco
    Piredda, Michela
    Cicolini, Giancarlo
    INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS, 2024, 188
  • [10] Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review
    Ho, Cindy N.
    Tian, Tiffany
    Ayers, Alessandra T.
    Aaron, Rachel E.
    Phillips, Vidith
    Wolf, Risa M.
    Mathioudakis, Nestoras
    Dai, Tinglong
    Klonoff, David C.
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)