Evaluation and mitigation of the limitations of large language models in clinical decision-making

被引:43
|
作者
Hager, Paul [1 ,2 ]
Jungmann, Friederike [1 ,2 ]
Holland, Robbie [3 ]
Bhagat, Kunal [4 ]
Hubrecht, Inga [5 ]
Knauer, Manuel [5 ]
Vielhauer, Jakob [6 ]
Makowski, Marcus [2 ]
Braren, Rickmer [2 ]
Kaissis, Georgios [1 ,2 ,3 ,7 ]
Rueckert, Daniel [1 ,3 ]
机构
[1] Tech Univ Munich, Klinikum Rechts Isar, Inst AI & Informat, Munich, Germany
[2] Tech Univ Munich, Inst Diagnost & Intervent Radiol, Klinikum Rechts Isar, Munich, Germany
[3] Imperial Coll, Dept Comp, London, England
[4] ChristianaCare Hlth Syst, Dept Med, Wilmington, DE USA
[5] Tech Univ Munich, Dept Med 3, Klinikum Rechts Isar, Munich, Germany
[6] Ludwig Maximilian Univ Munich, Dept Med 2, Univ Hosp, Munich, Germany
[7] Helmholtz Munich, Inst Machine Learning Biomed Imaging, Reliable AI Grp, Munich, Germany
关键词
AI; BIAS;
D O I
10.1038/s41591-024-03097-1
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Clinical decision-making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies. Using a curated dataset of 2,400 cases and a framework to simulate a realistic clinical setting, current large language models are shown to incur substantial pitfalls when used for autonomous clinical decision-making.
引用
收藏
页码:2613 / 2622
页数:26
相关论文
共 50 条
  • [21] Large sequence models for sequential decision-making:a survey
    Muning WEN
    Runji LIN
    Hanjing WANG
    Yaodong YANG
    Ying WEN
    Luo MAI
    Jun WANG
    Haifeng ZHANG
    Weinan ZHANG
    Frontiers of Computer Science, 2023, 17 (06) : 29 - 46
  • [22] Large sequence models for sequential decision-making: a survey
    Wen, Muning
    Lin, Runji
    Wang, Hanjing
    Yang, Yaodong
    Wen, Ying
    Mai, Luo
    Wang, Jun
    Zhang, Haifeng
    Zhang, Weinan
    FRONTIERS OF COMPUTER SCIENCE, 2023, 17 (06)
  • [23] The Application of Large Language Models for Radiologic Decision Making
    Zaki, Hossam A.
    Aoun, Andrew
    Munshi, Saminah
    Abdel-Megid, Hazem
    Nazario-Johnson, Lleayem
    Ahn, Sun Ho
    JOURNAL OF THE AMERICAN COLLEGE OF RADIOLOGY, 2024, 21 (07) : 1072 - 1078
  • [24] EVALUATION OF OPERATIONAL FAILURES IN CLINICAL DECISION-MAKING
    PALMER, RH
    STRAIN, R
    ROTHROCK, JK
    HSU, LN
    MEDICAL DECISION MAKING, 1983, 3 (03) : 299 - 310
  • [25] CLINICAL DECISION-MAKING IN THE EVALUATION AND TREATMENT OF INSOMNIA
    EVERITT, DE
    AVORN, J
    BAKER, MW
    AMERICAN JOURNAL OF MEDICINE, 1990, 89 (03): : 357 - 362
  • [26] The decision-making models for relief asset management and interaction with disaster mitigation
    Ivgin, Mehmet
    INTERNATIONAL JOURNAL OF DISASTER RISK REDUCTION, 2013, 5 : 107 - 116
  • [27] Evolution of publicly available large language models for complex decision-making in breast cancer care
    Griewing, Sebastian
    Knitza, Johannes
    Boekhoff, Jelena
    Hillen, Christoph
    Lechner, Fabian
    Wagner, Uwe
    Wallwiener, Markus
    Kuhn, Sebastian
    ARCHIVES OF GYNECOLOGY AND OBSTETRICS, 2024, 310 (01) : 537 - 550
  • [28] MODELS OF DECISION-MAKING
    KAPLAN, MF
    SCHWARTZ, S
    CONTEMPORARY PSYCHOLOGY, 1977, 22 (04): : 342 - 342
  • [29] Simulator limitations and their effects on decision-making
    Mackenzie, CF
    Harper, BD
    Xiao, Y
    PROCEEDINGS OF THE HUMAN FACTORS AND ERGONOMICS SOCIETY - 40TH ANNUAL MEETING, VOLS 1 AND 2: HUMAN CENTERED TECHNOLOGY - KEY TO THE FUTURE, 1996, : 747 - 751
  • [30] The limitations of shared decision-making in surgery
    Stephens, Timothy J.
    Pearse, Rupert M.
    BRITISH JOURNAL OF SURGERY, 2022, 109 (11) : 1051 - 1052