Evaluating large language models as agents in the clinic

被引:0
|
作者
Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
机构
[1] University of California,Bakar Computational Health Sciences Institute
[2] Berkeley,Neurosurgery Department Division of Neuro
[3] University of California San Francisco,Oncology
[4] University of California San Francisco,Department of Epidemiology and Biostatistics
[5] University of California San Francisco,Department of Pediatrics
[6] University of California San Francisco,undefined
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.
引用
收藏
相关论文
共 50 条
  • [11] Evaluating large language models on medical evidence summarization
    Tang, Liyan
    Sun, Zhaoyi
    Idnay, Betina
    Nestor, Jordan G.
    Soroush, Ali
    Elias, Pierre A.
    Xu, Ziyang
    Ding, Ying
    Durrett, Greg
    Rousseau, Justin F.
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [12] Methodological Challenges in Evaluating Large Language Models in Radiology
    Li, David
    Kim, Woojin
    Yi, Paul H.
    RADIOLOGY, 2024, 313 (03)
  • [13] CLAIR: Evaluating Image Captions with Large Language Models
    Chan, David M.
    Petryk, Suzanne
    Gonzalez, Joseph E.
    Darrell, Trevor
    Canny, John
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13638 - 13646
  • [14] Evaluating large language models on medical evidence summarization
    Liyan Tang
    Zhaoyi Sun
    Betina Idnay
    Jordan G. Nestor
    Ali Soroush
    Pierre A. Elias
    Ziyang Xu
    Ying Ding
    Greg Durrett
    Justin F. Rousseau
    Chunhua Weng
    Yifan Peng
    npj Digital Medicine, 6
  • [15] Baby steps in evaluating the capacities of large language models
    Frank, Michael C.
    NATURE REVIEWS PSYCHOLOGY, 2023, 2 (08): : 451 - 452
  • [16] Evaluating the ability of large language models to emulate personality
    Wang, Yilei
    Zhao, Jiabao
    Ones, Deniz S.
    He, Liang
    Xu, Xin
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [17] Evaluating Large Language Models on Controlled Generation Tasks
    Sun, Jiao
    Tian, Yufei
    Zhou, Wangchunshu
    Xu, Nan
    Hu, Qian
    Gupta, Rahul
    Wieting, John
    Peng, Nanyun
    Ma, Xuezhe
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3155 - 3168
  • [18] Baby steps in evaluating the capacities of large language models
    Michael C. Frank
    Nature Reviews Psychology, 2023, 2 : 451 - 452
  • [19] EconNLI: Evaluating Large Language Models on Economics Reasoning
    Guo, Yue
    Yang, Yi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 982 - 994
  • [20] Evaluating Large Language Models for Tax Law Reasoning
    Cavalcante Presa, Joao Paulo
    Camilo Junior, Celso Goncalves
    Teles de Oliveira, Savio Salvarino
    INTELLIGENT SYSTEMS, BRACIS 2024, PT I, 2025, 15412 : 460 - 474