Evaluating large language models as agents in the clinic

被引:0
|
作者
Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
机构
[1] University of California,Bakar Computational Health Sciences Institute
[2] Berkeley,Neurosurgery Department Division of Neuro
[3] University of California San Francisco,Oncology
[4] University of California San Francisco,Department of Epidemiology and Biostatistics
[5] University of California San Francisco,Department of Pediatrics
[6] University of California San Francisco,undefined
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.
引用
收藏
相关论文
共 50 条
  • [31] On Evaluating Adversarial Robustness of Large Vision-Language Models
    Zhao, Yunqing
    Pang, Tianyu
    Du, Chao
    Yang, Xiao
    Li, Chongxuan
    Cheung, Ngai-Man
    Lin, Min
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [32] Evaluating the persuasive influence of political microtargeting with large language models
    Hackenburg, Kobi
    Margetts, Helen
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (24)
  • [33] Evaluating the Ability of Large Language Models to Generate Motivational Feedback
    Gaeta, Angelo
    Orciuoli, Francesco
    Pascuzzo, Antonella
    Peduto, Angela
    GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 188 - 201
  • [34] Establishing vocabulary tests as a benchmark for evaluating large language models
    Martinez, Gonzalo
    Conde, Javier
    Merino-Gomez, Elena
    Bermudez-Margaretto, Beatriz
    Hernandez, Jose Alberto
    Reviriego, Pedro
    Brysbaert, Marc
    PLOS ONE, 2024, 19 (12):
  • [35] Evaluating Attribute Comprehension in Large Vision-Language Models
    Zhang, Haiwen
    Yang, Zixi
    Liu, Yuanzhi
    Wang, Xinran
    He, Zheqi
    Liang, Kongming
    Ma, Zhanyu
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113
  • [36] Evaluating the Application of Large Language Models in Clinical Research Contexts
    Perlis, Roy H.
    Fihn, Stephan D.
    JAMA NETWORK OPEN, 2023, 6 (10)
  • [37] Framework for evaluating code generation ability of large language models
    Yeo, Sangyeop
    Ma, Yu-Seung
    Kim, Sang Cheol
    Jun, Hyungkook
    Kim, Taeho
    ETRI JOURNAL, 2024, 46 (01) : 106 - 117
  • [38] Evaluating Large Language Models in Cybersecurity Knowledge with Cisco Certificates
    Keppler, Gustav
    Kunz, Jeremy
    Hagenmeyer, Veit
    Elbez, Ghada
    SECURE IT SYSTEMS, NORDSEC 2024, 2025, 15396 : 219 - 238
  • [39] Towards evaluating and building versatile large language models for medicine
    Wu, Chaoyi
    Qiu, Pengcheng
    Liu, Jinxin
    Gu, Hongfei
    Li, Na
    Zhang, Ya
    Wang, Yanfeng
    Xie, Weidi
    NPJ DIGITAL MEDICINE, 2025, 8 (01):
  • [40] An astronomical question answering dataset for evaluating large language models
    Li, Jie
    Zhao, Fuyong
    Chen, Panfeng
    Xie, Jiafu
    Zhang, Xiangrui
    Li, Hui
    Chen, Mei
    Wang, Yanhao
    Zhu, Ming
    SCIENTIFIC DATA, 2025, 12 (01)