Evaluating large language models as agents in the clinic

被引：0

作者：

Nikita Mehandru

Brenda Y. Miao

Eduardo Rodriguez Almaraz

Madhumita Sushil

Atul J. Butte

Ahmed Alaa

机构：

[1] University of California,Bakar Computational Health Sciences Institute

[2] Berkeley,Neurosurgery Department Division of Neuro

[3] University of California San Francisco,Oncology

[4] University of California San Francisco,Department of Epidemiology and Biostatistics

[5] University of California San Francisco,Department of Pediatrics

[6] University of California San Francisco,undefined

来源：

npj Digital Medicine | / 7卷

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.

引用

共 50 条

[31] On Evaluating Adversarial Robustness of Large Vision-Language Models
Zhao, Yunqing
Pang, Tianyu
Du, Chao
Yang, Xiao
Li, Chongxuan
Cheung, Ngai-Man
Lin, Min
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[32] Evaluating the persuasive influence of political microtargeting with large language models
Hackenburg, Kobi
Margetts, Helen
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (24)
[33] Evaluating the Ability of Large Language Models to Generate Motivational Feedback
Gaeta, Angelo
Orciuoli, Francesco
Pascuzzo, Antonella
Peduto, Angela
GENERATIVE INTELLIGENCE AND INTELLIGENT TUTORING SYSTEMS, PT I, ITS 2024, 2024, 14798 : 188 - 201
[34] Establishing vocabulary tests as a benchmark for evaluating large language models
Martinez, Gonzalo
Conde, Javier
Merino-Gomez, Elena
Bermudez-Margaretto, Beatriz
Hernandez, Jose Alberto
Reviriego, Pedro
Brysbaert, Marc
PLOS ONE, 2024, 19 (12):
[35] Evaluating Attribute Comprehension in Large Vision-Language Models
Zhang, Haiwen
Yang, Zixi
Liu, Yuanzhi
Wang, Xinran
He, Zheqi
Liang, Kongming
Ma, Zhanyu
PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 98 - 113
[36] Evaluating the Application of Large Language Models in Clinical Research Contexts
Perlis, Roy H.
Fihn, Stephan D.
JAMA NETWORK OPEN, 2023, 6 (10)
[37] Framework for evaluating code generation ability of large language models
Yeo, Sangyeop
Ma, Yu-Seung
Kim, Sang Cheol
Jun, Hyungkook
Kim, Taeho
ETRI JOURNAL, 2024, 46 (01) : 106 - 117
[38] Evaluating Large Language Models in Cybersecurity Knowledge with Cisco Certificates
Keppler, Gustav
Kunz, Jeremy
Hagenmeyer, Veit
Elbez, Ghada
SECURE IT SYSTEMS, NORDSEC 2024, 2025, 15396 : 219 - 238
[39] Towards evaluating and building versatile large language models for medicine
Wu, Chaoyi
Qiu, Pengcheng
Liu, Jinxin
Gu, Hongfei
Li, Na
Zhang, Ya
Wang, Yanfeng
Xie, Weidi
NPJ DIGITAL MEDICINE, 2025, 8 (01):
[40] An astronomical question answering dataset for evaluating large language models
Li, Jie
Zhao, Fuyong
Chen, Panfeng
Xie, Jiafu
Zhang, Xiangrui
Li, Hui
Chen, Mei
Wang, Yanhao
Zhu, Ming
SCIENTIFIC DATA, 2025, 12 (01)

← 1 2 3 4 5 →