Evaluating large language models as agents in the clinic

被引：0

作者：

Nikita Mehandru

Brenda Y. Miao

Eduardo Rodriguez Almaraz

Madhumita Sushil

Atul J. Butte

Ahmed Alaa

机构：

[1] University of California,Bakar Computational Health Sciences Institute

[2] Berkeley,Neurosurgery Department Division of Neuro

[3] University of California San Francisco,Oncology

[4] University of California San Francisco,Department of Epidemiology and Biostatistics

[5] University of California San Francisco,Department of Pediatrics

[6] University of California San Francisco,undefined

来源：

npj Digital Medicine | / 7卷

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.

引用

共 50 条

[11] Evaluating large language models on medical evidence summarization
Tang, Liyan
Sun, Zhaoyi
Idnay, Betina
Nestor, Jordan G.
Soroush, Ali
Elias, Pierre A.
Xu, Ziyang
Ding, Ying
Durrett, Greg
Rousseau, Justin F.
Weng, Chunhua
Peng, Yifan
NPJ DIGITAL MEDICINE, 2023, 6 (01)
[12] Methodological Challenges in Evaluating Large Language Models in Radiology
Li, David
Kim, Woojin
Yi, Paul H.
RADIOLOGY, 2024, 313 (03)
[13] CLAIR: Evaluating Image Captions with Large Language Models
Chan, David M.
Petryk, Suzanne
Gonzalez, Joseph E.
Darrell, Trevor
Canny, John
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13638 - 13646
[14] Evaluating large language models on medical evidence summarization
Liyan Tang
Zhaoyi Sun
Betina Idnay
Jordan G. Nestor
Ali Soroush
Pierre A. Elias
Ziyang Xu
Ying Ding
Greg Durrett
Justin F. Rousseau
Chunhua Weng
Yifan Peng
npj Digital Medicine, 6
[15] Baby steps in evaluating the capacities of large language models
Frank, Michael C.
NATURE REVIEWS PSYCHOLOGY, 2023, 2 (08): : 451 - 452
[16] Evaluating the ability of large language models to emulate personality
Wang, Yilei
Zhao, Jiabao
Ones, Deniz S.
He, Liang
Xu, Xin
SCIENTIFIC REPORTS, 2025, 15 (01):
[17] Evaluating Large Language Models on Controlled Generation Tasks
Sun, Jiao
Tian, Yufei
Zhou, Wangchunshu
Xu, Nan
Hu, Qian
Gupta, Rahul
Wieting, John
Peng, Nanyun
Ma, Xuezhe
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3155 - 3168
[18] Baby steps in evaluating the capacities of large language models
Michael C. Frank
Nature Reviews Psychology, 2023, 2 : 451 - 452
[19] EconNLI: Evaluating Large Language Models on Economics Reasoning
Guo, Yue
Yang, Yi
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 982 - 994
[20] Evaluating Large Language Models for Tax Law Reasoning
Cavalcante Presa, Joao Paulo
Camilo Junior, Celso Goncalves
Teles de Oliveira, Savio Salvarino
INTELLIGENT SYSTEMS, BRACIS 2024, PT I, 2025, 15412 : 460 - 474

← 1 2 3 4 5 →