Evaluating large language models as agents in the clinic

被引：0

作者：

Nikita Mehandru

Brenda Y. Miao

Eduardo Rodriguez Almaraz

Madhumita Sushil

Atul J. Butte

Ahmed Alaa

机构：

[1] University of California,Bakar Computational Health Sciences Institute

[2] Berkeley,Neurosurgery Department Division of Neuro

[3] University of California San Francisco,Oncology

[4] University of California San Francisco,Department of Epidemiology and Biostatistics

[5] University of California San Francisco,Department of Pediatrics

[6] University of California San Francisco,undefined

来源：

npj Digital Medicine | / 7卷

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.

引用

共 50 条

[1] Evaluating large language models as agents in the clinic
Mehandru, Nikita
Miao, Brenda Y.
Almaraz, Eduardo Rodriguez
Sushil, Madhumita
Butte, Atul J.
Alaa, Ahmed
NPJ DIGITAL MEDICINE, 2024, 7 (01)
[2] Evaluating large language models for annotating proteins
Vitale, Rosario
Bugnon, Leandro A.
Fenoy, Emilio Luis
Milone, Diego H.
Stegmayer, Georgina
BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
[3] A bilingual benchmark for evaluating large language models
Alkaoud, Mohamed
PEERJ COMPUTER SCIENCE, 2024, 10
[4] SafetyBench: Evaluating the Safety of Large Language Models
Zhang, Zhexin
Lei, Leqi
Wu, Lindong
Sun, Rui
Huang, Yongkang
Long, Chong
Liu, Xiao
Lei, Xuanyu
Tang, Jie
Huang, Minlie
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15537 - 15553
[5] Evaluating Large Language Models for Material Selection
Grandi, Daniele
Jain, Yash Patawari
Groom, Allin
Cramer, Brandon
Mccomb, Christopher
JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
[6] Evaluating large language models in pediatric nephrology
Filler, Guido
Niel, Olivier
PEDIATRIC NEPHROLOGY, 2025,
[7] EVALUATING LARGE LANGUAGE MODELS ON THEIR ACCURACY AND COMPLETENESS
Edalat, Camellia
Kirupaharan, Nila
Dalvin, Lauren A.
Mishra, Kapil
Marshall, Rayna
Xu, Hannah
Francis, Jasmine H.
Berkenstock, Meghan
RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2025, 45 (01): : 128 - 132
[8] Evaluating Intelligence and Knowledge in Large Language Models
Bianchini, Francesco
TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
[9] Evaluating large language models for software testing
Li, Yihao
Liu, Pan
Wang, Haiyang
Chu, Jie
Wong, W. Eric
COMPUTER STANDARDS & INTERFACES, 2025, 93
[10] AUGMENTING AUTOTELIC AGENTS WITH LARGE LANGUAGE MODELS
Colas, Cedric
Teodorescu, Laetitia
Oudeyer, Pierre-Yves
Yuan, Xingdi
Cote, Marc-Alexandre
CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 205 - 226

← 1 2 3 4 5 →