Evaluating large language models as agents in the clinic

被引:0
|
作者
Nikita Mehandru
Brenda Y. Miao
Eduardo Rodriguez Almaraz
Madhumita Sushil
Atul J. Butte
Ahmed Alaa
机构
[1] University of California,Bakar Computational Health Sciences Institute
[2] Berkeley,Neurosurgery Department Division of Neuro
[3] University of California San Francisco,Oncology
[4] University of California San Francisco,Department of Epidemiology and Biostatistics
[5] University of California San Francisco,Department of Pediatrics
[6] University of California San Francisco,undefined
来源
关键词
D O I
暂无
中图分类号
学科分类号
摘要
Recent developments in large language models (LLMs) have unlocked opportunities for healthcare, from information synthesis to clinical decision support. These LLMs are not just capable of modeling language, but can also act as intelligent “agents” that interact with stakeholders in open-ended conversations and even influence clinical decision-making. Rather than relying on benchmarks that measure a model’s ability to process clinical data or answer standardized test questions, LLM agents can be modeled in high-fidelity simulations of clinical settings and should be assessed for their impact on clinical workflows. These evaluation frameworks, which we refer to as “Artificial Intelligence Structured Clinical Examinations” (“AI-SCE”), can draw from comparable technologies where machines operate with varying degrees of self-governance, such as self-driving cars, in dynamic environments with multiple stakeholders. Developing these robust, real-world clinical evaluations will be crucial towards deploying LLM agents in medical settings.
引用
收藏
相关论文
共 50 条
  • [1] Evaluating large language models as agents in the clinic
    Mehandru, Nikita
    Miao, Brenda Y.
    Almaraz, Eduardo Rodriguez
    Sushil, Madhumita
    Butte, Atul J.
    Alaa, Ahmed
    NPJ DIGITAL MEDICINE, 2024, 7 (01)
  • [2] Evaluating large language models for annotating proteins
    Vitale, Rosario
    Bugnon, Leandro A.
    Fenoy, Emilio Luis
    Milone, Diego H.
    Stegmayer, Georgina
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
  • [3] A bilingual benchmark for evaluating large language models
    Alkaoud, Mohamed
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [4] SafetyBench: Evaluating the Safety of Large Language Models
    Zhang, Zhexin
    Lei, Leqi
    Wu, Lindong
    Sun, Rui
    Huang, Yongkang
    Long, Chong
    Liu, Xiao
    Lei, Xuanyu
    Tang, Jie
    Huang, Minlie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15537 - 15553
  • [5] Evaluating Large Language Models for Material Selection
    Grandi, Daniele
    Jain, Yash Patawari
    Groom, Allin
    Cramer, Brandon
    Mccomb, Christopher
    JOURNAL OF COMPUTING AND INFORMATION SCIENCE IN ENGINEERING, 2025, 25 (02)
  • [6] Evaluating large language models in pediatric nephrology
    Filler, Guido
    Niel, Olivier
    PEDIATRIC NEPHROLOGY, 2025,
  • [7] EVALUATING LARGE LANGUAGE MODELS ON THEIR ACCURACY AND COMPLETENESS
    Edalat, Camellia
    Kirupaharan, Nila
    Dalvin, Lauren A.
    Mishra, Kapil
    Marshall, Rayna
    Xu, Hannah
    Francis, Jasmine H.
    Berkenstock, Meghan
    RETINA-THE JOURNAL OF RETINAL AND VITREOUS DISEASES, 2025, 45 (01): : 128 - 132
  • [8] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173
  • [9] Evaluating large language models for software testing
    Li, Yihao
    Liu, Pan
    Wang, Haiyang
    Chu, Jie
    Wong, W. Eric
    COMPUTER STANDARDS & INTERFACES, 2025, 93
  • [10] AUGMENTING AUTOTELIC AGENTS WITH LARGE LANGUAGE MODELS
    Colas, Cedric
    Teodorescu, Laetitia
    Oudeyer, Pierre-Yves
    Yuan, Xingdi
    Cote, Marc-Alexandre
    CONFERENCE ON LIFELONG LEARNING AGENTS, VOL 232, 2023, 232 : 205 - 226