Large language models encode clinical knowledge

被引:779
|
作者
Singhal, Karan [1 ]
Azizi, Shekoofeh [1 ]
Tu, Tao [1 ]
Mahdavi, S. Sara [1 ]
Wei, Jason [1 ]
Chung, Hyung Won [1 ]
Scales, Nathan [1 ]
Tanwani, Ajay [1 ]
Cole-Lewis, Heather [1 ]
Pfohl, Stephen [1 ]
Payne, Perry [1 ]
Seneviratne, Martin [1 ]
Gamble, Paul [1 ]
Kelly, Chris [1 ]
Babiker, Abubakr [1 ]
Schaerli, Nathanael [1 ]
Chowdhery, Aakanksha [1 ]
Mansfield, Philip [1 ]
Demner-Fushman, Dina [2 ]
Arcas, Blaise Aguera y [1 ]
Webster, Dale [1 ]
Corrado, Greg S. [1 ]
Matias, Yossi [1 ]
Chou, Katherine [1 ]
Gottweis, Juraj [1 ]
Tomasev, Nenad [3 ]
Liu, Yun [1 ]
Rajkomar, Alvin [1 ]
Barral, Joelle [1 ]
Semturs, Christopher [1 ]
Karthikesalingam, Alan [1 ]
Natarajan, Vivek [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Natl Lib Med, Bethesda, MD USA
[3] DeepMind, London, England
关键词
HARM;
D O I
10.1038/s41586-023-06291-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
引用
收藏
页码:172 / +
页数:28
相关论文
共 50 条
  • [1] Large language models encode clinical knowledge
    Karan Singhal
    Shekoofeh Azizi
    Tao Tu
    S. Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Stephen Pfohl
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Chris Kelly
    Abubakr Babiker
    Nathanael Schärli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Blaise Agüera y Arcas
    Dale Webster
    Greg S. Corrado
    Yossi Matias
    Katherine Chou
    Juraj Gottweis
    Nenad Tomasev
    Yun Liu
    Alvin Rajkomar
    Joelle Barral
    Christopher Semturs
    Alan Karthikesalingam
    Vivek Natarajan
    Nature, 2023, 620 : 172 - 180
  • [2] Publisher Correction: Large language models encode clinical knowledge
    Karan Singhal
    Shekoofeh Azizi
    Tao Tu
    S. Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Stephen Pfohl
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Chris Kelly
    Abubakr Babiker
    Nathanael Schärli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Blaise Agüera y Arcas
    Dale Webster
    Greg S. Corrado
    Yossi Matias
    Katherine Chou
    Juraj Gottweis
    Nenad Tomasev
    Yun Liu
    Alvin Rajkomar
    Joelle Barral
    Christopher Semturs
    Alan Karthikesalingam
    Vivek Natarajan
    Nature, 2023, 620 (7973) : E19 - E19
  • [3] Large language models encode clinical knowledge (vol 620, pg 172, 2023)
    Singhal, Karan
    Azizi, Shekoofeh
    Tu, Tao
    Mahdavi, S. Sara
    Wei, Jason
    Chung, Hyung Won
    Scales, Nathan
    Tanwani, Ajay
    Cole-Lewis, Heather
    Pfohl, Stephen
    Payne, Perry
    Seneviratne, Martin
    Gamble, Paul
    Kelly, Chris
    Babiker, Abubakr
    Schaerli, Nathanael
    Chowdhery, Aakanksha
    Mansfield, Philip
    Demner-Fushman, Dina
    Arcas, Blaise
    Webster, Dale
    Corrado, Greg S.
    Matias, Yossi
    Chou, Katherine
    Gottweis, Juraj
    Tomasev, Nenad
    Liu, Yun
    Rajkomar, Alvin
    Barral, Joelle
    Semturs, Christopher
    Karthikesalingam, Alan
    Natarajan, Vivek
    NATURE, 2023,
  • [4] Large language foundation models encode clinical radiation oncology domain knowledge: Performance on the American College of Radiology Standardized Examination
    Loaiza-Bonilla, Arturo
    Thaker, Nikhil Gautam
    Redjal, Navid
    Doria, Cataldo
    Showalter, Timothy
    Penberthy, David
    Dicker, Adam P.
    Choudhri, Ajay
    Williamson, Shirnett
    Shah, Chirag
    Ward, Matthew C.
    Arcaro, Michael
    JOURNAL OF CLINICAL ONCOLOGY, 2024, 42 (16)
  • [5] Large language models leverage external knowledge to extend clinical insight beyond language boundaries
    Wu, Jiageng
    Wu, Xian
    Qiu, Zhaopeng
    Li, Minghui
    Lin, Shixu
    Zhang, Yingying
    Zheng, Yefeng
    Yuan, Changzheng
    Yang, Jie
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 2054 - 2064
  • [6] Quantifying Domain Knowledge in Large Language Models
    Sayenju, Sudhashree
    Aygun, Ramazan
    Franks, Bill
    Johnston, Sereres
    Lee, George
    Choi, Hansook
    Modgil, Girish
    2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 193 - 194
  • [7] Knowledge management in organization and the large language models
    Zelenkov, Yu. A.
    ROSSIISKII ZHURNAL MENEDZHMENTA-RUSSIAN MANAGEMENT JOURNAL, 2024, 22 (03): : 573 - 601
  • [8] Do large language models "understand" their knowledge?
    Venkatasubramanian, Venkat
    AICHE JOURNAL, 2025, 71 (03)
  • [9] Debiasing Large Language Models with Structured Knowledge
    Ma, Congda
    Zhao, Tianyu
    Okumura, Manabu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10274 - 10287
  • [10] Evaluating Intelligence and Knowledge in Large Language Models
    Bianchini, Francesco
    TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173