Large language models encode clinical knowledge

被引：779

作者：

Singhal, Karan ^{[1
]}

Azizi, Shekoofeh ^{[1
]}

Tu, Tao ^{[1
]}

Mahdavi, S. Sara ^{[1
]}

Wei, Jason ^{[1
]}

Chung, Hyung Won ^{[1
]}

Scales, Nathan ^{[1
]}

Tanwani, Ajay ^{[1
]}

Cole-Lewis, Heather ^{[1
]}

Pfohl, Stephen ^{[1
]}

Payne, Perry ^{[1
]}

Seneviratne, Martin ^{[1
]}

Gamble, Paul ^{[1
]}

Kelly, Chris ^{[1
]}

Babiker, Abubakr ^{[1
]}

Schaerli, Nathanael ^{[1
]}

Chowdhery, Aakanksha ^{[1
]}

Mansfield, Philip ^{[1
]}

Demner-Fushman, Dina ^{[2
]}

Arcas, Blaise Aguera y ^{[1
]}

Webster, Dale ^{[1
]}

Corrado, Greg S. ^{[1
]}

Matias, Yossi ^{[1
]}

Chou, Katherine ^{[1
]}

Gottweis, Juraj ^{[1
]}

Tomasev, Nenad ^{[3
]}

Liu, Yun ^{[1
]}

Rajkomar, Alvin ^{[1
]}

Barral, Joelle ^{[1
]}

Semturs, Christopher ^{[1
]}

Karthikesalingam, Alan ^{[1
]}

Natarajan, Vivek ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] Natl Lib Med, Bethesda, MD USA

[3] DeepMind, London, England

来源：

NATURE | 2023年 / 620卷 / 7972期

关键词：

HARM;

D O I：

10.1038/s41586-023-06291-2

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

引用

页码：172 / +

页数：28

共 50 条

[1] Large language models encode clinical knowledge
Karan Singhal
Shekoofeh Azizi
Tao Tu
S. Sara Mahdavi
Jason Wei
Hyung Won Chung
Nathan Scales
Ajay Tanwani
Heather Cole-Lewis
Stephen Pfohl
Perry Payne
Martin Seneviratne
Paul Gamble
Chris Kelly
Abubakr Babiker
Nathanael Schärli
Aakanksha Chowdhery
Philip Mansfield
Dina Demner-Fushman
Blaise Agüera y Arcas
Dale Webster
Greg S. Corrado
Yossi Matias
Katherine Chou
Juraj Gottweis
Nenad Tomasev
Yun Liu
Alvin Rajkomar
Joelle Barral
Christopher Semturs
Alan Karthikesalingam
Vivek Natarajan
Nature, 2023, 620 : 172 - 180
[2] Publisher Correction: Large language models encode clinical knowledge
Karan Singhal
Shekoofeh Azizi
Tao Tu
S. Sara Mahdavi
Jason Wei
Hyung Won Chung
Nathan Scales
Ajay Tanwani
Heather Cole-Lewis
Stephen Pfohl
Perry Payne
Martin Seneviratne
Paul Gamble
Chris Kelly
Abubakr Babiker
Nathanael Schärli
Aakanksha Chowdhery
Philip Mansfield
Dina Demner-Fushman
Blaise Agüera y Arcas
Dale Webster
Greg S. Corrado
Yossi Matias
Katherine Chou
Juraj Gottweis
Nenad Tomasev
Yun Liu
Alvin Rajkomar
Joelle Barral
Christopher Semturs
Alan Karthikesalingam
Vivek Natarajan
Nature, 2023, 620 (7973) : E19 - E19
[3] Large language models encode clinical knowledge (vol 620, pg 172, 2023)
Singhal, Karan
Azizi, Shekoofeh
Tu, Tao
Mahdavi, S. Sara
Wei, Jason
Chung, Hyung Won
Scales, Nathan
Tanwani, Ajay
Cole-Lewis, Heather
Pfohl, Stephen
Payne, Perry
Seneviratne, Martin
Gamble, Paul
Kelly, Chris
Babiker, Abubakr
Schaerli, Nathanael
Chowdhery, Aakanksha
Mansfield, Philip
Demner-Fushman, Dina
Arcas, Blaise
Webster, Dale
Corrado, Greg S.
Matias, Yossi
Chou, Katherine
Gottweis, Juraj
Tomasev, Nenad
Liu, Yun
Rajkomar, Alvin
Barral, Joelle
Semturs, Christopher
Karthikesalingam, Alan
Natarajan, Vivek
NATURE, 2023,
[4] Large language foundation models encode clinical radiation oncology domain knowledge: Performance on the American College of Radiology Standardized Examination
Loaiza-Bonilla, Arturo
Thaker, Nikhil Gautam
Redjal, Navid
Doria, Cataldo
Showalter, Timothy
Penberthy, David
Dicker, Adam P.
Choudhri, Ajay
Williamson, Shirnett
Shah, Chirag
Ward, Matthew C.
Arcaro, Michael
JOURNAL OF CLINICAL ONCOLOGY, 2024, 42 (16)
[5] Large language models leverage external knowledge to extend clinical insight beyond language boundaries
Wu, Jiageng
Wu, Xian
Qiu, Zhaopeng
Li, Minghui
Lin, Shixu
Zhang, Yingying
Zheng, Yefeng
Yuan, Changzheng
Yang, Jie
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 2054 - 2064
[6] Quantifying Domain Knowledge in Large Language Models
Sayenju, Sudhashree
Aygun, Ramazan
Franks, Bill
Johnston, Sereres
Lee, George
Choi, Hansook
Modgil, Girish
2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 193 - 194
[7] Knowledge management in organization and the large language models
Zelenkov, Yu. A.
ROSSIISKII ZHURNAL MENEDZHMENTA-RUSSIAN MANAGEMENT JOURNAL, 2024, 22 (03): : 573 - 601
[8] Do large language models "understand" their knowledge?
Venkatasubramanian, Venkat
AICHE JOURNAL, 2025, 71 (03)
[9] Debiasing Large Language Models with Structured Knowledge
Ma, Congda
Zhao, Tianyu
Okumura, Manabu
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10274 - 10287
[10] Evaluating Intelligence and Knowledge in Large Language Models
Bianchini, Francesco
TOPOI-AN INTERNATIONAL REVIEW OF PHILOSOPHY, 2025, 44 (01): : 163 - 173

← 1 2 3 4 5 →