Large language models encode clinical knowledge

被引：779

作者：

Singhal, Karan ^{[1
]}

Azizi, Shekoofeh ^{[1
]}

Tu, Tao ^{[1
]}

Mahdavi, S. Sara ^{[1
]}

Wei, Jason ^{[1
]}

Chung, Hyung Won ^{[1
]}

Scales, Nathan ^{[1
]}

Tanwani, Ajay ^{[1
]}

Cole-Lewis, Heather ^{[1
]}

Pfohl, Stephen ^{[1
]}

Payne, Perry ^{[1
]}

Seneviratne, Martin ^{[1
]}

Gamble, Paul ^{[1
]}

Kelly, Chris ^{[1
]}

Babiker, Abubakr ^{[1
]}

Schaerli, Nathanael ^{[1
]}

Chowdhery, Aakanksha ^{[1
]}

Mansfield, Philip ^{[1
]}

Demner-Fushman, Dina ^{[2
]}

Arcas, Blaise Aguera y ^{[1
]}

Webster, Dale ^{[1
]}

Corrado, Greg S. ^{[1
]}

Matias, Yossi ^{[1
]}

Chou, Katherine ^{[1
]}

Gottweis, Juraj ^{[1
]}

Tomasev, Nenad ^{[3
]}

Liu, Yun ^{[1
]}

Rajkomar, Alvin ^{[1
]}

Barral, Joelle ^{[1
]}

Semturs, Christopher ^{[1
]}

Karthikesalingam, Alan ^{[1
]}

Natarajan, Vivek ^{[1
]}

机构：

[1] Google Res, Mountain View, CA 94043 USA

[2] Natl Lib Med, Bethesda, MD USA

[3] DeepMind, London, England

来源：

NATURE | 2023年 / 620卷 / 7972期

关键词：

HARM;

D O I：

10.1038/s41586-023-06291-2

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

引用

页码：172 / +

页数：28

共 50 条

[21] Poisoning medical knowledge using large language models
Yang, Junwei
Xu, Hanwen
Mirzoyan, Srbuhi
Chen, Tong
Liu, Zixuan
Liu, Zequn
Ju, Wei
Liu, Luchen
Xiao, Zhiping
Zhang, Ming
Wang, Sheng
NATURE MACHINE INTELLIGENCE, 2024, 6 (10) : 1156 - 1168
[22] Knowledge Graph Treatments for Hallucinating Large Language Models
Collarana, Diego
Busch, Moritz
Lange, Christoph
ERCIM NEWS, 2024, (136): : 35 - 36
[23] SKILL: Structured Knowledge Infusion for Large Language Models
Moiseev, Fedor
Dong, Zhe
Alfonseca, Enrique
Jaggi, Martin
NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 1581 - 1588
[24] Detoxifying Large Language Models via Knowledge Editing
Wang, Mengru
Zhang, Ningyu
Xu, Ziwen
Xi, Zekun
Deng, Shumin
Yao, Yunzhi
Zhang, Qishen
Yang, Linyi
Wang, Jindong
Chen, Huajun
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3093 - 3118
[25] Unifying Large Language Models and Knowledge Graphs: A Roadmap
Pan, Shirui
Luo, Linhao
Wang, Yufei
Chen, Chen
Wang, Jiapu
Wu, Xindong
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (07) : 3580 - 3599
[26] Enhancing Clinical Trial Summarization: Leveraging Large Language Models and Knowledge Graphs for Entity Preservation
Nahed, Pouyan
Kambar, Mina Esmail Zadeh Nojoo
Taghva, Kazem
PROCEEDINGS OF NINTH INTERNATIONAL CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGY, ICICT 2024, VOL 7, 2024, 1003 : 325 - 336
[27] CPLLM: Clinical prediction with large language models
Ben Shoham, Ofir
Rappoport, Nadav
PLOS DIGITAL HEALTH, 2024, 3 (12):
[28] Clinical large language models with misplaced focus
Luo, Zining
Ma, Haowei
Li, Zhiwu
Chen, Yuquan
Sun, Yixin
Hu, Aimin
Yu, Jiang
Qiao, Yang
Gu, Junxian
Li, Hongying
Peng, Xuxi
Wang, Dunrui
Liu, Ying
Liu, Zhenglong
Xie, Jiebin
Jiang, Zhen
Tian, Gang
NATURE MACHINE INTELLIGENCE, 2024, : 1411 - 1412
[29] Distilling Script Knowledge from Large Language Models for Constrained Language Planning
Yuan, Siyu
Chen, Jiangjie
Fu, Ziquan
Ge, Xuyang
Shah, Soham
Jankowski, Charles Robert
Xiao, Yanghua
Yang, Deqing
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4303 - 4325
[30] Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models
Amayuelas, Alfonso
Wong, Kyle
Pang, Liangming
Chen, Wenhu
Wang, William
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6416 - 6432

← 1 2 3 4 5 →