Large language models encode clinical knowledge

被引:779
|
作者
Singhal, Karan [1 ]
Azizi, Shekoofeh [1 ]
Tu, Tao [1 ]
Mahdavi, S. Sara [1 ]
Wei, Jason [1 ]
Chung, Hyung Won [1 ]
Scales, Nathan [1 ]
Tanwani, Ajay [1 ]
Cole-Lewis, Heather [1 ]
Pfohl, Stephen [1 ]
Payne, Perry [1 ]
Seneviratne, Martin [1 ]
Gamble, Paul [1 ]
Kelly, Chris [1 ]
Babiker, Abubakr [1 ]
Schaerli, Nathanael [1 ]
Chowdhery, Aakanksha [1 ]
Mansfield, Philip [1 ]
Demner-Fushman, Dina [2 ]
Arcas, Blaise Aguera y [1 ]
Webster, Dale [1 ]
Corrado, Greg S. [1 ]
Matias, Yossi [1 ]
Chou, Katherine [1 ]
Gottweis, Juraj [1 ]
Tomasev, Nenad [3 ]
Liu, Yun [1 ]
Rajkomar, Alvin [1 ]
Barral, Joelle [1 ]
Semturs, Christopher [1 ]
Karthikesalingam, Alan [1 ]
Natarajan, Vivek [1 ]
机构
[1] Google Res, Mountain View, CA 94043 USA
[2] Natl Lib Med, Bethesda, MD USA
[3] DeepMind, London, England
关键词
HARM;
D O I
10.1038/s41586-023-06291-2
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model(1) (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA(3), MedMCQA(4), PubMedQA(5) and Measuring Massive Multitask Language Understanding (MMLU) clinical topics(6)), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today's models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.
引用
收藏
页码:172 / +
页数:28
相关论文
共 50 条
  • [21] Poisoning medical knowledge using large language models
    Yang, Junwei
    Xu, Hanwen
    Mirzoyan, Srbuhi
    Chen, Tong
    Liu, Zixuan
    Liu, Zequn
    Ju, Wei
    Liu, Luchen
    Xiao, Zhiping
    Zhang, Ming
    Wang, Sheng
    NATURE MACHINE INTELLIGENCE, 2024, 6 (10) : 1156 - 1168
  • [22] Knowledge Graph Treatments for Hallucinating Large Language Models
    Collarana, Diego
    Busch, Moritz
    Lange, Christoph
    ERCIM NEWS, 2024, (136): : 35 - 36
  • [23] SKILL: Structured Knowledge Infusion for Large Language Models
    Moiseev, Fedor
    Dong, Zhe
    Alfonseca, Enrique
    Jaggi, Martin
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 1581 - 1588
  • [24] Detoxifying Large Language Models via Knowledge Editing
    Wang, Mengru
    Zhang, Ningyu
    Xu, Ziwen
    Xi, Zekun
    Deng, Shumin
    Yao, Yunzhi
    Zhang, Qishen
    Yang, Linyi
    Wang, Jindong
    Chen, Huajun
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3093 - 3118
  • [25] Unifying Large Language Models and Knowledge Graphs: A Roadmap
    Pan, Shirui
    Luo, Linhao
    Wang, Yufei
    Chen, Chen
    Wang, Jiapu
    Wu, Xindong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (07) : 3580 - 3599
  • [26] Enhancing Clinical Trial Summarization: Leveraging Large Language Models and Knowledge Graphs for Entity Preservation
    Nahed, Pouyan
    Kambar, Mina Esmail Zadeh Nojoo
    Taghva, Kazem
    PROCEEDINGS OF NINTH INTERNATIONAL CONGRESS ON INFORMATION AND COMMUNICATION TECHNOLOGY, ICICT 2024, VOL 7, 2024, 1003 : 325 - 336
  • [27] CPLLM: Clinical prediction with large language models
    Ben Shoham, Ofir
    Rappoport, Nadav
    PLOS DIGITAL HEALTH, 2024, 3 (12):
  • [28] Clinical large language models with misplaced focus
    Luo, Zining
    Ma, Haowei
    Li, Zhiwu
    Chen, Yuquan
    Sun, Yixin
    Hu, Aimin
    Yu, Jiang
    Qiao, Yang
    Gu, Junxian
    Li, Hongying
    Peng, Xuxi
    Wang, Dunrui
    Liu, Ying
    Liu, Zhenglong
    Xie, Jiebin
    Jiang, Zhen
    Tian, Gang
    NATURE MACHINE INTELLIGENCE, 2024, : 1411 - 1412
  • [29] Distilling Script Knowledge from Large Language Models for Constrained Language Planning
    Yuan, Siyu
    Chen, Jiangjie
    Fu, Ziquan
    Ge, Xuyang
    Shah, Soham
    Jankowski, Charles Robert
    Xiao, Yanghua
    Yang, Deqing
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4303 - 4325
  • [30] Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models
    Amayuelas, Alfonso
    Wong, Kyle
    Pang, Liangming
    Chen, Wenhu
    Wang, William
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6416 - 6432