Evaluation and mitigation of cognitive biases in medical language models

被引:1
|
作者
Schmidgall, Samuel [1 ]
Harris, Carl [2 ]
Essien, Ime [2 ]
Olshvang, Daniel [2 ]
Rahman, Tawsifur [2 ]
Kim, Ji Woong [3 ]
Ziaei, Rojin [4 ]
Eshraghian, Jason [5 ]
Abadir, Peter [6 ]
Chellappa, Rama [1 ,2 ]
机构
[1] Johns Hopkins Univ, Dept Elect & Comp Engn, Baltimore, MD 21218 USA
[2] Johns Hopkins Univ, Dept Biomed Engn, Baltimore, MD USA
[3] Johns Hopkins Univ, Dept Mech Engn, Baltimore, MD USA
[4] Univ Maryland, Dept Comp Sci, College Pk, MD USA
[5] Univ Calif Santa Cruz, Dept Elect & Comp Engn, Santa Cruz, CA USA
[6] Johns Hopkins Univ, Sch Med, Div Geriatr Med & Gerontol, Baltimore, MD USA
来源
NPJ DIGITAL MEDICINE | 2024年 / 7卷 / 01期
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
D O I
10.1038/s41746-024-01283-6
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Increasing interest in applying large language models (LLMs) to medicine is due in part to their impressive performance on medical exam questions. However, these exams do not capture the complexity of real patient-doctor interactions because of factors like patient compliance, experience, and cognitive bias. We hypothesized that LLMs would produce less accurate responses when faced with clinically biased questions as compared to unbiased ones. To test this, we developed the BiasMedQA dataset, which consists of 1273 USMLE questions modified to replicate common clinically relevant cognitive biases. We assessed six LLMs on BiasMedQA and found that GPT-4 stood out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which showed large drops in performance. Additionally, we introduced three bias mitigation strategies, which improved but did not fully restore accuracy. Our findings highlight the need to improve LLMs' robustness to cognitive biases, in order to achieve more reliable applications of LLMs in healthcare.
引用
收藏
页数:9
相关论文
共 50 条
  • [41] Analyzing evaluation methods for large language models in the medical field: a scoping review
    Lee, Junbok
    Park, Sungkyung
    Shin, Jaeyong
    Cho, Belong
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
  • [42] COGNITIVE-PROCESSES AND MODELS OF LANGUAGE
    ELIASSON, S
    COMMUNICATION AND COGNITION, 1977, 10 (02): : 33 - 45
  • [43] THE ROLE OF COGNITIVE MODELS IN LANGUAGE REHABILITATION
    RAYMER, AM
    ROTHI, LJG
    GREENWALD, ML
    NEUROREHABILITATION, 1995, 5 (02) : 183 - 193
  • [44] Evaluation and mitigation of the influence of pseudorange biases on GNSS satellite clock offset estimation
    Ai, Qingsong
    Zhang, Baocheng
    Yuan, Yunbin
    Xu, Tianhe
    Chen, Yongchang
    Tan, Bingfeng
    MEASUREMENT, 2022, 193
  • [45] Benchmarking medical large language models
    Bakhshandeh, Sadra
    NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
  • [46] A toolbox for surfacing health equity harms and biases in large language models
    Pfohl, Stephen R.
    Cole-Lewis, Heather
    Sayres, Rory
    Neal, Darlene
    Asiedu, Mercy
    Dieng, Awa
    Tomasev, Nenad
    Rashid, Qazi Mamunur
    Azizi, Shekoofeh
    Rostamzadeh, Negar
    Mccoy, Liam G.
    Celi, Leo Anthony
    Liu, Yun
    Schaekermann, Mike
    Walton, Alanna
    Parrish, Alicia
    Nagpal, Chirag
    Singh, Preeti
    Dewitt, Akeiylah
    Mansfield, Philip
    Prakash, Sushant
    Heller, Katherine
    Karthikesalingam, Alan
    Semturs, Christopher
    Barral, Joelle
    Corrado, Greg
    Matias, Yossi
    Smith-Loud, Jamila
    Horn, Ivor
    Singhal, Karan
    NATURE MEDICINE, 2024, 30 (12)
  • [47] A toolbox for surfacing health equity harms and biases in large language models
    Pfohl, Stephen R.
    Cole-Lewis, Heather
    Sayres, Rory
    Neal, Darlene
    Asiedu, Mercy
    Dieng, Awa
    Tomasev, Nenad
    Rashid, Qazi Mamunur
    Azizi, Shekoofeh
    Rostamzadeh, Negar
    Mccoy, Liam G.
    Celi, Leo Anthony
    Liu, Yun
    Schaekermann, Mike
    Walton, Alanna
    Parrish, Alicia
    Nagpal, Chirag
    Singh, Preeti
    Dewitt, Akeiylah
    Mansfield, Philip
    Prakash, Sushant
    Heller, Katherine
    Karthikesalingam, Alan
    Semturs, Christopher
    Barral, Joelle
    Corrado, Greg
    Matias, Yossi
    Smith-Loud, Jamila
    Horn, Ivor
    Singhal, Karan
    NATURE MEDICINE, 2024, 30 (12) : 3590 - 3600
  • [48] In-Context Impersonation Reveals Large Language Models' Strengths and Biases
    Salewski, Leonard
    Alaniz, Stephan
    Rio-Torto, Isabel
    Schulz, Eric
    Akata, Zeynep
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [49] Measuring Normative and Descriptive Biases in Language Models Using Census Data
    Touileb, Samia
    Ovrelid, Lilja
    Velldal, Erik
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 2242 - 2248
  • [50] Methods to Evaluate Temporal Cognitive Biases in Machine Learning Prediction Models
    Harris, Christopher G.
    WWW'20: COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2020, 2020, : 572 - 575