Large language models facilitate the generation of electronic health record phenotyping algorithms

被引:7
|
作者
Yan, Chao [1 ]
Ong, Henry H. [1 ]
Grabowska, Monika E. [1 ]
Krantz, Matthew S. [1 ]
Su, Wu-Chen [1 ]
Dickson, Alyson L. [1 ,2 ]
Peterson, Josh F. [1 ,2 ]
Feng, QiPing [2 ]
Roden, Dan M. [1 ]
Stein, C. Michael [2 ]
Kerchberger, V. Eric [2 ]
Malin, Bradley A. [1 ,3 ,4 ]
Wei, Wei-Qi [1 ,3 ,5 ]
机构
[1] Vanderbilt Univ, Dept Biomed Informat, Med Ctr, Nashville, TN 37203 USA
[2] Vanderbilt Univ, Dept Med, Med Ctr, Nashville, TN 37203 USA
[3] Vanderbilt Univ, Dept Comp Sci, Nashville, TN 37203 USA
[4] Vanderbilt Univ, Dept Biostat, Med Ctr, Nashville, TN 37203 USA
[5] Vanderbilt Univ, Med Ctr, Dept Biomed Informat & Comp Sci, Suite 1500,2525 West End Ave, Nashville, TN 37203 USA
关键词
phenotyping; electronic health records; large language models; ChatGPT; MEDICAL-RECORDS;
D O I
10.1093/jamia/ocae072
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objectives Phenotyping is a core task in observational health research utilizing electronic health records (EHRs). Developing an accurate algorithm demands substantial input from domain experts, involving extensive literature review and evidence synthesis. This burdensome process limits scalability and delays knowledge discovery. We investigate the potential for leveraging large language models (LLMs) to enhance the efficiency of EHR phenotyping by generating high-quality algorithm drafts.Materials and Methods We prompted four LLMs-GPT-4 and GPT-3.5 of ChatGPT, Claude 2, and Bard-in October 2023, asking them to generate executable phenotyping algorithms in the form of SQL queries adhering to a common data model (CDM) for three phenotypes (ie, type 2 diabetes mellitus, dementia, and hypothyroidism). Three phenotyping experts evaluated the returned algorithms across several critical metrics. We further implemented the top-rated algorithms and compared them against clinician-validated phenotyping algorithms from the Electronic Medical Records and Genomics (eMERGE) network.Results GPT-4 and GPT-3.5 exhibited significantly higher overall expert evaluation scores in instruction following, algorithmic logic, and SQL executability, when compared to Claude 2 and Bard. Although GPT-4 and GPT-3.5 effectively identified relevant clinical concepts, they exhibited immature capability in organizing phenotyping criteria with the proper logic, leading to phenotyping algorithms that were either excessively restrictive (with low recall) or overly broad (with low positive predictive values).Conclusion GPT versions 3.5 and 4 are capable of drafting phenotyping algorithms by identifying relevant clinical criteria aligned with a CDM. However, expertise in informatics and clinical experience is still required to assess and further refine generated algorithms.
引用
收藏
页码:1994 / 2001
页数:8
相关论文
共 50 条
  • [31] Large language models to identify social determinants of health in electronic health records
    Marco Guevara
    Shan Chen
    Spencer Thomas
    Tafadzwa L. Chaunzwa
    Idalid Franco
    Benjamin H. Kann
    Shalini Moningi
    Jack M. Qian
    Madeleine Goldstein
    Susan Harper
    Hugo J. W. L. Aerts
    Paul J. Catalano
    Guergana K. Savova
    Raymond H. Mak
    Danielle S. Bitterman
    npj Digital Medicine, 7
  • [32] ELECTRONIC HEALTH RECORD ALGORITHMS TO DETECT PAD
    Jones, William Schuyler
    Lippman, Steven
    Smerek, Michelle
    Shah, Kuntal
    Ward, Rachael
    Brock, Adam
    Sullivan, Robert Casey
    Long, Chandler
    Vemulapalli, Sreekanth
    Patel, Manesh
    Greiner, Melissa
    Hardy, Chantelle
    Curtis, Lesley
    JOURNAL OF THE AMERICAN COLLEGE OF CARDIOLOGY, 2018, 71 (11) : 2036 - 2036
  • [33] Next-generation phenotyping of electronic health records
    Hripcsak, George
    Albers, David J.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2013, 20 (01) : 117 - 121
  • [34] Examination of Stigmatizing Language in the Electronic Health Record
    Himmelstein, Gracie
    Bates, David
    Zhou, Li
    JAMA NETWORK OPEN, 2022, 5 (01)
  • [35] Using norms to facilitate the multiple functions of the electronic health record
    Rooksby, J
    Kay, S
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2000, : 1123 - 1123
  • [36] Large Language Models and the Degradation of the Medical Record
    McCoy, Liam G.
    Manrai, Arjun K.
    Rodman, Adam
    NEW ENGLAND JOURNAL OF MEDICINE, 2024, 391 (17): : 1561 - 1564
  • [37] Validation of Electronic Health Record Phenotyping of Bipolar Disorder Cases and Controls
    Castro, Victor M.
    Minnier, Jessica
    Murphy, Shawn N.
    Kohane, Isaac
    Churchill, Susanne E.
    Gainer, Vivian
    Cai, Tianxi
    Hoffnagle, Alison G.
    Dai, Yael
    Block, Stefanie
    Weill, Sydney R.
    Nadal-Vicens, Mireya
    Pollastri, Alisha R.
    Rosenquist, J. Niels
    Goryachev, Sergey
    Ongur, Dost
    Sklar, Pamela
    Perlis, Roy H.
    Smoller, Jordan W.
    AMERICAN JOURNAL OF PSYCHIATRY, 2015, 172 (04): : 363 - 372
  • [38] Relational machine learning for electronic health record-driven phenotyping
    Peissig, Peggy L.
    Costa, Vitor Santos
    Caldwell, Michael D.
    Rottscheit, Carla
    Berg, Richard L.
    Mendonca, Eneida A.
    Page, David
    JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 52 : 260 - 270
  • [39] Concept libraries for automatic electronic health record based phenotyping: A review
    Almowil, Zahra A.
    Zhou, Shang-Ming
    Brophy, Sinead
    INTERNATIONAL JOURNAL OF POPULATION DATA SCIENCE (IJPDS), 2021, 6 (01):
  • [40] A hybrid framework with large language models for rare disease phenotyping
    Wu, Jinge
    Dong, Hang
    Li, Zexi
    Wang, Haowei
    Li, Runci
    Patra, Arijit
    Dai, Chengliang
    Ali, Waqar
    Scordis, Phil
    Wu, Honghan
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)