Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam

被引:0
|
作者
Builoff, Valerie [1 ]
Shanbhag, Aakash [1 ,2 ]
Miller, Robert J. H. [1 ,3 ]
Dey, Damini [1 ]
Liang, Joanna X. [1 ]
Flood, Kathleen [4 ]
Bourque, Jamieson M. [5 ]
Chareonthaitawee, Panithaya [6 ]
Phillips, Lawrence M. [7 ]
Slomka, Piotr J. [1 ]
机构
[1] Cedars Sinai Med Ctr, Dept Med, Div Artificial Intelligence Med, Imaging & Biomed Sci, Los Angeles, CA 90048 USA
[2] Univ Southern Calif, Signal & Image Proc Inst, Ming Hsieh Dept Elect & Comp Engn, Los Angeles, CA USA
[3] Univ Calgary, Dept Cardiac Sci, Calgary, AB, Canada
[4] Amer Soc Nucl Cardiol, Fairfax, VA USA
[5] Univ Virginia Hlth Syst, Div Cardiovasc Med & Radiol, Charlottesville, VA USA
[6] Mayo Clin, Dept Cardiovasc Med, Rochester, MN USA
[7] NYU Grossman Sch Med, Dept Med, Leon H Charney Div Cardiol, New York, NY USA
基金
美国国家卫生研究院;
关键词
Nuclear cardiology board exam; Large language models; GPT; Cardiovascular imaging questions; PERFORMANCE;
D O I
10.1016/j.nuclcard.2024.102089
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions. Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4%- 58.0%), 40.5% (39.9%- 42.9%), 60.7% (59.5% 61.3%), and 63.1% (62.5%e64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all). Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT4o shows potential to support physicians in answering text-based clinical questions.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023
    Khalpey, Zain
    Kumar, Ujjawal
    King, Nicholas
    Abraham, Alyssa
    Khalpey, Amina H.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (07)
  • [2] Evaluating Coding Proficiency of Large Language Models: An Investigation Through Machine Learning Problems
    Ko, Eunbi
    Kang, Pilsung
    IEEE ACCESS, 2025, 13 : 52925 - 52938
  • [3] Evaluating the Performance of Large Language Models in Predicting Diagnostics for Spanish Clinical Cases in Cardiology
    Delaunay, Julien
    Cusido, Jordi
    APPLIED SCIENCES-BASEL, 2025, 15 (01):
  • [4] Preaching with AI: an exploration of preachers' interaction with large language models in sermon preparation
    Mannerfelt, Frida
    Roitto, Rikard
    PRACTICAL THEOLOGY, 2025,
  • [5] Adapting vision-language AI models to cardiology tasks
    Arnaout, Rima
    NATURE MEDICINE, 2024, 30 (05) : 1245 - 1246
  • [6] Evaluating Cardiology Certification Using the ACCSAP Question Bank: Large Language Models vs Physicians
    Shahid, Abdulla
    Shetty, Naman S.
    Patel, Nirav
    Gaonkar, Mokshad
    Arora, Garima
    Arora, Pankaj
    MAYO CLINIC PROCEEDINGS, 2025, 100 (01) : 160 - 163
  • [7] Foundation Models, Generative AI, and Large Language Models
    Ross, Angela
    McGrow, Kathleen
    Zhi, Degui
    Rasmy, Laila
    CIN-COMPUTERS INFORMATICS NURSING, 2024, 42 (05) : 377 - 387
  • [8] Generative AI and large language models in nuclear medicine: current status and future prospects
    Hirata, Kenji
    Matsui, Yusuke
    Yamada, Akira
    Fujioka, Tomoyuki
    Yanagawa, Masahiro
    Nakaura, Takeshi
    Ito, Rintaro
    Ueda, Daiju
    Fujita, Shohei
    Tatsugami, Fuminari
    Fushimi, Yasutaka
    Tsuboyama, Takahiro
    Kamagata, Koji
    Nozaki, Taiki
    Fujima, Noriyuki
    Kawamura, Mariko
    Naganawa, Shinji
    ANNALS OF NUCLEAR MEDICINE, 2024, 38 (11) : 853 - 864
  • [9] Artificial intelligence: revolutionizing cardiology with large language models
    Boonstra, Machteld
    Weissenbacher, Davy
    Moore, Jason
    Gonzalez-Hernandez, Graciela
    Asselbergs, Folkert
    EUROPEAN HEART JOURNAL, 2024, 45 (05) : 332 - 345
  • [10] Performance of large language model artificial intelligence on dermatology board exam questions
    Park, Lily
    Ehlert, Brittany
    Susla, Lyudmyla
    Lum, Zachary C.
    Lee, Patrick K.
    CLINICAL AND EXPERIMENTAL DERMATOLOGY, 2023, 49 (07) : 733 - 734