Evaluating AI proficiency in nuclear cardiology: Large language models take on the board preparation exam

被引:0
|
作者
Builoff, Valerie [1 ]
Shanbhag, Aakash [1 ,2 ]
Miller, Robert J. H. [1 ,3 ]
Dey, Damini [1 ]
Liang, Joanna X. [1 ]
Flood, Kathleen [4 ]
Bourque, Jamieson M. [5 ]
Chareonthaitawee, Panithaya [6 ]
Phillips, Lawrence M. [7 ]
Slomka, Piotr J. [1 ]
机构
[1] Cedars Sinai Med Ctr, Dept Med, Div Artificial Intelligence Med, Imaging & Biomed Sci, Los Angeles, CA 90048 USA
[2] Univ Southern Calif, Signal & Image Proc Inst, Ming Hsieh Dept Elect & Comp Engn, Los Angeles, CA USA
[3] Univ Calgary, Dept Cardiac Sci, Calgary, AB, Canada
[4] Amer Soc Nucl Cardiol, Fairfax, VA USA
[5] Univ Virginia Hlth Syst, Div Cardiovasc Med & Radiol, Charlottesville, VA USA
[6] Mayo Clin, Dept Cardiovasc Med, Rochester, MN USA
[7] NYU Grossman Sch Med, Dept Med, Leon H Charney Div Cardiol, New York, NY USA
基金
美国国家卫生研究院;
关键词
Nuclear cardiology board exam; Large language models; GPT; Cardiovascular imaging questions; PERFORMANCE;
D O I
10.1016/j.nuclcard.2024.102089
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Background: Previous studies evaluated the ability of large language models (LLMs) in medical disciplines; however, few have focused on image analysis, and none specifically on cardiovascular imaging or nuclear cardiology. This study assesses four LLMs-GPT-4, GPT-4 Turbo, GPT-4omni (GPT-4o) (Open AI), and Gemini (Google Inc.)-in responding to questions from the 2023 American Society of Nuclear Cardiology Board Preparation Exam, reflecting the scope of the Certification Board of Nuclear Cardiology (CBNC) examination. Methods: We used 168 questions: 141 text-only and 27 image-based, categorized into four sections mirroring the CBNC exam. Each LLM was presented with the same standardized prompt and applied to each section 30 times to account for stochasticity. Performance over six weeks was assessed for all models except GPT-4o. McNemar's test compared correct response proportions. Results: GPT-4, Gemini, GPT-4 Turbo, and GPT-4o correctly answered median percentages of 56.8% (95% confidence interval 55.4%- 58.0%), 40.5% (39.9%- 42.9%), 60.7% (59.5% 61.3%), and 63.1% (62.5%e64.3%) of questions, respectively. GPT-4o significantly outperformed other models (P = .007 vs GPT-4 Turbo, P < .001 vs GPT-4 and Gemini). GPT-4o excelled on text-only questions compared to GPT-4, Gemini, and GPT-4 Turbo (P < .001, P < .001, and P = .001), while Gemini performed worse on image-based questions (P < .001 for all). Conclusion: GPT-4o demonstrated superior performance among the four LLMs, achieving scores likely within or just outside the range required to pass a test akin to the CBNC examination. Although improvements in medical image interpretation are needed, GPT4o shows potential to support physicians in answering text-based clinical questions.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Death by AI: Will large language models diminish Wikipedia?
    Wagner, Christian
    Jiang, Ling
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2025,
  • [32] Large Language Models and Generative AI, Oh My!
    Cobb, Peter J.
    ADVANCES IN ARCHAEOLOGICAL PRACTICE, 2023, 11 (03): : 363 - 369
  • [33] Clinical Knowledge and Reasoning Abilities of AI Large Language Models in Anesthesiology: A Comparative Study on the American Board of Anesthesiology Examination
    Angel, Mirana C.
    Rinehart, Joseph B.
    Cannesson, Maxime P.
    Baldi, Pierre
    ANESTHESIA AND ANALGESIA, 2024, 139 (02): : 349 - 356
  • [34] Towards AI-assisted cardiology: a reflection on the performance and limitations of using large language models in clinical decision-making
    Salihu, Adil
    Gadiri, Mehdi Ali
    Skalidis, Ioannis
    Meier, David
    Auberson, Denise
    Fournier, Annick
    Fournier, Romain
    Thanou, Dorina
    Abbe, Emmanuel
    Muller, Olivier
    Fournier, Stephane
    EUROINTERVENTION, 2023, 19 (10) : E798 - E801
  • [35] Evaluating AI-Generated Language as Models for Strategic Competence in English Language Teaching
    Nguyen, Phuong-Anh
    IAFOR JOURNAL OF EDUCATION, 2024, 12 (03)
  • [36] Performance of Three Large Language Models on Dermatology Board Examinations
    Mirza, Fatima N.
    Lim, Rachel K.
    Yumeen, Sara
    Wahood, Samer
    Zaidat, Bashar
    Shah, Asghar
    Tang, Oliver Y.
    Kawaoka, John
    Seo, Su-Jean
    Dimarco, Christopher
    Muglia, Jennie
    Goldbach, Hayley S.
    Wisco, Oliver
    Qureshi, Abrar A.
    Libby, Tiffany J.
    JOURNAL OF INVESTIGATIVE DERMATOLOGY, 2024, 144 (02) : 398 - 400
  • [37] Large Language Models and the American Board of Anesthesiology Examination Response
    Dost, Burhan
    De Cassai, Alessandro
    ANESTHESIA AND ANALGESIA, 2025, 140 (01): : 12 - 12
  • [38] Accuracy, Hallucinations, and Misgeneralizations of Large Language Models in Reviewing Cardiology Literature
    Fang, Spencer
    Pillai, Joshua
    Mahin, Baharullah
    Kim, Sarah
    CIRCULATION, 2024, 150
  • [39] Evaluating large language models on medical evidence summarization
    Tang, Liyan
    Sun, Zhaoyi
    Idnay, Betina
    Nestor, Jordan G.
    Soroush, Ali
    Elias, Pierre A.
    Xu, Ziyang
    Ding, Ying
    Durrett, Greg
    Rousseau, Justin F.
    Weng, Chunhua
    Peng, Yifan
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [40] Evaluating large language models on medical evidence summarization
    Liyan Tang
    Zhaoyi Sun
    Betina Idnay
    Jordan G. Nestor
    Ali Soroush
    Pierre A. Elias
    Ziyang Xu
    Ying Ding
    Greg Durrett
    Justin F. Rousseau
    Chunhua Weng
    Yifan Peng
    npj Digital Medicine, 6