Performance of ChatGPT and Bard in self-assessment questions for nephrology board renewal

被引:9
|
作者
Noda, Ryunosuke [1 ]
Izaki, Yuto [1 ]
Kitano, Fumiya [1 ]
Komatsu, Jun [1 ]
Ichikawa, Daisuke [1 ]
Shibagaki, Yugo [1 ]
机构
[1] St Marianna Univ, Dept Internal Med, Div Nephrol & Hypertens, Sch Med, 2-16-1 Sugao,Miyamae Ku, Kawasaki, Kanagawa 2168511, Japan
关键词
ChatGPT; GPT-4; Large language models; Artificial intelligence; Nephrology;
D O I
10.1007/s10157-023-02451-w
中图分类号
R5 [内科学]; R69 [泌尿科学(泌尿生殖系疾病)];
学科分类号
1002 ; 100201 ;
摘要
Background Large language models (LLMs) have impacted advances in artificial intelligence. While LLMs have demonstrated high performance in general medical examinations, their performance in specialized areas such as nephrology is unclear. This study aimed to evaluate ChatGPT and Bard in their potential nephrology applications. Methods Ninety-nine questions from the Self-Assessment Questions for Nephrology Board Renewal from 2018 to 2022 were presented to two versions of ChatGPT (GPT-3.5 and GPT-4) and Bard. We calculated the correct answer rates for the five years, each year, and question categories and checked whether they exceeded the pass criterion. The correct answer rates were compared with those of the nephrology residents. Results The overall correct answer rates for GPT-3.5, GPT-4, and Bard were 31.3% (31/99), 54.5% (54/99), and 32.3% (32/99), respectively, thus GPT-4 significantly outperformed GPT-3.5 (p < 0.01) and Bard (p < 0.01). GPT-4 passed in three years, barely meeting the minimum threshold in two. GPT-4 demonstrated significantly higher performance in problem-solving, clinical, and non-image questions than GPT-3.5 and Bard. GPT-4's performance was between third- and fourth-year nephrology residents. Conclusions GPT-4 outperformed GPT-3.5 and Bard and met the Nephrology Board renewal standards in specific years, albeit marginally. These results highlight LLMs' potential and limitations in nephrology. As LLMs advance, nephrologists should understand their performance for future applications.
引用
收藏
页码:465 / 469
页数:5
相关论文
共 50 条
  • [22] SELF-ASSESSMENT - SCORING OF MULTIPLE-CHOICE QUESTIONS
    ABBATT, FR
    ABBATT, J
    HARDEN, RM
    MEDICAL TEACHER, 1979, 1 (03) : 155 - 156
  • [23] An unusual cause of shoulder pain: self-assessment questions
    Garg, Bhavuk
    Sharma, Vijay
    Khan, Shah Alam
    Malhotra, Rajesh
    NEW ZEALAND MEDICAL JOURNAL, 2007, 120 (1260) : 71 - 73
  • [24] An evaluation of pedagogically informed parameterised questions for self-assessment
    Sitthisak, Onjira
    Gilbert, Lester
    Davis, Hugh C.
    LEARNING MEDIA AND TECHNOLOGY, 2008, 33 (03) : 235 - 248
  • [25] Interactive Self-Assessment Questions Within a Virtual Environment
    Evans, Chris
    Palacios, Luis
    INTERNATIONAL JOURNAL OF E-ADOPTION, 2011, 3 (02) : 1 - 10
  • [26] Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment
    Patil, Nikhil
    Huang, Ryan
    van der Pol, Christian
    Larocque, Natasha
    CANADIAN ASSOCIATION OF RADIOLOGISTS JOURNAL-JOURNAL DE L ASSOCIATION CANADIENNE DES RADIOLOGISTES, 2024, 75 (02): : 344 - 350
  • [27] Self-assessment or self deception? A lack of association between nursing students' self-assessment and performance
    Baxter, Pamela
    Norman, Geoff
    JOURNAL OF ADVANCED NURSING, 2011, 67 (11) : 2406 - 2413
  • [28] Physiotherapy students' self-assessment of performance-Are there gender differences in self-assessment accuracy?
    Stove, Morten Pallisgaard
    PHYSIOTHERAPY RESEARCH INTERNATIONAL, 2021, 26 (01)
  • [29] Performance of ChatGPT on American Board of Surgery In-Training Examination Preparation Questions
    Tran, Catherine G.
    Chang, Jeremy
    Sherman, Scott K.
    De Andrade, James P.
    JOURNAL OF SURGICAL RESEARCH, 2024, 299 : 329 - 335
  • [30] Resident self-assessment of operative performance
    Ward, M
    MacRae, H
    Schlachta, C
    Mamazza, J
    Poulin, E
    Reznick, R
    Regehr, G
    AMERICAN JOURNAL OF SURGERY, 2003, 185 (06): : 521 - 524