New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology

被引:41
|
作者
Huynh, Linda My [1 ]
Bonebrake, Benjamin T. [2 ]
Schultis, Kaitlyn [2 ]
Quach, Alan [3 ]
Deibert, Christopher M. [3 ,4 ]
机构
[1] Univ Nebraska Med Ctr, Omaha, NE USA
[2] Univ Nebraska Med Ctr, Coll Med, Omaha, NE USA
[3] Univ Nebraska Med Ctr, Div Urol, Omaha, NE USA
[4] Univ Nebraska Med Ctr, Dept Surg, Div Urol, 987521 Nebraska Med Ctr, Omaha, NE 68198 USA
关键词
artificial intelligence; medical informatics applications; urology;
D O I
10.1097/UPJ.0000000000000406
中图分类号
R5 [内科学]; R69 [泌尿科学(泌尿生殖系疾病)];
学科分类号
1002 ; 100201 ;
摘要
Introduction:Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians.Methods:One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning.Results:ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.Conclusions:ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.
引用
收藏
页码:408 / +
页数:8
相关论文
共 45 条
  • [31] MSCopilot®, a new multiple sclerosis self-assessment digital solution: Results of a comparative study versus standard tests
    Maillart, E.
    Labauge, P.
    Cohen, M.
    Maarouf, A.
    Vukusic, S.
    Donze, C.
    Gallien, P.
    De Seze, J.
    Bourre, B.
    Moreau, T.
    Zinai, S.
    Tourbah, A.
    EUROPEAN JOURNAL OF NEUROLOGY, 2019, 26 : 671 - 671
  • [32] Letter: Generative Artificial Intelligence Platform for Automating Social Media Posts From Urology Journal Articles: A Cross-Sectional Study and Randomized Assessment
    Ramacciotti, Lorenzo Storino
    Gill, Inderbir S.
    Cacciamani, Giovanni E.
    JOURNAL OF UROLOGY, 2025, 213 (03): : 380 - 381
  • [33] How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology
    Kollitsch, Lisa
    Eredics, Klaus
    Marszalek, Martin
    Rauchenwald, Michael
    Brookman-May, Sabine D.
    Burger, Maximilian
    Koerner-Riffard, Katharina
    May, Matthias
    WORLD JOURNAL OF UROLOGY, 2024, 42 (01)
  • [34] Would Uro_Chat, a Newly Developed Generative Artificial Intelligence Large Language Model, Have Successfully Passed the In-Service Assessment Questions of the European Board of Urology in 2022?
    May, Matthias
    Koerner-Riffard, Katharina
    Marszalek, Martin
    Eredics, Klaus
    EUROPEAN UROLOGY ONCOLOGY, 2024, 7 (01): : 155 - 156
  • [35] Protective mechanisms of betablockers for coronary artery disease are by distal vasoconstriction and slower speed: a study by angiographic assessment and artificial intelligence program
    Duong, H.
    Nguyen, T.
    Ho, D.
    Le, M.
    Thai, M.
    EUROPEAN HEART JOURNAL, 2021, 42 : 3004 - 3004
  • [36] Performance assessment of artificial intelligence chatbots (ChatGPT-4 and Copilot) for sharing insights on 3D-printed orthodontic appliances: A cross-sectional study
    Yousuf, Asma Muhammad
    Ikram, Fizzah
    Gulzar, Munnal
    Sukhia, Rashna Hoshang
    Fida, Mubassar
    INTERNATIONAL ORTHODONTICS, 2025, 23 (03)
  • [37] Developing a readiness self-assessment tool for low- and middle-income countries establishing new radiotherapy services: A participant validation study
    Donkor, Andrew
    Luckett, Tim
    Aranda, Sanchia
    Vanderpuye, Verna
    Phillips, Jane
    PHYSICA MEDICA-EUROPEAN JOURNAL OF MEDICAL PHYSICS, 2020, 71 : 88 - 99
  • [38] MSCopilot®, a new multiple sclerosis self-assessment digital solution: results of a comparative study versus standard tests. A randomized clinical trial
    Maillart, E.
    Labauge, P.
    Cohen, M.
    Maarouf, A.
    Vukusic, S.
    Donze, C.
    Gallien, P.
    De Seze, J.
    Bourre, B.
    Moreau, T.
    Zinai, S.
    Tourbah, A.
    MULTIPLE SCLEROSIS JOURNAL, 2019, 25 : 185 - 185
  • [39] Validation of a new self-assessment questionnaire and the Skindex-29 quality of life (QoL) instrument for chronic hand dermatitis (ChHD): A pilot study
    Fowler, J
    Ghosh, A
    Duh, MS
    Raut, M
    Reynolds, J
    Thorn, D
    Den, E
    Chang, J
    VALUE IN HEALTH, 2004, 7 (03) : 262 - 263
  • [40] A feasibility study on disaster preparedness in regional and rural emergency departments in New South Wales: Nurses self-assessment of knowledge, skills and preparation for disaster management
    Brewer, Catherine A.
    Hutton, Alison
    Hammad, Karen S.
    Geale, Sara K.
    AUSTRALASIAN EMERGENCY CARE, 2020, 23 (01) : 29 - 36