New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology

被引:41
|
作者
Huynh, Linda My [1 ]
Bonebrake, Benjamin T. [2 ]
Schultis, Kaitlyn [2 ]
Quach, Alan [3 ]
Deibert, Christopher M. [3 ,4 ]
机构
[1] Univ Nebraska Med Ctr, Omaha, NE USA
[2] Univ Nebraska Med Ctr, Coll Med, Omaha, NE USA
[3] Univ Nebraska Med Ctr, Div Urol, Omaha, NE USA
[4] Univ Nebraska Med Ctr, Dept Surg, Div Urol, 987521 Nebraska Med Ctr, Omaha, NE 68198 USA
关键词
artificial intelligence; medical informatics applications; urology;
D O I
10.1097/UPJ.0000000000000406
中图分类号
R5 [内科学]; R69 [泌尿科学(泌尿生殖系疾病)];
学科分类号
1002 ; 100201 ;
摘要
Introduction:Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians.Methods:One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning.Results:ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.Conclusions:ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.
引用
收藏
页码:408 / +
页数:8
相关论文
共 45 条
  • [21] Impact of a Commercial Artificial Intelligence-Driven Patient Self-Assessment Solution on Waiting Times at General Internal Medicine Outpatient Departments: Retrospective Study
    Harada, Yukinori
    Shimizu, Taro
    JMIR MEDICAL INFORMATICS, 2020, 8 (08)
  • [22] Cognitive Domain Assessment of Artificial Intelligence Chatbots: A Comparative Study Between ChatGPT and Gemini's Understanding of Anatomy Education
    Ganapathy, Arthi
    Kaushal, Parul
    MEDICAL SCIENCE EDUCATOR, 2025,
  • [23] Cervical screening program and the psychological impact of an abnormal Pap smear: a self-assessment questionnaire study of 590 patients
    Thangarajah, Fabinshy
    Einzmann, Thomas
    Bergauer, Florian
    Patzke, Jan
    Schmidt-Petruschkat, Silke
    Theune, Monika
    Engel, Katja
    Puppe, Julian
    Richters, Lisa
    Mallmann, Peter
    Kirn, Verena
    ARCHIVES OF GYNECOLOGY AND OBSTETRICS, 2016, 293 (02) : 391 - 398
  • [24] Cervical screening program and the psychological impact of an abnormal Pap smear: a self-assessment questionnaire study of 590 patients
    Fabinshy Thangarajah
    Thomas Einzmann
    Florian Bergauer
    Jan Patzke
    Silke Schmidt-Petruschkat
    Monika Theune
    Katja Engel
    Julian Puppe
    Lisa Richters
    Peter Mallmann
    Verena Kirn
    Archives of Gynecology and Obstetrics, 2016, 293 : 391 - 398
  • [25] Usability and acceptability of the electronic self-assessment and care (eSAC) program in advanced ovarian cancer: A mixed methods study
    Wickline, Mihkai
    Wolpin, Seth
    Cho, Susie
    Tomashek, Holly
    Louca, Tanya
    Frisk, Tori
    Templin, Janna
    Loechl, Alison
    Goff, Barbara
    Berry, Donna
    GYNECOLOGIC ONCOLOGY, 2022, 167 (02) : 239 - 246
  • [26] Generative Artificial Intelligence Platform for Automating Social Media Posts From Urology Journal Articles: A Cross-Sectional Study and Randomized Assessment
    Ramacciotti, Lorenzo Storino
    Cei, Francesco
    Hershenhouse, Jacob S.
    Mokhtar, Daniel
    Rodler, Severin
    Gill, Karanvir
    Strauss, David
    Medina, Luis G.
    Cai, Jie
    Abreu, Andre Luis
    Desai, Mihir M.
    Sotelo, Rene
    Gill, Inderbir S.
    Cacciamani, Giovanni E.
    JOURNAL OF UROLOGY, 2024, 212 (06): : 873 - 881
  • [27] Weekly self-assessment of painful joints in hand osteoarthritis (HOA): A new assessment tool. Preliminary validation study.
    Maheu, E
    Dewailly, J
    ARTHRITIS AND RHEUMATISM, 1997, 40 (09): : 1229 - 1229
  • [28] Accuracy of student self-assessment ability compared to their own performance in a problem-based learning medical program: A correlation study
    Tousignant, M
    DesMarchais, JE
    ADVANCES IN HEALTH SCIENCES EDUCATION, 2002, 7 (01) : 19 - 27
  • [29] Accuracy of Student Self-Assessment Ability Compared to Their Own Performance in a Problem-Based Learning Medical Program: A Correlation Study
    M. Tousignant
    J.E. DesMarchais
    Advances in Health Sciences Education, 2002, 7 : 19 - 27
  • [30] MSCopilot, a new multiple sclerosis self-assessment digital solution: results of a comparative study versus standard tests
    Maillart, E.
    Labauge, P.
    Cohen, M.
    Maarouf, A.
    Vukusic, S.
    Donze, C.
    Gallien, P.
    De Seze, J.
    Bourre, B.
    Moreau, T.
    Louapre, C.
    Mayran, P.
    Bieuvelet, S.
    Vallee, M.
    Bertillot, F.
    Klaeyle, L.
    Argoud, A-L
    Zinai, S.
    Tourbah, A.
    EUROPEAN JOURNAL OF NEUROLOGY, 2020, 27 (03) : 429 - 436