New Artificial Intelligence ChatGPT Performs Poorly on the 2022 Self-assessment Study Program for Urology

被引：41

作者：

Huynh, Linda My ^{[1
]}

Bonebrake, Benjamin T. ^{[2
]}

Schultis, Kaitlyn ^{[2
]}

Quach, Alan ^{[3
]}

Deibert, Christopher M. ^{[3
,4
]}

机构：

[1] Univ Nebraska Med Ctr, Omaha, NE USA

[2] Univ Nebraska Med Ctr, Coll Med, Omaha, NE USA

[3] Univ Nebraska Med Ctr, Div Urol, Omaha, NE USA

[4] Univ Nebraska Med Ctr, Dept Surg, Div Urol, 987521 Nebraska Med Ctr, Omaha, NE 68198 USA

来源：

UROLOGY PRACTICE | 2023年 / 10卷 / 04期

关键词：

artificial intelligence; medical informatics applications; urology;

D O I：

10.1097/UPJ.0000000000000406

中图分类号：

R5 [内科学]; R69 [泌尿科学（泌尿生殖系疾病）];

学科分类号：

1002 ; 100201 ;

摘要：

Introduction:Large language models have demonstrated impressive capabilities, but application to medicine remains unclear. We seek to evaluate the use of ChatGPT on the American Urological Association Self-assessment Study Program as an educational adjunct for urology trainees and practicing physicians.Methods:One hundred fifty questions from the 2022 Self-assessment Study Program exam were screened, and those containing visual assets (n=15) were removed. The remaining items were encoded as open ended or multiple choice. ChatGPT's output was coded as correct, incorrect, or indeterminate; if indeterminate, responses were regenerated up to 2 times. Concordance, quality, and accuracy were ascertained by 3 independent researchers and reviewed by 2 physician adjudicators. A new session was started for each entry to avoid crossover learning.Results:ChatGPT was correct on 36/135 (26.7%) open-ended and 38/135 (28.2%) multiple-choice questions. Indeterminate responses were generated in 40 (29.6%) and 4 (3.0%), respectively. Of the correct responses, 24/36 (66.7%) and 36/38 (94.7%) were on initial output, 8 (22.2%) and 1 (2.6%) on second output, and 4 (11.1%) and 1 (2.6%) on final output, respectively. Although regeneration decreased indeterminate responses, proportion of correct responses did not increase. For open-ended and multiple-choice questions, ChatGPT provided consistent justifications for incorrect answers and remained concordant between correct and incorrect answers.Conclusions:ChatGPT previously demonstrated promise on medical licensing exams; however, application to the 2022 Self-assessment Study Program was not demonstrated. Performance improved with multiple-choice over open-ended questions. More importantly were the persistent justifications for incorrect responses-left unchecked, utilization of ChatGPT in medicine may facilitate medical misinformation.

引用

页码：408 / +

页数：8

共 45 条

[41] Image quality assessment of artificial intelligence iterative reconstruction for low dose aortic CTA: A feasibility study of 70 kVp and reduced contrast medium volume (vol 149, 110221, 2022)
Li, Wanjiang
You, Yongchun
Zhong, Sihua
Shuai, Tao
Liao, Kai
Yu, Jianqun
Zhao, Jin
Li, Zhenlin
Lu, Chunyan
EUROPEAN JOURNAL OF RADIOLOGY, 2022, 152
[42] RETRACTION: A comparative study on predicting the rapid chloride permeability of self-compacting concrete using meta-heuristic algorithm and artificial intelligence techniques (Retraction of Vol 23, Pg 753, 2022)
Yuan, J.
Zhao, M.
Esmaeili-Falak, M.
STRUCTURAL CONCRETE, 2024, 25 (01) : 745 - 745
[43] Four years at the helm of the CJA: the past, the new Self-Assessment Program and the futureQuatre ans à la barre du JCA: le passé, le nouveau Programme d’autoévaluation et le futur
Jean-François Hardy
Canadian Journal of Anesthesia, 2005, 52 : 219 - 223
[44] Beijing Friendship Hospital Osteoporosis Self-Assessment Tool for Elderly Male (BFH-OSTM) vs Fracture Risk Assessment Tool (FRAX) for identifying painful new osteoporotic vertebral fractures in older Chinese men: a cross-sectional study
An, Ning
Lin, Ji Sheng
Fei, Qi
BMC MUSCULOSKELETAL DISORDERS, 2021, 22 (01)
[45] Beijing Friendship Hospital Osteoporosis Self-Assessment Tool for Elderly Male (BFH-OSTM) vs Fracture Risk Assessment Tool (FRAX) for identifying painful new osteoporotic vertebral fractures in older Chinese men: a cross-sectional study
Ning An
Ji Sheng Lin
Qi Fei
BMC Musculoskeletal Disorders, 22

← 1 2 3 4 5 →