Influence of Model Evolution and System Roles onChatGPT's Performance in Chinese Medical LicensingExams: Comparative Study

被引:2
|
作者
Ming, Shuai [1 ,2 ]
Guo, Qingge [1 ,2 ,3 ]
Cheng, Wenjun [4 ]
Lei, Bo [1 ,2 ,3 ]
机构
[1] Henan Eye Hosp, Henan Prov Peoples Hosp, Dept Ophthalmol, 7 Weiwu Rd, Zhengzhou 450003, Peoples R China
[2] Henan Acad Innovat Med Sci, Eye Inst, Zhengzhou, Peoples R China
[3] Zhengzhou Univ, Henan Clin Res Ctr Ocular Dis, Peoples Hosp, Zhengzhou, Peoples R China
[4] Zhengzhou Univ, Dept Ophthalmol, Peoples Hosp, Zhengzhou, Peoples R China
来源
JMIR MEDICAL EDUCATION | 2024年 / 10卷
关键词
ChatGPT; Chinese National Medical Licensing Examination; large language models; medical education; systemrole; LLM; LLMs; language model; language models; artificial intelligence; chatbot; chatbots; conversational agent; conver-sational agents; exam; exams; examination; examinations; OpenAI; answer; answers; response; responses; accuracy; performance; China; Chinese; CHATGPT; GPT-4;
D O I
10.2196/52784
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: With the increasing application of large language models like ChatGPT in various industries, its potential in themedical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability inthe Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15,2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored tomedical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The chi 2 tests and kappa values were employed to evaluate the model's accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001).The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However,both models showed relatively good response coherence, with kappa values of 0.778 and 0.610, respectively. System rolesnumerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7%and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types(P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the firstresponse. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, andmedical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence.GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study
引用
收藏
页数:11
相关论文
共 41 条