Influence of Model Evolution and System Roles onChatGPT's Performance in Chinese Medical LicensingExams: Comparative Study

被引：2

作者：

Ming, Shuai ^{[1
,2
]}

Guo, Qingge ^{[1
,2
,3
]}

Cheng, Wenjun ^{[4
]}

Lei, Bo ^{[1
,2
,3
]}

机构：

[1] Henan Eye Hosp, Henan Prov Peoples Hosp, Dept Ophthalmol, 7 Weiwu Rd, Zhengzhou 450003, Peoples R China

[2] Henan Acad Innovat Med Sci, Eye Inst, Zhengzhou, Peoples R China

[3] Zhengzhou Univ, Henan Clin Res Ctr Ocular Dis, Peoples Hosp, Zhengzhou, Peoples R China

[4] Zhengzhou Univ, Dept Ophthalmol, Peoples Hosp, Zhengzhou, Peoples R China

来源：

JMIR MEDICAL EDUCATION | 2024年 / 10卷

关键词：

ChatGPT; Chinese National Medical Licensing Examination; large language models; medical education; systemrole; LLM; LLMs; language model; language models; artificial intelligence; chatbot; chatbots; conversational agent; conver-sational agents; exam; exams; examination; examinations; OpenAI; answer; answers; response; responses; accuracy; performance; China; Chinese; CHATGPT; GPT-4;

D O I：

10.2196/52784

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

Background: With the increasing application of large language models like ChatGPT in various industries, its potential in themedical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability inthe Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15,2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored tomedical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The chi 2 tests and kappa values were employed to evaluate the model's accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001).The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However,both models showed relatively good response coherence, with kappa values of 0.778 and 0.610, respectively. System rolesnumerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7%and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types(P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the firstresponse. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, andmedical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence.GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study

引用

页数：11

共 41 条

[1] The Diversity and Evolution Process of Bus System Performance in Chinese Cities: An Empirical Study
Chen, Xiaohong
Wang, Xiang
Zhang, Hua
Li, Jia
SUSTAINABILITY, 2014, 6 (11) : 7751 - 7767
[2] The influence of conceptual model structure on model performance: a comparative study for 237 French catchments
van Esse, W. R.
Perrin, C.
Booij, M. J.
Augustijn, D. C. M.
Fenicia, F.
Kavetski, D.
Lobligeois, F.
HYDROLOGY AND EARTH SYSTEM SCIENCES, 2013, 17 (10) : 4227 - 4239
[3] Diffusion, convergence and influence of pharmaceutical innovations: a comparative study of Chinese and U.S. patents
Qiaolei Jiang
Chunjuan Luan
Globalization and Health, 14
[4] Influence of the chirality of (R)-(-)- and (S)-(+)-carvone in the central nervous system:: A comparative study
De Sousa, Damiao Pergentino
De Farias Nobrega, Franklin Ferreira
De Almeida, Reinaldo Nobrega
CHIRALITY, 2007, 19 (04) : 264 - 268
[5] Comparative Legal Study of the Freedom of Speech in Russia and China. Russian Legal System' Influence on the Chinese Legal System
Kolmakov, Stanislav
SRAVNITELNAYA POLITIKA-COMPARATIVE POLITICS, 2013, 4 (02): : 87 - +
[6] Does son preference influence children's growth in height? A comparative study of Chinese and Filipino children
Song, Shige
Burgard, Sarah A.
POPULATION STUDIES-A JOURNAL OF DEMOGRAPHY, 2008, 62 (03): : 305 - 320
[7] Study on the briskness performance of the imitation model for crab's walking-legged system
Jiqiren, 5 (309-315):
[8] Understanding the conditions that influence the roles of midwives in Ontario, Canada’s health system: an embedded single-case study
Cristina A. Mattison
John N. Lavis
Eileen K. Hutton
Michelle L. Dion
Michael G. Wilson
BMC Health Services Research, 20
[9] Understanding the conditions that influence the roles of midwives in Ontario, Canada's health system: an embedded single-case study
Mattison, Cristina A.
Lavis, John N.
Hutton, Eileen K.
Dion, Michelle L.
Wilson, Michael G.
BMC HEALTH SERVICES RESEARCH, 2020, 20 (01)
[10] Comparative Study of Chinese-western Cultural Context's Influence on Metaphorical Categorization of “heart” in Original and English Version of Caigentan
魏靖
海外英语, 2017, (07) : 218 - 220

← 1 2 3 4 5 →