Influence of Model Evolution and System Roles onChatGPT's Performance in Chinese Medical LicensingExams: Comparative Study

被引:2
|
作者
Ming, Shuai [1 ,2 ]
Guo, Qingge [1 ,2 ,3 ]
Cheng, Wenjun [4 ]
Lei, Bo [1 ,2 ,3 ]
机构
[1] Henan Eye Hosp, Henan Prov Peoples Hosp, Dept Ophthalmol, 7 Weiwu Rd, Zhengzhou 450003, Peoples R China
[2] Henan Acad Innovat Med Sci, Eye Inst, Zhengzhou, Peoples R China
[3] Zhengzhou Univ, Henan Clin Res Ctr Ocular Dis, Peoples Hosp, Zhengzhou, Peoples R China
[4] Zhengzhou Univ, Dept Ophthalmol, Peoples Hosp, Zhengzhou, Peoples R China
来源
JMIR MEDICAL EDUCATION | 2024年 / 10卷
关键词
ChatGPT; Chinese National Medical Licensing Examination; large language models; medical education; systemrole; LLM; LLMs; language model; language models; artificial intelligence; chatbot; chatbots; conversational agent; conver-sational agents; exam; exams; examination; examinations; OpenAI; answer; answers; response; responses; accuracy; performance; China; Chinese; CHATGPT; GPT-4;
D O I
10.2196/52784
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: With the increasing application of large language models like ChatGPT in various industries, its potential in themedical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability inthe Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15,2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored tomedical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The chi 2 tests and kappa values were employed to evaluate the model's accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001).The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However,both models showed relatively good response coherence, with kappa values of 0.778 and 0.610, respectively. System rolesnumerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7%and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types(P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the firstresponse. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, andmedical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence.GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study
引用
收藏
页数:11
相关论文
共 41 条
  • [1] The Diversity and Evolution Process of Bus System Performance in Chinese Cities: An Empirical Study
    Chen, Xiaohong
    Wang, Xiang
    Zhang, Hua
    Li, Jia
    SUSTAINABILITY, 2014, 6 (11) : 7751 - 7767
  • [2] The influence of conceptual model structure on model performance: a comparative study for 237 French catchments
    van Esse, W. R.
    Perrin, C.
    Booij, M. J.
    Augustijn, D. C. M.
    Fenicia, F.
    Kavetski, D.
    Lobligeois, F.
    HYDROLOGY AND EARTH SYSTEM SCIENCES, 2013, 17 (10) : 4227 - 4239
  • [3] Diffusion, convergence and influence of pharmaceutical innovations: a comparative study of Chinese and U.S. patents
    Qiaolei Jiang
    Chunjuan Luan
    Globalization and Health, 14
  • [4] Influence of the chirality of (R)-(-)- and (S)-(+)-carvone in the central nervous system:: A comparative study
    De Sousa, Damiao Pergentino
    De Farias Nobrega, Franklin Ferreira
    De Almeida, Reinaldo Nobrega
    CHIRALITY, 2007, 19 (04) : 264 - 268
  • [5] Comparative Legal Study of the Freedom of Speech in Russia and China. Russian Legal System' Influence on the Chinese Legal System
    Kolmakov, Stanislav
    SRAVNITELNAYA POLITIKA-COMPARATIVE POLITICS, 2013, 4 (02): : 87 - +
  • [6] Does son preference influence children's growth in height? A comparative study of Chinese and Filipino children
    Song, Shige
    Burgard, Sarah A.
    POPULATION STUDIES-A JOURNAL OF DEMOGRAPHY, 2008, 62 (03): : 305 - 320
  • [8] Understanding the conditions that influence the roles of midwives in Ontario, Canada’s health system: an embedded single-case study
    Cristina A. Mattison
    John N. Lavis
    Eileen K. Hutton
    Michelle L. Dion
    Michael G. Wilson
    BMC Health Services Research, 20
  • [9] Understanding the conditions that influence the roles of midwives in Ontario, Canada's health system: an embedded single-case study
    Mattison, Cristina A.
    Lavis, John N.
    Hutton, Eileen K.
    Dion, Michelle L.
    Wilson, Michael G.
    BMC HEALTH SERVICES RESEARCH, 2020, 20 (01)
  • [10] Comparative Study of Chinese-western Cultural Context's Influence on Metaphorical Categorization of “heart” in Original and English Version of Caigentan
    魏靖
    海外英语, 2017, (07) : 218 - 220