AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

被引:0
|
作者
Law, Alex K. K. [1 ,3 ]
So, Jerome [2 ]
Lui, Chun Tat [3 ]
Choi, Yu Fai [3 ]
Cheung, Koon Ho [3 ]
Hung, Kevin Kei-ching [1 ]
Graham, Colin Alexander [1 ,3 ]
机构
[1] Chinese Univ Hong Kong CUHK, Accid & Emergency Med Acad Unit AEMAU, Shatin, 2nd Floor,Main Clin Block & Trauma Ctr,Prince Wale, Hong Kong, Peoples R China
[2] Tseung Kwan O Hosp, Dept Accid & Emergency, Hong Kong, Peoples R China
[3] Hong Kong Coll Emergency Med, Hong Kong, Peoples R China
关键词
Artificial intelligence; Educational measurement; Multiple choice questions; Medical education; Cognitive processes;
D O I
10.1186/s12909-025-06796-6
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs-one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom's taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 +/- 0.22 vs. 0.69 +/- 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 +/- 0.23 vs. 0.26 +/- 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12-0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (chi(2) = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.
引用
收藏
页数:9
相关论文
共 50 条
  • [21] ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review
    Kiyak, Yavuz Selim
    Emekli, Emre
    POSTGRADUATE MEDICAL JOURNAL, 2024, 100 (1189) : 858 - 865
  • [22] Prevalence of Flawed Multiple-Choice Questions in Continuing Medical Education Activities of Major Radiology Journals
    DiSantis, David J.
    Ayoob, Andres R.
    Williams, Lindsay E.
    AMERICAN JOURNAL OF ROENTGENOLOGY, 2015, 204 (04) : 698 - 701
  • [23] Structure of text and multiple-choice questions in continuing medical education (CME) in two specialist journals
    Nobbe, Helmut
    Loesche, Peter
    Griebenow, Reinhard
    MEDIZINISCHE KLINIK, 2008, 103 (01) : 14 - 19
  • [24] Experiences in adding multiple-choice questions to an objective structural clinical examination (OSCE) in undergraduate dental education
    Napankangas, R.
    Harila, V.
    Lahti, S.
    EUROPEAN JOURNAL OF DENTAL EDUCATION, 2012, 16 (01) : E146 - E150
  • [25] A Qualitative Case Study on the Validation of Automatically Generated Multiple-Choice Questions From Science Textbooks
    Larranaga, Mikel
    Aldabe, Itziar
    Arruarte, Ana
    Elorriaga, Jon A.
    Maritxalar, Montse
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2022, 15 (03): : 338 - 349
  • [26] Student- versus teacher-generated explanations for answers to online multiple-choice questions: What are the differences?
    Yu, Fu-Yun
    Chen, Chiao-Yi
    COMPUTERS & EDUCATION, 2021, 173
  • [27] Rarely selected distractors in high stakes medical multiple-choice examinations and their recognition by item authors: a simulation and survey
    Anja Rogausch
    Rainer Hofer
    René Krebs
    BMC Medical Education, 10
  • [28] Rarely selected distractors in high stakes medical multiple-choice examinations and their recognition by item authors: a simulation and survey
    Rogausch, Anja
    Hofer, Rainer
    Krebs, Rene
    BMC MEDICAL EDUCATION, 2010, 10
  • [29] Using ChatGPT to generate multiple-choice questions in medical education may have potential adverse effects on medical educators and medical students
    Ye, Hongnan
    POSTGRADUATE MEDICAL JOURNAL, 2024,
  • [30] Resitting a high-stakes postgraduate medical examination on multiple occasions: nonlinear multilevel modelling of performance in the MRCP (UK) examinations
    McManus, I. C.
    Ludka, Katarzyna
    BMC MEDICINE, 2012, 10