AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination

被引:0
|
作者
Law, Alex K. K. [1 ,3 ]
So, Jerome [2 ]
Lui, Chun Tat [3 ]
Choi, Yu Fai [3 ]
Cheung, Koon Ho [3 ]
Hung, Kevin Kei-ching [1 ]
Graham, Colin Alexander [1 ,3 ]
机构
[1] Chinese Univ Hong Kong CUHK, Accid & Emergency Med Acad Unit AEMAU, Shatin, 2nd Floor,Main Clin Block & Trauma Ctr,Prince Wale, Hong Kong, Peoples R China
[2] Tseung Kwan O Hosp, Dept Accid & Emergency, Hong Kong, Peoples R China
[3] Hong Kong Coll Emergency Med, Hong Kong, Peoples R China
关键词
Artificial intelligence; Educational measurement; Multiple choice questions; Medical education; Cognitive processes;
D O I
10.1186/s12909-025-06796-6
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background The creation of high-quality multiple-choice questions (MCQs) is essential for medical education assessments but is resource-intensive and time-consuming when done by human experts. Large language models (LLMs) like ChatGPT-4o offer a promising alternative, but their efficacy remains unclear, particularly in high-stakes exams. Objective This study aimed to evaluate the quality and psychometric properties of ChatGPT-4o-generated MCQs compared to human-created MCQs in a high-stakes medical licensing exam. Methods A prospective cohort study was conducted among medical doctors preparing for the Primary Examination on Emergency Medicine (PEEM) organised by the Hong Kong College of Emergency Medicine in August 2024. Participants attempted two sets of 100 MCQs-one AI-generated and one human-generated. Expert reviewers assessed MCQs for factual correctness, relevance, difficulty, alignment with Bloom's taxonomy (remember, understand, apply and analyse), and item writing flaws. Psychometric analyses were performed, including difficulty and discrimination indices and KR-20 reliability. Candidate performance and time efficiency were also evaluated. Results Among 24 participants, AI-generated MCQs were easier (mean difficulty index = 0.78 +/- 0.22 vs. 0.69 +/- 0.23, p < 0.01) but showed similar discrimination indices to human MCQs (mean = 0.22 +/- 0.23 vs. 0.26 +/- 0.26). Agreement was moderate (ICC = 0.62, p = 0.01, 95% CI: 0.12-0.84). Expert reviews identified more factual inaccuracies (6% vs. 4%), irrelevance (6% vs. 0%), and inappropriate difficulty levels (14% vs. 1%) in AI MCQs. AI questions primarily tested lower-order cognitive skills, while human MCQs better assessed higher-order skills (chi(2) = 14.27, p = 0.003). AI significantly reduced time spent on question generation (24.5 vs. 96 person-hours). Conclusion ChatGPT-4o demonstrates the potential for efficiently generating MCQs but lacks the depth needed for complex assessments. Human review remains essential to ensure quality. Combining AI efficiency with expert oversight could optimise question creation for high-stakes exams, offering a scalable model for medical education that balances time efficiency and content quality.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Answering multiple-choice questions in high-stakes medical examinations
    Fischer, MR
    Herrmann, S
    Kopp, V
    MEDICAL EDUCATION, 2005, 39 (09) : 890 - 894
  • [2] An experimental comparison of multiple-choice and short-answer questions on a high-stakes test for medical students
    Mee, Janet
    Pandian, Ravi
    Wolczynski, Justin
    Morales, Amy
    Paniagua, Miguel
    Harik, Polina
    Baldwin, Peter
    Clauser, Brian E.
    ADVANCES IN HEALTH SCIENCES EDUCATION, 2024, 29 (03) : 783 - 801
  • [3] Predicting the Difficulty of Multiple Choice Questions in a High-stakes Medical Exam
    Le An Ha
    Yaneva, Victoria
    Baldwin, Peter
    Mee, Janet
    INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS, 2019, : 11 - 20
  • [4] Predicting Item Survival for Multiple Choice Questions in a High-stakes Medical Exam
    Yaneva, Victoria
    Le An Ha
    Baldwin, Peter
    Mee, Janet
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6812 - 6818
  • [5] PROBLEMS OF MULTIPLE-CHOICE QUESTIONS IN GERMAN MEDICAL EXAMINATION
    VOGTMOYK.I
    LANGENBECKS ARCHIV FUR CHIRURGIE, 1974, 337 : 463 - 468
  • [6] Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments
    Tarrant, Marie
    Ware, James
    MEDICAL EDUCATION, 2008, 42 (02) : 198 - 206
  • [7] MULTIPLE-CHOICE VERSUS EQUIVALENT ESSAY QUESTIONS IN A NATIONAL EXAMINATION
    BLUM, A
    AZENCOT, M
    EUROPEAN JOURNAL OF SCIENCE EDUCATION, 1986, 8 (02): : 225 - 228
  • [8] The relationship of correct option location, distractor efficiency, difficulty and discrimination indices in analysis of high-stakes multiple-choice questions exam of medical students
    Shafiayan, Madjid
    Izanloo, Balal
    REVISTA DE LA UNIVERSIDAD DEL ZULIA, 2019, 10 (27): : 132 - 151
  • [9] The optimal number of options for multiple-choice questions on high-stakes tests: application of a revised index for detecting nonfunctional distractors
    Mark R. Raymond
    Craig Stevens
    S. Deniz Bucak
    Advances in Health Sciences Education, 2019, 24 : 141 - 150
  • [10] The optimal number of options for multiple-choice questions on high-stakes tests: application of a revised index for detecting nonfunctional distractors
    Raymond, Mark R.
    Stevens, Craig
    Bucak, S. Deniz
    ADVANCES IN HEALTH SCIENCES EDUCATION, 2019, 24 (01) : 141 - 150