Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

被引:20
|
作者
Laupichler, Matthias Carl [1 ,2 ]
Rother, Johanna Flora [1 ]
Kadow, Ilona C. Grunwald [3 ]
Ahmadi, Seifollah [3 ]
Raupach, Tobias [1 ]
机构
[1] Univ Hosp Bonn, Inst Med Educ, Venusberg Campus 1, D-53127 Bonn, Germany
[2] Univ Bonn, Inst Psychol, Bonn, Germany
[3] Univ Bonn, Inst Physiol 2, Dept Med, Bonn, Germany
关键词
D O I
10.1097/ACM.0000000000005626
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Problem Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. Approach The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. Outcomes The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Next Steps Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.
引用
收藏
页码:508 / 512
页数:5
相关论文
共 50 条
  • [1] Comparing the quality of ChatGPT- and physician-generated responses to patients' dermatology questions in the electronic medical record
    Reynolds, Kelly
    Nadelman, Daniel
    Durgin, Joseph
    Ansah-Addo, Stephen
    Cole, Daniel
    Fayne, Rachel
    Harrell, Jane
    Ratycz, Madison
    Runge, Mason
    Shepard-Hayes, Amanda
    Wenzel, Daniel
    Tejasvi, Trilokraj
    CLINICAL AND EXPERIMENTAL DERMATOLOGY, 2024, 49 (07) : 715 - 718
  • [2] Large language models (ChatGPT) in medical education: Embrace or abjure?
    Luke, Nathasha
    Taneja, Reshma
    Ban, Kenneth
    Samarasekera, Dujeepa
    Yap, Celestial T.
    ASIA PACIFIC SCHOLAR, 2023, 8 (04): : 50 - 52
  • [3] ChatGPT Label: Comparing the Quality of Human-Generated and LLM-Generated Annotations in Low-Resource Language NLP Tasks
    Nasution, Arbi Haza
    Onan, Aytug
    IEEE ACCESS, 2024, 12 : 71876 - 71900
  • [4] ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review
    Aster, Alexandra
    Laupichler, Matthias Carl
    Rockwell-Kollmann, Tamina
    Masala, Gilda
    Bala, Ebru
    Raupach, Tobias
    MEDICAL SCIENCE EDUCATOR, 2024, : 555 - 567
  • [5] Using Large Language Models to Generate Script Concordance Test in Medical Education: ChatGPT and Claude
    Kiyak, Yavuz Selim
    Emekli, Emre
    SPANISH JOURNAL OF MEDICAL EDUCATION, 2025, 6 (01):
  • [6] Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT
    Jeon, Jaeho
    Lee, Seongyong
    EDUCATION AND INFORMATION TECHNOLOGIES, 2023, 28 (12) : 15873 - 15892
  • [7] Large language models in education: A focus on the complementary relationship between human teachers and ChatGPT
    Jaeho Jeon
    Seongyong Lee
    Education and Information Technologies, 2023, 28 : 15873 - 15892
  • [8] AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination
    Law, Alex K. K.
    So, Jerome
    Lui, Chun Tat
    Choi, Yu Fai
    Cheung, Koon Ho
    Hung, Kevin Kei-ching
    Graham, Colin Alexander
    BMC MEDICAL EDUCATION, 2025, 25 (01)
  • [9] Large language models (LLM) and ChatGPT: a medical student perspective
    Arosh S. Perera Molligoda Arachchige
    European Journal of Nuclear Medicine and Molecular Imaging, 2023, 50 : 2248 - 2249
  • [10] Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions
    Abbas, Ali
    Rehman, Mahad S.
    Rehman, Syed S.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (03)