Large Language Models in Medical Education: Comparing ChatGPT- to Human-Generated Exam Questions

被引:20
|
作者
Laupichler, Matthias Carl [1 ,2 ]
Rother, Johanna Flora [1 ]
Kadow, Ilona C. Grunwald [3 ]
Ahmadi, Seifollah [3 ]
Raupach, Tobias [1 ]
机构
[1] Univ Hosp Bonn, Inst Med Educ, Venusberg Campus 1, D-53127 Bonn, Germany
[2] Univ Bonn, Inst Psychol, Bonn, Germany
[3] Univ Bonn, Inst Physiol 2, Dept Med, Bonn, Germany
关键词
D O I
10.1097/ACM.0000000000005626
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Problem Creating medical exam questions is time consuming, but well-written questions can be used for test-enhanced learning, which has been shown to have a positive effect on student learning. The automated generation of high-quality questions using large language models (LLMs), such as ChatGPT, would therefore be desirable. However, there are no current studies that compare students' performance on LLM-generated questions to questions developed by humans. Approach The authors compared student performance on questions generated by ChatGPT (LLM questions) with questions created by medical educators (human questions). Two sets of 25 multiple-choice questions (MCQs) were created, each with 5 answer options, 1 of which was correct. The first set of questions was written by an experienced medical educator, and the second set was created by ChatGPT 3.5 after the authors identified learning objectives and extracted some specifications from the human questions. Students answered all questions in random order in a formative paper-and-pencil test that was offered leading up to the final summative neurophysiology exam (summer 2023). For each question, students also indicated whether they thought it had been written by a human or ChatGPT. Outcomes The final data set consisted of 161 participants and 46 MCQs (25 human and 21 LLM questions). There was no statistically significant difference in item difficulty between the 2 question sets, but discriminatory power was statistically significantly higher in human than LLM questions (mean = .36, standard deviation [SD] = .09 vs mean = .24, SD = .14; P = .001). On average, students identified 57% of question sources (human or LLM) correctly. Next Steps Future research should replicate the study procedure in other contexts (e.g., other medical subjects, semesters, countries, and languages). In addition, the question of whether LLMs are suitable for generating different question types, such as key feature questions, should be investigated.
引用
收藏
页码:508 / 512
页数:5
相关论文
共 50 条
  • [41] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
    Liu, Jiawei
    Xia, Chunqiu Steven
    Wang, Yuyao
    Zhang, Lingming
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [42] Large-Language Models in Orthodontics: Assessing Reliability and Validity of ChatGPT in Pretreatment Patient Education
    Vassis, Stratos
    Powell, Harriet
    Petersen, Emma
    Barkmann, Asta
    Noeldeke, Beatrice
    Kristensen, Kasper D.
    Stoustrup, Peter
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2024, 16 (08)
  • [43] Comparing Large Language Models and Human Programmers for Generating Programming Code
    Hou, Wenpin
    Ji, Zhicheng
    ADVANCED SCIENCE, 2025, 12 (08)
  • [44] Effect of large language models artificial intelligence chatgpt chatbot on achievement of computer education students
    Mohammed, Ibrahim Abba
    Bello, Ahmed
    Ayuba, Bala
    EDUCATION AND INFORMATION TECHNOLOGIES, 2025,
  • [45] Large language models in pathology: A comparative study of ChatGPT and Bard with pathology trainees on multiple-choice questions
    Du, Wei
    Jin, Xueting
    Harris, Jaryse Carol
    Brunetti, Alessandro
    Johnson, Erika
    Leung, Olivia
    Li, Xingchen
    Walle, Selemon
    Yu, Qing
    Zhou, Xiao
    Bian, Fang
    Mckenzie, Kajanna
    Kanathanavanich, Manita
    Ozcelik, Yusuf
    El-Sharkawy, Farah
    Koga, Shunsuke
    ANNALS OF DIAGNOSTIC PATHOLOGY, 2024, 73
  • [46] Performance and risk of harm of a large language model on dermatology continuing medical education questions
    Chen, M. L.
    Cai, Z. Ran
    Kim, J.
    Novoa, R.
    Barnes, L. A.
    Beam, A.
    Linos, E.
    JOURNAL OF INVESTIGATIVE DERMATOLOGY, 2024, 144 (08) : S25 - S25
  • [47] Medical education empowered by generative artificial intelligence large language models
    Jowsey, Tanisha
    Stokes-Parish, Jessica
    Singleton, Rachelle
    Todorovic, Michael
    TRENDS IN MOLECULAR MEDICINE, 2023, 29 (12) : 971 - 973
  • [48] Large language models and medical education: a paradigm shift in educator roles
    Li, Zhui
    Li, Fenghe
    Fu, Qining
    Wang, Xuehu
    Liu, Hong
    Zhao, Yu
    Ren, Wei
    SMART LEARNING ENVIRONMENTS, 2024, 11 (01)
  • [49] Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions
    Abd-alrazaq, Alaa
    AlSaad, Rawan
    Alhuwail, Dari
    Ahmed, Arfan
    Healy, Padraig Mark
    Latifi, Syed
    Aziz, Sarah
    Damseh, Rafat
    Alrazak, Sadam Alabed
    Sheikh, Javaid
    JMIR MEDICAL EDUCATION, 2023, 9
  • [50] Harnessing the potential of large language models in medical education: promise and pitfalls
    Benitez, Trista M.
    Xu, Yueyuan
    Boudreau, J. Donald
    Kow, Alfred Wei Chieh
    Bello, Fernando
    Phuoc, Le Van
    Wang, Xiaofei
    Sun, Xiaodong
    Leung, Gilberto Ka-Kit
    Lan, Yanyan
    Wang, Yaxing
    Cheng, Davy
    Tham, Yih-Chung
    Wong, Tien Yin
    Chung, Kevin C.
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (03) : 776 - 783