Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

被引:15
|
作者
Herrmann-Werner, Anne [1 ,2 ]
Festl-Wietek, Teresa [1 ]
Holderried, Friederike [1 ,3 ]
Herschbach, Lea [1 ]
Griewatz, Jan [1 ]
Masters, Ken [4 ]
Zipfel, Stephan [2 ]
Mahling, Moritz [1 ,5 ]
机构
[1] Univ Tubingen, Tubingen Inst Med Educ, Fac Med, Elfriede Aulhorn Str 10, D-72076 Tubingen, Germany
[2] Univ Hosp Tubingen, Dept Psychosomat Med & Psychotherapy, Tubingen, Germany
[3] Univ Hosp Tubingen, Univ Dept Anesthesiol & Intens Care Med, Tubingen, Germany
[4] Sultan Qaboos Univ, Coll Med & Hlth Sci, Med Educ & Informat Dept, Muscat, Oman
[5] Univ Hosp Tubingen, Dept Diabetol Endocrinol Nephrol, Sect Nephrol & Hypertens, Tubingen, Germany
关键词
answer; artificial intelligence; assessment; Bloom's taxonomy; ChatGPT; classification; error; exam; examination; generative; GPT-4; Generative Pre-trained Transformer 4; language model; learning outcome; LLM; MCQ; medical education; medical exam; multiple-choice question; natural language processing; NLP; psychosomatic; question; response; taxonomy; EDUCATION;
D O I
10.2196/52113
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Leadership coaching's efficacy and effect mechanisms - a mixed-methods study
    Halliwell, Peter R.
    Mitchell, Rebecca J.
    Boyle, Brendan
    COACHING-AN INTERNATIONAL JOURNAL OF THEORY RESEARCH AND PRACTICE, 2022, 15 (01) : 43 - 59
  • [22] A mixed-methods study of women's birthplace preferences and decisions in England
    Clancy, Georgia E.
    Boardman, Felicity K.
    Rees, Sophie
    WOMEN AND BIRTH, 2024, 37 (04)
  • [23] Perceptions of Black Children's Narrative Language: A Mixed-Methods Study
    Mills, Monique T.
    Moore, Leslie C.
    Chang, Rong
    Kim, Somin
    Frick, Bethany
    LANGUAGE SPEECH AND HEARING SERVICES IN SCHOOLS, 2021, 52 (01) : 84 - 99
  • [24] Diverging perspectives on children's rehabilitation services: a mixed-methods study
    Stefansdottir, Sara
    Egilson, Snaefridur Thora
    SCANDINAVIAN JOURNAL OF OCCUPATIONAL THERAPY, 2016, 23 (05) : 374 - 382
  • [25] Patient?s expectations of privacy and confidentiality in Pakistan: A mixed-methods study
    Shirazi, Bushra
    Shekhani, Sualeha
    JOURNAL OF THE PAKISTAN MEDICAL ASSOCIATION, 2021, 71 (02) : 537 - 539
  • [26] Crowdsourcing the Evaluation of Multiple-Choice Questions Using Item-Writing Flaws and Bloom's Taxonomy
    Moore, Steven
    Fang, Ellen
    Nguyen, Huy A.
    Stamper, John
    PROCEEDINGS OF THE TENTH ACM CONFERENCE ON LEARNING @ SCALE, L@S 2023, 2023, : 25 - 34
  • [27] Distractor Efficiency in an Item Pool for a Statistics Classroom Exam: Assessing Its Relation With Item Cognitive Level Classified According to Bloom's Taxonomy
    Testa, Silvia
    Toscano, Anna
    Rosato, Rosalba
    FRONTIERS IN PSYCHOLOGY, 2018, 9
  • [28] Changing Levels of Bloom's Taxonomy in Learning Objectives and Exam Questions in First-Semester Introductory Chemistry before and during Adoption of Guided Inquiry
    Kowalski, Eileen M.
    Koleci, Carolann
    Mcdonald, Kenneth J.
    EDUCATION SCIENCES, 2024, 14 (09):
  • [29] Evaluation of Final Examination Papers in Engineering: A Case Study Using Bloom's Taxonomy
    Swart, Arthur James
    IEEE TRANSACTIONS ON EDUCATION, 2010, 53 (02) : 257 - 264
  • [30] Deconstructing University Learners' Adoption Intention Towards AIGC Technology: A Mixed-Methods Study Using ChatGPT as an Example
    Wang, Chengliang
    Chen, Xiaojiao
    Hu, Zhebing
    Jin, Sheng
    Gu, Xiaoqing
    JOURNAL OF COMPUTER ASSISTED LEARNING, 2025, 41 (01)