Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study

被引：15

作者：

Herrmann-Werner, Anne ^{[1
,2
]}

Festl-Wietek, Teresa ^{[1
]}

Holderried, Friederike ^{[1
,3
]}

Herschbach, Lea ^{[1
]}

Griewatz, Jan ^{[1
]}

Masters, Ken ^{[4
]}

Zipfel, Stephan ^{[2
]}

Mahling, Moritz ^{[1
,5
]}

机构：

[1] Univ Tubingen, Tubingen Inst Med Educ, Fac Med, Elfriede Aulhorn Str 10, D-72076 Tubingen, Germany

[2] Univ Hosp Tubingen, Dept Psychosomat Med & Psychotherapy, Tubingen, Germany

[3] Univ Hosp Tubingen, Univ Dept Anesthesiol & Intens Care Med, Tubingen, Germany

[4] Sultan Qaboos Univ, Coll Med & Hlth Sci, Med Educ & Informat Dept, Muscat, Oman

[5] Univ Hosp Tubingen, Dept Diabetol Endocrinol Nephrol, Sect Nephrol & Hypertens, Tubingen, Germany

来源：

JOURNAL OF MEDICAL INTERNET RESEARCH | 2024年 / 26卷

关键词：

answer; artificial intelligence; assessment; Bloom's taxonomy; ChatGPT; classification; error; exam; examination; generative; GPT-4; Generative Pre-trained Transformer 4; language model; learning outcome; LLM; MCQ; medical education; medical exam; multiple-choice question; natural language processing; NLP; psychosomatic; question; response; taxonomy; EDUCATION;

D O I：

10.2196/52113

中图分类号：

R19 [保健组织与事业（卫生事业管理）];

学科分类号：

摘要：

Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. Results: GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.

引用

页数：13

共 50 条

[41] A mixed-methods study of women's sanitation utilization in informal settlements in Kenya
Winter, Samantha Cristine
Dreibelbis, Robert
Dzombo, Millicent Ningoma
Barchi, Francis
PLOS ONE, 2019, 14 (03):
[42] Transgender women's satisfaction with healthcare services: A mixed-methods pilot study
De Santis, Joseph P.
Cintulova, Monika
Provencio-Vasquez, Elias
Rodriguez, Allan E.
Cicero, Ethan C.
PERSPECTIVES IN PSYCHIATRIC CARE, 2020, 56 (04) : 926 - 938
[43] Young Bisexual People's Experiences of Sexual Violence: A Mixed-Methods Study
Flanders, Corey E.
Anderson, RaeAnn E.
Tarasoff, Lesley A.
JOURNAL OF BISEXUALITY, 2020, 20 (02) : 202 - 232
[44] Questions Classification Based on Revised Bloom's Taxonomy Cognitive Level using Naive Bayes and Support Vector Machine
Callista, Annisa Syafarani
Pratiwi, Oktariani Nurul
Sutoyo, Edi
2021 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATICS ENGINEERING (IC2IE 2021), 2021, : 260 - 265
[45] WHAT'S IMPORTANT? CAREGIVERS' PRIORITIES FOR THEIR CHILDREN'S HOME NONINVASIVE VENTILATION: A MIXED-METHODS STUDY
Dannenberg, Vanessa Campes
Ofosu, Daniel
Milne, Ella
Olmstead, Deborah
Chalifour, Mathieu
Scott, Shannon
Castro-Codesal, Maria
CHEST, 2024, 166 (04) : 5081A - 5081A
[46] Meaning Reconstruction in Bereaved Family Caregivers of Person's With Alzheimer's Disease: A Mixed-Methods Study
Romero, Melissa M.
OMEGA-JOURNAL OF DEATH AND DYING, 2021, 82 (04) : 548 - 569
[47] "Brave New World" or not?: A mixed-methods study of the relationship between second language writing learners' perceptions of ChatGPT, behaviors of using ChatGPT, and writing proficiency
Dong, Li
CURRENT PSYCHOLOGY, 2024, 43 (21) : 19481 - 19495
[48] What faculty write versus what students see? Perspectives on multiple-choice questions using Bloom's taxonomy
Monrad, Seetha U.
Zaidi, Nikki L. Bibler
Grob, Karri L.
Kurtz, Joshua B.
Tai, Andrew W.
Hortsch, Michael
Gruppen, Larry D.
Santen, Sally A.
MEDICAL TEACHER, 2021, 43 (05) : 575 - 582
[49] MIXED-METHODS STUDY OF WOMEN'S EXPERIENCES WITH SECOND-TRIMESTER ABORTION CARE
Meadows, J.
Gutierrez, H.
Hannwn, C. P. S.
Douglas-Durham, E.
Blanchard, K.
Dennis, A.
CONTRACEPTION, 2016, 94 (04) : 427 - 428
[50] Women's experiences with immediate postpartum intrauterine device insertion: a mixed-methods study
Carr, Shannon L.
Singh, Rameet H.
Sussman, Andrew L.
Rogers, Rebecca G.
Pereda, Brenda
Espey, Eve
CONTRACEPTION, 2018, 97 (03) : 219 - 226

← 1 2 3 4 5 →