Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods

被引：10

作者：

Moore, Steven ^{[1
]}

Nguyen, Huy A. ^{[1
]}

Chen, Tianying ^{[1
]}

Stamper, John ^{[1
]}

机构：

[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA

来源：

RESPONSIVE AND SUSTAINABLE EDUCATIONAL FUTURES, EC-TEL 2023 | 2023年 / 14200卷

关键词：

Question evaluation; Question quality; Rule-based; GPT-4; ITEM WRITING FLAWS;

D O I：

10.1007/978-3-031-42682-7_16

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Multiple-choice questions with item-writing flaws can negatively impact student learning and skew analytics. These flaws are often present in student-generated questions, making it difficult to assess their quality and suitability for classroom usage. Existingmethods for evaluating multiple-choice questions often focus on machine readability metrics, without considering their intended use within course materials and their pedagogical implications. In this study, we compared the performance of a rule-based method we developed to a machine-learning based method utilizing GPT-4 for the task of automatically assessing multiple-choice questions based on 19 common item-writing flaws. By analyzing 200 student-generated questions from four different subject areas, we found that the rule-based method correctly detected 91% of the flaws identified by human annotators, as compared to 79% by GPT-4. We demonstrated the effectiveness of the two methods in identifying common item-writing flaws present in the student-generated questions across different subject areas. The rule-based method can accurately and efficiently evaluate multiple-choice questions from multiple domains, outperforming GPT-4 and going beyond existing metrics that do not account for the educational use of such questions. Finally, we discuss the potential for using these automated methods to improve the quality of questions based on the identified flaws.

引用

页码：229 / 245

页数：17

共 50 条

[1] From GPT-3 to GPT-4: On the Evolving Efficacy of LLMs to Answer Multiple-Choice Questions for Programming Classes in Higher Education
Savelka, Jaromir
Agarwal, Arav
Bogart, Christopher
Sakr, Majd
COMPUTER SUPPORTED EDUCATION, CSEDU 2023, 2024, 2052 : 160 - 182
[2] Assessing declarative and procedural knowledge using multiple-choice questions
Abu-Zaid, Ahmed
Khan, Tehreem A.
MEDICAL EDUCATION ONLINE, 2013, 18
[3] Quality assessment of multiple-choice questions
Akram, Zareena
Aman, Khaliq
Waheed, Nomira
Rahim, Shiekh Kashif
Hanif, Fouzia
Abdullah, Zainab
RAWAL MEDICAL JOURNAL, 2025, 50 (01): : 191 - 194
[4] GPT-4 generated answer rationales to multiple choice assessment questions in undergraduate medical education
Ch'en, Peter Y.
Day, Wesley
Pekson, Ryan C.
Barrientos, Juan
Burton, William B.
Ludwig, Allison B.
Jariwala, Sunit P.
Cassese, Todd
BMC MEDICAL EDUCATION, 2025, 25 (01)
[5] Using cognitive models to develop quality multiple-choice questions
Pugh, Debra
De Champlain, Andre
Gierl, Mark
Lai, Hollis
Touchie, Claire
MEDICAL TEACHER, 2016, 38 (08) : 838 - 843
[6] Inconsistently Accurate: Repeatability of GPT-3.5 and GPT-4 in Answering Radiology Board-style Multiple Choice Questions
Ballard, David H.
RADIOLOGY, 2024, 311 (02)
[7] QUALITY AND FEATURE OF MULTIPLE-CHOICE QUESTIONS IN EDUCATION
Jia, Bing
He, Dan
Zhu, Zhemin
PROBLEMS OF EDUCATION IN THE 21ST CENTURY, 2020, 78 (04) : 576 - 594
[8] Assessing the quality of automatic-generated short answers using GPT-4
Rodrigues L.
Dwan Pereira F.
Cabral L.
Gašević D.
Ramalho G.
Ferreira Mello R.
Computers and Education: Artificial Intelligence, 2024, 7
[9] Assessing the Quality of Multiple-Choice Test Items
Clifton, Sandra L.
Schriner, Cheryl L.
NURSE EDUCATOR, 2010, 35 (01) : 12 - 16
[10] Information Quality in The Analysis of Multiple-Choice Questions on An Example of Assessing the Importance of The European Union
Stanimir, Agnieszka
EDUCATION EXCELLENCE AND INNOVATION MANAGEMENT: A 2025 VISION TO SUSTAIN ECONOMIC DEVELOPMENT DURING GLOBAL CHALLENGES, 2020, : 1410 - 1428

← 1 2 3 4 5 →