Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

被引：1

作者：

Tang, Xiaoyi ^{[1
]}

Chen, Hongwei ^{[1
]}

Lin, Daoyu ^{[2
]}

Li, Kexin ^{[1
]}

机构：

[1] Univ Sci & Technol Beijing, Sch Foreign Studies, Beijing 100083, Peoples R China

[2] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100094, Peoples R China

来源：

HELIYON | 2024年 / 10卷 / 14期

关键词：

Automated essay scoring (AES); Large language models (LLMs); Generative pre-trained transformer (GPT); Prompt engineering; Multi-dimensional writing assessment; LINGUISTIC FEATURES; QUALITY;

D O I：

10.1016/j.heliyon.2024.e34262

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multidimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AIpowered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.

引用

页数：18

共 50 条

[1] Accurate multi-dimensional alignment
Keller, Y
Shkolnisky, Y
Averbuch, A
2005 International Conference on Image Processing (ICIP), Vols 1-5, 2005, : 2573 - 2576
[2] MULTI-DIMENSIONAL RELIABILITY ASSESSMENT OF FRACTAL SIGNATURE ANALYSIS IN AN OUTPATIENT ORTHOPAEDIC POPULATION
Roemer, F. W.
Jarraya, M.
Niu, J.
Duryea, J.
Lynch, J.
Guermazi, A.
OSTEOARTHRITIS AND CARTILAGE, 2015, 23 : A257 - A258
[3] MFKD: Multi-dimensional feature alignment for knowledge distillation
Guo, Zhen
Zhang, Pengzhou
Liang, Peng
IMAGE AND VISION COMPUTING, 2025, 157
[4] Multi-Dimensional Human Workload Assessment for Supervisory Human-Machine Teams
Heard, Jamison
Adams, Julie A.
JOURNAL OF COGNITIVE ENGINEERING AND DECISION MAKING, 2019, 13 (03) : 146 - 170
[5] A Multi-Dimensional Analysis of Writing Flexibility in an Automated Writing Evaluation System
Allen, Laura K.
Likens, Aaron D.
McNamara, Danielle S.
PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE (LAK'18): TOWARDS USER-CENTRED LEARNING ANALYTICS, 2018, : 380 - 388
[6] Eliciting engineering judgments in human reliability assessment
Renato, Paulo
Costa S. Menezes, Regilda da
Droguett, Enrique L.
de Lemos Duarte, Dayse C.
2006 PROCEEDINGS - ANNUAL RELIABILITY AND MAINTAINABILITY SYMPOSIUM, VOLS 1 AND 2, 2006, : 512 - +
[7] Multi-dimensional assessment of clinical dyspnoea
Yorke, J.
Moosavi, S.
Shuldham, C.
Haigh, C.
Lau-Walker, M.
Barnes, P.
Jones, P. W.
THORAX, 2007, 62 : A93 - A93
[8] A Multi-dimensional Peer Assessment System
Wahid, Usman
Chatti, Mohamed Amine
Anwar, Uzair
Schroeder, Ulrik
PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED EDUCATION (CSEDU), VOL 1, 2017, : 683 - 694
[9] Parsimonious Multi-dimensional Impact Assessment
Antle, John M.
AMERICAN JOURNAL OF AGRICULTURAL ECONOMICS, 2011, 93 (05) : 1292 - 1311
[10] Genre variation in student writing: A multi-dimensional analysis
Hardy, Jack A.
Friginal, Eric
JOURNAL OF ENGLISH FOR ACADEMIC PURPOSES, 2016, 22 : 119 - 131

← 1 2 3 4 5 →