Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

被引:1
|
作者
Tang, Xiaoyi [1 ]
Chen, Hongwei [1 ]
Lin, Daoyu [2 ]
Li, Kexin [1 ]
机构
[1] Univ Sci & Technol Beijing, Sch Foreign Studies, Beijing 100083, Peoples R China
[2] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing 100094, Peoples R China
关键词
Automated essay scoring (AES); Large language models (LLMs); Generative pre-trained transformer (GPT); Prompt engineering; Multi-dimensional writing assessment; LINGUISTIC FEATURES; QUALITY;
D O I
10.1016/j.heliyon.2024.e34262
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multidimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding Ideas (QWK=0.551) and Organization (QWK=0.584) under well-crafted prompt engineering. These findings pave the way for a comprehensive exploration of LLMs' broader educational implications, offering insights into their capability to refine and potentially transform writing instruction, assessment, and the delivery of diagnostic and personalized feedback in the AIpowered educational age. While this study attached importance to the reliability and alignment of LLM-powered multi-dimensional AES, future research should broaden its scope to encompass diverse writing genres and a more extensive sample from varied backgrounds.
引用
收藏
页数:18
相关论文
共 50 条
  • [31] Multi-Dimensional Health Assessment Questionnaire in China: Reliability, Validity and Clinical Value in Patients with Rheumatoid Arthritis
    Song, Yang
    Zhu, Li-an
    Wang, Su-li
    Leng, Lin
    Bucala, Richard
    Lu, Liang-Jing
    PLOS ONE, 2014, 9 (05):
  • [32] Reliability Assessment of Cyber-Physical Distribution System Using Multi-Dimensional Information Network Model
    He, Ruiwen
    Liang, Huiyu
    Wu, Jianshuang
    Xie, Haijun
    Shahidehpour, Mohammad
    IEEE TRANSACTIONS ON SMART GRID, 2023, 14 (06) : 4683 - 4692
  • [33] The Study on College English Writing Based on Multi-dimensional Feedback Mode
    骆敏
    徐敏娜
    海外英语, 2020, (12) : 273 - 274
  • [34] Digital capability assessment for eGovernment: A multi-dimensional approach
    Cresswell, Anthony M.
    Pardo, Theresa A.
    Canestraro, Donna S.
    ELECTRONIC GOVERNMENT, PROCEEDINGS, 2006, 4084 : 293 - 304
  • [35] A Multi-dimensional Credibility Assessment for Arabic News Sources
    Gaber, Amira M.
    El-din, Mohamed Nour
    Moussa, Hanan
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2021, 12 (09) : 316 - 324
  • [36] A Multi-dimensional Credibility Assessment for Arabic News Sources
    Gaber A.M.
    El-din M.N.
    Moussa H.
    International Journal of Advanced Computer Science and Applications, 2021, 12 (09): : 316 - 324
  • [37] Web Article Quality Assessment in Multi-dimensional Space
    Han, Jingyu
    Fu, Xiong
    Chen, Kejia
    Wang, Chuandong
    WEB-AGE INFORMATION MANAGEMENT, 2011, 6897 : 214 - 225
  • [38] A multi-dimensional assessment of the accuracy of analyst target prices
    Lee, Ying-, I
    Hsieh, Wen-Liang
    Miao, Daniel Wei -Chung
    INTERNATIONAL REVIEW OF ECONOMICS & FINANCE, 2024, 93 : 947 - 969
  • [39] COPD MULTI-DIMENSIONAL ASSESSMENT: DEVELOPMENT FROM THE BEGINNING
    Jones, P.
    RESPIROLOGY, 2016, 21 : 16 - 16
  • [40] A multi-dimensional approach to the assessment of tunnel excavation methods
    Golestanifar, Mojtaba
    Goshtasbi, Kamran
    Jafarian, Mostafa
    Adnani, Siamak
    INTERNATIONAL JOURNAL OF ROCK MECHANICS AND MINING SCIENCES, 2011, 48 (07) : 1077 - 1085