Towards AI-Generated Essay Classification Using Numerical Text Representation

被引:1
|
作者
Krawczyk, Natalia [1 ]
Probierz, Barbara [1 ,2 ]
Kozak, Jan [1 ]
机构
[1] Univ Econ Katowice, Dept Machine Learning, 1 Maja 50, PL-40287 Katowice, Poland
[2] Lukasiewicz Res Network, Inst Innovat Technol EMAG, Leopolda 31, PL-40189 Katowice, Poland
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 21期
关键词
natural language processing; numerical text representations; text classification; large language models;
D O I
10.3390/app14219795
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The detection of essays written by AI compared to those authored by students is increasingly becoming a significant issue in educational settings. This research examines various numerical text representation techniques to improve the classification of these essays. Utilizing a diverse dataset, we undertook several preprocessing steps, including data cleaning, tokenization, and lemmatization. Our system analyzes different text representation methods such as Bag of Words, TF-IDF, and fastText embeddings in conjunction with multiple classifiers. Our experiments showed that TF-IDF weights paired with logistic regression reached the highest accuracy of 99.82%. Methods like Bag of Words, TF-IDF, and fastText embeddings achieved accuracies exceeding 96.50% across all tested classifiers. Sentence embeddings, including MiniLM and distilBERT, yielded accuracies from 93.78% to 96.63%, indicating room for further refinement. Conversely, pre-trained fastText embeddings showed reduced performance, with a lowest accuracy of 89.88% in logistic regression. Remarkably, the XGBoost classifier delivered the highest minimum accuracy of 96.24%. Specificity and precision were above 99% for most methods, showcasing high capability in differentiating between student-created and AI-generated texts. This study underscores the vital role of choosing dataset-specific text representations to boost classification accuracy.
引用
收藏
页数:23
相关论文
共 50 条
  • [21] Empowering design innovation using AI-generated content
    Jin, Jian
    Yang, Mingyue
    Hu, Huicong
    Guo, Xin
    Luo, Jianxi
    Liu, Ying
    JOURNAL OF ENGINEERING DESIGN, 2025, 36 (01) : 1 - 18
  • [22] AI-generated works and copyright law: towards a union of strange bedfellows
    Salami, Emmanuel
    JOURNAL OF INTELLECTUAL PROPERTY LAW & PRACTICE, 2021, 16 (02) : 124 - 135
  • [23] AI-generated text may have a role in evidence-based medicine
    Yifan Peng
    Justin F. Rousseau
    Edward H. Shortliffe
    Chunhua Weng
    Nature Medicine, 2023, 29 : 1593 - 1594
  • [24] Classification of human- and AI-generated texts for different languages and domains
    Kristina Schaaff
    Tim Schlippe
    Lorenz Mindner
    International Journal of Speech Technology, 2024, 27 (4) : 935 - 956
  • [25] Evaluating the Efficacy of Hybrid Deep Learning Models in Distinguishing AI-Generated Text
    Oketunji, Abiodun Finbarrs
    arXiv, 2023,
  • [26] CIFAKE: Image Classification and Explainable Identification of AI-Generated Synthetic Images
    Bird, Jordan J.
    Lotfi, Ahmad
    IEEE ACCESS, 2024, 12 : 15642 - 15650
  • [27] An Applied Statistics dataset for human vs AI-generated answer classification
    Salim, Md. Shahidul
    Hossain, Sk Imran
    DATA IN BRIEF, 2024, 54
  • [28] The affordances and contradictions of AI-generated text for writers of english as a second or foreign language
    Warschauer, Mark
    Tseng, Waverly
    Yim, Soobin
    Webster, Thomas
    Jacob, Sharin
    Du, Qian
    Tate, Tamara
    JOURNAL OF SECOND LANGUAGE WRITING, 2023, 62
  • [29] Rate of AI-Generated Text in Medical School Applicants' Personal Comments Essays
    Nield, Linda S.
    Nguyen, John
    Nguyen, Emily
    Vallejo, Manuel C.
    JOURNAL OF GENERAL INTERNAL MEDICINE, 2024,
  • [30] Toward Robust Arabic AI-Generated Text Detection: Tackling Diacritics Challenges
    Alshammari, Hamed
    Elleithy, Khaled
    INFORMATION, 2024, 15 (07)