Towards AI-Generated Essay Classification Using Numerical Text Representation

被引:1
|
作者
Krawczyk, Natalia [1 ]
Probierz, Barbara [1 ,2 ]
Kozak, Jan [1 ]
机构
[1] Univ Econ Katowice, Dept Machine Learning, 1 Maja 50, PL-40287 Katowice, Poland
[2] Lukasiewicz Res Network, Inst Innovat Technol EMAG, Leopolda 31, PL-40189 Katowice, Poland
来源
APPLIED SCIENCES-BASEL | 2024年 / 14卷 / 21期
关键词
natural language processing; numerical text representations; text classification; large language models;
D O I
10.3390/app14219795
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The detection of essays written by AI compared to those authored by students is increasingly becoming a significant issue in educational settings. This research examines various numerical text representation techniques to improve the classification of these essays. Utilizing a diverse dataset, we undertook several preprocessing steps, including data cleaning, tokenization, and lemmatization. Our system analyzes different text representation methods such as Bag of Words, TF-IDF, and fastText embeddings in conjunction with multiple classifiers. Our experiments showed that TF-IDF weights paired with logistic regression reached the highest accuracy of 99.82%. Methods like Bag of Words, TF-IDF, and fastText embeddings achieved accuracies exceeding 96.50% across all tested classifiers. Sentence embeddings, including MiniLM and distilBERT, yielded accuracies from 93.78% to 96.63%, indicating room for further refinement. Conversely, pre-trained fastText embeddings showed reduced performance, with a lowest accuracy of 89.88% in logistic regression. Remarkably, the XGBoost classifier delivered the highest minimum accuracy of 96.24%. Specificity and precision were above 99% for most methods, showcasing high capability in differentiating between student-created and AI-generated texts. This study underscores the vital role of choosing dataset-specific text representations to boost classification accuracy.
引用
收藏
页数:23
相关论文
共 50 条
  • [41] Workshop: Using AI-Generated Content to Support the Writing Process
    Vance, Bremen
    Brewer, Pam Estes
    Duin, Ann Hill
    2023 IEEE INTERNATIONAL PROFESSIONAL COMMUNICATION CONFERENCE, PROCOMM, 2023, : 168 - 170
  • [42] Investigating generative AI models and detection techniques: impacts of tokenization and dataset size on identification of AI-generated text
    Hua, Haowei
    Yao, Co-Jiayu
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2024, 7
  • [43] Development of AI-generated medical responses using the ChatGPT for cancer patients
    Lee, Jae-woo
    Yoo, In-Sang
    Kim, Ji-Hye
    Kim, Won Tae
    Jeon, Hyun Jeong
    Yoo, Hyo-Sun
    Shin, Jae Gwang
    Kim, Geun-Hyeong
    Hwang, ShinJi
    Park, Seung
    Kim, Yong-June
    COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2024, 254
  • [44] Understanding AI-Generated Experiments in Tourism: Replications Using GPT Simulations
    Xiong, Xiling
    Wong, IpKin Anthony
    Huang, GuoQiong Ivanka
    Peng, Yixuan
    JOURNAL OF TRAVEL RESEARCH, 2024,
  • [45] Towards a unified evaluation framework: integrating human perception and metrics for AI-generated images
    Aziz, Memoona
    Rehman, Umair
    Danish, Muhammad Umair
    Ali, Syed
    Abbasi, Amir Zaib
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [46] Risks and Benefits of AI-generated Text Summarization for Expert Level Content in Graduate Health Informatics
    Merine, Regina
    Purkayastha, Saptarshi
    2022 IEEE 10TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2022), 2022, : 567 - 574
  • [47] AI-generated text in otolaryngology publications: a comparative analysis before and after the release of ChatGPT.
    Carnino, Jonathan M.
    Chong, Nicholas Y. K.
    Bayly, Henry
    Salvati, Lindsay R.
    Tiwana, Hardeep S.
    Levi, Jessica R.
    EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, 2024, 281 (11) : 6141 - 6146
  • [48] Towards Fair Detection of AI-Generated Essays in Large-Scale Writing Assessments
    Jiang, Yang
    Hao, Jiangang
    Fauss, Michael
    Li, Chen
    ARTIFICIAL INTELLIGENCE IN EDUCATION: POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS, DOCTORAL CONSORTIUM AND BLUE SKY, AIED 2024, 2024, 2151 : 317 - 324
  • [49] The AI Ghostwriter Effect: When Users do not Perceive Ownership of AI-Generated Text but Self-Declare as Authors
    Draxler, Fiona
    Werner, Anna
    Lehmann, Florian
    Hoppe, Matthias
    Schmidt, Albrecht
    Buschek, Daniel
    Welsch, Robin
    ACM TRANSACTIONS ON COMPUTER-HUMAN INTERACTION, 2024, 31 (02)
  • [50] Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images
    Xu, Shicheng
    Hou, Danyang
    Pang, Liang
    Deng, Jingcheng
    Xu, Jun
    Shen, Huawei
    Cheng, Xueqi
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 208 - 217