Detecting Artificial Intelligence-Generated Versus Human-Written Medical Student Essays: Semirandomized Controlled Study

被引:0
|
作者
Doru, Berin [1 ]
Maier, Christoph [1 ]
Busse, Johanna Sophie [1 ]
Luecke, Thomas
Schoenhoff, Judith [2 ]
Enax-Krumova, Elena [3 ]
Hessler, Steffen [4 ]
Berger, Maria [5 ]
Tokic, Marianne [6 ]
机构
[1] Ruhr Univ Bochum, Univ Hosp Paediat & Adolescent Med, St Josef Hosp, Alexandrinenstr 5, D-44791 Bochum, Germany
[2] Ruhr Univ Bochum, Dept German Philol Gen & Comparat Literary Studies, Bochum, Germany
[3] Ruhr Univ Bochum, BG Univ Hosp Bergmannsheil gGmbH Bochum, Dept Neurol, Bochum, Germany
[4] Ruhr Univ Bochum, German Dept, German Linguist, Bochum, Germany
[5] Ruhr Univ Bochum, German Dept, Digital Forens Linguist, Bochum, Germany
[6] Ruhr Univ Bochum, Dept Med Informat Biometry & Epidemiol, Bochum, Germany
来源
JMIR MEDICAL EDUCATION | 2025年 / 11卷
关键词
artificial intelligence; ChatGPT; large language models; textual analysis; writing style; AI; chatbot; LLMs; detection; authorship; medical student; linguistic quality; decision-making; logical coherence;
D O I
10.2196/62779
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Background: Large language models, exemplified by ChatGPT, have reached a level of sophistication that makes distinguishing between human- and artificial intelligence (AI)-generated texts increasingly challenging. This has raised concerns in academia, particularly in medicine, where the accuracy and authenticity of written work are paramount. Objective: This semirandomized controlled study aims to examine the ability of 2 blinded expert groups with different levels of content familiarity-medical professionalsand humanities scholars with expertise in textual analysis-to distinguish between longer scientific texts in German written by medical students and those generated by ChatGPT. Additionally, the study sought to analyze the reasoning behind their identification choices, particularly the role of content familiarity and linguistic features. Methods: Between May and August 2023, a total of 35 experts (medical: n=22; humanities: n=13) were each presented with 2 pairs of texts on different medical topics. Each pair had similar content and structure: 1 text was written by a medical student, and the other was generated by ChatGPT (version 3.5, March 2023). Experts were asked to identify the AI-generated text and justify their choice. These justifications were analyzed through a multistage, interdisciplinary qualitative analysis to identify relevant textual features. Before unblinding, experts rated each text on 6 characteristics: linguistic fluency and spelling/grammatical accuracy, scientific quality, logical coherence, expression of knowledge limitations, formulation of future research questions, and citation quality. Univariate tests and multivariate logistic regression analyses were used to examine associations between participants' characteristics, their stated reasons for author identification, and the likelihood of correctly determining a text's authorship. Results: Overall, in 48 out of 69 (70%) decision rounds, participants accurately identified the AI-generated texts, with minimal difference between groups(medical: 31/43, 72%; humanities: 17/26, 65%; odds ratio [OR] 1.37, 95% CI 0.5-3.9). While content errors had little impact on identification accuracy, stylistic features-particularly redundancy (OR 6.90, 95% CI 1.01-47.1), repetition (OR 8.05, 95% CI 1.25-51.7), and thread/coherence(OR 6.62, 95% CI 1.25-35.2)-playedacrucial role in participants' decisions to identify a text as AI-generated. Conclusions: The findings suggest that both medical and humanities experts were able to identify ChatGPT-generated texts in medical contexts, with their decisions largely based on linguistic attributes. The accuracy of identification appears to be independent of experts' familiarity with the text content. As the decision-making process primarily relies on linguistic attributes-such as stylistic features and text coherence-further quasi-experimental studies using texts from other academic disciplines should be conducted to determine whether instructions based on these features can enhance lecturers' ability to distinguish between student-authored and AI-generated work.
引用
收藏
页数:14
相关论文
共 10 条
  • [1] Human Versus Machine: A Comparative Analysis in Detecting Artificial Intelligence-Generated Images
    Maiano, Luca
    Benova, Alexandra
    Papa, Lorenzo
    Stockner, Mara
    Marchetti, Michela
    Convertino, Gianmarco
    Mazzoni, Giuliana
    Amerini, Irene
    IEEE SECURITY & PRIVACY, 2024, 22 (03) : 77 - 86
  • [2] A large-scale comparison of human-written versus ChatGPT-generated essays
    Herbold S.
    Hautli-Janisz A.
    Heuer U.
    Kikteva Z.
    Trautsch A.
    Scientific Reports, 13 (1)
  • [3] Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study
    Liao, Wenxiong
    Liu, Zhengliang
    Dai, Haixing
    Xu, Shaochen
    Wu, Zihao
    Zhang, Yiyang
    Huang, Xiaoke
    Zhu, Dajiang
    Cai, Hongmin
    Li, Quanzheng
    Liu, Tianming
    Li, Xiang
    JMIR MEDICAL EDUCATION, 2023, 9
  • [4] GPTZero Performance in Identifying Artificial Intelligence-Generated Medical Texts: A Preliminary Study
    Habibzadeh, Farrokh
    JOURNAL OF KOREAN MEDICAL SCIENCE, 2023, 38 (38)
  • [5] Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry
    Kobis, Nils
    Mossink, Luca D.
    COMPUTERS IN HUMAN BEHAVIOR, 2021, 114
  • [6] Human Reviewers' Ability to Differentiate Human-Authored or Artificial Intelligence-Generated Medical Manuscripts: A Randomized Survey Study
    Helgeson, Scott A.
    Johnson, Patrick W.
    Gopikrishnan, Nilaa
    Koirala, Tapendra
    Moreno-Franco, Pablo
    Carter, Rickey E.
    Quicksall, Zachary S.
    Burger, Charles D.
    MAYO CLINIC PROCEEDINGS, 2025, 100 (04) : 622 - 633
  • [7] Artificial Intelligence-Generated and Human Expert-Designed Vocabulary Tests: A Comparative Study
    Luo Yunjiu
    Wei Wei
    Ying Zheng
    SAGE OPEN, 2022, 12 (01):
  • [8] Human versus artificial intelligence-generated arthroplasty literature: A single-blinded analysis of perceived communication, quality, and authorship source
    Lawrence, Kyle W.
    Habibi, Akram A.
    Ward, Spencer A.
    Lajam, Claudette M.
    Schwarzkopf, Ran
    Rozell, Joshua C.
    INTERNATIONAL JOURNAL OF MEDICAL ROBOTICS AND COMPUTER ASSISTED SURGERY, 2024, 20 (01):
  • [9] Comparison of social network engagement performance between artificial intelligence-generated content and human experts: A study in the field of immunology and allergology
    Palazzo, Stefano
    Piarulli, Cataldo
    Lerario, Francesca
    Cinquantasei, Marco
    Albanesi, Marcello
    ALLERGY, 2025, 80 (01) : 326 - 328
  • [10] Artificial Intelligence Versus Human-Controlled Doctor in Virtual Reality Simulation for Sepsis Team Training: Randomized Controlled Study
    Liaw, Sok Ying
    Tan, Jian Zhi
    Rusli, Khairul Dzakirin Bin
    Ratan, Rabindra
    Zhou, Wentao
    Lim, Siriwan
    Lau, Tang Ching
    Seah, Betsy
    Chua, Wei Ling
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2023, 25