A Benchmark Dataset to Distinguish Human-Written and Machine-Generated Scientific Papers

被引:8
|
作者
Abdalla, Mohamed Hesham Ibrahim [1 ]
Malberg, Simon [1 ]
Dementieva, Daryna [1 ]
Mosca, Edoardo [1 ]
Groh, Georg [1 ]
机构
[1] Tech Univ Munich, Sch Computat Informat & Technol, D-80333 Munich, Germany
关键词
text generation; large language models; machine-generated text detection;
D O I
10.3390/info14100522
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
As generative NLP can now produce content nearly indistinguishable from human writing, it is becoming difficult to identify genuine research contributions in academic writing and scientific publications. Moreover, information in machine-generated text can be factually wrong or even entirely fabricated. In this work, we introduce a novel benchmark dataset containing human-written and machine-generated scientific papers from SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica, as well as papers co-created by humans and ChatGPT. We also experiment with several types of classifiers-linguistic-based and transformer-based-for detecting the authorship of scientific text. A strong focus is put on generalization capabilities and explainability to highlight the strengths and weaknesses of these detectors. Our work makes an important step towards creating more robust methods for distinguishing between human-written and machine-generated scientific papers, ultimately ensuring the integrity of scientific literature.
引用
收藏
页数:33
相关论文
共 50 条
  • [21] RoFT: A Tool for Evaluating Human Detection of Machine-Generated Text
    Dugan, Liam
    Ippolito, Daphne
    Kirubarajan, Arun
    Callison-Burch, Chris
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, 2020, : 189 - 196
  • [22] A Genre, Scoring, and Authorship Analysis of AI-Generated and Human-Written Refusal Emails
    Wilson, Winny
    Rose, Heath
    BUSINESS AND PROFESSIONAL COMMUNICATION QUARTERLY, 2025,
  • [23] SAPIENTML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions
    Saha, Ripon K.
    Ura, Akira
    Mahajan, Sonal
    Zhu, Chenguang
    Li, Linyi
    Hu, Yang
    Yoshida, Hiroaki
    Khurshid, Sarfraz
    Prasad, Mukul R.
    2022 ACM/IEEE 44TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2022), 2022, : 1932 - 1944
  • [24] A large-scale comparison of human-written versus ChatGPT-generated essays
    Herbold S.
    Hautli-Janisz A.
    Heuer U.
    Kikteva Z.
    Trautsch A.
    Scientific Reports, 13 (1)
  • [25] 5 Sources of Clickbaits You Should Know! Using Synthetic Clickbaits to Improve Prediction and Distinguish between Bot-Generated and Human-Written Headlines
    Thai Le
    Shu, Kai
    Molina, Maria D.
    Lee, Dongwon
    Sundar, S. Shyam
    Liu, Huan
    PROCEEDINGS OF THE 2019 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM 2019), 2019, : 33 - 40
  • [26] Dataset of human-written and synthesized samples of keystroke dynamics features for free-text inputs
    Gonzalez, Nahuel
    Calot, Enrique P.
    DATA IN BRIEF, 2023, 48
  • [27] Combining human-authored and machine-generated software product documentation
    Albing, B
    IPCC 2003 PROCEEDINGS, THE SHAPE OF KNOWLEDGE, 2003, : 6 - 11
  • [28] A Comparative Study on the Translation Quality between Human and Machine-Generated Subtitles
    Du, Jiaying
    Lu, Jiabi
    2024 6TH INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING, ICNLP 2024, 2024, : 62 - 66
  • [29] TOXIGEN: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection
    Hartvigsen, Thomas
    Gabriel, Saadia
    Palangi, Hamid
    Sap, Maarten
    Ray, Dipankar
    Kamar, Ece
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 3309 - 3326
  • [30] Human- and Machine-Generated Traffic Distinction by DNS Protocol Analysis
    Ochab, Marcin
    Mrukowicz, Marcin
    Sarzynski, Jaromir
    Bentkowska, Urszula
    IEEE CIS INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS 2021 (FUZZ-IEEE), 2021,