MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark

被引:0
|
作者
Macko, Dominik [1 ]
Moro, Robert [1 ]
Uchendu, Adaku [2 ,3 ]
Lucas, Jason Samuel [3 ]
Yamashita, Michiharu [3 ]
MatusPikuliak [1 ]
Srba, Ivan [1 ]
Le, Thai [4 ]
Lee, Dongwon [3 ]
Simko, Jakub [1 ]
Bielikova, Maria [1 ]
机构
[1] Kempelen Inst Intelligent Technol, Bratislava, Slovakia
[2] MIT, Lincoln Lab, Lexington, MA USA
[3] Penn State Univ, University Pk, PA 16802 USA
[4] Univ Mississippi, University, MS USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There is a lack of research into capabilities of recent LLMs to generate convincing text in languages other than English and into performance of detectors of machine-generated text in multilingual settings. This is also reflected in the available benchmarks which lack authentic texts in languages other than English and predominantly cover older generators. To fill this gap, we introduce MULTITuDE(1), a novel benchmarking dataset for multilingual machine-generated text detection comprising of 74,081 authentic and machine-generated texts in 11 languages (ar, ca, cs, de, en, es, nl, pt, ru, uk, and zh) generated by 8 multilingual LLMs. Using this benchmark, we compare the performance of zero-shot (statistical and black-box) and fine-tuned detectors. Considering the multilinguality, we evaluate 1) how these detectors generalize to unseen languages (linguistically similar as well as dissimilar) and unseen LLMs and 2) whether the detectors improve their performance when trained on multiple languages.
引用
收藏
页码:9960 / 9987
页数:28
相关论文
共 50 条
  • [31] Zero-Shot Detection of Machine-Generated Codes
    Yang, Xianjun
    Zhang, Kexun
    Chen, Haifeng
    Petzold, Linda
    Wang, William Yang
    Cheng, Wei
    arXiv, 2023,
  • [32] A Benchmark Dataset to Distinguish Human-Written and Machine-Generated Scientific Papers
    Abdalla, Mohamed Hesham Ibrahim
    Malberg, Simon
    Dementieva, Daryna
    Mosca, Edoardo
    Groh, Georg
    INFORMATION, 2023, 14 (10)
  • [33] Limits of Detecting Text Generated by Large-Scale Language Models
    Varshney, Lav R.
    Keskar, Nitish Shirish
    Socher, Richard
    2020 INFORMATION THEORY AND APPLICATIONS WORKSHOP (ITA), 2020,
  • [34] MINION: a Large-Scale and Diverse Dataset for Multilingual Event Detection
    Ben Veyseh, Amir Pouran
    Minh Van Nguyen
    Dernoncourt, Franck
    Thien Huu Nguyen
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 2286 - 2299
  • [35] Team lm-detector at PAN: Can NLI be an Appropriate Approach to Machine-Generated Text Detection
    Wu, Guojun
    Guan, Qinghao
    CEUR Workshop Proceedings, 2024, 3740 : 2956 - 2962
  • [36] COCO: Coherence-Enhanced Machine-Generated Text Detection Under Low Resource With Contrastive Learning
    Liu, Xiaoming
    Zhang, Zhaohan
    Wang, Yichen
    Pu, Hang
    Lan, Yu
    Shen, Chao
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 16167 - 16188
  • [37] A Large-Scale Homography Benchmark
    Barath, Daniel
    Mishkin, Dmytro
    Polic, Michal
    Forstner, Wolfgang
    Matas, Jiri
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 21360 - 21370
  • [38] DetectLLM: Leveraging Log-Rank Information for Zero-Shot Detection of Machine-Generated Text
    Su, Jinyan
    Zhuo, Terry Yue
    Wang, Di
    Nakov, Preslav
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 12395 - 12412
  • [39] Collaborative Camouflaged Object Detection: A Large-Scale Dataset and Benchmark
    Zhang, Cong
    Bi, Hongbo
    Xiang, Tian-Zhu
    Wu, Ranwan
    Tong, Jinghui
    Wang, Xiufang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 35 (12) : 1 - 15
  • [40] Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges
    Ding, Jian
    Xue, Nan
    Xia, Gui-Song
    Bai, Xiang
    Yang, Wen
    Yang, Michael Ying
    Belongie, Serge
    Luo, Jiebo
    Datcu, Mihai
    Pelillo, Marcello
    Zhang, Liangpei
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (11) : 7778 - 7796