MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark

被引:0
|
作者
Macko, Dominik [1 ]
Moro, Robert [1 ]
Uchendu, Adaku [2 ,3 ]
Lucas, Jason Samuel [3 ]
Yamashita, Michiharu [3 ]
MatusPikuliak [1 ]
Srba, Ivan [1 ]
Le, Thai [4 ]
Lee, Dongwon [3 ]
Simko, Jakub [1 ]
Bielikova, Maria [1 ]
机构
[1] Kempelen Inst Intelligent Technol, Bratislava, Slovakia
[2] MIT, Lincoln Lab, Lexington, MA USA
[3] Penn State Univ, University Pk, PA 16802 USA
[4] Univ Mississippi, University, MS USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There is a lack of research into capabilities of recent LLMs to generate convincing text in languages other than English and into performance of detectors of machine-generated text in multilingual settings. This is also reflected in the available benchmarks which lack authentic texts in languages other than English and predominantly cover older generators. To fill this gap, we introduce MULTITuDE(1), a novel benchmarking dataset for multilingual machine-generated text detection comprising of 74,081 authentic and machine-generated texts in 11 languages (ar, ca, cs, de, en, es, nl, pt, ru, uk, and zh) generated by 8 multilingual LLMs. Using this benchmark, we compare the performance of zero-shot (statistical and black-box) and fine-tuned detectors. Considering the multilinguality, we evaluate 1) how these detectors generalize to unseen languages (linguistically similar as well as dissimilar) and unseen LLMs and 2) whether the detectors improve their performance when trained on multiple languages.
引用
收藏
页码:9960 / 9987
页数:28
相关论文
共 50 条
  • [21] DUTh at SemEval 2024 Task 8: Comparing classic Machine Learning Algorithms and LLM based methods for Multigenerator, Multidomain and Multilingual Machine-Generated Text Detection
    Kyriakou, Theodora
    Maslaris, Ioannis
    Arampatzis, Avi
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 1080 - 1086
  • [22] MAGECODE: Machine-Generated Code Detection Method Using Large Language Models
    Pham, Hung
    Ha, Huyen
    Tong, Van
    Hoang, Dung
    Tran, Duc
    Le, Tuyen Ngoc
    IEEE ACCESS, 2024, 12 : 190186 - 190202
  • [23] Chart-to-Text: A Large-Scale Benchmark for Chart Summarization
    Kantharaj, Shankar
    Leong, Rixie Tiffany Ko
    Lin, Xiang
    Masry, Ahmed
    Thakkar, Megh
    Hoque, Enamul
    Joty, Shafiq
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4005 - 4023
  • [24] Are Large-Scale Data From Private Companies Reliable? An Analysis of Machine-Generated Business Location Data in a Popular Dataset
    Grigoropoulou, Nikolitsa
    Small, Mario L.
    SOCIAL SCIENCE COMPUTER REVIEW, 2024,
  • [25] MAGECODE: Machine-Generated Code Detection Method Using Large Language Models
    Pham, Hung
    Ha, Huyen
    Tong, Van
    Hoang, Dung
    Tran, Duc
    Le, Tuyen Ngoc
    IEEE Access, 2024, 12 : 190186 - 190202
  • [26] On the Zero-Shot Generalization of Machine-Generated Text Detectors
    Pu, Xiao
    Zhang, Jingyu
    Han, Xiaochuang
    Tsvetkov, Yulia
    He, Tianxing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 4799 - 4808
  • [27] Team AT at SemEval-2024 Task 8: Machine-Generated Text Detection with Semantic Embeddings
    Wei, Yuchen
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 492 - 496
  • [28] Overview of IberAuTexTification at IberLEF 2024: Detection and Attribution of Machine-Generated Text on Languages of the Iberian Peninsula
    Sarvazyan, Areg Mikael
    Gonzalez, Jose Angel
    Rangel, Francisco
    Rosso, Paolo
    Franco-Salvador, Marc
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2024, (73): : 421 - 434
  • [29] C-Net: A Compression-Based Lightweight Network for Machine-Generated Text Detection
    Zhou, Yinghan
    Wen, Juan
    Jia, Jianghao
    Gao, Liting
    Zhang, Ziwei
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 1269 - 1273
  • [30] k-SEMSTAMP : A Clustering-Based Semantic Watermark for Detection of Machine-Generated Text
    Hou, Abe Bohan
    Zhang, Jingyu
    Wang, Yichen
    Khashabi, Daniel
    He, Tianxing
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 1706 - 1715