MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark

被引:0
|
作者
Macko, Dominik [1 ]
Moro, Robert [1 ]
Uchendu, Adaku [2 ,3 ]
Lucas, Jason Samuel [3 ]
Yamashita, Michiharu [3 ]
MatusPikuliak [1 ]
Srba, Ivan [1 ]
Le, Thai [4 ]
Lee, Dongwon [3 ]
Simko, Jakub [1 ]
Bielikova, Maria [1 ]
机构
[1] Kempelen Inst Intelligent Technol, Bratislava, Slovakia
[2] MIT, Lincoln Lab, Lexington, MA USA
[3] Penn State Univ, University Pk, PA 16802 USA
[4] Univ Mississippi, University, MS USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
There is a lack of research into capabilities of recent LLMs to generate convincing text in languages other than English and into performance of detectors of machine-generated text in multilingual settings. This is also reflected in the available benchmarks which lack authentic texts in languages other than English and predominantly cover older generators. To fill this gap, we introduce MULTITuDE(1), a novel benchmarking dataset for multilingual machine-generated text detection comprising of 74,081 authentic and machine-generated texts in 11 languages (ar, ca, cs, de, en, es, nl, pt, ru, uk, and zh) generated by 8 multilingual LLMs. Using this benchmark, we compare the performance of zero-shot (statistical and black-box) and fine-tuned detectors. Considering the multilinguality, we evaluate 1) how these detectors generalize to unseen languages (linguistically similar as well as dissimilar) and unseen LLMs and 2) whether the detectors improve their performance when trained on multiple languages.
引用
收藏
页码:9960 / 9987
页数:28
相关论文
共 50 条
  • [41] NewbieML at SemEval-2024 Task 8: Ensemble Approach for Multidomain Machine-Generated Text Detection
    Tran, Bao
    Nhi Tran
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 354 - 360
  • [42] RGBT Salient Object Detection: A Large-Scale Dataset and Benchmark
    Tu, Zhengzheng
    Ma, Yan
    Li, Zhun
    Li, Chenglong
    Xu, Jieming
    Liu, Yongtao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 4163 - 4176
  • [43] BadRock at SemEval-2024 Task 8: DistilBERT to Detect Multigenerator, Multidomain and Multilingual Black-Box Machine-Generated Text
    Siino, Marco
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 239 - 245
  • [44] A Large-Scale Multilingual Disambiguation of Glosses
    Camacho-Collados, Jose
    Bovi, Claudio Delli
    Raganato, Alessandro
    Navigli, Roberto
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1701 - 1708
  • [45] Enhancing Machine-Generated Text Detection: Adversarial Fine-Tuning of Pre-Trained Language Models
    Hee Lee, Dong
    Jang, Beakcheol
    IEEE ACCESS, 2024, 12 : 65333 - 65340
  • [46] TueCICL at SemEval-2024 Task 8: Resource-efficient approaches for machine-generated text detection
    Stuhlinger, Daniel
    Winkler, Aron
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 1597 - 1601
  • [47] UMUTeam at SemEval-2024 Task 8: Combining Transformers and Syntax Features for Machine-Generated Text Detection
    Pan, Ronghao
    Antonio Garcia-Diaz, Jose
    Jose Vivancos-Vicente, Pedro
    Valencia-Garcia, Rafael
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 697 - 702
  • [48] Smaller LanguageModels are Better Zero-shot Machine-Generated Text Detectors
    Mireshghallah, Niloofar
    Mattern, Justus
    Gao, Sicun
    Shokri, Reza
    Berg-Kirkpatrick, Taylor
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 278 - 293
  • [49] OmniArt: A Large-scale Artistic Benchmark
    Strezoski, Gjorgji
    Worring, Marcel
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (04)
  • [50] Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations
    Dutt, Ritam
    Wu, Zhen
    Shi, Kelly
    Sheth, Divyanshu
    Gupta, Prakhar
    Rose, Carolyn Penstein
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 6901 - 6929