Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

被引:0
|
作者
Haq I. [1 ]
Qiu W. [1 ]
Guo J. [1 ]
Tang P. [1 ]
机构
[1] School of Cyber Science and Engineering, Shanghai Jiao Tong University, Minhang, Shanghai
关键词
BERT; Large language models; LLMs; Low-resource languages; NLP; Offensive language detection; Osn; Pashto; Social media; Text processing;
D O I
10.7717/PEERJ-CS.1617
中图分类号
学科分类号
摘要
Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. This study aims to develop an AI model for the automatic detection of offensive textual content in Pashto. To achieve this goal, we have developed a benchmark dataset called the Pashto Offensive Language Dataset (POLD), which comprises tweets collected from Twitter and manually classified into two categories: ‘‘offensive’’ and ‘‘not offensive’’. To discriminate these two categories, we investigated the classic deep learning classifiers based on neural networks, including CNNs and RNNs, using static word embeddings: Word2Vec, fastText, and GloVe as features. Furthermore, we examined two transfer learning approaches. In the first approach, we fine-tuned the pre-trained multilingual language model, XLM-R, using the POLD dataset, whereas, in the second approach, we trained a monolingual BERT model for Pashto from scratch using a custom-developed text corpus. Pashto BERT was then fine-tuned similarly to XLM-R. The performance of all the deep learning and transformer learning models was evaluated using the POLD dataset. The experimental results demonstrate that our pre-trained Pashto BERT model outperforms the other models, achieving an F1-score of 94.34% and an accuracy of 94.77%. © 2023 Haq et al.
引用
收藏
页码:1 / 26
页数:25
相关论文
共 50 条
  • [41] A Parallel Dual-Channel Chinese Offensive Language Detection Method Combining BERT and CTM Topic Information
    Cao, Tao
    Guo, Hengchang
    Bai, Shuchen
    Li, Bingbing
    Liu, Na
    IEEE ACCESS, 2024, 12 : 95165 - 95184
  • [42] Structural Edge Detection: A Dataset and Benchmark
    Sun, Weixuan
    You, Shaodi
    Walker, Janine
    Li, Kunming
    Barnes, Nick
    2018 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2018, : 289 - 296
  • [43] Deep Learning for predicting neutralities in Offensive Language Identification Dataset
    Sharma, Mayukh
    Kandasamy, Ilanthenral
    Kandasamy, Vasantha
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 185
  • [44] A Dataset of Offensive German Language Tweets Annotated for Speech Acts
    Plakidis, Melina
    Rehm, Georg
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4799 - 4807
  • [45] OffensiveLang: A Community-Based Implicit Offensive Language Dataset
    Das, Amit
    Rahgouy, Mostafa
    Feng, Dongji
    Zhang, Zheng
    Bhattacharya, Tathagata
    Raychawdhary, Nilanjana
    Jamshidi, Fatemeh
    Jain, Vinija
    Chadha, Aman
    Sandage, Mary J.
    Pope, Lauramarie
    Dozier, Gerry V.
    Seals, Cheryl D.
    IEEE ACCESS, 2024, 12 : 185661 - 185672
  • [46] "Bend the truth": Benchmark dataset for fake news detection in Urdu language and its evaluation
    Amjad, Maaz
    Sidorov, Grigori
    Zhila, Alisa
    Gomez-Adorno, Helena
    Voronkov, Ilia
    Gelbukh, Alexander
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2020, 39 (02) : 2457 - 2469
  • [47] BERT-based Approach to Arabic Hate Speech and Offensive Language Detection in Twitter: Exploiting Emojis and Sentiment Analysis
    Althobaiti, Maha Jarallah
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (05) : 972 - 980
  • [48] LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
    Chalkidis, Ilias
    Jana, Abhik
    Hartung, Dirk
    Bommarito, Michael
    Androutsopoulos, Ion
    Katz, Daniel Martin
    Aletras, Nikolaos
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4310 - 4330
  • [49] Video Text Detection and Recognition: Dataset and Benchmark
    Phuc Xuan Nguyen
    Wang, Kai
    Belongie, Serge
    2014 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2014, : 776 - 783
  • [50] IFND: a benchmark dataset for fake news detection
    Sharma, Dilip Kumar
    Garg, Sonal
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2843 - 2863