Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

被引:0
|
作者
Haq I. [1 ]
Qiu W. [1 ]
Guo J. [1 ]
Tang P. [1 ]
机构
[1] School of Cyber Science and Engineering, Shanghai Jiao Tong University, Minhang, Shanghai
关键词
BERT; Large language models; LLMs; Low-resource languages; NLP; Offensive language detection; Osn; Pashto; Social media; Text processing;
D O I
10.7717/PEERJ-CS.1617
中图分类号
学科分类号
摘要
Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. This study aims to develop an AI model for the automatic detection of offensive textual content in Pashto. To achieve this goal, we have developed a benchmark dataset called the Pashto Offensive Language Dataset (POLD), which comprises tweets collected from Twitter and manually classified into two categories: ‘‘offensive’’ and ‘‘not offensive’’. To discriminate these two categories, we investigated the classic deep learning classifiers based on neural networks, including CNNs and RNNs, using static word embeddings: Word2Vec, fastText, and GloVe as features. Furthermore, we examined two transfer learning approaches. In the first approach, we fine-tuned the pre-trained multilingual language model, XLM-R, using the POLD dataset, whereas, in the second approach, we trained a monolingual BERT model for Pashto from scratch using a custom-developed text corpus. Pashto BERT was then fine-tuned similarly to XLM-R. The performance of all the deep learning and transformer learning models was evaluated using the POLD dataset. The experimental results demonstrate that our pre-trained Pashto BERT model outperforms the other models, achieving an F1-score of 94.34% and an accuracy of 94.77%. © 2023 Haq et al.
引用
收藏
页码:1 / 26
页数:25
相关论文
共 50 条
  • [31] Pashto Language Dialect Recognition using Mel Frequency Cepstral Coefficient and Support Vector Machines
    Khan, Saud
    Ali, Haider
    Ullah, Khalil
    2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN ELECTRICAL ENGINEERING AND COMPUTATIONAL TECHNOLOGIES (ICIEECT), 2017,
  • [32] A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection
    Chowdhury, Shammur A.
    Mubarak, Hamdy
    Abdelali, Ahmed
    Jung, Soon-gyo
    Jansen, Bernard J.
    Salminen, Joni
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6203 - 6212
  • [33] A Dataset of Offensive Language in Kosovo Social Media
    Ajvazi, Adem
    Hardmeier, Christian
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1860 - 1869
  • [34] Elevating Offensive Language Detection: CNN-GRU and BERT for Enhanced Hate Speech Identification
    Madhavi, M.
    Agal, Sanjay
    Odedra, Niyati Dhirubhai
    Chowdhary, Harish
    Ruprah, Taranpreet Singh
    Vuyyuru, Veera Ankalu
    El-Ebiary, Yousef A. Baker
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (05) : 1164 - 1172
  • [35] Robust Arabic and Pashto Text Detection in Camera-Captured Documents Using Deep Learning Techniques
    Khan, Nisar
    Ahmad, Riaz
    Ullah, Khalil
    Muhammad, Siraj
    Hussain, Ibrar
    Khan, Ahmad
    Ghadi, Yazeed Yasin
    Mohamed, Heba G.
    IEEE ACCESS, 2023, 11 : 135788 - 135796
  • [36] Advancing offensive language detection in Arabic social media: a BERT-based ensemble learning approach
    Mazari, Ahmed Cherif
    Benterkia, Asmaa
    Takdenti, Zineb
    SOCIAL NETWORK ANALYSIS AND MINING, 2024, 14 (01)
  • [37] Persian offensive language detection
    Kebriaei, Emad
    Homayouni, Ali
    Faraji, Roghayeh
    Razavi, Armita
    Shakery, Azadeh
    Faili, Heshaam
    Yaghoobzadeh, Yadollah
    MACHINE LEARNING, 2024, 113 (07) : 4359 - 4379
  • [38] Offensive Video Detection: Dataset and Baseline Results
    Alcantara, Cleber de S.
    Feijo, Diego
    Moreira, Viviane P.
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4309 - 4319
  • [39] Unsupervised offensive speech detection for multimedia based on multilingual BERT
    Liu, Ge
    Yang, Xiaona
    Shi, Xiayang
    Li, Yinlin
    INTERNATIONAL JOURNAL OF SENSOR NETWORKS, 2024, 46 (03) : 186 - 196
  • [40] A Survey of Offensive Language Detection for the Arabic Language
    Husain, Fatemah
    Uzuner, Ozlem
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (01)