Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

被引:0
|
作者
Haq I. [1 ]
Qiu W. [1 ]
Guo J. [1 ]
Tang P. [1 ]
机构
[1] School of Cyber Science and Engineering, Shanghai Jiao Tong University, Minhang, Shanghai
关键词
BERT; Large language models; LLMs; Low-resource languages; NLP; Offensive language detection; Osn; Pashto; Social media; Text processing;
D O I
10.7717/PEERJ-CS.1617
中图分类号
学科分类号
摘要
Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. This study aims to develop an AI model for the automatic detection of offensive textual content in Pashto. To achieve this goal, we have developed a benchmark dataset called the Pashto Offensive Language Dataset (POLD), which comprises tweets collected from Twitter and manually classified into two categories: ‘‘offensive’’ and ‘‘not offensive’’. To discriminate these two categories, we investigated the classic deep learning classifiers based on neural networks, including CNNs and RNNs, using static word embeddings: Word2Vec, fastText, and GloVe as features. Furthermore, we examined two transfer learning approaches. In the first approach, we fine-tuned the pre-trained multilingual language model, XLM-R, using the POLD dataset, whereas, in the second approach, we trained a monolingual BERT model for Pashto from scratch using a custom-developed text corpus. Pashto BERT was then fine-tuned similarly to XLM-R. The performance of all the deep learning and transformer learning models was evaluated using the POLD dataset. The experimental results demonstrate that our pre-trained Pashto BERT model outperforms the other models, achieving an F1-score of 94.34% and an accuracy of 94.77%. © 2023 Haq et al.
引用
收藏
页码:1 / 26
页数:25
相关论文
共 50 条
  • [1] Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT
    Haq, Ijazul
    Qiu, Weidong
    Guo, Jie
    Tang, Peng
    PEERJ COMPUTER SCIENCE, 2023, 9
  • [2] Monolingual Paraphrase Detection Corpus for Low Resource Pashto Language at Sentence Level
    Ali, Iqra
    Kamigaito, Hidetaka
    Watanabe, Taro
    2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 2024, : 11574 - 11581
  • [3] Benchmark Pashto Handwritten Character Dataset and Pashto Object Character Recognition (OCR) Using Deep Neural Network with Rule Activation Function
    Uddin, Imran
    Ramli, Dzati A.
    Khan, Abdullah
    Bangash, Javed Iqbal
    Fayyaz, Nosheen
    Khan, Asfandyar
    Kundi, Mahwish
    COMPLEXITY, 2021, 2021
  • [4] New language resources for the Pashto language
    Mostefa, Djamel
    Choukri, Khalid
    Brunessaux, Sylvie
    Boudahmane, Karim
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2917 - 2922
  • [5] Conceptualization of time in Pashto language
    Sardaraz, Khan
    Nusrat, Aasia
    Ab Rashid, Radzuwan
    FWU JOURNAL OF SOCIAL SCIENCES, 2021, 15 (04): : 92 - 113
  • [6] Recognizable Units in Pashto Language for OCR
    Ahmad, Riaz
    Afzal, Muhammad Zeshan
    Rashid, Sheikh Faisal
    Liwicki, Marcus
    Dengel, Andreas
    Breuel, Thomas
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 1246 - 1250
  • [7] Offline Pashto Characters Dataset for OCR Systems
    Khan, Sulaiman
    Khan, Habib Ullah
    Nazir, Shah
    SECURITY AND COMMUNICATION NETWORKS, 2021, 2021
  • [8] Pashto script and graphics detection in camera captured Pashto document images using deep learning model
    Bahadar, Khan
    Ahmad, Riaz
    Aurangzeb, Khursheed
    Muhammad, Siraj
    Ullah, Khalil
    Hussain, Ibrar
    Syed, Ikram
    Anwar, Muhammad Shahid
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [9] PashtoEmo: Enhancing Text-Based Emotion Analysis in the Pashto Language Through Dataset Creation
    Payendal, Mohammad Arif
    Vahidi, Abdul Razaq
    Hussiny, Mohammad Ali
    Prinzl, Andreas
    Ovrelid, Lilja
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PT II, NLDB 2024, 2024, 14763 : 225 - 236
  • [10] The development and evaluation of an automatic clitic generator for Pashto language
    Din, Aziz Ud
    Rabbi, Ihsan
    Farooq, Umar
    Khan, Jawad
    Jung, Younhyun
    HUMANITIES & SOCIAL SCIENCES COMMUNICATIONS, 2025, 12 (01):