Pashto offensive language detection: a benchmark dataset and monolingual Pashto BERT

被引:0
|
作者
Haq I. [1 ]
Qiu W. [1 ]
Guo J. [1 ]
Tang P. [1 ]
机构
[1] School of Cyber Science and Engineering, Shanghai Jiao Tong University, Minhang, Shanghai
关键词
BERT; Large language models; LLMs; Low-resource languages; NLP; Offensive language detection; Osn; Pashto; Social media; Text processing;
D O I
10.7717/PEERJ-CS.1617
中图分类号
学科分类号
摘要
Social media platforms have become inundated with offensive language. This issue must be addressed for the growth of online social networks (OSNs) and a healthy online environment. While significant research has been devoted to identifying toxic content in major languages like English, this remains an open area of research in the low-resource Pashto language. This study aims to develop an AI model for the automatic detection of offensive textual content in Pashto. To achieve this goal, we have developed a benchmark dataset called the Pashto Offensive Language Dataset (POLD), which comprises tweets collected from Twitter and manually classified into two categories: ‘‘offensive’’ and ‘‘not offensive’’. To discriminate these two categories, we investigated the classic deep learning classifiers based on neural networks, including CNNs and RNNs, using static word embeddings: Word2Vec, fastText, and GloVe as features. Furthermore, we examined two transfer learning approaches. In the first approach, we fine-tuned the pre-trained multilingual language model, XLM-R, using the POLD dataset, whereas, in the second approach, we trained a monolingual BERT model for Pashto from scratch using a custom-developed text corpus. Pashto BERT was then fine-tuned similarly to XLM-R. The performance of all the deep learning and transformer learning models was evaluated using the POLD dataset. The experimental results demonstrate that our pre-trained Pashto BERT model outperforms the other models, achieving an F1-score of 94.34% and an accuracy of 94.77%. © 2023 Haq et al.
引用
收藏
页码:1 / 26
页数:25
相关论文
共 50 条
  • [21] Deep learning-based recognition system for pashto handwritten text: benchmark on PHTI
    Hussain, Ibrar
    Ahmad, Riaz
    Ullah, Khalil
    Muhammad, Siraj
    Elhassan, Rasha
    Syed, Ikram
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [22] SOME FUNCTIONAL SPEECH FEATURES OF INFLECTIVE PATTERNS OF PLURAL IN THE PASHTO LANGUAGE
    Kozlov, M. A.
    VESTNIK ROSSIISKOGO UNIVERSITETA DRUZHBY NARODOV-SERIYA LINGVISTIKA-RUSSIAN JOURNAL OF LINGUISTICS, 2015, (02): : 156 - 165
  • [23] In the Garden of Language: Religion, Vernacularization, and the Pashto Poetry of Arzani in the Sixteenth and Seventeenth Centuries
    Sherman, William E. B.
    AFGHANISTAN, 2022, 5 (01) : 122 - 147
  • [24] Concepts of hewed `Homeland' and millat `Nation' in Modern Pashto-Language Schoolbooks
    Klagisz, Mateusz M. P.
    Drozdowska, Marta
    IRAN AND THE CAUCASUS, 2023, 27 (01) : 39 - 53
  • [25] A Dataset for Investigating the Impact of Context for Offensive Language Detection in Tweets
    Ihtiyar, Musa Nuri
    Ozdemir, Omer
    Erengul, Mustafa Emre
    Ozgur, Arzucan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 1543 - 1549
  • [26] Sentiment Analysis of Social Media Content in Pashto Language using Deep Learning Algorithms
    Iqbal, Saqib
    Khan, Farhad
    Khan, Hikmat Ullah
    Iqba, Tassawar
    Shah, Jamal Hussain
    JOURNAL OF INTERNET TECHNOLOGY, 2022, 23 (07): : 1669 - 1677
  • [27] A Deep Learning-Based Approach for Part of Speech (PoS) Tagging in the Pashto Language
    Ullah, Shaheen
    Ahmad, Riaz
    Namoun, Abdallah
    Muhammad, Siraj
    Ullah, Khalil
    Hussain, Ibrar
    Ibrahim, Isa Ali
    IEEE ACCESS, 2024, 12 : 86355 - 86364
  • [28] SOLD: Sinhala offensive language dataset
    Ranasinghe, Tharindu
    Anuradha, Isuri
    Premasiri, Damith
    Silva, Kanishka
    Hettiarachchi, Hansi
    Uyangodage, Lasitha
    Zampieri, Marcos
    LANGUAGE RESOURCES AND EVALUATION, 2025, 59 (01) : 297 - 337
  • [29] Pars-OFF: A Benchmark for Offensive Language Detection on Farsi Social Media
    Ataei, Taha Shangipour
    Darvishi, Kamyar
    Javdan, Soroush
    Pourdabiri, Amin
    Minaei-Bidgoli, Behrouz
    Pilehvar, Mohammad Taher
    IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, 2023, 14 (04) : 2787 - 2795
  • [30] Offensive Hebrew Corpus and Detection using BERT
    Hamad, Nagham
    Jarrar, Mustafa
    Khalilia, Mohammad
    Nashif, Nadim
    2023 20TH ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, AICCSA, 2023,