The design, construction and evaluation of annotated Arabic cyberbullying corpus

被引:7
|
作者
Shannag, Fatima [1 ]
Hammo, Bassam H. [1 ,2 ]
Faris, Hossam [1 ,3 ]
机构
[1] Univ Jordan, King Abdullah II Sch Informat Technol, Comp Informat Syst Dept, Amman, Jordan
[2] Princess Sumaya Univ Technol, King Hussein Sch Comp Sci, Amman, Jordan
[3] Al Hussein Tech Univ, Sch Comp & Informat, Amman, Jordan
关键词
Annotated cyberbullying corpus; Offensive language; Hate speech; Arabic harassment dataset; Cyberbullying dataset; Profane lexicon; COMMUNICATION;
D O I
10.1007/s10639-022-11056-x
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Cyberbullying (CB) is classified as one of the severe misconducts on social media. Many CB detection systems have been developed for many natural languages to face this phenomenon. However, Arabic is one of the under-resourced languages suffering from the lack of quality datasets in many computational research areas. This paper discusses the design, construction, and evaluation of a multi-dialect, annotated Arabic Cyberbullying Corpus (ArCybC), a valuable resource for Arabic CB detection and motivation for future research directions in Arabic Natural Language Processing (NLP). The study describes the phases of ArCybC compilation. By way of illustration, it explores the corpus to discover strategies used in rendering Arabic CB tweets pulled from four Twitter groups, including gaming, sports, news, and celebrities. Based on thorough analysis, we discovered that these groups were the most susceptible to harassment and cyberbullying. The collected tweets were filtered based on a compiled harassment lexicon, which contains a list of multi-dialectical profane words in Arabic compiled from four categories: sexual, racial, physical appearance, and intelligence. To annotate ArCybC, we asked five annotators to classify 4,505 tweets into two classes manually: Offensive/non-Offensive and CB/non-CB. We conducted a rigorous comparison of different machine learning approaches applied on ArCybC to detect Arabic CB using two language models: bag-of-words (BoW) and word embedding. The experiments showed that Support Vector Machine (SVM) with word embedding achieved an accuracy rate of 86.3% and an F1-score rate of 85%. The main challenges encountered during the ArCybC construction were the scarcity of freely available Arabic CB texts and the deficiency of annotating the texts.
引用
收藏
页码:10977 / 11023
页数:47
相关论文
共 50 条
  • [41] Annotation and initial evaluation of a large annotated German oncological corpus
    Kittner, Madeleine
    Lamping, Mario
    Rieke, Damian T.
    Goetze, Julian
    Bajwa, Bariya
    Jelas, Ivan
    Rueter, Gina
    Hautow, Hanjo
    Saenger, Mario
    Habibi, Maryam
    Zettwitz, Marit
    de Bortoli, Till
    Ostermann, Leonie
    Seva, Jurica
    Starlinger, Johannes
    Kohlbacher, Oliver
    Malek, Nisar P.
    Keilholz, Ulrich
    Leser, Ulf
    JAMIA OPEN, 2021, 4 (02)
  • [42] Arabic validation of a cyberbullying assessment instrument
    Sahli, L.
    Bourgou, S.
    Amor, S. Haj
    Belhadj, A.
    ENCEPHALE-REVUE DE PSYCHIATRIE CLINIQUE BIOLOGIQUE ET THERAPEUTIQUE, 2023, 49 (06): : 577 - 581
  • [43] Construction of Large-scale English Verbal Multiword Expression Annotated Corpus
    Kato, Akihiko
    Shindo, Hiroyuki
    Matsumoto, Yuji
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2495 - 2499
  • [44] The Construction of a New Lexicon Design for Arabic Language
    Bataineh, Bilal
    Bataineh, Emad
    BUSINESS TRANSFORMATION THROUGH INNOVATION AND KNOWLEDGE MANAGEMENT: AN ACADEMIC PERSPECTIVE, VOLS 3 AND 4, 2010, : 2086 - 2096
  • [45] Evaluation of an Arabic Speech Corpus of Emotions: A Perceptual and Statistical Analysis
    Meftah, Ali Hamid
    Alotaibi, Yousef Ajami
    Selouani, Sid-Ahmed
    IEEE ACCESS, 2018, 6 : 72845 - 72861
  • [46] Arabic Corpus Linguistics
    Al-Surmi, Mansoor
    CORPORA, 2021, 16 (02) : 301 - 303
  • [47] Saudi Learner Translation Corpus: The design and compilation of an English-Arabic learner translation corpus
    Al-Harthi, Maha
    Alsaif, Amal
    Al-Nafjan, Eman
    Alshihri, Fatma
    Saleh, Mahmoud
    PLOS ONE, 2024, 19 (10):
  • [48] Design and Usability Evaluation of an Annotated Video-Based Learning Environment for Construction Engineering Education
    Olayiwola, Johnson
    Akanmu, Abiola
    Gao, Xinghua
    Murzi, Homero
    Afsari, Kereshmeh
    JOURNAL OF COMPUTING IN CIVIL ENGINEERING, 2023, 37 (06)
  • [49] Thai Broadcast News Corpus Construction and Evaluation
    Jongtaveesataporn, Markpong
    Wutiwiwatchai, Chai
    Iwano, Koji
    Furui, Sadaoki
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1249 - 1254
  • [50] Arabic Cyberbullying Detection: Using Deep Learning
    Haidar, Batoul
    Chamoun, Maroun
    Serhrouchni, Ahmed
    PROCEEDINGS OF THE 2018 7TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING (ICCCE), 2018, : 284 - 289