The design, construction and evaluation of annotated Arabic cyberbullying corpus

被引:7
|
作者
Shannag, Fatima [1 ]
Hammo, Bassam H. [1 ,2 ]
Faris, Hossam [1 ,3 ]
机构
[1] Univ Jordan, King Abdullah II Sch Informat Technol, Comp Informat Syst Dept, Amman, Jordan
[2] Princess Sumaya Univ Technol, King Hussein Sch Comp Sci, Amman, Jordan
[3] Al Hussein Tech Univ, Sch Comp & Informat, Amman, Jordan
关键词
Annotated cyberbullying corpus; Offensive language; Hate speech; Arabic harassment dataset; Cyberbullying dataset; Profane lexicon; COMMUNICATION;
D O I
10.1007/s10639-022-11056-x
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Cyberbullying (CB) is classified as one of the severe misconducts on social media. Many CB detection systems have been developed for many natural languages to face this phenomenon. However, Arabic is one of the under-resourced languages suffering from the lack of quality datasets in many computational research areas. This paper discusses the design, construction, and evaluation of a multi-dialect, annotated Arabic Cyberbullying Corpus (ArCybC), a valuable resource for Arabic CB detection and motivation for future research directions in Arabic Natural Language Processing (NLP). The study describes the phases of ArCybC compilation. By way of illustration, it explores the corpus to discover strategies used in rendering Arabic CB tweets pulled from four Twitter groups, including gaming, sports, news, and celebrities. Based on thorough analysis, we discovered that these groups were the most susceptible to harassment and cyberbullying. The collected tweets were filtered based on a compiled harassment lexicon, which contains a list of multi-dialectical profane words in Arabic compiled from four categories: sexual, racial, physical appearance, and intelligence. To annotate ArCybC, we asked five annotators to classify 4,505 tweets into two classes manually: Offensive/non-Offensive and CB/non-CB. We conducted a rigorous comparison of different machine learning approaches applied on ArCybC to detect Arabic CB using two language models: bag-of-words (BoW) and word embedding. The experiments showed that Support Vector Machine (SVM) with word embedding achieved an accuracy rate of 86.3% and an F1-score rate of 85%. The main challenges encountered during the ArCybC construction were the scarcity of freely available Arabic CB texts and the deficiency of annotating the texts.
引用
收藏
页码:10977 / 11023
页数:47
相关论文
共 50 条
  • [31] Classification of Cyberbullying Text in Arabic
    Rachid, Benaissa Azzeddine
    Azza, Harbaoui
    Ben Ghezala, Hajjami Henda
    2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [32] A continuous vocoder for statistical parametric speech synthesis and its evaluation using an audio-visual phonetically annotated Arabic corpus
    Al-Radhi, Mohammed Salah
    Abdo, Omnia
    Csapo, Tamas Gabor
    Abdou, Sherif
    Nemeth, Geza
    Fashal, Mervat
    COMPUTER SPEECH AND LANGUAGE, 2020, 60
  • [33] Construction of a Bilingual Annotated Corpus with Chinese Buddhist Translation and their Sanskrit Parallels
    Wei Huangfu
    Zhut, Qingzhi
    Qiu, Bing
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 108 - 111
  • [34] AlgBERT: Automatic Construction of Annotated Corpus for Sentiment Analysis in Algerian Dialect
    Hamadouche, Khaoula
    Bousmaha, Kheira Zineb
    Bekkoucha, Mohamed Abdelwaret
    Hadrich-Belguith, Lamia
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (12)
  • [35] The Construction and Analysis of Annotated Imagery Corpus of Three Hundred Tang Poems
    Hao, Xingyue
    Ge, Sijia
    Zhang, Yang
    Dai, Yuling
    Yan, Peiyi
    Li, Bin
    CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 517 - 524
  • [36] A New Corpus for the Evaluation of Arabic Intrinsic Plagiarism Detection
    Bensalem, Imene
    Rosso, Paolo
    Chikhi, Salim
    INFORMATION ACCESS EVALUATION: MULTILINGUALITY, MULTIMODALITY, AND VISUALIZATION, 2013, 8138 : 53 - 58
  • [37] NAFIS: A Gold Standard Corpus for Arabic Stemmers Evaluation
    Namly, Driss
    Tajmout, Rachida
    Bouzoubaa, Karim
    Abouenour, Lahsen
    VISION 2020: INNOVATION MANAGEMENT, DEVELOPMENT SUSTAINABILITY, AND COMPETITIVE ECONOMIC GROWTH, 2016, VOLS I - VII, 2016, : 1868 - 1877
  • [38] Multilingual Cyberbullying Detection System Detecting Cyberbullying in Arabic Content
    Haidar, Batoul
    Chamoun, Maroun
    Serhrouchni, Ahmed
    2017 1ST CYBER SECURITY IN NETWORKING CONFERENCE (CSNET), 2017,
  • [40] The Coronavirus Corpus Design, construction, and use
    Davies, Mark
    INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS, 2021, 26 (04) : 583 - 598