The design, construction and evaluation of annotated Arabic cyberbullying corpus

被引:7
|
作者
Shannag, Fatima [1 ]
Hammo, Bassam H. [1 ,2 ]
Faris, Hossam [1 ,3 ]
机构
[1] Univ Jordan, King Abdullah II Sch Informat Technol, Comp Informat Syst Dept, Amman, Jordan
[2] Princess Sumaya Univ Technol, King Hussein Sch Comp Sci, Amman, Jordan
[3] Al Hussein Tech Univ, Sch Comp & Informat, Amman, Jordan
关键词
Annotated cyberbullying corpus; Offensive language; Hate speech; Arabic harassment dataset; Cyberbullying dataset; Profane lexicon; COMMUNICATION;
D O I
10.1007/s10639-022-11056-x
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Cyberbullying (CB) is classified as one of the severe misconducts on social media. Many CB detection systems have been developed for many natural languages to face this phenomenon. However, Arabic is one of the under-resourced languages suffering from the lack of quality datasets in many computational research areas. This paper discusses the design, construction, and evaluation of a multi-dialect, annotated Arabic Cyberbullying Corpus (ArCybC), a valuable resource for Arabic CB detection and motivation for future research directions in Arabic Natural Language Processing (NLP). The study describes the phases of ArCybC compilation. By way of illustration, it explores the corpus to discover strategies used in rendering Arabic CB tweets pulled from four Twitter groups, including gaming, sports, news, and celebrities. Based on thorough analysis, we discovered that these groups were the most susceptible to harassment and cyberbullying. The collected tweets were filtered based on a compiled harassment lexicon, which contains a list of multi-dialectical profane words in Arabic compiled from four categories: sexual, racial, physical appearance, and intelligence. To annotate ArCybC, we asked five annotators to classify 4,505 tweets into two classes manually: Offensive/non-Offensive and CB/non-CB. We conducted a rigorous comparison of different machine learning approaches applied on ArCybC to detect Arabic CB using two language models: bag-of-words (BoW) and word embedding. The experiments showed that Support Vector Machine (SVM) with word embedding achieved an accuracy rate of 86.3% and an F1-score rate of 85%. The main challenges encountered during the ArCybC construction were the scarcity of freely available Arabic CB texts and the deficiency of annotating the texts.
引用
收藏
页码:10977 / 11023
页数:47
相关论文
共 50 条
  • [21] Spiral construction of syntactically annotated spoken language corpus
    Ohno, T
    Matsubara, S
    Kawaguchi, N
    Inagaki, Y
    2003 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, PROCEEDINGS, 2003, : 477 - 483
  • [22] Cairo Student Code-Switch (CSCS) Corpus: An Annotated Egyptian Arabic-English Corpus
    Balabel, Mohamed
    Hamed, Injy
    Abdennadher, Slim
    Ngoc Thang Vu
    Cetinoglu, Oezlem
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 3973 - 3977
  • [23] Balanced Arabic corpus design for speech synthesis
    Amrouche, Aissa
    Abed, Ahcene
    Ferrat, Kamel
    Boubakeur, Khadidja Nesrine
    Bentrcia, Youssouf
    Falek, Leila
    INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY, 2021, 24 (03) : 747 - 759
  • [24] Balanced Arabic corpus design for speech synthesis
    Aissa Amrouche
    Ahcène Abed
    Kamel Ferrat
    Khadidja Nesrine Boubakeur
    Youssouf Bentrcia
    Leila Falek
    International Journal of Speech Technology, 2021, 24 : 747 - 759
  • [25] Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
    Brierley, Claire
    Sawalha, Majdi
    Atwell, Eric
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1011 - 1016
  • [26] Construction and Annotation of the Jordan Comprehensive Contemporary Arabic Corpus (JCCA)
    Sawalha, Majdi
    Alshargi, Faisal
    Alshdaifat, Abdallah
    Yagi, Sane
    Qudah, Mohammad A.
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 148 - 157
  • [27] CORPUS DESIGN AND DEVELOPMENT OF AN ANNOTATED SPEECH DATABASE FOR PUNJABI
    Bansal, Shweta
    Sharan, Shambhu
    Agrawal, S. S.
    2015 INTERNATIONAL CONFERENCE ORIENTAL COCOSDA HELD JOINTLY WITH 2015 CONFERENCE ON ASIAN SPOKEN LANGUAGE RESEARCH AND EVALUATION (O-COCOSDA/CASLRE), 2015, : 32 - 37
  • [28] QASR: QCRI aljazeera speech resource a large scale annotated Arabic speech corpus
    Mubarak, Hamdy
    Hussein, Amir
    Chowdhury, Shammur Absar
    Ali, Ahmed
    ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021, : 2274 - 2285
  • [29] Building audio-visual phonetically annotated Arabic corpus for expressive text to speech
    Abdo, Omnia
    Abdou, Sherif
    Fashal, Mervat
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 3767 - 3771
  • [30] QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus
    Mubarak, Hamdy
    Hussein, Amir
    Chowdhury, Shammur Absar
    Ali, Ahmed
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2274 - 2285