The design, construction and evaluation of annotated Arabic cyberbullying corpus

被引:7
|
作者
Shannag, Fatima [1 ]
Hammo, Bassam H. [1 ,2 ]
Faris, Hossam [1 ,3 ]
机构
[1] Univ Jordan, King Abdullah II Sch Informat Technol, Comp Informat Syst Dept, Amman, Jordan
[2] Princess Sumaya Univ Technol, King Hussein Sch Comp Sci, Amman, Jordan
[3] Al Hussein Tech Univ, Sch Comp & Informat, Amman, Jordan
关键词
Annotated cyberbullying corpus; Offensive language; Hate speech; Arabic harassment dataset; Cyberbullying dataset; Profane lexicon; COMMUNICATION;
D O I
10.1007/s10639-022-11056-x
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Cyberbullying (CB) is classified as one of the severe misconducts on social media. Many CB detection systems have been developed for many natural languages to face this phenomenon. However, Arabic is one of the under-resourced languages suffering from the lack of quality datasets in many computational research areas. This paper discusses the design, construction, and evaluation of a multi-dialect, annotated Arabic Cyberbullying Corpus (ArCybC), a valuable resource for Arabic CB detection and motivation for future research directions in Arabic Natural Language Processing (NLP). The study describes the phases of ArCybC compilation. By way of illustration, it explores the corpus to discover strategies used in rendering Arabic CB tweets pulled from four Twitter groups, including gaming, sports, news, and celebrities. Based on thorough analysis, we discovered that these groups were the most susceptible to harassment and cyberbullying. The collected tweets were filtered based on a compiled harassment lexicon, which contains a list of multi-dialectical profane words in Arabic compiled from four categories: sexual, racial, physical appearance, and intelligence. To annotate ArCybC, we asked five annotators to classify 4,505 tweets into two classes manually: Offensive/non-Offensive and CB/non-CB. We conducted a rigorous comparison of different machine learning approaches applied on ArCybC to detect Arabic CB using two language models: bag-of-words (BoW) and word embedding. The experiments showed that Support Vector Machine (SVM) with word embedding achieved an accuracy rate of 86.3% and an F1-score rate of 85%. The main challenges encountered during the ArCybC construction were the scarcity of freely available Arabic CB texts and the deficiency of annotating the texts.
引用
收藏
页码:10977 / 11023
页数:47
相关论文
共 50 条
  • [1] The design, construction and evaluation of annotated Arabic cyberbullying corpus
    Fatima Shannag
    Bassam H. Hammo
    Hossam Faris
    Education and Information Technologies, 2022, 27 : 10977 - 11023
  • [2] A 700M+Arabic corpus: KACST Arabic corpus design and construction
    Al-Thubaity, Abdulmohsen O.
    LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (03) : 721 - 751
  • [3] A 700M+ Arabic corpus: KACST Arabic corpus design and construction
    Abdulmohsen O. Al-Thubaity
    Language Resources and Evaluation, 2015, 49 : 721 - 751
  • [4] Hate speech detection in the Arabic language: corpus design, construction, and evaluation
    Ahmad, Ashraf
    Azzeh, Mohammad
    Alnagi, Eman
    Abu Al-Haija, Qasem
    Halabi, Dana
    Aref, Abdullah
    AbuHour, Yousef
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2024, 7
  • [5] BAAC: Bangor Arabic Annotated Corpus
    Alkhazi, Ibrahim S.
    Teahan, William J.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2018, 9 (11) : 131 - 140
  • [6] A Morphologically Annotated Corpus of Emirati Arabic
    Khalifa, Salam
    Habash, Nizar
    Eryani, Fadhl
    Obeid, Ossama
    Abdulrahim, Dana
    Al Kaabi, Meera
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3839 - 3846
  • [7] Curras: an annotated corpus for the Palestinian Arabic dialect
    Jarrar, Mustafa
    Habash, Nizar
    Alrimawi, Faeq
    Akra, Diyam
    Zalmout, Nasser
    LANGUAGE RESOURCES AND EVALUATION, 2017, 51 (03) : 745 - 775
  • [8] Curras: an annotated corpus for the Palestinian Arabic dialect
    Mustafa Jarrar
    Nizar Habash
    Faeq Alrimawi
    Diyam Akra
    Nasser Zalmout
    Language Resources and Evaluation, 2017, 51 : 745 - 775
  • [9] An academic Arabic corpus for plagiarism detection: design, construction and experimentation
    Al-Thwaib, Eman
    Hammo, Bassam H.
    Yagi, Sane
    INTERNATIONAL JOURNAL OF EDUCATIONAL TECHNOLOGY IN HIGHER EDUCATION, 2020, 17 (01)
  • [10] An academic Arabic corpus for plagiarism detection: design, construction and experimentation
    Eman Al-Thwaib
    Bassam H. Hammo
    Sane Yagi
    International Journal of Educational Technology in Higher Education, 17