Hate speech detection in the Arabic language: corpus design, construction, and evaluation

被引:1
|
作者
Ahmad, Ashraf [1 ]
Azzeh, Mohammad [2 ]
Alnagi, Eman [1 ]
Abu Al-Haija, Qasem [3 ]
Halabi, Dana [4 ]
Aref, Abdullah [1 ]
AbuHour, Yousef [5 ]
机构
[1] Princess Sumaya Univ Technol PSUT, Dept Comp Sci, Amman, Jordan
[2] Princess Sumaya Univ Technol PSUT, Dept Data Sci, Amman, Jordan
[3] Jordan Univ Sci & Technol, Fac Comp & Informat Technol, Dept Cybersecur, Irbid, Jordan
[4] Luminus Tech Univ Coll LTUC, SAE Inst, Amman, Jordan
[5] Princess Sumaya Univ Technol PSUT, Dept Basic Sci, Amman, Jordan
来源
关键词
Arabic hate speech; natural language processing (NLP); machine learning; Arabic hate speech detection; Arabic hate speech corpus; SOCIAL MEDIA;
D O I
10.3389/frai.2024.1345445
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hate Speech Detection in Arabic presents a multifaceted challenge due to the broad and diverse linguistic terrain. With its multiple dialects and rich cultural subtleties, Arabic requires particular measures to address hate speech online successfully. To address this issue, academics and developers have used natural language processing (NLP) methods and machine learning algorithms adapted to the complexities of Arabic text. However, many proposed methods were hampered by a lack of a comprehensive dataset/corpus of Arabic hate speech. In this research, we propose a novel multi-class public Arabic dataset comprised of 403,688 annotated tweets categorized as extremely positive, positive, neutral, or negative based on the presence of hate speech. Using our developed dataset, we additionally characterize the performance of multiple machine learning models for Hate speech identification in Arabic Jordanian dialect tweets. Specifically, the Word2Vec, TF-IDF, and AraBert text representation models have been applied to produce word vectors. With the help of these models, we can provide classification models with vectors representing text. After that, seven machine learning classifiers have been evaluated: Support Vector Machine (SVM), Logistic Regression (LR), Naive Bays (NB), Random Forest (RF), AdaBoost (Ada), XGBoost (XGB), and CatBoost (CatB). In light of this, the experimental evaluation revealed that, in this challenging and unstructured setting, our gathered and annotated datasets were rather efficient and generated encouraging assessment outcomes. This will enable academics to delve further into this crucial field of study.
引用
收藏
页数:19
相关论文
共 50 条
  • [41] A Deep Learning Framework for Automatic Detection of Hate Speech Embedded in Arabic Tweets
    Rehab Duwairi
    Amena Hayajneh
    Muhannad Quwaider
    Arabian Journal for Science and Engineering, 2021, 46 : 4001 - 4014
  • [42] Intelligent detection of hate speech in Arabic social network: A machine learning approach
    Aljarah, Ibrahim
    Habib, Maria
    Hijazi, Neveen
    Faris, Hossam
    Qaddoura, Raneem
    Hammo, Bassam
    Abushariah, Mohammad
    Alfawareh, Mohammad
    JOURNAL OF INFORMATION SCIENCE, 2021, 47 (04) : 483 - 501
  • [43] Deep Random Forest and AraBert for Hate Speech Detection from Arabic Tweets
    Daouadi, Kheir Eddine
    Boualleg, Yaakoub
    Guehairia, Oussama
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2023, 29 (11) : 1319 - 1335
  • [44] Racial Bias in Hate Speech and Abusive Language Detection Datasets
    Davidson, Thomas
    Bhattacharya, Debasmita
    Weber, Ingmar
    THIRD WORKSHOP ON ABUSIVE LANGUAGE ONLINE, 2019, : 25 - 35
  • [45] A Deep Learning Framework for Automatic Detection of Hate Speech Embedded in Arabic Tweets
    Duwairi, Rehab
    Hayajneh, Amena
    Quwaider, Muhannad
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2021, 46 (04) : 4001 - 4014
  • [46] Depression Detection in Arabic Using Speech Language Recognition
    Alsharif, Zainab
    Elhag, Salma
    Alfakeh, Sulhi
    2022 7TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MACHINE LEARNING APPLICATIONS (CDMA 2022), 2022, : 61 - 66
  • [47] Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study
    Alfina, Ika
    Mulia, Rio
    Fanany, Mohamad Ivan
    Ekanata, Yudo
    2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 233 - 237
  • [48] Offensive Language and Hate Speech Detection Based on Transfer Learning
    Touahri, Ibtissam
    Mazroui, Azzeddine
    ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2020), VOL 2, 2022, 1418 : 300 - 311
  • [49] Hate-Speech and Offensive Language Detection in Roman Urdu
    Rizwan, Hammad
    Shakeel, Muhammad Haroon
    Karim, Asim
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2512 - 2522
  • [50] Towards Automatic Detection and Explanation of Hate Speech and Offensive Language
    Dorris, Wyatt
    Hu, Ruijia
    Vishwamitra, Nishant
    Luo, Feng
    Costello, Matthew
    PROCEEDINGS OF THE SIXTH INTERNATIONAL WORKSHOP ON SECURITY AND PRIVACY ANALYTICS (IWSPA'20), 2020, : 23 - 29