Hate speech detection in the Arabic language: corpus design, construction, and evaluation

被引:1
|
作者
Ahmad, Ashraf [1 ]
Azzeh, Mohammad [2 ]
Alnagi, Eman [1 ]
Abu Al-Haija, Qasem [3 ]
Halabi, Dana [4 ]
Aref, Abdullah [1 ]
AbuHour, Yousef [5 ]
机构
[1] Princess Sumaya Univ Technol PSUT, Dept Comp Sci, Amman, Jordan
[2] Princess Sumaya Univ Technol PSUT, Dept Data Sci, Amman, Jordan
[3] Jordan Univ Sci & Technol, Fac Comp & Informat Technol, Dept Cybersecur, Irbid, Jordan
[4] Luminus Tech Univ Coll LTUC, SAE Inst, Amman, Jordan
[5] Princess Sumaya Univ Technol PSUT, Dept Basic Sci, Amman, Jordan
来源
关键词
Arabic hate speech; natural language processing (NLP); machine learning; Arabic hate speech detection; Arabic hate speech corpus; SOCIAL MEDIA;
D O I
10.3389/frai.2024.1345445
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hate Speech Detection in Arabic presents a multifaceted challenge due to the broad and diverse linguistic terrain. With its multiple dialects and rich cultural subtleties, Arabic requires particular measures to address hate speech online successfully. To address this issue, academics and developers have used natural language processing (NLP) methods and machine learning algorithms adapted to the complexities of Arabic text. However, many proposed methods were hampered by a lack of a comprehensive dataset/corpus of Arabic hate speech. In this research, we propose a novel multi-class public Arabic dataset comprised of 403,688 annotated tweets categorized as extremely positive, positive, neutral, or negative based on the presence of hate speech. Using our developed dataset, we additionally characterize the performance of multiple machine learning models for Hate speech identification in Arabic Jordanian dialect tweets. Specifically, the Word2Vec, TF-IDF, and AraBert text representation models have been applied to produce word vectors. With the help of these models, we can provide classification models with vectors representing text. After that, seven machine learning classifiers have been evaluated: Support Vector Machine (SVM), Logistic Regression (LR), Naive Bays (NB), Random Forest (RF), AdaBoost (Ada), XGBoost (XGB), and CatBoost (CatB). In light of this, the experimental evaluation revealed that, in this challenging and unstructured setting, our gathered and annotated datasets were rather efficient and generated encouraging assessment outcomes. This will enable academics to delve further into this crucial field of study.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Hate speech detection with ADHAR: a multi-dialectal hate speech corpus in Arabic
    Charfi, Anis
    Besghaier, Mabrouka
    Akasheh, Raghda
    Atalla, Andria
    Zaghouani, Wajdi
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2024, 7
  • [2] Corpus Building for Hate Speech Detection of Gujarati Language
    Vadesara, Abhilasha
    Tanna, Purna
    SOFT COMPUTING AND ITS ENGINEERING APPLICATIONS, ICSOFTCOMP 2022, 2023, 1788 : 382 - 395
  • [3] The design, construction and evaluation of annotated Arabic cyberbullying corpus
    Shannag, Fatima
    Hammo, Bassam H.
    Faris, Hossam
    EDUCATION AND INFORMATION TECHNOLOGIES, 2022, 27 (08) : 10977 - 11023
  • [4] The design, construction and evaluation of annotated Arabic cyberbullying corpus
    Fatima Shannag
    Bassam H. Hammo
    Hossam Faris
    Education and Information Technologies, 2022, 27 : 10977 - 11023
  • [6] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
    Daouadi, Kheir Eddine
    Boualleg, Yaakoub
    Guehairia, Oussama
    COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693
  • [7] Telugu language hate speech detection using deep learning transformer models: Corpus generation and evaluation
    Khanduja, Namit
    Kumar, Nishant
    Chauhan, Arun
    SYSTEMS AND SOFT COMPUTING, 2024, 6
  • [8] An academic Arabic corpus for plagiarism detection: design, construction and experimentation
    Al-Thwaib, Eman
    Hammo, Bassam H.
    Yagi, Sane
    INTERNATIONAL JOURNAL OF EDUCATIONAL TECHNOLOGY IN HIGHER EDUCATION, 2020, 17 (01)
  • [9] An academic Arabic corpus for plagiarism detection: design, construction and experimentation
    Eman Al-Thwaib
    Bassam H. Hammo
    Sane Yagi
    International Journal of Educational Technology in Higher Education, 17
  • [10] Multilingual Hate Speech Detection: Innovations in Optimized Deep Learning for English and Arabic Hate Speech Detection
    Hassan AL-Sukhani
    Qusay Bsoul
    Abdelrahman H. Elhawary
    Ziad M. Nasr
    Ahmed E. Mansour
    Radwan M. Batyha
    Basma S. Alqadi
    Jehad Saad Alqurni
    Hayat Alfagham
    Magda M. Madbouly
    SN Computer Science, 6 (3)