Hate speech detection in the Arabic language: corpus design, construction, and evaluation

被引:1
|
作者
Ahmad, Ashraf [1 ]
Azzeh, Mohammad [2 ]
Alnagi, Eman [1 ]
Abu Al-Haija, Qasem [3 ]
Halabi, Dana [4 ]
Aref, Abdullah [1 ]
AbuHour, Yousef [5 ]
机构
[1] Princess Sumaya Univ Technol PSUT, Dept Comp Sci, Amman, Jordan
[2] Princess Sumaya Univ Technol PSUT, Dept Data Sci, Amman, Jordan
[3] Jordan Univ Sci & Technol, Fac Comp & Informat Technol, Dept Cybersecur, Irbid, Jordan
[4] Luminus Tech Univ Coll LTUC, SAE Inst, Amman, Jordan
[5] Princess Sumaya Univ Technol PSUT, Dept Basic Sci, Amman, Jordan
来源
关键词
Arabic hate speech; natural language processing (NLP); machine learning; Arabic hate speech detection; Arabic hate speech corpus; SOCIAL MEDIA;
D O I
10.3389/frai.2024.1345445
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hate Speech Detection in Arabic presents a multifaceted challenge due to the broad and diverse linguistic terrain. With its multiple dialects and rich cultural subtleties, Arabic requires particular measures to address hate speech online successfully. To address this issue, academics and developers have used natural language processing (NLP) methods and machine learning algorithms adapted to the complexities of Arabic text. However, many proposed methods were hampered by a lack of a comprehensive dataset/corpus of Arabic hate speech. In this research, we propose a novel multi-class public Arabic dataset comprised of 403,688 annotated tweets categorized as extremely positive, positive, neutral, or negative based on the presence of hate speech. Using our developed dataset, we additionally characterize the performance of multiple machine learning models for Hate speech identification in Arabic Jordanian dialect tweets. Specifically, the Word2Vec, TF-IDF, and AraBert text representation models have been applied to produce word vectors. With the help of these models, we can provide classification models with vectors representing text. After that, seven machine learning classifiers have been evaluated: Support Vector Machine (SVM), Logistic Regression (LR), Naive Bays (NB), Random Forest (RF), AdaBoost (Ada), XGBoost (XGB), and CatBoost (CatB). In light of this, the experimental evaluation revealed that, in this challenging and unstructured setting, our gathered and annotated datasets were rather efficient and generated encouraging assessment outcomes. This will enable academics to delve further into this crucial field of study.
引用
收藏
页数:19
相关论文
共 50 条
  • [31] Hate speech detection in the Bengali language: a comprehensive survey
    Al Maruf, Abdullah
    Abidin, Ahmad Jainul
    Haque, Md. Mahmudul
    Jiyad, Zakaria Masud
    Golder, Aditi
    Alubady, Raaid
    Aung, Zeyar
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [32] Building a Rich Arabic Speech and Language Corpus Based on the Holy Quran
    Meftah, Ali
    Seddiq, Yasser
    Alotaibi, Yousef
    Selouani, Sid-Ahmed
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, 2018, 782 : 90 - 101
  • [33] Construction of Chinese Deceptive Speech Detection Corpus
    Fan, Cheng
    Zhao, Heming
    Chen, Xueqin
    Fan, Xiaohe
    Chen, Shuxi
    PROCEEDINGS OF THE 2015 JOINT INTERNATIONAL MECHANICAL, ELECTRONIC AND INFORMATION TECHNOLOGY CONFERENCE (JIMET 2015), 2015, 10 : 93 - 96
  • [34] A Multilingual Evaluation for Online Hate Speech Detection
    Corazza, Michele
    Menini, Stefano
    Cabrio, Elena
    Tonelli, Sara
    Villata, Serena
    ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2020, 20 (02)
  • [35] Evaluation of an Arabic Speech Corpus of Emotions: A Perceptual and Statistical Analysis
    Meftah, Ali Hamid
    Alotaibi, Yousef Ajami
    Selouani, Sid-Ahmed
    IEEE ACCESS, 2018, 6 : 72845 - 72861
  • [36] Construction and Evaluation of Tamil Speech Emotion Corpus
    P. Vasuki
    B. Sambavi
    Vijesh Joe
    National Academy Science Letters, 2020, 43 : 533 - 536
  • [37] Construction and Evaluation of Tamil Speech Emotion Corpus
    Vasuki, P.
    Sambavi, B.
    Joe, Vijesh
    NATIONAL ACADEMY SCIENCE LETTERS-INDIA, 2020, 43 (06): : 533 - 536
  • [38] The Construction of a New Lexicon Design for Arabic Language
    Bataineh, Bilal
    Bataineh, Emad
    BUSINESS TRANSFORMATION THROUGH INNOVATION AND KNOWLEDGE MANAGEMENT: AN ACADEMIC PERSPECTIVE, VOLS 3 AND 4, 2010, : 2086 - 2096
  • [39] BERT-based Approach to Arabic Hate Speech and Offensive Language Detection in Twitter: Exploiting Emojis and Sentiment Analysis
    Althobaiti, Maha Jarallah
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (05) : 972 - 980
  • [40] Semi-Supervised Self-Learning for Arabic Hate Speech Detection
    Alsafari, Safa
    Sadaoui, Samira
    2021 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2021, : 863 - 868