Token replacement-based data augmentation methods for hate speech detection

被引:0
|
作者
Kosisochukwu Judith Madukwe
Xiaoying Gao
Bing Xue
机构
[1] Victoria University of Wellington,School of Engineering and Computer Science
来源
World Wide Web | 2022年 / 25卷
关键词
Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;
D O I
暂无
中图分类号
学科分类号
摘要
Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.
引用
收藏
页码:1129 / 1150
页数:21
相关论文
共 50 条
  • [11] A Literature Review of Textual Hate Speech Detection Methods and Datasets
    Alkomah, Fatimah
    Ma, Xiaogang
    INFORMATION, 2022, 13 (06)
  • [12] A lexicon-based approach for hate speech detection
    School of Information Science and Engineering, Central South University, Changsha, China
    不详
    Int. J. Multimedia Ubiquitous Eng., 4 (215-230):
  • [13] Arabic hate speech detection system based on AraBERT
    Higher Institute of Computer, Science and Multimedia of Sfax, sfax, Tunisia
    不详
    Proc. IEEE Int. Conf. Cogn. Informatics Cogn. Comput. ICCI*CC, 2022, (208-213):
  • [14] Performance comparison of data balancing techniques on hate speech detection in Turkish
    Karayigit, Habibe
    Akdagli, Ali
    Aci, Cigdem
    PAMUKKALE UNIVERSITY JOURNAL OF ENGINEERING SCIENCES-PAMUKKALE UNIVERSITESI MUHENDISLIK BILIMLERI DERGISI, 2024, 30 (05): : 610 - 621
  • [15] A comprehensive review on detection of hate speech for multi-lingual data
    Narula, Rachna
    Chaudhary, Poonam
    SOCIAL NETWORK ANALYSIS AND MINING, 2025, 14 (01)
  • [16] On Online Hate Speech Detection. Effects of Negated Data Construction
    Abderrouaf, Cheniki
    Oussalah, Mourad
    2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2019, : 5595 - 5602
  • [17] Data expansion using back translation and paraphrasing for hate speech detection
    Beddiar D.R.
    Jahan M.S.
    Oussalah M.
    Online Social Networks and Media, 2021, 24
  • [18] BERT-based Ensemble Approaches for Hate Speech Detection
    Mnassri, Khouloud
    Rajapaksha, Praboda
    Farahbakhsh, Reza
    Crespi, Noel
    2022 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM 2022), 2022, : 4649 - 4654
  • [19] Bias Detection and Mitigation in Textual Data: A Study on Fake News and Hate Speech Detection
    Kasampalis, Apostolos
    Chatzakou, Despoina
    Tsikrika, Theodora
    Vrochidis, Stefanos
    Kompatsiaris, Ioannis
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT III, 2024, 14610 : 374 - 383
  • [20] Deep Learning Based Fusion Approach for Hate Speech Detection
    Zhou, Yanling
    Yang, Yanyan
    Liu, Han
    Liu, Xiufeng
    Savage, Nick
    IEEE ACCESS, 2020, 8 : 128923 - 128929