Token replacement-based data augmentation methods for hate speech detection

被引:0
|
作者
Kosisochukwu Judith Madukwe
Xiaoying Gao
Bing Xue
机构
[1] Victoria University of Wellington,School of Engineering and Computer Science
来源
World Wide Web | 2022年 / 25卷
关键词
Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;
D O I
暂无
中图分类号
学科分类号
摘要
Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.
引用
收藏
页码:1129 / 1150
页数:21
相关论文
共 50 条
  • [21] Offensive Language and Hate Speech Detection Based on Transfer Learning
    Touahri, Ibtissam
    Mazroui, Azzeddine
    ADVANCED INTELLIGENT SYSTEMS FOR SUSTAINABLE DEVELOPMENT (AI2SD'2020), VOL 2, 2022, 1418 : 300 - 311
  • [22] Data Augmentation for Deep Learning-Based Speech Reconstruction Using FOC-Based Methods
    Yazgac, Bilgi Gorkem
    Kirci, Murvet
    FRACTAL AND FRACTIONAL, 2025, 9 (02)
  • [23] Graph-Based Methods to Detect Hate Speech Diffusion on Twitter
    Beatty, Matthew
    2020 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING (ASONAM), 2020, : 502 - 506
  • [24] Hate or Non-hate: Translation based hate speech identification in Code-Mixed Hinglish data set
    Biradar, Shankar
    Saumya, Sunil
    Chauhan, Arun
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 2470 - 2475
  • [25] Data Augmentation Based Event Detection
    丁祥武
    丁晶晶
    秦彦霞
    JournalofDonghuaUniversity(EnglishEdition), 2021, 38 (06) : 511 - 518
  • [26] Data Augmentation for Pipeline-Based Speech Translation
    Alves, Diego
    Salimbajevs, Askars
    Pinnis, Marcis
    HUMAN LANGUAGE TECHNOLOGIES - THE BALTIC PERSPECTIVE (HLT 2020), 2020, 328 : 73 - 79
  • [27] Augment to Prevent: Short-Text Data Augmentation in Deep Learning for Hate-Speech Classification
    Rizos, Georgios
    Hemker, Konstantin
    Schuller, Bjoern
    PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT (CIKM '19), 2019, : 991 - 1000
  • [28] Data-Driven and Psycholinguistics-Motivated Approaches to Hate Speech Detection
    Silva, Samuel Caetano
    Ferreira, Thiago Castro
    Silva Ramos, Ricelli Moreira
    Paraboni, Ivandre
    COMPUTACION Y SISTEMAS, 2020, 24 (03): : 1179 - 1188
  • [29] Counterfactually Augmented Data and Unintended Bias: The Case of Sexism and Hate Speech Detection
    Sen, Indira
    Samory, Mattia
    Wagner, Claudia
    Augenstein, Isabelle
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4716 - 4726
  • [30] NAIJAHATE: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data
    Tonneau, Manuel
    de Castro, Pedro Vitor Quinta
    Lasri, Karim
    Farouq, Ibrahim
    Subramanian, Lakshminarayanan
    Orozco-Olvera, Victor
    Fraiberger, Samuel P.
    arXiv,