Token replacement-based data augmentation methods for hate speech detection

被引:0
|
作者
Kosisochukwu Judith Madukwe
Xiaoying Gao
Bing Xue
机构
[1] Victoria University of Wellington,School of Engineering and Computer Science
来源
World Wide Web | 2022年 / 25卷
关键词
Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;
D O I
暂无
中图分类号
学科分类号
摘要
Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.
引用
收藏
页码:1129 / 1150
页数:21
相关论文
共 50 条
  • [41] Effect of Data Augmentation, Cross-Validation Methods in Robustness of Explainable Speech Based Emotion Recognition
    Shinde, Ashwini S.
    Patil, Vaishali V.
    TRAITEMENT DU SIGNAL, 2024, 41 (03) : 1565 - 1574
  • [42] Data augmentation using CycleGAN-based methods for automatic bridge crack detection
    Li, Baoxian
    Guo, Hongbin
    Wang, Zhanfei
    STRUCTURES, 2024, 62
  • [43] Similar target replacement for remote sensing object detection data augmentation
    Sun, Deyao
    Zhu, Ming
    Wang, Jiarong
    CHINESE JOURNAL OF LIQUID CRYSTALS AND DISPLAYS, 2024, 39 (06) : 813 - 821
  • [44] Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation
    Badri, Nabil
    Kboubi, Ferihane
    Chaibi, Anja Habacha
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (11)
  • [45] Convolutional Graph Neural Networks for Hate Speech Detection in Data-Poor Settings
    De la Pena Sarracen, Gretel Liz
    Rosso, Paolo
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 16 - 24
  • [46] Effective hate-speech detection in Twitter data using recurrent neural networks
    Pitsilis, Georgios K.
    Ramampiaro, Heri
    Langseth, Helge
    APPLIED INTELLIGENCE, 2018, 48 (12) : 4730 - 4742
  • [47] Challenges of Hate Speech Detection in Social Media: Data Scarcity, and Leveraging External Resources
    Kovács G.
    Alonso P.
    Saini R.
    SN Computer Science, 2021, 2 (2)
  • [48] Towards more robust hate speech detection: using social context and user data
    Seema Nagar
    Ferdous Ahmed Barbhuiya
    Kuntal Dey
    Social Network Analysis and Mining, 13
  • [49] Towards more robust hate speech detection: using social context and user data
    Nagar, Seema
    Barbhuiya, Ferdous Ahmed
    Dey, Kuntal
    SOCIAL NETWORK ANALYSIS AND MINING, 2023, 13 (01)
  • [50] An efficient approach for data-imbalanced hate speech detection in Arabic social media
    Mohamed, Mohamed S.
    Elzayady, Hossam
    Badran, Khaled M.
    Salama, Gouda I.
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (04) : 6381 - 6390