Token replacement-based data augmentation methods for hate speech detection

被引:0
|
作者
Kosisochukwu Judith Madukwe
Xiaoying Gao
Bing Xue
机构
[1] Victoria University of Wellington,School of Engineering and Computer Science
来源
World Wide Web | 2022年 / 25卷
关键词
Hate speech data; Data augmentation; Token substitution; Word replacement; Data generation; Text data;
D O I
暂无
中图分类号
学科分类号
摘要
Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.
引用
收藏
页码:1129 / 1150
页数:21
相关论文
共 50 条
  • [1] Token replacement-based data augmentation methods for hate speech detection
    Madukwe, Kosisochukwu Judith
    Gao, Xiaoying
    Xue, Bing
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (03): : 1129 - 1150
  • [2] Impact of Data Augmentation on Hate Speech Detection
    Batarfi, Hanan A.
    Alsaedi, Olaa A.
    Wali, Arwa M.
    Jamal, Amani T.
    INNOVATIONS FOR COMMUNITY SERVICES, I4CS 2023, 2023, 1876 : 187 - 199
  • [3] Data Augmentation for Improving Explainability of Hate Speech Detection
    Gunjan Ansari
    Parmeet Kaur
    Chandni Saxena
    Arabian Journal for Science and Engineering, 2024, 49 : 3609 - 3621
  • [4] Data Augmentation for Improving Explainability of Hate Speech Detection
    Ansari, Gunjan
    Kaur, Parmeet
    Saxena, Chandni
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024, 49 (03) : 3609 - 3621
  • [5] Exploring Data Augmentation Strategies for Hate Speech Detection in Roman Urdu
    Azam, Ubaid
    Rizwan, Hammad
    Karim, Asim
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4523 - 4531
  • [6] Application of Data Augmentation Techniques for Hate Speech Detection with Deep Learning
    Venturott, Ligia Iunes
    Ciarelli, Patrick Marques
    PROGRESS IN ARTIFICIAL INTELLIGENCE (EPIA 2021), 2021, 12981 : 778 - 787
  • [7] Data-Efficient Methods For Improving Hate Speech Detection
    Roychowdhury, Sumegh
    Gupta, Vikram
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 125 - 132
  • [8] An approach of data augmentation to improve the performance of BERTology models for vietnamese hate speech detection
    Luu, Son T.
    Van Nguyen, Kiet
    Nguyen, Ngan Luu-Thuy
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (19) : 56763 - 56783
  • [9] Exploring Conditional Language Model Based Data Augmentation Approaches for Hate Speech Classification
    D'Sa, Ashwin Geet
    Illina, Irina
    Fohr, Dominique
    Klakow, Dietrich
    Ruiter, Dana
    TEXT, SPEECH, AND DIALOGUE, TSD 2021, 2021, 12848 : 135 - 146
  • [10] Hate Speech Detection in Twitter using Transformer Methods
    Mutanga, Raymond T.
    Naicker, Nalindren
    Olugbara, Oludayo O.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (09) : 614 - 620