Challenges of Hate Speech Detection in Social Media: Data Scarcity, and Leveraging External Resources

被引:0
|
作者
Kovács G. [1 ]
Alonso P. [1 ]
Saini R. [1 ]
机构
[1] Luleå University of Technology, Aurorum 1, Luleå
关键词
BERT; Deep language processing; Hate speech; Transfer learning; Vocabulary augmentation;
D O I
10.1007/s42979-021-00457-3
中图分类号
学科分类号
摘要
The detection of hate speech in social media is a crucial task. The uncontrolled spread of hate has the potential to gravely damage our society, and severely harm marginalized people or groups. A major arena for spreading hate speech online is social media. This significantly contributes to the difficulty of automatic detection, as social media posts include paralinguistic signals (e.g. emoticons, and hashtags), and their linguistic content contains plenty of poorly written text. Another difficulty is presented by the context-dependent nature of the task, and the lack of consensus on what constitutes as hate speech, which makes the task difficult even for humans. This makes the task of creating large labeled corpora difficult, and resource consuming. The problem posed by ungrammatical text has been largely mitigated by the recent emergence of deep neural network (DNN) architectures that have the capacity to efficiently learn various features. For this reason, we proposed a deep natural language processing (NLP) model—combining convolutional and recurrent layers—for the automatic detection of hate speech in social media data. We have applied our model on the HASOC2019 corpus, and attained a macro F1 score of 0.63 in hate speech detection on the test set of HASOC. The capacity of DNNs for efficient learning, however, also means an increased risk of overfitting. Particularly, with limited training data available (as was the case for HASOC). For this reason, we investigated different methods for expanding resources used. We have explored various opportunities, such as leveraging unlabeled data, similarly labeled corpora, as well as the use of novel models. Our results showed that by doing so, it was possible to significantly increase the classification score attained. © 2021, The Author(s).
引用
收藏
相关论文
共 50 条
  • [41] Online Multilingual Hate Speech Detection: Experimenting with Hindi and English Social Media
    Vashistha, Neeraj
    Zubiaga, Arkaitz
    INFORMATION, 2021, 12 (01) : 1 - 16
  • [42] Advances in Machine Learning Algorithms for Hate Speech Detection in Social Media: A Review
    Mullah, Nanlir Sallau
    Zainon, Wan Mohd Nazmee Wan
    IEEE ACCESS, 2021, 9 : 88364 - 88376
  • [43] Deep Explainable Hate Speech Active Learning on Social-Media Data
    Ahmed, Usman
    Lin, Jerry Chun-Wei
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, 11 (04) : 4625 - 4635
  • [44] Context-aware and expert data resources for Brazilian Portuguese hate speech detection
    Vargas, Francielle
    Carvalho, Isabelle
    Pardo, Thiago A. S.
    Benevenuto, Fabricio
    NATURAL LANGUAGE PROCESSING, 2025, 31 (02): : 435 - 456
  • [45] Leveraging Financial Social Media Data for Corporate Fraud Detection
    Dong, Wei
    Liao, Shaoyi
    Zhang, Zhongju
    JOURNAL OF MANAGEMENT INFORMATION SYSTEMS, 2018, 35 (02) : 461 - 487
  • [46] Combating the challenges of social media hate speech in a polarized society A Twitter ego lexalytics approach
    Udanor, Collins
    Anyanwu, Chinatu C.
    DATA TECHNOLOGIES AND APPLICATIONS, 2019, 53 (04) : 501 - 527
  • [47] Hate speech in social media: Opportunities for critical resilience?
    Buhin, Larisa
    Odag, Oezen
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2024, 59 : 188 - 188
  • [48] Sinhala Hate Speech Detection in Social Media Using Machine Learning and Deep Learning
    Fernando, W. S. S.
    Weerasinghe, Ruvan
    Bandara, E. R. A. D.
    2022 22ND INTERNATIONAL CONFERENCE ON ADVANCES IN ICT FOR EMERGING REGIONS (ICTER), 2022,
  • [49] Social Media Hate Speech Detection Using Machine Learning Algorithms: Comparative Study
    Dharani, P.
    Bagade, Nidhi
    Nittala, Sripriya
    Konkala, Sowmya
    Sasidhar, B.
    PROCEEDINGS OF THE 5TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, MACHINE LEARNING AND APPLICATIONS, VOL 1, ICDSMLA 2023, 2025, 1273 : 864 - 870
  • [50] Moralized language predicts hate speech on social media
    Solovev, Kirill
    Proellochs, Nicolas
    PNAS NEXUS, 2023, 2 (01):