COVID-HateBERT: a Pre-trained Language Model for COVID-19 related Hate Speech Detection

被引:9
|
作者
Li, Mingqi [1 ]
Liao, Song [1 ]
Okpala, Ebuka [1 ]
Tong, Max [1 ,4 ]
Costello, Matthew [2 ]
Cheng, Long [1 ]
Hu, Hongxin [3 ]
Luo, Feng [1 ]
机构
[1] Clemson Univ, Sch Comp, Clemson, SC 29631 USA
[2] Clemson Univ, Dept Sociol, Clemson, SC 29631 USA
[3] Univ Buffalo, Dept Comp Sci & Engn, Buffalo, NY USA
[4] Christ Church Episcopal Sch, Greenville, SC USA
关键词
hate speech detection; language model; COVID-19; BERT;
D O I
10.1109/ICMLA52953.2021.00043
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the dramatic growth of hate speech on social media during the COVID-19 pandemic, there is an urgent need to detect various hate speech effectively. Existing methods only achieve high performance when the training and testing data come from the same data distribution. The models trained on the traditional hateful dataset cannot fit well on COVID-19 related dataset. Meanwhile, manually annotating the hate speech dataset for supervised learning is time-consuming. Here, we propose COVID-HateBERT, a pre-trained language model to detect hate speech on English Tweets to address this problem. We collect 200M English tweets based on COVID-19 related hateful keywords and hashtags. Then, we use a classifier to extract the 1.27M potential hateful tweets to re-train BERT-base. We evaluate our COVID-HateBERT on four benchmark datasets. The COVID-HateBERT achieves a 14.8%-23.8% higher macro average F1 score on traditional hate speech detection comparing to baseline methods and a 2.6%-6.73% higher macro average F1 score on COVID-19 related hate speech detection comparing to classifiers using BERT and BERTweet, which shows that COIVD-HateBERT can generalize well on different datasets.
引用
收藏
页码:233 / 238
页数:6
相关论文
共 50 条
  • [21] Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparative Study
    Tekiroglu, Serra Sinem
    Bonaldi, Helena
    Fanton, Margherita
    Guerini, Marco
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3099 - 3114
  • [22] Abusive and Hate speech Classification in Arabic Text Using Pre-trained Language Models and Data Augmentation
    Badri, Nabil
    Kboubi, Ferihane
    Chaibi, Anja Habacha
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (11)
  • [23] Unveiling Hidden Variables in Adversarial Attack Transferability on Pre-Trained Models for COVID-19 Diagnosis
    Akhtom, Dua'a
    Singh, Manmeet Mahinderjit
    Xinying, Chew
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (11) : 1343 - 1350
  • [24] COVID-19 Diagnosis from CXR images through pre-trained Deep Visual Embeddings
    Khalid, Shahzaib
    Syed, Muhammad Shehram Shah
    Saba, Erum
    Pirzada, Nasrullah
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2022, 22 (05): : 175 - 181
  • [25] Federated Learning Approach with Pre-Trained Deep Learning Models for COVID-19 Detection from Unsegmented CT images
    Florescu, Lucian Mihai
    Streba, Costin Teodor
    Serbanescu, Mircea-Sebastian
    Mamuleanu, Madalin
    Florescu, Dan Nicolae
    Teica, Rossy Vladut
    Nica, Raluca Elena
    Gheonea, Ioana Andreea
    LIFE-BASEL, 2022, 12 (07):
  • [26] Pre-trained ensemble model for identification of emotion during COVID-19 based on emergency response support system dataset
    Nimmi, K.
    Janet, B.
    Selvan, A. Kalai
    Sivakumaran, N.
    APPLIED SOFT COMPUTING, 2022, 122
  • [27] Detection of Speech Related Disorders by Pre-trained Embedding Models Extracted Biomarkers
    Jenei, Attila Zoltan
    Kiss, Gabor
    Sztaho, David
    SPEECH AND COMPUTER, SPECOM 2022, 2022, 13721 : 279 - 289
  • [28] Detection of Chinese Deceptive Reviews Based on Pre-Trained Language Model
    Weng, Chia-Hsien
    Lin, Kuan-Cheng
    Ying, Jia-Ching
    APPLIED SCIENCES-BASEL, 2022, 12 (07):
  • [29] Data Augmentation Based on Pre-trained Language Model for Event Detection
    Zhang, Meng
    Xie, Zhiwen
    Liu, Jin
    CCKS 2021 - EVALUATION TRACK, 2022, 1553 : 59 - 68
  • [30] Pre-Trained Language Model Ensemble for Arabic Fake News Detection
    Al-Zahrani, Lama
    Al-Yahya, Maha
    MATHEMATICS, 2024, 12 (18)