COVID-HateBERT: a Pre-trained Language Model for COVID-19 related Hate Speech Detection

被引：9

作者：

Li, Mingqi ^{[1
]}

Liao, Song ^{[1
]}

Okpala, Ebuka ^{[1
]}

Tong, Max ^{[1
,4
]}

Costello, Matthew ^{[2
]}

Cheng, Long ^{[1
]}

Hu, Hongxin ^{[3
]}

Luo, Feng ^{[1
]}

机构：

[1] Clemson Univ, Sch Comp, Clemson, SC 29631 USA

[2] Clemson Univ, Dept Sociol, Clemson, SC 29631 USA

[3] Univ Buffalo, Dept Comp Sci & Engn, Buffalo, NY USA

[4] Christ Church Episcopal Sch, Greenville, SC USA

来源：

20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021) | 2021年

关键词：

hate speech detection; language model; COVID-19; BERT;

D O I：

10.1109/ICMLA52953.2021.00043

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the dramatic growth of hate speech on social media during the COVID-19 pandemic, there is an urgent need to detect various hate speech effectively. Existing methods only achieve high performance when the training and testing data come from the same data distribution. The models trained on the traditional hateful dataset cannot fit well on COVID-19 related dataset. Meanwhile, manually annotating the hate speech dataset for supervised learning is time-consuming. Here, we propose COVID-HateBERT, a pre-trained language model to detect hate speech on English Tweets to address this problem. We collect 200M English tweets based on COVID-19 related hateful keywords and hashtags. Then, we use a classifier to extract the 1.27M potential hateful tweets to re-train BERT-base. We evaluate our COVID-HateBERT on four benchmark datasets. The COVID-HateBERT achieves a 14.8%-23.8% higher macro average F1 score on traditional hate speech detection comparing to baseline methods and a 2.6%-6.73% higher macro average F1 score on COVID-19 related hate speech detection comparing to classifiers using BERT and BERTweet, which shows that COIVD-HateBERT can generalize well on different datasets.

引用

页码：233 / 238

页数：6

共 50 条

[1] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
Daouadi, Kheir Eddine
Boualleg, Yaakoub
Guehairia, Oussama
COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693
[2] Comparing pre-trained language models for Spanish hate speech detection
Miriam Plaza-del-Arco, Flor
Dolores Molina-Gonzalez, M.
Alfonso Urena-Lopez, L.
Teresa Martin-Valdivia, M.
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 166
[3] Asian hate speech detection on Twitter during COVID-19
Toliyat, Amir
Levitan, Sarah Ita
Peng, Zheng
Etemadpour, Ronak
FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2022, 5
[4] Blockchain-Based Trusted Federated Learning with Pre-Trained Models for COVID-19 Detection
Bian, Genqing
Qu, Wenjing
Shao, Bilin
ELECTRONICS, 2023, 12 (09)
[5] Predicting the Hate: A GSTM Model based on COVID-19 Hate Speech Datasets
Wu, Xiao-Kun
Zhao, Tian-Fang
Lu, Lu
Chen, Wei-Neng
INFORMATION PROCESSING & MANAGEMENT, 2022, 59 (04)
[6] COVID-19 outbreak: An ensemble pre-trained deep learning model for detecting informative tweets
Malla, SreeJagadeesh
Alphonse, P. J. A.
APPLIED SOFT COMPUTING, 2021, 107
[7] COVID-19 outbreak: An ensemble pre-trained deep learning model for detecting informative tweets
Malla, SreeJagadeesh
P.J.A., Alphonse
Malla, SreeJagadeesh (malla.sree@gmail.com), 1600, Elsevier Ltd (107):
[8] Leveraging Pre-trained Language Model for Speech Sentiment Analysis
Shon, Suwon
Brusco, Pablo
Pan, Jing
Han, Kyu J.
Watanabe, Shinji
INTERSPEECH 2021, 2021, : 3420 - 3424
[9] SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL
Shih, Yi-Jen
Wang, Hsuan-Fu
Chang, Heng-Jui
Berry, Layne
Lee, Hung-yi
Harwath, David
2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 715 - 722
[10] COVID-19 Detection on X-Ray Images using a Combining Mechanism of Pre-trained CNNs
El Gannour, Oussama
Hamida, Soufiane
Saleh, Shawki
Lamalem, Yasser
Cherradi, Bouchaib
Raihani, Abdelhadi
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (06) : 564 - 570

← 1 2 3 4 5 →