Chinese Cyber Threat Intelligence Named Entity Recognition via RoBERTa-wwm-RDCNN-CRF

被引：3

作者：

Zhen, Zhen ^{[1
]}

Gao, Jian ^{[1
,2
]}

机构：

[1] Peoples Publ Secur Univ China, Sch Informat Network Secur, Beijing 100038, Peoples R China

[2] Minist Publ Secur, Key Lab Safety Precaut & Risk Assessment, Beijing, Peoples R China

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2023年 / 77卷 / 01期

关键词：

Cybersecurity; cyber threat intelligence; named entity recognition;

D O I：

10.32604/cmc.2023.042090

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, cyber attacks have been intensifying and causing great harm to individuals, companies, and countries. The mining of cyber threat intelligence (CTI) can facilitate intelligence integration and serve well in combating cyber attacks. Named Entity Recognition (NER), as a crucial component of text mining, can structure complex CTI text and aid cybersecurity professionals in effectively countering threats. However, current CTI NER research has mainly focused on studying English CTI. In the limited studies conducted on Chinese text, existing models have shown poor performance. To fully utilize the power of Chinese pre-trained language models (PLMs) and conquer the problem of lengthy infrequent English words mixing in the Chinese CTIs, we propose a residual dilated convolutional neural network (RDCNN) with a conditional random field (CRF) based on a robustly optimized bidirectional encoder representation from transformers pre-training approach with whole word masking (RoBERTa-wwm), abbreviated as RoBERTa-wwm-RDCNN-CRF. We are the first to experiment on the relevant open source dataset and achieve an F1-score of 82.35%, which exceeds the common baseline model bidirectional encoder representation from transformers (BERT)-bidirectional long short-term memory (BiLSTM)-CRF in this field by about 19.52% and exceeds the current state-of-the-art model, BERT-RDCNN-CRF, by about 3.53%. In addition, we conducted an ablation study on the encoder part of the model to verify the effectiveness of the proposed model and an in-depth investigation of the PLMs and encoder part of the model to verify the effectiveness of the proposed model. The RoBERTa-wwm-RDCNN-CRF model, the shared pre-processing, and augmentation methods can serve the subsequent fundamental tasks such as cybersecurity information extraction and knowledge graph construction, contributing to important applications in downstream tasks such as intrusion detection and advanced persistent threat (APT) attack detection.

引用

页码：299 / 323

页数：25

共 50 条

[21] An Improved Chinese Named Entity Recognition Method with TB-LSTM-CRF
Li, Jiazheng
Wang, Tao
Zhang, Weiwen
SSPS 2020: 2020 2ND SYMPOSIUM ON SIGNAL PROCESSING SYSTEMS, 2020, : 96 - 100
[22] Multichannel LSTM-CRF for Named Entity Recognition in Chinese Social Media
Dong, Chuanhai
Wu, Huijia
Zhang, Jiajun
Zong, Chengqing
CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2017, 2017, 10565 : 197 - 208
[23] Named Entity Recognition for Chinese Aviation Security Incident Based on BiLSTM and CRF
Zhao, Yan
Liu, Hu
Chen, Zhen
2021 2ND ASIA CONFERENCE ON COMPUTERS AND COMMUNICATIONS (ACCC 2021), 2021, : 89 - 94
[24] DNRTI: A Large-scale Dataset for Named Entity Recognition in Threat Intelligence
Wang, Xuren
Liu, Xinpei
Ao, Shengqin
Li, Ning
Jiang, Zhengwei
Xu, Zongyi
Xiong, Zihan
Xiong, Mengbo
Zhang, Xiaoqing
2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 1842 - 1848
[25] Enhanced Crime and Threat Intelligence Hunter with Named Entity Recognition and Sentiment Analysis
Ng, James H.
Loh, Peter K. K.
SOFT COMPUTING FOR SECURITY APPLICATIONS, ICSCS 2022, 2023, 1428 : 299 - 313
[26] Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation
Wu, Fangzhao
Liu, Junxin
Wu, Chuhan
Huang, Yongfeng
Xie, Xing
WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 3342 - 3348
[27] SESG-Optimizing Information Extraction in Chinese Clinical Texts: An Innovative Named Entity Recognition Approach Using RoBERTa-BiLSTM-CRF Mechanism
Li, Bin
Cheng, Haitao
Lin, Mengfei
JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2024, 23 (06)
[28] Chinese Named Entity Recognition in Power Domain Based on Bi-LSTM-CRF
Zhao, Zhenqiang
Chen, Zhenyu
Liu, Jinbo
Huang, Yunhao
Gao, Xingyu
Di, Fangchun
Li, Lixin
Ji, Xiaohui
2019 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND PATTERN RECOGNITION (AIPR 2019), 2019, : 176 - 180
[29] A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition
Ji, Bin
Liu, Rui
Li, ShaSha
Tang, JinTao
Yu, Jie
Li, Qian
Xu, WeiSang
2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
[30] Named Entity Recognition of Traditional Chinese Medicine Patents Based on BiLSTM-CRF
Deng, Na
Fu, Hao
Chen, Xu
WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021

← 1 2 3 4 5 →