Chinese Cyber Threat Intelligence Named Entity Recognition via RoBERTa-wwm-RDCNN-CRF

被引:3
|
作者
Zhen, Zhen [1 ]
Gao, Jian [1 ,2 ]
机构
[1] Peoples Publ Secur Univ China, Sch Informat Network Secur, Beijing 100038, Peoples R China
[2] Minist Publ Secur, Key Lab Safety Precaut & Risk Assessment, Beijing, Peoples R China
来源
CMC-COMPUTERS MATERIALS & CONTINUA | 2023年 / 77卷 / 01期
关键词
Cybersecurity; cyber threat intelligence; named entity recognition;
D O I
10.32604/cmc.2023.042090
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, cyber attacks have been intensifying and causing great harm to individuals, companies, and countries. The mining of cyber threat intelligence (CTI) can facilitate intelligence integration and serve well in combating cyber attacks. Named Entity Recognition (NER), as a crucial component of text mining, can structure complex CTI text and aid cybersecurity professionals in effectively countering threats. However, current CTI NER research has mainly focused on studying English CTI. In the limited studies conducted on Chinese text, existing models have shown poor performance. To fully utilize the power of Chinese pre-trained language models (PLMs) and conquer the problem of lengthy infrequent English words mixing in the Chinese CTIs, we propose a residual dilated convolutional neural network (RDCNN) with a conditional random field (CRF) based on a robustly optimized bidirectional encoder representation from transformers pre-training approach with whole word masking (RoBERTa-wwm), abbreviated as RoBERTa-wwm-RDCNN-CRF. We are the first to experiment on the relevant open source dataset and achieve an F1-score of 82.35%, which exceeds the common baseline model bidirectional encoder representation from transformers (BERT)-bidirectional long short-term memory (BiLSTM)-CRF in this field by about 19.52% and exceeds the current state-of-the-art model, BERT-RDCNN-CRF, by about 3.53%. In addition, we conducted an ablation study on the encoder part of the model to verify the effectiveness of the proposed model and an in-depth investigation of the PLMs and encoder part of the model to verify the effectiveness of the proposed model. The RoBERTa-wwm-RDCNN-CRF model, the shared pre-processing, and augmentation methods can serve the subsequent fundamental tasks such as cybersecurity information extraction and knowledge graph construction, contributing to important applications in downstream tasks such as intrusion detection and advanced persistent threat (APT) attack detection.
引用
收藏
页码:299 / 323
页数:25
相关论文
共 50 条
  • [21] An Improved Chinese Named Entity Recognition Method with TB-LSTM-CRF
    Li, Jiazheng
    Wang, Tao
    Zhang, Weiwen
    SSPS 2020: 2020 2ND SYMPOSIUM ON SIGNAL PROCESSING SYSTEMS, 2020, : 96 - 100
  • [22] Multichannel LSTM-CRF for Named Entity Recognition in Chinese Social Media
    Dong, Chuanhai
    Wu, Huijia
    Zhang, Jiajun
    Zong, Chengqing
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2017, 2017, 10565 : 197 - 208
  • [23] Named Entity Recognition for Chinese Aviation Security Incident Based on BiLSTM and CRF
    Zhao, Yan
    Liu, Hu
    Chen, Zhen
    2021 2ND ASIA CONFERENCE ON COMPUTERS AND COMMUNICATIONS (ACCC 2021), 2021, : 89 - 94
  • [24] DNRTI: A Large-scale Dataset for Named Entity Recognition in Threat Intelligence
    Wang, Xuren
    Liu, Xinpei
    Ao, Shengqin
    Li, Ning
    Jiang, Zhengwei
    Xu, Zongyi
    Xiong, Zihan
    Xiong, Mengbo
    Zhang, Xiaoqing
    2020 IEEE 19TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2020), 2020, : 1842 - 1848
  • [25] Enhanced Crime and Threat Intelligence Hunter with Named Entity Recognition and Sentiment Analysis
    Ng, James H.
    Loh, Peter K. K.
    SOFT COMPUTING FOR SECURITY APPLICATIONS, ICSCS 2022, 2023, 1428 : 299 - 313
  • [26] Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation
    Wu, Fangzhao
    Liu, Junxin
    Wu, Chuhan
    Huang, Yongfeng
    Xie, Xing
    WEB CONFERENCE 2019: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2019), 2019, : 3342 - 3348
  • [27] SESG-Optimizing Information Extraction in Chinese Clinical Texts: An Innovative Named Entity Recognition Approach Using RoBERTa-BiLSTM-CRF Mechanism
    Li, Bin
    Cheng, Haitao
    Lin, Mengfei
    JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT, 2024, 23 (06)
  • [28] Chinese Named Entity Recognition in Power Domain Based on Bi-LSTM-CRF
    Zhao, Zhenqiang
    Chen, Zhenyu
    Liu, Jinbo
    Huang, Yunhao
    Gao, Xingyu
    Di, Fangchun
    Li, Lixin
    Ji, Xiaohui
    2019 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND PATTERN RECOGNITION (AIPR 2019), 2019, : 176 - 180
  • [29] A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition
    Ji, Bin
    Liu, Rui
    Li, ShaSha
    Tang, JinTao
    Yu, Jie
    Li, Qian
    Xu, WeiSang
    2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [30] Named Entity Recognition of Traditional Chinese Medicine Patents Based on BiLSTM-CRF
    Deng, Na
    Fu, Hao
    Chen, Xu
    WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2021, 2021