Rethinking Textual Adversarial Defense for Pre-Trained Language Models

被引：8

作者：

Wang, Jiayi ^{[1
,2
,3
]}

Bao, Rongzhou ^{[4
]}

Zhang, Zhuosheng ^{[1
,2
,3
]}

Zhao, Hai ^{[1
,2
,3
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Comp Sci & Engn, Shanghai 200240, Peoples R China

[2] Shanghai Jiao Tong Univ, Key Lab Shanghai Educ Commiss Intelligent Interac, Shanghai 200240, Peoples R China

[3] Shanghai Jiao Tong Univ, AI Inst, MoE Key Lab Artificial Intelligence, Shanghai 200240, Peoples R China

[4] Ant Grp, Hangzhou 310000, Peoples R China

来源：

IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING | 2022年 / 30卷

基金：

中国国家自然科学基金;

关键词：

Detectors; Perturbation methods; Robustness; Speech processing; Adaptation models; Predictive models; Computer vision; Adversarial attack; adversarial defense; pre-trained language models; ATTACKS;

D O I：

10.1109/TASLP.2022.3192097

中图分类号：

O42 [声学];

学科分类号：

070206 ; 082403 ;

摘要：

Although pre-trained language models (PrLMs) have achieved significant success, recent studies demonstrate that PrLMs are vulnerable to adversarial attacks. By generating adversarial examples with slight perturbations on different levels (sentence / word / character), adversarial attacks can fool PrLMs to generate incorrect predictions, which questions the robustness of PrLMs. However, we find that most existing textual adversarial examples are unnatural, which can be easily distinguished by both human and machine. Based on a general anomaly detector, we propose a novel metric (Degree of Anomaly) as a constraint to enable current adversarial attack approaches to generate more natural and imperceptible adversarial examples. Under this new constraint, the success rate of existing attacks drastically decreases, which reveals that the robustness of PrLMs is not as fragile as they claimed. In addition, we find that four types of randomization can invalidate a large portion of textual adversarial examples. Based on anomaly detector and randomization, we design a universal defense framework, which is among the first to perform textual adversarial defense without knowing the specific attack. Empirical results show that our universal defense framework achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time. Our work discloses the essence of textual adversarial attacks, and indicates that (i) further works of adversarial attacks should focus more on how to overcome the detection and resist the randomization, otherwise their adversarial examples would be easily detected and invalidated; and (ii) compared with the unnatural and perceptible adversarial examples, it is those undetectable adversarial examples that pose real risks for PrLMs and require more attention for future robustness-enhancing strategies.

引用

页码：2526 / 2540

页数：15

共 50 条

[21] Using Pre-trained Language Models to Resolve Textual and Semantic Merge Conflicts (Experience Paper)
Zhang, Jialu
Mytkowicz, Todd
Kaufman, Mike
Piskac, Ruzica
Lahiri, Shuvendu K.
PROCEEDINGS OF THE 31ST ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2022, 2022, : 77 - 88
[22] G-Tuning: Improving Generalization of Pre-trained Language Models with Generative Adversarial Network
Weng, Rongxiang
Cheng, Wensen
Zhang, Min
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 4747 - 4755
[23] VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Yin, Ziyi
Ye, Muchao
Zhang, Tianrong
Du, Tianyu
Zhu, Jinguo
Liu, Han
Chen, Jinghui
Wang, Ting
Ma, Fenglong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[24] A Study of Pre-trained Language Models in Natural Language Processing
Duan, Jiajia
Zhao, Hui
Zhou, Qian
Qiu, Meikang
Liu, Meiqin
2020 IEEE INTERNATIONAL CONFERENCE ON SMART CLOUD (SMARTCLOUD 2020), 2020, : 116 - 121
[25] How Should Pre-Trained Language Models Be Fine-Tuned Towards Adversarial Robustness?
Dong, Xinshuai
Luu Anh Tuan
Lin, Min
Yan, Shuicheng
Zhang, Hanwang
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[26] From Cloze to Comprehension: Retrofitting Pre-trained Masked Language Models to Pre-trained Machine Reader
Xu, Weiwen
Li, Xin
Zhang, Wenxuan
Zhou, Meng
Lam, Wai
Si, Luo
Bing, Lidong
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[27] Pre-trained models for natural language processing: A survey
Qiu XiPeng
Sun TianXiang
Xu YiGe
Shao YunFan
Dai Ning
Huang XuanJing
SCIENCE CHINA-TECHNOLOGICAL SCIENCES, 2020, 63 (10) : 1872 - 1897
[28] Probing Pre-Trained Language Models for Disease Knowledge
Alghanmi, Israa
Espinosa-Anke, Luis
Schockaert, Steven
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 3023 - 3033
[29] Analyzing Individual Neurons in Pre-trained Language Models
Durrani, Nadir
Sajjad, Hassan
Dalvi, Fahim
Belinkov, Yonatan
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4865 - 4880
[30] Emotional Paraphrasing Using Pre-trained Language Models
Casas, Jacky
Torche, Samuel
Daher, Karl
Mugellini, Elena
Abou Khaled, Omar
2021 9TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS (ACIIW), 2021,

← 1 2 3 4 5 →