Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

被引:0
|
作者
Zhu, Biru [1 ]
Cui, Ganqu [2 ]
Chen, Yangyi [3 ]
Qin, Yujia [2 ]
Yuan, Lifan [2 ]
Fu, Chong [4 ]
Deng, Yangdong [1 ]
Liu, Zhiyuan [2 ]
Sun, Maosong [2 ]
Gu, Ming [1 ]
机构
[1] Tsinghua Univ, Sch Software, Beijing, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Univ Illinois, Champaign, IL USA
[4] Zhejiang Univ, Zhejiang, Peoples R China
基金
国家重点研发计划;
关键词
Compendex;
D O I
10.1162/tacl_a_00622
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.
引用
收藏
页码:1608 / 1623
页数:16
相关论文
共 50 条
  • [41] Code Execution with Pre-trained Language Models
    Liu, Chenxiao
    Lu, Shuai
    Chen, Weizhu
    Jiang, Daxin
    Svyatkovskiy, Alexey
    Fu, Shengyu
    Sundaresan, Neel
    Duan, Nan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 4984 - 4999
  • [42] Weight Poisoning Attacks on Pre-trained Models
    Kurita, Keita
    Michel, Paul
    Neubig, Graham
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2793 - 2806
  • [43] Probing for Hyperbole in Pre-Trained Language Models
    Schneidermann, Nina Skovgaard
    Hershcovich, Daniel
    Pedersen, Bolette Sandford
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-SRW 2023, VOL 4, 2023, : 200 - 211
  • [44] A Systematic Survey of Chemical Pre-trained Models
    Xia, Jun
    Zhu, Yanqiao
    Du, Yuanqi
    Li, Stan Z.
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 6787 - 6795
  • [45] Pre-trained language models in medicine: A survey *
    Luo, Xudong
    Deng, Zhiqi
    Yang, Binxia
    Luo, Michael Y.
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 154
  • [46] Multi-stage Pre-training over Simplified Multimodal Pre-training Models
    Liu, Tongtong
    Feng, Fangxiang
    Wang, Xiaojie
    59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 2556 - 2565
  • [47] Continual Pre-Training of Language Models for Concept Prerequisite Learning with Graph Neural Networks
    Tang, Xin
    Liu, Kunjia
    Xu, Hao
    Xiao, Weidong
    Tan, Zhen
    MATHEMATICS, 2023, 11 (12)
  • [48] Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models
    Zheng, Kecheng
    Wu, Wei
    Feng, Ruili
    Zhu, Kai
    Liu, Jiawei
    Zhao, Deli
    Zha, Zheng-Jun
    Chen, Wei
    Shen, Yujun
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11629 - 11639
  • [49] Towards Efficient Post-training Quantization of Pre-trained Language Models
    Bai, Haoli
    Hou, Lu
    Shang, Lifeng
    Jiang, Xin
    King, Irwin
    Lyu, Michael R.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [50] ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding
    Sun, Yu
    Wang, Shuohuan
    Li, Yukun
    Feng, Shikun
    Tian, Hao
    Wu, Hua
    Wang, Haifeng
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 8968 - 8975