Removing Backdoors in Pre-trained Models by Regularized Continual Pre-training

被引:0
|
作者
Zhu, Biru [1 ]
Cui, Ganqu [2 ]
Chen, Yangyi [3 ]
Qin, Yujia [2 ]
Yuan, Lifan [2 ]
Fu, Chong [4 ]
Deng, Yangdong [1 ]
Liu, Zhiyuan [2 ]
Sun, Maosong [2 ]
Gu, Ming [1 ]
机构
[1] Tsinghua Univ, Sch Software, Beijing, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Univ Illinois, Champaign, IL USA
[4] Zhejiang Univ, Zhejiang, Peoples R China
基金
国家重点研发计划;
关键词
Compendex;
D O I
10.1162/tacl_a_00622
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent research has revealed that pre-trained models (PTMs) are vulnerable to backdoor attacks before the fine-tuning stage. The attackers can implant transferable task-agnostic backdoors in PTMs, and control model outputs on any downstream task, which poses severe security threats to all downstream applications. Existing backdoor-removal defenses focus on task-specific classification models and they are not suitable for defending PTMs against task-agnostic backdoor attacks. To this end, we propose the first task-agnostic backdoor removal method for PTMs. Based on the selective activation phenomenon in backdoored PTMs, we design a simple and effective backdoor eraser, which continually pre-trains the backdoored PTMs with a regularization term in an end-to-end approach. The regularization term removes backdoor functionalities from PTMs while the continual pre-training maintains the normal functionalities of PTMs. We conduct extensive experiments on pre-trained models across different modalities and architectures. The experimental results show that our method can effectively remove backdoors inside PTMs and preserve benign functionalities of PTMs with a few downstream-task-irrelevant auxiliary data, e.g., unlabeled plain texts. The average attack success rate on three downstream datasets is reduced from 99.88% to 8.10% after our defense on the backdoored BERT. The codes are publicly available at https://github.com/thunlp/RECIPE.
引用
收藏
页码:1608 / 1623
页数:16
相关论文
共 50 条
  • [1] Continual Learning with Pre-Trained Models: A Survey
    Zhou, Da-Wei
    Sun, Hai-Long
    Ning, Jingyi
    Ye, Han-Jia
    Zhan, De-Chuan
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 8363 - 8371
  • [2] Detecting Backdoors in Pre-trained Encoders
    Feng, Shiwei
    Tao, Guanhong
    Cheng, Siyuan
    Shen, Guangyu
    Xu, Xiangzhe
    Liu, Yingqi
    Zhang, Kaiyuan
    Ma, Shiqing
    Zhang, Xiangyu
    arXiv, 2023,
  • [3] Detecting Backdoors in Pre-trained Encoders
    Feng, Shiwei
    Tao, Guanhong
    Cheng, Siyuan
    Shen, Guangyu
    Xu, Xiangzhe
    Liu, Yingqi
    Zhang, Kaiyuan
    Ma, Shiqing
    Zhang, Xiangyu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 16352 - 16362
  • [4] RanPAC: Random Projections and Pre-trained Models for Continual Learning
    McDonnell, Mark D.
    Gong, Dong
    Parveneh, Amin
    Abbasnejad, Ehsan
    van den Hengel, Anton
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Do Pre-trained Models Benefit Equally in Continual Learning?
    Lee, Kuan-Ying
    Zhong, Yuanyi
    Wang, Yu-Xiong
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 6474 - 6482
  • [6] Continual knowledge infusion into pre-trained biomedical language models
    Jha, Kishlay
    Zhang, Aidong
    BIOINFORMATICS, 2022, 38 (02) : 494 - 502
  • [7] Recyclable Tuning for Continual Pre-training
    Qin, Yujia
    Qian, Cheng
    Han, Xu
    Lin, Yankai
    Wang, Huadong
    Xie, Ruobing
    Li, Zhiyuan
    Sun, Maosong
    Zhou, Jie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11403 - 11426
  • [8] Preserving Cross-Linguality of Pre-trained Models via Continual Learning
    Liu, Zihan
    Winata, Genta Indra
    Madotto, Andrea
    Fung, Pascale
    REPL4NLP 2021: PROCEEDINGS OF THE 6TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2021, : 64 - 71
  • [9] Fine-tuning Pre-trained Language Models for Few-shot Intent Detection: Supervised Pre-training and Isotropization
    Zhang, Haode
    Liang, Haowen
    Zhang, Yuwei
    Zhan, Liming
    Wu, Xiao-Ming
    Lu, Xiaolei
    Lam, Albert Y. S.
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 532 - 542
  • [10] The Impact of Training Methods on the Development of Pre-Trained Language Models
    Uribe, Diego
    Cuan, Enrique
    Urquizo, Elisa
    COMPUTACION Y SISTEMAS, 2024, 28 (01): : 109 - 124