Defending LLMs against Jailbreaking Attacks via Backtranslation

被引:0
|
作者
Wang, Yihan [1 ]
Shi, Zhouxing [1 ]
Bai, Andrew [1 ]
Hsieh, Cho-Jui [1 ]
机构
[1] UCLA, Los Angeles, CA 90095 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Although many large language models (LLMs) have been trained to refuse harmful requests, they are still vulnerable to jailbreaking attacks which rewrite the original prompt to conceal its harmful intent. In this paper, we propose a new method for defending LLMs against jailbreaking attacks by "backtranslation". Specifically, given an initial response generated by the target LLM from an input prompt, our backtranslation prompts a language model to infer an input prompt that can lead to the response. The inferred prompt is called the backtranslated prompt which tends to reveal the actual intent of the original prompt, since it is generated based on the LLM's response and not directly manipulated by the attacker. We then run the target LLM again on the backtranslated prompt, and we refuse the original prompt if the model refuses the backtranslated prompt. We explain that the proposed defense provides several benefits on its effectiveness and efficiency. We empirically demonstrate that our defense significantly outperforms the baselines, in the cases that are hard for the baselines, and our defense also has little impact on the generation quality for benign input prompts. Our implementation is based on our library for LLM jailbreaking defense algorithms at https://github.com/YihanWang617/ llm-jailbreaking- defense, and the code for reproducing our experiments is available at https://github.com/YihanWang617/ LLM-Jailbreaking- Defense- Backtranslation.
引用
收藏
页码:16031 / 16046
页数:16
相关论文
共 50 条
  • [1] Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
    Zhang, Zhexin
    Yang, Junxiao
    Ke, Pei
    Mi, Fei
    Wang, Hongning
    Huang, Minlie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 8865 - 8887
  • [2] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
    Mehrotra, Anay
    Zampetakis, Manolis
    Kassianik, Paul
    Nelson, Blaine
    Anderson, Hyrum
    Singer, Yaron
    Karbasi, Amin
    arXiv, 2023,
  • [3] Artemis: Defending Against Backdoor Attacks via Distribution Shift
    Xue, Meng
    Wang, Zhixian
    Zhang, Qian
    Gong, Xueluan
    Liu, Zhihang
    Chen, Yanjiao
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2025, 22 (02) : 1781 - 1795
  • [4] SybilGuard: Defending against sybil attacks via social networks
    Yu, Haifeng
    Kaminsky, Michael
    Gibbons, Phillip B.
    Flaxman, Abraham D.
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2008, 16 (03) : 576 - 589
  • [5] SybilGuard: Defending against sybil attacks via social networks
    Yu, Haifeng
    Kaminsky, Michael
    Gibbons, Phillip B.
    Flaxman, Abraham
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2006, 36 (04) : 267 - 278
  • [6] DiffDefense: Defending Against Adversarial Attacks via Diffusion Models
    Silva, Hondamunige Prasanna
    Seidenari, Lorenzo
    Del Bimbo, Alberto
    IMAGE ANALYSIS AND PROCESSING, ICIAP 2023, PT II, 2023, 14234 : 430 - 442
  • [7] Defending Against Adversarial Attacks via Neural Dynamic System
    Li, Xiyuan
    Zou, Xin
    Liu, Weiwei
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [8] SybilBF: Defending against Sybil Attacks via Bloom Filters
    Wu, Hengkui
    Yang, Dong
    Zhang, Hongke
    ETRI JOURNAL, 2011, 33 (05) : 826 - 829
  • [9] Defending against Whitebox Adversarial Attacks via Randomized Discretization
    Zhang, Yuchen
    Liang, Percy
    22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89 : 684 - 693
  • [10] Defending against attacks tailored to transfer learning via feature distancing
    Ji, Sangwoo
    Park, Namgyu
    Na, Dongbin
    Zhu, Bin
    Kim, Jong
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2022, 223