Defending ChatGPT against jailbreak attack via self-reminders

被引:22
|
作者
Xie, Yueqi [1 ]
Yi, Jingwei [2 ]
Shao, Jiawei [1 ]
Curl, Justin [3 ]
Lyu, Lingjuan [4 ]
Chen, Qifeng [1 ]
Xie, Xing [5 ]
Wu, Fangzhao [5 ]
机构
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Tsinghua Univ, Beijing, Peoples R China
[4] Sony AI, Tokyo, Japan
[5] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1038/s42256-023-00765-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT's ethics safeguards and engender harmful responses. This paper investigates the severe yet under-explored problems created by jailbreaks as well as potential defensive techniques. We introduce a jailbreak dataset with various types of jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminders and further propose a simple yet effective defence technique called system-mode self-reminder. This technique encapsulates the user's query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%. Our work systematically documents the threats posed by jailbreak attacks, introduces and analyses a dataset for evaluating defensive interventions and proposes the psychologically inspired self-reminder technique that can efficiently and effectively mitigate against jailbreaks without further training. Interest in using large language models such as ChatGPT has grown rapidly, but concerns about safe and responsible use have emerged, in part because adversarial prompts can bypass existing safeguards with so-called jailbreak attacks. Wu et al. build a dataset of various types of jailbreak attack prompt and demonstrate a simple but effective technique to counter these attacks by encapsulating users' prompts in another standard prompt that reminds ChatGPT to respond responsibly.
引用
收藏
页码:1486 / 1496
页数:16
相关论文
共 50 条
  • [1] Defending ChatGPT against jailbreak attack via self-reminders
    Yueqi Xie
    Jingwei Yi
    Jiawei Shao
    Justin Curl
    Lingjuan Lyu
    Qifeng Chen
    Xing Xie
    Fangzhao Wu
    Nature Machine Intelligence, 2023, 5 : 1486 - 1496
  • [2] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
    Xu, Zhangchen
    Jiang, Fengqing
    Niu, Luyao
    Jia, Jinyuan
    Lin, Bill Yuchen
    Poovendran, Radha
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 5587 - 5605
  • [3] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
    Zhao, Wei
    Li, Zhe
    Li, Yige
    Zhang, Ye
    Sun, Jun
    EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024, 2024, : 5094 - 5109
  • [4] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
    Zhao, Wei
    Li, Zhe
    Li, Yige
    Zhang, Ye
    Sun, Jun
    arXiv,
  • [5] Defending Against Wormhole Attack in MANET
    Patel, Anal
    Patel, Nimisha
    Patel, Rajan
    2015 FIFTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT2015), 2015, : 674 - 678
  • [6] AN APPROACH OF DEFENDING AGAINST DDOS ATTACK
    Wu Zhijun Duan Haixin Li Xing (Network Research Center
    Journal of Electronics(China), 2006, (01) : 148 - 153
  • [7] Defending Against Wormhole Attack in OLSR
    HONG Liang HONG Fan FU Cai
    Geo-Spatial Information Science, 2006, (03) : 229 - 233
  • [8] Defending Against Wormhole Attack in OLSR
    Hong Liang
    Hong Fan
    Fu Cai
    GEO-SPATIAL INFORMATION SCIENCE, 2006, 9 (03) : 229 - 233
  • [9] AN APPROACH OF DEFENDING AGAINST DDOS ATTACK
    Wu Zhijun Duan Haixin Li Xing Network Research Center Tsinghua University Beijing China Tianjin Key Lab for Advanced Signal Processing Civil Aviation University of China Tianjin China
    JournalofElectronics, 2006, (01) : 148 - 153
  • [10] Defending against the Pirate Evolution Attack
    Jin, Hongxia
    Lotspiech, Jeffrey
    INFORMATION SECURITY PRACTICE AND EXPERIENCE, PROCEEDINGS: 5TH INTERNATIONAL CONFERENCE, ISPEC 2009, 2009, 5451 : 147 - 158