Defending ChatGPT against jailbreak attack via self-reminders

被引:22
|
作者
Xie, Yueqi [1 ]
Yi, Jingwei [2 ]
Shao, Jiawei [1 ]
Curl, Justin [3 ]
Lyu, Lingjuan [4 ]
Chen, Qifeng [1 ]
Xie, Xing [5 ]
Wu, Fangzhao [5 ]
机构
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Tsinghua Univ, Beijing, Peoples R China
[4] Sony AI, Tokyo, Japan
[5] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1038/s42256-023-00765-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT's ethics safeguards and engender harmful responses. This paper investigates the severe yet under-explored problems created by jailbreaks as well as potential defensive techniques. We introduce a jailbreak dataset with various types of jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminders and further propose a simple yet effective defence technique called system-mode self-reminder. This technique encapsulates the user's query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%. Our work systematically documents the threats posed by jailbreak attacks, introduces and analyses a dataset for evaluating defensive interventions and proposes the psychologically inspired self-reminder technique that can efficiently and effectively mitigate against jailbreaks without further training. Interest in using large language models such as ChatGPT has grown rapidly, but concerns about safe and responsible use have emerged, in part because adversarial prompts can bypass existing safeguards with so-called jailbreak attacks. Wu et al. build a dataset of various types of jailbreak attack prompt and demonstrate a simple but effective technique to counter these attacks by encapsulating users' prompts in another standard prompt that reminds ChatGPT to respond responsibly.
引用
收藏
页码:1486 / 1496
页数:16
相关论文
共 50 条
  • [31] Legal Framework for Defending Transport Networks Against Terrorist Attack
    Cryer, Robert
    TRANSPORTATION SECURITY AGAINST TERRORISM, 2009, 54 : 108 - 115
  • [32] Defending Against Membership Inference Attack by Shielding Membership Signals
    Miao, Yinbin
    Yu, Yueming
    Li, Xinghua
    Guo, Yu
    Liu, Ximeng
    Choo, Kim-Kwang Raymond
    Deng, Robert H.
    IEEE TRANSACTIONS ON SERVICES COMPUTING, 2023, 16 (06) : 4087 - 4101
  • [33] Robust Adversarial Watermark Defending Against GAN Synthesization Attack
    Xu, Shengwang
    Qiao, Tong
    Xu, Ming
    Wang, Wei
    Zheng, Ning
    IEEE SIGNAL PROCESSING LETTERS, 2024, 31 : 351 - 355
  • [35] Defending Multiple Attack Via Multiple Algorithms With Fault Tolerance
    Grace, M.
    Sughasiny, M.
    2021 IEEE INTERNATIONAL CONFERENCE ON MOBILE NETWORKS AND WIRELESS COMMUNICATIONS (ICMNWC), 2021,
  • [36] Defending Against Adversarial Attack Towards Deep Neural Networks Via Collaborative Multi-Task Training
    Wang, Derui
    Li, Chaoran
    Wen, Sheng
    Nepal, Surya
    Xiang, Yang
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2022, 19 (02) : 953 - 965
  • [37] SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
    Huang, Caishuang
    Zhao, Wanxu
    Zheng, Rui
    Lv, Huijie
    Zhan, Wenyu
    Dou, Shihan
    Li, Sixian
    Wang, Xiao
    Zhou, Enyu
    Ye, Junjie
    Yang, Yuming
    Gui, Tao
    Zhang, Qi
    Huang, Xuanjing
    arXiv,
  • [38] MaliFuzz: Adversarial Malware Detection Model for Defending Against Fuzzing Attack
    Xianwei Gao
    Chun Shan
    Changzhen Hu
    Journal of Beijing Institute of Technology, 2024, 33 (05) : 436 - 449
  • [39] An MDP Approach for Defending Against Fraud Attack in Cognitive Radio Networks
    Shahhoseini, Hadi Shahriar
    Jafari, Amir Hosein
    Afhamisisi, Khadijeh
    IETE JOURNAL OF RESEARCH, 2015, 61 (05) : 492 - 499
  • [40] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
    Kim, Heegyu
    Yuk, Sehyun
    Cho, Hyunsouk
    arXiv, 1600,