Defending ChatGPT against jailbreak attack via self-reminders

被引:22
|
作者
Xie, Yueqi [1 ]
Yi, Jingwei [2 ]
Shao, Jiawei [1 ]
Curl, Justin [3 ]
Lyu, Lingjuan [4 ]
Chen, Qifeng [1 ]
Xie, Xing [5 ]
Wu, Fangzhao [5 ]
机构
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Tsinghua Univ, Beijing, Peoples R China
[4] Sony AI, Tokyo, Japan
[5] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1038/s42256-023-00765-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT's ethics safeguards and engender harmful responses. This paper investigates the severe yet under-explored problems created by jailbreaks as well as potential defensive techniques. We introduce a jailbreak dataset with various types of jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminders and further propose a simple yet effective defence technique called system-mode self-reminder. This technique encapsulates the user's query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%. Our work systematically documents the threats posed by jailbreak attacks, introduces and analyses a dataset for evaluating defensive interventions and proposes the psychologically inspired self-reminder technique that can efficiently and effectively mitigate against jailbreaks without further training. Interest in using large language models such as ChatGPT has grown rapidly, but concerns about safe and responsible use have emerged, in part because adversarial prompts can bypass existing safeguards with so-called jailbreak attacks. Wu et al. build a dataset of various types of jailbreak attack prompt and demonstrate a simple but effective technique to counter these attacks by encapsulating users' prompts in another standard prompt that reminds ChatGPT to respond responsibly.
引用
收藏
页码:1486 / 1496
页数:16
相关论文
共 50 条
  • [21] DEFENDING A MOVING TARGET AGAINST MISSILE OR TORPEDO ATTACK
    BOYELL, RL
    IEEE TRANSACTIONS ON AEROSPACE AND ELECTRONIC SYSTEMS, 1976, 12 (04) : 522 - 526
  • [22] Defending Against Model Inversion Attack by Adversarial Examples
    Wen, Jing
    Yiu, Siu-Ming
    Hui, Lucas C. K.
    PROCEEDINGS OF THE 2021 IEEE INTERNATIONAL CONFERENCE ON CYBER SECURITY AND RESILIENCE (IEEE CSR), 2021, : 551 - 556
  • [23] A Novel Proposal for Defending Against Vampire Attack in WSN
    Patel, Amee A.
    Soni, Sunil J.
    2015 FIFTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT2015), 2015, : 624 - 627
  • [24] Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
    Wang, Youze
    Hu, Wenbo
    Dong, Yinpeng
    Liu, Jing
    Zhang, Hanwang
    Hong, Richang
    IEEE Transactions on Circuits and Systems for Video Technology,
  • [25] Defending the self against identity misclassification
    Prewitt-Freilino, Jennifer L.
    Bosson, Jennifer K.
    SELF AND IDENTITY, 2008, 7 (02) : 168 - 183
  • [26] Defending Against Man-In-The-Middle Attack in Repeated Games
    Li, Shuxin
    Li, Xiaohong
    Hao, Jianye
    An, Bo
    Feng, Zhiyong
    Chen, Kangjie
    Zhang, Chengwei
    PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3742 - 3748
  • [27] Defending Against Bruteforc Attack Using Open Source - SNORT
    Bharati, Manisha
    Tamane, Sharvaree
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTING AND INFORMATICS (ICICI 2017), 2017, : 903 - 907
  • [28] TESTING AND DEFENDING METHODS AGAINST DOS ATTACK IN STATE ESTIMATION
    Zhang, Heng
    Qi, Yifei
    Zhou, Huan
    Zhang, Jian
    Sun, Jing
    ASIAN JOURNAL OF CONTROL, 2017, 19 (04) : 1295 - 1305
  • [29] Resistance maximization principle for defending networks against virus attack
    Li, Angsheng
    Zhang, Xiaohui
    Pan, Yicheng
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2017, 466 : 211 - 223
  • [30] A SOLUTION FOR DEFENDING AGAINST DENIAL OF SERVICE ATTACK ON WIRELESS LAN
    Nguyen, Dinh-Thuc
    Tran, Ngoc-Bao
    Nguyen-Ho, Minh-Duc
    MOBILE AND WIRELESS NETWORKS SECURITY, PROCEEDINGS, 2008, : 67 - +