SneakyPrompt: Jailbreaking Text-to-image Generative Models

被引:1
|
作者
Yang, Yuchen [1 ]
Hui, Bo [1 ]
Yuan, Haolin [1 ]
Gong, Neil [2 ]
Cao, Yinzhi [1 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Duke Univ, Durham, NC USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/SP54263.2024.00123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-to-image generative models such as Stable Diffusion and DALL center dot E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL center dot E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: https://github.com/Yuchen413/text2image safety.
引用
收藏
页码:897 / 912
页数:16
相关论文
共 50 条
  • [31] Text-to-Image Synthesis With Generative Models: Methods, Datasets, Performance Metrics, Challenges, and Future Direction
    Alhabeeb, Sarah K.
    Al-Shargabi, Amal A.
    IEEE ACCESS, 2024, 12 : 24412 - 24427
  • [32] Survey About Generative Adversarial Network and Text-to-Image Synthesis
    Lai, Lina
    Mi, Yu
    Zhou, Longlong
    Rao, Jiyong
    Xu, Tianyang
    Song, Xiaoning
    Computer Engineering and Applications, 2023, 59 (19): : 21 - 39
  • [33] Muse: Text-To-Image Generation via Masked Generative Transformers
    Chang, Huiwen
    Zhang, Han
    Barber, Jarred
    Maschinot, A. J.
    Lezama, Jose
    Jiang, Lu
    Yang, Ming-Hsuan
    Murphy, Kevin
    Freeman, William T.
    Rubinstein, Michael
    Li, Yuanzhen
    Krishnan, Dilip
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [34] Decoupling Control in Text-to-Image Diffusion Models
    Cao, Shitong
    Zhang, Xuejie
    Wang, Jin
    Zhou, Xiaobing
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VII, ICIC 2024, 2024, 14868 : 312 - 322
  • [35] Enhanced Text-to-Image Synthesis Conditional Generative Adversarial Networks
    Tan, Yong Xuan
    Lee, Chin Poo
    Neo, Mai
    Lim, Kian Ming
    Lim, Jit Yan
    IAENG International Journal of Computer Science, 2022, 49 (01) : 1 - 7
  • [36] A Comparative Study of Generative Adversarial Networks for Text-to-Image Synthesis
    Chopra, Muskaan
    Singh, Sunil K.
    Sharma, Akhil
    Gill, Shabeg Singh
    INTERNATIONAL JOURNAL OF SOFTWARE SCIENCE AND COMPUTATIONAL INTELLIGENCE-IJSSCI, 2022, 14 (01):
  • [37] Evaluating Data Attribution for Text-to-Image Models
    Wang, Sheng-Yu
    Efros, Alexei A.
    Zhu, Jun-Yan
    Zhang, Richard
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7158 - 7169
  • [38] Multilingual Conceptual Coverage in Text-to-Image Models
    Saxon, Michael
    Wang, William Yang
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4831 - 4848
  • [39] Ablating Concepts in Text-to-Image Diffusion Models
    Kumari, Nupur
    Zhang, Bingliang
    Wang, Sheng-Yu
    Shechtman, Eli
    Zhang, Richard
    Zhu, Jun-Yan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22634 - 22645
  • [40] SINE: SINgle Image Editing with Text-to-Image Diffusion Models
    Zhang, Zhixing
    Han, Ligong
    Ghosh, Arnab
    Metaxas, Dimitris
    Ren, Jian
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6027 - 6037