SneakyPrompt: Jailbreaking Text-to-image Generative Models

被引:1
|
作者
Yang, Yuchen [1 ]
Hui, Bo [1 ]
Yuan, Haolin [1 ]
Gong, Neil [2 ]
Cao, Yinzhi [1 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Duke Univ, Durham, NC USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/SP54263.2024.00123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-to-image generative models such as Stable Diffusion and DALL center dot E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL center dot E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: https://github.com/Yuchen413/text2image safety.
引用
收藏
页码:897 / 912
页数:16
相关论文
共 50 条
  • [1] Resolving Ambiguities in Text-to-Image Generative Models
    Mehrabi, Ninareh
    Goyal, Palash
    Verma, Apurv
    Dhamala, Jwala
    Kumar, Varun
    Hu, Qian
    Chang, Kai-Wei
    Zemel, Richard
    Galstyan, Aram
    Gupta, Rahul
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 14367 - 14388
  • [2] Typology of Risks of Generative Text-to-Image Models
    Bird, Charlotte
    Ungless, Eddie L.
    Kasirzadeh, Atoosa
    PROCEEDINGS OF THE 2023 AAAI/ACM CONFERENCE ON AI, ETHICS, AND SOCIETY, AIES 2023, 2023, : 396 - 410
  • [3] Advancements in adversarial generative text-to-image models: a review
    Zaghloul, Rawan
    Rawashdeh, Enas
    Bani-Ata, Tomader
    IMAGING SCIENCE JOURNAL, 2024,
  • [4] Example-Based Conditioning for Text-to-Image Generative Models
    Takada, Atsushi
    Kawabe, Wataru
    Sugano, Yusuke
    IEEE ACCESS, 2024, 12 : 162191 - 162203
  • [5] BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models
    Vice, Jordan
    Akhtar, Naveed
    Hartley, Richard
    Mian, Ajmal
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2024, 19 : 4865 - 4880
  • [6] TIBET: Identifying and Evaluating Biases in Text-to-Image Generative Models
    Chinchure, Aditya
    Shukla, Pushkar
    Bhatt, Gaurav
    Salij, Kiri
    Hosanagar, Kartik
    Leonid
    TuriC, Matthew
    COMPUTER VISION - ECCV 2024, PT LXXIX, 2025, 15137 : 429 - 446
  • [7] The Infinite Index: Information Retrieval on Generative Text-To-Image Models
    Deckers, Niklas
    Froebe, Maik
    Kiesel, Johannes
    Pandolfo, Gianluca
    Schroeder, Christopher
    Stein, Benno
    Potthast, Martin
    PROCEEDINGS OF THE 2023 CONFERENCE ON HUMAN INFORMATION INTERACTION AND RETRIEVAL, CHIIR 2023, 2023, : 172 - 186
  • [8] Design Guidelines for Prompt Engineering Text-to-Image Generative Models
    Liu, Vivian
    Chilton, Lydia B.
    PROCEEDINGS OF THE 2022 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS (CHI' 22), 2022,
  • [9] Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models
    Liu, Nan
    Du, Yilun
    Li, Shuang
    Tenenbaum, Joshua B.
    Torralba, Antonio
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 2085 - 2095
  • [10] Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation
    Zhao, Shihao
    Shaozhe, Hao
    Zi, Bojia
    Xu, Huaizhe
    Kwan-Yee K Wone
    COMPUTER VISION - ECCV 2024, PT LXXXI, 2025, 15139 : 70 - 86