Masked-attention diffusion guidance for spatially controlling text-to-image generation

被引:1
|
作者
Endo, Yuki [1 ]
机构
[1] Univ Tsukuba, Tsukuba, Ibaraki, Japan
来源
VISUAL COMPUTER | 2024年 / 40卷 / 09期
基金
日本学术振兴会;
关键词
Diffusion model; Text-to-image synthesis; Multimodal; Classifier guidance;
D O I
10.1007/s00371-023-03151-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g., sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Some prior works also allow training-free spatial control of text-to-image diffusion models by directly manipulating cross-attention maps. However, these approaches still suffer from misalignment to given masks because manipulated attention maps are far from actual ones learned by diffusion models. To address this issue, we propose masked-attention guidance, which can generate images more faithful to semantic masks via indirect control of attention to each word and pixel by manipulating noise images fed to diffusion models. Masked-attention guidance can be easily integrated into pre-trained off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the tasks of text-guided image editing. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.
引用
收藏
页码:6033 / 6045
页数:13
相关论文
共 50 条
  • [1] Temporal Adaptive Attention Map Guidance for Text-to-Image Diffusion Models
    Jung, Sunghoon
    Heo, Yong Seok
    ELECTRONICS, 2025, 14 (03):
  • [2] Shifted Diffusion for Text-to-image Generation
    Zhou, Yufan
    Liu, Bingchen
    Zhu, Yizhe
    Yang, Xiao
    Chen, Changyou
    Xu, Jinhui
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10157 - 10166
  • [3] Controlling Text-to-Image Diffusion by Orthogonal Finetuning
    Qiu, Zeju
    Liu, Weiyang
    Feng, Haiwen
    Xue, Yuxuan
    Feng, Yao
    Liu, Zhen
    Zhang, Dan
    Weller, Adrian
    Schoelkopf, Bernhard
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation
    Pan, Zhihong
    Zhou, Xin
    Tian, Hao
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4450 - 4460
  • [5] Dense Text-to-Image Generation with Attention Modulation
    Kim, Yunji
    Lee, Jiyoung
    Kim, Jin-Hwa
    Ha, Jung-Woo
    Zhu, Jun-Yan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7667 - 7677
  • [6] Improving text-to-image generation with object layout guidance
    Jezia Zakraoui
    Moutaz Saleh
    Somaya Al-Maadeed
    Jihad Mohammed Jaam
    Multimedia Tools and Applications, 2021, 80 : 27423 - 27443
  • [7] Improving text-to-image generation with object layout guidance
    Zakraoui, Jezia
    Saleh, Moutaz
    Al-Maadeed, Somaya
    Jaam, Jihad Mohammed
    MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (18) : 27423 - 27443
  • [8] Muse: Text-To-Image Generation via Masked Generative Transformers
    Chang, Huiwen
    Zhang, Han
    Barber, Jarred
    Maschinot, A. J.
    Lezama, Jose
    Jiang, Lu
    Yang, Ming-Hsuan
    Murphy, Kevin
    Freeman, William T.
    Rubinstein, Michael
    Li, Yuanzhen
    Krishnan, Dilip
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 202, 2023, 202
  • [9] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
    Chefer, Hila
    Alaluf, Yuval
    Vinker, Yael
    Wolf, Lior
    Cohen-Or, Daniel
    ACM TRANSACTIONS ON GRAPHICS, 2023, 42 (04):
  • [10] Masked-attention Mask Transformer for Universal Image Segmentation
    Cheng, Bowen
    Misra, Ishan
    Schwing, Alexander G.
    Kirillov, Alexander
    Girdhar, Rohit
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 1280 - 1289