Masked-attention diffusion guidance for spatially controlling text-to-image generation

被引:1
|
作者
Endo, Yuki [1 ]
机构
[1] Univ Tsukuba, Tsukuba, Ibaraki, Japan
来源
VISUAL COMPUTER | 2024年 / 40卷 / 09期
基金
日本学术振兴会;
关键词
Diffusion model; Text-to-image synthesis; Multimodal; Classifier guidance;
D O I
10.1007/s00371-023-03151-y
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Text-to-image synthesis has achieved high-quality results with recent advances in diffusion models. However, text input alone has high spatial ambiguity and limited user controllability. Most existing methods allow spatial control through additional visual guidance (e.g., sketches and semantic masks) but require additional training with annotated images. In this paper, we propose a method for spatially controlling text-to-image generation without further training of diffusion models. Our method is based on the insight that the cross-attention maps reflect the positional relationship between words and pixels. Our aim is to control the attention maps according to given semantic masks and text prompts. To this end, we first explore a simple approach of directly swapping the cross-attention maps with constant maps computed from the semantic regions. Some prior works also allow training-free spatial control of text-to-image diffusion models by directly manipulating cross-attention maps. However, these approaches still suffer from misalignment to given masks because manipulated attention maps are far from actual ones learned by diffusion models. To address this issue, we propose masked-attention guidance, which can generate images more faithful to semantic masks via indirect control of attention to each word and pixel by manipulating noise images fed to diffusion models. Masked-attention guidance can be easily integrated into pre-trained off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the tasks of text-guided image editing. Experiments show that our method enables more accurate spatial control than baselines qualitatively and quantitatively.
引用
收藏
页码:6033 / 6045
页数:13
相关论文
共 50 条
  • [31] Perceptions and Realities of Text-to-Image Generation
    Oppenlaender, Jonas
    Silvennoinen, Johanna
    Paananen, Ville
    Visuri, Aku
    PROCEEDINGS OF THE 26TH INTERNATIONAL ACADEMIC MINDTREK, MINDTREK 2023, 2023, : 279 - 288
  • [32] MAGAN: Multi-attention Generative Adversarial Networks for Text-to-Image Generation
    Jia, Xibin
    Mi, Qing
    Dai, Qi
    PATTERN RECOGNITION AND COMPUTER VISION, PT IV, 2021, 13022 : 312 - 322
  • [33] Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation
    Hou, Shouming
    Li, Ziying
    Wu, Kuikui
    Zhao, Yinggang
    Li, Hui
    VISUAL COMPUTER, 2024, 40 (12): : 8639 - 8651
  • [34] Optimizing Prompts for Text-to-Image Generation
    Hao, Yaru
    Chi, Zewen
    Dong, Li
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [35] Grounded Text-to-Image Synthesis with Attention Refocusing
    Phung, Quynh
    Ge, Songwei
    Huang, Jia-Bin
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 7932 - 7942
  • [36] Text-to-image generation method based on self-supervised attention and image features fusion
    Liao, Yonghui
    Zhang, Haitao
    Jin, Haibo
    CHINESE JOURNAL OF LIQUID CRYSTALS AND DISPLAYS, 2024, 39 (02) : 180 - 191
  • [37] Exploring the Capability of Text-to-Image Diffusion Models With Structural Edge Guidance for Multispectral Satellite Image Inpainting
    Czerkawski, Mikolaj
    Tachtatzis, Christos
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2024, 21 : 1 - 5
  • [38] Decoupling Control in Text-to-Image Diffusion Models
    Cao, Shitong
    Zhang, Xuejie
    Wang, Jin
    Zhou, Xiaobing
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VII, ICIC 2024, 2024, 14868 : 312 - 322
  • [39] Text-to-image Generation Model Based on Diffusion Wasserstein Generative Adversarial Networks
    Zhao H.
    Li W.
    Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2023, 45 (12): : 4371 - 4381
  • [40] Prompt Refinement with Image Pivot for Text-to-Image Generation
    Zhan, Jingtao
    Ai, Qingyao
    Liu, Yiqun
    Pan, Yingwei
    Yao, Ting
    Mao, Jiaxin
    Ma, Shaoping
    Mei, Tao
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 941 - 954