BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

被引:0
|
作者
Li, Dongxu [1 ]
Li, Junnan [1 ]
Hoi, Steven C. H. [1 ]
机构
[1] Salesforce AI Res, Sydney, NSW, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also show that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-toprompt to enable novel subject-driven generation and editing applications.
引用
收藏
页数:21
相关论文
共 50 条
  • [1] Video Colorization with Pre-trained Text-to-Image Diffusion Models
    Liu, Hanyuan
    Xie, Minshan
    Xing, Jinbo
    Li, Chengze
    Wong, Tien-Tsin
    arXiv, 2023,
  • [2] Controllable Text-to-Image Generation
    Li, Bowen
    Qi, Xiaojuan
    Lukasiewicz, Thomas
    Torr, Philip H. S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [3] SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
    Zhong, Shanshan
    Huang, Zhongzhan
    Wen, Wushao
    Qin, Jinghui
    Lin, Liang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 567 - 578
  • [4] Can Pre-Trained Text-to-Image Models Generate Visual Goals for Reinforcement Learning?
    Gao, Jialu
    Hu, Kaizhe
    Xu, Guowei
    Xu, Huazhe
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Shifted Diffusion for Text-to-image Generation
    Zhou, Yufan
    Liu, Bingchen
    Zhu, Yizhe
    Yang, Xiao
    Chen, Changyou
    Xu, Jinhui
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10157 - 10166
  • [6] SINE: SINgle Image Editing with Text-to-Image Diffusion Models
    Zhang, Zhixing
    Han, Ligong
    Ghosh, Arnab
    Metaxas, Dimitris
    Ren, Jian
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6027 - 6037
  • [7] Editing Implicit Assumptions in Text-to-Image Diffusion Models
    Orgad, Hadas
    Kawar, Bahjat
    Belinkov, Yonatan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 7030 - 7038
  • [8] A Survey of Controllable Text Generation Using Transformer-based Pre-trained Language Models
    Zhang, Hanqing
    Song, Haolin
    Li, Shaoyu
    Zhou, Ming
    Song, Dawei
    ACM COMPUTING SURVEYS, 2024, 56 (03)
  • [9] Pre-Trained Language Models for Text Generation: A Survey
    Li, Junyi
    Tang, Tianyi
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    ACM COMPUTING SURVEYS, 2024, 56 (09)
  • [10] DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
    Ruiz, Nataniel
    Li, Yuanzhen
    Jampani, Varun
    Pritch, Yael
    Rubinstein, Michael
    Aberman, Kfir
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 22500 - 22510