BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

被引:0
|
作者
Li, Dongxu [1 ]
Li, Junnan [1 ]
Hoi, Steven C. H. [1 ]
机构
[1] Salesforce AI Res, Sydney, NSW, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also show that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-toprompt to enable novel subject-driven generation and editing applications.
引用
收藏
页数:21
相关论文
共 50 条
  • [31] MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
    Zhao, Jing
    Zheng, Heliang
    Wang, Chaoyue
    Lan, Long
    Yang, Wenjing
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 22535 - 22545
  • [32] RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
    Xue, Zeyue
    Song, Guanglu
    Guo, Qiushan
    Liu, Boxiao
    Zong, Zhuofan
    Liu, Yu
    Luo, Ping
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [33] Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference
    Yu, Zihao
    Li, Haoyang
    Fu, Fangcheng
    Miao, Xupeng
    Cui, Bin
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 15, 2024, : 16605 - 16613
  • [34] Locally controllable network based on visual–linguistic relation alignment for text-to-image generation
    Zaike Li
    Li Liu
    Huaxiang Zhang
    Dongmei Liu
    Yu Song
    Boqun Li
    Multimedia Systems, 2024, 30
  • [35] Pre-trained Diffusion Models for Plug-and-Play Medical Image Enhancement
    Ma, Jun
    Zhu, Yuanzhi
    You, Chenyu
    Wang, Bo
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT III, 2023, 14222 : 3 - 13
  • [36] A System of Multimodal Image-Text Retrieval Based on Pre-Trained Models Fusion
    Li, Qiang
    Zhao, Feng
    Zhao, Linlin
    Liu, Maokai
    Wang, Yubo
    Zhang, Shuo
    Guo, Yuanyuan
    Wang, Shunlu
    Wang, Weigang
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2025, 37 (03):
  • [37] Masked-attention diffusion guidance for spatially controlling text-to-image generation
    Endo, Yuki
    VISUAL COMPUTER, 2024, 40 (09): : 6033 - 6045
  • [38] Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation
    Pan, Zhihong
    Zhou, Xin
    Tian, Hao
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 4450 - 4460
  • [39] Text-to-image Generation Model Based on Diffusion Wasserstein Generative Adversarial Networks
    Zhao H.
    Li W.
    Dianzi Yu Xinxi Xuebao/Journal of Electronics and Information Technology, 2023, 45 (12): : 4371 - 4381
  • [40] Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
    Ma, Jian
    Liang, Junhao
    Chen, Chen
    Lu, Haonan
    PROCEEDINGS OF SIGGRAPH 2024 CONFERENCE PAPERS, 2024,