Open-Vocabulary Text-Driven Human Image Generation

被引:1
|
作者
Zhang, Kaiduo [1 ,2 ]
Sun, Muyi [1 ,3 ]
Sun, Jianxin [1 ,2 ]
Zhang, Kunbo [1 ,2 ]
Sun, Zhenan [1 ,2 ]
Tan, Tieniu [1 ,2 ,4 ]
机构
[1] CASIA, CRIPAC, MAIS, Beijing 100190, Peoples R China
[2] UCAS, Sch AI, Beijing 101408, Peoples R China
[3] BUPT, Sch AI, Beijing 100875, Peoples R China
[4] Nanjing Univ, Nanjing 210008, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal biometric analysis; Human image generation; Text-to-human generation; Human image editing; MANIPULATION;
D O I
10.1007/s11263-024-02079-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.
引用
收藏
页码:4379 / 4397
页数:19
相关论文
共 50 条
  • [31] Lightweight Text-Driven Image Editing With Disentangled Content and Attributes
    Li, Bo
    Lin, Xiao
    Liu, Bin
    He, Zhi-Fen
    Lai, Yu-Kun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 1829 - 1841
  • [32] Text-driven Face Image Generation and Manipulation via Multi-level Residual Mapper
    Li Z.-L.
    Zhang S.-P.
    Liu Y.
    Zhang Z.-X.
    Zhang W.-G.
    Huang Q.-M.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2101 - 2115
  • [33] Text2LIVE: Text-Driven Layered Image and Video Editing
    Bar-Tal, Omer
    Ofri-Amar, Dolev
    Fridman, Rafail
    Kasten, Yoni
    Dekel, Tali
    COMPUTER VISION - ECCV 2022, PT XV, 2022, 13675 : 707 - 723
  • [34] DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation
    Lyu, Yueming
    Lin, Tianwei
    Li, Fu
    He, Dongliang
    Dong, Jing
    Tan, Tieniu
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 6894 - 6903
  • [35] Open-Vocabulary Object Detection With an Open Corpus
    Wang, Jiong
    Zhang, Huiming
    Hong, Haiwen
    Jin, Xuan
    He, Yuan
    Xue, Hui
    Zhao, Zhou
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 6736 - 6746
  • [36] TexFit: Text-Driven Fashion Image Editing with Diffusion Models
    Wang, Tongxin
    Ye, Mang
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 9, 2024, : 10198 - 10206
  • [37] ConIS: controllable text-driven image stylization with semantic intensity
    Yang, Gaoming
    Li, Changgeng
    Zhang, Ji
    MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [38] DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization
    Huang, Nisha
    Zhang, Yuxin
    Tang, Fan
    Ma, Chongyang
    Huang, Haibin
    Dong, Weiming
    Xu, Changsheng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (02) : 3370 - 3383
  • [39] Text-driven Visual Prosody Generation for Embodied Conversational Agents
    Chen, Jiali
    Liu, Yong
    Zhang, Zhimeng
    Fan, Changjie
    Ding, Yu
    PROCEEDINGS OF THE 19TH ACM INTERNATIONAL CONFERENCE ON INTELLIGENT VIRTUAL AGENTS (IVA' 19), 2019, : 108 - 110
  • [40] OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers
    Liang, Han
    Bao, Jiacheng
    Zhang, Ruichi
    Ren, Sihan
    Xu, Yuecheng
    Yang, Sibei
    Chen, Xin
    Yu, Jingyi
    Xu, Lan
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 482 - 493