Open-Vocabulary Text-Driven Human Image Generation

被引:1
|
作者
Zhang, Kaiduo [1 ,2 ]
Sun, Muyi [1 ,3 ]
Sun, Jianxin [1 ,2 ]
Zhang, Kunbo [1 ,2 ]
Sun, Zhenan [1 ,2 ]
Tan, Tieniu [1 ,2 ,4 ]
机构
[1] CASIA, CRIPAC, MAIS, Beijing 100190, Peoples R China
[2] UCAS, Sch AI, Beijing 101408, Peoples R China
[3] BUPT, Sch AI, Beijing 100875, Peoples R China
[4] Nanjing Univ, Nanjing 210008, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal biometric analysis; Human image generation; Text-to-human generation; Human image editing; MANIPULATION;
D O I
10.1007/s11263-024-02079-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.
引用
收藏
页码:4379 / 4397
页数:19
相关论文
共 50 条
  • [21] Robust Open-Vocabulary Translation from Visual Text Representations
    Salesky, Elizabeth
    Etter, David
    Post, Matt
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7235 - 7252
  • [22] Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
    Ghiasi, Golnaz
    Gu, Xiuye
    Cui, Yin
    Lin, Tsung-Yi
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 540 - 557
  • [23] Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
    Jiao, Siyu
    Zhu, Hongguang
    Huang, Jiannan
    Zhao, Yao
    Wei, Yunchao
    Shi, Humphrey
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 399 - 416
  • [24] Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
    Shin, Hyeon-Kyeong
    Han, Hyewon
    Kim, Doyeon
    Chung, Soo-Whan
    Kang, Hong-Goo
    INTERSPEECH 2022, 2022, : 1871 - 1875
  • [25] Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
    Lu, Zhuqiang
    Hu, Kun
    Wang, Chaoyue
    Bai, Lei
    Wang, Zhiyong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 14211 - 14219
  • [26] Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training
    Lin, Junfan
    Chang, Jianlong
    Liu, Lingbo
    Li, Guanbin
    Lin, Liang
    Tian, Qi
    Chen, Chang Wen
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23222 - 23231
  • [27] USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation
    Wang, Xiaoqi
    He, Wenbin
    Xuan, Xiwei
    Sebastian, Clint
    Ono, Jorge Piazentin
    Li, Xin
    Behpour, Sima
    Thang Doan
    Gou, Liang
    Shen, Han-Wei
    Ren, Liu
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4187 - 4196
  • [28] Open-vocabulary Attribute Detection
    Bravo, Maria A.
    Mittal, Sudhanshu
    Ging, Simon
    Brox, Thomas
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7041 - 7050
  • [29] Conditional Score Guidance for Text-Driven Image-to-Image Translation
    Lee, Hyunsoo
    Kang, Minsoo
    Han, Bohyung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [30] InterFusion: Text-Driven Generation of 3D Human-Object Interaction
    Dai, Sisi
    Li, Wenhao
    Sun, Haowen
    Huang, Haibin
    Ma, Chongyang
    Huang, Hui
    Xu, Kai
    Hu, Ruizhen
    COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 18 - 35