Open-Vocabulary Text-Driven Human Image Generation

被引:1
|
作者
Zhang, Kaiduo [1 ,2 ]
Sun, Muyi [1 ,3 ]
Sun, Jianxin [1 ,2 ]
Zhang, Kunbo [1 ,2 ]
Sun, Zhenan [1 ,2 ]
Tan, Tieniu [1 ,2 ,4 ]
机构
[1] CASIA, CRIPAC, MAIS, Beijing 100190, Peoples R China
[2] UCAS, Sch AI, Beijing 101408, Peoples R China
[3] BUPT, Sch AI, Beijing 100875, Peoples R China
[4] Nanjing Univ, Nanjing 210008, Peoples R China
基金
中国国家自然科学基金;
关键词
Multi-modal biometric analysis; Human image generation; Text-to-human generation; Human image editing; MANIPULATION;
D O I
10.1007/s11263-024-02079-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.
引用
收藏
页码:4379 / 4397
页数:19
相关论文
共 50 条
  • [1] Correction: Open-Vocabulary Text-Driven Human Image Generation
    Kaiduo Zhang
    Muyi Sun
    Jianxin Sun
    Kunbo Zhang
    Zhenan Sun
    Tieniu Tan
    International Journal of Computer Vision, 2025, 133 (2) : 989 - 989
  • [2] Open-Vocabulary Text-Driven Human Image Generation (May, 10.1007/s11263-024-02079-7, 2024)
    Zhang, Kaiduo
    Sun, Muyi
    Sun, Jianxin
    Zhang, Kunbo
    Sun, Zhenan
    Tan, Tieniu
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2025, 133 (02) : 989 - 989
  • [3] Text2Human: Text-Driven Controllable Human Image Generation
    Jiang, Yuming
    Yang, Shuai
    Qju, Haonan
    Wu, Wayne
    Loy, Chen Change
    Liu, Ziwei
    ACM TRANSACTIONS ON GRAPHICS, 2022, 41 (04):
  • [4] Text-driven human image generation with texture and pose control
    Jin, Zhedong
    Xia, Guiyu
    Yang, Paike
    Wang, Mengxiang
    Sun, Yubao
    Liu, Qingshan
    NEUROCOMPUTING, 2025, 634
  • [5] A method for open-vocabulary speech-driven text retrieval
    Fujii, A
    Itou, K
    Ishikawa, T
    PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2002, : 188 - 195
  • [6] Image-text aggregation for open-vocabulary semantic segmentation
    Cheng, Shengyang
    Huang, Jianyong
    Wang, Xiaodong
    Huang, Lei
    Wei, Zhiqiang
    NEUROCOMPUTING, 2025, 630
  • [7] Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
    Xu, Jiarui
    Liu, Sifei
    Vahdat, Arash
    Byeon, Wonmin
    Wang, Xiaolong
    De Meo, Shalini
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2955 - 2966
  • [8] Open-Vocabulary And Multitask Image Segmentation
    Pan, Lihu
    Yang, Yunting
    Wang, Zhengkui
    Shan, Wen
    Yin, Jaili
    39TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2024, 2024, : 1048 - 1049
  • [9] Improving Open-Vocabulary Scene Text Recognition
    Feild, Jacqueline L.
    Learned-Miller, Erik G.
    2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 604 - 608
  • [10] Plan, Posture and Go: Towards Open-Vocabulary Text-to-Motion Generation
    Liu, Jinpeng
    Dai, Wenxun
    Wang, Chunyu
    Cheng, Yiji
    Tang, Yansong
    Tong, Xin
    COMPUTER VISION - ECCV 2024, PT XXVII, 2025, 15085 : 445 - 463