Open-Vocabulary Text-Driven Human Image Generation

被引：1

作者：

Zhang, Kaiduo ^{[1
,2
]}

Sun, Muyi ^{[1
,3
]}

Sun, Jianxin ^{[1
,2
]}

Zhang, Kunbo ^{[1
,2
]}

Sun, Zhenan ^{[1
,2
]}

Tan, Tieniu ^{[1
,2
,4
]}

机构：

[1] CASIA, CRIPAC, MAIS, Beijing 100190, Peoples R China

[2] UCAS, Sch AI, Beijing 101408, Peoples R China

[3] BUPT, Sch AI, Beijing 100875, Peoples R China

[4] Nanjing Univ, Nanjing 210008, Peoples R China

来源：

INTERNATIONAL JOURNAL OF COMPUTER VISION | 2024年 / 132卷 / 10期

基金：

中国国家自然科学基金;

关键词：

Multi-modal biometric analysis; Human image generation; Text-to-human generation; Human image editing; MANIPULATION;

D O I：

10.1007/s11263-024-02079-7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Generating human images from open-vocabulary text descriptions is an exciting but challenging task. Previous methods (i.e., Text2Human) face two challenging problems: (1) they cannot well handle the open-vocabulary setting by arbitrary text inputs (i.e., unseen clothing appearances) and heavily rely on limited preset words (i.e., pattern styles of clothing appearances); (2) the generated human image is inaccuracy in open-vocabulary settings. To alleviate these drawbacks, we propose a flexible diffusion-based framework, namely HumanDiffusion, for open-vocabulary text-driven human image generation (HIG). The proposed framework mainly consists of two novel modules: the Stylized Memory Retrieval (SMR) module and the Multi-scale Feature Mapping (MFM) module. Encoded by the vision-language pretrained CLIP model, we obtain coarse features of the local human appearance. Then, the SMR module utilizes an external database that contains clothing texture details to refine the initial coarse features. Through SMR refreshing, we can achieve the HIG task with arbitrary text inputs, and the range of expression styles is greatly expanded. Later, the MFM module embedding in the diffusion backbone can learn fine-grained appearance features, which effectively achieves precise semantic-coherence alignment of different body parts with appearance features and realizes the accurate expression of desired human appearance. The seamless combination of the proposed novel modules in HumanDiffusion realizes the freestyle and high accuracy of text-guided HIG and editing tasks. Extensive experiments demonstrate that the proposed method can achieve state-of-the-art (SOTA) performance, especially in the open-vocabulary setting.

引用

页码：4379 / 4397

页数：19

共 50 条

[21] Robust Open-Vocabulary Translation from Visual Text Representations
Salesky, Elizabeth
Etter, David
Post, Matt
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7235 - 7252
[22] Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Ghiasi, Golnaz
Gu, Xiuye
Cui, Yin
Lin, Tsung-Yi
COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 540 - 557
[23] Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation
Jiao, Siyu
Zhu, Hongguang
Huang, Jiannan
Zhao, Yao
Wei, Yunchao
Shi, Humphrey
COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 399 - 416
[24] Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting
Shin, Hyeon-Kyeong
Han, Hyewon
Kim, Doyeon
Chung, Soo-Whan
Kang, Hong-Goo
INTERSPEECH 2022, 2022, : 1871 - 1875
[25] Autoregressive Omni-Aware Outpainting for Open-Vocabulary 360-Degree Image Generation
Lu, Zhuqiang
Hu, Kun
Wang, Chaoyue
Bai, Lei
Wang, Zhiyong
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 13, 2024, : 14211 - 14219
[26] Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation with Wordless Training
Lin, Junfan
Chang, Jianlong
Liu, Lingbo
Li, Guanbin
Lin, Liang
Tian, Qi
Chen, Chang Wen
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23222 - 23231
[27] USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation
Wang, Xiaoqi
He, Wenbin
Xuan, Xiwei
Sebastian, Clint
Ono, Jorge Piazentin
Li, Xin
Behpour, Sima
Thang Doan
Gou, Liang
Shen, Han-Wei
Ren, Liu
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2024, 2024, : 4187 - 4196
[28] Open-vocabulary Attribute Detection
Bravo, Maria A.
Mittal, Sudhanshu
Ging, Simon
Brox, Thomas
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 7041 - 7050
[29] Conditional Score Guidance for Text-Driven Image-to-Image Translation
Lee, Hyunsoo
Kang, Minsoo
Han, Bohyung
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[30] InterFusion: Text-Driven Generation of 3D Human-Object Interaction
Dai, Sisi
Li, Wenhao
Sun, Haowen
Huang, Haibin
Ma, Chongyang
Huang, Hui
Xu, Kai
Hu, Ruizhen
COMPUTER VISION - ECCV 2024, PT XLVIII, 2025, 15106 : 18 - 35

← 1 2 3 4 5 →