Unifying Multimodal Transformer for Bi-directional Image and Text Generation

被引：22

作者：

Huang, Yupan ^{[1
]}

Xue, Hongwei ^{[2
]}

Liu, Bei ^{[3
]}

Lu, Yutong ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Guangzhou, Guangdong, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

[3] Microsoft Res Asia, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

关键词：

cross-modal generation; image captioning; text-to-image synthesis; LANGUAGE;

D O I：

10.1145/3474085.3481540

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.

引用

页码：1138 / 1147

页数：10

共 50 条

[41] Detecting Sensitive Data Disclosure via Bi-directional Text Correlation Analysis
Huang, Jianjun
Zhang, Xiangyu
Tan, Lin
FSE'16: PROCEEDINGS OF THE 2016 24TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON FOUNDATIONS OF SOFTWARE ENGINEERING, 2016, : 169 - 180
[42] StyleBERT: Text-audio sentiment analysis with Bi-directional Style Enhancement
Lin, Fei
Liu, Shengqiang
Zhang, Cong
Fan, Jin
Wu, Zizhao
INFORMATION SYSTEMS, 2023, 114
[43] Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning
Hossain, Md Zakir
Sohel, Ferdous
Shiratuddin, Mohd Fairuz
Laga, Hamid
Bennamoun, Mohammed
2019 DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2019, : 167 - 173
[44] Bi-directional Feature Fusion for Fast and Accurate Text Detection of Arbitrary Shapes
Bian Liang
Qu Yadong
Zhou Yu
JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (04) : 931 - 938
[45] Bi-directional hoeing in maize
Naruhn, Georg
Schneevoigt, Valentin
Hartung, Jens
Peteinatos, Gerassimos
Moeller, Kurt
Gerhards, Roland
WEED RESEARCH, 2023, 63 (06) : 348 - 360
[46] A bi-directional multilayer perceptron
Jedra, M
El Ouardighi, A
Essaid, A
Limouri, M
NEURAL PROCESSING LETTERS, 1999, 10 (02) : 89 - 95
[47] Bi-directional OLED microdisplay
Vogel, U.
Kreye, D.
Richter, B.
Bunk, G.
Reckziegel, S.
Herold, R.
Scholles, M.
Toerker, M.
Amelung, J.
IDW '07: PROCEEDINGS OF THE 14TH INTERNATIONAL DISPLAY WORKSHOPS, VOLS 1-3, 2007, : 1051 - 1054
[48] Bi-directional reflectance of corals
Joyce, KE
Phinn, SR
INTERNATIONAL JOURNAL OF REMOTE SENSING, 2002, 23 (02) : 389 - 394
[49] Bi-directional Image-Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges
Ebaid, Doaa B.
Madbouly, Magda M.
El-Zoghabi, Adel A.
INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2023, 16 (01)
[50] Bi-Directional Transition Nets
Staines, Anthony Spiteri
APPLIED MATHEMATICS AND COMPUTER SCIENCE, 2017, 1836

← 1 2 3 4 5 →