Unifying Multimodal Transformer for Bi-directional Image and Text Generation

被引：22

作者：

Huang, Yupan ^{[1
]}

Xue, Hongwei ^{[2
]}

Liu, Bei ^{[3
]}

Lu, Yutong ^{[1
]}

机构：

[1] Sun Yat Sen Univ, Guangzhou, Guangdong, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

[3] Microsoft Res Asia, Beijing, Peoples R China

来源：

PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年

关键词：

cross-modal generation; image captioning; text-to-image synthesis; LANGUAGE;

D O I：

10.1145/3474085.3481540

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.

引用

页码：1138 / 1147

页数：10

共 50 条

[31] Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges
Doaa B. Ebaid
Magda M. Madbouly
Adel A. El-Zoghabi
International Journal of Computational Intelligence Systems, 16
[32] Bi-Directional Multi-Granularity Generation Framework for Knowledge Graph-to-Text with Large Language Model
Du, Haowei
Li, Chen
Zhang, Dinghao
Zhao, Dongyan
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 147 - 152
[33] Deep Stereo Image Compression via Bi-directional Coding
Lei, Jianjun
Liu, Xiangrui
Peng, Bo
Jin, Dengchao
Li, Wanqing
Gu, Jingxiao
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19637 - 19646
[34] Bi-directional phase compensation to accelerate conical hologram generation
Wang, Peiding
Wang, Jun
Wu, Yang
Lei, Xiangli
Liu, Chengmin
Han, Han
Chen, Ni
DISPLAYS, 2022, 74
[35] Improved virtual DC motor control for a hybrid bi-directional DC transformer
Wang, Yunjian
Zhang, Changjie
Sun, Xing
Yang, Sen
He, Shihao
Guo, Xiangwei
Dianli Xitong Baohu yu Kongzhi/Power System Protection and Control, 2024, 52 (22): : 93 - 103
[36] Bi-Directional Power Electronic Transformer Based Compact Dynamic Voltage Restorer
Hosseini, S. H.
Sharifian, M. B. B.
Sabahi, M.
Goharrizi, A. Y.
Gharehpetian, G. B.
2009 IEEE POWER & ENERGY SOCIETY GENERAL MEETING, VOLS 1-8, 2009, : 836 - +
[37] An Isolated Bi-Directional Series Bridge DC Transformer without Resonant Tank
Cao, Yuliang
Ngo, Minh
Dong, Dong
2022 IEEE APPLIED POWER ELECTRONICS CONFERENCE AND EXPOSITION, APEC, 2022, : 374 - 381
[38] Dual Contrastive Learning and Dual Bi-directional Transformer Encoders for Sequential Recommendations
Wang, Li-e
Chang, Hengtong
Wei, Rongwen
Li, Xianxian
Sun, Zhigang
Li, Yongdong
Wei, Yi
Meng, LingHui
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1388 - 1393
[39] Dyna-C: A Topology for a Bi-Directional Solid-State Transformer
Prasai, Anish
Chen, Hao
Divan, Deepak
2014 TWENTY-NINTH ANNUAL IEEE APPLIED POWER ELECTRONICS CONFERENCE AND EXPOSITION (APEC), 2014, : 1219 - 1226
[40] Bi-directional Encoder Representation of Transformer model for Sequential Music Recommender System
Yadav, Naina
Singh, Anil Kumar
PROCEEDINGS OF THE 12TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2020), 2020, : 49 - 53

← 1 2 3 4 5 →