Unifying Multimodal Transformer for Bi-directional Image and Text Generation

被引:22
|
作者
Huang, Yupan [1 ]
Xue, Hongwei [2 ]
Liu, Bei [3 ]
Lu, Yutong [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Guangdong, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年
关键词
cross-modal generation; image captioning; text-to-image synthesis; LANGUAGE;
D O I
10.1145/3474085.3481540
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.
引用
收藏
页码:1138 / 1147
页数:10
相关论文
共 50 条
  • [31] Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges
    Doaa B. Ebaid
    Magda M. Madbouly
    Adel A. El-Zoghabi
    International Journal of Computational Intelligence Systems, 16
  • [32] Bi-Directional Multi-Granularity Generation Framework for Knowledge Graph-to-Text with Large Language Model
    Du, Haowei
    Li, Chen
    Zhang, Dinghao
    Zhao, Dongyan
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 147 - 152
  • [33] Deep Stereo Image Compression via Bi-directional Coding
    Lei, Jianjun
    Liu, Xiangrui
    Peng, Bo
    Jin, Dengchao
    Li, Wanqing
    Gu, Jingxiao
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 19637 - 19646
  • [34] Bi-directional phase compensation to accelerate conical hologram generation
    Wang, Peiding
    Wang, Jun
    Wu, Yang
    Lei, Xiangli
    Liu, Chengmin
    Han, Han
    Chen, Ni
    DISPLAYS, 2022, 74
  • [35] Improved virtual DC motor control for a hybrid bi-directional DC transformer
    Wang, Yunjian
    Zhang, Changjie
    Sun, Xing
    Yang, Sen
    He, Shihao
    Guo, Xiangwei
    Dianli Xitong Baohu yu Kongzhi/Power System Protection and Control, 2024, 52 (22): : 93 - 103
  • [36] Bi-Directional Power Electronic Transformer Based Compact Dynamic Voltage Restorer
    Hosseini, S. H.
    Sharifian, M. B. B.
    Sabahi, M.
    Goharrizi, A. Y.
    Gharehpetian, G. B.
    2009 IEEE POWER & ENERGY SOCIETY GENERAL MEETING, VOLS 1-8, 2009, : 836 - +
  • [37] An Isolated Bi-Directional Series Bridge DC Transformer without Resonant Tank
    Cao, Yuliang
    Ngo, Minh
    Dong, Dong
    2022 IEEE APPLIED POWER ELECTRONICS CONFERENCE AND EXPOSITION, APEC, 2022, : 374 - 381
  • [38] Dual Contrastive Learning and Dual Bi-directional Transformer Encoders for Sequential Recommendations
    Wang, Li-e
    Chang, Hengtong
    Wei, Rongwen
    Li, Xianxian
    Sun, Zhigang
    Li, Yongdong
    Wei, Yi
    Meng, LingHui
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 1388 - 1393
  • [39] Dyna-C: A Topology for a Bi-Directional Solid-State Transformer
    Prasai, Anish
    Chen, Hao
    Divan, Deepak
    2014 TWENTY-NINTH ANNUAL IEEE APPLIED POWER ELECTRONICS CONFERENCE AND EXPOSITION (APEC), 2014, : 1219 - 1226
  • [40] Bi-directional Encoder Representation of Transformer model for Sequential Music Recommender System
    Yadav, Naina
    Singh, Anil Kumar
    PROCEEDINGS OF THE 12TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION (FIRE 2020), 2020, : 49 - 53