Unifying Multimodal Transformer for Bi-directional Image and Text Generation

被引:22
|
作者
Huang, Yupan [1 ]
Xue, Hongwei [2 ]
Liu, Bei [3 ]
Lu, Yutong [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Guangdong, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年
关键词
cross-modal generation; image captioning; text-to-image synthesis; LANGUAGE;
D O I
10.1145/3474085.3481540
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.
引用
收藏
页码:1138 / 1147
页数:10
相关论文
共 50 条
  • [1] The Bi-directional Framework for Unifying Parametric Image Alignment Approaches
    Megret, Remi
    Authesserre, Jean-Baptiste
    Berthoumieu, Yannick
    COMPUTER VISION - ECCV 2008, PT III, PROCEEDINGS, 2008, 5304 : 400 - 411
  • [2] Bengali Text generation Using Bi-directional RNN
    Abujar, Sheikh
    Masum, Abu Kaisar Mohammad
    Chowdhury, S. M. Mazharul Hoque
    Hasan, Mahmudul
    Hossain, Syed Akhter
    2019 10TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2019,
  • [3] Bi-directional Adapter for Multimodal Tracking
    Cao, Bing
    Guo, Junliang
    Zhu, Pengfei
    Hu, Qinghua
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 927 - 935
  • [4] A Modular Bi-Directional Power Electronic Transformer
    Gao, Zhigang
    Fan, Hui
    JOURNAL OF POWER ELECTRONICS, 2016, 16 (02) : 399 - 413
  • [5] Bi-Directional Image-to-Text Mapping for NLP-Based Schedule Generation and Computer Vision Progress Monitoring
    Nunez-Morales, Juan D.
    Jung, Yoonhwa
    Golparvar-Fard, Mani
    CONSTRUCTION RESEARCH CONGRESS 2024: ADVANCED TECHNOLOGIES, AUTOMATION, AND COMPUTER APPLICATIONS IN CONSTRUCTION, 2024, : 826 - 835
  • [6] Bi-Directional Spatial-Semantic Attention Networks for Image-Text Matching
    Huang, Feiran
    Zhang, Xiaoming
    Zhao, Zhonghua
    Li, Zhoujun
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2019, 28 (04) : 2008 - 2020
  • [7] Uniting Image and Text Deep Networks via Bi-directional Triplet Loss for Retreival
    Hua, Yan
    Du, Jianhe
    PROCEEDINGS OF 2019 IEEE 9TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2019), 2019, : 297 - 300
  • [8] BiRGAN: Bi-directional Deep Image Retargeting
    Sun, Di
    Wang, Yunxiang
    Yang, Tingting
    Mei, Yijing
    Pan, Gang
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VII, ICIC 2024, 2024, 14868 : 113 - 123
  • [9] Image Fusion Using Bi-directional Similarity
    Bai Chunshan
    Luo Xiaoyan
    HOLOGRAPHY: ADVANCES AND MODERN TRENDS IV, 2015, 9508
  • [10] Bi-directional lstm network speech-to-gesture generation using bi-directional lstm network
    Kaneko N.
    Takeuchi K.
    Hasegawa D.
    Shirakawa S.
    Sakuta H.
    Sumi K.
    Transactions of the Japanese Society for Artificial Intelligence, 2019, 34 (06):