Unifying Multimodal Transformer for Bi-directional Image and Text Generation

被引:22
|
作者
Huang, Yupan [1 ]
Xue, Hongwei [2 ]
Liu, Bei [3 ]
Lu, Yutong [1 ]
机构
[1] Sun Yat Sen Univ, Guangzhou, Guangdong, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Microsoft Res Asia, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021 | 2021年
关键词
cross-modal generation; image captioning; text-to-image synthesis; LANGUAGE;
D O I
10.1145/3474085.3481540
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.
引用
收藏
页码:1138 / 1147
页数:10
相关论文
共 50 条
  • [41] Detecting Sensitive Data Disclosure via Bi-directional Text Correlation Analysis
    Huang, Jianjun
    Zhang, Xiangyu
    Tan, Lin
    FSE'16: PROCEEDINGS OF THE 2016 24TH ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON FOUNDATIONS OF SOFTWARE ENGINEERING, 2016, : 169 - 180
  • [42] StyleBERT: Text-audio sentiment analysis with Bi-directional Style Enhancement
    Lin, Fei
    Liu, Shengqiang
    Zhang, Cong
    Fan, Jin
    Wu, Zizhao
    INFORMATION SYSTEMS, 2023, 114
  • [43] Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning
    Hossain, Md Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    2019 DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2019, : 167 - 173
  • [44] Bi-directional Feature Fusion for Fast and Accurate Text Detection of Arbitrary Shapes
    Bian Liang
    Qu Yadong
    Zhou Yu
    JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (04) : 931 - 938
  • [45] Bi-directional hoeing in maize
    Naruhn, Georg
    Schneevoigt, Valentin
    Hartung, Jens
    Peteinatos, Gerassimos
    Moeller, Kurt
    Gerhards, Roland
    WEED RESEARCH, 2023, 63 (06) : 348 - 360
  • [46] A bi-directional multilayer perceptron
    Jedra, M
    El Ouardighi, A
    Essaid, A
    Limouri, M
    NEURAL PROCESSING LETTERS, 1999, 10 (02) : 89 - 95
  • [47] Bi-directional OLED microdisplay
    Vogel, U.
    Kreye, D.
    Richter, B.
    Bunk, G.
    Reckziegel, S.
    Herold, R.
    Scholles, M.
    Toerker, M.
    Amelung, J.
    IDW '07: PROCEEDINGS OF THE 14TH INTERNATIONAL DISPLAY WORKSHOPS, VOLS 1-3, 2007, : 1051 - 1054
  • [48] Bi-directional reflectance of corals
    Joyce, KE
    Phinn, SR
    INTERNATIONAL JOURNAL OF REMOTE SENSING, 2002, 23 (02) : 389 - 394
  • [49] Bi-directional Image-Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE SYSTEMS, 2023, 16 (01)
  • [50] Bi-Directional Transition Nets
    Staines, Anthony Spiteri
    APPLIED MATHEMATICS AND COMPUTER SCIENCE, 2017, 1836