MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

被引:0
|
作者
Wang, Junjie [1 ]
Ji, Yatai [2 ]
Sun, Jiaqi [2 ]
Yang, Yujiu [2 ]
Sakai, Tetsuya [1 ]
机构
[1] Waseda Univ, Shinjuku, Tokyo, Japan
[2] Tsinghua Univ, Grad Sch Shenzhen, Shenzhen, Guangdong, Peoples R China
来源
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021 | 2021年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.
引用
收藏
页码:2280 / 2292
页数:13
相关论文
共 50 条
  • [1] Compact Trilinear Interaction for Visual Question Answering
    Tuong Do
    Thanh-Toan Do
    Huy Tran
    Tjiputra, Erman
    Tran, Quang D.
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 392 - 401
  • [2] Trilinear Distillation Learning and Question Feature Capturing for Medical Visual Question Answering
    Long, Shaopei
    Li, Yong
    Weng, Heng
    Tang, Buzhou
    Wang, Fu Lee
    Hao, Tianyong
    NEURAL COMPUTING FOR ADVANCED APPLICATIONS, NCAA 2024, PT III, 2025, 2183 : 162 - 177
  • [3] Multimodal Learning and Reasoning for Visual Question Answering
    Ilievski, Ilija
    Feng, Jiashi
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [4] LEARNING REPRESENTATIONS FROM EXPLAINABLE AND CONNECTIONIST APPROACHES FOR VISUAL QUESTION ANSWERING
    Mishra, Aakansha
    Soumitri, Miriyala Srinivas
    Rajendiran, Vikram N.
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6420 - 6424
  • [5] Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering
    Gao, Ling
    Zhang, Hongda
    Sheng, Nan
    Shi, Lida
    Xu, Hao
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [6] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [7] Fusing Visual and Textual Representations via Multi-layer Fusing Transformers for Vietnamese Visual Question Answering
    Cong Phu Nguyen
    Huy Tien Nguyen
    Tung Le
    ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2024, PT II, 2024, 2166 : 185 - 196
  • [8] Faithful Multimodal Explanation for Visual Question Answering
    Wu, Jialin
    Mooney, Raymond J.
    BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 103 - 112
  • [9] Visual Question Answering with Textual Representations for Images
    Hirota, Yusuke
    Garcia, Noa
    Otani, Mayu
    Chu, Chenhui
    Nakashima, Yuta
    Taniguchi, Ittetsu
    Onoye, Takao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
  • [10] Adaptive Transformers for Learning Multimodal Representations
    Bhargava, Prajjwal
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 1 - 7