MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

被引：0

作者：

Wang, Junjie ^{[1
]}

Ji, Yatai ^{[2
]}

Sun, Jiaqi ^{[2
]}

Yang, Yujiu ^{[2
]}

Sakai, Tetsuya ^{[1
]}

机构：

[1] Waseda Univ, Shinjuku, Tokyo, Japan

[2] Tsinghua Univ, Grad Sch Shenzhen, Shenzhen, Guangdong, Peoples R China

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021 | 2021年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.

引用

页码：2280 / 2292

页数：13

共 50 条

[1] Compact Trilinear Interaction for Visual Question Answering
Tuong Do
Thanh-Toan Do
Huy Tran
Tjiputra, Erman
Tran, Quang D.
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 392 - 401
[2] Trilinear Distillation Learning and Question Feature Capturing for Medical Visual Question Answering
Long, Shaopei
Li, Yong
Weng, Heng
Tang, Buzhou
Wang, Fu Lee
Hao, Tianyong
NEURAL COMPUTING FOR ADVANCED APPLICATIONS, NCAA 2024, PT III, 2025, 2183 : 162 - 177
[3] Multimodal Learning and Reasoning for Visual Question Answering
Ilievski, Ilija
Feng, Jiashi
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[4] LEARNING REPRESENTATIONS FROM EXPLAINABLE AND CONNECTIONIST APPROACHES FOR VISUAL QUESTION ANSWERING
Mishra, Aakansha
Soumitri, Miriyala Srinivas
Rajendiran, Vikram N.
2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2024, 2024, : 6420 - 6424
[5] Learning neighbor-enhanced region representations and question-guided visual representations for visual question answering
Gao, Ling
Zhang, Hongda
Sheng, Nan
Shi, Lida
Xu, Hao
EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
[6] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[7] Fusing Visual and Textual Representations via Multi-layer Fusing Transformers for Vietnamese Visual Question Answering
Cong Phu Nguyen
Huy Tien Nguyen
Tung Le
ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2024, PT II, 2024, 2166 : 185 - 196
[8] Faithful Multimodal Explanation for Visual Question Answering
Wu, Jialin
Mooney, Raymond J.
BLACKBOXNLP WORKSHOP ON ANALYZING AND INTERPRETING NEURAL NETWORKS FOR NLP AT ACL 2019, 2019, : 103 - 112
[9] Visual Question Answering with Textual Representations for Images
Hirota, Yusuke
Garcia, Noa
Otani, Mayu
Chu, Chenhui
Nakashima, Yuta
Taniguchi, Ittetsu
Onoye, Takao
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
[10] Adaptive Transformers for Learning Multimodal Representations
Bhargava, Prajjwal
58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 1 - 7

← 1 2 3 4 5 →