Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

被引：241

作者：

Duy-Kien Nguyen ^{[1
]}

Okatani, Takayuki ^{[1
,2
]}

机构：

[1] Tohoku Univ, Sendai, Miyagi, Japan

[2] RIKEN, Ctr AIP, Wako, Saitama, Japan

来源：

2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR) | 2018年

关键词：

D O I：

10.1109/CVPR.2018.00637

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.

引用

页码：6087 / 6096

页数：10

共 50 条

[41] Fusing Attention with Visual Question Answering
Burt, Ryan
Cudic, Mihael
Principe, Jose C.
2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
[42] Transformer Gate Attention Model: An Improved Attention Model for Visual Question Answering
Zhang, Haotian
Wu, Wei
2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
[43] Multimodal feature fusion by relational reasoning and attention for visual question answering
Zhang, Weifeng
Yu, Jing
Hu, Hua
Hu, Haiyang
Qin, Zengchang
INFORMATION FUSION, 2020, 55 (55) : 116 - 126
[44] LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING
Liu, Fei
Liu, Jing
Fang, Zhiwei
Lu, Hanqing
2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3307 - 3311
[45] MDAnet: Multiple Fusion Network with Double Attention for Visual Question Answering
Feng, Junyi
Gong, Ping
Qiu, Guanghui
ICVIP 2019: PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON VIDEO AND IMAGE PROCESSING, 2019, : 143 - 147
[46] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Feng Yan
Wushouer Silamu
Yachuang Chai
Yanbing Li
Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
[47] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Yan, Feng
Silamu, Wushouer
Chai, Yachuang
Li, Yanbing
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096
[48] Visual Question Answering with Textual Representations for Images
Hirota, Yusuke
Garcia, Noa
Otani, Mayu
Chu, Chenhui
Nakashima, Yuta
Taniguchi, Ittetsu
Onoye, Takao
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3147 - 3150
[49] Focal Visual-Text Attention for Visual Question Answering
Liang, Junwei
Jiang, Lu
Cao, Liangliang
Li, Li-Jia
Hauptmann, Alexander
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
[50] Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering
Lu, Qiwen
Chen, Shengbo
Zhu, Xiaoke
JOURNAL OF IMAGING, 2024, 10 (03)

← 1 2 3 4 5 →