From Pixels to Objects: Cubic Visual Attention for Visual Question Answering

被引：0

作者：

Song, Jingkuan

Zeng, Pengpeng

Gao, Lianli ^{[1
]}

Shen, Heng Tao ^{[1
]}

机构：

[1] Univ Elect Sci & Technol China, Ctr Future Media, Chengdu 611731, Peoples R China

来源：

PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2018年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problem. In this paper we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.

引用

页码：906 / 912

页数：7

共 50 条

[31] Multimodal Encoders and Decoders with Gate Attention for Visual Question Answering
Li, Haiyan
Han, Dezhi
COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2021, 18 (03) : 1023 - 1040
[32] Local relation network with multilevel attention for visual question answering
Sun, Bo
Yao, Zeng
Zhang, Yinghui
Yu, Lejun
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
[33] Focal Visual-Text Attention for Memex Question Answering
Liang, Junwei
Jiang, Lu
Cao, Liangliang
Kalantidis, Yannis
Li, Li-Jia
Hauptmann, Alexander G.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (08) : 1893 - 1908
[34] Local self-attention in transformer for visual question answering
Xiang Shen
Dezhi Han
Zihan Guo
Chongqing Chen
Jie Hua
Gaofeng Luo
Applied Intelligence, 2023, 53 : 16706 - 16723
[35] Latent Attention Network With Position Perception for Visual Question Answering
Zhang, Jing
Liu, Xiaoqiang
Wang, Zhe
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2025, 36 (03) : 5059 - 5069
[36] Stacked Self-Attention Networks for Visual Question Answering
Sun, Qiang
Fu, Yanwei
ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
[37] Stacked Attention based Textbook Visual Question Answering with BERT
Aishwarya, R.
Sarath, P.
Rahman, Shibil P.
Sneha, U.
Manmadhan, Sruthy
2022 IEEE 19TH INDIA COUNCIL INTERNATIONAL CONFERENCE, INDICON, 2022,
[38] Multi-stage Attention based Visual Question Answering
Mishra, Aakansha
Anand, Ashish
Guha, Prithwijit
2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 9407 - 9414
[39] Multimodal attention-driven visual question answering for Malayalam
Kovath A.G.
Nayyar A.
Sikha O.K.
Neural Computing and Applications, 2024, 36 (24) : 14691 - 14708
[40] Deep Attention Neural Tensor Network for Visual Question Answering
Bai, Yalong
Fu, Jianlong
Zhao, Tiejun
Mei, Tao
COMPUTER VISION - ECCV 2018, PT XII, 2018, 11216 : 21 - 37

← 1 2 3 4 5 →