Dual-feature collaborative relation-attention networks for visual question answering

被引：1

作者：

Yao, Lu ^{[1
]}

Yang, You ^{[1
,2
]}

Hu, Juntao ^{[1
]}

机构：

[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China

[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China

来源：

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL | 2023年 / 12卷 / 02期

关键词：

Visual question answering; Region feature; Grid feature; Relation attention; Positional encoding;

D O I：

10.1007/s13735-023-00283-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Region and grid features extracted by object detection networks, which contain abundant image information, are widely used in visual question answering (VQA). The regions focus on object-level features, but the grids are better at representing contextual information and fine-grained attributes of images. However, most of the existing VQA models process visual information with one-way attention, failing to capture the internal relations between objects and analyze the feature details. In this work, we propose a novel multi-level collaborative decoder (MLCD) layer based on the encoder-decoder framework to address this issue, which incorporates visual location vectors into attention. Specifically, each MLCD is equipped with three different attention-MLP sub-modules to progressively and accurately mine the intrinsic interactions of features and enhance the influence of image content on prediction results. Additionally, to fully exploit the respective advantages of two features, we propose a novel relativity-augmented cross-attention (RACA) unit and add it to MLCD, in which the features after simple attention are complementarily augmented using global information and self-attributes. To validate the proposed methods, we stack the MLCD layer deeply to constitute our dual-feature collaborative relation-attention network (DFCRAN). We conduct extensive experiments and visualize the results on three benchmark datasets, including COCO-QA, VQA 1.0, and VQA 2.0, to prove the effectiveness of our model and achieve competitive performances compared to the state-of-the-art single models without pre-training.

引用

页数：15

共 50 条

[1] Dual-feature collaborative relation-attention networks for visual question answering
Lu Yao
You Yang
Juntao Hu
International Journal of Multimedia Information Retrieval, 2023, 12
[2] Feature Enhancement in Attention for Visual Question Answering
Lin, Yuetan
Pang, Zhangyang
Wang, Donghui
Zhuang, Yueting
PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4216 - 4222
[3] Feature Fusion Attention Visual Question Answering
Wang, Chunlin
Sun, Jianyong
Chen, Xiaolin
ICMLC 2019: 2019 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND COMPUTING, 2019, : 412 - 416
[4] Dual Self-Guided Attention with Sparse Question Networks for Visual Question Answering
Shen, Xiang
Han, Dezhi
Chang, Chin-Chen
Zong, Liang
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2022, E105D (04) : 785 - 796
[5] Dual self-attention with co-attention networks for visual question answering
Liu, Yun
Zhang, Xiaoming
Zhang, Qianyun
Li, Chaozhuo
Huang, Feiran
Tang, Xianghong
Li, Zhoujun
PATTERN RECOGNITION, 2021, 117 (117)
[6] Collaborative Attention Network to Enhance Visual Question Answering
Gu, Rui
BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 304 - 305
[7] Dual Attention and Question Categorization-Based Visual Question Answering
Mishra A.
Anand A.
Guha P.
IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 81 - 91
[8] Multi-modal co-attention relation networks for visual question answering
Zihan Guo
Dezhi Han
The Visual Computer, 2023, 39 : 5783 - 5795
[9] Multi-modal co-attention relation networks for visual question answering
Guo, Zihan
Han, Dezhi
VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
[10] Dual-Branch Collaborative Learning for Visual Question Answering
Tian, Weidong
Zhao, Junxiang
Xu, Wenzheng
Zhao, Zhongqiu
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT III, ICIC 2024, 2024, 14864 : 96 - 107

← 1 2 3 4 5 →