Dual-feature collaborative relation-attention networks for visual question answering

被引：1

作者：

Yao, Lu ^{[1
]}

Yang, You ^{[1
,2
]}

Hu, Juntao ^{[1
]}

机构：

[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China

[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China

来源：

INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL | 2023年 / 12卷 / 02期

关键词：

Visual question answering; Region feature; Grid feature; Relation attention; Positional encoding;

D O I：

10.1007/s13735-023-00283-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Region and grid features extracted by object detection networks, which contain abundant image information, are widely used in visual question answering (VQA). The regions focus on object-level features, but the grids are better at representing contextual information and fine-grained attributes of images. However, most of the existing VQA models process visual information with one-way attention, failing to capture the internal relations between objects and analyze the feature details. In this work, we propose a novel multi-level collaborative decoder (MLCD) layer based on the encoder-decoder framework to address this issue, which incorporates visual location vectors into attention. Specifically, each MLCD is equipped with three different attention-MLP sub-modules to progressively and accurately mine the intrinsic interactions of features and enhance the influence of image content on prediction results. Additionally, to fully exploit the respective advantages of two features, we propose a novel relativity-augmented cross-attention (RACA) unit and add it to MLCD, in which the features after simple attention are complementarily augmented using global information and self-attributes. To validate the proposed methods, we stack the MLCD layer deeply to constitute our dual-feature collaborative relation-attention network (DFCRAN). We conduct extensive experiments and visualize the results on three benchmark datasets, including COCO-QA, VQA 1.0, and VQA 2.0, to prove the effectiveness of our model and achieve competitive performances compared to the state-of-the-art single models without pre-training.

引用

页数：15

共 50 条

[21] Graph-enhanced visual representations and question-guided dual attention for visual question answering
Yusuf, Abdulganiyu Abdu
Feng, Chong
Mao, Xianling
Haruna, Yunusa
Li, Xinyan
Duma, Ramadhani Ally
NEUROCOMPUTING, 2025, 614
[22] Differential Attention for Visual Question Answering
Patro, Badri
Namboodiri, Vinay P.
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
[23] Multimodal Attention for Visual Question Answering
Kodra, Lorena
Mece, Elinda Kajo
INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
[24] Fusing Attention with Visual Question Answering
Burt, Ryan
Cudic, Mihael
Principe, Jose C.
2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
[25] Hierarchical Attention Networks for Fact-based Visual Question Answering
Haibo Yao
Yongkang Luo
Zhi Zhang
Jianhang Yang
Chengtao Cai
Multimedia Tools and Applications, 2024, 83 : 17281 - 17298
[26] Compound-Attention Network with Original Feature injection for visual question and answering
Wu, Chunlei
Lu, Jing
Li, Haisheng
Wu, Jie
Duan, Hailong
Yuan, Shaozu
SIGNAL IMAGE AND VIDEO PROCESSING, 2021, 15 (08) : 1853 - 1861
[27] Compound-Attention Network with Original Feature injection for visual question and answering
Chunlei Wu
Jing Lu
Haisheng Li
Jie Wu
Hailong Duan
Shaozu Yuan
Signal, Image and Video Processing, 2021, 15 : 1853 - 1861
[28] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
Chen, Chongqing
Han, Dezhi
Wang, Jun
IEEE ACCESS, 2020, 8 : 35662 - 35671
[29] An Effective Dense Co-Attention Networks for Visual Question Answering
He, Shirong
Han, Dezhi
SENSORS, 2020, 20 (17) : 1 - 15
[30] Hierarchical Attention Networks for Fact-based Visual Question Answering
Yao, Haibo
Luo, Yongkang
Zhang, Zhi
Yang, Jianhang
Cai, Chengtao
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 17281 - 17298

← 1 2 3 4 5 →