Dual-feature collaborative relation-attention networks for visual question answering

被引:1
|
作者
Yao, Lu [1 ]
Yang, You [1 ,2 ]
Hu, Juntao [1 ]
机构
[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China
关键词
Visual question answering; Region feature; Grid feature; Relation attention; Positional encoding;
D O I
10.1007/s13735-023-00283-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Region and grid features extracted by object detection networks, which contain abundant image information, are widely used in visual question answering (VQA). The regions focus on object-level features, but the grids are better at representing contextual information and fine-grained attributes of images. However, most of the existing VQA models process visual information with one-way attention, failing to capture the internal relations between objects and analyze the feature details. In this work, we propose a novel multi-level collaborative decoder (MLCD) layer based on the encoder-decoder framework to address this issue, which incorporates visual location vectors into attention. Specifically, each MLCD is equipped with three different attention-MLP sub-modules to progressively and accurately mine the intrinsic interactions of features and enhance the influence of image content on prediction results. Additionally, to fully exploit the respective advantages of two features, we propose a novel relativity-augmented cross-attention (RACA) unit and add it to MLCD, in which the features after simple attention are complementarily augmented using global information and self-attributes. To validate the proposed methods, we stack the MLCD layer deeply to constitute our dual-feature collaborative relation-attention network (DFCRAN). We conduct extensive experiments and visualize the results on three benchmark datasets, including COCO-QA, VQA 1.0, and VQA 2.0, to prove the effectiveness of our model and achieve competitive performances compared to the state-of-the-art single models without pre-training.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Graph-enhanced visual representations and question-guided dual attention for visual question answering
    Yusuf, Abdulganiyu Abdu
    Feng, Chong
    Mao, Xianling
    Haruna, Yunusa
    Li, Xinyan
    Duma, Ramadhani Ally
    NEUROCOMPUTING, 2025, 614
  • [22] Differential Attention for Visual Question Answering
    Patro, Badri
    Namboodiri, Vinay P.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7680 - 7688
  • [23] Multimodal Attention for Visual Question Answering
    Kodra, Lorena
    Mece, Elinda Kajo
    INTELLIGENT COMPUTING, VOL 1, 2019, 858 : 783 - 792
  • [24] Fusing Attention with Visual Question Answering
    Burt, Ryan
    Cudic, Mihael
    Principe, Jose C.
    2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 949 - 953
  • [25] Hierarchical Attention Networks for Fact-based Visual Question Answering
    Haibo Yao
    Yongkang Luo
    Zhi Zhang
    Jianhang Yang
    Chengtao Cai
    Multimedia Tools and Applications, 2024, 83 : 17281 - 17298
  • [26] Compound-Attention Network with Original Feature injection for visual question and answering
    Wu, Chunlei
    Lu, Jing
    Li, Haisheng
    Wu, Jie
    Duan, Hailong
    Yuan, Shaozu
    SIGNAL IMAGE AND VIDEO PROCESSING, 2021, 15 (08) : 1853 - 1861
  • [27] Compound-Attention Network with Original Feature injection for visual question and answering
    Chunlei Wu
    Jing Lu
    Haisheng Li
    Jie Wu
    Hailong Duan
    Shaozu Yuan
    Signal, Image and Video Processing, 2021, 15 : 1853 - 1861
  • [28] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [29] An Effective Dense Co-Attention Networks for Visual Question Answering
    He, Shirong
    Han, Dezhi
    SENSORS, 2020, 20 (17) : 1 - 15
  • [30] Hierarchical Attention Networks for Fact-based Visual Question Answering
    Yao, Haibo
    Luo, Yongkang
    Zhang, Zhi
    Yang, Jianhang
    Cai, Chengtao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 17281 - 17298