Dual-decoder transformer network for answer grounding in visual question answering

被引:6
|
作者
Zhu, Liangjun [1 ]
Peng, Li [1 ]
Zhou, Weinan [1 ]
Yang, Jielong [1 ]
机构
[1] Jiangnan Univ, Engn Res Ctr Internet Things Appl Technol, Wuxi 214122, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Answer grounding; Dual-decoder transformer;
D O I
10.1016/j.patrec.2023.04.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) have made stunning advances by exploiting Transformer architecture and large-scale visual-linguistic pretraining. State-of-the-art methods generally require large amounts of data and devices to predict textualized answers and fail to provide visualized evidence of the answers. To mitigate these limitations, we propose a novel dual-decoder Transformer network (DDTN) for pre-dicting the language answer and corresponding vision instance. Specifically, the linguistic features are first embedded by Long Short-Term Memory (LSTM) block and Transformer encoder, which are shared between the Transformer dual-decoder. Then, we introduce object detector to obtain vision region fea-tures and grid features for reducing the size and cost of DDTN. These visual features are combined with the linguistic features and are respectively fed into two decoders. Moreover, we design an in-stance query to guide the fused visual-linguistic features for outputting the instance mask or bounding box. The classification layers aggregate results from decoders and predict answer as well as correspond-ing instance coordinates at last. Without bells and whistles, DDTN achieves state-of-the-art performance and even competitive to pretraining models on VizWizGround and GQA dataset. The code is available at https://github.com/zlj63501/DDTN .(c) 2023 Published by Elsevier B.V.
引用
收藏
页码:53 / 60
页数:8
相关论文
共 50 条
  • [21] Designing and Evaluating a Dual-Stream Transformer-Based Architecture for Visual Question Answering
    Shehzad, Faheem
    Minutolo, Aniello
    Esposito, Massimo
    IEEE Access, 2024, 12 : 195561 - 195574
  • [22] Co-attention Network for Visual Question Answering Based on Dual Attention
    Dong, Feng
    Wang, Xiaofeng
    Oad, Ammar
    Talpur, Mir Sajjad Hussain
    Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123
  • [23] Answer Distillation Network With Bi-Text-Image Attention for Medical Visual Question Answering
    Gong, Hongfang
    Li, Li
    IEEE ACCESS, 2025, 13 : 16455 - 16465
  • [24] EDBNet: Efficient Dual-Decoder Boosted Network for Eye Retinal Exudates Segmentation
    Salem Ali, Mohammed Yousef
    Abdel-Nasser, Mohamed
    Valls, Aida
    Baget, Marc
    Jabreel, Mohammed
    ARTIFICIAL INTELLIGENCE RESEARCH AND DEVELOPMENT, 2022, 356 : 308 - 317
  • [25] A dual-decoder banded convolutional attention network for bone segmentation in ultrasound images
    Liu, Chuanba
    Wang, Wenshuo
    Sun, Rui
    Wang, Teng
    Shen, Xiantao
    Sun, Tao
    MEDICAL PHYSICS, 2025, 52 (03) : 1556 - 1572
  • [26] Multimodal Encoder-Decoder Attention Networks for Visual Question Answering
    Chen, Chongqing
    Han, Dezhi
    Wang, Jun
    IEEE ACCESS, 2020, 8 : 35662 - 35671
  • [27] Attention-based encoder-decoder model for answer selection in question answering
    Yuan-ping Nie
    Yi Han
    Jiu-ming Huang
    Bo Jiao
    Ai-ping Li
    Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 535 - 544
  • [28] Attention-based encoder-decoder model for answer selection in question answering
    Nie, Yuan-ping
    Han, Yi
    Huang, Jiu-ming
    Jiao, Bo
    Li, Ai-ping
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (04) : 535 - 544
  • [29] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
    Yu, Zhou
    Jin, Zitian
    Yu, Jun
    Xu, Mingliang
    Wang, Hongbo
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556
  • [30] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723