Dual-decoder transformer network for answer grounding in visual question answering

被引:6
|
作者
Zhu, Liangjun [1 ]
Peng, Li [1 ]
Zhou, Weinan [1 ]
Yang, Jielong [1 ]
机构
[1] Jiangnan Univ, Engn Res Ctr Internet Things Appl Technol, Wuxi 214122, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering; Answer grounding; Dual-decoder transformer;
D O I
10.1016/j.patrec.2023.04.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual Question Answering (VQA) have made stunning advances by exploiting Transformer architecture and large-scale visual-linguistic pretraining. State-of-the-art methods generally require large amounts of data and devices to predict textualized answers and fail to provide visualized evidence of the answers. To mitigate these limitations, we propose a novel dual-decoder Transformer network (DDTN) for pre-dicting the language answer and corresponding vision instance. Specifically, the linguistic features are first embedded by Long Short-Term Memory (LSTM) block and Transformer encoder, which are shared between the Transformer dual-decoder. Then, we introduce object detector to obtain vision region fea-tures and grid features for reducing the size and cost of DDTN. These visual features are combined with the linguistic features and are respectively fed into two decoders. Moreover, we design an in-stance query to guide the fused visual-linguistic features for outputting the instance mask or bounding box. The classification layers aggregate results from decoders and predict answer as well as correspond-ing instance coordinates at last. Without bells and whistles, DDTN achieves state-of-the-art performance and even competitive to pretraining models on VizWizGround and GQA dataset. The code is available at https://github.com/zlj63501/DDTN .(c) 2023 Published by Elsevier B.V.
引用
收藏
页码:53 / 60
页数:8
相关论文
共 50 条
  • [1] Transformer-based Sparse Encoder and Answer Decoder for Visual Question Answering
    Peng, Longkun
    An, Gaoyun
    Ruan, Qiuqi
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 120 - 123
  • [2] An Answer FeedBack Network for Visual Question Answering
    Tian, Weidong
    Tian, Ruihua
    Zhao, Zhongqiu
    Ren, Quan
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [3] Transformer-Based Neural Network for Answer Selection in Question Answering
    Shao, Taihua
    Guo, Yupu
    Chen, Honghui
    Hao, Zepeng
    IEEE ACCESS, 2019, 7 : 26146 - 26156
  • [4] Answer Distillation for Visual Question Answering
    Fang, Zhiwei
    Liu, Jing
    Tang, Qu
    Li, Yong
    Lu, Hanqing
    COMPUTER VISION - ACCV 2018, PT I, 2019, 11361 : 72 - 87
  • [5] More Than An Answer: Neural Pivot Network for Visual Question Answering
    Zhou, Yiyi
    Ji, Rongrong
    Su, Jinsong
    Wu, Yongjian
    Wu, Yunsheng
    PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 681 - 689
  • [6] Visual-Semantic Dual Channel Network for Visual Question Answering
    Wang, Xin
    Chen, Qiaohong
    Hu, Ting
    Sun, Qi
    Jia, Yubo
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [7] Learning Answer Embeddings for Visual Question Answering
    Hu, Hexiang
    Chao, Wei-Lun
    Sha, Fei
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 5428 - 5436
  • [8] Modular dual-stream visual fusion network for visual question answering
    Xue, Lixia
    Wang, Wenhao
    Wang, Ronggui
    Yang, Juan
    VISUAL COMPUTER, 2025, 41 (01): : 549 - 562
  • [9] Attentional bias for hands: Cascade dual-decoder transformer for sign language production
    Ma, Xiaohan
    Jin, Rize
    Wang, Jianming
    Chung, Tae-Sun
    IET COMPUTER VISION, 2024, 18 (05) : 696 - 708
  • [10] Visual Question Generation as Dual Task of Visual Question Answering
    Li, Yikang
    Duan, Nan
    Zhou, Bolei
    Chu, Xiao
    Ouyang, Wanli
    Wang, Xiaogang
    Zhou, Ming
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6116 - 6124