Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

被引:5
|
作者
Sharma, Himanshu [1 ,2 ]
Srivastava, Swati [1 ]
机构
[1] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura, India
[2] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura 281406, UP, India
来源
IMAGING SCIENCE JOURNAL | 2021年 / 69卷 / 1-4期
关键词
Visual question answering; co-attention; transformer; multimodal fusion; ENCRYPTION; ALGORITHM; IMAGES;
D O I
10.1080/13682199.2022.2153489
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using Fast Text embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.
引用
收藏
页码:177 / 189
页数:13
相关论文
共 50 条
  • [1] Integrating multimodal features by a two-way co-attention mechanism for visual question answering
    Sharma, Himanshu
    Srivastava, Swati
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 83 (21) : 59577 - 59595
  • [2] Enhancing visual question answering with a two-way co-attention mechanism and integrated multimodal features
    Agrawal, Mayank
    Jalal, Anand Singh
    Sharma, Himanshu
    COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
  • [3] Co-attention Network for Visual Question Answering Based on Dual Attention
    Dong, Feng
    Wang, Xiaofeng
    Oad, Ammar
    Talpur, Mir Sajjad Hussain
    Journal of Engineering Science and Technology Review, 2021, 14 (06) : 116 - 123
  • [4] Multimodal feature-wise co-attention method for visual question answering
    Zhang, Sheng
    Chen, Min
    Chen, Jincai
    Zou, Fuhao
    Li, Yuan-Fang
    Lu, Ping
    INFORMATION FUSION, 2021, 73 : 1 - 10
  • [5] Multimodal Fusion with Co-attention Mechanism
    Li, Pei
    Li, Xinde
    PROCEEDINGS OF 2020 23RD INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION 2020), 2020, : 607 - 614
  • [6] Co-Attention Network With Question Type for Visual Question Answering
    Yang, Chao
    Jiang, Mengqi
    Jiang, Bin
    Zhou, Weixin
    Li, Keqin
    IEEE ACCESS, 2019, 7 : 40771 - 40781
  • [7] Dynamic Co-attention Network for Visual Question Answering
    Ebaid, Doaa B.
    Madbouly, Magda M.
    El-Zoghabi, Adel A.
    2021 8TH INTERNATIONAL CONFERENCE ON SOFT COMPUTING & MACHINE INTELLIGENCE (ISCMI 2021), 2021, : 125 - 129
  • [8] Sparse co-attention visual question answering networks based on thresholds
    Guo, Zihan
    Han, Dezhi
    APPLIED INTELLIGENCE, 2023, 53 (01) : 586 - 600
  • [9] Sparse co-attention visual question answering networks based on thresholds
    Zihan Guo
    Dezhi Han
    Applied Intelligence, 2023, 53 : 586 - 600
  • [10] A medical visual question answering approach based on co-attention networks
    Cui W.
    Shi W.
    Shao H.
    Shengwu Yixue Gongchengxue Zazhi/Journal of Biomedical Engineering, 2024, 41 (03): : 560 - 568