Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

被引:5
|
作者
Sharma, Himanshu [1 ,2 ]
Srivastava, Swati [1 ]
机构
[1] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura, India
[2] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura 281406, UP, India
来源
IMAGING SCIENCE JOURNAL | 2021年 / 69卷 / 1-4期
关键词
Visual question answering; co-attention; transformer; multimodal fusion; ENCRYPTION; ALGORITHM; IMAGES;
D O I
10.1080/13682199.2022.2153489
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using Fast Text embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.
引用
收藏
页码:177 / 189
页数:13
相关论文
共 50 条
  • [21] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    Soft Computing, 2021, 25 : 5411 - 5421
  • [22] IMCN: Improved modular co-attention networks for visual question answering
    Liu, Cheng
    Wang, Chao
    Peng, Yan
    APPLIED INTELLIGENCE, 2024, 54 (06) : 5167 - 5182
  • [23] JGRCAN: A Visual Question Answering Co-Attention Network via Joint Grid-Region Features
    Liang, Jianpeng
    Xu, Tianjiao
    Chen, Shihong
    Ao, Zhuopan
    Mathematical Problems in Engineering, 2022, 2022
  • [24] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    The Visual Computer, 2023, 39 : 5783 - 5795
  • [25] Multi-modal co-attention relation networks for visual question answering
    Guo, Zihan
    Han, Dezhi
    VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
  • [26] LRCN: Layer-residual Co-Attention Networks for visual question answering
    Han, Dezhi
    Shi, Jingya
    Zhao, Jiahao
    Wu, Huafeng
    Zhou, Yachao
    Li, Ling-Huey
    Khan, Muhammad Khurram
    Li, Kuan-Ching
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 263
  • [27] SceneGATE: Scene-Graph Based Co-Attention Networks for Text Visual Question Answering
    Cao, Feiqi
    Luo, Siwen
    Nunez, Felipe
    Wen, Zean
    Poon, Josiah
    Han, Soyeon Caren
    ROBOTICS, 2023, 12 (04)
  • [28] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    INFORMATION FUSION, 2020, 55 (55) : 116 - 126
  • [29] Medical Visual Question Answering Model Based on Knowledge Enhancement and Multimodal Fusion
    Dianyuan, Zhang
    Chuanming, Yu
    Data Analysis and Knowledge Discovery, 2024, 8 (8-9) : 226 - 239
  • [30] SPCA-Net: a based on spatial position relationship co-attention network for visual question answering
    Yan, Feng
    Silamu, Wushouer
    Li, Yanbin
    Chai, Yachuang
    VISUAL COMPUTER, 2022, 38 (9-10): : 3097 - 3108