Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

被引:5
|
作者
Sharma, Himanshu [1 ,2 ]
Srivastava, Swati [1 ]
机构
[1] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura, India
[2] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura 281406, UP, India
来源
IMAGING SCIENCE JOURNAL | 2021年 / 69卷 / 1-4期
关键词
Visual question answering; co-attention; transformer; multimodal fusion; ENCRYPTION; ALGORITHM; IMAGES;
D O I
10.1080/13682199.2022.2153489
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using Fast Text embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.
引用
收藏
页码:177 / 189
页数:13
相关论文
共 50 条
  • [41] Aspect-level multimodal sentiment analysis based on co-attention fusion
    Wang, Shunjie
    Cai, Guoyong
    Lv, Guangrui
    INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS, 2024,
  • [42] An Adaptive Multimodal Fusion Network Based on Multilinear Gradients for Visual Question Answering
    Zhao, Chengfang
    Tang, Mingwei
    Zheng, Yanxi
    Ran, Chaocong
    ELECTRONICS, 2025, 14 (01):
  • [43] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain
    Dhruv Sharma
    Sanjay Purushotham
    Chandan K. Reddy
    Scientific Reports, 11
  • [44] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain
    Sharma, Dhruv
    Purushotham, Sanjay
    Reddy, Chandan K.
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [45] ViCLEVR: a visual reasoning dataset and hybrid multimodal fusion model for visual question answering in Vietnamese
    Tran, Khiem Vinh
    Phan, Hao Phu
    Van Nguyen, Kiet
    Nguyen, Ngan Luu Thuy
    MULTIMEDIA SYSTEMS, 2024, 30 (04)
  • [46] Visual Question Answering Research on Multi-layer Attention Mechanism Based on Image Target Features
    Cao, Danyang
    Ren, Xu
    Zhu, Menggui
    Song, Wei
    HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2021, 11
  • [47] A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering
    Ma, Mingyang
    Tohti, Turdi
    Liang, Yi
    Zuo, Zicheng
    Hamdulla, Askar
    SIGNAL IMAGE AND VIDEO PROCESSING, 2024, 18 (04) : 3471 - 3482
  • [48] A focus fusion attention mechanism integrated with image captions for knowledge graph-based visual question answering
    Mingyang Ma
    Turdi Tohti
    Yi Liang
    Zicheng Zuo
    Askar Hamdulla
    Signal, Image and Video Processing, 2024, 18 : 3471 - 3482
  • [49] Cascading Attention Visual Question Answering Model Based on Graph Structure
    Zhang, Haoyu
    Zhang, De
    Computer Engineering and Applications, 2023, 59 (06) : 155 - 161
  • [50] Owner name entity recognition in websites based on multiscale features and multimodal co-attention
    Ren, Yimo
    Li, Hong
    Liu, Peipei
    Liu, Jie
    Zhu, Hongsong
    Sun, Limin
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 224