Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

被引:5
|
作者
Sharma, Himanshu [1 ,2 ]
Srivastava, Swati [1 ]
机构
[1] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura, India
[2] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura 281406, UP, India
来源
IMAGING SCIENCE JOURNAL | 2021年 / 69卷 / 1-4期
关键词
Visual question answering; co-attention; transformer; multimodal fusion; ENCRYPTION; ALGORITHM; IMAGES;
D O I
10.1080/13682199.2022.2153489
中图分类号
TB8 [摄影技术];
学科分类号
0804 ;
摘要
Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using Fast Text embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.
引用
收藏
页码:177 / 189
页数:13
相关论文
共 50 条
  • [31] SPCA-Net: a based on spatial position relationship co-attention network for visual question answering
    Feng Yan
    Wushouer Silamu
    Yanbin Li
    Yachuang Chai
    The Visual Computer, 2022, 38 : 3097 - 3108
  • [32] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
    Mahamoud, Ibrahim Souleiman
    Coustaty, Mickael
    Joseph, Aurelie
    d'Andecy, Vincent Poulain
    Ogier, Jean-Marc
    DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
  • [33] Bi-direction Co-Attention Network on Visual Question Answering for Blind People
    Tung Le
    Thong Bui
    Huy Tien Nguyen
    Minh Le Nguyen
    FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084
  • [34] AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering
    Pan, Haiwei
    He, Shuning
    Zhang, Kejia
    Qu, Bo
    Chen, Chunling
    Shi, Kun
    KNOWLEDGE-BASED SYSTEMS, 2022, 255
  • [35] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
    Yu, Zhou
    Yu, Jun
    Fan, Jianping
    Tao, Dacheng
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
  • [36] DSAF: A Dual-Stage Attention Based Multimodal Fusion Framework for Medical Visual Question Answering
    K. Mukesh
    S. L. Jayaprakash
    R. Prasanna Kumar
    SN Computer Science, 6 (4)
  • [37] TRAFMEL: Multimodal Entity Linking Based on Transformer Reranking and Multimodal Co-Attention Fusion
    Zhang, Xiaoming
    Meng, Kaikai
    Wang, Huiyong
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2024, 34 (06) : 973 - 997
  • [38] Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering
    Asri H.S.
    Safabakhsh R.
    Multimedia Tools and Applications, 2024, 83 (40) : 87959 - 87986
  • [39] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Feng Yan
    Wushouer Silamu
    Yachuang Chai
    Yanbing Li
    Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
  • [40] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
    Yan, Feng
    Silamu, Wushouer
    Chai, Yachuang
    Li, Yanbing
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096