Visual question answering model based on the fusion of multimodal features by a two-wav co-attention mechanism

被引：5

作者：

Sharma, Himanshu ^{[1
,2
]}

Srivastava, Swati ^{[1
]}

机构：

[1] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura, India

[2] GLA Univ Mathura, Dept Comp Engn & Applicat, Mathura 281406, UP, India

来源：

IMAGING SCIENCE JOURNAL | 2021年 / 69卷 / 1-4期

关键词：

Visual question answering; co-attention; transformer; multimodal fusion; ENCRYPTION; ALGORITHM; IMAGES;

D O I：

10.1080/13682199.2022.2153489

中图分类号：

TB8 [摄影技术];

学科分类号：

0804 ;

摘要：

Scene Text Visual Question Answering (VQA) needs to understand both the visual contents and the texts in an image to predict an answer for the image-related question. Existing Scene Text VQA models predict an answer by choosing a word from a fixed vocabulary or the extracted text tokens. In this paper, we have strengthened the representation power of the text tokens by using Fast Text embedding, appearance, bounding box and PHOC features for text tokens. Our model employs two-way co-attention by using self-attention and guided attention mechanisms to obtain the discriminative image features. We compute the text token position and combine this information with the predicted answer embedding for final answer generation. We have achieved an accuracy of 51.27% and 52.09% on the TextVQA validation set and test set. For the ST-VQA dataset, our model predicted an ANLS score of 0.698 on validation set and 0.686 on test set.

引用

页码：177 / 189

页数：13

共 50 条

[31] SPCA-Net: a based on spatial position relationship co-attention network for visual question answering
Feng Yan
Wushouer Silamu
Yanbin Li
Yachuang Chai
The Visual Computer, 2022, 38 : 3097 - 3108
[32] QAlayout: Question Answering Layout Based on Multimodal Attention for Visual Question Answering on Corporate Document
Mahamoud, Ibrahim Souleiman
Coustaty, Mickael
Joseph, Aurelie
d'Andecy, Vincent Poulain
Ogier, Jean-Marc
DOCUMENT ANALYSIS SYSTEMS, DAS 2022, 2022, 13237 : 659 - 673
[33] Bi-direction Co-Attention Network on Visual Question Answering for Blind People
Tung Le
Thong Bui
Huy Tien Nguyen
Minh Le Nguyen
FOURTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2021), 2022, 12084
[34] AMAM: An Attention-based Multimodal Alignment Model for Medical Visual Question Answering
Pan, Haiwei
He, Shuning
Zhang, Kejia
Qu, Bo
Chen, Chunling
Shi, Kun
KNOWLEDGE-BASED SYSTEMS, 2022, 255
[35] Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
Yu, Zhou
Yu, Jun
Fan, Jianping
Tao, Dacheng
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1839 - 1848
[36] DSAF: A Dual-Stage Attention Based Multimodal Fusion Framework for Medical Visual Question Answering
K. Mukesh
S. L. Jayaprakash
R. Prasanna Kumar
SN Computer Science, 6 (4)
[37] TRAFMEL: Multimodal Entity Linking Based on Transformer Reranking and Multimodal Co-Attention Fusion
Zhang, Xiaoming
Meng, Kaikai
Wang, Huiyong
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2024, 34 (06) : 973 - 997
[38] Advanced Visual and Textual Co-context Aware Attention Network with Dependent Multimodal Fusion Block for Visual Question Answering
Asri H.S.
Safabakhsh R.
Multimedia Tools and Applications, 2024, 83 (40) : 87959 - 87986
[39] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Feng Yan
Wushouer Silamu
Yachuang Chai
Yanbing Li
Multimedia Tools and Applications, 2024, 83 : 7085 - 7096
[40] OECA-Net: A co-attention network for visual question answering based on OCR scene text feature enhancement
Yan, Feng
Silamu, Wushouer
Chai, Yachuang
Li, Yanbing
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7085 - 7096

← 1 2 3 4 5 →