Advancing Vietnamese Visual Question Answering with Transformer and Convolutional

被引：0

作者：

Nguyen, Ngoc Son ^{[1
,3
]}

Nguyen, Van Son ^{[1
,3
]}

Le, Tung ^{[2
,3
]}

机构：

[1] Univ Sci, Fac Math & Comp Sci, Ho Chi Minh, Vietnam

[2] Univ Sci, Fac Informat Technol, Ho Chi Minh, Vietnam

[3] Vietnam Natl Univ, Ho Chi Minh, Vietnam

来源：

COMPUTERS & ELECTRICAL ENGINEERING | 2024年 / 119卷

关键词：

Visual question answering; ViVQA; EfficientNet; BLIP-2; Convolutional;

D O I：

10.1016/j.compeleceng.2024.109474

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Visual Question Answering (VQA) has recently emerged as a potential research domain, captivating the interest of many in the field of artificial intelligence and computer vision. Despite the prevalence of approaches in English, there is a notable lack of systems specifically developed for certain languages, particularly Vietnamese. This study aims to bridge this gap by conducting comprehensive experiments on the Vietnamese Visual Question Answering (ViVQA) dataset, demonstrating the effectiveness of our proposed model. In response to community interest, we have developed a model that enhances image representation capabilities, thereby improving overall performance in the ViVQA system. Therefore, we propose AViVQA-TranConI (Advancing A dvancing Vi etnamese V isual Q uestion A nswering with T ransformer and Con volutional I ntegration). AViVQA-TranConI integrates the Bootstrapping Language-Image Pre-training with frozen unimodal models (BLIP-2) and the convolutional neural network EfficientNet to extract and process both local and global features from images. This integration leverages the strengths of transformer-based architectures for capturing comprehensive contextual information and convolutional networks for detailed local features. By freezing the parameters of these pre-trained models, we significantly reduce the computational cost and training time, while maintaining high performance. This approach significantly improves image representation and enhances the performance of existing VQA systems. We then leverage a multi-modal fusion module based on a general-purpose multi-modal foundation model (BEiT-3) to fuse the information between visual and textual features. Our experimental findings demonstrate that AViVQA-TranConI surpasses competing baselines, achieving promising performance. This is particularly evident in its accuracy of 71.04% on the test set of the ViVQA dataset, marking a significant advancement in our research area. The code is available at https://github.com/nngocson2002/ViVQA.

引用

页数：18

共 50 条

[41] Fusing Visual and Textual Representations via Multi-layer Fusing Transformers for Vietnamese Visual Question Answering
Cong Phu Nguyen
Huy Tien Nguyen
Tung Le
ADVANCES IN COMPUTATIONAL COLLECTIVE INTELLIGENCE, ICCCI 2024, PT II, 2024, 2166 : 185 - 196
[42] VQA: Visual Question Answering
Antol, Stanislaw
Agrawal, Aishwarya
Lu, Jiasen
Mitchell, Margaret
Batra, Dhruv
Zitnick, C. Lawrence
Parikh, Devi
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
[43] Indic Visual Question Answering
Chandrasekar, Aditya
Shimpi, Amey
Naik, Dinesh
2022 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, SPCOM, 2022,
[44] VQA: Visual Question Answering
Agrawal, Aishwarya
Lu, Jiasen
Antol, Stanislaw
Mitchell, Margaret
Zitnick, C. Lawrence
Parikh, Devi
Batra, Dhruv
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
[45] Survey on Visual Question Answering
Bao X.-G.
Zhou C.-L.
Xiao K.-J.
Qin B.
Ruan Jian Xue Bao/Journal of Software, 2021, 32 (08): : 2522 - 2544
[46] Visual Question Answering A tutorial
Teney, Damien
Wu, Qi
van den Hengel, Anton
IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (06) : 63 - 75
[47] ViOCRVQA: novel benchmark dataset and VisionReader for visual question answering by understanding Vietnamese text in images
Pham, Huy Quang
Nguyen, Thang Kien-Bao
Nguyen, Quan Van
Tran, Dan Quang
Nguyen, Nghia Hieu
Nguyen, Kiet Van
Nguyen, Ngan Luu-Thuy
MULTIMEDIA SYSTEMS, 2025, 31 (02)
[48] Question Analysis towards a Vietnamese Question Answering System in the Education Domain
Ngo Xuan Bach
Phan Duc Thanh
Tran Thi Oanh
CYBERNETICS AND INFORMATION TECHNOLOGIES, 2020, 20 (01) : 112 - 128
[49] Visual Question Generation as Dual Task of Visual Question Answering
Li, Yikang
Duan, Nan
Zhou, Bolei
Chu, Xiao
Ouyang, Wanli
Wang, Xiaogang
Zhou, Ming
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6116 - 6124
[50] MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
Khan, Aisha Urooj
Mazaheri, Amir
Lobo, Niels Da Vitoria
Shah, Mubarak
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 4648 - 4660

← 1 2 3 4 5 →