Differentiated Attention with Multi-modal Reasoning for Video Question Answering

被引：0

作者：

Yao, Shentao ^{[1
]}

Li, Kun ^{[1
]}

Xing, Kun ^{[1
]}

Wu, Kewei ^{[1
]}

Xie, Zhao ^{[2
]}

Guo, Dan ^{[1
]}

机构：

[1] Hefei Univ Technol, Sch Comp & Informat, Hefei, Peoples R China

[2] Hefei Univ Technol, Sch Microelect, Hefei, Peoples R China

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, BIG DATA AND ALGORITHMS (EEBDA) | 2022年

关键词：

video question answering; differentiated attention; multi-modal fusion; multi-modal interaction;

D O I：

10.1109/EEBDA53927.2022.9744732

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

It is extremely challenging to infer the answers for long questions and complex videos. Video question answering not only needs to capture clues from questions but also reasonably infer the certain clip or frame in videos. In the paper, we propose a novel method to understand the entire video and reasonably infer the answer to the question. We utilize a traditional attention mechanism combined with the multi-head structure to construct a differentiated attention module. Different from existing methods, our method is dedicated to obtaining differentiated features. Videos on our method are split into a few video clips, and there is a great overlap between video clips. Thus, simple using the selfattention mechanism to aggregate features will lead to excessive redundancy in the captured features. To tackle this issue, we propose a differentiated attention module consists of traditional attention mechanism and multi-head structure to focus on the core semantics and decode different clips or phrases. In addition, we also apply the differentiated attention block on question aggregation and video clues reasoning. We use different query attention loss (DQALoss) to solve the problem of question requiring stronger differentiation. Meanwhile, we propose to utilize the multi-modal factorized bilinear pooling method to solve multi-modal features reasoning and interaction. Our experiment shows that the proposed method outperforms existing methods on TGIF-QA datasets by large margins. The experimental results show the effectiveness of our method.

引用

页码：525 / 530

页数：6

共 50 条

[11] Multi-modal co-attention relation networks for visual question answering
Zihan Guo
Dezhi Han
The Visual Computer, 2023, 39 : 5783 - 5795
[12] Multi-modal co-attention relation networks for visual question answering
Guo, Zihan
Han, Dezhi
VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
[13] Interactive Multi-Modal Question-Answering
Orasan, Constantin
COMPUTATIONAL LINGUISTICS, 2012, 38 (02) : 451 - 453
[14] MoQA - A Multi-modal Question Answering Architecture
Haurilet, Monica
Al-Halah, Ziad
Stiefelhagen, Rainer
COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 106 - 113
[15] Text-Guided Object Detector for Multi-modal Video Question Answering
Shen, Ruoyue
Inoue, Nakamasa
Shinoda, Koichi
2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1032 - 1042
[16] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
Lyu, Chenyang
Li, Wenxi
Ji, Tianbo
Zhou, Liting
Gurrin, Cathal
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
[17] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
Xia, Qihao
Yu, Chao
Hou, Yinong
Peng, Pingping
Zheng, Zhengqi
Chen, Wen
ELECTRONICS, 2022, 11 (11)
[18] A Survey of Multi-modal Question Answering Systems for Robotics
Liu, Xiaomeng
Long, Fei
2017 2ND INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM), 2017, : 189 - 194
[19] Multi-Modal Correlated Network with Emotional Reasoning Knowledge for Social Intelligence Question-Answering
Xie, Baijun
Park, Chung Hyuk
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 3067 - 3073
[20] The Multi-Modal Video Reasoning and Analyzing Competition
Peng, Haoran
Huang, He
Xu, Li
Li, Tianjiao
Liu, Jun
Rahmani, Hossein
Ke, Qiuhong
Guo, Zhicheng
Wu, Cong
Li, Rongchang
Ye, Mang
Wang, Jiahao
Zhang, Jiaxu
Liu, Yuanzhong
He, Tao
Zhang, Fuwei
Liu, Xianbin
Lin, Tao
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 806 - 813

← 1 2 3 4 5 →