Differentiated Attention with Multi-modal Reasoning for Video Question Answering

被引:0
|
作者
Yao, Shentao [1 ]
Li, Kun [1 ]
Xing, Kun [1 ]
Wu, Kewei [1 ]
Xie, Zhao [2 ]
Guo, Dan [1 ]
机构
[1] Hefei Univ Technol, Sch Comp & Informat, Hefei, Peoples R China
[2] Hefei Univ Technol, Sch Microelect, Hefei, Peoples R China
关键词
video question answering; differentiated attention; multi-modal fusion; multi-modal interaction;
D O I
10.1109/EEBDA53927.2022.9744732
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
It is extremely challenging to infer the answers for long questions and complex videos. Video question answering not only needs to capture clues from questions but also reasonably infer the certain clip or frame in videos. In the paper, we propose a novel method to understand the entire video and reasonably infer the answer to the question. We utilize a traditional attention mechanism combined with the multi-head structure to construct a differentiated attention module. Different from existing methods, our method is dedicated to obtaining differentiated features. Videos on our method are split into a few video clips, and there is a great overlap between video clips. Thus, simple using the selfattention mechanism to aggregate features will lead to excessive redundancy in the captured features. To tackle this issue, we propose a differentiated attention module consists of traditional attention mechanism and multi-head structure to focus on the core semantics and decode different clips or phrases. In addition, we also apply the differentiated attention block on question aggregation and video clues reasoning. We use different query attention loss (DQALoss) to solve the problem of question requiring stronger differentiation. Meanwhile, we propose to utilize the multi-modal factorized bilinear pooling method to solve multi-modal features reasoning and interaction. Our experiment shows that the proposed method outperforms existing methods on TGIF-QA datasets by large margins. The experimental results show the effectiveness of our method.
引用
收藏
页码:525 / 530
页数:6
相关论文
共 50 条
  • [11] Multi-modal co-attention relation networks for visual question answering
    Zihan Guo
    Dezhi Han
    The Visual Computer, 2023, 39 : 5783 - 5795
  • [12] Multi-modal co-attention relation networks for visual question answering
    Guo, Zihan
    Han, Dezhi
    VISUAL COMPUTER, 2023, 39 (11): : 5783 - 5795
  • [13] Interactive Multi-Modal Question-Answering
    Orasan, Constantin
    COMPUTATIONAL LINGUISTICS, 2012, 38 (02) : 451 - 453
  • [14] MoQA - A Multi-modal Question Answering Architecture
    Haurilet, Monica
    Al-Halah, Ziad
    Stiefelhagen, Rainer
    COMPUTER VISION - ECCV 2018 WORKSHOPS, PT IV, 2019, 11132 : 106 - 113
  • [15] Text-Guided Object Detector for Multi-modal Video Question Answering
    Shen, Ruoyue
    Inoue, Nakamasa
    Shinoda, Koichi
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1032 - 1042
  • [16] Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering
    Lyu, Chenyang
    Li, Wenxi
    Ji, Tianbo
    Zhou, Liting
    Gurrin, Cathal
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT VII, 2023, 14260 : 427 - 438
  • [17] Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism
    Xia, Qihao
    Yu, Chao
    Hou, Yinong
    Peng, Pingping
    Zheng, Zhengqi
    Chen, Wen
    ELECTRONICS, 2022, 11 (11)
  • [18] A Survey of Multi-modal Question Answering Systems for Robotics
    Liu, Xiaomeng
    Long, Fei
    2017 2ND INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM), 2017, : 189 - 194
  • [19] Multi-Modal Correlated Network with Emotional Reasoning Knowledge for Social Intelligence Question-Answering
    Xie, Baijun
    Park, Chung Hyuk
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 3067 - 3073
  • [20] The Multi-Modal Video Reasoning and Analyzing Competition
    Peng, Haoran
    Huang, He
    Xu, Li
    Li, Tianjiao
    Liu, Jun
    Rahmani, Hossein
    Ke, Qiuhong
    Guo, Zhicheng
    Wu, Cong
    Li, Rongchang
    Ye, Mang
    Wang, Jiahao
    Zhang, Jiaxu
    Liu, Yuanzhong
    He, Tao
    Zhang, Fuwei
    Liu, Xianbin
    Lin, Tao
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 806 - 813