Text Visual Question Answering Based on Interactive Learning and Relationship Modeling

被引:1
|
作者
Zhang, Chao [1 ]
Wu, Wei [1 ]
Ma, Bingzhuo [1 ]
机构
[1] Inner Mongolia Univ, Hohhot 010021, Inner Mongolia, Peoples R China
关键词
TextVQA; Multimodal Fusion; Relative Position Relationship; Cascade Guidance;
D O I
10.1007/978-3-031-72347-6_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Text Visual Question Answering (TextVQA) task is introduced to understand textual information and answer the question related to textual information in daily life scenarios. In the TextVQA task, the interaction fusion between different modalities (question text, visual objects, and visual text) plays an important role, which can capture the correlation between different modalities, thereby improving ability to understand and answer the question accurately. However, most existing methods cannot effectively handle the interaction fusion between modalities well. Therefore, in order to more effectively integrate information from different modalities, this paper proposes a module called Multimodal Feature Cascade Guidance (MFCG), which solves the problem of ignoring the importance of certain modalities in previous methods. In addition, a Relative Position Relationship Enhanced Transformer (RPRET) layer is introduced to model the relative position relationship between different modalities in the image, thereby improving the performance of answering the question related to spatial position relationships. The proposed method outperforms various state-of-the-art models on two public datasets, which confirms the effectiveness of our method.
引用
收藏
页码:95 / 109
页数:15
相关论文
共 50 条
  • [41] Cross-attention Based Text-image Transformer for Visual Question Answering
    Rezapour M.
    Recent Advances in Computer Science and Communications, 2024, 17 (04) : 72 - 78
  • [42] Mathematical Modeling of Question Popularity in User-Interactive Question Answering Systems
    Quan, Xiaojun
    Lu, Yao
    Xu, Feifei
    Lei, Jingsheng
    Liu, Wenyin
    JOURNAL OF ADVANCED MATHEMATICS AND APPLICATIONS, 2013, 2 (01) : 24 - 31
  • [43] Beyond Question-Based Biases: Assessing Multimodal Shortcut Learning in Visual Question Answering
    Dancette, Corentin
    Cadene, Remi
    Teney, Damien
    Cord, Matthieu
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 1554 - 1563
  • [44] MMQL: Multi-Question Learning for Medical Visual Question Answering
    Chen, Qishen
    Bian, Minjie
    Xu, Huahu
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT V, 2024, 15005 : 480 - 489
  • [45] Bidirectional Contrastive Split Learning for Visual Question Answering
    Sun, Yuwei
    Ochiai, Hideya
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 19, 2024, : 21602 - 21609
  • [46] Learning to Specialize with Knowledge Distillation for Visual Question Answering
    Mun, Jonghwan
    Lee, Kimin
    Shin, Jinwoo
    Han, Bohyung
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [47] Multiple Context Learning Networks for Visual Question Answering
    Zhang, Pufen
    Lan, Hong
    Khan, Muhammad Asim
    SCIENTIFIC PROGRAMMING, 2022, 2022
  • [48] Adversarial Learning with Bidirectional Attention for Visual Question Answering
    Li, Qifeng
    Tang, Xinyi
    Jian, Yi
    SENSORS, 2021, 21 (21)
  • [49] Visual Question Answering
    Nada, Ahmed
    Chen, Min
    2024 INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKING AND COMMUNICATIONS, ICNC, 2024, : 6 - 10
  • [50] Learning Visual Question Answering by Bootstrapping Hard Attention
    Malinowski, Mateusz
    Doersch, Carl
    Santoro, Adam
    Battaglia, Peter
    COMPUTER VISION - ECCV 2018, PT VI, 2018, 11210 : 3 - 20