Text Visual Question Answering Based on Interactive Learning and Relationship Modeling

被引:1
|
作者
Zhang, Chao [1 ]
Wu, Wei [1 ]
Ma, Bingzhuo [1 ]
机构
[1] Inner Mongolia Univ, Hohhot 010021, Inner Mongolia, Peoples R China
关键词
TextVQA; Multimodal Fusion; Relative Position Relationship; Cascade Guidance;
D O I
10.1007/978-3-031-72347-6_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The Text Visual Question Answering (TextVQA) task is introduced to understand textual information and answer the question related to textual information in daily life scenarios. In the TextVQA task, the interaction fusion between different modalities (question text, visual objects, and visual text) plays an important role, which can capture the correlation between different modalities, thereby improving ability to understand and answer the question accurately. However, most existing methods cannot effectively handle the interaction fusion between modalities well. Therefore, in order to more effectively integrate information from different modalities, this paper proposes a module called Multimodal Feature Cascade Guidance (MFCG), which solves the problem of ignoring the importance of certain modalities in previous methods. In addition, a Relative Position Relationship Enhanced Transformer (RPRET) layer is introduced to model the relative position relationship between different modalities in the image, thereby improving the performance of answering the question related to spatial position relationships. The proposed method outperforms various state-of-the-art models on two public datasets, which confirms the effectiveness of our method.
引用
收藏
页码:95 / 109
页数:15
相关论文
共 50 条
  • [1] Learning Hierarchical Reasoning for Text-Based Visual Question Answering
    Li, Caiyuan
    Du, Qinyi
    Wang, Qingqing
    Jin, Yaohui
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2021, PT III, 2021, 12893 : 305 - 316
  • [2] Scene Text Visual Question Answering
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Gomez, Lluis
    Rusinol, Marcal
    Valveny, Ernest
    Jawahar, C. V.
    Karatzas, Dimosthenis
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
  • [3] Visual question answering model based on visual relationship detection
    Xi, Yuling
    Zhang, Yanning
    Ding, Songtao
    Wan, Shaohua
    SIGNAL PROCESSING-IMAGE COMMUNICATION, 2020, 80
  • [4] Ontological modeling for interactive question answering
    Basili, Roberto
    De Cao, Diego
    Giannone, Cristina
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2007: OTM 2007 WORKSHOPS, PT 1, PROCEEDINGS, 2007, 4805 : 544 - +
  • [5] Separate and Locate: Rethink the Text in Text-based Visual Question Answering
    Fang, Chengyang
    Li, Jiangnan
    Li, Liang
    Ma, Can
    Hu, Dayong
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4378 - 4388
  • [6] Interactive Language Learning by Question Answering
    Yuan, Xingdi
    Cote, Marc-Alexandre
    Fu, Jie
    Lin, Zhouhan
    Pal, Christopher
    Bengio, Yoshua
    Trischler, Adam
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2796 - 2813
  • [7] IQA: Visual Question Answering in Interactive Environments
    Gordon, Daniel
    Kembhavi, Aniruddha
    Rastegari, Mohammad
    Redmon, Joseph
    Fox, Dieter
    Farhadi, Ali
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4089 - 4098
  • [8] Cascade Reasoning Network for Text-based Visual Question Answering
    Liu, Fen
    Xu, Guanghui
    Wu, Qi
    Du, Qing
    Jia, Wei
    Tan, Mingkui
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 4060 - 4069
  • [9] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [10] Multitask Learning for Visual Question Answering
    Ma, Jie
    Liu, Jun
    Lin, Qika
    Wu, Bei
    Wang, Yaxian
    You, Yang
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (03) : 1380 - 1394