Cascade Reasoning Network for Text-based Visual Question Answering

被引:38
|
作者
Liu, Fen [1 ]
Xu, Guanghui [1 ]
Wu, Qi [2 ]
Du, Qing [1 ]
Jia, Wei [3 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Univ Adelaide, Adelaide, SA, Australia
[3] CVTE, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-based VQA; Multimodal Information; Progressive Attention; Reasoning Graph;
D O I
10.1145/3394171.3413924
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvga.
引用
收藏
页码:4060 / 4069
页数:10
相关论文
共 50 条
  • [41] Interpretable Visual Question Answering by Reasoning on Dependency Trees
    Cao, Qingxing
    Liang, Xiaodan
    Li, Bailin
    Lin, Liang
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (03) : 887 - 901
  • [42] Multimodal Knowledge Reasoning for Enhanced Visual Question Answering
    Hussain, Afzaal
    Maqsood, Ifrah
    Shahzad, Muhammad
    Fraz, Muhammad Moazam
    2022 16TH INTERNATIONAL CONFERENCE ON SIGNAL-IMAGE TECHNOLOGY & INTERNET-BASED SYSTEMS, SITIS, 2022, : 224 - 230
  • [43] Relational reasoning and adaptive fusion for visual question answering
    Shen, Xiang
    Han, Dezhi
    Zong, Liang
    Guo, Zihan
    Hua, Jie
    APPLIED INTELLIGENCE, 2024, 54 (06) : 5062 - 5080
  • [44] INTERPRETABLE VISUAL QUESTION ANSWERING VIA REASONING SUPERVISION
    Parelli, Maria
    Mallis, Dimitrios
    Diomataris, Markos
    Pitsikalis, Vassilis
    2023 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2023, : 2525 - 2529
  • [45] MUREL: Multimodal Relational Reasoning for Visual Question Answering
    Cadene, Remi
    Ben-younes, Hedi
    Cord, Matthieu
    Thome, Nicolas
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 1989 - 1998
  • [46] Maintaining Reasoning Consistency in Compositional Visual Question Answering
    Jing, Chenchen
    Jia, Yunde
    Wu, Yuwei
    Liu, Xinyu
    Wu, Qi
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5089 - 5098
  • [47] A DIAGNOSTIC STUDY OF VISUAL QUESTION ANSWERING WITH ANALOGICAL REASONING
    Huang, Ziqi
    Zhu, Hongyuan
    Sun, Ying
    Choi, Dongkyu
    Tan, Cheston
    Lim, Joo-Hwee
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 2463 - 2467
  • [48] Reasoning about text-based evidence
    Du, Hongcui
    List, Alexandra
    CONTEMPORARY EDUCATIONAL PSYCHOLOGY, 2022, 68
  • [49] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [50] Graph-based relational reasoning network for video question answering
    Tan, Tao
    Sun, Guanglu
    MACHINE VISION AND APPLICATIONS, 2025, 36 (01)