Cascade Reasoning Network for Text-based Visual Question Answering

被引:38
|
作者
Liu, Fen [1 ]
Xu, Guanghui [1 ]
Wu, Qi [2 ]
Du, Qing [1 ]
Jia, Wei [3 ]
Tan, Mingkui [1 ]
机构
[1] South China Univ Technol, Guangzhou, Peoples R China
[2] Univ Adelaide, Adelaide, SA, Australia
[3] CVTE, Guangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Text-based VQA; Multimodal Information; Progressive Attention; Reasoning Graph;
D O I
10.1145/3394171.3413924
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvga.
引用
收藏
页码:4060 / 4069
页数:10
相关论文
共 50 条
  • [31] A question-guided multi-hop reasoning graph network for visual question answering
    Xu, Zhaoyang
    Gu, Jinguang
    Liu, Maofu
    Zhou, Guangyou
    Fu, Haidong
    Qiu, Chen
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [32] Efficient Multi-step Reasoning Attention Network for Visual Question Answering
    Zhang, Haotian
    Wu, Wei
    Zhang, Meng
    THIRTEENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2021), 2022, 12083
  • [33] Hierarchical reasoning based on perception action cycle for visual question answering
    Mohamud, Safaa Abdullahi Moallim
    Jalali, Amin
    Lee, Minho
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 241
  • [34] Visual question answering method based on relational reasoning and gating mechanism
    Wang X.
    Chen Q.-H.
    Sun Q.
    Jia Y.-B.
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2022, 56 (01): : 36 - 46
  • [35] Improving reasoning with contrastive visual information for visual question answering
    Long, Yu
    Tang, Pengjie
    Wang, Hanli
    Yu, Jian
    ELECTRONICS LETTERS, 2021, 57 (20) : 758 - 760
  • [36] VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning
    Chen, Kang
    Wu, Xiangqian
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 27208 - 27217
  • [37] Text Visual Question Answering Based on Interactive Learning and Relationship Modeling
    Zhang, Chao
    Wu, Wei
    Ma, Bingzhuo
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING-ICANN 2024, PT VI, 2024, 15021 : 95 - 109
  • [38] Coarse-to-Fine Reasoning for Visual Question Answering
    Nguyen, Binh X.
    Tuong Do
    Huy Tran
    Tjiputra, Erman
    Tran, Quang D.
    Anh Nguyen
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 4557 - 4565
  • [39] Medical Visual Question Answering via Conditional Reasoning
    Zhan, Li-Ming
    Liu, Bo
    Fan, Lu
    Chen, Jiaxin
    Wu, Xiao-Ming
    MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, 2020, : 2345 - 2354
  • [40] Affective Visual Question Answering Network
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Dong, Ming
    IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 170 - 173