Visual question answering via Attention-based syntactic structure tree-LSTM

被引:25
|
作者
Liu, Yun [1 ]
Zhang, Xiaoming [2 ]
Huang, Feiran [3 ]
Tang, Xianghong [4 ]
Li, Zhoujun [5 ]
机构
[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[3] Jinan Univ, Coll Informat Sci & Technol, Coll Cyber Secur, Guangzhou 510632, Guangdong, Peoples R China
[4] Guizhou Univ, Key Lab Adv Mfg Technol, Minist Educ, Guiyang 550025, Guizhou, Peoples R China
[5] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Visual question answering; Visual attention; Tree-LSTM; Spatial-semantic correlation;
D O I
10.1016/j.asoc.2019.105584
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Due to the various patterns of the image and free-form language of the question, the performance of Visual Question Answering (VQA) still lags behind satisfaction. Existing approaches mainly infer answers from the low-level features and sequential question words, which neglects the syntactic structure information of the question sentence and its correlation with the spatial structure of the image. To address these problems, we propose a novel VQA model, i.e., Attention-based Syntactic Structure Tree-LSTM (ASST-LSTM). Specifically, a tree-structured LSTM is used to encode the syntactic structure of the question sentence. A spatial-semantic attention model is proposed to learn the visual-textual correlation and the alignment between image regions and question words. In the attention model, Siamese network is employed to explore the alignment between visual and textual contents. Then, the tree-structured LSTM and the spatial-semantic attention model are integrated with a joint deep model, in which the multi-task learning method is used to train the model for answer inferring. Experiments conducted on three widely used VQA benchmark datasets demonstrate the superiority of the proposed model compared with state-of-the-art approaches. (C) 2019 Elsevier B.V. All rights reserved.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Visual Question Answering via Combining Inferential Attention and Semantic Space Mapping
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Zhou, Zhibo
    Zhao, Zhonghua
    Li, Zhoujun
    KNOWLEDGE-BASED SYSTEMS, 2020, 207
  • [42] Hierarchical Attention Networks for Fact-based Visual Question Answering
    Haibo Yao
    Yongkang Luo
    Zhi Zhang
    Jianhang Yang
    Chengtao Cai
    Multimedia Tools and Applications, 2024, 83 : 17281 - 17298
  • [43] SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering
    Luo, Haonan
    Lin, Guosheng
    Liu, Zichuan
    Liu, Fayao
    Tang, Zhenmin
    Yao, Yazhou
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9666 - 9675
  • [44] Hierarchical Attention Networks for Fact-based Visual Question Answering
    Yao, Haibo
    Luo, Yongkang
    Zhang, Zhi
    Yang, Jianhang
    Cai, Chengtao
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 17281 - 17298
  • [45] EFFECT OF A PRAGMATIC PRESUPPOSITION ON SYNTACTIC STRUCTURE IN QUESTION ANSWERING
    BOCK, JK
    JOURNAL OF VERBAL LEARNING AND VERBAL BEHAVIOR, 1977, 16 (06): : 723 - 734
  • [46] Depth and Video Segmentation Based Visual Attention for Embodied Question Answering
    Luo, Haonan
    Lin, Guosheng
    Yao, Yazhou
    Liu, Fayao
    Liu, Zichuan
    Tang, Zhenmin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 6807 - 6819
  • [47] Chinese Knowledge Base Question Answering by Attention-Based Multi-Granularity Model
    Shen, Cun
    Huang, Tinglei
    Liang, Xiao
    Li, Feng
    Fu, Kun
    INFORMATION, 2018, 9 (04)
  • [48] Step Counting with Attention-based LSTM
    Khan, Shehroz S.
    Abedi, Ali
    2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 559 - 566
  • [49] Focal Visual-Text Attention for Visual Question Answering
    Liang, Junwei
    Jiang, Lu
    Cao, Liangliang
    Li, Li-Jia
    Hauptmann, Alexander
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
  • [50] Learning to Supervise Knowledge Retrieval Over a Tree Structure for Visual Question Answering
    Xu, Ning
    Lu, Zimu
    Tian, Hongshuo
    Kang, Rongbao
    Cao, Jinbo
    Zhang, Yongdong
    Liu, An-An
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6689 - 6700