Visual question answering via Attention-based syntactic structure tree-LSTM

被引：25

作者：

Liu, Yun ^{[1
]}

Zhang, Xiaoming ^{[2
]}

Huang, Feiran ^{[3
]}

Tang, Xianghong ^{[4
]}

Li, Zhoujun ^{[5
]}

机构：

[1] Beihang Univ, Beijing Key Lab Network Technol, Beijing 100191, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China

[3] Jinan Univ, Coll Informat Sci & Technol, Coll Cyber Secur, Guangzhou 510632, Guangdong, Peoples R China

[4] Guizhou Univ, Key Lab Adv Mfg Technol, Minist Educ, Guiyang 550025, Guizhou, Peoples R China

[5] Beihang Univ, Sch Comp Sci & Engn, State Key Lab Software Dev Environm, Beijing 100191, Peoples R China

来源：

APPLIED SOFT COMPUTING | 2019年 / 82卷

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

Visual question answering; Visual attention; Tree-LSTM; Spatial-semantic correlation;

D O I：

10.1016/j.asoc.2019.105584

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Due to the various patterns of the image and free-form language of the question, the performance of Visual Question Answering (VQA) still lags behind satisfaction. Existing approaches mainly infer answers from the low-level features and sequential question words, which neglects the syntactic structure information of the question sentence and its correlation with the spatial structure of the image. To address these problems, we propose a novel VQA model, i.e., Attention-based Syntactic Structure Tree-LSTM (ASST-LSTM). Specifically, a tree-structured LSTM is used to encode the syntactic structure of the question sentence. A spatial-semantic attention model is proposed to learn the visual-textual correlation and the alignment between image regions and question words. In the attention model, Siamese network is employed to explore the alignment between visual and textual contents. Then, the tree-structured LSTM and the spatial-semantic attention model are integrated with a joint deep model, in which the multi-task learning method is used to train the model for answer inferring. Experiments conducted on three widely used VQA benchmark datasets demonstrate the superiority of the proposed model compared with state-of-the-art approaches. (C) 2019 Elsevier B.V. All rights reserved.

引用

页数：12

共 50 条

[41] Visual Question Answering via Combining Inferential Attention and Semantic Space Mapping
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Zhou, Zhibo
Zhao, Zhonghua
Li, Zhoujun
KNOWLEDGE-BASED SYSTEMS, 2020, 207
[42] Hierarchical Attention Networks for Fact-based Visual Question Answering
Haibo Yao
Yongkang Luo
Zhi Zhang
Jianhang Yang
Chengtao Cai
Multimedia Tools and Applications, 2024, 83 : 17281 - 17298
[43] SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering
Luo, Haonan
Lin, Guosheng
Liu, Zichuan
Liu, Fayao
Tang, Zhenmin
Yao, Yazhou
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9666 - 9675
[44] Hierarchical Attention Networks for Fact-based Visual Question Answering
Yao, Haibo
Luo, Yongkang
Zhang, Zhi
Yang, Jianhang
Cai, Chengtao
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (06) : 17281 - 17298
[45] EFFECT OF A PRAGMATIC PRESUPPOSITION ON SYNTACTIC STRUCTURE IN QUESTION ANSWERING
BOCK, JK
JOURNAL OF VERBAL LEARNING AND VERBAL BEHAVIOR, 1977, 16 (06): : 723 - 734
[46] Depth and Video Segmentation Based Visual Attention for Embodied Question Answering
Luo, Haonan
Lin, Guosheng
Yao, Yazhou
Liu, Fayao
Liu, Zichuan
Tang, Zhenmin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 6807 - 6819
[47] Chinese Knowledge Base Question Answering by Attention-Based Multi-Granularity Model
Shen, Cun
Huang, Tinglei
Liang, Xiao
Li, Feng
Fu, Kun
INFORMATION, 2018, 9 (04)
[48] Step Counting with Attention-based LSTM
Khan, Shehroz S.
Abedi, Ali
2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 559 - 566
[49] Focal Visual-Text Attention for Visual Question Answering
Liang, Junwei
Jiang, Lu
Cao, Liangliang
Li, Li-Jia
Hauptmann, Alexander
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6135 - 6143
[50] Learning to Supervise Knowledge Retrieval Over a Tree Structure for Visual Question Answering
Xu, Ning
Lu, Zimu
Tian, Hongshuo
Kang, Rongbao
Cao, Jinbo
Zhang, Yongdong
Liu, An-An
IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 6689 - 6700

← 1 2 3 4 5 →