Image Caption Generation with Hierarchical Contextual Visual Spatial Attention

被引:10
|
作者
Khademi, Mahmoud [1 ]
Schulte, Oliver [1 ]
机构
[1] Simon Fraser Univ, Burnaby, BC, Canada
关键词
D O I
10.1109/CVPRW.2018.00260
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present a novel context-aware attention-based deep architecture for image caption generation. Our architecture employs a Bidirectional Grid LSTM, which takes visual features of an image as input and learns complex spatial patterns based on two-dimensional context, by selecting or ignoring its input. The Grid LSTM has not been applied to image caption generation task before. Another novel aspect is that we leverage a set of local region-grounded texts obtained by transfer learning. The region-grounded texts often describe the properties of the objects and their relationships in an image. To generate a global caption for the image, we integrate the spatial features from the Grid LSTM with the local region-grounded texts, using a two-layer Bidirectional LSTM. The first layer models the global scene context such as object presence. The second layer utilizes a novel dynamic spatial attention mechanism, based on another Grid LSTM, to generate the global caption word-by-word, while considering the caption context around a word in both directions. Unlike recent models that use a soft attention mechanism, our dynamic spatial attention mechanism considers the spatial context of the image regions. Experimental results on MS-COCO dataset show that our architecture outperforms the state-of-the-art.
引用
收藏
页码:2024 / 2032
页数:9
相关论文
共 50 条
  • [41] Attention-based Visual-Audio Fusion for Video Caption Generation
    Guo, Ningning
    Liu, Huaping
    Jiang, Linhua
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS (ICARM 2019), 2019, : 839 - 844
  • [42] Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
    Zhang, Huawei
    Ma, Chengbo
    Jiang, Zhanjun
    Lian, Jing
    IEEE ACCESS, 2023, 11 : 134 - 143
  • [43] Mind's Eye: A Recurrent Visual Representation for Image Caption Generation
    Chen, Xinlei
    Zitnick, C. Lawrence
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 2422 - 2431
  • [44] TVPRNN for image caption generation
    Yang, Liang
    Hu, Haifeng
    ELECTRONICS LETTERS, 2017, 53 (22) : 1471 - +
  • [45] Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards
    Wu, Chunlei
    Yuan, Shaozu
    Cao, Haiwen
    Wei, Yiwei
    Wang, Leiquan
    IEEE ACCESS, 2020, 8 (08): : 57943 - 57951
  • [46] Attention based sequence-to-sequence framework for auto image caption generation
    Khan, Rashid
    Islam, M. Shujah
    Kanwal, Khadija
    Iqbal, Mansoor
    Hossain, Md Imran
    Ye, Zhongfu
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 43 (01) : 159 - 170
  • [47] Image Caption with Endogenous–Exogenous Attention
    Teng Wang
    Haifeng Hu
    Chen He
    Neural Processing Letters, 2019, 50 : 431 - 443
  • [48] CNN image caption generation
    Li Y.
    Cheng H.
    Liang X.
    Guo Q.
    Qian Y.
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2019, 46 (02): : 152 - 157
  • [49] Enhancing image caption generation through context-aware attention mechanism
    Bhuiyan, Ahatesham
    Hossain, Eftekhar
    Hoque, Mohammed Moshiul
    Dewan, M. Ali Akber
    HELIYON, 2024, 10 (17)
  • [50] Hierarchical cross-modal contextual attention network for visual grounding
    Xu, Xin
    Lv, Gang
    Sun, Yining
    Hu, Yuxia
    Nian, Fudong
    MULTIMEDIA SYSTEMS, 2023, 29 (04) : 2073 - 2083