Stacked cross-modal feature consolidation attention networks for image captioning

被引:2
|
作者
Pourkeshavarz, Mozhgan [1 ]
Nabavi, Shahabedin [1 ]
Moghaddam, Mohsen Ebrahimi [1 ]
Shamsfard, Mehrnoush [1 ]
机构
[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran, Iran
关键词
Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information; BOTTOM-UP; TOP-DOWN;
D O I
10.1007/s11042-023-15869-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
引用
收藏
页码:12209 / 12233
页数:25
相关论文
共 50 条
  • [31] Joint feature approach for image-text cross-modal retrieval
    Gao, Dihui
    Sheng, Lijie
    Xu, Xiaodong
    Miao, Qiguang
    Xi'an Dianzi Keji Daxue Xuebao/Journal of Xidian University, 2024, 51 (04): : 128 - 138
  • [32] Heterogeneous Feature Fusion and Cross-modal Alignment for Composed Image Retrieval
    Zhang, Gangjian
    Wei, Shikui
    Pang, Huaxin
    Zhao, Yao
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5353 - 5362
  • [33] Deep discriminative image feature learning for cross-modal semantics understanding
    Zhang, Hong
    Liu, Fangming
    Li, Bo
    Zhang, Ling
    Zhu, Yihai
    Wang, Ziwei
    KNOWLEDGE-BASED SYSTEMS, 2021, 216
  • [34] Cross-modal links in spatial attention
    Driver, J
    Spence, C
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES, 1998, 353 (1373) : 1319 - 1331
  • [35] Cross-modal decoupling in temporal attention
    Muehlberg, Stefanie
    Oriolo, Giovanni
    Soto-Faraco, Salvador
    EUROPEAN JOURNAL OF NEUROSCIENCE, 2014, 39 (12) : 2089 - 2097
  • [36] Cross-modal orienting of visual attention
    Hillyard, Steven A.
    Stoermer, Viola S.
    Feng, Wenfeng
    Martinez, Antigona
    McDonald, John J.
    NEUROPSYCHOLOGIA, 2016, 83 : 170 - 178
  • [37] Cross-modal attention and letter recognition
    Wesner, Michael
    Miller, Lisa
    INTERNATIONAL JOURNAL OF PSYCHOLOGY, 2008, 43 (3-4) : 343 - 343
  • [38] Cross-modal synergies in spatial attention
    Driver, J
    Eimer, M
    Macaluso, E
    Van Velzen, J
    PERCEPTION, 2003, 32 : 15 - 15
  • [39] Deliberate Attention Networks for Image Captioning
    Gao, Lianli
    Fan, Kaixuan
    Song, Jingkuan
    Liu, Xianglong
    Xu, Xing
    Shen, Heng Tao
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8320 - 8327
  • [40] Cross-Modal Scene Networks
    Aytar, Yusuf
    Castrejon, Lluis
    Vondrick, Carl
    Pirsiavash, Hamed
    Torralba, Antonio
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2018, 40 (10) : 2303 - 2314