Stacked cross-modal feature consolidation attention networks for image captioning

被引:2
|
作者
Pourkeshavarz, Mozhgan [1 ]
Nabavi, Shahabedin [1 ]
Moghaddam, Mohsen Ebrahimi [1 ]
Shamsfard, Mehrnoush [1 ]
机构
[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran, Iran
关键词
Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information; BOTTOM-UP; TOP-DOWN;
D O I
10.1007/s11042-023-15869-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
引用
收藏
页码:12209 / 12233
页数:25
相关论文
共 50 条
  • [1] Stacked cross-modal feature consolidation attention networks for image captioning
    Mozhgan Pourkeshavarz
    Shahabedin Nabavi
    Mohsen Ebrahimi Moghaddam
    Mehrnoush Shamsfard
    Multimedia Tools and Applications, 2024, 83 : 12209 - 12233
  • [2] HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning
    Yang, Zhigang
    Li, Qiang
    Yuan, Yuan
    Wang, Qi
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 11
  • [3] Exploring and Distilling Cross-Modal Information for Image Captioning
    Liu, Fenglin
    Ren, Xuancheng
    Liu, Yuanxin
    Lei, Kai
    Sun, Xu
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 5095 - 5101
  • [4] Cross-modal recipe retrieval with stacked attention model
    Chen, Jing-Jing
    Pang, Lei
    Ngo, Chong-Wah
    MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (22) : 29457 - 29473
  • [5] Cross-modal recipe retrieval with stacked attention model
    Jing-Jing Chen
    Lei Pang
    Chong-Wah Ngo
    Multimedia Tools and Applications, 2018, 77 : 29457 - 29473
  • [6] CM-SC: Cross-modal spatial-channel attention network for image captioning
    Hossain, Md. Shamim
    Aktar, Shamima
    Hossain, Mohammad Alamgir
    Gu, Naijie
    Huang, Zhangjin
    DISPLAYS, 2025, 87
  • [7] SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval
    Ji, Zhong
    Wang, Haoran
    Han, Jungong
    Pang, Yanwei
    IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (02) : 1086 - 1097
  • [8] Cross-modal attention for multi-modal image registration
    Song, Xinrui
    Chao, Hanqing
    Xu, Xuanang
    Guo, Hengtao
    Xu, Sheng
    Turkbey, Baris
    Wood, Bradford J.
    Sanford, Thomas
    Wang, Ge
    Yan, Pingkun
    MEDICAL IMAGE ANALYSIS, 2022, 82
  • [9] Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning
    Yang, Yang
    Wei, Hongchen
    Zhu, Hengshu
    Yu, Dianhai
    Xiong, Hui
    Yang, Jian
    IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 890 - 902
  • [10] Learning Cross-modal Representations with Multi-relations for Image Captioning
    Cheng, Peng
    Le, Tung
    Racharak, Teeradaj
    Cao Yiming
    Kong Weikun
    Minh Le Nguyen
    PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM), 2021, : 346 - 353