Stacked cross-modal feature consolidation attention networks for image captioning

被引:2
|
作者
Pourkeshavarz, Mozhgan [1 ]
Nabavi, Shahabedin [1 ]
Moghaddam, Mohsen Ebrahimi [1 ]
Shamsfard, Mehrnoush [1 ]
机构
[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran, Iran
关键词
Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information; BOTTOM-UP; TOP-DOWN;
D O I
10.1007/s11042-023-15869-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
引用
收藏
页码:12209 / 12233
页数:25
相关论文
共 50 条
  • [11] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning
    Li, Zhengxin
    Zhao, Wenzhe
    Du, Xuanyi
    Zhou, Guangyao
    Zhang, Songlin
    REMOTE SENSING, 2024, 16 (01)
  • [12] VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
    Wei, Tingting
    Yuan, Weilin
    Luo, Junren
    Zhang, Wanpeng
    Lu, Lina
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2023, 34 (01) : 9 - 18
  • [13] VLCA: vision-language aligning model with cross-modal attention for bilingual remote sensing image captioning
    WEI Tingting
    YUAN Weilin
    LUO Junren
    ZHANG Wanpeng
    LU Lina
    JournalofSystemsEngineeringandElectronics, 2023, 34 (01) : 9 - 18
  • [14] Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation
    Zhao, Wentian
    Wu, Xinxiao
    Luo, Jiebo
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 1180 - 1192
  • [15] Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning
    Xiang, Nan
    Chen, Ling
    Liang, Leiyan
    Rao, Xingdi
    Gong, Zehao
    ELECTRONICS, 2023, 12 (17)
  • [16] Cross-modal image fusion guided by subjective visual attention
    Fang, Aiqing
    Zhao, Xinbo
    Zhang, Yanning
    NEUROCOMPUTING, 2020, 414 (414) : 333 - 345
  • [17] CROSS-MODAL DEEP NETWORKS FOR DOCUMENT IMAGE CLASSIFICATION
    Bakkali, Souhail
    Ming, Zuheng
    Coustaty, Mickael
    Rusinol, Marcal
    2020 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2020, : 2556 - 2560
  • [18] A Regenerated Feature Extraction Method for Cross-modal Image Registration
    Yang, Jian
    Wang, Qi
    Li, Xuelong
    ADVANCES IN BRAIN INSPIRED COGNITIVE SYSTEMS, BICS 2018, 2018, 10989 : 441 - 451
  • [19] Cross-Modal feature description for remote sensing image matching
    Li, Liangzhi
    Liu, Ming
    Ma, Lingfei
    Han, Ling
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2022, 112
  • [20] Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval
    Lu, Yuhang
    Yu, Jing
    Liu, Yanbing
    Tan, Jianlong
    Guo, Li
    Zhang, Weifeng
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT (KSEM 2018), PT I, 2018, 11061 : 213 - 225