Stacked cross-modal feature consolidation attention networks for image captioning

被引:2
|
作者
Pourkeshavarz, Mozhgan [1 ]
Nabavi, Shahabedin [1 ]
Moghaddam, Mohsen Ebrahimi [1 ]
Shamsfard, Mehrnoush [1 ]
机构
[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran, Iran
关键词
Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information; BOTTOM-UP; TOP-DOWN;
D O I
10.1007/s11042-023-15869-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
引用
收藏
页码:12209 / 12233
页数:25
相关论文
共 50 条
  • [41] TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning
    Wu, Yinan
    Li, Lingling
    Jiao, Licheng
    Liu, Fang
    Liu, Xu
    Yang, Shuyuan
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [42] Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning
    Jin, Tao
    Huang, Siyu
    Li, Yingming
    Zhang, Zhongfei
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2001 - 2011
  • [43] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
    Han, Shixing
    Liu, Jin
    Zhang, Jinyingming
    Gong, Peizhu
    Zhang, Xiliang
    He, Huihua
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (05) : 4995 - 5012
  • [44] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
    Shixing Han
    Jin Liu
    Jinyingming Zhang
    Peizhu Gong
    Xiliang Zhang
    Huihua He
    Complex & Intelligent Systems, 2023, 9 : 4995 - 5012
  • [45] Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders
    Liu, Wenhao
    Yuan, Simiao
    Wang, Zhen
    Chang, Xinyi
    Gao, Limeng
    Zhang, Zhenrui
    MATHEMATICS, 2024, 12 (20)
  • [46] Auditory Attention Detection via Cross-Modal Attention
    Cai, Siqi
    Li, Peiwen
    Su, Enze
    Xie, Longhan
    FRONTIERS IN NEUROSCIENCE, 2021, 15
  • [47] Image captioning: Semantic selection unit with stacked residual attention
    Song, Lifei
    Li, Fei
    Wang, Ying
    Liu, Yu
    Wang, Yuanhua
    Xiang, Shiming
    IMAGE AND VISION COMPUTING, 2024, 144
  • [48] Cross-modal fusion for multi-label image classification with attention mechanism
    Wang, Yangtao
    Xie, Yanzhao
    Zeng, Jiangfeng
    Wang, Hanpin
    Fan, Lisheng
    Song, Yufan
    Computers and Electrical Engineering, 2022, 101
  • [49] Cross-modal fusion for multi-label image classification with attention mechanism
    Wang, Yangtao
    Xie, Yanzhao
    Zeng, Jiangfeng
    Wang, Hanpin
    Fan, Lisheng
    Song, Yufan
    COMPUTERS & ELECTRICAL ENGINEERING, 2022, 101
  • [50] CCAFusion: Cross-Modal Coordinate Attention Network for Infrared and Visible Image Fusion
    Li, Xiaoling
    Li, Yanfeng
    Chen, Houjin
    Peng, Yahui
    Pan, Pan
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (02) : 866 - 881