Stacked cross-modal feature consolidation attention networks for image captioning

被引:2
|
作者
Pourkeshavarz, Mozhgan [1 ]
Nabavi, Shahabedin [1 ]
Moghaddam, Mohsen Ebrahimi [1 ]
Shamsfard, Mehrnoush [1 ]
机构
[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran, Iran
关键词
Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information; BOTTOM-UP; TOP-DOWN;
D O I
10.1007/s11042-023-15869-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.
引用
收藏
页码:12209 / 12233
页数:25
相关论文
共 50 条
  • [21] Improving Cross-Modal Alignment with Synthetic Pairs for Text-Only Image Captioning
    Liu, Zhiyue
    Liu, Jinyuan
    Ma, Fanrong
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4, 2024, : 3864 - 3872
  • [22] Cross-Modal Graph With Meta Concepts for Video Captioning
    Wang, Hao
    Lin, Guosheng
    Hoi, Steven C. H.
    Miao, Chunyan
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 5150 - 5162
  • [23] Hybrid-attention based Feature-reconstructive Adversarial Hashing Networks for Cross-modal Retrieval
    Li, Chen
    Wang, Hua
    JOURNAL OF ENGINEERING RESEARCH, 2022, 10
  • [24] Cross-Modal Self-Attention Network for Referring Image Segmentation
    Ye, Linwei
    Rochan, Mrigank
    Liu, Zhi
    Wang, Yang
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 10494 - 10503
  • [25] Cross-modal attention guided visual reasoning for referring image segmentation
    Wenjing Zhang
    Mengnan Hu
    Quange Tan
    Qianli Zhou
    Rong Wang
    Multimedia Tools and Applications, 2023, 82 : 28853 - 28872
  • [26] Cross-modal attention guided visual reasoning for referring image segmentation
    Zhang, Wenjing
    Hu, Mengnan
    Tan, Quange
    Zhou, Qianli
    Wang, Rong
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (19) : 28853 - 28872
  • [27] Cross-Modal Attention With Semantic Consistence for Image-Text Matching
    Xu, Xing
    Wang, Tan
    Yang, Yang
    Zuo, Lin
    Shen, Fumin
    Shen, Heng Tao
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5412 - 5425
  • [28] A role for consolidation in cross-modal category learning
    Ashton, Jennifer E.
    Jefferies, Elizabeth
    Gaskell, M. Gareth
    NEUROPSYCHOLOGIA, 2018, 108 : 50 - 60
  • [29] Cross-Modal Hybrid Feature Fusion for Image-Sentence Matching
    Xu, Xing
    Wang, Yifan
    He, Yixuan
    Yang, Yang
    Hanjalic, Alan
    Shen, Heng Tao
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2021, 17 (04)
  • [30] CROSS-MODAL TRANSFER WITH NEURAL WORD VECTORS FOR IMAGE FEATURE LEARNING
    Irie, Go
    Asami, Taichi
    Tarashima, Shuhei
    Kurozumi, Takayuki
    Kinebuchi, Tetsuya
    2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 2916 - 2920