Stacked cross-modal feature consolidation attention networks for image captioning

被引：2

作者：

Pourkeshavarz, Mozhgan ^{[1
]}

Nabavi, Shahabedin ^{[1
]}

Moghaddam, Mohsen Ebrahimi ^{[1
]}

Shamsfard, Mehrnoush ^{[1
]}

机构：

[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran, Iran

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2024年 / 83卷 / 04期

关键词：

Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information; BOTTOM-UP; TOP-DOWN;

D O I：

10.1007/s11042-023-15869-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.

引用

页码：12209 / 12233

页数：25

共 50 条

[41] TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning
Wu, Yinan
Li, Lingling
Jiao, Licheng
Liu, Fang
Liu, Xu
Yang, Shuyuan
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
[42] Low-Rank HOCA: Efficient High-Order Cross-Modal Attention for Video Captioning
Jin, Tao
Huang, Siyu
Li, Yingming
Zhang, Zhongfei
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2001 - 2011
[43] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
Han, Shixing
Liu, Jin
Zhang, Jinyingming
Gong, Peizhu
Zhang, Xiliang
He, Huihua
COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (05) : 4995 - 5012
[44] Lightweight dense video captioning with cross-modal attention and knowledge-enhanced unbiased scene graph
Shixing Han
Jin Liu
Jinyingming Zhang
Peizhu Gong
Xiliang Zhang
Huihua He
Complex & Intelligent Systems, 2023, 9 : 4995 - 5012
[45] Revamping Image-Recipe Cross-Modal Retrieval with Dual Cross Attention Encoders
Liu, Wenhao
Yuan, Simiao
Wang, Zhen
Chang, Xinyi
Gao, Limeng
Zhang, Zhenrui
MATHEMATICS, 2024, 12 (20)
[46] Auditory Attention Detection via Cross-Modal Attention
Cai, Siqi
Li, Peiwen
Su, Enze
Xie, Longhan
FRONTIERS IN NEUROSCIENCE, 2021, 15
[47] Image captioning: Semantic selection unit with stacked residual attention
Song, Lifei
Li, Fei
Wang, Ying
Liu, Yu
Wang, Yuanhua
Xiang, Shiming
IMAGE AND VISION COMPUTING, 2024, 144
[48] Cross-modal fusion for multi-label image classification with attention mechanism
Wang, Yangtao
Xie, Yanzhao
Zeng, Jiangfeng
Wang, Hanpin
Fan, Lisheng
Song, Yufan
Computers and Electrical Engineering, 2022, 101
[49] Cross-modal fusion for multi-label image classification with attention mechanism
Wang, Yangtao
Xie, Yanzhao
Zeng, Jiangfeng
Wang, Hanpin
Fan, Lisheng
Song, Yufan
COMPUTERS & ELECTRICAL ENGINEERING, 2022, 101
[50] CCAFusion: Cross-Modal Coordinate Attention Network for Infrared and Visible Image Fusion
Li, Xiaoling
Li, Yanfeng
Chen, Houjin
Peng, Yahui
Pan, Pan
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (02) : 866 - 881

← 1 2 3 4 5 →