Stacked cross-modal feature consolidation attention networks for image captioning

被引：2

作者：

Pourkeshavarz, Mozhgan ^{[1
]}

Nabavi, Shahabedin ^{[1
]}

Moghaddam, Mohsen Ebrahimi ^{[1
]}

Shamsfard, Mehrnoush ^{[1
]}

机构：

[1] Shahid Beheshti Univ, Fac Comp Sci & Engn, Tehran, Iran

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2024年 / 83卷 / 04期

关键词：

Contextual representation; Cross-modal feature fusion; Image captioning; Stacked attention network; Visual and semantic information; BOTTOM-UP; TOP-DOWN;

D O I：

10.1007/s11042-023-15869-x

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The attention-enriched encoder-decoder framework has recently aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed compounding function, where our CAA provides a concise context-sensitive semantic representation. To better use consolidated features potential, we propose an SCFC-LSTM as the caption generator, which can leverage discriminative semantic information through the caption generation process. The experimental results indicate that our proposed SCFC can outperform various state-of-the-art image captioning benchmarks in terms of popular metrics on the MSCOCO and Flickr30K datasets.

引用

页码：12209 / 12233

页数：25

共 50 条

[1] Stacked cross-modal feature consolidation attention networks for image captioning
Mozhgan Pourkeshavarz
Shahabedin Nabavi
Mohsen Ebrahimi Moghaddam
Mehrnoush Shamsfard
Multimedia Tools and Applications, 2024, 83 : 12209 - 12233
[2] HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning
Yang, Zhigang
Li, Qiang
Yuan, Yuan
Wang, Qi
IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62 : 1 - 11
[3] Exploring and Distilling Cross-Modal Information for Image Captioning
Liu, Fenglin
Ren, Xuancheng
Liu, Yuanxin
Lei, Kai
Sun, Xu
PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 5095 - 5101
[4] Cross-modal recipe retrieval with stacked attention model
Chen, Jing-Jing
Pang, Lei
Ngo, Chong-Wah
MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (22) : 29457 - 29473
[5] Cross-modal recipe retrieval with stacked attention model
Jing-Jing Chen
Lei Pang
Chong-Wah Ngo
Multimedia Tools and Applications, 2018, 77 : 29457 - 29473
[6] CM-SC: Cross-modal spatial-channel attention network for image captioning
Hossain, Md. Shamim
Aktar, Shamima
Hossain, Mohammad Alamgir
Gu, Naijie
Huang, Zhangjin
DISPLAYS, 2025, 87
[7] SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval
Ji, Zhong
Wang, Haoran
Han, Jungong
Pang, Yanwei
IEEE TRANSACTIONS ON CYBERNETICS, 2022, 52 (02) : 1086 - 1097
[8] Cross-modal attention for multi-modal image registration
Song, Xinrui
Chao, Hanqing
Xu, Xuanang
Guo, Hengtao
Xu, Sheng
Turkbey, Baris
Wood, Bradford J.
Sanford, Thomas
Wang, Ge
Yan, Pingkun
MEDICAL IMAGE ANALYSIS, 2022, 82
[9] Exploiting Cross-Modal Prediction and Relation Consistency for Semisupervised Image Captioning
Yang, Yang
Wei, Hongchen
Zhu, Hengshu
Yu, Dianhai
Xiong, Hui
Yang, Jian
IEEE TRANSACTIONS ON CYBERNETICS, 2024, 54 (02) : 890 - 902
[10] Learning Cross-modal Representations with Multi-relations for Image Captioning
Cheng, Peng
Le, Tung
Racharak, Teeradaj
Cao Yiming
Kong Weikun
Minh Le Nguyen
PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS (ICPRAM), 2021, : 346 - 353

← 1 2 3 4 5 →