Dynamic-balanced double-attention fusion for image captioning

被引：10

作者：

Wang, Changzhi ^{[1
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2022年 / 114卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Attention fusion; DSR; Attention variance; SEMANTIC ATTENTION; NETWORK;

D O I：

10.1016/j.engappai.2022.105194

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image captioning has received significant attention in the cross-modal field in which spatial and channel attentions play a crucial role. However, such attention-based approaches ignore two issues: (1) errors or noise in the channel feature map amplifies in the spatial feature map, leading to a lower model reliability; (2) image spatial feature and channel feature provide different contributions to the prediction both function words (e.g., "in'', "out'' and "on'') and notional words (e.g., "girl'', "teddy'' and "bear''). To alleviate the above issues, in this paper we propose the Dynamic-Balanced Double-Attention Fusion (DBDAF) for image captioning task that novelly exploits the attention variation and enhances the overall performance of the model. Technically, DBDAF first integrates a parallel Double Attention Network (DAN) in which channel attention is capitalized on as a supplement to the region attention, enhancing the model reliability. Then, a attention variation based Balancing Attention Fusion Mechanism (BAFM) module is devised. When predicting function words and notional words, BAFM makes a dynamic balance between channel attention and region attention based on attention variation. Moreover, to achieve the richer image description, we further devise a Doubly Stochastic Regularization (DSR) penalty and integrate it into the model loss function. Such DSR makes the model equally focus on every pixel and every channel in generating entire sentence. Extensive experiments on the three typical datasets show our DBDAF outperforms the related end-to-end leading approaches clearly. More remarkably, DBDAF achieves 1.04% and 1.75% improvement in terms of BLEU4 and CIDEr on the MSCOCO datasets.

引用

页数：15

共 50 条

[31] REFINING ATTENTION: A SEQUENTIAL ATTENTION MODEL FOR IMAGE CAPTIONING
Fang, Fang
Li, Qinyu
Wang, Hanli
Tang, Pengjie
2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
[32] Boosted Attention: Leveraging Human Attention for Image Captioning
Chen, Shi
Zhao, Qi
COMPUTER VISION - ECCV 2018, PT XI, 2018, 11215 : 72 - 88
[33] Identification of Tool Wear Based on Infographics and a Double-Attention Network
Ni, Jing
Liu, Xuansong
Meng, Zhen
Cui, Yiming
MACHINES, 2023, 11 (10)
[34] The synergy of double attention: Combine sentence-level and word-level attention for image captioning
Wei, Haiyang
Li, Zhixin
Zhang, Canlong
Ma, Huifang
COMPUTER VISION AND IMAGE UNDERSTANDING, 2020, 201
[35] Double-attention mechanism-based segmentation grasping detection network
Li, Qinghua
Wang, Xuyang
Zhang, Kun
Yang, Yiran
Feng, Chao
JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (02)
[36] Recurrent fusion transformer for image captioning
Mou, Zhenping
Yuan, Qiao
Song, Tianqi
SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
[37] A multimodal fusion approach for image captioning
Zhao, Dexin
Chang, Zhi
Guo, Shutao
NEUROCOMPUTING, 2019, 329 : 476 - 485
[38] Recurrent Fusion Network for Image Captioning
Jiang, Wenhao
Ma, Lin
Jiang, Yu-Gang
Liu, Wei
Zhang, Tong
COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 510 - 526
[39] Cascade Semantic Fusion for Image Captioning
Wang, Shiwei
Lan, Long
Zhang, Xiang
Dong, Guohua
Luo, Zhigang
IEEE ACCESS, 2019, 7 : 66680 - 66688
[40] Self-Enhanced Attention for Image Captioning
Sun, Qingyu
Zhang, Juan
Fang, Zhijun
Gao, Yongbin
NEURAL PROCESSING LETTERS, 2024, 56 (02)

← 1 2 3 4 5 →