Dynamic-balanced double-attention fusion for image captioning

被引：0

作者：

Wang, Changzhi ^{[1
]}

Gu, Xiaodong ^{[1
]}

机构：

[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2022年 / 114卷

关键词：

Image captioning; Attention fusion; DSR; Attention variance;

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Image captioning has received significant attention in the cross-modal field in which spatial and channel attentions play a crucial role. However, such attention-based approaches ignore two issues: (1) errors or noise in the channel feature map amplifies in the spatial feature map, leading to a lower model reliability; (2) image spatial feature and channel feature provide different contributions to the prediction both function words (e.g., "in", "out"and "on") and notional words (e.g., "girl", "teddy"and "bear"). To alleviate the above issues, in this paper we propose the Dynamic-Balanced Double-Attention Fusion (DBDAF) for image captioning task that novelly exploits the attention variation and enhances the overall performance of the model. Technically, DBDAF first integrates a parallel Double Attention Network (DAN) in which channel attention is capitalized on as a supplement to the region attention, enhancing the model reliability. Then, a attention variation based Balancing Attention Fusion Mechanism (BAFM) module is devised. When predicting function words and notional words, BAFM makes a dynamic balance between channel attention and region attention based on attention variation. Moreover, to achieve the richer image description, we further devise a Doubly Stochastic Regularization (DSR) penalty and integrate it into the model loss function. Such DSR makes the model equally focus on every pixel and every channel in generating entire sentence. Extensive experiments on the three typical datasets show our DBDAF outperforms the related end-to-end leading approaches clearly. More remarkably, DBDAF achieves 1.04% and 1.75% improvement in terms of BLEU4 and CIDEr on the MSCOCO datasets.

引用

页数：15

共 50 条

[21] Deliberate Attention Networks for Image Captioning
Gao, Lianli
Fan, Kaixuan
Song, Jingkuan
Liu, Xianglong
Xu, Xing
Shen, Heng Tao
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8320 - 8327
[22] Gated Hierarchical Attention for Image Captioning
Wang, Qingzhong
Chan, Antoni B.
COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 21 - 37
[23] Delving into Precise Attention in Image Captioning
Hu, Shaohan
Huang, Shenglei
Wang, Guolong
Li, Zhipeng
Qin, Zheng
NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 74 - 82
[24] Multivariate Attention Network for Image Captioning
Wang, Weixuan
Chen, Zhihong
Hu, Haifeng
COMPUTER VISION - ACCV 2018, PT VI, 2019, 11366 : 587 - 602
[25] Distributed Attention for Grounded Image Captioning
Chen, Nenglun
Pan, Xingjia
Chen, Runnan
Yang, Lei
Lin, Zhiwen
Ren, Yuqiang
Yuan, Haolei
Guo, Xiaowei
Huang, Feiyue
Wang, Wenping
PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 1966 - 1975
[26] Feedback Attention Model for Image Captioning
Lyu F.
Hu F.
Zhang Y.
Xia Z.
Sheng V.S.
Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2019, 31 (07): : 1122 - 1129
[27] Attention Correctness in Neural Image Captioning
Liu, Chenxi
Mao, Junhua
Sha, Fei
Yuille, Alan
THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4176 - 4182
[28] IMAGE CAPTIONING WITH WORD LEVEL ATTENTION
Fang, Fang
Wang, Hanli
Tang, Pengjie
2018 25TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2018, : 1278 - 1282
[29] Hierarchical Attention Network for Image Captioning
Wang, Weixuan
Chen, Zhihong
Hu, Haifeng
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8957 - 8964
[30] Hybrid attention network for image captioning
Jiang, Wenhui
Li, Qin
Zhan, Kun
Fang, Yuming
Shen, Fei
DISPLAYS, 2022, 73

← 1 2 3 4 5 →