Dynamic-balanced double-attention fusion for image captioning

被引:10
|
作者
Wang, Changzhi [1 ]
Gu, Xiaodong [1 ]
机构
[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Attention fusion; DSR; Attention variance; SEMANTIC ATTENTION; NETWORK;
D O I
10.1016/j.engappai.2022.105194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image captioning has received significant attention in the cross-modal field in which spatial and channel attentions play a crucial role. However, such attention-based approaches ignore two issues: (1) errors or noise in the channel feature map amplifies in the spatial feature map, leading to a lower model reliability; (2) image spatial feature and channel feature provide different contributions to the prediction both function words (e.g., "in'', "out'' and "on'') and notional words (e.g., "girl'', "teddy'' and "bear''). To alleviate the above issues, in this paper we propose the Dynamic-Balanced Double-Attention Fusion (DBDAF) for image captioning task that novelly exploits the attention variation and enhances the overall performance of the model. Technically, DBDAF first integrates a parallel Double Attention Network (DAN) in which channel attention is capitalized on as a supplement to the region attention, enhancing the model reliability. Then, a attention variation based Balancing Attention Fusion Mechanism (BAFM) module is devised. When predicting function words and notional words, BAFM makes a dynamic balance between channel attention and region attention based on attention variation. Moreover, to achieve the richer image description, we further devise a Doubly Stochastic Regularization (DSR) penalty and integrate it into the model loss function. Such DSR makes the model equally focus on every pixel and every channel in generating entire sentence. Extensive experiments on the three typical datasets show our DBDAF outperforms the related end-to-end leading approaches clearly. More remarkably, DBDAF achieves 1.04% and 1.75% improvement in terms of BLEU4 and CIDEr on the MSCOCO datasets.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] REFINING ATTENTION: A SEQUENTIAL ATTENTION MODEL FOR IMAGE CAPTIONING
    Fang, Fang
    Li, Qinyu
    Wang, Hanli
    Tang, Pengjie
    2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
  • [32] Boosted Attention: Leveraging Human Attention for Image Captioning
    Chen, Shi
    Zhao, Qi
    COMPUTER VISION - ECCV 2018, PT XI, 2018, 11215 : 72 - 88
  • [33] Identification of Tool Wear Based on Infographics and a Double-Attention Network
    Ni, Jing
    Liu, Xuansong
    Meng, Zhen
    Cui, Yiming
    MACHINES, 2023, 11 (10)
  • [34] The synergy of double attention: Combine sentence-level and word-level attention for image captioning
    Wei, Haiyang
    Li, Zhixin
    Zhang, Canlong
    Ma, Huifang
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2020, 201
  • [35] Double-attention mechanism-based segmentation grasping detection network
    Li, Qinghua
    Wang, Xuyang
    Zhang, Kun
    Yang, Yiran
    Feng, Chao
    JOURNAL OF ELECTRONIC IMAGING, 2024, 33 (02)
  • [36] Recurrent fusion transformer for image captioning
    Mou, Zhenping
    Yuan, Qiao
    Song, Tianqi
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [37] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    NEUROCOMPUTING, 2019, 329 : 476 - 485
  • [38] Recurrent Fusion Network for Image Captioning
    Jiang, Wenhao
    Ma, Lin
    Jiang, Yu-Gang
    Liu, Wei
    Zhang, Tong
    COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 510 - 526
  • [39] Cascade Semantic Fusion for Image Captioning
    Wang, Shiwei
    Lan, Long
    Zhang, Xiang
    Dong, Guohua
    Luo, Zhigang
    IEEE ACCESS, 2019, 7 : 66680 - 66688
  • [40] Self-Enhanced Attention for Image Captioning
    Sun, Qingyu
    Zhang, Juan
    Fang, Zhijun
    Gao, Yongbin
    NEURAL PROCESSING LETTERS, 2024, 56 (02)