Dynamic-balanced double-attention fusion for image captioning

被引:10
|
作者
Wang, Changzhi [1 ]
Gu, Xiaodong [1 ]
机构
[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Attention fusion; DSR; Attention variance; SEMANTIC ATTENTION; NETWORK;
D O I
10.1016/j.engappai.2022.105194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image captioning has received significant attention in the cross-modal field in which spatial and channel attentions play a crucial role. However, such attention-based approaches ignore two issues: (1) errors or noise in the channel feature map amplifies in the spatial feature map, leading to a lower model reliability; (2) image spatial feature and channel feature provide different contributions to the prediction both function words (e.g., "in'', "out'' and "on'') and notional words (e.g., "girl'', "teddy'' and "bear''). To alleviate the above issues, in this paper we propose the Dynamic-Balanced Double-Attention Fusion (DBDAF) for image captioning task that novelly exploits the attention variation and enhances the overall performance of the model. Technically, DBDAF first integrates a parallel Double Attention Network (DAN) in which channel attention is capitalized on as a supplement to the region attention, enhancing the model reliability. Then, a attention variation based Balancing Attention Fusion Mechanism (BAFM) module is devised. When predicting function words and notional words, BAFM makes a dynamic balance between channel attention and region attention based on attention variation. Moreover, to achieve the richer image description, we further devise a Doubly Stochastic Regularization (DSR) penalty and integrate it into the model loss function. Such DSR makes the model equally focus on every pixel and every channel in generating entire sentence. Extensive experiments on the three typical datasets show our DBDAF outperforms the related end-to-end leading approaches clearly. More remarkably, DBDAF achieves 1.04% and 1.75% improvement in terms of BLEU4 and CIDEr on the MSCOCO datasets.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Self-Enhanced Attention for Image Captioning
    Qingyu Sun
    Juan Zhang
    Zhijun Fang
    Yongbin Gao
    Neural Processing Letters, 56
  • [42] Social Image Captioning: Exploring Visual Attention and User Attention
    Wang, Leiquan
    Chu, Xiaoliang
    Zhang, Weishan
    Wei, Yiwei
    Sun, Weichen
    Wu, Chunlei
    SENSORS, 2018, 18 (02)
  • [43] Task-Adaptive Attention for Image Captioning
    Yan, Chenggang
    Hao, Yiming
    Li, Liang
    Yin, Jian
    Liu, Anan
    Mao, Zhendong
    Chen, Zhenyu
    Gao, Xingyu
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 43 - 51
  • [44] Improve Image Captioning by Self-attention
    Li, Zhenru
    Li, Yaoyi
    Lu, Hongtao
    NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 91 - 98
  • [45] Hadamard Product Perceptron Attention for Image Captioning
    Weitao Jiang
    Haifeng Hu
    Neural Processing Letters, 2023, 55 : 2707 - 2724
  • [46] Image Captioning with Affective Guiding and Selective Attention
    Wang, Anqi
    Hu, Haifeng
    Yang, Liang
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (03)
  • [47] Adaptive Syncretic Attention for Constrained Image Captioning
    Liang Yang
    Haifeng Hu
    Neural Processing Letters, 2019, 50 : 549 - 564
  • [48] An Image Captioning Approach Using Dynamical Attention
    Wang, Changzhi
    Gu, Xiaodong
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [49] Contextual and selective attention networks for image captioning
    Wang, Jing
    Li, Yehao
    Pan, Yingwei
    Yao, Ting
    Tang, Jinhui
    Mei, Tao
    SCIENCE CHINA-INFORMATION SCIENCES, 2022, 65 (12)
  • [50] Looking deeper and transferring attention for image captioning
    Fang Fang
    Hanli Wang
    Yihao Chen
    Pengjie Tang
    Multimedia Tools and Applications, 2018, 77 : 31159 - 31175