Dynamic-balanced double-attention fusion for image captioning

被引:10
|
作者
Wang, Changzhi [1 ]
Gu, Xiaodong [1 ]
机构
[1] Fudan Univ, Dept Elect Engn, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Attention fusion; DSR; Attention variance; SEMANTIC ATTENTION; NETWORK;
D O I
10.1016/j.engappai.2022.105194
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image captioning has received significant attention in the cross-modal field in which spatial and channel attentions play a crucial role. However, such attention-based approaches ignore two issues: (1) errors or noise in the channel feature map amplifies in the spatial feature map, leading to a lower model reliability; (2) image spatial feature and channel feature provide different contributions to the prediction both function words (e.g., "in'', "out'' and "on'') and notional words (e.g., "girl'', "teddy'' and "bear''). To alleviate the above issues, in this paper we propose the Dynamic-Balanced Double-Attention Fusion (DBDAF) for image captioning task that novelly exploits the attention variation and enhances the overall performance of the model. Technically, DBDAF first integrates a parallel Double Attention Network (DAN) in which channel attention is capitalized on as a supplement to the region attention, enhancing the model reliability. Then, a attention variation based Balancing Attention Fusion Mechanism (BAFM) module is devised. When predicting function words and notional words, BAFM makes a dynamic balance between channel attention and region attention based on attention variation. Moreover, to achieve the richer image description, we further devise a Doubly Stochastic Regularization (DSR) penalty and integrate it into the model loss function. Such DSR makes the model equally focus on every pixel and every channel in generating entire sentence. Extensive experiments on the three typical datasets show our DBDAF outperforms the related end-to-end leading approaches clearly. More remarkably, DBDAF achieves 1.04% and 1.75% improvement in terms of BLEU4 and CIDEr on the MSCOCO datasets.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Dynamic-balanced double-attention fusion for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 114
  • [2] Image Captioning With Visual-Semantic Double Attention
    He, Chen
    Hu, Haifeng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2019, 15 (01)
  • [3] Research on Image Captioning Based on Double Attention Model
    Zhuo Y.-Q.
    Wei J.-H.
    Li Z.-X.
    Tien Tzu Hsueh Pao/Acta Electronica Sinica, 2022, 50 (05): : 1123 - 1130
  • [4] Attention on Attention for Image Captioning
    Huang, Lun
    Wang, Wenmin
    Chen, Jie
    Wei, Xiao-Yong
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4633 - 4642
  • [5] Balanced image captioning with task-aware decoupled learning and fusion
    Ding, Yuxuan
    Liu, Lingqiao
    Tian, Chunna
    Zhang, Xiangnan
    Tian, Xilan
    NEUROCOMPUTING, 2023, 538
  • [6] Cross on Cross Attention: Deep Fusion Transformer for Image Captioning
    Zhang, Jing
    Xie, Yingshuai
    Ding, Weichao
    Wang, Zhe
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4257 - 4268
  • [7] A dynamic-balanced scheduler for genetic algorithms for grid computing
    Santiago, A.J. Sánchez
    Yuste, A.J.
    Expósito, J.E. Muñoz
    Galán, S. García
    Marín, J.M. Maqueira
    Bruque, S.
    WSEAS Transactions on Computers, 2009, 8 (01): : 11 - 20
  • [8] Attention Based Double Layer LSTM for Chinese Image Captioning
    Wu, Wei
    Sun, Deshuai
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [9] Reasoning like Humans: On Dynamic Attention Prior in Image Captioning
    Wang, Yong
    Sun, Xian
    Li, Xuan
    Zhang, Wenkai
    Gao, Xin
    KNOWLEDGE-BASED SYSTEMS, 2021, 228
  • [10] Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module
    Zhu, Hegui
    Wang, Ru
    Zhang, Xiangde
    NEURAL PROCESSING LETTERS, 2021, 53 (02) : 1101 - 1118