Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

被引:0
|
作者
Hegui Zhu
Ru Wang
Xiangde Zhang
机构
[1] Northeastern University,College of Sciences
来源
Neural Processing Letters | 2021年 / 53卷
关键词
Image captioning; Masked convolution; Dense fusion connection; Improved stacked attention module;
D O I
暂无
中图分类号
学科分类号
摘要
In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from 91.2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$91.2 \%$$\end{document} to 106.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$106.1 \%$$\end{document}, which has higher performance than the comparable models and verifies the effectiveness of the proposed model.
引用
收藏
页码:1101 / 1118
页数:17
相关论文
共 50 条
  • [1] Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module
    Zhu, Hegui
    Wang, Ru
    Zhang, Xiangde
    NEURAL PROCESSING LETTERS, 2021, 53 (02) : 1101 - 1118
  • [2] An improved Image Inpainting Algorithm Based on Attention Fusion Module
    Ding, Zhen
    Wang, Tong
    Haul, Kuangrong
    2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 5087 - 5092
  • [3] Image captioning: Semantic selection unit with stacked residual attention
    Song, Lifei
    Li, Fei
    Wang, Ying
    Liu, Yu
    Wang, Yuanhua
    Xiang, Shiming
    IMAGE AND VISION COMPUTING, 2024, 144
  • [4] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
  • [5] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
  • [6] Captioning Transformer with Stacked Attention Modules
    Zhu, Xinxin
    Li, Lixiang
    Liu, Jing
    Peng, Haipeng
    Niu, Xinxin
    APPLIED SCIENCES-BASEL, 2018, 8 (05):
  • [7] Attention on Attention for Image Captioning
    Huang, Lun
    Wang, Wenmin
    Chen, Jie
    Wei, Xiao-Yong
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4633 - 4642
  • [8] Cross on Cross Attention: Deep Fusion Transformer for Image Captioning
    Zhang, Jing
    Xie, Yingshuai
    Ding, Weichao
    Wang, Zhe
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4257 - 4268
  • [9] Stacked cross-modal feature consolidation attention networks for image captioning
    Mozhgan Pourkeshavarz
    Shahabedin Nabavi
    Mohsen Ebrahimi Moghaddam
    Mehrnoush Shamsfard
    Multimedia Tools and Applications, 2024, 83 : 12209 - 12233
  • [10] Stacked cross-modal feature consolidation attention networks for image captioning
    Pourkeshavarz, Mozhgan
    Nabavi, Shahabedin
    Moghaddam, Mohsen Ebrahimi
    Shamsfard, Mehrnoush
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12209 - 12233