Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

被引:0
|
作者
Hegui Zhu
Ru Wang
Xiangde Zhang
机构
[1] Northeastern University,College of Sciences
来源
Neural Processing Letters | 2021年 / 53卷
关键词
Image captioning; Masked convolution; Dense fusion connection; Improved stacked attention module;
D O I
暂无
中图分类号
学科分类号
摘要
In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from 91.2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$91.2 \%$$\end{document} to 106.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$106.1 \%$$\end{document}, which has higher performance than the comparable models and verifies the effectiveness of the proposed model.
引用
收藏
页码:1101 / 1118
页数:17
相关论文
共 50 条
  • [41] Image Captioning for Nantong Blue Calico Through Stacked Local-Global Channel Attention Network
    Guo, Chenyi
    Zhang, Li
    Yu, Xiang
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT II, 2023, 14255 : 357 - 372
  • [42] Dense semantic embedding network for image captioning
    Xiao, Xinyu
    Wang, Lingfeng
    Ding, Kun
    Xiang, Shiming
    Pan, Chunhong
    PATTERN RECOGNITION, 2019, 90 : 285 - 296
  • [43] Post-Attention Modulator for Dense Video Captioning
    Guo, Zixin
    Wang, Tzu-Jui Julius
    Laaksonen, Jorma
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1536 - 1542
  • [44] Image Captioning with Compositional Neural Module Networks
    Tian, Junjiao
    Oh, Jean
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 3576 - 3584
  • [45] Variational Stacked Local Attention Networks for Diverse Video Captioning
    Deb, Tonmoay
    Sadmanee, Akib
    Bhaumik, Kishor Kumar
    Ali, Amin Ahsan
    Amin, M. Ashraful
    Rahman, A. K. M. Mahbubur
    2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2493 - 2502
  • [46] Cascade Semantic Fusion for Image Captioning
    Wang, Shiwei
    Lan, Long
    Zhang, Xiang
    Dong, Guohua
    Luo, Zhigang
    IEEE ACCESS, 2019, 7 : 66680 - 66688
  • [47] Recurrent Fusion Network for Image Captioning
    Jiang, Wenhao
    Ma, Lin
    Jiang, Yu-Gang
    Liu, Wei
    Zhang, Tong
    COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 510 - 526
  • [48] Recurrent fusion transformer for image captioning
    Mou, Zhenping
    Yuan, Qiao
    Song, Tianqi
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [49] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    NEUROCOMPUTING, 2019, 329 : 476 - 485
  • [50] Text to Image Synthesis for Improved Image Captioning
    Hossain, Md. Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    IEEE ACCESS, 2021, 9 : 64918 - 64928