Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

被引：0

作者：

Hegui Zhu

Ru Wang

Xiangde Zhang

机构：

[1] Northeastern University,College of Sciences

来源：

Neural Processing Letters | 2021年 / 53卷

关键词：

Image captioning; Masked convolution; Dense fusion connection; Improved stacked attention module;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from 91.2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$91.2 \%$$\end{document} to 106.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$106.1 \%$$\end{document}, which has higher performance than the comparable models and verifies the effectiveness of the proposed model.

引用

页码：1101 / 1118

页数：17

共 50 条

[41] Image Captioning for Nantong Blue Calico Through Stacked Local-Global Channel Attention Network
Guo, Chenyi
Zhang, Li
Yu, Xiang
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT II, 2023, 14255 : 357 - 372
[42] Dense semantic embedding network for image captioning
Xiao, Xinyu
Wang, Lingfeng
Ding, Kun
Xiang, Shiming
Pan, Chunhong
PATTERN RECOGNITION, 2019, 90 : 285 - 296
[43] Post-Attention Modulator for Dense Video Captioning
Guo, Zixin
Wang, Tzu-Jui Julius
Laaksonen, Jorma
2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1536 - 1542
[44] Image Captioning with Compositional Neural Module Networks
Tian, Junjiao
Oh, Jean
PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 3576 - 3584
[45] Variational Stacked Local Attention Networks for Diverse Video Captioning
Deb, Tonmoay
Sadmanee, Akib
Bhaumik, Kishor Kumar
Ali, Amin Ahsan
Amin, M. Ashraful
Rahman, A. K. M. Mahbubur
2022 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2022), 2022, : 2493 - 2502
[46] Cascade Semantic Fusion for Image Captioning
Wang, Shiwei
Lan, Long
Zhang, Xiang
Dong, Guohua
Luo, Zhigang
IEEE ACCESS, 2019, 7 : 66680 - 66688
[47] Recurrent Fusion Network for Image Captioning
Jiang, Wenhao
Ma, Lin
Jiang, Yu-Gang
Liu, Wei
Zhang, Tong
COMPUTER VISION - ECCV 2018, PT II, 2018, 11206 : 510 - 526
[48] Recurrent fusion transformer for image captioning
Mou, Zhenping
Yuan, Qiao
Song, Tianqi
SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
[49] A multimodal fusion approach for image captioning
Zhao, Dexin
Chang, Zhi
Guo, Shutao
NEUROCOMPUTING, 2019, 329 : 476 - 485
[50] Text to Image Synthesis for Improved Image Captioning
Hossain, Md. Zakir
Sohel, Ferdous
Shiratuddin, Mohd Fairuz
Laga, Hamid
Bennamoun, Mohammed
IEEE ACCESS, 2021, 9 : 64918 - 64928

← 1 2 3 4 5 →