Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

被引：0

作者：

Hegui Zhu

Ru Wang

Xiangde Zhang

机构：

[1] Northeastern University,College of Sciences

来源：

Neural Processing Letters | 2021年 / 53卷

关键词：

Image captioning; Masked convolution; Dense fusion connection; Improved stacked attention module;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from 91.2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$91.2 \%$$\end{document} to 106.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$106.1 \%$$\end{document}, which has higher performance than the comparable models and verifies the effectiveness of the proposed model.

引用

页码：1101 / 1118

页数：17

共 50 条

[1] Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module
Zhu, Hegui
Wang, Ru
Zhang, Xiangde
NEURAL PROCESSING LETTERS, 2021, 53 (02) : 1101 - 1118
[2] An improved Image Inpainting Algorithm Based on Attention Fusion Module
Ding, Zhen
Wang, Tong
Haul, Kuangrong
2022 34TH CHINESE CONTROL AND DECISION CONFERENCE, CCDC, 2022, : 5087 - 5092
[3] Image captioning: Semantic selection unit with stacked residual attention
Song, Lifei
Li, Fei
Wang, Ying
Liu, Yu
Wang, Yuanhua
Xiang, Shiming
IMAGE AND VISION COMPUTING, 2024, 144
[4] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
[5] RVAIC: Refined visual attention for improved image captioning
Al-Qatf, Majjed
Hawbani, Ammar
Wang, XingFu
Abdusallam, Amr
Alsamhi, Saeed
Alhabib, Mohammed
Curry, Edward
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
[6] Captioning Transformer with Stacked Attention Modules
Zhu, Xinxin
Li, Lixiang
Liu, Jing
Peng, Haipeng
Niu, Xinxin
APPLIED SCIENCES-BASEL, 2018, 8 (05):
[7] Attention on Attention for Image Captioning
Huang, Lun
Wang, Wenmin
Chen, Jie
Wei, Xiao-Yong
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4633 - 4642
[8] Cross on Cross Attention: Deep Fusion Transformer for Image Captioning
Zhang, Jing
Xie, Yingshuai
Ding, Weichao
Wang, Zhe
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (08) : 4257 - 4268
[9] Stacked cross-modal feature consolidation attention networks for image captioning
Mozhgan Pourkeshavarz
Shahabedin Nabavi
Mohsen Ebrahimi Moghaddam
Mehrnoush Shamsfard
Multimedia Tools and Applications, 2024, 83 : 12209 - 12233
[10] Stacked cross-modal feature consolidation attention networks for image captioning
Pourkeshavarz, Mozhgan
Nabavi, Shahabedin
Moghaddam, Mohsen Ebrahimi
Shamsfard, Mehrnoush
MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (04) : 12209 - 12233

← 1 2 3 4 5 →