Image Captioning with Dense Fusion Connection and Improved Stacked Attention Module

被引：0

作者：

Hegui Zhu

Ru Wang

Xiangde Zhang

机构：

[1] Northeastern University,College of Sciences

来源：

Neural Processing Letters | 2021年 / 53卷

关键词：

Image captioning; Masked convolution; Dense fusion connection; Improved stacked attention module;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from 91.2%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$91.2 \%$$\end{document} to 106.1%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$106.1 \%$$\end{document}, which has higher performance than the comparable models and verifies the effectiveness of the proposed model.

引用

页码：1101 / 1118

页数：17

共 50 条

[21] Triplet attention fusion module: A concise and efficient channel attention module for medical image segmentation
Wu, Yanlin
Wang, Guanglei
Wang, Zhongyang
Wang, Hongrui
Li, Yan
BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2023, 82
[22] Dense video captioning based on local attention
Qian, Yong
Mao, Yingchi
Chen, Zhihao
Li, Chang
Bloh, Olano Teah
Huang, Qian
IET IMAGE PROCESSING, 2023, 17 (09) : 2673 - 2685
[23] Image Graph Production by Dense Captioning
Sahba, Amin
Das, Arun
Rad, Paul
Jamshidi, Mo
2018 WORLD AUTOMATION CONGRESS (WAC), 2018, : 193 - 198
[24] Video captioning with stacked attention and semantic hard pull
Rahman, Md Mushfiqur
Abedin, Thasin
Prottoy, Khondokar S. S.
Moshruba, Ayana
Siddiqui, Fazlul Hasan
PEERJ COMPUTER SCIENCE, 2021, 7 : 1 - 18
[25] Attention-guided image captioning with adaptive global and local feature fusion
Zhong, Xian
Nie, Guozhang
Huang, Wenxin
Liu, Wenxuan
Ma, Bo
Lin, Chia-Wen
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 78
[26] Visual Relationship Attention for Image Captioning
Zhang, Zongjian
Wu, Qiang
Wang, Yang
Chen, Fang
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,
[27] Bengali Image Captioning with Visual Attention
Ami, Amit Saha
Humaira, Mayeesha
Jim, Md Abidur Rahman Khan
Paul, Shimul
Shah, Faisal Muhammad
2020 23RD INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (ICCIT 2020), 2020,
[28] Deliberate Attention Networks for Image Captioning
Gao, Lianli
Fan, Kaixuan
Song, Jingkuan
Liu, Xianglong
Xu, Xing
Shen, Heng Tao
THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8320 - 8327
[29] Gated Hierarchical Attention for Image Captioning
Wang, Qingzhong
Chan, Antoni B.
COMPUTER VISION - ACCV 2018, PT IV, 2019, 11364 : 21 - 37
[30] Delving into Precise Attention in Image Captioning
Hu, Shaohan
Huang, Shenglei
Wang, Guolong
Li, Zhipeng
Qin, Zheng
NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 74 - 82

← 1 2 3 4 5 →