Improving image captioning with Pyramid Attention and SC-GAN

被引：23

作者：

Chen, Tianyu ^{[1
]}

Li, Zhixin ^{[1
]}

Wu, Jingli ^{[1
]}

Ma, Huifang ^{[2
]}

Su, Bianping ^{[3
]}

机构：

[1] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China

[2] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China

[3] Xian Univ Architecture & Technol, Coll Sci, Xian 710055, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2022年 / 117卷

基金：

中国国家自然科学基金;

关键词：

Image captioning; Pyramid Attention network; Self-critical training; Reinforcement learning; Generative adversarial network; Sequence-level learning;

D O I：

10.1016/j.imavis.2021.104340

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most of the existing image captioning models mainly use global attention, which represents the whole image fea-tures, local attention, representing the object features, or a combination of them; there are few models to inte-grate the relationship information between various object regions of the image. But this relationship information is also very instructive for caption generation. For example, if a football appears, there is a high prob-ability that the image also contains people near the football. In this article, the relationship feature is embedded into the global-local attention to constructing a new Pyramid Attention mechanism, which can explore the inter-nal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adver-sarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evalua-tion metrics on both local and online test sets.(c) 2021 Elsevier B.V. All rights reserved.

引用

页数：12

共 50 条

[1] Dual Attention on Pyramid Feature Maps for Image Captioning
Yu, Litao
Zhang, Jian
Wu, Qiang
IEEE Transactions on Multimedia, 2022, 24 : 1775 - 1786
[2] Dual Attention on Pyramid Feature Maps for Image Captioning
Yu, Litao
Zhang, Jian
Wu, Qiang
IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 1775 - 1786
[3] SC-GAN: Subspace Clustering based GAN for Automatic Expression Manipulation
Li, Shuai
Liu, Liang
Liu, Ji
Song, Wenfeng
Hao, Aimin
Qin, Hong
PATTERN RECOGNITION, 2023, 134
[4] Attention on Attention for Image Captioning
Huang, Lun
Wang, Wenmin
Chen, Jie
Wei, Xiao-Yong
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4633 - 4642
[5] SPT: Spatial Pyramid Transformer for Image Captioning
Zhang, Haonan
Zeng, Pengpeng
Gao, Lianli
Lyu, Xinyu
Song, Jingkuan
Shen, Heng Tao
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4829 - 4842
[6] Areas of Attention for Image Captioning
Pedersoli, Marco
Lucas, Thomas
Schmid, Cordelia
Verbeek, Jakob
2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 1251 - 1259
[7] Image Captioning with Semantic Attention
You, Quanzeng
Jin, Hailin
Wang, Zhaowen
Fang, Chen
Luo, Jiebo
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4651 - 4659
[8] SC-GAN: STRUCTURE CONSISTENT GAN FOR MODALITY TRANSFER WITH FFT AND MULTI-SCALE PERCEPTION
Xi, Ruiling
Zhang, Yinglin
Bai, Ruibin
Higashita, Risa
Liu, Jiang
2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
[9] Lightweight dynamic conditional GAN with pyramid attention for text-to-image synthesis
Gao, Lianli
Chen, Daiyuan
Zhao, Zhou
Shao, Jie
Shen, Heng Tao
PATTERN RECOGNITION, 2021, 110
[10] Visual Relationship Attention for Image Captioning
Zhang, Zongjian
Wu, Qiang
Wang, Yang
Chen, Fang
2019 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2019,

← 1 2 3 4 5 →