Complementary Shifted Transformer for Image Captioning

被引：1

作者：

Liu, Yanbo ^{[1
]}

Yang, You ^{[2
]}

Xiang, Ruoyu ^{[1
]}

Ma, Jixin ^{[1
]}

机构：

[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China

[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China

来源：

NEURAL PROCESSING LETTERS | 2023年 / 55卷 / 06期

关键词：

Image captioning; Transformer; Positional encoding; Multi-branch self-attention; Spatial shift;

D O I：

10.1007/s11063-023-11314-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer-basedmodels have dominated many vision and language tasks, including image captioning. However, such models still suffer from the limitation of expressive ability and information loss during dimensionality reduction. In order to solve the above problems, this paper proposes a Complementary Shifted Transformer (CST) for image captioning. We first introduce a complementary Multi-branch Bi-positional encoding Self-Attention (MBSA) module. It utilizes both absolute and relative positional encoding to learn precise positional representations. Meanwhile, MBSA is equipped with Multi-Branch Architecture, which replicates multiple branches for each head. To improve the expressive ability of the model, we utilize the drop branch technique, which trains the branches in a complementary way. Furthermore, we propose a Spatial Shift Augmented module, which takes advantage of both low-level and high-level features to enhance visual features with fewer parameters. To validate our model, we conduct extensive experiments on the MSCOCO benchmark dataset. Compared to the state-of-the-art methods, the proposed CST achieves a competitive performance of 135.3% CIDEr (+0.2%) on the Karpathy split and 136.3% CIDEr (+0.9%) on the official online test server. In addition, we also evaluate the inference performance of our model on a novel object dataset. The source codes and trained models are publicly available at https://github.com/noonisy/CST.

引用

页码：8339 / 8363

页数：25

共 50 条

[31] Efficient Image Captioning Based on Vision Transformer Models
Elbedwehy, Samar
Medhat, T.
Hamza, Taher
Alrahmawy, Mohammed F.
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
[32] External knowledge-assisted Transformer for image captioning
Li, Zhixin
Su, Qiang
Chen, Tianyu
IMAGE AND VISION COMPUTING, 2023, 140
[33] Dual-Spatial Normalized Transformer for image captioning
Hu, Juntao
Yang, You
An, Yongzhi
Yao, Lu
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
[34] Caption TLSTMs: combining transformer with LSTMs for image captioning
Yan, Jie
Xie, Yuxiang
Luan, Xidao
Guo, Yanming
Gong, Quanzhi
Feng, Suru
INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
[35] Reinforcement Learning Transformer for Image Captioning Generation Model
Yan, Zhaojie
FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
[36] Improving Stylized Image Captioning with Better Use of Transformer
Tan, Yutong
Lin, Zheng
Liu, Huan
Zuo, Fan
ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
[37] Graph Alignment Transformer for More Grounded Image Captioning
Tian, Canwei
Hu, Haiyang
Li, Zhongjin
2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102
[38] Visual contextual relationship augmented transformer for image captioning
Su, Qiang
Hu, Junbo
Li, Zhixin
APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
[39] Spiking -Transformer Optimization on FPGA for Image Classification and Captioning
Udeji, Uchechukwu Leo
Margala, Martin
SOUTHEASTCON 2024, 2024, : 1353 - 1357
[40] Improved image captioning with subword units training and transformer
蔡强
Li Jing
Li Haisheng
Zuo Min
HighTechnologyLetters, 2020, 26 (02) : 211 - 216

← 1 2 3 4 5 →