Complementary Shifted Transformer for Image Captioning

被引:1
|
作者
Liu, Yanbo [1 ]
Yang, You [2 ]
Xiang, Ruoyu [1 ]
Ma, Jixin [1 ]
机构
[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China
关键词
Image captioning; Transformer; Positional encoding; Multi-branch self-attention; Spatial shift;
D O I
10.1007/s11063-023-11314-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-basedmodels have dominated many vision and language tasks, including image captioning. However, such models still suffer from the limitation of expressive ability and information loss during dimensionality reduction. In order to solve the above problems, this paper proposes a Complementary Shifted Transformer (CST) for image captioning. We first introduce a complementary Multi-branch Bi-positional encoding Self-Attention (MBSA) module. It utilizes both absolute and relative positional encoding to learn precise positional representations. Meanwhile, MBSA is equipped with Multi-Branch Architecture, which replicates multiple branches for each head. To improve the expressive ability of the model, we utilize the drop branch technique, which trains the branches in a complementary way. Furthermore, we propose a Spatial Shift Augmented module, which takes advantage of both low-level and high-level features to enhance visual features with fewer parameters. To validate our model, we conduct extensive experiments on the MSCOCO benchmark dataset. Compared to the state-of-the-art methods, the proposed CST achieves a competitive performance of 135.3% CIDEr (+0.2%) on the Karpathy split and 136.3% CIDEr (+0.9%) on the official online test server. In addition, we also evaluate the inference performance of our model on a novel object dataset. The source codes and trained models are publicly available at https://github.com/noonisy/CST.
引用
收藏
页码:8339 / 8363
页数:25
相关论文
共 50 条
  • [1] Complementary Shifted Transformer for Image Captioning
    Yanbo Liu
    You Yang
    Ruoyu Xiang
    Jixin Ma
    Neural Processing Letters, 2023, 55 : 8339 - 8363
  • [2] Distance Transformer for Image Captioning
    Wang, Jiarong
    Lu, Tongwei
    Liu, Xuanxuan
    Yang, Qi
    2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
  • [3] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
  • [4] Entangled Transformer for Image Captioning
    Li, Guang
    Zhu, Linchao
    Liu, Ping
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8927 - 8936
  • [5] Boosted Transformer for Image Captioning
    Li, Jiangyun
    Yao, Peng
    Guo, Longteng
    Zhang, Weicun
    APPLIED SCIENCES-BASEL, 2019, 9 (16):
  • [6] Reinforced Transformer for Medical Image Captioning
    Xiong, Yuxuan
    Du, Bo
    Yan, Pingkun
    MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 673 - 680
  • [7] Transformer with a Parallel Decoder for Image Captioning
    Wei, Peilang
    Liu, Xu
    Luo, Jun
    Pu, Huayan
    Huang, Xiaoxu
    Wang, Shilong
    Cao, Huajun
    Yang, Shouhong
    Zhuang, Xu
    Wang, Jason
    Yue, Hong
    Ji, Cheng
    Zhou, Mingliang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
  • [8] ReFormer: The Relational Transformer for Image Captioning
    Yang, Xuewen
    Liu, Yingru
    Wang, Xin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5398 - 5406
  • [9] Image captioning with transformer and knowledge graph
    Zhang, Yu
    Shi, Xinyu
    Mi, Siya
    Yang, Xu
    PATTERN RECOGNITION LETTERS, 2021, 143 (143) : 43 - 49
  • [10] Direction Relation Transformer for Image Captioning
    Song, Zeliang
    Zhou, Xiaofei
    Dong, Linhua
    Tan, Jianlong
    Guo, Li
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5056 - 5064