ETransCap: efficient transformer for image captioning

被引:0
|
作者
Mundu, Albert [1 ]
Singh, Satish Kumar [1 ]
Dubey, Shiv Ram [1 ]
机构
[1] IIIT Allahabad, Dept IT, Comp Vis & Biometr Lab CVBL, Allahabad, India
关键词
Deep learning; Natural language processing; Image captioning; Scene understanding; Transformers; Efficient transformers; ATTENTION;
D O I
10.1007/s10489-024-05739-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a challenging task in computer vision that automatically generates a textual description of an image by integrating visual and linguistic information, as the generated captions must accurately describe the image's content while also adhering to the conventions of natural language. We adopt the encoder-decoder framework employed by various CNN-RNN-based models for image captioning in the past few years. Recently, we observed that the CNN-Transformer-based models have achieved great success and surpassed traditional CNN-RNN-based models in the area. Many researchers have concentrated on Transformers, exploring and uncovering its vast possibilities. Unlike conventional CNN-RNN-based models in image captioning, transformer-based models have achieved notable success and offer the benefit of handling longer input sequences more efficiently. However, they are resource-intensive to train and deploy, particularly for large-scale tasks or for tasks that require real-time processing. In this work, we introduce a lightweight and efficient transformer-based model called the Efficient Transformer Captioner (ETransCap), which consumes fewer computation resources to generate captions. Our model operates in linear complexity and has been trained and tested on MS-COCO dataset. Comparisons with existing state-of-the-art models show that ETransCap achieves promising results. Our results support the potential of ETransCap as a good approach for image captioning tasks in real-time applications. Code for this project will be available at https://github.com/albertmundu/etranscap.
引用
收藏
页码:10748 / 10762
页数:15
相关论文
共 50 条
  • [1] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [2] Distance Transformer for Image Captioning
    Wang, Jiarong
    Lu, Tongwei
    Liu, Xuanxuan
    Yang, Qi
    2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
  • [3] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
  • [4] Entangled Transformer for Image Captioning
    Li, Guang
    Zhu, Linchao
    Liu, Ping
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8927 - 8936
  • [5] Boosted Transformer for Image Captioning
    Li, Jiangyun
    Yao, Peng
    Guo, Longteng
    Zhang, Weicun
    APPLIED SCIENCES-BASEL, 2019, 9 (16):
  • [6] ACORT: A compact object relation transformer for parameter efficient image captioning
    Tan, Jia Huei
    Tan, Ying Hua
    Chan, Chee Seng
    Chuah, Joon Huang
    NEUROCOMPUTING, 2022, 482 : 60 - 72
  • [7] Complementary Shifted Transformer for Image Captioning
    Liu, Yanbo
    Yang, You
    Xiang, Ruoyu
    Ma, Jixin
    NEURAL PROCESSING LETTERS, 2023, 55 (06) : 8339 - 8363
  • [8] Reinforced Transformer for Medical Image Captioning
    Xiong, Yuxuan
    Du, Bo
    Yan, Pingkun
    MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 673 - 680
  • [9] Transformer with a Parallel Decoder for Image Captioning
    Wei, Peilang
    Liu, Xu
    Luo, Jun
    Pu, Huayan
    Huang, Xiaoxu
    Wang, Shilong
    Cao, Huajun
    Yang, Shouhong
    Zhuang, Xu
    Wang, Jason
    Yue, Hong
    Ji, Cheng
    Zhou, Mingliang
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
  • [10] ReFormer: The Relational Transformer for Image Captioning
    Yang, Xuewen
    Liu, Yingru
    Wang, Xin
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5398 - 5406