ETransCap: efficient transformer for image captioning

被引：0

作者：

Mundu, Albert ^{[1
]}

Singh, Satish Kumar ^{[1
]}

Dubey, Shiv Ram ^{[1
]}

机构：

[1] IIIT Allahabad, Dept IT, Comp Vis & Biometr Lab CVBL, Allahabad, India

来源：

APPLIED INTELLIGENCE | 2024年 / 54卷 / 21期

关键词：

Deep learning; Natural language processing; Image captioning; Scene understanding; Transformers; Efficient transformers; ATTENTION;

D O I：

10.1007/s10489-024-05739-w

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Image captioning is a challenging task in computer vision that automatically generates a textual description of an image by integrating visual and linguistic information, as the generated captions must accurately describe the image's content while also adhering to the conventions of natural language. We adopt the encoder-decoder framework employed by various CNN-RNN-based models for image captioning in the past few years. Recently, we observed that the CNN-Transformer-based models have achieved great success and surpassed traditional CNN-RNN-based models in the area. Many researchers have concentrated on Transformers, exploring and uncovering its vast possibilities. Unlike conventional CNN-RNN-based models in image captioning, transformer-based models have achieved notable success and offer the benefit of handling longer input sequences more efficiently. However, they are resource-intensive to train and deploy, particularly for large-scale tasks or for tasks that require real-time processing. In this work, we introduce a lightweight and efficient transformer-based model called the Efficient Transformer Captioner (ETransCap), which consumes fewer computation resources to generate captions. Our model operates in linear complexity and has been trained and tested on MS-COCO dataset. Comparisons with existing state-of-the-art models show that ETransCap achieves promising results. Our results support the potential of ETransCap as a good approach for image captioning tasks in real-time applications. Code for this project will be available at https://github.com/albertmundu/etranscap.

引用

页码：10748 / 10762

页数：15

共 50 条

[1] Efficient Image Captioning Based on Vision Transformer Models
Elbedwehy, Samar
Medhat, T.
Hamza, Taher
Alrahmawy, Mohammed F.
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
[2] Distance Transformer for Image Captioning
Wang, Jiarong
Lu, Tongwei
Liu, Xuanxuan
Yang, Qi
2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
[3] Rotary Transformer for Image Captioning
Qiu, Yile
Zhu, Li
SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
[4] Entangled Transformer for Image Captioning
Li, Guang
Zhu, Linchao
Liu, Ping
Yang, Yi
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8927 - 8936
[5] Boosted Transformer for Image Captioning
Li, Jiangyun
Yao, Peng
Guo, Longteng
Zhang, Weicun
APPLIED SCIENCES-BASEL, 2019, 9 (16):
[6] ACORT: A compact object relation transformer for parameter efficient image captioning
Tan, Jia Huei
Tan, Ying Hua
Chan, Chee Seng
Chuah, Joon Huang
NEUROCOMPUTING, 2022, 482 : 60 - 72
[7] Complementary Shifted Transformer for Image Captioning
Liu, Yanbo
Yang, You
Xiang, Ruoyu
Ma, Jixin
NEURAL PROCESSING LETTERS, 2023, 55 (06) : 8339 - 8363
[8] Reinforced Transformer for Medical Image Captioning
Xiong, Yuxuan
Du, Bo
Yan, Pingkun
MACHINE LEARNING IN MEDICAL IMAGING (MLMI 2019), 2019, 11861 : 673 - 680
[9] Transformer with a Parallel Decoder for Image Captioning
Wei, Peilang
Liu, Xu
Luo, Jun
Pu, Huayan
Huang, Xiaoxu
Wang, Shilong
Cao, Huajun
Yang, Shouhong
Zhuang, Xu
Wang, Jason
Yue, Hong
Ji, Cheng
Zhou, Mingliang
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2024, 38 (01)
[10] ReFormer: The Relational Transformer for Image Captioning
Yang, Xuewen
Liu, Yingru
Wang, Xin
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5398 - 5406

← 1 2 3 4 5 →